[00:16:26] (03PS1) 10Ottomata: Writing Kafka stats to jmxtrans outfile at /var/log/kafka/kafka-jmx.log [operations/puppet] - 10https://gerrit.wikimedia.org/r/93897 [00:17:29] (03CR) 10Ottomata: [C: 032 V: 032] Writing Kafka stats to jmxtrans outfile at /var/log/kafka/kafka-jmx.log [operations/puppet] - 10https://gerrit.wikimedia.org/r/93897 (owner: 10Ottomata) [00:25:10] every single custom ganglia view is empty at the moment [00:25:13] at least for me [00:25:38] ori-l: same [00:25:43] well, of the two I just tested [00:25:52] this sometimes happens when the front-end gives up on waiting for the back end service to return the XML document that describes all the metrics [00:25:59] I can click more, but I think it's not needed [00:26:06] (03PS1) 10Eloquence: Increase upload size limit for chunked and URL uploads to 1000MB. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93900 [00:26:17] ottomata: is it possible that you are adding a lot metrics and that it is overwhelming ganglia? [00:27:21] hm, ori-l, i doubt it, i'm actually reducing the number of metrics compared to what was (sortof) there last week [00:27:40] reducing how? just reporting fewer metrics? [00:28:06] yea [00:28:24] i have been restarting gmetad occaisionally on nickel as I blast away old .rrds there [00:28:28] i'm bascialy done though [00:29:06] i wish i understood why this happens [00:29:09] i've seen it before [00:29:29] it has fixed itself in the past, i think for now i'll just wait before poking nickel [00:29:35] !log aaron synchronized php-1.23wmf1/includes/clientpool '313457c3ed8d58d7193d806e0155daea59adada4' [00:29:52] Logged the message, Master [00:30:49] !log aaron synchronized php-1.23wmf2/includes/clientpool 'f1df92c9308c3350098dc9ca9e7ac1a221810cd7' [00:31:05] Logged the message, Master [00:35:18] paravoid: has any of that temp data cleared? [00:39:07] (03PS1) 10Ottomata: Giving analytics102{1,2} a public address. [operations/dns] - 10https://gerrit.wikimedia.org/r/93903 [00:40:03] (03CR) 10Ottomata: "Also, I noticed that analytics1003 and analytics1004 (which currently have public IPs), did not have their .eqiad internal addresses remov" [operations/dns] - 10https://gerrit.wikimedia.org/r/93903 (owner: 10Ottomata) [00:43:21] (03PS1) 10Ottomata: analytics102{1,2} are Kafka brokers and need public addresses. [operations/puppet] - 10https://gerrit.wikimedia.org/r/93904 [00:47:54] (03PS1) 10Ottomata: (WIP) Initial Debian version [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/93905 [00:48:30] (03Abandoned) 10Ottomata: (WIP) Initial Debian version [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/93905 (owner: 10Ottomata) [00:49:47] (03PS14) 10Ottomata: (WIP) Initial Debian version [operations/software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/78782 (owner: 10Faidon Liambotis) [00:59:43] ...and the graphs are back. [01:16:17] 0 5 */3 * * flock -n /var/lock/update-special-pages /usr/local/bin/update-special-pages > /var/log/updateSpecialPages.log 2>&1 [01:16:38] Can someone look for evidence of that cron job running on terbium as the apache user? I can't read the syslog [01:18:36] Reedy: confirmed it is in the cron of user apache, but that log file doesn't exist [01:19:32] mutante: Yup, as I had and also commented on https://bugzilla.wikimedia.org/show_bug.cgi?id=53227#c37 :P [01:19:52] 39 even [01:20:33] Could it be failing because it can't write to that log file? [01:21:41] Reedy: no, it works [01:22:09] it creates the logfile when i run it as apache [01:22:36] i stopped it right away, but; [01:22:42] Uncategorizedcategories got 24 rows in 0.01s [01:23:03] That log file should still be there after [01:23:20] It's not helped by it running on over 800 wikis, doing tens of jobs per wiki... [01:23:41] "It doesn't work" [01:23:45] It's fine on aawiki! [01:24:21] Reedy: sudo -u apache flock -n ... bla bla [01:24:35] Yeah [01:24:39] logfile ends up being [01:24:41] -rw-r--r-- 1 root root 6.7K Nov 6 01:21 updateSpecialPages.log [01:24:52] I believe I did something like that last time for the last manual run [01:25:31] i can run it in a screen, as apache [01:25:51] but why did it not run by itself [01:26:06] or is the logfile just removed since the last run [01:27:14] I don't think so [01:27:51] The following data is cached, and was last updated 05:57, 10 September 2013. A maximum of 2,000 results are available in the cache. [01:28:24] That's apparently the last run, but run by what exactly, I'm not sure [01:29:42] Hour 5, every 3 days? [01:42:39] Reedy: ok, pretty sure it's the permissions, i commented the actual exection in the script, so all it lists is db names [01:42:51] then touched the logfile, let apache write to it [01:42:58] change cronjob to "in a minute".. and works [01:47:42] Reedy: i say next run it will work and the root cause is having to puppetize that logfile.. commented on ticket.. ttyl [01:53:19] (03PS1) 10Yuvipanda: Labs: Rename confusing role name to more accurate role name [operations/puppet] - 10https://gerrit.wikimedia.org/r/93910 [01:53:32] woo, a patch to ops/puppet after so long! [02:00:55] YuviPanda, mutante RobH re the earlier ping - I don't think petan has access, it's just myself and jeremyb [02:02:16] Thehelpfulone: he has signed an NDA, though [02:02:27] Thehelpfulone: since he's admin on toollabs [02:03:10] yeah I knew that - I think a number of volunteers have signed NDAs in the past though for various reasons - and the newer ones specify exactly what the NDA covers AFAIK, so you'd want to double check with legal each time [02:03:23] ah, right [02:21:04] there isn't just one type of NDA... [02:21:07] afaict [02:21:52] !log LocalisationUpdate completed (1.23wmf2) at Wed Nov 6 02:21:52 UTC 2013 [02:22:10] Logged the message, Master [02:32:14] mutante, yeah, I meant the newer ones for volunteers [02:33:36] Thehelpfulone: when asked i said they should copy the one you have :p [02:33:44] or unify [02:33:59] see reply on bz on spam filter.. bbl [02:35:33] !log LocalisationUpdate completed (1.23wmf1) at Wed Nov 6 02:35:33 UTC 2013 [02:35:48] Logged the message, Master [03:23:59] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Nov 6 03:23:59 UTC 2013 [03:24:16] Logged the message, Master [05:05:32] (03CR) 10Ori.livneh: [C: 032] udp2log: demux now filter groups stricly [operations/puppet] - 10https://gerrit.wikimedia.org/r/93864 (owner: 10Hashar) [05:09:17] !log on tin: running bug 53687 cleanup script (WikimediaMaintenance/bug-53687/fixOrphans.php) on 37 wikis [05:09:35] Logged the message, Master [06:57:55] (03PS1) 10Spage: Enable Flow on QA pages on beta labs. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93927 [07:01:03] (03CR) 10Hashar: [C: 032] "Go go go :]" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93927 (owner: 10Spage) [07:01:13] (03Merged) 10jenkins-bot: Enable Flow on QA pages on beta labs. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93927 (owner: 10Spage) [07:40:54] (03PS1) 10ArielGlenn: mark stuff in decomm.pp with their rt tickets for easier tracking [operations/puppet] - 10https://gerrit.wikimedia.org/r/93930 [07:41:44] (03CR) 10ArielGlenn: "this is not really meant to get merged, it's just here to aid in eventual removal of this file" [operations/puppet] - 10https://gerrit.wikimedia.org/r/93930 (owner: 10ArielGlenn) [07:43:31] apergos: I wouldn't mind seeing it merged. [07:43:42] seems useful [07:44:10] well this is here mostly so robh can look at it, he was planning n getting rid of this file within days [07:44:30] however if there is a delay on that I'll go ahead [07:46:44] I guess palladium is still in process of being set up? (noticing puppet errors like 'no /var/lib/puppet/server/ssl/certs/puppet.pem') [07:46:58] akosiaris: ^ [07:47:28] ah, rain... [07:47:43] yeah [07:47:59] I was so tired yesterday that slept through the night without waking up at all from the storm [07:48:05] I just woke up and saw my balcony wet [07:48:21] paravoid: apergos: yes palladium is almost setup [07:48:28] I woke at one point, thunder etc, it was nice... not enough to be sleepless, just enough to be lulled back to sleep again [07:48:34] ok, I will ignore then, thanks [07:48:42] the only manual things left to do is copy private and /var/lib/puppet/server/ssl [07:48:49] cool! [07:49:12] private needs some cleanup too [07:49:27] let's move the canonical location in /var/lib/git or something (but maybe keep a checkout in ~) [07:49:38] the hook is buggy iirc [07:49:43] I just got derailed by strontium not having any *working* connection to the switch. [07:49:46] not sure if it uses the gitpuppet account nowadays [07:49:48] paravoid: yes i agree [07:49:55] cool [07:49:58] andrewbogott_afk actually made it better [07:50:13] because it was bad imho but it needs a little more love [07:50:35] i don't think though that cding to /var/lib/git/operations/private is the best way forward [07:50:51] nod [07:51:01] that's why I said "maybe keep a checkout in ~" [07:51:32] probably the repo in ~ needs to be the "gerrit" is this case [07:51:47] and /var/lib/git/operations/private in machines need to be... well those [07:51:47] for priate? [07:51:51] fuck no :) [07:52:00] where would you have it then ? [07:52:03] I don't trust gerrit for *all* of our passwords [07:52:16] it was "gerrit" for a reason [07:52:29] i meant it as the primary source of information [07:52:38] it'd still be stored there [07:52:45] the place where we do changes [07:52:56] gerrit is publically accessible [07:53:04] stop talking about gerrit [07:53:08] it was a metaphor [07:53:11] ah [07:53:27] and a bad one as it seems [07:53:34] :-D [07:53:42] so... reiterating [07:53:48] yeah, start over :) [07:54:11] I think that ~ is a good place for the primary repo to be in [07:54:17] the repo where we commit [07:54:25] and the /var/lib/etcetc repos [07:54:29] just pulling from that one [07:54:36] agreed [07:54:39] :-) [07:54:54] well, probably pushing the other way around, not pulling [07:54:57] but still, same thing [07:55:02] the process though still needs some love though [07:55:09] it does :) [07:55:12] some hooks to avoid commits in /var/lib/etc [07:55:29] some informative messages and a clear documentation [07:56:31] how does this differ from what we have now? [07:57:03] everyday use ? it does not [07:57:25] but it sure was not clear to me when i first joined [07:57:32] sure [07:57:48] and I have seen more than once the private dirs in sockpuppet and stafford being out of sync [07:58:05] maybe after I get 'done' with cleanup, I should go on a documentation spree, updating docs for all our processes [07:58:14] and that's what i want to avoid in the future [07:58:19] ah [07:59:04] I think I've been lucky (but I always check that my stuff synced too before leaving) [07:59:05] well ... many of our processes are not clearly documented so if you did that it would be great [07:59:23] I will think about it... it's another long thankless task. but maybe! [07:59:43] i can say thank you. As many times as you 'd like :-) [07:59:49] (recently the number of 'this isn't what the docs say' has gone up) [07:59:51] hahaha [08:00:09] we'll see :-D [08:06:01] (03PS1) 10ArielGlenn: horrible hack to fix up puppet on formey til the host goes away [operations/puppet] - 10https://gerrit.wikimedia.org/r/93932 [08:07:05] (03CR) 10ArielGlenn: [C: 032] horrible hack to fix up puppet on formey til the host goes away [operations/puppet] - 10https://gerrit.wikimedia.org/r/93932 (owner: 10ArielGlenn) [08:15:52] is there currently any ganglia work going on? [08:16:17] some hosts seem to have disappeared all of a sudden [08:16:34] for example cerium and xenon [08:17:50] not by me [08:18:33] was following the cassandra test at https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&tab=ch&vn=&hreg%5B%5D=%28cerium%7Cxenon%7Cpraseodymium%29 [08:23:52] Nov 6 06:31:19 neon kernel: [2556701.921304] EXT3-fs error (device md0): ext3_lookup: deleted inode referenced: 21526 [08:24:06] (sorry, on another track right now) [08:26:19] Nov 6 06:31:18 neon kernel: [2556701.046954] EXT3-fs error (device md0): ext3_lookup: deleted inode referenced: 21531 [08:26:19] Nov 6 06:31:18 neon kernel: [2556701.055236] Aborting journal on device md0. [08:26:19] Nov 6 06:31:18 neon kernel: [2556701.080431] EXT3-fs (md0): error: ext3_journal_start_sb: Detected aborted journal [08:26:20] joy [08:26:37] Nov 6 06:31:18 neon icinga: Warning: Unable to move file '/tmp/checkJELVaB' to check results queue. [08:27:28] (03CR) 10Akosiaris: "Any reason we haven't merged this yet ?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88441 (owner: 10Reedy) [08:28:24] need a little help here: least disruptive way to fsck root filesystem remotely? [08:29:25] apart from rebooting ? [08:29:46] well yeah.. it's neon, I'll do it if I have to but ugh [08:30:08] if that's the best way then ok [08:30:23] it is [08:30:26] bummer [08:30:30] * apergos gets on it [08:31:11] just touch /forcefsck [08:31:16] and it should it on reboot [08:31:28] k [08:31:47] I assume it would anyways, filesystem errors. but done [08:31:58] heh can't touch it cause [08:32:01] ro filesystem :-D [08:32:19] (03PS8) 10Faidon Liambotis: Update font packages to not use virtual packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/88441 (owner: 10Reedy) [08:32:27] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Update font packages to not use virtual packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/88441 (owner: 10Reedy) [08:33:45] well let's see what happens on the reboot [08:35:15] !Log rebooting neon, ext3 errors on root filesystem [08:35:36] Logged the message, Master [08:36:37] yep it's checking *whew* [08:40:02] I wouldn't "whew" [08:40:04] its memory is bad [08:40:09] I've said so two months ago [08:40:15] there's an RT about it [08:40:19] it's going to keep corrupting [08:41:16] I'd rather have it recover now than die a horrible death though [08:42:47] had a bunch of these on reboot: [08:42:49] http://pastebin.com/4ptJMpN0 [08:43:11] and a bunch of [08:43:13] [ 134.306849] md/raid1:md0: redirecting sector 2047336 to other mirror: sda1 [08:43:14] ah, could be a disk failure then [08:43:15] for various sectors [08:43:26] I'll put it all into rt [08:43:52] first let's make sure that everything's running over there [08:50:43] (03CR) 10Akosiaris: "To be clear, parsoid is not a system group (per definition, system group => uid >= 1000, whereas parsoid gid == 1002 )." [operations/puppet] - 10https://gerrit.wikimedia.org/r/91043 (owner: 10Dzahn) [08:56:33] (03PS1) 10ArielGlenn: add empty pmtpa lucene frontends and indexers [operations/puppet] - 10https://gerrit.wikimedia.org/r/93937 [08:58:10] Aaron|home: hey [08:59:19] * Aaron|home lurks at 1AM [08:59:19] sorry, I wasn't going to ask you anything [08:59:19] (03CR) 10ArielGlenn: [C: 032] add empty pmtpa lucene frontends and indexers [operations/puppet] - 10https://gerrit.wikimedia.org/r/93937 (owner: 10ArielGlenn) [08:59:19] you just asked me a question while I was asleep and wanted to answer :) [08:59:19] but it can wait until (your) tomorrow, sure :) [09:04:59] gwicke_away: how's ganglia now? (well.. whenever you're around again) [09:07:15] PROBLEM - search indices - check lucene status page on search1003 is CRITICAL: Connection timed out [09:08:15] RECOVERY - search indices - check lucene status page on search1003 is OK: HTTP OK: HTTP/1.1 200 OK - 269 bytes in 0.002 second response time [09:10:55] (03PS2) 10Akosiaris: sudo -u parsoid access for parsoid admins [operations/puppet] - 10https://gerrit.wikimedia.org/r/91043 (owner: 10Dzahn) [09:15:05] (03PS1) 10Hashar: beta: auto update Parsoid dependencies using npm [operations/puppet] - 10https://gerrit.wikimedia.org/r/93939 [09:20:47] (03CR) 10Hashar: "Roan, Gabriel, that change would make the Jenkins job updating the code to use 'npm install' for Parsoid, thus automatically updating the " [operations/puppet] - 10https://gerrit.wikimedia.org/r/93939 (owner: 10Hashar) [09:21:15] (03CR) 10ArielGlenn: "not sure if 303 is the way to go, see comment on bug" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/92925 (owner: 10Dzahn) [09:31:11] apergos: may you blindly merge in a beta change for me please? https://gerrit.wikimedia.org/r/#/c/93939/ :D [09:31:15] apergos: tested already [09:31:38] no, but I'll look at in a little while (doing other reviews right now) [09:31:45] thx :) [09:31:49] * apergos not interested in blind gerrit dates :-P [09:32:00] ahah [09:32:10] I will double check my change meanwhile [09:44:25] (03CR) 10Akosiaris: [C: 032] sudo -u parsoid access for parsoid admins [operations/puppet] - 10https://gerrit.wikimedia.org/r/91043 (owner: 10Dzahn) [09:50:15] PROBLEM - Varnish traffic logger on cp1046 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:50:25] PROBLEM - Varnish HTCP daemon on cp1046 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:50:35] PROBLEM - Varnish HTTP mobile-backend on cp1046 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:54:05] RECOVERY - Varnish traffic logger on cp1046 is OK: PROCS OK: 2 processes with command name varnishncsa [09:54:15] RECOVERY - Varnish HTCP daemon on cp1046 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [09:54:25] RECOVERY - Varnish HTTP mobile-backend on cp1046 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.001 second response time [09:58:43] (03PS2) 10ArielGlenn: toy scripts playing with long-range compression [operations/dumps] (ariel) - 10https://gerrit.wikimedia.org/r/63139 (owner: 10Twotwotwo) [10:03:30] (03CR) 10ArielGlenn: [C: 032] "Thanks! If you have more stuff like this hidden away, please get it in." [operations/dumps] (ariel) - 10https://gerrit.wikimedia.org/r/63139 (owner: 10Twotwotwo) [10:04:07] apergos: it was originally posted on the mailing list, good thing I convinced him to submit it as patch ;) [10:04:11] thanks for merging [10:05:20] paravoid: is there documention about why we use gerrit? I remember the debate to replace it, but can't find the outcome. [10:06:12] ok, before I start the next one let me look at yours hashar [10:08:27] matanya: see the history of https://www.mediawiki.org/wiki/Git/Conversion/issues and discussions linked from there, there's plenty [10:08:51] yes, and you can keep on convincing him and others to submit their tools [10:08:55] ideally they would do two things [10:09:03] 1) get themselves git accounts, ask for branches [10:09:12] Nemo_bis: i mean this: https://www.mediawiki.org/wiki/Git/Gerrit_evaluation [10:09:21] 2) put their tools in their branches and ask for merge into my branch when they like [10:09:33] matanya: yes, there's also the parent page of the page I linked etc. [10:09:52] but no result listed there [10:10:00] result? [10:10:15] I"m happy to have copies of their stuff in with the production dump etc stuff, but if they have their own branches they can commit at will, which is a plus [10:11:31] Nemo_bis: the tool of choice after the discussion [10:11:32] matanya: ah, you mean for gerrit vs. other similar tools evaluation; there was an email [10:11:46] yes Nemo_bis [10:11:47] wow really? npm install? :-( [10:13:12] Nemo_bis: and the content of it was "gerrit" ? [10:14:53] hashar: can some sort of git deploy scheme not work for this? [10:16:39] matanya: it wasn't hard to find with the information on that page (like at 5th line)... I copied it on the page though, thanks for heads up [10:18:23] thank you [10:19:42] (03CR) 10ArielGlenn: "I'll try an email ping, would like to get her active and committing." [operations/dumps] (ariel) - 10https://gerrit.wikimedia.org/r/63390 (owner: 10Sanja pavlovic) [10:23:01] has anyone played with forcing all traffic on mobile to HTTPS? [10:23:01] apergos: git deploy does not work on beta / labs :( [10:23:08] apergos: but yeah, eventually we will migrate to git deploy [10:23:27] ah at all? well that sucks majorly [10:24:08] ok well I hate it but I'll merge it (untested) [10:24:51] mark, paravoid IMPORTANT: has there been any auto-redirect changes that force HTTPS for mobile/zero? [10:25:10] (03CR) 10ArielGlenn: [C: 032] beta: auto update Parsoid dependencies using npm [operations/puppet] - 10https://gerrit.wikimedia.org/r/93939 (owner: 10Hashar) [10:25:15] not that I'm aware of [10:25:25] not from the ops side for sure [10:25:45] paravoid: i'm trying to test zero, and i get bounced to HTTPS - which means ALL of zero right now might be broken [10:25:50] and everyone is getting charged [10:29:10] same, not that I'm aware of - but if anywhere, it's probably done by mediawiki ;) [10:33:18] so.. etherpad upgrade time [10:42:25] !log rebooted zirconium for kernel upgrade [10:42:42] !log upgraded etherpad-lite to 1.3.0 on zirconium [10:42:45] (03PS1) 10Mark Bergsma: Setup LVS monitoring of text-lb.esams [operations/puppet] - 10https://gerrit.wikimedia.org/r/93943 [10:42:45] Logged the message, Master [10:43:01] Logged the message, Master [10:43:57] PROBLEM - Host zirconium is DOWN: CRITICAL - Host Unreachable (208.80.154.41) [10:44:14] (03CR) 10Mark Bergsma: [C: 032] Setup LVS monitoring of text-lb.esams [operations/puppet] - 10https://gerrit.wikimedia.org/r/93943 (owner: 10Mark Bergsma) [10:45:27] RECOVERY - Host zirconium is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [10:46:44] mark: so, shall I merge the US West -> ulsfo? [10:46:49] yes [10:46:55] what does the new epl version have in it, anything nice? [10:46:58] mark: I don't think much would have been done in core or centralauth since the large https only push / yurik [10:47:07] although i don't follow development that much [10:47:24] (03PS2) 10Faidon Liambotis: Point US West Coast states to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/93752 [10:47:59] (03CR) 10Faidon Liambotis: [C: 032] Point US West Coast states to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/93752 (owner: 10Faidon Liambotis) [10:48:40] (03PS2) 10Faidon Liambotis: Add Canada and its provinces/territories [operations/dns] - 10https://gerrit.wikimedia.org/r/93756 [10:49:23] mark: canada? it's 4 million more [10:49:43] ok [10:49:57] can you watch the graphs by the time US/CA wake up? [10:50:02] I can [10:50:17] observium? [10:50:25] what are the graphs you're specifically monitoring? [10:50:35] (03CR) 10Faidon Liambotis: [C: 032] Add Canada and its provinces/territories [operations/dns] - 10https://gerrit.wikimedia.org/r/93756 (owner: 10Faidon Liambotis) [10:50:44] yeah, that tinet link [10:50:57] it shouldn't go above 60% or so [10:52:32] 529. xe-0/0/3.98 [10:52:33] Transit: cr1-ulsfo [10:52:43] right? [10:53:20] no, xe-0/0/3 [10:53:55] well yeah, the other one was the vlan [10:54:21] i know, that doesn't include the other vlan [10:54:32] right, I'm just looking at that [10:54:42] ok [11:03:51] PROBLEM - Host text-lb.esams.wikimedia.org is DOWN: CRITICAL - Network Unreachable (91.198.174.192) [11:04:12] hey [11:05:48] (03CR) 10ArielGlenn: [C: 031] "why inherit (from jobrunner::base) when you could include? is there inheritance you are using?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/77034 (owner: 10Hashar) [11:11:02] p858snake|l: when was that push? [11:11:35] when we went https only months ago? [11:11:39] have a look at the blog [11:12:47] akosiaris1: you perhaps don't even know you tweet, but you got a RT ;) https://twitter.com/EtherpadOrg/status/398042986308980736 [11:12:52] p858snake|l: when was that push? [11:13:01] sorry, dup [11:13:33] i tried the local carrier - they are fine [11:13:46] so it must be just non-carriers that get bounced somehow [11:15:20] NEW: Localisation updates from http://translatewiki.net. ohhhh that's nice! [11:15:32] yurik: https://blog.wikimedia.org/2013/09/10/https-by-default-beta-program/ [11:27:31] RECOVERY - Host text-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 95.62 ms [11:27:45] Nemo_bis: I also identi.ca don't I ? [11:28:08] akosiaris1: I doubt it, that was killed [11:28:13] thank god :-) [11:28:19] or whatever... [11:28:44] seems like we are a good user of etherpad foundation :-) [11:28:55] you can find its swan scream (?) here https://identi.ca/notice/101377209 [11:29:01] yes, they mention WMF in their homepage [11:29:24] yeah but they used to only mention the labs instance [11:29:33] let's see... did they update that info? [11:39:21] PROBLEM - LVS HTTP IPv4 on wikidata-lb.esams.wikimedia.org is CRITICAL: Connection timed out [11:39:31] PROBLEM - LVS HTTP IPv4 on text-lb.esams.wikimedia.org is CRITICAL: Connection timed out [11:40:59] mark & paravoid, & p858snake|l - thanks for your help - somehow on my browser i had forceHTTPS cookie set. Removing it fixed the issue, but i still don't know how i could have received it [11:44:12] (03PS1) 10Mark Bergsma: Add AAAA records & reverse DNS for the esams text Varnish servers [operations/dns] - 10https://gerrit.wikimedia.org/r/93945 [11:45:10] (03CR) 10Mark Bergsma: [C: 032] Add AAAA records & reverse DNS for the esams text Varnish servers [operations/dns] - 10https://gerrit.wikimedia.org/r/93945 (owner: 10Mark Bergsma) [11:46:44] akosiaris1: I think their info is more generic than that [11:51:22] dont appear to be able to get to http://www.wikidata.org/wiki/ :/ [11:52:48] looks like something somewhere is wrong for half or the world or so :P [12:00:45] huh [12:00:49] mark: ^ [12:00:53] ak^ [12:00:56] what? [12:01:02] can't access wikidata [12:01:07] 11:39 <+icinga-wm> PROBLEM - LVS HTTP IPv4 on wikidata-lb.esams.wikimedia.org is CRITICAL: Connection timed out [12:01:12] :< [12:01:14] from europe [12:01:20] seems to work on my proxy, via us [12:01:40] what IP do you see wikidata being resolved to? [12:01:43] can someone help? [12:01:56] 208.80.152.218 [12:02:04] www.wikidata.org [12:02:08] works for :208.80.154.242 [12:02:08] with the www [12:02:19] PING wikidata-lb.esams.wikimedia.org (91.198.174.237): 56 data bytes [12:02:22] 91.198.174.237 [12:02:32] and that does not work for you? [12:02:39] nope [12:02:41] nope [12:02:53] ok, because it works for me [12:02:54] from the us PING wikidata-lb.eqiad.wikimedia.org (208.80.154.242) 56(84) bytes of data. [12:02:54] looking [12:03:02] i can ping, but not access wikidata [= [12:03:04] esams does not work [12:03:29] oh, it works for me because of ipv6, heh [12:03:33] oh [12:03:51] looking [12:03:54] k [12:05:04] huh [12:06:05] is https://gerrit.wikimedia.org/r/#/c/93945/ the one to blame? [12:06:07] paravoid: http only also :P [12:06:12] I know [12:06:14] this is weird [12:07:14] RECOVERY - LVS HTTP IPv4 on wikidata-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.0 301 Moved Permanently - 719 bytes in 0.192 second response time [12:07:18] ok, I switched traffic to the other LVS box [12:07:28] working again [= [12:07:55] !log stopping pybal on amslvs1, wikidata-lb:80 issues, pending further investigation [12:07:56] of course it just redirects us straight to https :P [12:08:10] Logged the message, Master [12:11:45] paravoid: same thing happened to text-lb.esams.wikimedia.org ipv4 http lvs at the same time (not sure if you want to do the same for that too as it still times out (not sure where it is used though..)) [12:11:54] nowhere [12:11:55] it's new [12:12:17] [= [12:18:42] !log started pybal on amslvs1 again; seems to work again, root cause unknown [12:18:47] oh well... [12:18:55] Logged the message, Master [12:18:57] huh [12:19:07] hah! [12:40:46] (03PS2) 10ArielGlenn: decommision db3[29] db4[2-6] db5[1235689] [operations/puppet] - 10https://gerrit.wikimedia.org/r/93052 (owner: 10Springle) [12:41:17] (03PS1) 10Yurik: Made netmapper updates more frequent [operations/puppet] - 10https://gerrit.wikimedia.org/r/93946 [12:42:03] bblack ^ [12:52:14] PROBLEM - Varnish HTCP daemon on cp1046 is CRITICAL: NRPE: Call to popen() failed [12:52:14] PROBLEM - Varnish traffic logger on cp1046 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:52:34] PROBLEM - Varnish HTTP mobile-backend on cp1046 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:54:05] RECOVERY - Varnish traffic logger on cp1046 is OK: PROCS OK: 2 processes with command name varnishncsa [12:54:15] RECOVERY - Varnish HTCP daemon on cp1046 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [12:54:24] RECOVERY - Varnish HTTP mobile-backend on cp1046 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.001 second response time [13:53:29] akosiaris: can you please merge: https://gerrit.wikimedia.org/r/#/c/93798/ [13:54:36] dr0ptp4kt: can you please comment on https://gerrit.wikimedia.org/r/#/c/92288/ [13:55:43] matanya: is the queue drained indeed ? [13:55:55] yes, confimed by gwicke_away [13:56:04] see: https://gerrit.wikimedia.org/r/#/c/93800/ [13:57:03] (03CR) 10Akosiaris: [C: 032] jobrunner: remove legacy parsoid jobs [operations/puppet] - 10https://gerrit.wikimedia.org/r/93798 (owner: 10Matanya) [13:57:51] done [13:58:08]