[00:00:59] (03PS1) 10Reedy: WIP: Change wikiversions to use json [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114108 [00:01:11] That's my TODO [00:01:39] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 1703 bytes in 5.976 second response time [00:03:59] ^demon|away: the init script is gone, what happened [00:04:30] !log restarting gitblit on antimony [00:04:37] Logged the message, Master [00:04:55] ok, i see, init script info was outdated on https://wikitech.wikimedia.org/wiki/Gitblit [00:07:12] ok, it's back [00:07:39] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 152215 bytes in 6.430 second response time [00:08:10] and .. be back later .. [00:58:19] PROBLEM - Varnish HTTP text-backend on cp1053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:58:29] PROBLEM - Varnish HTCP daemon on cp1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:58:39] PROBLEM - Varnish traffic logger on cp1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:06:09] RECOVERY - Varnish HTTP text-backend on cp1053 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.510 second response time [01:06:19] RECOVERY - Varnish HTCP daemon on cp1053 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [01:06:29] RECOVERY - Varnish traffic logger on cp1053 is OK: PROCS OK: 2 processes with command name varnishncsa [01:10:03] (03PS2) 10Jforrester: Enable VisualEditor for legalteamwiki by default [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112717 [01:10:05] (03PS2) 10Jforrester: Initial setup for legalteamwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112850 (owner: 10TTO) [01:10:42] (03CR) 10Jforrester: "PS2 is a rebase." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112850 (owner: 10TTO) [01:10:57] <^d> I wonder if nsawiki uses VisualEditor. [01:11:26] ^d: I doubt it. [01:11:39] ^d: ISTR they were using FCKEditor. [01:11:50] <^d> Cirrus? [01:11:52] ^d: Which is OK if most people are just writing basic content. [01:11:53] <^d> Or lsearchd? [01:11:55] <^d> :) [01:12:08] (I duped their system set-up for ours.) [01:34:44] James_F: interesting, they are using CKEditor with Parsoid? [01:35:13] gwicke: No, directly. [01:35:23] gwicke: This was a few years back; Parsoid didn't exist. [01:36:03] hmm, they must have allowed tags then [01:36:15] or done some ad-hoc serialization [02:03:41] !log LocalisationUpdate completed (1.23wmf14) at 2014-02-19 02:03:41+00:00 [02:03:55] Logged the message, Master [02:16:59] PROBLEM - Puppet freshness on dysprosium is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 07:45:00 PM UTC [02:27:49] !log LocalisationUpdate completed (1.23wmf13) at 2014-02-19 02:27:49+00:00 [02:27:57] Logged the message, Master [02:55:36] !log LocalisationUpdate ResourceLoader cache refresh completed at 2014-02-19 02:55:36+00:00 [02:55:45] Logged the message, Master [03:02:29] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [03:22:49] morebots, are you any better over here? [03:22:50] I am a logbot running on tools-exec-02. [03:22:50] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [03:22:50] To log a message, type !log . [03:23:03] !log testing the log by logging a test [03:23:09] dammit [03:23:11] Logged the message, Master [03:23:20] ah! [03:43:09] (03PS4) 10MZMcBride: Gzip SVGs on front & back upload varnishes [operations/puppet] - 10https://gerrit.wikimedia.org/r/108484 (owner: 10Ori.livneh) [04:01:29] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [04:25:40] (03PS2) 10Dzahn: remove db9 from dsh and dhcpd [operations/puppet] - 10https://gerrit.wikimedia.org/r/113082 [04:27:47] (03CR) 10Springle: [C: 032] remove db9 from dsh and dhcpd [operations/puppet] - 10https://gerrit.wikimedia.org/r/113082 (owner: 10Dzahn) [04:56:52] (03CR) 10Chad: [C: 032] Turn on checkDelay for a cirrus job [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113954 (owner: 10Manybubbles) [04:57:01] (03Merged) 10jenkins-bot: Turn on checkDelay for a cirrus job [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113954 (owner: 10Manybubbles) [04:58:41] !log demon synchronized wmf-config/CirrusSearch-common.php 'Turn on checkDelay for cirrus links update secondary jobs' [04:58:50] Logged the message, Master [05:11:52] (03Abandoned) 10Chad: Hold fluorine logs for 90 days instead of 180 days [operations/puppet] - 10https://gerrit.wikimedia.org/r/111127 (owner: 10Chad) [05:17:05] PROBLEM - Puppet freshness on dysprosium is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 07:45:00 PM UTC [05:58:55] (03PS1) 10Andrew Bogott: Handle role::labs::instance properly in pmtpa. [operations/puppet] - 10https://gerrit.wikimedia.org/r/114127 [06:01:38] !log flowdb schema changes gerrit 111671 [06:01:46] Logged the message, Master [06:07:16] (03PS2) 10Andrew Bogott: Handle role::labs::instance properly in pmtpa. [operations/puppet] - 10https://gerrit.wikimedia.org/r/114127 [06:09:07] (03CR) 10Andrew Bogott: [C: 032] Handle role::labs::instance properly in pmtpa. [operations/puppet] - 10https://gerrit.wikimedia.org/r/114127 (owner: 10Andrew Bogott) [06:10:05] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [06:11:15] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 36.02 ms [06:35:25] PROBLEM - DPKG on cp1054 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [06:38:45] RECOVERY - Varnish HTCP daemon on cp1054 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [06:38:45] RECOVERY - Varnish traffic logger on cp1054 is OK: PROCS OK: 2 processes with command name varnishncsa [06:39:25] RECOVERY - DPKG on cp1054 is OK: All packages OK [06:39:45] RECOVERY - Varnish HTTP text-backend on cp1054 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.002 second response time [06:42:35] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [06:42:59] ] [06:43:05] that's me [06:43:28] if you're not careful i'll corner you with some geoip patches :P [06:43:35] :P [06:46:26] honestly, I don't yet even understand why that's being done in varnish. I guess something to do with geoip-varied content? [06:50:13] bblack: VCL is the new JavaScript [06:50:21] hah [06:50:49] < bblack> that's me <--- the 5xx spike? [06:51:17] * jeremyb points to https://gdash.wikimedia.org/dashboards/reqerror/ for the more visaul version [06:51:26] jeremyb: definitely the cp1054 stuff above it, which I assume is the cause of the spike... [06:51:44] well i got a 503 from frontend 1065 [06:51:57] basically web applications have gotten so amazingly complicated that it's hard to do anything without making the walls fall down on your head [06:52:12] varnish is just too tempting [06:52:33] inline C is 'C{' away [06:52:40] hmmm I haven't gotten to 1065 in my list yet [06:52:56] was 8 mins ago [06:53:00] I'm rolling through them upgrading the varnish package, sometimes there are complications, but it's rare [06:53:07] !log s5 slaves online reindexing wikidatawiki wb_terms [06:53:14] Logged the message, Master [06:53:17] got it again from 1065 [06:53:58] now i got 1054 [06:54:12] anyway, i can't imagine it's not also effecting other people besides me [06:54:12] maybe it's unrelated to what I'm doing [06:54:36] hmmm [06:54:48] at the varnish level, I've upgraded 1054 (which had some complications along the way, above), but I haven't touched 1065 yet. both look basically sane [06:54:50] 1052 [06:55:04] (03PS4) 10Matanya: remove shell access and key for mgrover [operations/puppet] - 10https://gerrit.wikimedia.org/r/113636 [06:55:11] 1068 [06:56:12] can anyone replicate this? [06:56:21] what are you doing to replicate it? [06:56:30] loading https://en.wikipedia.org/wiki/Abraham_Lincoln while logged in [06:57:32] hmmmm [06:57:59] 1054 again [06:58:31] it's not broke every time. but maybe more than half [06:58:35] well, what's reported in the 5xx page is the frontend, but it would be more interesting to know which backend [06:58:45] right [06:59:02] i can give XID... [06:59:26] X-Cache header doesn't list the backend. only frontend. [07:00:07] yeah, I got it too [07:00:08] but if i'm logged in do i hit a backend for that? or i go straight to app servers? [07:00:18] well [07:00:39] I'm reproducing the same thing on the same article logged in (https) [07:00:45] what's odd is I always see this: [07:00:45] Forwarded for: 127.0.0.1, 69.203.7.188, 172.8.229.29, 208.80.154.75 [07:00:58] the 172.x is me, and the 208.x is ssl1005 [07:01:22] but I don't have a proxy on localhost, and the 69.x is some Comcast customer IP, which isn't even my ISP? [07:01:55] Who is Brandon Black? [07:01:57] Is he new? [07:02:09] Bsadowski1: on the 1 year scale, yes :) [07:02:15] Ah :) [07:02:16] uhhh, so 127.0.0.1 could be localssl? [07:02:17] Welcome! [07:02:25] [07:02:31] thanks [07:02:45] PROBLEM - Varnish HTTP text-backend on cp1054 is CRITICAL: Connection refused [07:02:48] yeah but I'm on AT&T, why is a comcast customer IP in an XFF header for me? [07:03:20] bblack: so maybe the frontends' healthchecks on the backends are failing? [07:03:56] ori: varnishadm debug.health [07:05:12] I'm going to start downgrading upgraded hosts and see if 5xx goes away, for now [07:07:05] also, varnishlog -i Backend_health | ts | fgrep -e 'Backend_health' [07:08:30] bblack: so, 1) this happened with a range of hosts not just upgraded ones. 2) maybe this is the sort of thing that will recover as cache fills in after your restarts but doing more restarting will prolong the problem? [07:08:46] * jeremyb is too sleepy to usefully debug. good luck [07:09:08] jeremyb: well, a range of mixed (upgraded -vs- not) frontends, but they could be hashing only to upgraded backends when hitting 5xx [07:09:27] oh, i didn't realize you had also upgraded backends [07:09:41] well, backend varnish I mean, there's two layers of varnish. not backend as in apps [07:09:49] yes, i understand :) [07:10:27] (there can be even more than 2 layers AIUI. e.g. ams/sfo -> frontend -> backend) [07:10:42] well, when I upgrade a machine it upgrades both the front+back on that host, I mean [07:10:52] oh, riiiight [07:11:24] hmmm [07:11:40] I just went to check 5xx and it seems to have dropped back off. I didn't even get halfway through the downgrading process... [07:13:42] I'm just going to finish downgrading and then look at this again in the morning, safest option for now [07:13:49] the timing can't be coincidental [07:14:41] * jeremyb wonders if there's a handbook to introduce new people to all the kinds of outages. e.g. michael jackson/barack obama [07:14:46] :D [07:15:07] was i right about ~1 year? [07:15:24] yeah, I started on Apr 1 last year [07:15:46] they had to put a sentence in my offer letter that it wasn't an april fool's joke lol :) [07:16:38] uhhh, you've seen our jokes? [07:17:02] michael jackson/barack obama. essentially cache stampede: https://wikitech.wikimedia.org/wiki/PoolCounter [07:17:06] back on topic, though: this upgrade is a version that's been running on ulsfo a while. I wouldn't have expected to run into any major problems [07:17:27] huh [07:17:41] i guess it's the only version ulsfo ever ran? [07:17:44] it could be something subtle, or just a one-off glitch with one misbehaving backend shortly after upgrade that hashed some popular articles + Abraham_Lincoln? [07:17:49] right [07:18:21] (well, there's been one small patch since ulsfo's version, but that only affects mobile) [07:18:56] but anyways, whatever it is, it's too subtle for 1:18AM here, so I'll just sort it out in detail tomorrow and leave them back on the versions they had earlier for tonight [07:19:12] https://en.wikipedia.org/wiki/Wikipedia:Today%27s_featured_article/April_1,_2007 is pretty good [07:19:14] yeah, wish i could offer some insight [07:19:24] https://en.wikipedia.org/w/index.php?title=Wikipedia:Today%27s_featured_article/April_1,_2007&diff=119303193&oldid=119292660 [07:19:51] good night moon [07:20:09] bye jeremyb [07:20:15] * ori heads off too [07:20:55] matanya: https://gerrit.wikimedia.org/r/#/c/108484/ reminds me, RT: in footer *does* work, better amend those multiple changes with a ticket in common to ease search [07:21:38] Nemo_bis: did i say it doesn't? see my commit above [07:25:27] yes you did [07:26:55] hmm, if you say so :) [07:33:11] ori: is the package wikimedia-job-runner installed on any job runner? i would like to remove the absent statment [08:05:44] akosiaris: morning. want to have a deal? [08:08:44] (03CR) 10Gilles: "Who can +2 this?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112452 (owner: 10Gilles) [08:08:57] good morning. a deal ? want kind of a deal ? [08:13:37] (03PS1) 10KartikMistry: Remove $wgULSNoWebfontsSelectors [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114131 [08:18:05] PROBLEM - Puppet freshness on dysprosium is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 07:45:00 PM UTC [08:20:29] akosiaris: you donate 5 minutes to glance at https://etherpad.wikimedia.org/p/absents and i'll push patches to all un-needed statments, works for you? :) [08:23:35] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [08:24:05] RECOVERY - Disk space on virt6 is OK: DISK OK [08:24:28] matanya: it wont take 5 minutes for sure but I will give it a try [08:25:21] thanks, i only asked for 5 minutes, since not all of the stuff there is ops related. and i don't want to waste too much of your time [08:29:25] meh... those mailman archives take forever to be regenerated... [08:30:21] (03CR) 10Nikerabbit: [C: 031] Remove $wgULSNoWebfontsSelectors [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114131 (owner: 10KartikMistry) [08:32:02] so cool: http://people.debian.org/~stapelberg//2014/02/19/SSH-keys-in-your-TPM.html [08:41:33] So, I need help to get https://gerrit.wikimedia.org/r/114131 deployed. [08:41:58] Nikerabbit suggested me that it has to deployed first and then can be merge. [08:45:38] Sounds unlikely [08:46:48] https://wikitech.wikimedia.org/wiki/How_to_do_a_configuration_change [08:47:29] Last edit is by him one week ago :) not so unreasonably outdated hopefully [08:51:28] kart_: review first, then it should be merged only if someone is going to deploy it right after [08:53:29] Nikerabbit: ah. Thanks for clarification! [08:55:28] Nemo_bis: :) [09:06:57] It would be nice if we kept discussions for changes to somewhere other than CR notes… Since they aren't searchable [09:15:02] (03PS1) 10Matanya: jenkins: remove accesslog rotation absent statment [operations/puppet] - 10https://gerrit.wikimedia.org/r/114135 [09:20:40] (03PS1) 10Matanya: standard-packages: remove absent statments of unused packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/114136 [09:28:22] (03PS1) 10Matanya: parsoid: remove init script absent statment [operations/puppet] - 10https://gerrit.wikimedia.org/r/114137 [09:52:34] (03PS27) 10Alexandros Kosiaris: site: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/109507 (owner: 10Matanya) [09:53:11] (03CR) 10jenkins-bot: [V: 04-1] site: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/109507 (owner: 10Matanya) [09:54:27] (03CR) 10Hashar: [C: 04-1] "Please keep the comment around for later reference :]" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/114135 (owner: 10Matanya) [09:56:51] hashar_: i'll rephrase it as FIXME, is this ok? [09:57:30] matanya: whatever works for you, as long as you keep the comment :-] [09:57:38] thanks :) [09:58:00] and yeah, one day I will lookt at Jenkins init script and logrotation [09:58:22] hashar_: btw, why SIGALARM ? [09:58:27] (03PS28) 10Alexandros Kosiaris: site: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/109507 (owner: 10Matanya) [09:58:50] let's try and diff the freaking catalogs now... [09:59:58] (03PS2) 10Matanya: jenkins: remove accesslog rotation absent statment [operations/puppet] - 10https://gerrit.wikimedia.org/r/114135 [10:20:16] akosiaris: Jenkins respond to SIGALARM for log rotation apparently [10:20:57] the jenkins source tree has material for boths rpm and deb packages [10:21:04] and apparently the init scripts / logrotate files are not shared [10:21:16] so they might fix a bug in the rpm logrotate while leaving the issue in the deb logrotate :D [10:21:26] example of bug fixed for the rpm logrotate : https://github.com/jenkinsci/jenkins/commit/8fb2f5162e499b5c0c4dc8f88af761b7c490bc38 [10:22:26] hmmm ok. But your comment says that SIGALARM kills jenkins, so it does not work as expected, right ? [10:50:45] akosiaris: yup [10:51:07] akosiaris: I think sigalarm is sent to start-stop-daemon instead of jenkins and that kills it [10:51:12] akosiaris: got to investigate that in labs [10:51:23] akosiaris: and that is at the bottom of my pile :-] [10:52:38] hashar: I kind of doubt start-stop-daemon is even there to receive the signal at the first place. [10:52:51] anyway, if you need help investigating this, let me know [10:56:05] PROBLEM - Host virt1001 is DOWN: PING CRITICAL - Packet loss = 100% [10:56:10] ^ me [10:57:00] akosiaris: thanks :) [11:00:45] RECOVERY - Host virt1001 is UP: PING WARNING - Packet loss = 93%, RTA = 0.25 ms [11:01:07] matanya: do you see whitespace at https://gerrit.wikimedia.org/r/#/c/114135/2/modules/jenkins/manifests/init.pp ? :-] [11:01:24] arrg [11:01:28] * matanya is blind [11:01:28] :D [11:01:41] I got vim to highlight them in red as well [11:02:06] yes, me too. but it helps if you look [11:02:45] PROBLEM - Host virt1001 is DOWN: PING CRITICAL - Packet loss = 100% [11:03:04] (03PS3) 10Matanya: jenkins: remove accesslog rotation absent statment [operations/puppet] - 10https://gerrit.wikimedia.org/r/114135 [11:04:45] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1525.93335 [11:06:35] RECOVERY - Host virt1001 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [11:07:49] You can always enable the default pre-commit hook; that guards against committing changes with trailing whitespace. [11:10:25] (03CR) 10Hashar: [C: 031] "Thank you. Feel free to merge at anytime." [operations/puppet] - 10https://gerrit.wikimedia.org/r/114135 (owner: 10Matanya) [11:19:05] PROBLEM - Puppet freshness on dysprosium is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 07:45:00 PM UTC [11:22:16] ACKNOWLEDGEMENT - Puppet freshness on dysprosium is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 07:45:00 PM UTC alexandros kosiaris decommissioned [11:22:54] "alexandros kosiaris decommissioned" [11:23:09] we need a new alex [11:26:15] finally! I was beginning to fear that would never happen :-) [11:26:45] PROBLEM - Host virt1000 is DOWN: CRITICAL - Host Unreachable (208.80.154.18) [11:26:51] ^ also me [11:26:55] PROBLEM - Host labs-ns1.wikimedia.org is DOWN: CRITICAL - Host Unreachable (208.80.154.19) [11:28:15] oooooops [11:28:25] RECOVERY - Host virt1000 is UP: PING OK - Packet loss = 0%, RTA = 1.50 ms [11:28:42] We changed http://lists.wikimedia.org/pipermail/wikimedia-l/ to use lighttpd? [11:28:59] Can we please make it display archives by date, decreasing? [11:29:35] RECOVERY - Host labs-ns1.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [11:29:36] Now it's 2013-October / 2013-September / 2014-February / 2014-January [11:29:39] doesn't make sense. [11:29:44] odder: I am reconstructing the archives of that specific list [11:30:16] i 'd say that is why you have that view. [11:30:53] okay [11:36:45] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:37:21] akosiaris: reconstructing the archives? tell me you're kidding please [11:38:20] or did you find a magic solution that allows not to break links? [11:39:12] Nemo_bis: a) I am not kidding, b) please tell me you are kidding [11:39:41] How do you mean, reconstructing the archives? [11:39:42] what links get broken ? [11:39:45] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1118.800049 [11:39:56] /usr/lib/mailman/bin/arch wikimedia-l [11:40:11] https://wikitech.wikimedia.org/wiki/Remove_a_message_from_mailing_list_archive [11:40:19] after changing something in mbox per legal's request [11:40:42] ok that is what i did :-) [11:41:26] all links are definitely broken [11:41:29] I want to cry [11:41:49] luckily I almost never link pipermail, because it gets broken so often [11:42:19] * odder still seeing lighttpd [11:42:41] that only means you're missing the index, the messages are mostly there already [11:42:44] odder: in 2012, please hold on [11:42:52] they've all been renumbered [11:43:59] What does one have a legal department for if they can't even defend our pipermail [11:46:45] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:49:45] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1733.599976 [11:57:37] (03CR) 10Alexandros Kosiaris: PostgreSQL/Postgis module (037 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/112469 (owner: 10Alexandros Kosiaris) [12:01:19] odder: done [12:04:16] A pipermail renumber? Didn't mutante screw that up some years ago as well? [12:04:56] yeah. Seems like I am following in his steps [12:05:37] * Nemo_bis proposes to nuke archives rebuild script from orbit [12:06:16] Well, if you have the backup, it should be amendable: Take the backup, follow the instructions in Nemo's link and regenerate the archives. That should make the links work again. [12:06:49] He said he's already following that [12:07:32] Not deleting the message is a required condition, not a sufficient condition though. It's not clear how numbers are chosen by the script, doesn't seem to be deterministic (!) [12:07:45] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:08:18] Just the other day mutante said so in the channel, I was lazy and I didn't document it "who can ever think of rebuilding mailman archives in 2014 anyway" [12:08:43] (Without looking at it:) Isn't it not just the sequence number in the mbox file? 1, 2, 3, etc.? [12:08:54] i'd wish [12:09:40] Someone got a link to the script? [12:09:42] wikimedia-l archives started from about 60k, among other things [12:10:01] (i.e. the number of last foundation-l post I think) [12:11:45] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 3007.600098 [12:19:16] (03CR) 10Nemo bis: "https://gerrit.wikimedia.org/r/#/c/114142/" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107008 (owner: 10Reedy) [12:21:17] Hmmm. Mailman/Archiver/pipermail.py looks pretty unrandom to me. [12:21:35] Whatever; Gmane rules. [12:26:26] Viva gmail [12:26:30] * gmane [12:26:35] * Nemo_bis slaps self [12:26:54] with a large trout ? ah... memories [12:29:04] (03PS8) 10Alexandros Kosiaris: PostgreSQL/Postgis module [operations/puppet] - 10https://gerrit.wikimedia.org/r/112469 [12:42:45] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:48:45] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1492.199951 [13:04:45] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:09:45] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1862.666626 [13:17:45] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:21:40] (03PS1) 10Hashar: misc::mwlib::packages: explode list of packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/114143 [13:26:59] (03PS1) 10Hashar: contint: get ocaml-nox package for Math texvc [operations/puppet] - 10https://gerrit.wikimedia.org/r/114144 [13:27:01] (03CR) 10Matanya: "duplicate of https://gerrit.wikimedia.org/r/#/c/111619/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/114143 (owner: 10Hashar) [13:27:40] (03Abandoned) 10Hashar: misc::mwlib::packages: explode list of packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/114143 (owner: 10Hashar) [13:27:45] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 996.06665 [13:28:50] (03CR) 10Hashar: mwlib: lint (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/111619 (owner: 10Matanya) [13:31:29] hashar: i did 2 back indents, thanks [13:31:45] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:33:31] (03PS2) 10Matanya: mwlib: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/111619 [13:35:45] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 163.5 [13:36:45] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:45:45] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 789.666687 [13:48:45] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:52:45] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 480.733337 [13:53:23] (03PS29) 10Alexandros Kosiaris: site: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/109507 (owner: 10Matanya) [13:53:49] matanya: ^ this might be the last. I closed another set of holes and doing a final test [13:54:07] * matanya prays and crosses fingers [13:57:44] (03CR) 10Alexandros Kosiaris: [C: 04-1] parsoid: remove init script absent statment (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/114137 (owner: 10Matanya) [13:59:50] (03PS2) 10Matanya: parsoid: remove init script absent statment [operations/puppet] - 10https://gerrit.wikimedia.org/r/114137 [14:01:26] (03CR) 10Alexandros Kosiaris: [C: 032] contint: get ocaml-nox package for Math texvc [operations/puppet] - 10https://gerrit.wikimedia.org/r/114144 (owner: 10Hashar) [14:02:28] (03CR) 10Alexandros Kosiaris: [C: 032] parsoid: remove init script absent statment [operations/puppet] - 10https://gerrit.wikimedia.org/r/114137 (owner: 10Matanya) [14:04:31] (03CR) 10Alexandros Kosiaris: [C: 032] jenkins: remove accesslog rotation absent statment [operations/puppet] - 10https://gerrit.wikimedia.org/r/114135 (owner: 10Matanya) [14:04:45] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:08:49] (03PS1) 10coren: Make manage-nfs-volumes handle broken instances [operations/puppet] - 10https://gerrit.wikimedia.org/r/114146 [14:10:45] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1057.06665 [14:12:45] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:16:27] (03CR) 10coren: [C: 032] "Trivial fix is trivial." [operations/puppet] - 10https://gerrit.wikimedia.org/r/114146 (owner: 10coren) [14:19:51] (03PS1) 10coren: Fix the bugfix in manage-nfs-volumes [operations/puppet] - 10https://gerrit.wikimedia.org/r/114147 [14:21:18] (03CR) 10coren: [C: 032] "Death to typos!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/114147 (owner: 10coren) [14:22:45] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1143.800049 [14:23:45] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:27:45] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 690.700012 [14:30:45] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:32:05] (03PS1) 10coren: Tool Labs: Install tabix on exec_environ [operations/puppet] - 10https://gerrit.wikimedia.org/r/114148 [14:39:00] (03CR) 10coren: [C: 032] "Simple package addition." [operations/puppet] - 10https://gerrit.wikimedia.org/r/114148 (owner: 10coren) [14:39:45] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 408.166656 [14:39:50] ottomata: ping [14:40:13] hiya [14:40:19] hi [14:40:29] cp3021 is flapping all day [14:43:08] ok thanks [14:48:36] Snaps: you around? [14:50:45] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:58:34] (03PS1) 10coren: Labs: Fix permissions for nfsmanager [operations/puppet] - 10https://gerrit.wikimedia.org/r/114151 [14:59:49] (03CR) 10coren: [C: 032] Labs: Fix permissions for nfsmanager [operations/puppet] - 10https://gerrit.wikimedia.org/r/114151 (owner: 10coren) [15:00:45] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 576.099976 [15:01:03] ottomata: anything I can help with? [15:01:07] even brainstorming :) [15:03:16] probably yeah, pretty stumped right now, hang on, in standup [15:03:20] mark and I looked at this for a bit on sunday [15:03:26] sorry, monday [15:05:38] paravoid here are details: [15:05:40] http://ganglia.wikimedia.org/latest/graph_all_periods.php?title=&vl=&x=&n=&hreg%5B%5D=cp30(19%7C2%5B0-2%5D)&mreg%5B%5D=kafka.rdkafka.brokers.*.rtt.avg>ype=line&glegend=show&aggregate=1 [15:05:58] rtt is the time it takes for a kafka broker to ack a message batch back to varnishkakfa [15:06:00] so far [15:06:07] this only happens on messages sent to analytics1022 [15:06:09] not analytics1021 [15:06:18] and, it usually only happens during high traffic time [15:06:24] and only on one or a few of the bits varnishes [15:06:26] not all of them [15:06:31] but it could be any of them [15:06:47] mark and I couldnt' detect anything wrong with network connectivity to analytics1022 [15:06:54] compared to analytics1021 [15:06:58] but they are in different rows [15:07:45] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:08:25] (03CR) 10Jgreen: [C: 031] mwlib: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/111619 (owner: 10Matanya) [15:13:45] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1205.800049 [15:14:53] ja paravoid, it is very strange too [15:15:02] because even thought it does happen during high load [15:15:21] i don't think that more traffic is the cause, i think that these brokers and producers can handle more traffic [15:15:23] on monday [15:15:30] I depooled the offending varnish nonde [15:15:30] node [15:15:39] and let its vk buffer slowly empty [15:15:45] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:15:53] when I depooled the node, that balanced more traffic on the remaining 3 bits varnishes [15:16:05] brokers, producers, leaders, elections, ... meh [15:16:06] giving them each more traffic than then usually serve [15:16:13] did you investigate the code to figure out what those "txerrs" mean exactly? [15:16:18] yes [15:16:50] didn't help too much, txerrs happen in varnishkafka when the buffer is full [15:17:07] which buffer exactly? the 2M one? [15:17:09] rd_kafka_produce, which is the async call to add a message to the buffer [15:17:51] yes [15:17:54] ok [15:18:21] so yeah, when that is full, rd_kafka_produce will return -1, and varnishkafka will txerr++ [15:18:24] so does the buffer get "flushed" as soon as it can [15:18:28] (so as fast as tcp allows) [15:18:35] or does it wait on other events? [15:18:57] not sure... [15:19:13] reading, but that means I have to understand rdkafka buffers and produce threads...:) [15:19:21] so on varnishkafka nodes where the buffer is NOT full (under 2 million messages) we shouldn't be seeing any txerrs, right? [15:21:19] that's right [15:21:25] is that indeed the case? [15:21:28] well, ok so, that's not 100% true [15:21:28] but in our case yes [15:21:44] i'm reading more, i think there can be txerrs that are not caused by buffer full [15:22:00] but they will output a different error code and message when that happens [15:22:05] if you find any nodes with non-full buffers that have txerrs you'll know that for sure [15:22:06] hang on, double checking something... [15:22:08] ok [15:23:43] txerrs: [15:23:43] http://ganglia.wikimedia.org/latest/graph_all_periods.php?title=&vl=&x=&n=&hreg%5B%5D=cp30(19%7C2%5B0-2%5D)&mreg%5B%5D=kafka.varnishkafka.*.txerr.per_second>ype=line&glegend=show&aggregate=1 [15:23:52] ok still checking code for something, hang on [15:28:34] (trying to find where this one error message comes from and grepping is not finding it!...) [15:28:57] it might be a standard unix error message [15:29:20] man 3 errno [15:32:45] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 762.666687 [15:33:19] ahhh, yeah maybe so [15:33:33] ah! [15:33:36] thank you :) [15:33:36] ENOBUFS No buffer space available [15:33:52] magnus uses both errno and his own error codes and strings in some places [15:33:56] this case uses both [15:33:59] cool danke [15:34:01] ok yeah [15:34:11] so, yes is the answer to your question, double checked a bunch [15:34:45] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:34:54] the txerrs we are seeing are because queue.buffering.max.messages is reached [15:35:03] that's the 2M we have set right now [15:35:21] i'm trying to confirm this as well, although something doesn't look right [15:35:40] i think that outbuf_cnt in the varnishkafka stats is this queue [15:35:52] http://ganglia.wikimedia.org/latest/graph_all_periods.php?hreg[]=cp.%2B&mreg[]=kafka.rdkafka.brokers..%2B%5C.outbuf_cnt&z=large>ype=line&title=kafka.rdkafka.brokers..%2B%5C.outbuf_cnt&aggregate=1&r=hour [15:36:03] I need to check with Snaps about that to be sure, also reading code to find out [15:36:48] but yeah, if so, you can see in this graph that cp3021 -> analytics1022 is the only queue with messages sitting aroudn in it [15:37:02] and of course cp3021 -> analytics1022 is the only connection with a high rtt [15:37:06] http://ganglia.wikimedia.org/latest/graph_all_periods.php?title=&vl=&x=&n=&hreg%5B%5D=cp30(19%7C2%5B0-2%5D)&mreg%5B%5D=kafka.rdkafka.brokers.*.rtt.avg>ype=line&glegend=show&aggregate=1 [15:38:45] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 315.799988 [15:39:45] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:44:45] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 731.93335 [15:45:45] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:46:50] sorry, i'm doing other stuff in the mean time [15:46:54] just trying to give you pointers to debug :) [15:47:46] ottomata: so, you're often seeing *unix* error message ENOBUFS? [15:48:19] if that's not generated by rdkafka but is instead coming from the tcp stream [15:48:23] that might explain some things [15:48:45] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1021.400024 [15:49:48] I'm back [15:51:29] hm, lemme check, i think rdkafka manually generating it [15:51:55] yeah [15:52:11] if msg_cnt + 1 > queue_buffering_max_msgs then errno = ENOBUFS; [15:52:11] return -1; [15:52:33] ok [15:52:39] what is that queue_buffering_max_msgs set to then? [15:52:44] that's the 2M [15:52:52] ok [15:53:01] so, we can ignore that part, we need to figure out why the flushing of that queue is so slow [15:53:09] it's slower than the producing of messages [15:53:17] and we need to figure out why [15:53:27] yeah [15:53:28] so why can't rdkafka send fast enough [15:53:32] is it blocked on tcp? [15:53:45] or does it wait on something arbitrary? [15:54:06] seems unlikely, mainly since the bits varnishes should be handling the same amount of traffic, and should all be networked identically [15:54:20] well [15:54:22] the fact that this only happens for traffic bound to an22 is very suspicious [15:54:28] we need to figure out what exactly is going on anyway [15:54:35] yes [15:54:37] they SHOULD be identical but they aren't [15:54:49] can be anything, varnish server differences, network differences, broker differences [15:55:08] so figuring out where exactly the wait/block is is crucial [15:55:17] might simply be tcp [15:55:30] "simply", diagnosing tcp behaviour can be quite tricky too ;) [15:55:49] might just be the broker ack's being too slow, and then we need to figure out why that is [15:57:45] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:58:33] strace might be helpful here [16:03:19] (03CR) 10Alexandros Kosiaris: [C: 032] "After too much deliberation, catalog compilations and diffs, merging this finally. Keep an eye for errors" [operations/puppet] - 10https://gerrit.wikimedia.org/r/109507 (owner: 10Matanya) [16:05:04] (03PS1) 10Andrew Bogott: Remove autofs references for new, puppetized mounts in eqiad. [operations/puppet] - 10https://gerrit.wikimedia.org/r/114155 [16:05:18] yeah mark i'm stracing now [16:05:23] had to find the threads that were actually producing [16:06:03] (03CR) 10Andrew Bogott: [C: 032] Remove autofs references for new, puppetized mounts in eqiad. [operations/puppet] - 10https://gerrit.wikimedia.org/r/114155 (owner: 10Andrew Bogott) [16:06:05] i don't really see much differnce, kinda hard to say though [16:06:32] maybea lot more EAGAINs on recvmsg calls on the thread that is sendign to an22 vs the thread that is sending to an21, buuut, maybe nto [16:06:45] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 733.266663 [16:06:47] that's strace -c ? [16:06:52] scfc_de: I wonder if the renumbering failure may have something to do with https://bugs.launchpad.net/mailman/+bug/265829/comments/1 [16:06:55] -p [16:07:07] ah will do -c [16:08:45] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [16:09:13] Nemo_bis: *Is* there a renumbering failure (if nobody messes with the source mbox)? [16:09:28] mark: https://gist.github.com/ottomata/9095219 [16:10:04] scfc_de: define mess with? [16:10:13] i guess more calls for an22 because it has more in the queue to send and wait for responses [16:10:29] yes [16:10:33] whoa, but only one sendmsg to an21 in about 20 secs? [16:10:38] what are those recvmsg errors, EAGAIN? [16:10:47] yes that does seem suspicious [16:10:57] yeah, but those might be normal, i do see those on an21 thread as well [16:11:00] the EAGAINs [16:12:13] on recvmsg that probably just means "no new data yet" on a nonblocking socket [16:12:24] Nemo_bis: Not replace the message body with "This message has been deleted", but delete the whole message, shuffle the messages around, etc. [16:12:33] hm, i'm running with -C now so I can also see output, the an21 thread looks much different than it did a few minutes ago [16:12:39] now it just mainly has polls with 1sec timeouts happening [16:12:40] but there are polls also [16:12:43] hardly any real activity [16:13:02] scfc_de: akosiaris said he didn't mess up with the mbox then [16:13:13] gonna see what this looks like on another box [16:14:24] Nemo_bis: Didn't he say he deleted a message? [16:15:47] scfc_de: not AFAICS [16:17:08] (03PS1) 10Chad: Give wikiversities Cirrus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114156 [16:18:46] ok updated https://gist.github.com/ottomata/9095219 [16:19:07] ah i'm going to sort those calls by alpha rather than count [16:19:49] well, just make them consistent at least [16:20:45] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 628.93335 [16:21:03] Nemo_bis: I just grepped all archives for 2012/wikimedia-l (http://lists.wikimedia.org/pipermail/wikimedia-l/) and none of them has the word "deleted" in them that seems to have been redacted. So I think akosiaris deleted the message itself. akosiaris, can you solve that mystery, please? [16:21:05] Nemo_bis: scfc_de: no. I changed 10-15 characters in a message to X [16:21:17] oh hm, mark, maybe that does make sense, since max buffer queue messages I think is global to all brokers [16:21:17] if it is already full [16:21:22] i specifically avoided deleting a message [16:21:22] with messages all bound for analytics1022 [16:21:29] new messages are going to come in and just be dropped [16:21:37] so, there won't be any sends to an21 [16:21:40] since all the msgs are dropped [16:21:42] anyway [16:21:47] all the sends will be whatever is in the queue [16:21:48] akosiaris: a) Perfect, thanks. b) Then I don't know why they have been renumbered. [16:22:02] which is all bound to an22 [16:22:12] (Though I don't have old links at hand that used to work to test this.) [16:22:45] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [16:22:54] scfc_de: I will restore the archive from backup and we will be ok. Finishing something else first though [16:23:26] akosiaris: fantastic, thanks [16:24:32] (03PS1) 10Jgreen: work around for bayes_999 snafu until spammassassin fixes their nightly rule updates [operations/puppet] - 10https://gerrit.wikimedia.org/r/114158 [16:24:49] LOL " This line has "-" instead of white space, and the seconds field has added hundredths of a second" https://mail.python.org/pipermail/mailman-users/2007-September/058248.html [16:27:50] (03CR) 10Jgreen: [C: 032 V: 031] work around for bayes_999 snafu until spammassassin fixes their nightly rule updates [operations/puppet] - 10https://gerrit.wikimedia.org/r/114158 (owner: 10Jgreen) [16:33:25] hey [16:33:40] (03PS2) 10Chad: Give wikiversities and itwikiquote Cirrus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114156 [16:34:58] ottomata/Mark: I think we need to diagnose TCP for this issue. [16:36:48] ottomata: re outbufs: outbufs may (will) contain multiple messages. up to 252 of them per outbuf. [16:39:35] ottomata, mark: can we use tcpdump + tcptrace perhaps? [16:40:35] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 856.633362 [16:41:08] akosiaris, scfc_de, I'm now rather sure the issue is https://mail.python.org/pipermail/mailman-users/2013-June/075304.html [16:41:56] We even have a bug, filed by Gloria, about unescaped From: breaking the archives. It was "fixed" by the last archive rebuild, but I don't think the underlying mbox was actually repaired. [16:44:41] https://wikitech.wikimedia.org/w/index.php?title=Remove_a_message_from_mailing_list_archive&diff=99904&oldid=99885 [16:45:09] (03CR) 10MarkTraceur: "People with deploy rights...because we need to sync it right after we merge, IIRC." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112452 (owner: 10Gilles) [16:45:22] It seems all archives havoc happens in February or March https://wikitech.wikimedia.org/w/index.php?title=Remove_a_message_from_mailing_list_archive&action=history [16:48:57] Snaps: sorry, was afk for a sec [16:52:37] (03PS1) 10coren: Revert "Handle role::labs::instance properly in pmtpa." [operations/puppet] - 10https://gerrit.wikimedia.org/r/114160 [16:53:06] (03PS2) 10coren: Revert "Handle role::labs::instance properly in pmtpa." [operations/puppet] - 10https://gerrit.wikimedia.org/r/114160 [16:53:47] (03CR) 10coren: [C: 032] "That change was done in LDAP instead." [operations/puppet] - 10https://gerrit.wikimedia.org/r/114160 (owner: 10coren) [16:57:36] Snaps: I'm looking into tcptrace, lemme know if there is anythign in particular you want me to look at [16:58:35] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [16:59:12] (03Abandoned) 10Hashar: Add dependencies for Wikimetrics to Jenkins to enable CI. [operations/puppet] - 10https://gerrit.wikimedia.org/r/90684 (owner: 10Diederik) [17:01:35] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 918.266663 [17:06:31] !log demon synchronized php-1.23wmf13/extensions/CirrusSearch 'Cirrus to master' [17:06:39] Logged the message, Master [17:06:55] (03CR) 10Chad: [C: 032] Give wikiversities and itwikiquote Cirrus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114156 (owner: 10Chad) [17:07:04] !log demon synchronized php-1.23wmf13/extensions/Elastica 'Elastica to master' [17:07:12] Logged the message, Master [17:07:32] !log demon synchronized php-1.23wmf14/extensions/CirrusSearch 'Cirrus to master' [17:07:39] Logged the message, Master [17:08:35] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [17:08:42] !log demon synchronized php-1.23wmf14/extensions/Elastica 'Elastica to master' [17:08:50] Logged the message, Master [17:09:41] !log demon synchronized wmf-config/InitialiseSettings.php 'wikiversities and itwikiquote getting cirrus' [17:09:49] Logged the message, Master [17:11:35] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 845.666687 [17:12:26] PROBLEM - ElasticSearch health check on elastic1002 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1335: active_shards: 3932: relocating_shards: 6: initializing_shards: 1: unassigned_shards: 2 [17:13:26] RECOVERY - ElasticSearch health check on elastic1002 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1340: active_shards: 3947: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [17:14:11] (03CR) 10OliverKeyes: [C: 031] "Looks good; Opsen will strongly approve of standardising, I suspect ;p." [operations/puppet] - 10https://gerrit.wikimedia.org/r/111619 (owner: 10Matanya) [17:14:35] PROBLEM - DPKG on helium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:15:13] Snaps: https://gist.github.com/ottomata/9096667 [17:15:30] <^d> !log creating elasticsearch indexes for all wikiversities, may see some intermittent icinga spam as shards rebalance [17:15:35] RECOVERY - DPKG on helium is OK: All packages OK [17:15:38] Logged the message, Master [17:16:40] whoa interesting, Snaps, mark [17:17:04] my tcpdump on cp3021 did capture fewer packets, but still [17:17:29] packets sent from an22 -> cp3021 are proportionally way less than packets sent from an22 -> cp3019 [17:17:37] (cp3021 is the current host with problems) [17:18:00] both tcpdumps ran for about the same amount of time [17:18:14] cp3021 -> an22 and cp3019 -> an22 saw about the same number of packets [17:18:20] so about the same number of produce requests (I assume) [17:18:39] but, fewer packets sent from an22 back to vk on cp3021 [17:19:00] 5885 to cp3021 vs 11388 to cp3019 [17:19:27] you might want to actually check that assumption :) [17:19:50] hm [17:20:14] hm, yeah, i guess it'd be good to know exactly what data is supposed to go through this socket [17:20:21] i assume just produce requests and acks [17:20:24] but maybe more? [17:20:30] Snaps: come back to us! :) [17:22:20] are there retransmits? [17:22:35] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [17:25:06] ottomata: did the change to emery help? [17:25:35] paravoid, yes, more on cp3021, but there don't seem to be relatively that many [17:25:50] only retransmits for sending to an22 [17:25:54] rexmt data pkts: 128 [17:25:56] on cp3021 [17:25:57] and [17:26:03] rexmt data pkts: 14 [17:26:05] on cp3019 [17:26:19] sent about the same number of packets in this dump [17:26:30] but there were more retransmits on cp3021 [17:28:12] mark, as far as I can tell, in librdkafka, the only call to sendmsg comes from producing messages [17:30:57] hm ok, no, that is not true [17:31:05] the same code is used for sending and receiving from different buffers [17:31:11] there are metadata requests that come in and out [17:31:17] but those are in different threads on different sockets [17:31:41] so, for this socket, it should be just produce requests [17:33:13] I'm not on rt afaik [17:33:45] last week, and again next week, but not this week [17:34:19] physikerwelt: ^^ [17:35:40] matanya: [17:35:40] yup [17:35:40] https://graphite.wikimedia.org/render/?width=859&height=441&_salt=1392831325.675&target=reqstats.pageviews&from=00%3A00_20140218&until=23%3A59_20140219 [17:36:02] oh, well. thanks for the fix [17:36:35] Hey all, I'm getting an icinga critical about the raid in virt5, but I don't know how to tell which drive it's upset about. [17:36:37] Any ideas? [17:36:41] This is a cisco, unfortunately [17:40:30] um… RobH, cmjohnson1, anyone? [17:40:52] let me see [17:41:16] (03PS1) 10EBernhardson: Update flow parsoid config in beta labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114166 [17:43:12] urgh, ciscos [17:43:14] (03CR) 10Matthias Mullie: [C: 032] Update flow parsoid config in beta labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114166 (owner: 10EBernhardson) [17:43:23] dont we just use software raid on them? [17:43:25] (03Merged) 10jenkins-bot: Update flow parsoid config in beta labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114166 (owner: 10EBernhardson) [17:43:38] mdadm checks then no? [17:43:53] andrewbogott: ? [17:44:26] RobH: I don't… know anything [17:44:31] Just that icinga is upset [17:44:32] it is software raid [17:44:45] sda2[0](F) [17:46:14] Which drive would sda2 be? [17:46:35] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 680.166687 [17:46:47] speaking of drives, sbernardin, how's that ms-be disk going? [17:47:07] paravoid: that's on me..i told steve I would order it and forgot [17:47:13] ok :) [17:47:13] will do that [17:48:10] so we have not done a lot of checking [17:48:25] but i recall on initial system installs, the sda/sdb/sdc disks lined up to 0/1/2 [17:48:57] but not sure if its always the case [17:49:16] so usually have to do a check anyhow for device id before pulling. [17:49:32] ie: im not sure there is a nice way to put into icinga to say disk in slot 0 [17:49:41] since its not polling a hw raid controller for disk slot numbers [17:49:51] (03PS1) 10Chad: Wikiversities and itwikiquote are done building [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114169 [17:49:52] andrewbogott: ^ sorry =P [17:50:12] there is a hw raid controller in the cisco systems though, we simply dont use it at all [17:50:19] cuz it is a shitty one that doesnt do much [17:50:36] it may be able to be polled, but it may also be more effort than its worth [17:50:51] matanya: this is minor, but if you make your topic branches a little less specific (rt-6845 vs. rt-6845 , bugzilla vs. bugzilla-files,..) then imho they are more useful because then i can just link to topic:foo in gerrit and get a nice list of related patches [17:50:58] all: see you on Friday, bbl [17:51:03] i think most folks know how to manually check before disk swap, and its not a normal platform where we expect to roll out more of them [17:51:17] hdparm doesn't give seriel # [17:51:26] matanya: arr, first example should have been : rt-6845 vs. rt-6845-username [17:51:46] sure mutante noted and will adjust [17:51:50] thanks [17:52:13] thanks, was just an obversation when i wanted to link to ALL those changes for one ticket [17:52:35] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [17:52:41] cya later [17:53:05] cmjohnson1: lshw -class disk [17:53:09] shows disk sn [17:53:26] but yea, hrmm [17:53:27] cmjohnson1, sbernardin, RobH, I definitely appreciate your help but am multitasking furiously and don't actually think I have much to contribute to this (other than the initial complaint). Please ping me if I can do anything useful. [17:54:00] cmjohnson1: which also shows what its mounted as [17:54:10] so the info is somewhere in there for parsing [17:54:24] i just use the lshw -class disk for when i manually check for the alerts [17:54:41] its more than likely not the best way. [17:55:35] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 107.866669 [17:56:25] robh: there is this /dev/disk/by-id/ [17:58:25] well, as long as sda always hits slot 0 then its easy [17:58:50] so if its true for all the existing ones, then its prolly fine. [17:58:52] sbernardin: i can try and make all the disks except sda blink [17:59:01] are you in the data center? [17:59:35] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [18:09:05] paravoid, mark, ottomata, Snaps: https://wikitech.wikimedia.org/wiki/File:Cp3019.png and https://wikitech.wikimedia.org/wiki/File:Cp3021.png [18:09:45] is that just packets sent? [18:11:57] 1sec [18:14:51] it's actually two lines, a green line that keeps track of the ACK values received and a yellow line that tracks the receive window but you have to zoom in to see that and that requires the original file and not a screen shot (the chart contains more, see bottom of http://www.tcptrace.org/manual/node12_mn.html) [18:19:48] paravoid: ping [18:19:54] pong [18:19:57] hey [18:20:07] hi [18:20:37] do you have some time to chat about debs? [18:20:46] not right now :( [18:21:13] k; do you have an idea on when would be better? [18:22:01] not really, I have a backlog [18:22:03] but I haven't forgot [18:22:09] I even mentioned it on the SoS [18:22:18] yeah, I saw ;) [18:22:35] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1306.699951 [18:22:50] I'll try to bug you just enough to maintain a position somewhere in the middle of the stack [18:22:59] ;) [18:23:00] haha [18:23:07] I'm sorry, I know it sucks [18:23:49] paravoid, is it ok with you if we go ahead and merge the current debianization & do a more thorough review later? [18:23:58] merge where? [18:24:03] in the parsoid repo [18:24:35] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [18:24:37] I wouldn't mind, but note that git-buildpackage and associated tools really prefer a separate branch for debian/ [18:24:38] https://gerrit.wikimedia.org/r/#/c/110666/ [18:24:57] yeah, git-buildpackage wants a separate upstream etc [18:25:07] so not ideal for native debian packages [18:25:23] nope [18:25:36] it is possible to configure it to use master for both though [18:25:53] the main thing it can give us is automatic tagging and changelog generation [18:25:55] that's not great [18:26:05] anything else? [18:26:24] I haven't looked at anything else :) [18:26:31] don't commit the .ex [18:26:40] .ex stands for "example" [18:26:56] and do not commit the .debhelper; basically do a debian/rules clean before you commit [18:26:58] yeah, I wanted to keep them in there so that they can serve as examples for a later iteration of the packaging [18:27:04] can move them to a subdir though [18:27:14] or re-generate them when needed [18:27:17] we can always regenerate them [18:27:17] yup [18:28:07] so far we don't really need git-buildpackage; normal dpkg-buildpackage works just fine [18:28:11] (.debhelper and .debhelper.log) [18:28:29] oh, you're using a native version number [18:28:30] so my question about what else was about what else git-buildpackage can help us with [18:28:34] that's Bad™ [18:28:46] why? [18:28:58] it is a native debian package after all [18:29:01] native packages are packages native to the distribution [18:29:13] yup [18:29:18] we are upstream etc [18:29:25] that's not what native means :) [18:29:41] well, that's a philosophical question [18:30:15] our goals is to distribute the debian dir along with the code, so that anybody can easily build their own debs after hacking the source [18:30:48] similar to many other projects that maintain a native debian package [18:31:04] all these packages are packagers' worst nightmares :) [18:31:17] so, as an example [18:31:21] I am a packager, and it works pretty well for me ;) [18:31:29] if Debian wants to package parsoid for inclusion into Debian proper [18:31:35] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1094.866699 [18:31:57] they'd have to redo the packaging (because parsoid being a native package is a big no-no for Debian, for starters) [18:31:58] greg-g, looks like we are not depl now, skipping [18:32:10] yurik: k, cool, thanks [18:32:19] but with upstream (you) shipping debian/, you have all kinds of dirty diffs [18:32:35] including things you /can't/ do with a diff, like removing a file (say, postinst) [18:32:46] paravoid, I did not notice anything about that no-no in the maintainer guide [18:32:54] trust me :) [18:33:16] Don't trust him, make him explain [18:33:25] yeah, got a link? [18:33:45] native packages is for software that has been written specifically for Debian [18:34:00] IMO we should aim to improve our packaging over time so that debian can just use it [18:34:08] I agree [18:34:13] they'll do it anyway [18:34:15] no need for a diff then [18:34:18] might just as well work together [18:35:08] well, not necessarily; someone could make an NMU, for instance [18:35:12] for a release critical bug [18:35:16] * gwicke is looking at https://wiki.debian.org/DebianMentorsFaq#What_is_the_difference_between_a_native_Debian_package_and_a_non-native_package.3F [18:35:25] also, a native package would never ever get into Debian [18:35:45] even if a DD with upload rights uploaded it, it would never get pass through the NEW queue [18:36:12] speaking of NEW, i see lfaraone is right here :) [18:36:29] so we have to pick between making things harder than necessary for devs and pleasing Debian? [18:36:35] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [18:36:36] I see at least 3 more DDs around here ;) [18:36:47] why harder? [18:36:52] paravoid: luke is particularly involved with ftp-masters though? [18:37:07] I think so too [18:37:21] paravoid, to me just running dpkg-buildpackage in a checkout is about as easy as it can get [18:37:26] PROBLEM - MySQL InnoDB on db1033 is CRITICAL: CRIT longest blocking idle transaction sleeps for 606 seconds [18:37:42] so [18:37:48] paravoid doesn't actually have time for this discussion right now [18:37:49] it's also easier for us to work with in gerrit [18:37:54] can we defer it to later please? [18:37:57] gwicke: i'm jumping in the middle here but is there something wrong with git-buildpackage? (sp?) [18:37:58] hi jeremyb [18:38:02] thanks :) [18:38:34] ohai lfaraone [18:39:22] also keep in mind that our debian dir is not actually in the code repo [18:39:23] So, yeah, parsoid shouldn't be debian-native, because debian-native is meant for things where the package is *released* to Debian as its own upstream, if that makes sense? [18:39:53] like, the debian-policy package is debian-native, because by definition it is released once it lands in unstable, and you have to follow the debian-policy that is shipped with your release. [18:40:49] so the preference is to use git-buildpackage with a separate upstream and debian branch? [18:41:04] I'm afraid so :) [18:41:07] with upstream really being master [18:41:16] It is fine if your upstream tarball contains a "debian/" directory; if your package is non-native and using source format 3.0, it'll just get stripped out [18:41:16] yup [18:41:29] lfaraone, we don't do tarballs [18:41:39] IMO that's silly if you are in git [18:41:42] * lfaraone sobs softly.  [18:41:54] gwicke: do you tag releases? [18:42:06] yes, when we build the deb [18:42:20] that's something I hoped git-buildpackage can do for us [18:42:29] gwicke: it isn't silly if, say, you're using autoconf and you want your users to just be able to ./configure but don't want autogenerated configure code in your repository. [18:42:30] along with changelog generation from git logs [18:43:28] did the MW maintainers ever respond? [18:43:32] sure; we don't use autoconf though [18:43:42] paravoid, sadly no [18:43:56] (03PS1) 10Mark Bergsma: Rename subnet labs-hosts1-c-eqiad to labs-support1-c-eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/114179 [18:44:08] gwicke: was this on a list? [18:44:18] (03CR) 10Mark Bergsma: [C: 032] Rename subnet labs-hosts1-c-eqiad to labs-support1-c-eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/114179 (owner: 10Mark Bergsma) [18:44:22] jeremyb, I posted to the mediawiki package maintainer alias [18:44:23] yeah, pkg-mediawiki-devel@lists.alioth.debian.org [18:44:27] gwicke: so, if you're uploading this to *debian*, your debian/changelog should not be a list of things that changed in your release. The changelog is a list of things that changed in the *debian packaging* of the upstream work. [18:44:42] mark, ottomata, paravoid, Snaps : check https://wikitech.wikimedia.org/wiki/File:Cp3021_outstanding_data.png and https://wikitech.wikimedia.org/wiki/File:Cp3019_outstanding_data.png [18:45:01] lfaraone, we are uploading this to our own repo for now [18:45:11] we haven't even agreed to that ;) [18:45:32] so far in labs, but I hope that we can work out a longer-term solution [18:45:35] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 715.133362 [18:45:42] nod [18:45:58] labs has its own deb repos? [18:46:16] I spent about 20 minutes to set up mini-dinstall on a vm [18:46:41] I think download.mediawiki.org would be best to distribute this to users [18:47:11] paravoid: i thought that was going away in favor of releases? (but maybe I got that wrong) [18:47:19] yeah whatever replaces it :) [18:47:20] i.e. releases.wm.o or something like that [18:47:35] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [18:47:40] http://download.wikimedia.org/mediawiki/ redirscts to http://dumps.wikimedia.org/mediawiki/ which I thought was sort of amusing. [18:47:56] * odder remembers he once typed 'gerrit.wm.org' into his browsers' address bar [18:47:58] lfaraone: the index does, but http://download.wikimedia.org/mediawiki/1.22/mediawiki-1.22.2.tar.gz does not [18:48:16] another option might be a separate distro in apt.wikimedia.org [18:48:21] drdee, looking [18:48:29] (03PS1) 10Mark Bergsma: Rename subnet labs-hosts1-c-eqiad to labs-support1-c-eqiad [operations/puppet] - 10https://gerrit.wikimedia.org/r/114180 [18:48:48] no, because that's trusted by all of our servers (= root), plus you'd need high-maintainance stuff, like maintaining multiple sections per version etc. [18:48:52] what am I looking at drdee? [18:48:55] paravoid: yes it does :) [18:48:55] HTTP request sent, awaiting response... 301 Moved Permanently\n Location: http://dumps.wikimedia.org/mediawiki/1.22/mediawiki-1.22.2.tar.gz [following] [18:49:05] oh lol [18:49:17] yeah whatever, I think it's being replaced as jeremyb said [18:49:24] paravoid, yeah- I meant this for debs built from ops-controlled debian dirs [18:49:25] RECOVERY - MySQL InnoDB on db1033 is OK: OK longest blocking idle transaction sleeps for 0 seconds [18:49:33] ottomata: The idea behind these graphs is to estimate the congestion window at the sender [18:49:41] (03CR) 10Ragesoss: [C: 031] Enable EducationProgram on Dutch language Wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/71605 (owner: 10Siebrand) [18:50:29] outstanding means waiting to be acked? [18:50:41] (03CR) 10Mark Bergsma: [C: 032] Rename subnet labs-hosts1-c-eqiad to labs-support1-c-eqiad [operations/puppet] - 10https://gerrit.wikimedia.org/r/114180 (owner: 10Mark Bergsma) [18:50:51] anyway, lets think about this some more & catch up later this week? [18:50:55] yeah... [18:51:01] ottomata: i would presume so [18:51:38] drdee, do you know what the horizontal lines represent? [18:51:55] i guess averages or normalization of some kind? [18:55:50] yes those are moving averages [18:55:53] Blue Line tracks the average outstanding data up to that point. [18:56:03] lfaraone: one or both of those should also support HTTPS fwiw. but anyway, they come with sigs you can verify (not sure how well integrated into WoT they are though) [18:56:05] Green Line tracks the weighted average of outstanding data up to that point, [18:59:14] (03CR) 10Ottomata: "Ok, so." [operations/puppet] - 10https://gerrit.wikimedia.org/r/113966 (owner: 10Matanya) [18:59:35] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 785.133362 [18:59:47] and yellow line? [18:59:48] drdee? [19:00:02] Yellow Line tracks the window advertised by the opposite end-point i.e., the receive window [19:00:27] hmmm [19:00:32] oh and that is the same on both [19:00:33] m [19:00:35] right? [19:00:35] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [19:00:46] 30K something [19:03:35] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 900.06665 [19:06:29] Nemo_bis: Re archives, but then each rebuild would yield the same result, wouldn't it? Only the "stream" of new messages would be sorted differently? Do you have a link for Gloria's bug? [19:12:36] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [19:17:35] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 680.700012 [19:21:35] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [19:27:36] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1454.233276 [19:30:48] scfc_de: not necessarily, if the behaviour changed since the archives were built [19:32:46] scfc_de: bug is https://bugzilla.wikimedia.org/show_bug.cgi?id=25231 [19:34:15] anyone know the story on shop.wikimedia.org being down? [19:34:59] I only know they published something on Meta [19:36:05] jgage, it's run by shopify. there were some issues with the site JS and CSS that they wanted to see resolved before bringing it back online. folks in #wikimedia-fundraising would know the details. [19:36:19] "they" being wmf folks interfacing with shopify [19:37:41] thanks, ok [19:38:16] Nemo_bis: Oh, that just sucks. I just close my eyes and make very certain that I'll never ever link to lists.wikimedia.org. [19:38:35] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [19:38:58] scfc_de: yes, that's safest [19:39:35] Essentially it's as stable as linking a pastebin [19:40:03] Sadly, Gmane's search sometimes fails rather badly. [19:41:45] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 714.299988 [19:46:33] (03PS1) 10Faidon Liambotis: Add asw-d-eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/114189 [19:47:09] (03CR) 10Faidon Liambotis: [C: 032] Add asw-d-eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/114189 (owner: 10Faidon Liambotis) [19:51:07] apergos or any one with image store knowlege: https://upload.wikimedia.org/wikipedia/commons/thumb/0/0f/Available_on_the_App_Store_%28black%29_SVG.svg/253px-Available_on_the_App_Store_%28black%29_SVG.svg [19:51:19] why do i get this error? ^ [19:52:33] matanya: doesn't it say itself :P [19:52:35] https://upload.wikimedia.org/wikipedia/commons/thumb/0/0f/Available_on_the_App_Store_%28black%29_SVG.svg/253px-Available_on_the_App_Store_%28black%29_SVG.svg.png [19:52:51] svg thumb of a svg makes no sense, you want a png thumb of the svg [19:53:02] * Nemo_bis is surprised how helpful and kind the error message it [19:53:35] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [19:54:30] Nemo_bis: try to create a size of https://commons.wikimedia.org/wiki/File:Athene_cunicularia_-near_Goiania,_Goias,_Brazil-8_edit.jpg [19:55:19] (03PS1) 10Ottomata: Parameterizing 4 more replica server settings [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/114191 [19:56:25] (03PS2) 10Ottomata: Parameterizing 4 more replica server settings [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/114191 [19:56:29] matanya: done; so? [19:56:40] i get this error [19:56:49] I don't [19:56:56] (03CR) 10Ottomata: [C: 032 V: 032] Parameterizing 4 more replica server settings [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/114191 (owner: 10Ottomata) [19:57:35] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1189.199951 [19:59:11] (03PS4) 10Nemo bis: removed orion shell account [operations/puppet] - 10https://gerrit.wikimedia.org/r/113637 (owner: 10Matanya) [19:59:22] (03PS3) 10Nemo bis: remove smerritt shell account [operations/puppet] - 10https://gerrit.wikimedia.org/r/113638 (owner: 10Matanya) [19:59:34] (03PS3) 10Nemo bis: remove darrell shell account [operations/puppet] - 10https://gerrit.wikimedia.org/r/113639 (owner: 10Matanya) [19:59:43] (03PS5) 10Nemo bis: remove shell access and key for mgrover [operations/puppet] - 10https://gerrit.wikimedia.org/r/113636 (owner: 10Matanya) [19:59:56] (03CR) 10jenkins-bot: [V: 04-1] remove shell access and key for mgrover [operations/puppet] - 10https://gerrit.wikimedia.org/r/113636 (owner: 10Matanya) [20:00:18] Nemo_bis: you are breaking my commits? :P [20:00:45] (03PS1) 10Ottomata: Updating kafka module and increasing replica.lag.max.messages to 10000 [operations/puppet] - 10https://gerrit.wikimedia.org/r/114195 [20:02:22] matanya: more likely that they are breaking each other [20:03:52] (03CR) 10Ottomata: [C: 032 V: 032] Updating kafka module and increasing replica.lag.max.messages to 10000 [operations/puppet] - 10https://gerrit.wikimedia.org/r/114195 (owner: 10Ottomata) [20:05:23] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Wed 19 Feb 2014 08:00:23 PM UTC [20:07:23] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Wed 19 Feb 2014 08:00:23 PM UTC [20:07:24] (03CR) 10Catrope: [C: 04-1] Factor out Parsoid config from VisualEditor config (034 comments) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114100 (owner: 10Jforrester) [20:09:23] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Wed 19 Feb 2014 08:00:23 PM UTC [20:11:23] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Wed 19 Feb 2014 08:00:23 PM UTC [20:12:44] (03CR) 10GWicke: "We could actually consider enabling the parsoid PHP extension for all non-private wikis." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114100 (owner: 10Jforrester) [20:13:23] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Wed 19 Feb 2014 08:00:23 PM UTC [20:15:23] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Wed 19 Feb 2014 08:00:23 PM UTC [20:17:23] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Wed 19 Feb 2014 08:00:23 PM UTC [20:19:23] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Wed 19 Feb 2014 08:00:23 PM UTC [20:19:51] !log controlled shutdown of analytics1021 Kafka broker in order to reload configs for replica.lag.max.messages [20:19:59] Logged the message, Master [20:20:43] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [20:21:23] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Wed 19 Feb 2014 08:00:23 PM UTC [20:21:43] PROBLEM - Kafka Broker Under Replicated Partitions on analytics1022 is CRITICAL: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value CRITICAL: 20.0 [20:23:23] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Wed 19 Feb 2014 08:00:23 PM UTC [20:25:23] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Wed 19 Feb 2014 08:00:23 PM UTC [20:26:33] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 0.0 [20:26:43] RECOVERY - Kafka Broker Under Replicated Partitions on analytics1022 is OK: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value OKAY: 0.0 [20:27:23] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Wed 19 Feb 2014 08:00:23 PM UTC [20:29:23] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Wed 19 Feb 2014 08:00:23 PM UTC [20:30:34] RECOVERY - Puppet freshness on mw1109 is OK: puppet ran at Wed Feb 19 20:30:31 UTC 2014 [20:31:33] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 2311.60331508 [20:32:23] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Wed 19 Feb 2014 08:30:31 PM UTC [20:32:43] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1274.133301 [20:34:23] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Wed 19 Feb 2014 08:30:31 PM UTC [20:34:52] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1831.900024 [20:39:24] !log bd808 Started scap: no-diff scap to test script changes [20:39:32] Logged the message, Master [20:39:50] (03PS2) 10Ori.livneh: Add an eventual consistency call for deploy.deployment_server_init [operations/puppet] - 10https://gerrit.wikimedia.org/r/111749 (owner: 10Ryan Lane) [20:39:58] greg-g, ori: ^ [20:40:23] (03PS3) 10Ryan Lane: Add an eventual consistency call for deploy.deployment_server_init [operations/puppet] - 10https://gerrit.wikimedia.org/r/111749 [20:41:28] * bd808 grumbles about l10n cache generation speed [20:43:50] (03CR) 10Ori.livneh: [C: 032] "I think that with a bit of discipline it is possible to make Puppet code idempotent, so that log items in Puppet runs represent actual mod" [operations/puppet] - 10https://gerrit.wikimedia.org/r/111749 (owner: 10Ryan Lane) [20:45:20] ori: 20:44:47 INFO - Finished mw-update-l10n (duration: 04m 51s) [20:45:25] Timing data! [20:47:42] 20:47:03 INFO - Finished scap-1 to proxies (duration: 02m 16s) [20:48:05] bd808: for your reading pleasure: (pdf link) [20:48:26] bd808: doesn't have a programmatic API -- that's the biggest downside [20:49:25] neat-o re timing data [20:51:46] (03CR) 10Ori.livneh: "Verified on tin:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/111749 (owner: 10Ryan Lane) [20:51:55] ori: Our scap-1 isn't deployed to the cluster ?! [20:52:35] bd808: argh. yeah, it's not symlinked. let me figure out why [20:52:42] There must be another role? [20:52:57] modules/mediawiki/manifests/sync.pp, l35 [20:55:34] we'll deploy new parsoid code in a few minutes and will likely need root help for restarts (hopefully a last time) [20:57:08] I guess apergos is off duty by now [20:57:42] ori, can we draft you? [20:57:53] no [20:57:55] :) [20:58:01] * bd808 still has an option to pick up ori's contract [20:58:18] greg-g, who else? [20:58:25] * greg-g mostly kids, just doesn't want ori to get too sucked into 'random other ops things' [20:58:39] gwicke: ask ops :) [20:58:46] * greg-g doesn't know [20:58:49] * gwicke checks the channel name [20:59:01] if what you need is easy! I can help :) [20:59:08] !log initiating controlled shutdown of analytics1022 kafka broker in order to reload configs for replica.lag.max.messages [20:59:12] "Platform operations" != "Platform" ;) [20:59:15] Logged the message, Master [20:59:18] ottomata, awesome, thanks! [20:59:30] it won't be harder than 'service restart parsoid' as root [20:59:44] * greg-g grumbles at core-features and core and platform ops and platform and mediawiki/wikimedia/wikiwikiwkikwikiwkikwiki [21:00:05] greg-g, for me all that matters is that somebody has the root bit to run that command [21:00:19] * ebernhardson wishes google would stop converting wikimedia into wikipedia [21:00:21] * greg-g nods [21:01:02] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:01:07] and deploying [21:01:11] ebernhardson: y'know, i'm pretty sure that the WMF is a big enough player that you could make that happen if you really wanted ;) [21:01:12] RECOVERY - Puppet freshness on mw1109 is OK: puppet ran at Wed Feb 19 21:01:07 UTC 2014 [21:01:20] gwicke: yeah, technically, ori's root privs are "limited" (per his root approval), so having him do more random things effectively needs permission from opsen. (correct me if I'm wrong here, anyone) [21:01:41] * bd808 has seen email threads that agree [21:01:55] not saying this fits into that, just trying to protect ori from being pulled in 25 diff directions (vs his normal 22) [21:01:57] greg-g, I know- it's just often hard to get hold of other roots [21:02:12] PROBLEM - Kafka Broker Under Replicated Partitions on analytics1021 is CRITICAL: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value CRITICAL: 24.0 [21:02:17] gwicke: that's an ops problem, unfortunately, and should be coordinated with them on [21:02:31] gwicke, where will you need me to restart parsoid? [21:02:36] gwicke: i'd do it, but i have to run in ~10 mins, and it would be unprofessional of me if parsoid was broken to just leave [21:02:42] ah, ottomata's on it. great. [21:02:45] thanks ottomata [21:02:52] yup! [21:02:53] !log fixing firewall bastion to payments deny clause to actually deny [21:03:02] Logged the message, Mistress of the network gear. [21:03:13] ottomata, just restarted the parsoids; some might not come back up right and will need to be restarted again [21:03:19] looking at ganglia currently [21:03:20] ah ok [21:03:28] Deploy system requirement in the making… deployers can restart services being deployed [21:03:30] gwicke: good luck, thanks for understanding [21:03:32] and icinga [21:03:54] wtp1015 and wtp1024 look critical [21:04:02] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 42.0 [21:04:24] bd808: innnnteresting [21:04:33] PROBLEM - Parsoid on wtp1024 is CRITICAL: Connection refused [21:04:53] bd808, that's on the etherpad already.. [21:04:58] :) [21:05:02] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:05:09] bd808: when's good for you for etherpad grooming? [21:05:09] :) [21:05:12] PROBLEM - Kafka Broker Messages In on analytics1022 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 0.0 [21:05:32] PROBLEM - Parsoid on wtp1015 is CRITICAL: Connection refused [21:05:43] greg-g: I was going to ask you that today. Friday? [21:06:02] ottomata, can you do a 'service restart parsoid' on wtp1015 and wtp1024? [21:06:12] RECOVERY - Kafka Broker Under Replicated Partitions on analytics1021 is OK: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value OKAY: 0.0 [21:06:13] greg-g: Once Ori leaves on vacation I'll be semi dead in the water for a week or so [21:06:35] * greg-g checks whne that is [21:06:49] ahh, kk [21:06:54] gwicke: done [21:07:10] bd808: yeah, propose a time that works for you on gcal, I'm flexible on Fridays [21:07:18] gwicke: I see the restart requirement on the etherpad. Thanks! [21:07:19] !log deployed Parsoid deploy c73ea9d3 and code 76e9b66 [21:07:23] (03PS1) 10Ori.livneh: Symlink remaining scap scripts to the scap repository [operations/puppet] - 10https://gerrit.wikimedia.org/r/114337 [21:07:28] Logged the message, Master [21:07:32] RECOVERY - Parsoid on wtp1015 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.008 second response time [21:07:32] RECOVERY - Parsoid on wtp1024 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.008 second response time [21:07:35] (03PS2) 10Ori.livneh: Symlink remaining scap scripts to the scap repository [operations/puppet] - 10https://gerrit.wikimedia.org/r/114337 [21:07:38] ottomata, thanks a bunch! [21:07:51] thanks ottomata [21:08:13] yupyup [21:09:12] RECOVERY - Kafka Broker Messages In on analytics1022 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 1747.35780864 [21:09:40] (03CR) 10BryanDavis: [C: 031] Symlink remaining scap scripts to the scap repository [operations/puppet] - 10https://gerrit.wikimedia.org/r/114337 (owner: 10Ori.livneh) [21:09:52] (03CR) 10Ori.livneh: [C: 032] Symlink remaining scap scripts to the scap repository [operations/puppet] - 10https://gerrit.wikimedia.org/r/114337 (owner: 10Ori.livneh) [21:10:52] !log During scap: snapshot2: rsync error: timeout in data send/receive (code 30) at io.c(137) [sender=3.0.9] [21:11:00] Logged the message, Master [21:11:02] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 511.066681 [21:12:50] 21:12:40 INFO - Finished scap-1 to apaches (duration: 25m 37s) [21:13:12] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:14:01] bd808: also a regular occurrence (snapshot2) [21:15:02] ori: Yeah. Apergos commented on my rt ticket from last week that those boxes are just waiting to die. [21:21:31] * bd808 notes that scap still takes too long [21:31:25] 21:31:12 INFO - Finished scap-rebuild-cdbs (duration: 18m 31s) [21:31:30] !log bd808 Finished scap: no-diff scap to test script changes (duration: 52m 06s) [21:31:38] Logged the message, Master [21:31:47] hiddeous! [21:31:53] soooo loooong [21:31:56] !log aaron synchronized php-1.23wmf14/includes/filebackend/SwiftFileBackend.php '10ba3d7caa1e4bb4d521384bebbf42976cea4a22' [21:32:04] Logged the message, Master [21:32:09] seriously, a no-op shouldn't take an hour [21:32:24] It's l10n [21:32:26] I think [21:33:09] And the dsh fanout maybe. The next test will tell us more. [21:33:11] I wonder if the default should be 'no l10n updates" but you can specific "yes, we need them" (ideally, it'd "just do the right thing") [21:33:53] we can make it perform well enough to be the default [21:34:12] better attitude [21:34:14] Well the best thing would be to get switched over to something that actually transmits precomputed diffs … but not yet [21:34:14] :) [21:34:27] I'm not 100% positive that all parsoids were actually restarted [21:35:32] yeah, dsh shows old parsoid processes [21:36:09] could a root run 'dsh -g parsoid service parsoid restart' ? [21:36:28] greg-g, ori: ready for the second no-op scap? [21:36:51] aye aye [21:36:55] !log bd808 Started scap: no-diff scap to test script changes [21:37:04] Logged the message, Master [21:37:11] This time it should have lots more timing data [21:37:43] !log during scap: snapshot3: ImportError: No module named argparse [21:37:50] Logged the message, Master [21:37:56] ottomata, can I bug you once more? [21:39:25] sure [21:39:45] this should get rid of all old processes: 'dsh -M -g parsoid killall nodejs' [21:39:56] the upstart config respawns [21:40:17] a sleep in there would be even better [21:40:19] (03PS1) 10Ottomata: Adding controlled-shutdown command to debian/bin/kafka [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/114341 [21:40:36] 21:40:26 INFO - Finished scap-1 to apaches (duration: 03m 11s) [21:40:36] dsh -M -g parsoid "sleep 5; killall nodejs" [21:40:37] on those two nodes? [21:40:43] That's better [21:40:43] ah [21:40:45] on all nodes [21:40:48] where can I run that from? [21:40:59] tin or bast1001 afaik [21:41:01] (which machine has the dsh groups on it? [21:41:01] ok [21:41:26] the important thing is doing it sequentially [21:41:43] so that only one node at a time goes down [21:41:47] !log bd808 Finished scap: no-diff scap to test script changes (duration: 04m 52s) [21:41:55] Logged the message, Master [21:42:01] but the command above will do that [21:42:07] ok running... [21:42:10] greg-g: {{done}} [21:43:06] (03PS2) 10Ottomata: Adding controlled-shutdown command to debian/bin/kafka [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/114341 [21:43:19] (03CR) 10Ottomata: [C: 032 V: 032] Adding controlled-shutdown command to debian/bin/kafka [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/114341 (owner: 10Ottomata) [21:44:10] bd808: no l10n I suppose? [21:44:41] Nope. Which shaved 47 minutes off the scap ?! [21:44:56] :/ [21:45:02] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:45:15] That's insane. [21:45:21] yes [21:45:30] The real test happens tomorrow ;) [21:45:42] PROBLEM - Parsoid on wtp1012 is CRITICAL: Connection refused [21:46:14] Reedy: Yes. Did you see that my attempt to fix the new branch bootstrapping bug merged? [21:46:42] RECOVERY - Parsoid on wtp1012 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.003 second response time [21:47:30] ok gwicke, done [21:47:35] i did it in a for loop rather than dsh [21:47:44] ah, k [21:47:49] looks great, thanks! [21:47:49] to make sure it was sequential and the sleep was doing weird stuff [21:47:56] that way I could sleep locally rather than remotely :p [21:48:02] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1700.133301 [21:48:25] without -c dsh is sequential [21:48:56] i think maybe i just wasn't getting any output and was not sure what was going on there, probably could have just added an hostname && to the command to see progress :p [21:48:58] anyway [21:48:58] done,. [21:49:12] ah, -M should have printed the host name [21:49:23] yeah, done ;) [21:49:32] we are now actually seeing new code [21:53:35] (03PS1) 10Ottomata: Reducing varnishkafka queue_buffering_max_messages from 2M to 500K [operations/puppet] - 10https://gerrit.wikimedia.org/r/114346 [21:55:28] (03PS2) 10Ottomata: Reducing varnishkafka queue_buffering_max_messages from 2M to 500K [operations/puppet] - 10https://gerrit.wikimedia.org/r/114346 [21:56:07] (03CR) 10Ottomata: [C: 032 V: 032] "The large buffer size of 2M wasn't helping anyway, and only resulted in too much memory usage and long waits for queues to be emptied." [operations/puppet] - 10https://gerrit.wikimedia.org/r/114346 (owner: 10Ottomata) [21:58:12] PROBLEM - Varnish HTTP parsoid-frontend on cp1058 is CRITICAL: Connection refused [21:58:42] PROBLEM - Varnish HTTP parsoid-backend on cp1058 is CRITICAL: Connection refused [21:59:02] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:59:42] RECOVERY - Varnish HTTP parsoid-backend on cp1058 is OK: HTTP OK: HTTP/1.1 200 OK - 632 bytes in 0.002 second response time [22:00:12] RECOVERY - Varnish HTTP parsoid-frontend on cp1058 is OK: HTTP OK: HTTP/1.1 200 OK - 641 bytes in 0.005 second response time [22:05:34] (03PS1) 10Merlijn van Deen: Make local- part of project name optional [operations/debs/adminbot] - 10https://gerrit.wikimedia.org/r/114350 [22:09:05] (ignore those parsoid varnish icinga warnings for now, Roan is clearing caches for us) [22:10:22] PROBLEM - Varnish HTTP parsoid-backend on cp1045 is CRITICAL: Connection refused [22:10:22] PROBLEM - Varnish HTTP parsoid-frontend on cp1045 is CRITICAL: Connection refused [22:12:57] ganglia looks unhappy, returns XML errors [22:13:22] RECOVERY - Varnish HTTP parsoid-backend on cp1045 is OK: HTTP OK: HTTP/1.1 200 OK - 634 bytes in 0.001 second response time [22:13:22] RECOVERY - Varnish HTTP parsoid-frontend on cp1045 is OK: HTTP OK: HTTP/1.1 200 OK - 643 bytes in 0.002 second response time [22:13:33] hah, yeah [22:14:43] Presumably that's because the varnishstat processes on those boxes are zombies now [22:16:12] strange that this would take out all of ganglia though [22:16:38] There was an error collecting ganglia data (127.0.0.1:8654): XML error: Invalid document end at 1 [22:18:06] Weird [22:18:10] I restarted gmond [22:18:18] And it seems to just always have a zombied varnishstat child [22:18:19] (03PS1) 10Merlijn van Deen: Follow redirects when saving the page [operations/debs/adminbot] - 10https://gerrit.wikimedia.org/r/114351 [22:20:23] (03CR) 10Merlijn van Deen: "Two notes:" [operations/debs/adminbot] - 10https://gerrit.wikimedia.org/r/114351 (owner: 10Merlijn van Deen) [22:51:44] (03PS3) 10Jforrester: Factor out Parsoid config from VisualEditor config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114100 [22:51:51] (03CR) 10Jforrester: Factor out Parsoid config from VisualEditor config (033 comments) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114100 (owner: 10Jforrester) [23:08:37] (03PS4) 10Jforrester: Factor out Parsoid config from VisualEditor config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114100 [23:25:07] ganglia still seems to be broken [23:25:27] https://ganglia.wikimedia.org/latest/ [23:30:11] apergos: ^ [23:46:56] (03PS2) 10Gilles: Start sampling detailed network performance for Multimedia Viewer [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112452 [23:47:39] !log restarting gmetad on nickel [23:47:48] Logged the message, Master [23:51:10] * gwicke hoorays