[00:00:59] (03PS1) 10Reedy: WIP: Change wikiversions to use json [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114108 [00:01:11] That's my TODO [00:01:39] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 1703 bytes in 5.976 second response time [00:03:59] ^demon|away: the init script is gone, what happened [00:04:30] !log restarting gitblit on antimony [00:04:37] Logged the message, Master [00:04:55] ok, i see, init script info was outdated on https://wikitech.wikimedia.org/wiki/Gitblit [00:07:12] ok, it's back [00:07:39] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 152215 bytes in 6.430 second response time [00:08:10] and .. be back later .. [00:58:19] PROBLEM - Varnish HTTP text-backend on cp1053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:58:29] PROBLEM - Varnish HTCP daemon on cp1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:58:39] PROBLEM - Varnish traffic logger on cp1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:06:09] RECOVERY - Varnish HTTP text-backend on cp1053 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.510 second response time [01:06:19] RECOVERY - Varnish HTCP daemon on cp1053 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [01:06:29] RECOVERY - Varnish traffic logger on cp1053 is OK: PROCS OK: 2 processes with command name varnishncsa [01:10:03] (03PS2) 10Jforrester: Enable VisualEditor for legalteamwiki by default [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112717 [01:10:05] (03PS2) 10Jforrester: Initial setup for legalteamwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112850 (owner: 10TTO) [01:10:42] (03CR) 10Jforrester: "PS2 is a rebase." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112850 (owner: 10TTO) [01:10:57] <^d> I wonder if nsawiki uses VisualEditor. [01:11:26] ^d: I doubt it. [01:11:39] ^d: ISTR they were using FCKEditor. [01:11:50] <^d> Cirrus? [01:11:52] ^d: Which is OK if most people are just writing basic content. [01:11:53] <^d> Or lsearchd? [01:11:55] <^d> :) [01:12:08] (I duped their system set-up for ours.) [01:34:44] James_F: interesting, they are using CKEditor with Parsoid? [01:35:13] gwicke: No, directly. [01:35:23] gwicke: This was a few years back; Parsoid didn't exist. [01:36:03] hmm, they must have allowed tags then [01:36:15] or done some ad-hoc serialization [02:03:41] !log LocalisationUpdate completed (1.23wmf14) at 2014-02-19 02:03:41+00:00 [02:03:55] Logged the message, Master [02:16:59] PROBLEM - Puppet freshness on dysprosium is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 07:45:00 PM UTC [02:27:49] !log LocalisationUpdate completed (1.23wmf13) at 2014-02-19 02:27:49+00:00 [02:27:57] Logged the message, Master [02:55:36] !log LocalisationUpdate ResourceLoader cache refresh completed at 2014-02-19 02:55:36+00:00 [02:55:45] Logged the message, Master [03:02:29] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [03:22:49] morebots, are you any better over here? [03:22:50] I am a logbot running on tools-exec-02. [03:22:50] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [03:22:50] To log a message, type !log . [03:23:03] !log testing the log by logging a test [03:23:09] dammit [03:23:11] Logged the message, Master [03:23:20] ah! [03:43:09] (03PS4) 10MZMcBride: Gzip SVGs on front & back upload varnishes [operations/puppet] - 10https://gerrit.wikimedia.org/r/108484 (owner: 10Ori.livneh) [04:01:29] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [04:25:40] (03PS2) 10Dzahn: remove db9 from dsh and dhcpd [operations/puppet] - 10https://gerrit.wikimedia.org/r/113082 [04:27:47] (03CR) 10Springle: [C: 032] remove db9 from dsh and dhcpd [operations/puppet] - 10https://gerrit.wikimedia.org/r/113082 (owner: 10Dzahn) [04:56:52] (03CR) 10Chad: [C: 032] Turn on checkDelay for a cirrus job [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113954 (owner: 10Manybubbles) [04:57:01] (03Merged) 10jenkins-bot: Turn on checkDelay for a cirrus job [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113954 (owner: 10Manybubbles) [04:58:41] !log demon synchronized wmf-config/CirrusSearch-common.php 'Turn on checkDelay for cirrus links update secondary jobs' [04:58:50] Logged the message, Master [05:11:52] (03Abandoned) 10Chad: Hold fluorine logs for 90 days instead of 180 days [operations/puppet] - 10https://gerrit.wikimedia.org/r/111127 (owner: 10Chad) [05:17:05] PROBLEM - Puppet freshness on dysprosium is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 07:45:00 PM UTC [05:58:55] (03PS1) 10Andrew Bogott: Handle role::labs::instance properly in pmtpa. [operations/puppet] - 10https://gerrit.wikimedia.org/r/114127 [06:01:38] !log flowdb schema changes gerrit 111671 [06:01:46] Logged the message, Master [06:07:16] (03PS2) 10Andrew Bogott: Handle role::labs::instance properly in pmtpa. [operations/puppet] - 10https://gerrit.wikimedia.org/r/114127 [06:09:07] (03CR) 10Andrew Bogott: [C: 032] Handle role::labs::instance properly in pmtpa. [operations/puppet] - 10https://gerrit.wikimedia.org/r/114127 (owner: 10Andrew Bogott) [06:10:05] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [06:11:15] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 36.02 ms [06:35:25] PROBLEM - DPKG on cp1054 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [06:38:45] RECOVERY - Varnish HTCP daemon on cp1054 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [06:38:45] RECOVERY - Varnish traffic logger on cp1054 is OK: PROCS OK: 2 processes with command name varnishncsa [06:39:25] RECOVERY - DPKG on cp1054 is OK: All packages OK [06:39:45] RECOVERY - Varnish HTTP text-backend on cp1054 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.002 second response time [06:42:35] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [06:42:59] ] [06:43:05] that's me [06:43:28] if you're not careful i'll corner you with some geoip patches :P [06:43:35] :P [06:46:26] honestly, I don't yet even understand why that's being done in varnish. I guess something to do with geoip-varied content? [06:50:13] bblack: VCL is the new JavaScript [06:50:21] hah [06:50:49] < bblack> that's me <--- the 5xx spike? [06:51:17] * jeremyb points to https://gdash.wikimedia.org/dashboards/reqerror/ for the more visaul version [06:51:26] jeremyb: definitely the cp1054 stuff above it, which I assume is the cause of the spike... [06:51:44] well i got a 503 from frontend 1065 [06:51:57] basically web applications have gotten so amazingly complicated that it's hard to do anything without making the walls fall down on your head [06:52:12] varnish is just too tempting [06:52:33] inline C is 'C{' away [06:52:40] hmmm I haven't gotten to 1065 in my list yet [06:52:56] was 8 mins ago [06:53:00] I'm rolling through them upgrading the varnish package, sometimes there are complications, but it's rare [06:53:07] !log s5 slaves online reindexing wikidatawiki wb_terms [06:53:14] Logged the message, Master [06:53:17] got it again from 1065 [06:53:58] now i got 1054 [06:54:12] anyway, i can't imagine it's not also effecting other people besides me [06:54:12] maybe it's unrelated to what I'm doing [06:54:36] hmmm [06:54:48] at the varnish level, I've upgraded 1054 (which had some complications along the way, above), but I haven't touched 1065 yet. both look basically sane [06:54:50] 1052 [06:55:04] (03PS4) 10Matanya: remove shell access and key for mgrover [operations/puppet] - 10https://gerrit.wikimedia.org/r/113636 [06:55:11] 1068 [06:56:12] can anyone replicate this? [06:56:21] what are you doing to replicate it? [06:56:30] loading https://en.wikipedia.org/wiki/Abraham_Lincoln while logged in [06:57:32] hmmmm [06:57:59] 1054 again [06:58:31] it's not broke every time. but maybe more than half [06:58:35] well, what's reported in the 5xx page is the frontend, but it would be more interesting to know which backend [06:58:45] right [06:59:02] i can give XID... [06:59:26] X-Cache header doesn't list the backend. only frontend. [07:00:07] yeah, I got it too [07:00:08] but if i'm logged in do i hit a backend for that? or i go straight to app servers? [07:00:18] well [07:00:39] I'm reproducing the same thing on the same article logged in (https) [07:00:45] what's odd is I always see this: [07:00:45] Forwarded for: 127.0.0.1, 69.203.7.188, 172.8.229.29, 208.80.154.75 [07:00:58] the 172.x is me, and the 208.x is ssl1005 [07:01:22] but I don't have a proxy on localhost, and the 69.x is some Comcast customer IP, which isn't even my ISP? [07:01:55] Who is Brandon Black? [07:01:57] Is he new? [07:02:09] Bsadowski1: on the 1 year scale, yes :) [07:02:15] Ah :) [07:02:16] uhhh, so 127.0.0.1 could be localssl? [07:02:17] Welcome! [07:02:25] [07:02:31] thanks [07:02:45] PROBLEM - Varnish HTTP text-backend on cp1054 is CRITICAL: Connection refused [07:02:48] yeah but I'm on AT&T, why is a comcast customer IP in an XFF header for me? [07:03:20] bblack: so maybe the frontends' healthchecks on the backends are failing? [07:03:56] ori: varnishadm debug.health [07:05:12] I'm going to start downgrading upgraded hosts and see if 5xx goes away, for now [07:07:05] also, varnishlog -i Backend_health | ts | fgrep -e 'Backend_health' [07:08:30] bblack: so, 1) this happened with a range of hosts not just upgraded ones. 2) maybe this is the sort of thing that will recover as cache fills in after your restarts but doing more restarting will prolong the problem? [07:08:46] * jeremyb is too sleepy to usefully debug. good luck [07:09:08] jeremyb: well, a range of mixed (upgraded -vs- not) frontends, but they could be hashing only to upgraded backends when hitting 5xx [07:09:27] oh, i didn't realize you had also upgraded backends [07:09:41] well, backend varnish I mean, there's two layers of varnish. not backend as in apps [07:09:49] yes, i understand :) [07:10:27] (there can be even more than 2 layers AIUI. e.g. ams/sfo -> frontend -> backend) [07:10:42] well, when I upgrade a machine it upgrades both the front+back on that host, I mean [07:10:52] oh, riiiight [07:11:24] hmmm [07:11:40] I just went to check 5xx and it seems to have dropped back off. I didn't even get halfway through the downgrading process... [07:13:42] I'm just going to finish downgrading and then look at this again in the morning, safest option for now [07:13:49] the timing can't be coincidental [07:14:41] * jeremyb wonders if there's a handbook to introduce new people to all the kinds of outages. e.g. michael jackson/barack obama [07:14:46] :D [07:15:07] was i right about ~1 year? [07:15:24] yeah, I started on Apr 1 last year [07:15:46] they had to put a sentence in my offer letter that it wasn't an april fool's joke lol :) [07:16:38] uhhh, you've seen our jokes? [07:17:02] michael jackson/barack obama. essentially cache stampede: https://wikitech.wikimedia.org/wiki/PoolCounter [07:17:06] back on topic, though: this upgrade is a version that's been running on ulsfo a while. I wouldn't have expected to run into any major problems [07:17:27] huh [07:17:41] i guess it's the only version ulsfo ever ran? [07:17:44] it could be something subtle, or just a one-off glitch with one misbehaving backend shortly after upgrade that hashed some popular articles + Abraham_Lincoln? [07:17:49] right [07:18:21] (well, there's been one small patch since ulsfo's version, but that only affects mobile) [07:18:56] but anyways, whatever it is, it's too subtle for 1:18AM here, so I'll just sort it out in detail tomorrow and leave them back on the versions they had earlier for tonight [07:19:12] https://en.wikipedia.org/wiki/Wikipedia:Today%27s_featured_article/April_1,_2007 is pretty good [07:19:14] yeah, wish i could offer some insight [07:19:24] https://en.wikipedia.org/w/index.php?title=Wikipedia:Today%27s_featured_article/April_1,_2007&diff=119303193&oldid=119292660 [07:19:51] good night moon [07:20:09] bye jeremyb [07:20:15] * ori heads off too [07:20:55] matanya: https://gerrit.wikimedia.org/r/#/c/108484/ reminds me, RT: in footer *does* work, better amend those multiple changes with a ticket in common to ease search [07:21:38] Nemo_bis: did i say it doesn't? see my commit above [07:25:27] yes you did [07:26:55] hmm, if you say so :) [07:33:11] ori: is the package wikimedia-job-runner installed on any job runner? i would like to remove the absent statment [08:05:44] akosiaris: morning. want to have a deal? [08:08:44] (03CR) 10Gilles: "Who can +2 this?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112452 (owner: 10Gilles) [08:08:57] good morning. a deal ? want kind of a deal ? [08:13:37] (03PS1) 10KartikMistry: Remove $wgULSNoWebfontsSelectors [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114131 [08:18:05] PROBLEM - Puppet freshness on dysprosium is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 07:45:00 PM UTC [08:20:29] akosiaris: you donate 5 minutes to glance at https://etherpad.wikimedia.org/p/absents and i'll push patches to all un-needed statments, works for you? :) [08:23:35] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [08:24:05] RECOVERY - Disk space on virt6 is OK: DISK OK [08:24:28] matanya: it wont take 5 minutes for sure but I will give it a try [08:25:21] thanks, i only asked for 5 minutes, since not all of the stuff there is ops related. and i don't want to waste too much of your time [08:29:25] meh... those mailman archives take forever to be regenerated... [08:30:21] (03CR) 10Nikerabbit: [C: 031] Remove $wgULSNoWebfontsSelectors [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114131 (owner: 10KartikMistry) [08:32:02] so cool: http://people.debian.org/~stapelberg//2014/02/19/SSH-keys-in-your-TPM.html [08:41:33] So, I need help to get https://gerrit.wikimedia.org/r/114131 deployed. [08:41:58] Nikerabbit suggested me that it has to deployed first and then can be merge. [08:45:38] Sounds unlikely [08:46:48] https://wikitech.wikimedia.org/wiki/How_to_do_a_configuration_change [08:47:29] Last edit is by him one week ago :) not so unreasonably outdated hopefully [08:51:28] kart_: review first, then it should be merged only if someone is going to deploy it right after [08:53:29] Nikerabbit: ah. Thanks for clarification! [08:55:28] Nemo_bis: :) [09:06:57] It would be nice if we kept discussions for changes to somewhere other than CR notes… Since they aren't searchable [09:15:02] (03PS1) 10Matanya: jenkins: remove accesslog rotation absent statment [operations/puppet] - 10https://gerrit.wikimedia.org/r/114135 [09:20:40] (03PS1) 10Matanya: standard-packages: remove absent statments of unused packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/114136 [09:28:22] (03PS1) 10Matanya: parsoid: remove init script absent statment [operations/puppet] - 10https://gerrit.wikimedia.org/r/114137 [09:52:34] (03PS27) 10Alexandros Kosiaris: site: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/109507 (owner: 10Matanya) [09:53:11] (03CR) 10jenkins-bot: [V: 04-1] site: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/109507 (owner: 10Matanya) [09:54:27] (03CR) 10Hashar: [C: 04-1] "Please keep the comment around for later reference :]" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/114135 (owner: 10Matanya) [09:56:51] hashar_: i'll rephrase it as FIXME, is this ok? [09:57:30] matanya: whatever works for you, as long as you keep the comment :-] [09:57:38] thanks :) [09:58:00] and yeah, one day I will lookt at Jenkins init script and logrotation [09:58:22] hashar_: btw, why SIGALARM ? [09:58:27] (03PS28) 10Alexandros Kosiaris: site: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/109507 (owner: 10Matanya) [09:58:50] let's try and diff the freaking catalogs now... [09:59:58] (03PS2) 10Matanya: jenkins: remove accesslog rotation absent statment [operations/puppet] - 10https://gerrit.wikimedia.org/r/114135 [10:20:16] akosiaris: Jenkins respond to SIGALARM for log rotation apparently [10:20:57] the jenkins source tree has material for boths rpm and deb packages [10:21:04] and apparently the init scripts / logrotate files are not shared [10:21:16] so they might fix a bug in the rpm logrotate while leaving the issue in the deb logrotate :D [10:21:26] example of bug fixed for the rpm logrotate : https://github.com/jenkinsci/jenkins/commit/8fb2f5162e499b5c0c4dc8f88af761b7c490bc38 [10:22:26] hmmm ok. But your comment says that SIGALARM kills jenkins, so it does not work as expected, right ? [10:50:45] akosiaris: yup [10:51:07] akosiaris: I think sigalarm is sent to start-stop-daemon instead of jenkins and that kills it [10:51:12] akosiaris: got to investigate that in labs [10:51:23] akosiaris: and that is at the bottom of my pile :-] [10:52:38] hashar: I kind of doubt start-stop-daemon is even there to receive the signal at the first place. [10:52:51] anyway, if you need help investigating this, let me know [10:56:05] PROBLEM - Host virt1001 is DOWN: PING CRITICAL - Packet loss = 100% [10:56:10] ^ me [10:57:00] akosiaris: thanks :) [11:00:45] RECOVERY - Host virt1001 is UP: PING WARNING - Packet loss = 93%, RTA = 0.25 ms [11:01:07] matanya: do you see whitespace at https://gerrit.wikimedia.org/r/#/c/114135/2/modules/jenkins/manifests/init.pp ? :-] [11:01:24] arrg [11:01:28] * matanya is blind [11:01:28] :D [11:01:41] I got vim to highlight them in red as well [11:02:06] yes, me too. but it helps if you look [11:02:45] PROBLEM - Host virt1001 is DOWN: PING CRITICAL - Packet loss = 100% [11:03:04] (03PS3) 10Matanya: jenkins: remove accesslog rotation absent statment [operations/puppet] - 10https://gerrit.wikimedia.org/r/114135 [11:04:45] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1525.93335 [11:06:35] RECOVERY - Host virt1001 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [11:07:49] You can always enable the default pre-commit hook; that guards against committing changes with trailing whitespace. [11:10:25] (03CR) 10Hashar: [C: 031] "Thank you. Feel free to merge at anytime." [operations/puppet] - 10https://gerrit.wikimedia.org/r/114135 (owner: 10Matanya) [11:19:05] PROBLEM - Puppet freshness on dysprosium is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 07:45:00 PM UTC [11:22:16] ACKNOWLEDGEMENT - Puppet freshness on dysprosium is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 07:45:00 PM UTC alexandros kosiaris decommissioned [11:22:54] "alexandros kosiaris decommissioned" [11:23:09] we need a new alex [11:26:15] finally! I was beginning to fear that would never happen :-) [11:26:45] PROBLEM - Host virt1000 is DOWN: CRITICAL - Host Unreachable (208.80.154.18) [11:26:51] ^ also me [11:26:55] PROBLEM - Host labs-ns1.wikimedia.org is DOWN: CRITICAL - Host Unreachable (208.80.154.19) [11:28:15] oooooops [11:28:25] RECOVERY - Host virt1000 is UP: PING OK - Packet loss = 0%, RTA = 1.50 ms [11:28:42] We changed http://lists.wikimedia.org/pipermail/wikimedia-l/ to use lighttpd? [11:28:59] Can we please make it display archives by date, decreasing? [11:29:35] RECOVERY - Host labs-ns1.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [11:29:36] Now it's 2013-October / 2013-September / 2014-February / 2014-January [11:29:39] doesn't make sense. [11:29:44] odder: I am reconstructing the archives of that specific list [11:30:16] i 'd say that is why you have that view. [11:30:53] okay [11:36:45] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:37:21] akosiaris: reconstructing the archives? tell me you're kidding please [11:38:20] or did you find a magic solution that allows not to break links? [11:39:12] Nemo_bis: a) I am not kidding, b) please tell me you are kidding [11:39:41] How do you mean, reconstructing the archives? [11:39:42] what links get broken ? [11:39:45] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1118.800049 [11:39:56] /usr/lib/mailman/bin/arch wikimedia-l [11:40:11] https://wikitech.wikimedia.org/wiki/Remove_a_message_from_mailing_list_archive [11:40:19] after changing something in mbox per legal's request [11:40:42] ok that is what i did :-) [11:41:26] all links are definitely broken [11:41:29] I want to cry [11:41:49] luckily I almost never link pipermail, because it gets broken so often [11:42:19] * odder still seeing lighttpd [11:42:41] that only means you're missing the index, the messages are mostly there already [11:42:44] odder: in 2012, please hold on [11:42:52] they've all been renumbered [11:43:59] What does one have a legal department for if they can't even defend our pipermail [11:46:45] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:49:45] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1733.599976 [11:57:37] (03CR) 10Alexandros Kosiaris: PostgreSQL/Postgis module (037 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/112469 (owner: 10Alexandros Kosiaris) [12:01:19] odder: done [12:04:16] A pipermail renumber? Didn't mutante screw that up some years ago as well? [12:04:56] yeah. Seems like I am following in his steps [12:05:37] * Nemo_bis proposes to nuke archives rebuild script from orbit [12:06:16] Well, if you have the backup, it should be amendable: Take the backup, follow the instructions in Nemo's link and regenerate the archives. That should make the links work again. [12:06:49] He said he's already following that [12:07:32] Not deleting the message is a required condition, not a sufficient condition though. It's not clear how numbers are chosen by the script, doesn't seem to be deterministic (!) [12:07:45] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:08:18] Just the other day mutante said so in the channel, I was lazy and I didn't document it "who can ever think of rebuilding mailman archives in 2014 anyway" [12:08:43] (Without looking at it:) Isn't it not just the sequence number in the mbox file? 1, 2, 3, etc.? [12:08:54] i'd wish [12:09:40] Someone got a link to the script? [12:09:42] wikimedia-l archives started from about 60k, among other things [12:10:01] (i.e. the number of last foundation-l post I think) [12:11:45] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 3007.600098 [12:19:16] (03CR) 10Nemo bis: "https://gerrit.wikimedia.org/r/#/c/114142/" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107008 (owner: 10Reedy) [12:21:17] Hmmm. Mailman/Archiver/pipermail.py looks pretty unrandom to me. [12:21:35] Whatever; Gmane rules. [12:26:26] Viva gmail [12:26:30] * gmane [12:26:35] * Nemo_bis slaps self [12:26:54] with a large trout ? ah... memories [12:29:04] (03PS8) 10Alexandros Kosiaris: PostgreSQL/Postgis module [operations/puppet] - 10https://gerrit.wikimedia.org/r/112469 [12:42:45] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [12:48:45] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1492.199951 [13:04:45] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:09:45] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1862.666626 [13:17:45] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:21:40] (03PS1) 10Hashar: misc::mwlib::packages: explode list of packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/114143 [13:26:59] (03PS1) 10Hashar: contint: get ocaml-nox package for Math texvc [operations/puppet] - 10https://gerrit.wikimedia.org/r/114144 [13:27:01] (03CR) 10Matanya: "duplicate of https://gerrit.wikimedia.org/r/#/c/111619/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/114143 (owner: 10Hashar) [13:27:40] (03Abandoned) 10Hashar: misc::mwlib::packages: explode list of packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/114143 (owner: 10Hashar) [13:27:45] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 996.06665 [13:28:50] (03CR) 10Hashar: mwlib: lint (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/111619 (owner: 10Matanya) [13:31:29] hashar: i did 2 back indents, thanks [13:31:45] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:33:31] (03PS2) 10Matanya: mwlib: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/111619 [13:35:45] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 163.5 [13:36:45] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:45:45] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 789.666687 [13:48:45] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [13:52:45] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 480.733337 [13:53:23] (03PS29) 10Alexandros Kosiaris: site: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/109507 (owner: 10Matanya) [13:53:49] matanya: ^ this might be the last. I closed another set of holes and doing a final test [13:54:07] * matanya prays and crosses fingers [13:57:44] (03CR) 10Alexandros Kosiaris: [C: 04-1] parsoid: remove init script absent statment (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/114137 (owner: 10Matanya) [13:59:50] (03PS2) 10Matanya: parsoid: remove init script absent statment [operations/puppet] - 10https://gerrit.wikimedia.org/r/114137 [14:01:26] (03CR) 10Alexandros Kosiaris: [C: 032] contint: get ocaml-nox package for Math texvc [operations/puppet] - 10https://gerrit.wikimedia.org/r/114144 (owner: 10Hashar) [14:02:28] (03CR) 10Alexandros Kosiaris: [C: 032] parsoid: remove init script absent statment [operations/puppet] - 10https://gerrit.wikimedia.org/r/114137 (owner: 10Matanya) [14:04:31] (03CR) 10Alexandros Kosiaris: [C: 032] jenkins: remove accesslog rotation absent statment [operations/puppet] - 10https://gerrit.wikimedia.org/r/114135 (owner: 10Matanya) [14:04:45] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:08:49] (03PS1) 10coren: Make manage-nfs-volumes handle broken instances [operations/puppet] - 10https://gerrit.wikimedia.org/r/114146 [14:10:45] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1057.06665 [14:12:45] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:16:27] (03CR) 10coren: [C: 032] "Trivial fix is trivial." [operations/puppet] - 10https://gerrit.wikimedia.org/r/114146 (owner: 10coren) [14:19:51] (03PS1) 10coren: Fix the bugfix in manage-nfs-volumes [operations/puppet] - 10https://gerrit.wikimedia.org/r/114147 [14:21:18] (03CR) 10coren: [C: 032] "Death to typos!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/114147 (owner: 10coren) [14:22:45] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1143.800049 [14:23:45] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:27:45] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 690.700012 [14:30:45] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:32:05] (03PS1) 10coren: Tool Labs: Install tabix on exec_environ [operations/puppet] - 10https://gerrit.wikimedia.org/r/114148 [14:39:00] (03CR) 10coren: [C: 032] "Simple package addition." [operations/puppet] - 10https://gerrit.wikimedia.org/r/114148 (owner: 10coren) [14:39:45] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 408.166656 [14:39:50] ottomata: ping [14:40:13] hiya [14:40:19] hi [14:40:29] cp3021 is flapping all day [14:43:08] ok thanks [14:48:36] Snaps: you around? [14:50:45] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:58:34] (03PS1) 10coren: Labs: Fix permissions for nfsmanager [operations/puppet] - 10https://gerrit.wikimedia.org/r/114151 [14:59:49] (03CR) 10coren: [C: 032] Labs: Fix permissions for nfsmanager [operations/puppet] - 10https://gerrit.wikimedia.org/r/114151 (owner: 10coren) [15:00:45] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 576.099976 [15:01:03] ottomata: anything I can help with? [15:01:07] even brainstorming :) [15:03:16] probably yeah, pretty stumped right now, hang on, in standup [15:03:20] mark and I looked at this for a bit on sunday [15:03:26] sorry, monday [15:05:38] paravoid here are details: [15:05:40] http://ganglia.wikimedia.org/latest/graph_all_periods.php?title=&vl=&x=&n=&hreg%5B%5D=cp30(19%7C2%5B0-2%5D)&mreg%5B%5D=kafka.rdkafka.brokers.*.rtt.avg>ype=line&glegend=show&aggregate=1 [15:05:58] rtt is the time it takes for a kafka broker to ack a message batch back to varnishkakfa [15:06:00] so far [15:06:07] this only happens on messages sent to analytics1022 [15:06:09] not analytics1021 [15:06:18] and, it usually only happens during high traffic time [15:06:24] and only on one or a few of the bits varnishes [15:06:26] not all of them [15:06:31] but it could be any of them [15:06:47] mark and I couldnt' detect anything wrong with network connectivity to analytics1022 [15:06:54] compared to analytics1021 [15:06:58] but they are in different rows [15:07:45] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:08:25] (03CR) 10Jgreen: [C: 031] mwlib: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/111619 (owner: 10Matanya) [15:13:45] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1205.800049 [15:14:53] ja paravoid, it is very strange too [15:15:02] because even thought it does happen during high load [15:15:21] i don't think that more traffic is the cause, i think that these brokers and producers can handle more traffic [15:15:23] on monday [15:15:30] I depooled the offending varnish nonde [15:15:30] node [15:15:39] and let its vk buffer slowly empty [15:15:45] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:15:53] when I depooled the node, that balanced more traffic on the remaining 3 bits varnishes [15:16:05] brokers, producers, leaders, elections, ... meh [15:16:06] giving them each more traffic than then usually serve [15:16:13] did you investigate the code to figure out what those "txerrs" mean exactly? [15:16:18] yes [15:16:50] didn't help too much, txerrs happen in varnishkafka when the buffer is full [15:17:07] which buffer exactly? the 2M one? [15:17:09] rd_kafka_produce, which is the async call to add a message to the buffer [15:17:51] yes [15:17:54] ok [15:18:21] so yeah, when that is full, rd_kafka_produce will return -1, and varnishkafka will txerr++ [15:18:24] so does the buffer get "flushed" as soon as it can [15:18:28] (so as fast as tcp allows) [15:18:35] or does it wait on other events? [15:18:57] not sure... [15:19:13] reading, but that means I have to understand rdkafka buffers and produce threads...:) [15:19:21] so on varnishkafka nodes where the buffer is NOT full (under 2 million messages) we shouldn't be seeing any txerrs, right? [15:21:19] that's right [15:21:25] is that indeed the case? [15:21:28] well, ok so, that's not 100% true [15:21:28] but in our case yes [15:21:44] i'm reading more, i think there can be txerrs that are not caused by buffer full [15:22:00] but they will output a different error code and message when that happens [15:22:05] if you find any nodes with non-full buffers that have txerrs you'll know that for sure [15:22:06] hang on, double checking something... [15:22:08] ok [15:23:43] txerrs: [15:23:43] http://ganglia.wikimedia.org/latest/graph_all_periods.php?title=&vl=&x=&n=&hreg%5B%5D=cp30(19%7C2%5B0-2%5D)&mreg%5B%5D=kafka.varnishkafka.*.txerr.per_second>ype=line&glegend=show&aggregate=1 [15:23:52] ok still checking code for something, hang on [15:28:34] (trying to find where this one error message comes from and grepping is not finding it!...) [15:28:57] it might be a standard unix error message [15:29:20] man 3 errno [15:32:45] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 762.666687 [15:33:19] ahhh, yeah maybe so [15:33:33] ah! [15:33:36]