[00:00:32] !log reedy synchronized wmf-config/ [00:00:44] Logged the message, Master [00:02:07] !log reedy synchronized docroot and w [00:02:18] Logged the message, Master [00:11:22] (03PS12) 10Hashar: Jenkins validation (please ignore) [operations/debs/pybal] - 10https://gerrit.wikimedia.org/r/84932 [00:12:46] could anybody with the requisite rights restart the parsoids for me? [00:13:42] we have some hanging workers in production, the fix for it will go out tomorrow morning [00:13:44] https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&s=by+name&c=Parsoid%2520eqiad&tab=m&vn= [00:17:25] Ryan's commandline: salt -b 5 -G 'deployment_target:parsoid' parsoid.restart_parsoid parsoid [00:37:32] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 16 00:37:24 UTC 2013 [00:38:02] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [00:47:19] (03PS1) 10Ori.livneh: webperf: count number of CSS rules, log to Ganglia [operations/puppet] - 10https://gerrit.wikimedia.org/r/90076 [00:47:28] ^ Krinkle [00:48:37] ori-l: cool [00:48:41] ori-l: 2 things [00:49:31] ori-l: 1) considering this is a much nicer interface (ganglia, puppet) than jenkins, maybe I could do the error checking in here? It wouldn't give details, but at least somethign to look for. [00:50:03] 2) gut tells me styleSheets might be problematic with cross-domain (throws if accessed straight), not sure if that's guarded against here and/or of this is affectedby that [00:50:30] #1 maybe track numbers for: http errors, uncaught exceptions, mw.log.deprecate traces [00:50:56] doesn't for me in prod: Array.prototype.reduce.call( document.styleSheets, function ( total, styleSheet ) { return styleSheet.cssRules ? total + styleSheet.cssRules.length : total; }, 0 ); [00:51:14] allow origin on bits for css [00:51:28] try debug=true when it doens't get load.php headers [00:51:59] I'd be happy to write that stuff (was going to from jenkins), The advantage of jenkins woudl be having the urls or errors in question there. We could do both and re-use the script. [00:52:02] re: modifying this to log errors, yes, totally, i'd love to have your help with that [00:52:09] yes, go for it [00:52:23] unless we have a place for this script to log to, then we'd avoid Jenkins. That'd be nicer. [00:52:40] e.g. log the stacktrace and/or url of 404 error [00:53:05] as well as exceptions caught by RL (module state missing or error) [00:53:24] $wmfUdp2logDest [00:53:38] udp2log on fluorine doesn't care if the datagram comes from mediawiki or $fooscript [00:53:51] prefix with 'js' and you get /a/mw-log/js.log [00:53:57] Right [00:54:10] Cool [00:55:01] ori-l: you recommend I set up separate logs or embed the data inside of 1 log type? not sure what the convention is there. [00:55:22] does it allow a slash (implied directory, hack hack :P)? dash would be fine otherwise [00:59:00] ori-l: hm.. looks like somethign might have recently regressed in the way exceptions are logged. There's 100s of lines of wikitext in exception.log [00:59:10] e.g. [[Category:Wikimedia policies and guidelines]]', Object(Title), Object(ParserOptions), true, true, xxx) [00:59:21] but with 100+ lines wikitext before that [01:00:42] I saw this before, it was some template, I think [01:01:12] the log contains the string representation of the arguments passed to the function in which the exception was thrown [01:01:30] and it happened to be a function that took a page of wikitext as input [01:06:03] Ryan_Lane: could you merge https://gerrit.wikimedia.org/r/#/c/90076/ ? (tested, etc.) [01:06:58] (03CR) 10Faidon Liambotis: [C: 032] webperf: count number of CSS rules, log to Ganglia [operations/puppet] - 10https://gerrit.wikimedia.org/r/90076 (owner: 10Ori.livneh) [01:07:11] thanks paravoid [01:07:19] heh [01:07:23] was in the middle of reviewing it [01:07:29] paravoid: shouldn't you be sleeping? :) [01:07:31] heh, sorry :) [01:07:33] I should [01:07:37] talks start in 5h [01:07:50] and it's the open source working group, so I want to go there [01:07:56] go to sleep! :D [01:07:57] that leaves about 3-4h of sleep [01:08:27] but, but [01:08:29] this is more interesting [01:09:08] (03CR) 10Krinkle: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/90076 (owner: 10Ori.livneh) [01:10:12] what is? [01:11:45] reviewing code? your code in particular? [01:12:30] irc, backlog, etc. [01:12:56] AaronSchulz: how did you count this? [01:15:38] ori-l: yeah, I thought we summarised it though. I know we do so for sql errors and analytics (ellipsis right or middle or something like foo... or foo..bar) [01:16:07] should be used for php trace as well I suppose. Unless we want the complete multi-line strings for debugging purposes. [01:19:14] ori-l: Which is the ULS bug? [01:19:40] I don't think there is one. I should probably file it. [01:19:57] Yes, please. [01:20:07] Every time I run .inspect() it's irking me. [01:20:21] Krinkle: formatting is done by http://us2.php.net/manual/en/exception.gettraceasstring.php [01:20:47] Elsie: I'm glad you're using it often enough for there to be an "every time" :P [01:21:15] Well, I was considering bothering Krinkle with questions. [01:21:34] Like, Special:BlankPage loads ext.visualEditor.viewPageTarget.init. [01:21:40] And I'm not sure if that's a bug. [01:22:17] it sounds like one [01:22:22] ori-l: BTW, overall, I'm not sure if I'm bad at checking or what, but en.wiki seems to be pretty slim. [01:22:45] enwiki could go on a cookie diet :D [01:22:54] Heh. [01:23:02] I was reading about the lemon diet today. [01:26:40] ori-l: not our version of that? [01:26:53] I thought we didn't use getraceasstring [01:27:16] Two Ts. [01:27:30] That's an awfully difficult name for something. [01:27:37] I guess until https://gerrit.wikimedia.org/r/#/c/89621/ various places do [01:27:39] It has the word "ass" in it. [01:27:48] Elsie: not a bad [01:27:50] bug* [01:28:05] The VE init script should be everywhere? [01:28:11] not init [01:28:14] init init [01:28:20] can you verify? [01:28:30] I just know what the table says. [01:29:04] It's 6.95 KB uncompressed for ext.visualEditor.viewPageTarget.init. [01:29:35] Yeah, confusingly ext.ve.viewpage is ve.init.mw, and ext.ve.viewpage.init is ve.init.mw.init [01:29:36] so its fine [01:30:05] init.mw.init is a mini module that allows us to avoid depending on page cache to change its configuration. Though not explicitly for special pages, I'd rather not make exceptions for it. E.g. if we get a surfuce on some special page (Special:Edit? :D ) or whatever it may be. [01:30:18] could be removed but as it is was intentionally made unconditional. [01:30:25] * Elsie nods. [01:30:36] The rest of this noise is mostly ULS. [01:30:40] should be shared in cache with non-special pages so shouldn't add any load in requests, and size is small. though 7KB is bigger then it should be [01:30:59] than [01:31:06] yeah, I know :P [01:31:13] :P [01:31:16] modules like that only add size, not requests. 7KB seems more than the 4KB I measured originally. [01:31:37] I'll check it out tomorrow if anything stands out there, thx [01:31:41] Dunno how .inspect() measures. [01:31:47] Cool, thanks. [01:37:21] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 16 01:37:18 UTC 2013 [01:38:01] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [02:09:26] !log LocalisationUpdate completed (1.22wmf21) at Wed Oct 16 02:09:25 UTC 2013 [02:09:43] Logged the message, Master [02:16:35] !log LocalisationUpdate completed (1.22wmf20) at Wed Oct 16 02:16:35 UTC 2013 [02:16:47] Logged the message, Master [02:24:20] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:25:11] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 1 logical drive(s), 4 physical drive(s) [02:29:45] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Oct 16 02:29:45 UTC 2013 [02:29:59] Logged the message, Master [02:37:20] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 16 02:37:17 UTC 2013 [02:38:00] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [02:58:44] (03CR) 10Ottomata: [C: 031] "Cool, looks good." [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/90030 (owner: 10Edenhill) [03:34:18] (03PS1) 10Ori.livneh: Use standard env var names for MW_COMMON / MW_COMMON_SOURCE dirs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/90083 [03:37:25] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 16 03:37:19 UTC 2013 [03:37:55] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [04:35:50] (03CR) 10Yurik: [C: 031] Add an extra header for cache variance of W0 banners for proxies. [operations/puppet] - 10https://gerrit.wikimedia.org/r/88261 (owner: 10Dr0ptp4kt) [04:37:25] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 16 04:37:19 UTC 2013 [04:37:55] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [04:53:50] (03PS1) 10Ori.livneh: multiversion/checkoutMediaWiki: create/update 'current' / 'stable' symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/90085 [04:54:56] TimStarling: I tested that patch ^ but would appreciate a once-over whenever you have the time [04:55:43] the dependent patch (90083) is largely cosmetic; you don't need to bother with that one [05:38:55] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 16 05:38:47 UTC 2013 [05:38:55] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [05:58:46] (03CR) 10Ori.livneh: "On second thought, this should be a standalone script. I'll amend the patch." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/90085 (owner: 10Ori.livneh) [06:21:21] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:22:21] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 1 logical drive(s), 4 physical drive(s) [06:25:21] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:26:11] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 1 logical drive(s), 4 physical drive(s) [06:37:31] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 16 06:37:21 UTC 2013 [06:37:51] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [06:44:43] (03PS2) 10Ori.livneh: Add updateBitsBranchPointers script to create/update 'current' / 'stable' symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/90085 [06:47:38] (03CR) 10TTO: "Re Odder: it was what was requested - maybe the users will be told to create their accounts on their "home" language wiki, to minimise con" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/89642 (owner: 10TTO) [06:51:32] (03CR) 10Nemo bis: "Is the package on apt.wm.o now?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/84743 (owner: 10QChris) [07:01:48] (03PS2) 10Matanya: ssh: move to a module & reorganize/cleanup [operations/puppet] - 10https://gerrit.wikimedia.org/r/15874 (owner: 10Faidon Liambotis) [07:02:14] (03CR) 10jenkins-bot: [V: 04-1] ssh: move to a module & reorganize/cleanup [operations/puppet] - 10https://gerrit.wikimedia.org/r/15874 (owner: 10Faidon Liambotis) [07:04:47] (03PS3) 10Ori.livneh: Add updateBitsBranchPointers script to create/update 'current' / 'stable' symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/90085 [07:15:47] (03PS2) 10Ori.livneh: Script tweaks; use standard env var names in multiversion/checkoutMediaWiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/90083 [07:37:27] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 16 07:37:20 UTC 2013 [07:37:57] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [07:53:48] (03PS1) 10ArielGlenn: absence of /usr/local/sbin on some hosts was causing puppet errors, add to base [operations/puppet] - 10https://gerrit.wikimedia.org/r/90091 [07:56:39] (03CR) 10ArielGlenn: [C: 032] absence of /usr/local/sbin on some hosts was causing puppet errors, add to base [operations/puppet] - 10https://gerrit.wikimedia.org/r/90091 (owner: 10ArielGlenn) [08:13:33] apergos: what kind of puppet errors? [08:14:36] err: /Stage[main]/Salt::Minion/File[/usr/local/sbin/grain-ensure]/ensure: change from absent to file failed: Could not set 'file on ensure: No such file or directory [08:14:41] also, <= 50 chars on the first line of commit messages :) [08:15:11] on lucid? precise? both? [08:15:36] hi paravoid. do you have time to help me out modeling ssh ? [08:15:36] precise [08:15:39] dunno about lucid [08:17:37] matanya: I'm at a conference this week, I can perhaps help but only asynchronously [08:17:53] how is the conference? [08:18:49] interesting [08:19:01] hm, so, the precise installs I see how /usr/local/sbin alright [08:19:14] paravoid: acutlly, i wrote a module and then found out you started one, so i tried rebasing your work. not with a lot of success so far :/ https://gerrit.wikimedia.org/r/#/c/15874/ [08:19:16] and I remember debian having it as well, that's why I'm curious [08:19:24] matanya: oh that was ages ago [08:19:41] could be completely obsolete by now [08:19:43] there's probably sme upgrade path where the directory gets tossed, it's all I can guess [08:19:46] feel free to ignore it [08:20:11] only a small number of hosts had the issue [08:20:31] ok, paravoid I think i'll start from scratch. thogh your design is good, and i'll grab some ideas from it [08:30:49] (03PS2) 10Faidon Liambotis: delete zwinger and zwinger2 from wmnet [operations/dns] - 10https://gerrit.wikimedia.org/r/88113 (owner: 10Dzahn) [08:31:07] (03CR) 10Faidon Liambotis: [C: 032] delete zwinger and zwinger2 from wmnet [operations/dns] - 10https://gerrit.wikimedia.org/r/88113 (owner: 10Dzahn) [08:31:31] (03PS2) 10Faidon Liambotis: remove references to old servers like zwinger [operations/dns] - 10https://gerrit.wikimedia.org/r/88120 (owner: 10Dzahn) [08:32:01] (03CR) 10Faidon Liambotis: [C: 032] remove references to old servers like zwinger [operations/dns] - 10https://gerrit.wikimedia.org/r/88120 (owner: 10Dzahn) [08:34:52] (03PS1) 10Faidon Liambotis: Remove ns2-old, unused for a while [operations/dns] - 10https://gerrit.wikimedia.org/r/90093 [08:35:05] (03CR) 10Faidon Liambotis: [C: 032] Remove ns2-old, unused for a while [operations/dns] - 10https://gerrit.wikimedia.org/r/90093 (owner: 10Faidon Liambotis) [08:37:22] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 16 08:37:21 UTC 2013 [08:37:52] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [08:41:12] PROBLEM - MySQL Replication Heartbeat on db1046 is CRITICAL: CRIT replication delay 303 seconds [08:43:12] PROBLEM - MySQL Replication Heartbeat on db1046 is CRITICAL: CRIT replication delay 305 seconds [09:16:08] (03PS1) 10Yurik: Add "enable_esi" setting to the beta labs instances [operations/puppet] - 10https://gerrit.wikimedia.org/r/90095 [09:16:37] mark, could you +2 https://gerrit.wikimedia.org/r/90095 - it enables ESI in beta cluster [09:19:19] paravoid: ^? [09:37:23] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 16 09:37:18 UTC 2013 [09:37:53] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [09:49:03] (03PS1) 10Matanya: ssh: convert into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/90098 [09:49:29] (03CR) 10jenkins-bot: [V: 04-1] ssh: convert into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/90098 (owner: 10Matanya) [09:52:55] akosiaris: are you around? [09:53:43] yes [09:58:15] can you please tell me what is the problem with producation as a rleam? in the change above. i see puppet complians, but fail to uinderstand why [10:04:00] you are missing a comma [10:04:25] line 234 [10:04:52] and line 943 [10:05:35] morning [10:05:37] (kind of) [10:06:02] hashar: good (kind of) morning to you [10:06:28] spent part of night trying to polish up a Jenkins job to build debian packages :D [10:06:45] it is not fully done :( [10:23:27] (03PS1) 10Mark Bergsma: Add ulsfo data center for bits & geoiplookup [operations/dns] - 10https://gerrit.wikimedia.org/r/90103 [10:23:34] (03CR) 10jenkins-bot: [V: 04-1] Add ulsfo data center for bits & geoiplookup [operations/dns] - 10https://gerrit.wikimedia.org/r/90103 (owner: 10Mark Bergsma) [10:37:31] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 16 10:37:21 UTC 2013 [10:37:51] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [10:40:05] (03PS2) 10Mark Bergsma: Add a ulsfo-migration map for bits & geoiplookup [operations/dns] - 10https://gerrit.wikimedia.org/r/90103 [10:40:13] (03CR) 10jenkins-bot: [V: 04-1] Add a ulsfo-migration map for bits & geoiplookup [operations/dns] - 10https://gerrit.wikimedia.org/r/90103 (owner: 10Mark Bergsma) [10:40:47] (03PS3) 10Mark Bergsma: Add a ulsfo-migration map for bits & geoiplookup [operations/dns] - 10https://gerrit.wikimedia.org/r/90103 [10:57:55] apergos: I saw https://gerrit.wikimedia.org/r/90091. How on earth was /usr/local/sbin missing from some hosts ? How did that happen ? [10:59:32] well I saw it on a few db hosts, no idea how they would have got that way [11:04:09] (03CR) 10Hashar: [C: 04-1] "As demonstrated during our 1/1 session you will have to add cluster_options to the role::cache::mobile class and add the cluster_options t" [operations/puppet] - 10https://gerrit.wikimedia.org/r/90095 (owner: 10Yurik) [11:07:11] (03CR) 10Mark Bergsma: [C: 04-2] "Why are you enabling ESI for upload and bits?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/90095 (owner: 10Yurik) [11:11:49] (03PS2) 10Matanya: ssh: convert into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/90098 [11:11:52] thanks akosiaris [11:12:35] (03CR) 10jenkins-bot: [V: 04-1] ssh: convert into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/90098 (owner: 10Matanya) [11:13:53] (03PS3) 10Matanya: ssh: convert into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/90098 [11:14:56] finally [11:15:39] apergos: could you please have a lokk at : https://gerrit.wikimedia.org/r/#/c/86855/ [11:17:59] mark, could you help out with the "enable_esi" command? I thought i did it correctly, but it seems i have no clue how cache.pp works [11:18:25] yurik_: you have applied it on upload and bit [11:18:29] not on mobile :-] [11:22:43] matanya: needs rebased maybe [11:25:05] (03PS2) 10Hashar: "enable_esi" setting to the mobile beta labs varnish [operations/puppet] - 10https://gerrit.wikimedia.org/r/90095 (owner: 10Yurik) [11:26:08] (03PS2) 10Matanya: (bug 38946) : Added culmus-fancy font to help render svg [operations/puppet] - 10https://gerrit.wikimedia.org/r/86855 [11:26:09] (03CR) 10Hashar: "Removed ESI from bits and upload. Applied it on mobile by introducing a per realm $cluster_options which is passed to mobile-backend and m" [operations/puppet] - 10https://gerrit.wikimedia.org/r/90095 (owner: 10Yurik) [11:26:37] mark, is this a good fix? Could you +2 ^ [11:26:42] yurik_: go ahead and apply PS2 on the staging instance :] [11:26:43] for testing [11:26:48] you don't even need to get it merged now [11:26:56] since the staging instance is using a local puppet repo [11:27:02] (03CR) 10ArielGlenn: [C: 032] (bug 38946) : Added culmus-fancy font to help render svg [operations/puppet] - 10https://gerrit.wikimedia.org/r/86855 (owner: 10Matanya) [11:28:09] tha nks apergos [11:28:12] not done [11:29:10] yurik_: next you want to use ssh port forwarding to be able to connect to port 80 on the instance [11:29:47] why did jenkins +1 after build succeeded instead f +2 [11:30:14] hashar: ? [11:30:39] apergos: Maybe the person who submitted the patchset isn't on the trusted list? [11:30:45] mh [11:31:02] (03CR) 10ArielGlenn: [V: 032] (bug 38946) : Added culmus-fancy font to help render svg [operations/puppet] - 10https://gerrit.wikimedia.org/r/86855 (owner: 10Matanya) [11:33:43] mark: could you rerview Yurik change again ? https://gerrit.wikimedia.org/r/90095 [11:34:55] ok in the next hour that will be live [11:35:33] yurik_: ssh -L 8080:deployment-staging-cache-mobile02.pmtpa.wmflabs:80 deployment-staging-cache-mobile02.pmtpa.wmflabs [11:35:41] yurik_: which redirect my port 8080 to there [11:36:04] deployment-staging-cache-mobile02.pmtpa.wmflabs 10 2013-10-16T11:35:16 0.000116825 10.4.0.52 hit/200 5533 GET http://localhost:8080/wmf_logo_with_text.png - - http://localhost:8080/ - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5 Safari/536.30.1 [11:37:28] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 16 11:37:19 UTC 2013 [11:37:37] yurik_: curl -L -H 'Host: en.m.wikipedia.beta.wmflabs.org' http://localhost:8080/ [11:37:46] yurik_: that does a query to my local port 8080 [11:37:50] ssh forward it to the staging instance [11:37:58] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [11:38:03] (03CR) 10Mark Bergsma: [C: 032] Add a ulsfo-migration map for bits & geoiplookup [operations/dns] - 10https://gerrit.wikimedia.org/r/90103 (owner: 10Mark Bergsma) [11:38:14] and I pass 'Host: en.m.wikipedia.beta.wmflabs.org' so that varnish properly route the request [11:38:29] on varnish mobile staging cache I got: deployment-staging-cache-mobile02.pmtpa.wmflabs 14 2013-10-16T11:37:21 0.000086308 10.4.0.52 hit/301 0 GET http://en.m.wikipedia.beta.wmflabs.org/ - - - - curl/7.21.4 (universal-apple-darwin11.0) libcurl/7.21.4 OpenSSL/0.9.8y zlib/1.2.5 [11:40:07] apergos: sorry,in audio with Yuri. Which change got voted +1 ? [11:40:28] https://gerrit.wikimedia.org/r/#/c/86855/ [11:40:37] but maybe it is the trusted list, I didn't think about that [11:42:56] yurik_: X-Analytics: zero=250-99 \O/ [11:44:29] apergos: matanaya is probably not whitelisted [11:44:38] yeah [11:46:27] yurik_: once approved you can run puppet on deployment-cache-mobile01.pmtpa.wmflabs and you will get ESI :-] [11:48:52] yurik_: http://en.m.wikipedia.beta.wmflabs.org/ [11:49:16] points to deployment-cache-mobile01 [11:49:31] which runs production puppet branch :) [11:50:11] plink.exe [11:57:13] !g Id4ba9ca9cedfe0c18ed47cbd4ec9f508f2521fda [11:57:13] https://gerrit.wikimedia.org/r/#q,Id4ba9ca9cedfe0c18ed47cbd4ec9f508f2521fda,n,z [11:58:17] !log Jenkins: migrate extensions jobs to install mediawiki with sqlite with a shell script instead of inlined shell in the job configuration {{gerrit|90104}} [11:58:32] Logged the message, Master [12:37:24] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 16 12:37:17 UTC 2013 [12:37:54] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [12:50:17] (03Abandoned) 10Hashar: ssh: move to a module & reorganize/cleanup [operations/puppet] - 10https://gerrit.wikimedia.org/r/15874 (owner: 10Faidon Liambotis) [12:56:15] when should one use $MW_COMMON_DIR_USE/multiversion/MWScript.php rather than $MW_COMMON/multiversion/MWScript.php ? [13:10:39] is use apache of group wikidev allowed to write to /usr/local/bin/ ? [13:26:18] (03PS1) 10Mark Bergsma: Add IPv6 DNS entries for cp4001-cp4020 [operations/dns] - 10https://gerrit.wikimedia.org/r/90113 [13:26:35] Nemo_bis: i would hope not [13:27:27] hmm [13:27:53] mark: so how can misc::maintenance::update_special_pages work? it tries to create a file in that dir as apache/wikidev [13:28:05] (03PS2) 10Mark Bergsma: Add IPv6 DNS entries for cp4001-cp4020 [operations/dns] - 10https://gerrit.wikimedia.org/r/90113 [13:28:34] you mean puppet (as root) creates a file with owner/group apache/wikidev? :) [13:29:08] (03CR) 10Mark Bergsma: [C: 032] Add IPv6 DNS entries for cp4001-cp4020 [operations/dns] - 10https://gerrit.wikimedia.org/r/90113 (owner: 10Mark Bergsma) [13:29:08] mark: ah, right :) [13:32:45] oh damn [13:35:39] (03PS1) 10Mark Bergsma: Add AAAA records to not-mgmt hostnames [operations/dns] - 10https://gerrit.wikimedia.org/r/90116 [13:36:30] (03CR) 10Mark Bergsma: [C: 032] Add AAAA records to not-mgmt hostnames [operations/dns] - 10https://gerrit.wikimedia.org/r/90116 (owner: 10Mark Bergsma) [13:36:38] (03PS1) 10Nemo bis: Simplify misc::maintenance::update_special_pages a bit [operations/puppet] - 10https://gerrit.wikimedia.org/r/90117 [13:37:24] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 16 13:37:19 UTC 2013 [13:37:54] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [13:42:52] (03PS1) 10Mark Bergsma: Make eqiad nameservers the default [operations/puppet] - 10https://gerrit.wikimedia.org/r/90118 [13:43:33] (03CR) 10jenkins-bot: [V: 04-1] Make eqiad nameservers the default [operations/puppet] - 10https://gerrit.wikimedia.org/r/90118 (owner: 10Mark Bergsma) [13:44:04] (03PS2) 10Mark Bergsma: Make eqiad nameservers the default [operations/puppet] - 10https://gerrit.wikimedia.org/r/90118 [13:45:33] (03CR) 10Mark Bergsma: [C: 032] Make eqiad nameservers the default [operations/puppet] - 10https://gerrit.wikimedia.org/r/90118 (owner: 10Mark Bergsma) [13:58:24] (03PS1) 10Mark Bergsma: Add HTTPS service for text in ulsfo [operations/puppet] - 10https://gerrit.wikimedia.org/r/90120 [13:59:53] (03CR) 10Mark Bergsma: [C: 032] Add HTTPS service for text in ulsfo [operations/puppet] - 10https://gerrit.wikimedia.org/r/90120 (owner: 10Mark Bergsma) [14:02:50] (03PS1) 10Mark Bergsma: Add a HTTPS service for bits in ulsfo [operations/puppet] - 10https://gerrit.wikimedia.org/r/90122 [14:03:54] (03CR) 10Mark Bergsma: [C: 032] Add a HTTPS service for bits in ulsfo [operations/puppet] - 10https://gerrit.wikimedia.org/r/90122 (owner: 10Mark Bergsma) [14:11:09] !log jenkins : upgrading php on gallium. [14:11:23] Logged the message, Master [14:11:24] morebots: ? [14:11:24] I am a logbot running on tools-exec-08. [14:11:24] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [14:11:24] To log a message, type !log . [14:11:26] ah [14:11:48] !log upgrading php on lanthanum.eqiad.wmnet ( jenkins CI slave ) [14:12:01] Logged the message, Master [14:25:48] (03PS1) 10Mark Bergsma: Send New Zealand traffic to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/90124 [14:32:22] (03CR) 10Andrew Bogott: [C: 04-1] "(5 comments)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/90098 (owner: 10Matanya) [14:33:54] (03CR) 10Mark Bergsma: [C: 032] Send New Zealand traffic to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/90124 (owner: 10Mark Bergsma) [14:37:22] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 16 14:37:19 UTC 2013 [14:37:52] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [14:53:16] (03PS1) 10Mark Bergsma: Send all Oceania bits traffic to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/90128 [14:53:55] (03CR) 10Mark Bergsma: [C: 032] Send all Oceania bits traffic to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/90128 (owner: 10Mark Bergsma) [14:56:13] (03PS1) 10Mark Bergsma: Remove erroneous comment [operations/dns] - 10https://gerrit.wikimedia.org/r/90130 [14:56:22] (03PS1) 10Hashar: ci: raise tmpfs dir size from 128M to 512M [operations/puppet] - 10https://gerrit.wikimedia.org/r/90131 [14:56:59] mark: hey :) Could you merge in https://gerrit.wikimedia.org/r/90131 that is to raise the Jenkins tmpfs disk from 128M to 512M [14:57:03] will remount manually [14:57:19] (03CR) 10Mark Bergsma: [C: 032] Remove erroneous comment [operations/dns] - 10https://gerrit.wikimedia.org/r/90130 (owner: 10Mark Bergsma) [14:57:58] (03CR) 10Mark Bergsma: [C: 032] ci: raise tmpfs dir size from 128M to 512M [operations/puppet] - 10https://gerrit.wikimedia.org/r/90131 (owner: 10Hashar) [14:58:57] mark: thanks [15:00:53] interestingly, puppet has been smart enough to remount the disk [15:01:05] it has its moments [15:11:52] hey mark, we're talking in analytics about the varnishncsa vs varnishkafka problem [15:11:57] Snaps had a suggestions: [15:12:08] make varnishkafka output to udp and kafka at the same time [15:12:42] we could then turn off varnishncsa, run one instance of varnishkafka, and still support all the downstream consumers of udp2log data without modifying them [15:12:43] that's fine - temporarily [15:12:56] well, yes, temporarily, we'd like to disable udp2log completely [15:13:08] yeah [15:13:11] but scheduling that is hard, and probably not going to happen within 6 months to a year [15:13:16] I think that's exactly why paravoid added support for that [15:13:23] (he did? [15:13:24] ) [15:13:26] udp? [15:13:27] I didn't [15:13:32] you didn't? [15:13:37] why do I remember that then? [15:13:48] paravoid was playing with adding support for bson [15:13:48] anyway :) [15:13:54] oh that was it [15:14:34] anyway, yeah, if we were able to turn off varnishncsa and just one run instance of varnishkafka that outputs to both kafka and udp2log... [15:14:46] could we do that temporarily…but a long temporarily? [15:14:50] 6 months to a year [15:14:59] why that long? [15:16:28] if we remove data from udp2log stream, we have to make sure we can support all of the current analysis that is using that data [15:16:41] for mobile, that is kind of feesible [15:16:46] but some of the uses are a little hard [15:16:50] write a simple kafka to udp converter script? [15:16:54] oh I did! [15:16:57] :) [15:17:00] https://gerrit.wikimedia.org/r/#/c/86894/ [15:17:02] what's the issue then? :) [15:17:11] faidon doesn't like it? :p [15:17:12] ok [15:17:13] well [15:17:24] this discussion was started because I was writing an email summarizing all of this [15:17:41] i'll finish that and we can continue discussion there [15:17:51] paravoid: why not? :) [15:18:04] I didn't say I don't like it [15:18:09] I said that I want us to make an informed decision [15:18:22] so I want to know how temporary or not this solution is [15:18:36] if it's very temporary, maybe it isn't worth it and we should just keep varnishncsa running [15:18:46] if it's not, then we need to find another solution [15:18:46] if the kafka to udp script is on the analytics side, it can be permanent for all i care [15:18:57] like the one proposed [15:19:06] what does "the analytics side" mean? :) [15:19:10] it's all us now, isn't it [15:19:20] "on infrastructure I don't care about" :) [15:19:46] i.e. not core-ops [15:20:10] RECOVERY - MySQL Replication Heartbeat on db1046 is OK: OK replication delay 0 seconds [15:20:31] i.e. on running on an analytics box [15:20:32] yeah [15:20:42] that way it doesn't impact hardware [15:20:49] caching performance [15:20:51] bleebla [15:20:57] yeah, paravoid, i'm writing up all that information [15:21:07] and snaps just came up with that idea, so I'll include that as an option too [15:21:09] dunno, if we own it, I'd prefer at least being aware of the plan [15:21:16] yeah [15:21:42] yeah, not sending it across the wire twice is also good [15:21:54] oo, true, yeah [15:22:09] hm, well in eqiad its still duplicated, right? but you're saying not sending it across datacenters twice?" [15:22:17] yeah but that matters less [15:22:23] indeed [15:22:23] yeah [15:23:07] I don't have the assumption that the analytics/hadoop clusters are for running efficient jobs/queries, so that's ok with me [15:23:20] then analytics can decide whether they want to optimize that part or not [15:26:55] ah yay :) [15:39:20] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 16 15:39:10 UTC 2013 [15:40:00] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [15:56:02] (03PS1) 10Cmjohnson: Removing dns entries for mc1-16, sq33,35,41,42,45 and their mgmt ip's (chrisJ) [operations/dns] - 10https://gerrit.wikimedia.org/r/90142 [15:57:54] (03PS2) 10Cmjohnson: Removing dns entries for mc1-16, sq33,35,41,42,45 and their mgmt ip's (chrisJ) [operations/dns] - 10https://gerrit.wikimedia.org/r/90142 [15:59:48] (03CR) 10Cmjohnson: [C: 032] Removing dns entries for mc1-16, sq33,35,41,42,45 and their mgmt ip's (chrisJ) [operations/dns] - 10https://gerrit.wikimedia.org/r/90142 (owner: 10Cmjohnson) [16:01:35] !log authdns update [16:01:49] Logged the message, Master [16:28:08] Reedy: you around today on a normal-ish schedule? [16:28:17] something like that [16:28:35] cool, I assume we'll need a hand from you to fix this https://bugzilla.wikimedia.org/show_bug.cgi?id=55779 [16:28:40] Reedy: so what happened with the frwiki collaction? [16:28:47] collation* [16:28:54] jon has a patch, pleasestand and him and makign it better, but once it's ready, it's Reedy time ;) [16:29:12] twkozlowski: it died [16:29:19] I don't think jon deploys much code (why I'm asking you) [16:29:37] it died with a database error, which looks interesting. [16:30:16] Master switch [16:30:18] Not interesting [16:30:21] after almost 58 hours of running, by the way [16:30:32] Oh, ok. [16:31:04] I hadn't checked on it in a couple of days [16:37:24] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 16 16:37:18 UTC 2013 [16:37:54] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [16:37:57] !log demon synchronized php-1.22wmf20/includes/DefaultSettings.php 'I115f8cf9' [16:37:59] Reedy: can you get me a stack trace for wikidata? [16:38:07] Unexpected non-MediaWiki exception encountered, of type "Wikibase\Lib\FormattingException" [16:38:11] Logged the message, Master [16:38:33] !log demon synchronized php-1.22wmf20/includes/LinksUpdate.php 'I115f8cf9' [16:38:44] Logged the message, Master [16:39:13] <^d> Reedy: I can't do a `git fetch --all` on php-1.22wmf21. Complains about permissions on .git/objects [16:39:33] Not usually me to blame.. [16:39:49] aude: [16:39:49] reedy@fluorine:/a/mw-log$ grep FormattingException fatal.log [16:39:49] reedy@fluorine:/a/mw-log$ grep FormattingException exception.log [16:39:50] reedy@fluorine:/a/mw-log$ [16:39:51] When? [16:39:56] one sec [16:40:04] look again [16:40:10] for https://www.wikidata.org/wiki/Q4489080 [16:40:26] reedy@tin:/a/common/php-1.22wmf21$ git fetch --all [16:40:26] Fetching origin [16:40:26] Permission denied (publickey). [16:40:26] fatal: The remote end hung up unexpectedly [16:40:26] error: Could not fetch origin [16:40:50] bah, capitals [16:41:15] ^d: Blame bd808 [16:41:20] drwxr-xr-x 2 bd808 wikidev 4096 Oct 15 16:33 d8 [16:41:29] bd808: Your umask is wrong! [16:41:33] cc mutante ^ :( [16:41:42] Reedy: ack. What did I f up? [16:42:04] ottomata: what are all the things that expect UDP? [16:42:27] bd808: Not exaclty f up... did you replace your bashrc with your own? [16:42:43] <^d> Reedy: Also, https://gerrit.wikimedia.org/r/#/c/89998/ :) [16:42:46] umask 002 [16:42:49] Reedy: I did and apparently didn't notice the umask [16:43:39] Jeff_Green: is aluminium in the fundraising cluster? [16:43:48] Need to do chmod -R g+w on a handful of .git/object folders [16:43:58] Did you touch both current versions? [16:44:00] andrewbogott: yes. once we cut over to barium it will be moved into frack [16:44:29] Ah, you must've just grabbed that ticket... [16:44:33] guess I'll leave it to you :) [16:44:50] i didn't. this is a fast evolving request :-P [16:45:16] aude: It doesn't surface in the exception logs [16:45:28] hmmmm [16:45:29] andrewbogott: i got the same by IRC and said "you already get 2.7 on barium" [16:45:59] i wonder if that is because it's not a MWException? [16:46:03] There are a few tickets here which are marked 'new' but owned by you, which I didn't know could happen [16:46:18] Reedy: Yes I touched 20 and 21 [16:46:46] Reedy: https://github.com/bd808/wmf-kanban/issues/17#issuecomment-26353910 [16:47:42] 43 Warning: Exception caught in Message::__toString (message returnto): exception 'DBConnectionError' with message 'DB connection error: Unknown error (10.64.16.16)' [16:47:55] db1027 [16:48:56] noisy logs are noisy : [16:48:57] :( [16:49:41] andrewbogott: you can do that when you create a ticket in the webUI [16:52:04] PROBLEM - MySQL Processlist on db1027 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 0 copy to table, 848 statistics [16:53:07] Unsticking gluster is so trivial, it's weird that gluster doesn't do it itself [16:53:35] is there an auto-unstick knob that's tweaked the wrong way? [16:54:01] possibly. There actually was something like that, called 'quorum' that was the "Please don't corrupt my data" bit [16:54:05] RECOVERY - MySQL Processlist on db1027 is OK: OK 0 unauthenticated, 0 locked, 0 copy to table, 0 statistics [16:54:13] Things work much better with that flag set. [17:06:12] mark, mutante, is it useful to keep this ticket open? https://rt.wikimedia.org/Ticket/Display.html?id=9 [17:07:08] andrewbogott: i see no reason to close it really [17:07:16] would be nice to have that [17:07:29] ok. [17:07:47] it's certainly not a high prio ticket by any means ;) [17:08:27] Yeah, I was just trying to clean up 'stalled' tickets but I guess 'stalled' is an accurate description in this case. [17:08:46] ori-l: Did my email thread w/max about eventlogging make sense? Is there an easy fix for that? [17:08:54] * andrewbogott wants to fix /that/ problem so he can break something else [17:09:11] <^d> Reedy: Does bd808 need to fix, or a root? [17:09:26] He should be able to fix it [17:09:42] * ^d pokes bd808 with a stick [17:09:47] andrewbogott: um, looking [17:10:09] hey Jeff_Green [17:10:16] ya [17:10:22] does FR use the udp2log firehose? [17:10:25] andrewbogott: oh, this has nothing to do with eventlogging [17:10:35] trying to remember... [17:10:40] ottomata: yeah, for banner impression logs [17:10:53] andrewbogott: what and how puppet logs is a weird union of agent and master configs, very bizarre [17:11:00] iirc fr uses a collector on gadolinium [17:11:11] yes ok, i rememb er now [17:11:25] ok, and are any of the requests you save from mobile? [17:11:36] andrewbogott: i guess enabling reporting also had the effect of enabling disk-based reporting, but i don't know why. let me look at the docs again for a minute or six. [17:12:06] ori-l: OK… I didn't investigate deeply, just saw that puppet was trying to install a file that wasn't there [17:12:31] wait, what? [17:12:39] what issue are you referring to, exactly? [17:12:46] Jeff_Green: ^^^^ [17:12:54] ottomata: I assume so [17:12:59] interesting ok. [17:13:00] ^ andrewbogott [17:13:02] thanks [17:13:26] ori-l: puppet doesn't run cleanly on labs instance deployment-eventlogging [17:13:52] (03PS1) 10Cmjohnson: Removing mc1-16 from dsh group, dhcpd, ganglia.pp, decom.pp remvoing oc1-3 from site.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/90157 [17:14:04] oh. no, i didn't realize that's what you were talking about. lemme look. [17:14:21] ori-l, email thread with unhelpful subj 'lab rat needed!' [17:14:51] yes, that's not a great subject, is it :P [17:15:10] The conversation is pretty incoherent as well :) [17:15:44] republican marshmallow inspiration safari [17:16:57] ^d, Reedy: Need me to fix it? [17:16:57] oh, git deploy to labs works now? [17:17:16] <^d> bd808: Yes, please :) [17:17:42] (03CR) 10Cmjohnson: [C: 032] Removing mc1-16 from dsh group, dhcpd, ganglia.pp, decom.pp remvoing oc1-3 from site.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/90157 (owner: 10Cmjohnson) [17:18:00] ori-l: um… no idea [17:18:49] ^d: Just chmod g+w in .git? [17:19:01] <^d> Should do it. [17:19:09] <^d> Well, with -R :) [17:19:56] ^d: {{done}} [17:20:11] <^d> Thanks [17:20:13] ori-l: istr git-deploy worked well in labs and tanked in prod [17:21:03] !log demon synchronized php-1.22wmf21/includes/DefaultSettings.php 'I115f8cf9' [17:21:13] Logged the message, Master [17:21:28] !log demon synchronized php-1.22wmf21/includes/LinksUpdate.php 'I115f8cf9' [17:21:39] Logged the message, Master [17:23:10] andrewbogott: fixed [17:23:23] !log Creating/setting ACLs on eqiad swift containers for all wikis [17:23:36] Logged the message, Master [17:23:38] ori-l: need me to review a patch? Or was it somehow not puppet-related? [17:24:35] andrewbogott: the puppet manifest was referencing files in the eventlogging git-deploy target dir, but git-deploy was not deploying to it [17:24:52] ah, ok. [17:24:54] thanks for fixing! [17:25:03] i don't know how git-deploy on labs is supposed to work so for now i just git-cloned the repo from gerrit into the deployment target location [17:25:06] so it's a half-fix [17:25:30] but i'll look into how git-deploy works in labs later [17:26:33] what's the labs saltmaster? [17:26:47] oh, i found it: i-00000390.pmtpa.wmflabs [17:31:40] greg-g: can I get a slot to update CirrusSearch? I [17:32:00] I'll need to rebuild indexes after the update which will take a while but won't bother anyone during the process [17:32:24] so it won't take long [17:32:45] Reedy: backport & deploy plz: https://gerrit.wikimedia.org/r/90133 [17:32:47] greg-g: ^ [17:33:10] You can create the backport :p [17:33:28] Just w1? [17:33:30] *21 [17:34:52] Reedy: i think, yes [17:34:56] it's a recent regression [17:34:59] greg would know? [17:35:51] Reedy: I can deploy it [17:36:02] Reedy: backport https://gerrit.wikimedia.org/r/90169 [17:36:06] ori-l: ^ [17:36:19] jdlrobson: this is just 1.21? ^ [17:37:05] Reedy: this one is for you, though: https://gerrit.wikimedia.org/r/#/c/90085/ [17:37:24] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 16 17:37:20 UTC 2013 [17:37:37] MatmaRex: 1.22wmf21 [17:37:54] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [17:38:02] Reedy: yeah, definitely just 21, there's no such file in 20 [17:38:17] ori-l: Feel free to deploy it if you want [17:38:36] which one? [17:38:45] https://gerrit.wikimedia.org/r/#/c/90108/ also :) [17:39:02] The core backport [17:39:17] Reedy: ok, doing so [17:40:34] (03PS6) 10Andrew Bogott: Move mysql_wmf into a module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/88666 [17:40:48] aude: your change is empty [17:40:55] huh [17:41:05] as such it is already deployed [17:41:05] it is submodule change [17:41:12] https://gerrit.wikimedia.org/r/#/c/90108/3/extensions/Wikibase,unified [17:41:18] oh [17:43:50] !log olivneh synchronized php-1.22wmf21/skins/vector/screen.less 'I412575f37fix for bug 55779' [17:44:04] Logged the message, Master [17:44:32] !log olivneh synchronized php-1.22wmf21/resources/startup.js 'Touch startup.js' [17:44:38] ori-l: rtl looks good now :) [17:44:45] Logged the message, Master [17:45:11] cool, thanks ori-l [17:45:49] yep. i am syncing the bug-fixing submodule update too, yes? [17:46:00] I believe so? [17:46:03] would be great [17:46:11] sure, no problem [17:46:18] both bug fixes are trivial but not good to have on wikidata [17:46:24] without fixes [17:52:31] aude: what is the actual bugfix changeset? [17:52:39] the submodule bump describes it but doesn't link to it [17:52:57] i should link them in the future [17:53:40] https://gerrit.wikimedia.org/r/#/c/90158/ [17:53:44] https://gerrit.wikimedia.org/r/#/c/90101/ [17:54:00] first one, that line of code got lost in rebase [17:54:17] second changes one character in the code :) [17:54:25] one important character [17:57:14] !log olivneh synchronized php-1.22wmf21/extensions/Wikibase 'Update Wikibase to 8903c2234d' [17:57:20] ^ aude [17:57:24] Logged the message, Master [17:57:25] it works! [17:57:26] !log stopping spamassassin on williams [17:57:32] https://www.wikidata.org/wiki/Q4489080 [17:57:37] Logged the message, Master [17:57:37] MaxSem: Was the db empty before as well? Or did the new puppet change break things? [17:57:44] thanks for deploying this [17:57:53] no problem [17:58:14] Reedy: so, does the updateBitsBranchPointers change look OK? [17:58:16] andrewbogott, no idea, actually - I did some puppet work related to that instance but it was launched by Ori:) [17:58:41] which instance are you discussing? [17:58:54] deployment-eventlogging [17:59:20] what db are you talking about? [18:00:14] well, mysql is empty on that VM, it has only information_schema and test [18:00:21] the latter is an empty DB [18:01:09] yes, so, what's the actual problem? [18:01:23] ori-l: that require() looks funny for launching a method some reason [18:01:49] ori-l: There isn't a problem, exactly… the issue is that we're testing a change to the mysql puppet manifest, and I want to know whether or not it breaks things. [18:01:54] *for some reason [18:01:55] andrewbogott: no, it's fine [18:01:56] Using your instance as a lab rat, hence email subject [18:02:13] Aaron|home: I looked for a php equivalent to if __name__ == '__main__' but there isn't a nice one available imo [18:02:15] But it sounds like in this case we've learned nothing since you weren't using the db anyway :) [18:02:58] !log shutting down williams (former OTRS), stab [18:03:09] Logged the message, Master [18:03:16] ori-l: check argv[0] [18:03:22] * Aaron|home grins evily :p [18:03:33] Aaron|home: and compare it with? [18:04:29] basename( $_SERVER['SCRIPT_FILENAME'] ) ? ew. [18:04:44] PROBLEM - Host williams is DOWN: PING CRITICAL - Packet loss = 100% [18:04:53] (03PS1) 10coren: Tool Labs: make the webservice nodes sumbit nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/90173 [18:11:00] ori-l: it's funny to get a linkedin email from a recruiter with a photo like http://www.linkedin.com/profile/view?id=17213838 [18:11:21] (03PS1) 10Dzahn: remove williams from DHCP and dsh groups [operations/puppet] - 10https://gerrit.wikimedia.org/r/90175 [18:11:23] Aaron|home: i've worked for cats before, they're assholes [18:11:39] speciest? [18:12:29] (03CR) 10coren: [C: 032] Tool Labs: make the webservice nodes sumbit nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/90173 (owner: 10coren) [18:13:07] (03CR) 10Dzahn: [C: 032] remove williams from DHCP and dsh groups [operations/puppet] - 10https://gerrit.wikimedia.org/r/90175 (owner: 10Dzahn) [18:13:31] Aaron|home: would you prefer https://dpaste.de/I8PE/raw/ btw? if so, i'll submit it [18:14:31] hmm, not as evil looking as I would have imagined [18:15:22] i'll take it [18:16:10] Reedy: I guess a question for you, too, is, are you actively looking at https://bugzilla.wikimedia.org/show_bug.cgi?id=38865 on a (semi-) regular basis? [18:17:35] dr0ptp4kt: fyi, I am on the ops list too :) [18:17:50] i've only been loosely following that thread though [18:18:25] ottomata, cool, thx [18:18:26] I mentioned your concern to mark yesterday, and he said that he was aware of the problem? not sure. [18:18:31] good to follow up though, thanks [18:18:56] greg-g: Yeah, I tend to keep an eye on it [18:19:36] PROBLEM - Host lvs4001 is DOWN: PING CRITICAL - Packet loss = 100% [18:22:30] Reedy: awesome. [18:22:53] greg-g: It's rare we actually get a true blocker [18:23:21] (03PS2) 10Dzahn: remove "williams" from DNS, RT #5908 [operations/dns] - 10https://gerrit.wikimedia.org/r/88147 [18:24:48] (03CR) 10Dzahn: [C: 032] "rebased, williams is shutdown now" [operations/dns] - 10https://gerrit.wikimedia.org/r/88147 (owner: 10Dzahn) [18:25:21] !log DNS update - remove williams [18:25:32] Logged the message, Master [18:30:16] (03PS4) 10Ori.livneh: Add updateBitsBranchPointers script to create/update 'current' / 'stable' symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/90085 [18:30:31] ^ Aaron|home, BUT NOT REEDY [18:31:04] Reedy: yah [18:31:53] manybubbles: sorry, getting to you... [18:32:07] manybubbles: needed today or next week? [18:34:23] (03PS3) 10Ori.livneh: Script tweaks; use standard env var names in multiversion/checkoutMediaWiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/90083 [18:34:36] also not reedy under any circumstances [18:36:18] (03CR) 10Reedy: [C: 04-2] Add updateBitsBranchPointers script to create/update 'current' / 'stable' symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/90085 (owner: 10Ori.livneh) [18:36:37] hah [18:36:41] fun [18:38:23] * Reedy is being productive [18:41:25] http://hire.jobvite.com/Jobvite/Job.aspx?j=oIcUXfwv&c=qSa9VfwQ : s/Prophicency/Proficiency/ [18:42:19] greg-g: no rush [18:42:20] awight: the ideal candidate combines competence in the language with oracular powers [18:43:40] nice ;) experience in Delphi or Crystal Reports a plus [18:44:33] (03CR) 10Ottomata: "For more info see: https://wikitech.wikimedia.org/wiki/Analytics/Kafka_Udp2log" [operations/puppet] - 10https://gerrit.wikimedia.org/r/86894 (owner: 10Ottomata) [18:46:34] (03CR) 10Reedy: [C: 032] Add updateBitsBranchPointers script to create/update 'current' / 'stable' symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/90085 (owner: 10Ori.livneh) [18:46:51] Reedy: thank you :) [18:47:24] (03Merged) 10jenkins-bot: Add updateBitsBranchPointers script to create/update 'current' / 'stable' symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/90085 (owner: 10Ori.livneh) [18:47:53] manybubbles: gotcha, then, I'll add it to... Monday morning, the 21st, at say 9am pacific? [18:48:05] (03PS4) 10Reedy: Script tweaks; use standard env var names in multiversion/checkoutMediaWiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/90083 (owner: 10Ori.livneh) [18:48:22] (03CR) 10Reedy: [C: 032] Script tweaks; use standard env var names in multiversion/checkoutMediaWiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/90083 (owner: 10Ori.livneh) [18:49:31] greg-g: great! [18:49:57] * Aaron|home wants frosted flakes now [18:50:25] (03Merged) 10jenkins-bot: Script tweaks; use standard env var names in multiversion/checkoutMediaWiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/90083 (owner: 10Ori.livneh) [18:50:47] (03PS2) 10Reedy: Bug 39482 - Rename "chapcomwiki" to "affcomwiki" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/53922 [18:50:52] Aaron|home: oh man, now me too! jerk [18:51:04] manybubbles: noted [18:51:07] (well, going to note) [18:51:50] {{done}} [18:52:10] (03CR) 10Reedy: [C: 04-1] "Tagging -1, can't be currently done due to ES related issues" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/53922 (owner: 10Reedy) [18:52:48] (03CR) 10jenkins-bot: [V: 04-1] Rename "chapcomwiki" to "affcomwiki" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/53922 (owner: 10Reedy) [18:53:51] (03CR) 10Dzahn: [C: 032] Rename www portals files to drop wikipedia from name [operations/apache-config] - 10https://gerrit.wikimedia.org/r/89760 (owner: 10Reedy) [18:54:23] :D [18:55:29] (03PS3) 10Reedy: Rename "chapcomwiki" to "affcomwiki" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/53922 [18:55:46] (03CR) 10Reedy: [C: 04-1] "Rebased and readding -1" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/53922 (owner: 10Reedy) [18:56:16] (03PS8) 10Jforrester: wikibugs: Set up #mediawiki-visualeditor [operations/puppet] - 10https://gerrit.wikimedia.org/r/37570 [18:58:05] !log updated Parsoid to 9b52271 [18:58:14] Logged the message, Master [18:58:56] RECOVERY - Parsoid on wtp1008 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.004 second response time [18:59:03] (03PS9) 10Reedy: Simplify wikimania apache confs, reuse wikimedia.org docroot. [operations/apache-config] - 10https://gerrit.wikimedia.org/r/84707 [18:59:23] (03CR) 10Reedy: [C: 04-1] "Damn it, wikimania team can't go into the same VHOST due to the HTTPS redirect. Will fix" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/84707 (owner: 10Reedy) [19:00:40] (03PS10) 10Reedy: Simplify wikimania apache confs, reuse wikimedia.org docroot. [operations/apache-config] - 10https://gerrit.wikimedia.org/r/84707 [19:04:15] (03CR) 10Nemo bis: "Commit message is unclear: does this just *add* messages to #mediawiki-visualeditor or also remove them from somewhere else (#mediawiki)?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/37570 (owner: 10Jforrester) [19:06:05] (03PS11) 10Reedy: Simplify wikimania apache confs, reuse wikimedia.org docroot. [operations/apache-config] - 10https://gerrit.wikimedia.org/r/84707 [19:06:32] (03CR) 10Jforrester: "This PS does nothing more than let wikibugs write to #mediawiki-visualeditor; you need to investigate I04e82cb392 to see where the message" [operations/puppet] - 10https://gerrit.wikimedia.org/r/37570 (owner: 10Jforrester) [19:08:36] (03CR) 10Nemo bis: "I already read the other commit and the docs above the lines it changes, they don't explain anything." [operations/puppet] - 10https://gerrit.wikimedia.org/r/37570 (owner: 10Jforrester) [19:09:31] (03PS12) 10Reedy: Simplify wikimania apache confs, reuse wikimedia.org docroot. [operations/apache-config] - 10https://gerrit.wikimedia.org/r/84707 [19:10:27] (03CR) 10Bartosz Dziewoński: "I definitely hope it moves them instead of copying. wikibugs doesn't belong in #mediawiki at all, it should sit in #wikimedia-dev." [operations/puppet] - 10https://gerrit.wikimedia.org/r/37570 (owner: 10Jforrester) [19:14:22] (03CR) 10Nemo bis: "MatmaRex, your comment seems to be unrelated, this commit is not about a general move of wikibugs AKA bug 46144, AFAIK." [operations/puppet] - 10https://gerrit.wikimedia.org/r/37570 (owner: 10Jforrester) [19:14:30] (03PS1) 10Reedy: Remove nagios.conf [operations/apache-config] - 10https://gerrit.wikimedia.org/r/90179 [19:17:07] (03PS2) 10Reedy: Remove nagios.conf and ganglia.conf [operations/apache-config] - 10https://gerrit.wikimedia.org/r/90179 [19:19:07] (03CR) 10Bartosz Dziewoński: "I know. I am approaching that bug the Cato way ;)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/37570 (owner: 10Jforrester) [19:37:44] (03PS1) 10Edenhill: Added rate-limiting to (most) error logs generated by varnishkafka (issue #1) [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/90184 [19:42:35] is anybody around with Pybal experience? After a Parsoid deployment load balancing seems to be off a bit: https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&s=by+name&c=Parsoid%2520eqiad&tab=m&vn= [19:43:02] greg-g: I have a VE cherry-pick and an accompanying WikimediaEvents change to sync, when would be a good time? [19:43:02] nearly half the nodes don't seem to get traffic 10 minutes after the deployment [19:43:20] (03CR) 10Faidon Liambotis: [C: 04-1] "This won't work. Access-Control-Allow-Headers should be sent to the preflight OPTIONS request. a) cors.py four lines above skips everythin" [operations/puppet] - 10https://gerrit.wikimedia.org/r/89238 (owner: 10Aaron Schulz) [19:43:43] /var/log/pybal.log on parsoid.svc.eqiad.wmnet should have more info [19:43:50] (I can't see that) [19:44:02] gwicke: looking [19:44:19] paravoid: thanks! [19:45:35] (03Abandoned) 10Aaron Schulz: Add Range header support to CORS headers [operations/puppet] - 10https://gerrit.wikimedia.org/r/89238 (owner: 10Aaron Schulz) [19:46:47] Aaron|home: I didn't say no! :) [19:47:17] I said I'm not sure, I wanted to investigate [19:48:25] gwicke: pybal looks fine [19:48:37] my guess would be that Varnish does keepalives here [19:48:47] and has connections open to the servers that look as busy [19:49:04] and pipelines requests to those servers [19:50:43] (03PS1) 10Cmjohnson: Removing servers: arsenic, niobium, palladium, strontium from puppet dsh groups [operations/puppet] - 10https://gerrit.wikimedia.org/r/90188 [19:50:51] springle: around? [19:50:55] paravoid: makes sense- IIRC the load balancing is done with a backend LVS, and once the varnishes have enough connections they are probably lazy to open additional ones [19:51:03] I tcpdumped on wtp1001 which looks relatively idle in graphs and it get requests and responds just fine [19:51:44] it just gets 1/10th [19:52:20] there might be a knob in the varnish config that could improve this [19:52:23] hm, let me verify that hypothesis [19:53:09] it is not the first time I have seen this, was just wondering why [19:53:33] (03CR) 10Cmjohnson: [C: 032] Removing servers: arsenic, niobium, palladium, strontium from puppet dsh groups [operations/puppet] - 10https://gerrit.wikimedia.org/r/90188 (owner: 10Cmjohnson) [19:54:04] gwicke: were all servers restarted as part of your deployment? [19:54:19] paravoid: yes, through salt [19:54:36] hrm [19:55:27] ps aux shows the same start times both on the idleish and the busy machines [19:55:32] ori-l: well, not right now, as stuff seems broken, so... 2? [19:56:01] OK; they'll be staged until then. What's broken? [19:56:41] translate [19:56:44] see -tech [19:57:12] it's been broken for at least a few days [19:57:32] oh, well then, carry on [19:57:56] greg-g: let me know when I'm clear to start running the purges [19:58:01] bd808: I believe you are [19:58:15] bd808: was there any other feedback from Ops list I should know about? [19:58:26] like "don't do it" :) [19:58:54] greg-g: No, they seemed cool with it after the rate limit was added [19:58:54] are you not on the ops list? [19:59:28] ori-l: I am, just making sure I didn't miss something [19:59:32] bblack: around? greg-g says I can start running purges [19:59:54] bd808: yes, I'm here [20:00:10] I guess we'll watch ganglia for overflows, etc [20:00:29] but really I expect this to be painless [20:00:54] Seems reasonable. [20:01:25] ok. here goes nothing [20:01:30] greg-g: aside, it's a serious problem that it has been ongoing for several days, but not related to my change. Should I still hold off? [20:02:04] paravoid: the idle machines seem to slowly pick up some traffic [20:02:16] about 20 minutes after the restart [20:02:17] bblack: HTCP packets should be flowing [20:02:25] ori-l: right. so, let's give bd 5 minutes to break the site, we'll assume all's well at 1:10 (ok, 8 minutes), and you can proceed. [20:02:34] got it, thanks. [20:03:26] Script is working through the be* wikis now [20:03:34] bg* ... [20:04:13] (03PS1) 10Dzahn: fix redirects for jobs/careers.wm.org [operations/apache-config] - 10https://gerrit.wikimedia.org/r/90191 [20:04:32] cawiki …. [20:04:55] well the traffic goes multicast to everywhere, so we can see the effects anywhere too [20:05:18] * ori-l redirects bd808 to /dev/null [20:05:29] bblack: "supposed" to only be talking to 91.198.174.113:4827 [20:05:57] oh right, that's the esams relay? [20:06:16] bblack: yes, the esams side of the relay [20:07:16] bd808: http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&title=&vl=&x=&n=&hreg[]=cp3%5B0-9%5D%2B.esams&mreg[]=vhtcpd_inpkts_sane>ype=stack&glegend=show&aggregate=1&embed=1&_=1381953966183 [20:07:51] honestly, I'm not even sure why the graph values that are several times higher than the legend's min/avg/max though :) [20:08:14] bblack: graph is stacked [20:08:24] oh, right, duh [20:08:36] (03CR) 10Dzahn: [C: 032] "full ack. ganglia is on nickel and nagios is gone" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/90179 (owner: 10Reedy) [20:08:40] in any case, you're sending at 200/s, it should be visible [20:08:59] bblack: hmmm... [20:09:15] it may take some time to filter through the monitoring, I didn't mean visible *right now* [20:09:43] The script is logging a lot of sends, but … udp [20:10:02] 2h may be a better view anyways (it's too spiky for the 1h): http://ganglia.wikimedia.org/latest/graph.php?r=2hr&z=xlarge&title=&vl=&x=&n=&hreg[]=cp3%5B0-9%5D%2B.esams&mreg[]=vhtcpd_inpkts_sane>ype=stack&glegend=show&aggregate=1&embed=1&_=1381954158299 [20:11:31] * bd808 uses a local exploit to symlink /dev/null to ori-l's tty [20:12:24] (03PS1) 10Andrew Bogott: Pick the proper mysql version, depending on distro [operations/puppet] - 10https://gerrit.wikimedia.org/r/90193 [20:14:23] !log removing unused apache config files via dsh, they have been merged "D" but sync-apache doesn't delete things [20:14:36] Logged the message, Master [20:14:42] Reedy: ^ [20:15:17] (03CR) 10Andrew Bogott: [C: 032] Pick the proper mysql version, depending on distro [operations/puppet] - 10https://gerrit.wikimedia.org/r/90193 (owner: 10Andrew Bogott) [20:15:34] bd808: can you give me some chunk of hostnames it's working through for a while currently or whatever, to loosely match traces on a cp machine of the live purges? [20:16:04] bblack: it's doing commons.wikimedia.org at the moment [20:16:34] Logs are on terbium in ~bd808/projects/purge [20:16:35] 20:16:21.106704 recvfrom(7, "\0\\\0\0\0V\4\0\0\0\0\t\0\0\0\4HEAD\0008http://commons.wikimedia.org/wiki/User:Cyberbot_I/Run/AR\0\10HTTP/1.0\0\0\0\2", 4096, 0, NULL, NULL) = 92 [20:16:39] 20:16:21.106751 recvfrom(7, "\0x\0\0\0r\4\0\0\0\0\n\0\0\0\4HEAD\0Thttp://commons.wikimedia.org/w/index.php?title=User:Cyberbot_I/Run/AR&action=history\0\10HTTP/1.0\0\0\0\2", 4096, 0, NULL, NULL) = 120 [20:16:43] 20:16:21.111823 recvfrom(7, "\0f\0\0\0`\4\0\0\0\0\3\0\0\0\4HEAD\0Bhttp://commons.wikimedia.org/wiki/Category:The_Wellcome_Collection\0\10HTTP/1.0\0\0\0\2", 4096, 0, NULL, NULL) = 102 [20:16:47] 20:16:21.111890 recvfrom(7, "\0\202\0\0\0|\4\0\0\0\0\4\0\0\0\4HEAD\0^http://commons.wikimedia.org/w/index.php?title=Category:The_Wellcome_Collection&action=history\0\10HTTP/1.0\0\0\0\2", 4096, 0, NULL, NULL) = 130 [20:16:51] that stuff? [20:17:04] !log olivneh synchronized php-1.22wmf20/extensions/WikimediaEvents 'Updating WikimediaEvents for VE logging (1/2)' [20:17:09] ok I'll go look on terbium to sync up [20:17:19] Logged the message, Master [20:17:23] !log olivneh synchronized php-1.22wmf21/extensions/WikimediaEvents 'Updating WikimediaEvents for VE logging (2/2)' [20:17:37] Logged the message, Master [20:17:43] Reedy: kills nagios.conf, ganglia.conf .. eh and www.wikipedia.conf is already gone(?) [20:17:45] !log olivneh synchronized php-1.22wmf21/extensions/VisualEditor 'Updating VE for cherry-picks I54602394e & I3b58ce0f4' [20:18:00] Logged the message, Master [20:18:33] o_0 [20:19:23] bblack: purge-commonswiki-20131016T2006.log is the active log file [20:20:48] Reedy: no, it wasn't, now it is [20:20:49] bd808: I don't think things are working [20:21:00] bblack: I was getting the same feeling [20:21:02] PROBLEM - Host lvs4003 is DOWN: PING CRITICAL - Packet loss = 100% [20:21:20] (03CR) 10Dzahn: [C: 032] fix redirects for jobs/careers.wm.org [operations/apache-config] - 10https://gerrit.wikimedia.org/r/90191 (owner: 10Dzahn) [20:21:28] in the time it took that log to grow 1407 lines, I only had 10 hits for purges on commons in cp3003's traffic, and checking a few of them, none of them matched your sends, I think they were organic [20:21:36] bblack: should I stop and change it to announce to the mcast group? [20:21:48] greg-g: i'm all done; thank you. [20:21:54] ori-l: word, thanks [20:21:57] bblack: Or should we try to figure out why I sending to the relay isn't working? [20:22:04] bd808: perhaps. at least stop it for now since it's not doing anything real I don't think [20:22:22] bblack: I killed the script [20:22:24] where is the relay, anyways? [20:22:36] bblack: It's on hooft [20:23:23] Reedy: i guess the foo.conf~ backups are actually in git and doesnt make sense either :p [20:23:33] bblack: hooft:/usr/local/bin/udpmcast.py [20:23:45] greg-g: ugh, no. forgot to sync one change, sec [20:23:53] yeah I'm reading it [20:23:59] mutante: backups are important [20:24:42] oh, another thing.. why is "static.conf" an untracked file on fenari git [20:24:56] git status in /h/w/conf/httpd [20:25:11] !log olivneh synchronized php-1.22wmf20/extensions/WikimediaEvents 'Updating WikimediaEvents for VE logging (1/2)' [20:25:29] !log olivneh synchronized php-1.22wmf21/extensions/WikimediaEvents 'Updating WikimediaEvents for VE logging (2/2)' [20:25:29] http://static.wikipedia.org/ ? [20:25:38] ServerName static.wikipedia.org [20:25:38] DocumentRoot "/mnt/static" [20:25:46] reedy@fenari:/h/w/conf/httpd$ ls -al /mnt/static [20:25:46] ls: cannot access /mnt/static: No such file or directory [20:25:48] Very important [20:25:57] http://static.wikipedia.org/ [20:25:59] It works! [20:26:03] yea :p [20:26:06] it works great [20:26:07] greg-g: done for real now. sorry. [20:28:12] It almost looks to be like a bits predecessor [20:28:21] bd808: I don't see any udp port 4827 traffic on that machine at all right now on eth0, even normal flow [20:28:51] (03PS1) 10Andrew Bogott: Remove generic::mysql::packages::client in favor of mysql module [operations/puppet] - 10https://gerrit.wikimedia.org/r/90194 [20:28:52] (03PS1) 10Andrew Bogott: Remove generic::mysql::packages::server in favor of mysql::server::package [operations/puppet] - 10https://gerrit.wikimedia.org/r/90195 [20:28:57] Who broke ulsfo bits? [20:28:58] https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Bits%20caches%20ulsfo&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1381955305&g=network_report&z=large [20:29:06] All the traffic died [20:29:26] (03CR) 10jenkins-bot: [V: 04-1] Remove generic::mysql::packages::client in favor of mysql module [operations/puppet] - 10https://gerrit.wikimedia.org/r/90194 (owner: 10Andrew Bogott) [20:29:34] bblack: Well… that sounds like I had bad intell on the relay host. [20:29:51] yeah, cp3003 is getting packets from 91.198.174.106 [20:29:59] not 113 which is hooft [20:30:18] Reedy: one by one :) https://bugzilla.wikimedia.org/show_bug.cgi?id=55809 [20:30:31] Reedy: looking [20:30:33] thanks [20:30:36] bblack: Should I try sending to .106 then? [20:32:09] bd808: yes, and the machine is "nescio" btw [20:32:21] it also has udpmcast running, and seems to be passing real traffic [20:33:29] (03PS1) 10Mark Bergsma: Move ulsfo traffic back to eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/90196 [20:33:45] (03CR) 10Mark Bergsma: [C: 032] Move ulsfo traffic back to eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/90196 (owner: 10Mark Bergsma) [20:34:08] !log powercycling lvs4003, locked up(?) [20:34:22] Logged the message, Master [20:34:45] bblack: Ok, script changed and ready to restart (also fixed bug that was wiping out logs) [20:35:11] bd808: go ahead, I'm watching traffic on cp3003 already [20:35:28] bblack: should be flowing now [20:35:41] bblack: ar.wiki [20:36:17] bd808: I'm seeing some matches [20:36:31] (03PS2) 10Andrew Bogott: Remove generic::mysql::packages::client in favor of mysql module [operations/puppet] - 10https://gerrit.wikimedia.org/r/90194 [20:36:46] bblack: It sent 9510 for ar.wiki alone [20:36:52] RECOVERY - Host lvs4003 is UP: PING OK - Packet loss = 0%, RTA = 75.03 ms [20:37:22] I'm not logging them all on cp3003, but when it gets to commons I'll log a several-second sample, since that one lasts a while [20:38:08] bblack: ok. It's in ca.wiki now [20:39:02] RECOVERY - Host lvs4001 is UP: PING OK - Packet loss = 0%, RTA = 75.31 ms [20:40:32] bblack: it's working through commons [20:40:42] PROBLEM - Host lvs4003 is DOWN: PING CRITICAL - Packet loss = 100% [20:41:03] bd808: I'm still not sure it's reaching cp3003 [20:41:18] bblack: blerg [20:41:19] (03PS3) 10Andrew Bogott: Remove generic::mysql::packages in favor of mysql module [operations/puppet] - 10https://gerrit.wikimedia.org/r/90194 [20:41:32] trying another several-second log for matches [20:41:38] (03Abandoned) 10Andrew Bogott: Remove generic::mysql::packages::server in favor of mysql::server::package [operations/puppet] - 10https://gerrit.wikimedia.org/r/90195 (owner: 10Andrew Bogott) [20:41:51] bblack: ganglia is sure not showing any spikes like I'd expect [20:42:16] yeah still not going through [20:42:32] bblack: stopped again [20:42:43] bd808: where's the actual code at? [20:42:50] (03CR) 10Dzahn: [C: 031] "dzahn@fenari:~$ apache-fast-test wikimania.url mw1045" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/84707 (owner: 10Reedy) [20:43:22] :D [20:43:26] Reedy: a single one is "Portal" instead of "Main_Page" [20:43:29] purgeChangedPages.php or whatever [20:43:30] 2009 [20:43:34] bblack: /usr/local/apache/common/php-1.22wmf21/maintenance/purgeChangedPages.php [20:43:59] (03Abandoned) 10Andrew Bogott: Use instance-proxy for the default hostname. [operations/puppet] - 10https://gerrit.wikimedia.org/r/57443 (owner: 10Andrew Bogott) [20:45:12] bblack: I know it sends packets, but it may be the $wgHTCPRouting trick that's not working [20:46:12] bd808: well, I didn't even look on nescio yet to see if packets are there, but that aside, I don't think the batching/delay-send stuff in purgeChangedPages works as I thought it did [20:46:41] I was expecting no batching at all and 5ms intervals per individual send. If you're sending bursts, you'll almost certainly overflow queues somewhere at some point [20:47:09] bblack: Aaron wanted that simplification in code review [20:47:38] My batch size is 100 so it should only be bursting 100 at a time [20:47:38] bd808: that's not a simplification, it's a major change [20:47:49] bblack: I don't disagree [20:48:05] 100 packets is a lot of data. if some tiny packet queue somewhere can't handle 100 shoved in all at once at some random moment, it drops things [20:49:14] the whole point here isn't so much the overall rate. e.g. it's not equivalent to send 10,000-batches and then sleep for 50 seconds between each. You'd lose most of your packets that way. [20:49:31] it's the rate over very small timescales that matters. timescales to fill small unreliable buffers [20:49:53] but this isn't our problem, it's just I noticed when looking at the logs that the rate didn't look right [20:51:06] bblack: I should have argued stronger for the deeper change but I didn't. I get all the parts you are pointing out. [20:51:30] paravoid: Dropped off again :( [20:51:39] But at least some would be coming through if there wasn't another issue [20:51:55] Reedy: we know [20:52:04] Reedy: it's depooled now [20:52:13] kernel bug [20:52:23] aha [20:52:54] bd808: yes, I agree that some (probably almost all?) should come through, although it could also be losing a few here and there, and some of the loss could be normal traffic in place of your purge. [20:53:06] I didn't see anything mentioned anywhere ;) [20:53:20] bd808: if it's not almost all though, it sort of confounds the ability to grep for individual requests as a reliable indicator that traffic made it, but we can work around that. [20:53:24] yeah, sorry [20:53:25] !log deleting from /etc/dsh/group on tin these 0 byte files not in puppet: ams.puppettmp_7604 apache-eqiad.puppettmp_2863 apaches-eqiad cp esams mail.ini.list [20:53:38] Logged the message, Master [20:54:32] bd808: just leave it running, or start it again, so we can keep looking [20:54:38] bblack: Is there an easy way to check for router/vlan ACLs that limit the source of :4827 traffic into esams? [20:54:55] I'll look at traffic on nescio first and just see what's arriving there [20:55:03] bblack: ok. I'll start it again... [20:55:04] (from terbium that is) [20:55:19] bblack: running now [20:55:35] It should produce a pretty stead stream of packets [20:55:45] *steady [20:56:00] ehm [20:56:00] wait [20:56:03] you're sending from tin.eqiad.wmnet? [20:56:13] that's an eqiad internal host... which can't talk to the internet, nor to esams [20:56:13] mark: from terbium [20:56:18] ok [20:56:26] same thing [20:56:32] ah [20:56:32] RECOVERY - Host lvs4003 is UP: PING OK - Packet loss = 0%, RTA = 75.05 ms [20:56:37] you need a public host, .wikimedia.org [20:56:54] * bd808 stops it again [20:57:02] or just send to the multicast group [20:57:05] !log removing also from /etc/dsh/group on tin: yaseo.text yaseo.upload srv284 (note, none of these files are managed by puppet) [20:57:18] Logged the message, Master [20:57:42] Haha [20:57:47] every cache will get it done, but at least it'll arrive [20:57:53] get it then... [20:57:56] mark: multicast is easy but there was some concern about unnecessary load on equiad caches [20:58:18] bd808: yeah I was just about to say what mark said, no routing :) [20:58:23] oh [20:58:27] (03CR) 10Dzahn: [C: 032] delete dsh group "mediawiki-installation-precise" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88071 (owner: 10Dzahn) [20:58:28] I just had to figure it out from scratch heh [20:58:30] you could just point at dobson [20:58:38] that will send to esams relay [20:58:43] mark: send unicast to dobson, and it will relay? [20:58:44] instead of to esams relay directly [20:58:47] and dobson won't send to multicast [20:58:48] yep [20:59:10] that seems the easiest workaround right now [20:59:16] yeah [20:59:28] sorry for not realizing this earlier guys ;) [20:59:34] mark: that sounds good. What ip? [20:59:35] bd808: I don't know how we'll measure any loss problem from the bursting tbh [20:59:35] I didn't think about tin being internal either [21:00:04] bblack: just try dobson's main ip, 208.80.152.173 [21:00:09] bblack: I can crank the batch size way down. [21:00:09] bd808: it would be better to split the SQL batching from the send batching though, if that's the reason for the batching at all. batch the SQL by 1000 or whatever to local memory and then pace it out from there [21:00:53] bblack: That would take a patch :) [21:00:55] bd808: I worry that if you leave the code as-is and try to hack around things with e.g. batch-size:1 and a small delay, you'd be generating more sql traffic than you want to [21:02:08] bblack: you are probably right. [21:02:24] why would 1 be used? [21:02:38] (03PS4) 10Andrew Bogott: Cleanup the base class [operations/puppet] - 10https://gerrit.wikimedia.org/r/33066 (owner: 10Faidon Liambotis) [21:02:53] mark: do you think batches of 100 with 500ms delay between batches won't overflow a queue somewhere, in the midst of normal traffic? [21:03:06] that's what we ended up with in the php code, as opposed to 5ms/packet [21:03:08] i think that's not fine grained enough yeah [21:03:50] I doubt a batch size of 10 would explode the DB [21:03:50] well hard to say [21:03:53] esp given the relays involved, too. I don't think the relay explicitly sets a huge udp buffer or anything. I'd expect it to be a possible loss point. [21:03:56] normal traffic can be really bursty too [21:04:02] yes [21:04:35] changing batch size to 10 with a 50ms delay is easy [21:04:44] definitely better [21:04:57] bd808: ok, try that, if the sql side is acceptable [21:05:08] (03CR) 10Andrew Bogott: "Sorry I didn't merge this before moving base into a module :( trying to revive." [operations/puppet] - 10https://gerrit.wikimedia.org/r/33066 (owner: 10Faidon Liambotis) [21:06:25] that shouldn't even come close to killing some random db slave [21:07:25] Ok. Ready to run at 10/batch, 50ms delay and sending to 208.80.152.173 [21:08:26] bblack: running now. You should see some traffic [21:08:42] bd808: yeah, I can definitely see the pattern live now when watching tcpdump [21:08:50] the stream of obvious alpha-order subdomains, etc [21:08:58] good luck, i'm going offline again [21:09:01] bblack: excellent [21:09:09] mark: thanks for your input [21:11:03] * bd808 wants to see spike in ganglia [21:11:29] bd808: it's starting to show up in the tiny-scale view here on the right edge: http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&title=&vl=&x=&n=&hreg[]=cp3%5B0-9%5D%2B.esams&mreg[]=vhtcpd_inpkts_sane>ype=stack&glegend=show&aggregate=1&embed=1&_=1381957793775 [21:12:29] 0 160.279679 query-m: UPDATE `watchlist` SET wl_notificationtimestamp = NULL WHERE wl_user = 'X' [21:12:35] * Aaron|home wonders what is up with that [21:12:55] (03PS1) 10coren: Tool Labs: make a combined exec_sumbit_host class [operations/puppet] - 10https://gerrit.wikimedia.org/r/90209 [21:13:19] bd808: given all sorts of strangeness about sample-timing, that tiny-scale graph is an ugly indicator, though [21:14:04] the 2hr graph looks sane though [21:14:08] I'm stats spoiled. I'm used to graphite with a 6s sample window. [21:14:47] bd808: you should stop this and we'll come up with a slower plan [21:14:54] the iowait on cp3xxx is too high [21:14:55] (03CR) 10coren: [C: 032] Tool Labs: make a combined exec_sumbit_host class [operations/puppet] - 10https://gerrit.wikimedia.org/r/90209 (owner: 10coren) [21:15:11] bblack: stopped [21:15:20] it's spiking out to large values. not so much a packet-rate problem as a purge-rate problem freeing large binary cache object from e.g. commons [21:15:39] it would probably affect site performance for esams at that rate [21:16:23] it seemed mostly fine earlier on, but once it got to commons, those just require a lot more CPU time out of varnish itself to purge each one [21:16:43] (well, io, whatever) [21:17:20] bblack: Huh. They are still just text purges, but I guess there are quite a few of them there. [21:17:27] yeah I donno [21:17:52] let me look at the historical graphs a bit, maybe those iowait spikes are "normal" there [21:19:52] (03Abandoned) 10Reedy: Wikimedia pakistan chapter redirect [operations/apache-config] - 10https://gerrit.wikimedia.org/r/65242 (owner: 10Reedy) [21:20:09] bd808: they're apparently somewhat normal, and we're on the downslope for their daily pattern, so maybe it'll just be ok [21:20:12] http://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&c=Upload+caches+esams&h=cp3003.esams.wikimedia.org&jr=&js=&v=18.9&m=cpu_wio&vl=%25&ti=CPU+wio [21:21:30] maybe slow it down just a bit more, just to be safe? [21:22:12] (03PS1) 10Faidon Liambotis: Disable LRO on lvs40xx boxes [operations/puppet] - 10https://gerrit.wikimedia.org/r/90211 [21:22:19] (03Abandoned) 10Reedy: Move/redirect arbcom.XX wikis to arbcom-XX [operations/apache-config] - 10https://gerrit.wikimedia.org/r/67351 (owner: 10Reedy) [21:22:37] bblack: Sure. I should make a driver script change to skip the wikis we've already done too. No use replaying all that up to commons. [21:22:44] (03CR) 10Faidon Liambotis: [C: 032] Disable LRO on lvs40xx boxes [operations/puppet] - 10https://gerrit.wikimedia.org/r/90211 (owner: 10Faidon Liambotis) [21:22:59] on the upside, those are cheap now since they've probably mostly not been re-cached yet [21:23:10] bblack: 10+100ms seem like enough? [21:24:07] bd808: yeah that's ... 100/s, and the average on normal traffic is ~270/s. At least we're not nearly-doubling it at that rate [21:24:53] (03PS13) 10Reedy: Simplify wikimania apache confs, reuse wikimedia.org docroot. [operations/apache-config] - 10https://gerrit.wikimedia.org/r/84707 [21:25:13] bd808: probably watching vhtcpd queue size would give a good indicator if something's going off the rails. If varnish bogs down on the purges, that's where it would backlog at. [21:27:13] !log staggered reboot of lvs4003/lvs4001 [21:27:23] Logged the message, Master [21:27:27] bblack: running again [21:27:31] oh, except we fixed that by making it not block, so it's possible even queue size wouldn't show it. overall site perf might :) [21:27:38] skipped everything up to commons [21:27:42] ok [21:27:45] PROBLEM - Host lvs4003 is DOWN: PING CRITICAL - Packet loss = 100% [21:28:01] let's just let it run and see if people complain. at the end of the day that's the real metric [21:28:56] bblack: Heh. Not the most proactive approach, but pragmatic [21:30:05] RECOVERY - Host lvs4003 is UP: PING OK - Packet loss = 0%, RTA = 76.32 ms [21:31:00] * bd808 is remembering why he likes writing code more that running servers [21:31:45] PROBLEM - Host lvs4001 is DOWN: PING CRITICAL - Packet loss = 100% [21:32:25] (03PS1) 10Reedy: Remove misconfigured tenwiki vhost, ten.wikipedia.org works fine [operations/apache-config] - 10https://gerrit.wikimedia.org/r/90216 [21:32:58] (03PS1) 10Reedy: Remove tenwiki docroot [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/90217 [21:33:06] RECOVERY - Host lvs4001 is UP: PING OK - Packet loss = 0%, RTA = 75.41 ms [21:33:53] We have some weird stuff in our configs [21:35:34] "some" [21:35:39] bd808: if everyone thought in terms of what happens at runtime on a real server when they wrote code, we might not need people to run servers :) [21:36:22] DocumentRoot "/usr/local/apache/common/docroot/tenwiki" [21:36:22] ServerName movementroles.wikimedia.org [21:38:04] bblack: Heh. I try to think about the operational support costs when designing, but it's not always easy to reason about. Sometimes you just have try things to see what really happens. [21:39:04] I agree that the batch vs individual delay change was a non-trivial departure from spec [21:40:09] commons apparently had quite a few changes in this time range. [21:40:29] bd808: someone should write a php-on-xen vm. You could do it as a port of hiphop and call it hhhvmvm or something. Then you wouldn't need a Linux layer to be mis-managed under the app code :) [21:40:34] like the erlang one here: http://erlangonxen.org/ [21:43:25] yikes. [21:44:14] OSV [21:44:32] some prominent linux/kvm people started a new OS [21:44:46] http://lwn.net/Articles/567222/ [21:46:50] Ha. Quote in the article matches my initial reaction: "the OSv developers have managed to reimplement MS-DOS" [21:48:31] paravoid: but they targetted Java, "patches welcome" for other VMs [21:49:30] but I tend to agree with this whole general direction. For stateless appservers, you really don't need much of an OS, just the barest of hardware abstractions. It's a very different set of needs than a general-purpose multi-user machine. [21:54:00] but they have a .io domain! they're going somewhere! [21:54:07] paravoid: maybe there's room for something else that's halfway there. e.g. a Linux fork that just tosses out user/kernel-space distinctions and all concepts of local userids and local security, but still has the whole generic posix API and can run arbitrary processes from disk, etc. [21:55:10] just enough that you can run one application, deploy it on a disk, have it be something arbitrary like a php or perl executable, still run some separate things for e.g. gmond for monitoring, etc [21:55:26] but it's a given that the host's purpose in life is to run one app and local security is meaningless [21:56:38] (03PS1) 10Dzahn: add LVS monitoring for ULSFO [operations/puppet] - 10https://gerrit.wikimedia.org/r/90257 [21:57:58] (03CR) 10Dzahn: "note: i commented 2 checks of bits.ulsfo.wikimedia.org because that doesn't resolve (yet)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/90257 (owner: 10Dzahn) [22:03:31] (03CR) 10Swalling: [C: 031] "Please for to do this by Thursday." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/90045 (owner: 10Mattflaschen) [22:04:49] 202K commons purges and counting [22:14:31] (03PS2) 10Dzahn: add LVS monitoring for ULSFO [operations/puppet] - 10https://gerrit.wikimedia.org/r/90257 [22:37:01] (03CR) 10Dzahn: [C: 032] "ack, wrong ServerName, and tested ten.wp still works fine without this. thanks for all the remnant cleanup" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/90216 (owner: 10Reedy) [22:38:56] (03CR) 10Dzahn: "http://movementroles.wikimedia.org" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/90216 (owner: 10Reedy) [22:48:10] (03PS1) 10Reedy: Remove en2.wikipedia.org [operations/apache-config] - 10https://gerrit.wikimedia.org/r/90264 [22:48:11] (03CR) 10jenkins-bot: [V: 04-1] Remove en2.wikipedia.org [operations/apache-config] - 10https://gerrit.wikimedia.org/r/90264 (owner: 10Reedy) [22:48:20] ls [22:48:57] grr [22:49:47] wtf [22:50:09] (03PS2) 10Reedy: Remove en2.wikipedia.org [operations/apache-config] - 10https://gerrit.wikimedia.org/r/90264 [22:50:36] ... [22:50:41] Add an ! to the commit message and send it again [22:51:44] Reedy: ah, so not even move it to redirects [22:51:49] (03CR) 10Reedy: "From redirects.coonf" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/90264 (owner: 10Reedy) [22:52:27] heh [22:53:10] 1 # A host en2.wikipedia.org was used in 2003 or so for primitive load [22:53:26] balancing between our two (!sic) servers. :) [22:53:38] (!sic by me) [22:55:44] i wonder who that predates... [23:00:21] (03CR) 10Dzahn: [C: 032] "that's true. let's get rid of this too:)" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/90264 (owner: 10Reedy) [23:02:35] (03CR) 10Dzahn: "quote: "was used in 2003 for primivite load balancing between our 2 servers" :)" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/90264 (owner: 10Reedy) [23:03:35] Improving Wikipedia, deleting one config at a time [23:05:37] :) !log deleting www.wikipedia.conf, gracefull'ing.. <- that might have looked scary but accurate:) [23:06:00] I get a certificate warning when I try to go to https://deployment.wikimedia.beta.wmflabs.org/ [23:06:12] should I file an RT ticket for that? [23:06:24] The certificate is only valid for *.wmflabs.org [23:06:25] No... [23:06:36] can't have multi-level wildcard [23:07:27] kaldari: in this case BZ i'd say [23:07:35] OK [23:07:39] thanks [23:07:44] np [23:08:35] !log ddsh -g apaches -F20 "rm /etc/apache2/wmf/en2.conf" [23:08:51] Logged the message, Master [23:11:45] kaldari: if you didn't make it yet, wait, just comment here https://bugzilla.wikimedia.org/show_bug.cgi?id=48501 [23:12:04] ah even better [23:17:13] (03PS1) 10Aaron Schulz: Added some purge/thumbnail rate limits [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/90265 [23:35:00] !log re-enabling puppet on zinc [23:35:16] Logged the message, Master [23:36:48] !log maxsem synchronized php-1.22wmf21/extensions/MobileFrontend/ [23:37:02] Logged the message, Master [23:38:15] !log maxsem synchronized php-1.22wmf20/extensions/MobileFrontend/ [23:38:31] Logged the message, Master [23:53:08] Aaron|home: I'm currently working on the job runner adjustments for the new parsoid job types and am wondering where the exclusions from the default queue are configured [23:53:51] $wgJobTypesExcludedFromDefaultQueue in common settings (and extension config) [23:54:09] ahh yes, thanks! [23:56:51] (03PS1) 10GWicke: Add job runners for new Parsoid job types [operations/puppet] - 10https://gerrit.wikimedia.org/r/90268