[00:00:57] hoo, I thought those three dots is it's attempt to get to the next hop and is failing. I am fairly certain that it's not masking 64 hops. [00:01:54] let me search a nice expl. online [00:02:00] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 1703 bytes in 6.872 second response time [00:02:30] Cyberpower678: https://serverfault.com/a/334039 [00:02:56] the Wikimedia nodes coming after that simply aren't responding [00:04:49] hoo, so good thing or bad thing? [00:05:09] Cyberpower678: As told, that's totally normal behavior [00:05:15] oh. [00:05:34] you're traceroute looks totally fine [00:05:57] Ok. So trace route is ok. So why is it taking milkipedia sites to load significantly longer. :/ [00:06:13] milkipedia? [00:06:21] what's that? Wikia? [00:06:23] :D [00:06:29] Wikimedia [00:06:37] LMFAO [00:06:46] By far the best autocorrect ever. [00:07:02] hilarious :P [00:07:30] Anyway... given the description of your problems before, that seemed more like you have problems with bits than with the actual html pages [00:08:22] It took 5 minutes to load the Application Management page in preferences. [00:08:38] Firefox said it was transferring data from bits.wikimedia.org [00:08:46] for 5 minutes [00:08:52] icky [00:09:19] (03CR) 10Reedy: "166 or so apaches apparently online in tampa (based on API, non api and bits). Ganglia suggests that's out of 243 machines online." [operations/puppet] - 10https://gerrit.wikimedia.org/r/108070 (owner: 10Chad) [00:10:10] Block logs take 3-6 seconds. [00:10:24] Usually it's a half of a second to load. [00:10:32] Cyberpower678: ping bits-lb.eqiad.wikimedia.org and see whether you have packet losses [00:10:40] hoo, how? [00:11:00] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 229854 bytes in 7.035 second response time [00:11:04] nvm [00:11:57] Alright, lighttning deploy time [00:12:50] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 08:45:44 AM UTC [00:13:18] hoo, How many pings will it do? [00:13:42] Cyberpower678: Unless you stop it using ctrl + c (usually) [00:14:40] hoo, http://pastebin.com/mzXCY5mw [00:15:37] "0.5% packet loss " that's not as good as it can get, but I doubt that's what you were hitting [00:16:54] hoo, any other ideas? [00:17:35] Try turning the internet off and on again [00:18:04] -.- [00:18:13] I'm not that clueless. :p [00:18:20] I've already tried that. [00:18:29] As well as reboot the computer. [00:18:49] Connection speeds as pointed out, are at peak levels. [00:21:02] !log catrope synchronized php-1.23wmf16/extensions/VisualEditor/modules/ve-mw/ui/dialogs/ve.ui.MWMediaInsertDialog.js 'https://gerrit.wikimedia.org/r/#/c/116211/' [00:21:10] Logged the message, Master [00:21:38] (03CR) 10Catrope: [C: 032] Fix VisualEditor/Parsoid on private wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116662 (owner: 10Catrope) [00:21:47] (03Merged) 10jenkins-bot: Fix VisualEditor/Parsoid on private wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116662 (owner: 10Catrope) [00:25:03] (03CR) 10GWicke: Fix VisualEditor/Parsoid on private wikis (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116662 (owner: 10Catrope) [00:25:48] !log catrope synchronized wmf-config/CommonSettings.php 'Fix VE on officewiki' [00:25:51] hoo, I just received an incomplete page through a timeout after 5 minutes of attempting to load my watchlist [00:25:57] Logged the message, Master [00:27:57] https://en.wikipedia.org/w/index.php?title=Saint_Joseph%27s_University&action=history 11 seconds [00:28:09] what exactly takes 11s [00:28:14] I mean which http request [00:28:18] all of them [00:28:24] simultaneously [00:28:41] https://en.wikipedia.org/w/index.php?title=Encyclopedia_Dramatica&action=history 23 seconds [00:29:09] Reedy: Even Wikipedia would ahve been a better answer :D [00:29:33] computers [00:29:37] hoo, From the time I click that link to the time something shows up on my screen. [00:29:52] till something shows? :| [00:30:11] Reedy, I can't be more specific atm [00:30:56] it's quite difficult to help you without better information [00:31:19] Reedy, how do I get better information. It helps to know where to look. [00:31:26] What browser are you using? [00:31:31] Safari on Mac [00:32:26] https://developer.apple.com/library/safari/documentation/AppleApplications/Conceptual/Safari_Developer_Guide/Instruments/Instruments.html#//apple_ref/doc/uid/TP40007874-CH4-SW1 [00:33:41] i'd just try another browser before i tried troubleshooting deeply within a single browser personally... [00:34:03] (but usually its crazy addons that cause issues, safari isnt known for that) [00:34:38] RobH, Other browsers are having the same issues. Firefox on windows is slow to load wikipedia too [00:34:45] ahh, then no clue [00:34:49] do what reedy says =] [00:35:10] Reedy, is the web inspector an add-on? [00:35:30] I have nfi [00:35:47] No [00:35:49] "Web Inspector is an open source web development tool built into Safari" [00:36:59] Reedy, except for I can't find the damn thing. [00:37:08] I don't use a mac [00:37:11] Or safari [00:39:56] Reedy, foun it [00:40:38] You've got to enable it in advanced preferences because apparently no one would ever want development tools in their browser [00:42:20] index.php en.wikipedia.org resource-type-document 200 false 99945 15803 3.183137893676758 6.03571891784668 1393893692.505842 [00:42:34] 6.03571891784668s [00:44:03] a bit of context would help [00:44:04] (03CR) 10Dzahn: [C: 032] remove 'zhen' from site.pp,dsh,dhcpd [operations/puppet] - 10https://gerrit.wikimedia.org/r/116659 (owner: 10Dzahn) [00:44:12] I can guess most, but well :P [00:44:33] load.php bits.wikimedia.org resource-type-stylesheet 200 false 140985 34677 0.6567199230194092 25.39751410484314 1393893839.557539 [00:45:43] Name Domain Type Status Cached Size Transferred Latency Duration Timeline [00:45:50] load.php bits.wikimedia.org resource-type-stylesheet 200 false 140985 34677 0.6567199230194092 25.39751410484314 1393893839.557539 [00:46:16] the duration is... wow [00:47:50] hoo, does that help? [00:48:10] Do you have a load of useless crap enabled? [00:48:16] gadgets, user js... [00:48:30] does it change being logged in vs logged out? [00:48:39] Reedy, just what I've had enabled for the past year. [00:48:40] mutante: that's the question ;) [00:48:49] Hasn't affected performance. [00:49:21] mutante, I can't access preferences logged out. [00:49:34] That load.php entry comes from preferences. [00:49:53] !log zhen - disable puppet,revoke puppet cert,delete salt key,delete stored configs, disable monitoring... [00:50:02] Logged the message, Master [00:50:09] Cyberpower678: ah, understand [00:50:49] Special:Watchlist en.wikipedia.org resource-type-document 200 false 230319 34417 9.361343145370483 10.17725396156311 1393894224.499453 [00:51:28] Cyberpower678: next step would be comparing to the same computer and user, but connected via a different provider (if that is possible) [00:51:37] Cyberbot_II en.wikipedia.org resource-type-document 200 false 99589 15749 1.6356449127197266 6.37595796585083 1393894275.096077 [00:52:16] load.php bits.wikimedia.org resource-type-stylesheet 200 false 150438 37766 1.9066860675811768 3.514871120452881 1393894318.734478 [00:52:34] Another computer on the same network might work also, if you can't just hop onto another connection [00:52:44] Not much of a significant difference. Time until something appears is still the same. Logged out [00:53:40] I can't login anymore. [00:54:02] nvm [00:54:04] Cyberbot_II en.wikipedia.org resource-type-document 200 false 102051 15950 27.839799165725708 1.7688369750976562 1393894424.443726 [00:54:20] Latency of 27.839799165725708 seconds when logging in. [00:54:56] hoo, mutante ^ [00:55:05] Cyberpower678: If you have another computer or a tablet or whatever around, you might want to try that [00:55:41] Takes forever on my iPhone [00:56:00] Just to tell me that I'm not even logged in. [00:56:27] Login page loaded in 19 seconds [00:56:51] Logs in slightly faster though. [00:58:09] Cyberpower678: Is anyone besides you having an issue? [00:58:10] Cyberpower678: Is that using your local internet connection or going via some mobile network? [00:58:20] I'm not sure shitty local wifi is an operations issue. [00:58:20] Both [00:58:46] No one else is on Wikipedia. [00:59:50] hoo, On my LTE connection it took 10 seconds. That was slightly faster, but my LTE is a faster connection in general to the WiFi I have. [01:02:31] Well I have to go. [01:10:08] !log shutting down 'zhen' permanently [01:10:19] Logged the message, Master [01:20:38] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [01:32:45] !log catrope synchronized php-1.23wmf16/resources/oojs-ui/oojs-ui.js 'oojs-ui fixes' [01:32:56] Logged the message, Master [02:25:05] !log schema change bug 31397 afl_namespace, slave by slave [02:25:15] Logged the message, Master [02:28:16] !log LocalisationUpdate completed (1.23wmf15) at 2014-03-04 02:28:16+00:00 [02:28:24] Logged the message, Master [02:41:35] (03CR) 10Hoo man: [C: 032] "Trivial one" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/115992 (owner: 10BryanDavis) [02:41:52] (03Merged) 10jenkins-bot: Fix documentation of `--home` option for activeMWVersions.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/115992 (owner: 10BryanDavis) [02:43:06] hoo: make sure you at least pull it onto tin [02:43:16] Reedy: ;) [02:43:17] !log LocalisationUpdate completed (1.23wmf16) at 2014-03-04 02:43:17+00:00 [02:43:24] Logged the message, Master [02:43:42] !log hoo synchronized multiversion/activeMWVersions.php 'Comment only change {{Gerrit' [02:43:49] Logged the message, Master [02:43:55] well :P [02:45:05] You don't want RoanKattouw_away chasing you about it [02:45:51] ok, there we go... :P [02:46:01] !log hoo synchronized multiversion/activeMWVersions.php 'Comment only change {{Gerrit|I5e68518}}' [02:46:08] Logged the message, Master [02:46:12] if anyone tries to chase hoo, he can simply lock the chaser xD [02:46:16] I knew it was smart to start of with a file outside the normal tree :P [02:46:45] Time to open a betting pool [02:47:09] for what? [02:47:36] Till you cause your first outage ;) [02:47:47] /downtime [02:48:06] :P [02:48:14] Shall I delete the first log entry? [02:48:17] The broken one? [02:48:29] I wouldn't worry about it [02:48:39] wikitech won't run out of space :p [02:48:47] might not :P [02:49:26] Reedy: What's the procedure to know that nobody else is deploying (except of scanning tins processlist like crazy)? [02:50:04] Scheduled stuff is https://wikitech.wikimedia.org/wiki/Deployments [02:50:12] Yeah, I'm not that bad :P [02:50:32] But you don't schedule every "Allow foowiki sysops to destroy the world" [02:50:33] Beyond that, there isn't really [02:51:27] Common sense mostly [02:51:50] Being in here [02:52:20] I mostly looked that nobody merged any deployment stuff recently and that nobody was doing suspicious stuff on tin [02:53:27] I wonder if gerrit should report action on gerrit on wmf/* branches in here [02:53:46] Probably not needed in most cases. solved by being in -dev [02:54:12] that sounds like a somewhat smart idea... sometimes people flood -dev so hard ... :P [02:54:32] Like that guy adding COPYING to everything :D [02:56:08] If you go back to living on CET rather than PST, you shouldn't conflict ;) [02:58:51] Reedy: I'm on CET (physically), just not living it... I have to find a rhythm again, I guess... [02:59:42] ;) [02:59:54] To which extent, I really shouldn't be (still) here [03:01:56] I guess this is a good moment to say good night [03:13:38] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 08:45:44 AM UTC [03:29:53] !log LocalisationUpdate ResourceLoader cache refresh completed at 2014-03-04 03:29:53+00:00 [03:30:01] Logged the message, Master [04:18:50] (03PS1) 10Greg Grossmeier: Log length of l10nupdate [operations/puppet] - 10https://gerrit.wikimedia.org/r/116718 [04:21:25] (03CR) 10Greg Grossmeier: "Haven't tested on production, but I did test my syntax locally with a stub shell script: http://paste.debian.net/85186/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/116718 (owner: 10Greg Grossmeier) [04:21:38] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [04:22:36] bd808: ^^ step 1 of my yet-to-be-report bug report [04:22:42] +ed [04:23:15] and.. sleep time [04:34:35] (03CR) 10BryanDavis: "You should also add:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/116718 (owner: 10Greg Grossmeier) [05:24:22] (03PS1) 10Ori.livneh: Enable GeoIP cookie on Labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/116719 [05:26:06] (03PS2) 10Ori.livneh: Enable GeoIP cookie on Labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/116719 [05:27:20] (03CR) 10Ori.livneh: "hashar: please +1 if this is OK." [operations/puppet] - 10https://gerrit.wikimedia.org/r/116719 (owner: 10Ori.livneh) [06:14:38] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 08:45:44 AM UTC [06:27:18] PROBLEM - udp2log log age for emery on emery is CRITICAL: CRITICAL: log files /a/log/webrequest/packet-loss.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [06:29:18] RECOVERY - udp2log log age for emery on emery is OK: OK: all log files active [07:22:38] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [07:23:17] (03CR) 10Matanya: [C: 031] remove "zhen" public IP, decom [operations/dns] - 10https://gerrit.wikimedia.org/r/116658 (owner: 10Dzahn) [08:04:59] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [08:48:51] (03PS3) 10Ori.livneh: geoip.inc.vcl: don't increment loop counter twice [operations/puppet] - 10https://gerrit.wikimedia.org/r/116469 [08:57:52] (03CR) 10Faidon Liambotis: [C: 032] geoip.inc.vcl: don't increment loop counter twice [operations/puppet] - 10https://gerrit.wikimedia.org/r/116469 (owner: 10Ori.livneh) [08:58:16] (03CR) 10Faidon Liambotis: [C: 032] Enable GeoIP cookie on Labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/116719 (owner: 10Ori.livneh) [09:01:31] paravoid: thanks! [09:02:58] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [09:15:38] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 08:45:44 AM UTC [09:22:30] (03CR) 10Alexandros Kosiaris: [C: 032] contint: slaves now have openjdk-{6,7}-jdk [operations/puppet] - 10https://gerrit.wikimedia.org/r/114619 (owner: 10Hashar) [09:37:15] (03CR) 10Faidon Liambotis: "What's the rationale for this?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/116019 (owner: 10Hoo man) [09:51:30] paravoid: around? https://gerrit.wikimedia.org/r/116019 = [09:51:31] *? [09:51:53] yes? [09:52:21] I think the rational is pretty clear: Let's only give everyone the access they need [09:52:37] eg. these people don't need access to the database and don't need to be able to run maint. scripts [09:52:46] so I don't see why they should be able to do that [09:54:00] we shouldn't had given these people access to the bastion in the first place [09:55:07] well, that's obviously to late [09:56:44] Of course the bastions are already rather high risks as they've access to the internal network and stuff... but that's by far lower than just giving *everybody* full access to the Database and logs and whatever [09:57:10] it's not about these two users, but a more general thing [09:57:39] I only took them because I was sure they really don't need any other machines than bastion and their release server [09:59:11] our access control sucks [09:59:28] so no disagreement there [09:59:38] however, a group for bastion sounds wrong to me [09:59:47] bastion is the means to accessing some resource, not the end [10:00:22] paravoid: It's not the end, the group is meant for people who need only (few) specific machines [10:00:44] so that we let them access the bastions and the boxes they need (via site.pp) [10:03:01] no, that's not the right way to handle access control [10:03:12] create a class admins::releases if you want [10:03:15] and add them there [10:03:41] that way, we know why they have access and why a machine has admins::releases in it [10:03:53] and if you're feeling like it, merge that into manifests/role/releases.pp as well [10:04:11] which has the same information of who is a member of the release group there [10:04:31] the way we should do this is having a different group per *purpose*, not per e.g. server [10:04:49] yeah, that idea is sane [10:04:50] btw, https://gerrit.wikimedia.org/r/#/c/107848/ :) [10:05:04] but I'm not a fan of putting users into roles [10:05:53] why not? [10:06:35] cause I like to have the *real* list of who has access to what at a single (or multiple) well known places [10:06:56] why? [10:07:13] you will have that information, but it will grouped into roles [10:07:26] with your changeset, how can I know why markus has access to the bastion? [10:07:41] Ok, my changeset is crap, agreed [10:07:45] what if we change the way we upload releases, for instance (e.g. jenkins does it, as it has been floated before) [10:07:52] no, it's a step in the right direction [10:07:58] I just don't agree fully with it :) [10:08:26] so, if we remove the access to caesium, will we remove mglaser from admins::bastion? [10:08:45] the RT could help there, but what if in the meantime mglaser gets access to another box [10:08:51] I just would like to have all these lists in one place... so a admin::releases in admins.pp is fine with me, also using that group in the releases role sounds sane [10:09:11] while having mglaser in the releases group, and then saying that the releases group has access to a) caesium, b) bastion [10:09:13] but specifying a list of users with access *in* the releases role sounds scary [10:09:14] is the right way to do this [10:09:23] oh, yes, I agree with that [10:09:32] I was suggesting the opposite actually [10:09:40] the releases role reusing admins::releases, not the other way around [10:09:50] Oh, I got that wrong then [10:10:25] yeah I wasn't clear enough, sorry about that [10:10:51] the duplication right now is a bit scary [10:11:16] btw, note from my admins.pp overhaul this bit too: [10:11:17] - Tie Unix groups with our arbitrary class groupings; grouping users together in a class now means grouping them in a Unix group too. [10:11:59] as to have multiple levels of protection [10:12:15] right now we only have 'wikidev' as unix group right? [10:12:17] so even if a mistake happens and a user gets access to a box they shouldn't, unix permissions shouldn't let them run code as wikidev [10:12:23] Despite of mwdeploy and stuff like that [10:12:25] or read the database password [10:12:44] wikidev is the primary gid for all users, yes :(( [10:13:08] what is wikidev? [10:13:10] yeah, I thought about restricting some of the password scripts, but as long as everyone from analytics intern to root is a wikidev, that's hardly possible [10:13:11] you seem interested in this; I'd love if you could review my WIP changeset and see if it makes sense to you [10:13:36] I like security stuffs, yeah :P [10:13:50] But I guess I should really go and do some of my real work now... might be back later [10:14:46] no worries, the patchset is there for almost a month now [10:16:38] :) [10:23:38] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [10:29:30] and it is time to start thinking about merging it paravoid :) [10:29:39] it's not ready yet [10:30:01] agreed, but you don't seem to have time to work on it [11:03:11] (03PS2) 10Nemo bis: Enable autopatrolled group on itwikiquote [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114945 (owner: 10Ricordisamoa) [11:04:32] (03CR) 10Nemo bis: [C: 031] "Ouch, I missed the question there: you should have filed a bug (if you file one now, better late than never). Anyway, consensus is clear a" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114945 (owner: 10Ricordisamoa) [11:12:06] !log Jenkins web service unavailable, investigating. Builds should not be affected though since they dont use the web service (but gearman) [11:12:14] Logged the message, Master [11:12:48] (03CR) 10Nemo bis: "I'm told (on IRC) that I should comment on gerrit..." (032 comments) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114656 (owner: 10Hoo man) [11:19:03] !log Restarting Jenkins, it is stalled .... [11:19:10] Logged the message, Master [11:22:08] PROBLEM - jenkins_service_running on gallium is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war [11:24:40] (03CR) 10Ricordisamoa: ""Why file a bug when you can file a patch?"" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114945 (owner: 10Ricordisamoa) [11:30:29] (03CR) 10Nemo bis: "Says who?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114945 (owner: 10Ricordisamoa) [11:32:21] Do we really need bugs if there's consensus etc? [11:32:48] Reedy: You once said we do :P [11:54:20] (03Restored) 10Hoo man: Simplify the AbuseFilter configuration a little [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114656 (owner: 10Hoo man) [12:04:21] (03PS2) 10Hoo man: Simplify the AbuseFilter configuration a little [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114656 [12:05:07] (03CR) 10Hoo man: "Addressed Nemo bis's comments" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114656 (owner: 10Hoo man) [12:13:49] (03CR) 10Ricordisamoa: "Me." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114945 (owner: 10Ricordisamoa) [12:16:11] (03CR) 10Hoo man: [C: 04-1] "Please open a bug, there really should be one for each change." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114945 (owner: 10Ricordisamoa) [12:16:38] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 08:45:44 AM UTC [12:54:32] apergos: https://commons.wikimedia.org/wiki/File:Happiness_Wrapped_in_a_Blanket_13.jpg ? [12:54:43] there are quite a bit of those [12:54:52] around 6 recently [12:55:23] matanya: https://bugzilla.wikimedia.org/show_bug.cgi?id=32551 [12:56:25] thanks hoo [13:24:38] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [13:27:17] * hoo looks manifests/role/releases.pp [13:27:21] :/ [13:31:21] (03PS3) 10Ricordisamoa: Enable autopatrolled group on itwikiquote [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114945 [13:34:13] yay for dynamic scoping all over the place... but ok :P [13:49:56] (03PS4) 10Hoo man: Create an autopatrolled group on itwikiquote [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114945 (owner: 10Ricordisamoa) [13:50:21] (03CR) 10Hoo man: [C: 032] Create an autopatrolled group on itwikiquote [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114945 (owner: 10Ricordisamoa) [13:50:29] (03Merged) 10jenkins-bot: Create an autopatrolled group on itwikiquote [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114945 (owner: 10Ricordisamoa) [13:51:52] !log hoo synchronized wmf-config/InitialiseSettings.php '{{Gerrit|I62b3288}}' [13:52:00] Logged the message, Master [14:16:54] (03PS1) 10Manybubbles: Require config file for staring Elasticsearch [operations/puppet] - 10https://gerrit.wikimedia.org/r/116743 [14:23:09] (03CR) 10Matanya: [C: 031] Require config file for staring Elasticsearch [operations/puppet] - 10https://gerrit.wikimedia.org/r/116743 (owner: 10Manybubbles) [14:42:32] (03CR) 10Ottomata: [C: 032 V: 032] Require config file for staring Elasticsearch [operations/puppet] - 10https://gerrit.wikimedia.org/r/116743 (owner: 10Manybubbles) [14:52:53] !log stopping puppet on gallium to play with apache configuration [14:53:02] Logged the message, Master [14:53:43] !log switched wikitech to allow eqiad access, turned off pmtpa instance creation [14:53:51] Logged the message, Master [14:55:36] andrewbogott: eqiad labs looks really good [14:55:49] i even logged in to there yesterday [14:55:49] good! Hope it holds up today [14:56:11] what is happening today? [14:56:13] matanya: starting today you should only access it via wikitech. Things will get a bit out of sync otherwise. [14:56:23] matanya, switched on in wikitech so everyone can create instances there. [14:56:59] so anything i currently do via tampa will break? or anything it do that needs to change? [14:59:15] um... [14:59:34] matanya: wikitech is in tampa. [14:59:42] (03PS2) 10Greg Grossmeier: Log length of l10nupdate to SAL and Graphite [operations/puppet] - 10https://gerrit.wikimedia.org/r/116718 [15:00:00] But, for a while some of us were using virt1000 (in eqiad) as the eqiad labs gui. That's all I was referring to. [15:00:28] ok, but in regards for user stuff, any change projected? [15:01:57] matanya: https://wikitech.wikimedia.org/wiki/Labs_Eqiad_Migration_Howto [15:02:30] thanks [15:03:20] matanya, speaking of which, are you attached to the site-testing instance in the 'puppet' project? [15:03:31] If you're using it currently then I'll move it to eqiad; if not, I'll just delete it. [15:03:55] since akosiaris merged the site.pp change it can now go :) [15:04:05] great [15:04:55] andrewbogott: same for puppet-svn [15:04:59] was merged too [15:05:14] ok, I think you deleted that one already anyway [15:05:21] i have? [15:05:33] !log reenabling puppet on gallium [15:05:35] hm… 'puppet' project looks empty to me. Are you seeing an instance there still? [15:05:41] Logged the message, Master [15:05:56] yes, i see an instance, but it is empty [15:06:40] i see two in fact, but both are marked deleted, i wonder why they show up [15:06:53] what url are you looking at? [15:07:23] https://wikitech.wikimedia.org/wiki/Nova_Resource:I-000009e0.pmtpa.wmflabs [15:07:31] and https://wikitech.wikimedia.org/wiki/Nova_Resource:I-000009b8.pmtpa.wmflabs [15:08:28] Oh, ok. Just so long as you don't see them on the 'manage instances' page I'm not too worried. [15:09:38] but i do need the etherpad instances moved [15:15:38] matanya: OK, best to create bugzilla tickets for that. [15:15:44] i will [15:15:53] thx [15:16:00] (03PS1) 10Hashar: contint: do not cache api/json calls [operations/puppet] - 10https://gerrit.wikimedia.org/r/116748 [15:16:46] andrewbogott: under toollabs? [15:17:08] there's a link on the above page that will file a bug properly. We have a tracking but for migration tasks. [15:17:38] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 08:45:44 AM UTC [15:18:27] (03CR) 10Hashar: [C: 031 V: 032] "This solve jenkins job builder throwing error:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/116748 (owner: 10Hashar) [15:19:30] https://bugzilla.wikimedia.org/show_bug.cgi?id=62207 [15:28:27] as I understand it, the daily backups of Wikidata have failed for several days [15:28:58] can someone tell me what the issue is ? [15:29:14] matanya: do you happen to know what if anything is in the /data/projects dir in etherpad? Do you want to do a search and rescue or should I copy over everything? [15:29:46] let me take a look [15:32:23] apergos: ^^ see at XX:28 [15:33:05] andrewbogott: it wasn't created by me [15:33:18] what daily backups? [15:33:22] matanya: ok [15:33:46] i only need the puppet/webserver of that [15:33:47] apergos: /me shrugs [15:34:17] matanya: do you know if anyone is using/cares about that project? [15:34:35] GerardM-: what backups, specifically? [15:34:51] greg-g: Dumps [15:35:07] andrewbogott: i know ^d did in the past [15:35:13] need to check with him [15:35:32] what dumps? sorry but we don't have daily xml dumps of anything and that's what I would be aware of [15:35:49] matanya: ok. Mind making a note to that effect on the progress page? https://wikitech.wikimedia.org/wiki/Labs_Eqiad_Migration/Progress [15:36:05] i geuss a short mail to members will reveal the real status [15:36:18] * greg-g steps out of this conversation and let's GerardM- and apergos figure it out :) [15:37:26] done [15:37:42] GerardM-: is there a bug report? [15:38:13] matanya: ty [15:38:55] Last dump was 1 hour and 20 minutes ago for Wikidata. [15:39:08] greg-g: the daily backups have failed several times (this was confirmed by Lydia) [15:39:43] GerardM-: please direct your questions/statements to apergos :) [15:39:58] what daily backups? I'm trying to understand which these are [15:40:53] since when do we have backups? [15:41:29] mark; database dumps [15:41:47] !log upgraded php5 packages, php5-wmerrors package and libmemcached11 on mw1017. This will make puppet and the corresponding icinga check unhappy. [15:41:52] mysql snapshots? [15:41:55] Logged the message, Master [15:41:57] mark: https://dumps.wikimedia.org/ fwiw :p [15:42:09] those are not daily [15:42:30] blurg not https for that [15:43:08] we don't do any daily dumps; as a side projct we do experimental adds/changes of the wikis but those are not guaranteed to run or even have reasonable content [15:43:32] (huh re no tls for dumps.wikimedia) [15:44:05] but the regular dump runs are not daily and never have been [15:44:07] akosiaris: Can we check in about the state of your labs project? Either I failed to migrate some instances to eqiad or you deleted some there...? [15:44:19] i deleted [15:44:45] andrewbogott: in general the migration was rather uneventful [15:44:52] ok. So are you happy with the state of eqiad there? Shall I mark that project as finished? [15:44:58] they also have run successfully for as far back as we keep them... [15:45:02] yes :-) [15:45:08] (That means that the pmtpa instances may get deleted sometime) [15:45:20] ok, great! two down [15:45:21] the sooner the better [15:46:28] Well, I'll delete them right now if you don't mind, still having some space issues in pmtpa [15:46:29] andrewbogott: i think you can mark the puppet one too [15:46:36] matanya: yep, done. [15:46:42] thanks [15:46:53] matanya: there's also the 'puppet-cleanup' project which I'm unsure about [15:47:04] it is mostly abaonaded [15:47:34] I thought so, but… still rude to delete people's stuff [15:48:02] ori: how did the cookie stuff work after the inc fix? seems to have been the source of the segfault? [15:48:38] andrewbogott: jzerebecki is the only user there i think [15:48:52] (03PS5) 10BBlack: Set ZeroOpts=tls cookie for HTTPS-enabled Zero clients [operations/puppet] - 10https://gerrit.wikimedia.org/r/115669 [15:49:06] the protorel thing was for relative protocol fixes, can't remember by who [15:49:35] those from 2012 can go for sure [15:49:40] ok [15:51:26] if mutante comes around when i'm absent, i'd apprerciate of someone can tell him i would like to take care of wikibugs [15:51:28] andre__: ^ [15:51:51] okay. whatever "taking care" means :P [15:52:04] move it from "misc" [15:53:08] PROBLEM - zuul_service_running on gallium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:53:28] PROBLEM - SSH on gallium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:53:59] PROBLEM - zuul_gearman_service on gallium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:53:59] PROBLEM - HTTP on gallium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:54:58] RECOVERY - zuul_gearman_service on gallium is OK: TCP OK - 0.000 second response time on port 4730 [15:54:58] RECOVERY - HTTP on gallium is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 563 bytes in 0.690 second response time [15:55:00] swap death [15:55:08] RECOVERY - zuul_service_running on gallium is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/local/bin/zuul-server [15:55:13] https://ganglia.wikimedia.org/latest/?c=Miscellaneous%20eqiad&h=gallium.wikimedia.org&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [15:55:48] matanya: take care or "take care"? :> [15:55:57] :) [15:56:18] RECOVERY - SSH on gallium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [15:56:33] MatmaRex: take care == rm -rf [15:57:11] matanya: on a possibly related note, it seems that the wikibugs that is running right now is not the same version that is in the repository [15:57:27] means? [15:57:28] matanya: the code in the repo writes to #wikimedia-dev, while wikibugs sits on #mediwiki right now [15:57:41] and everyone is afraid to touch that [15:57:49] it's been this way since, like, september [15:58:15] i'll look at it, and try to do it the wiki way: be bold [15:58:42] there's a bug for that [15:58:43] or two [15:59:27] there is a bug that the bugs aren't bugging us correctly? :D [15:59:28] PROBLEM - SSH on gallium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:59:50] any opsen wants to help gallium? [16:00:08] PROBLEM - zuul_service_running on gallium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:00:18] RECOVERY - SSH on gallium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [16:00:38] * matanya points @ apergos  [16:00:59] RECOVERY - zuul_service_running on gallium is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/local/bin/zuul-server [16:01:33] <^d> matanya: backups? huh? [16:02:11] ^d: no, etherpad [16:02:17] I'm almost on the host (waiting for prompt) [16:02:28] <^d> matanya: What about etherpad? [16:03:20] i remember you had something to do with https://wikitech.wikimedia.org/wiki/Nova_Resource:Etherpad [16:03:28] PROBLEM - SSH on gallium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:03:35] <^d> I did? News to me :) [16:04:01] ^d: maybe i have a memory leak [16:04:38] gallium might have to be powercycled; not getting a prompt after motd [16:04:46] <^d> Happens to the best of us. [16:05:04] hashar, have you been working on labs migration at all? If so, how are things going? [16:05:07] ^d: no, i'm still sane : https://wikitech.wikimedia.org/w/index.php?title=Nova_Resource:Etherpad&action=history [16:05:39] andrewbogott: not at all beside creating a bunch of instances on virt1000 [16:05:52] andrewbogott: thinking about it, I might well recreate beta entirely on eqiad [16:05:55] <^d> matanya: Heh, removing myself from a project I never used counts as using it? :p [16:06:10] hashar: that was going to be my question… planning to rebuild things afresh? [16:06:12] well, kinda :) [16:06:23] hashar, btw, don't use virt1000 anymore. You can create eqiad instances right on wikitech. [16:06:40] awesome [16:06:58] PROBLEM - zuul_gearman_service on gallium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:06:59] PROBLEM - HTTP on gallium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:07:05] arf [16:07:09] PROBLEM - zuul_service_running on gallium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:07:14] that is not me [16:07:24] apergos: i'm for it [16:07:50] !log gallium has been sent to swap [16:07:57] Logged the message, Master [16:08:14] hashar: actually I'm going to disable instance creation on virt1000 [16:08:28] RECOVERY - SSH on gallium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [16:08:32] hashar: that isn't new, it has been there for some time: https://ganglia.wikimedia.org/latest/?r=month&cs=&ce=&c=Miscellaneous+eqiad&h=gallium.wikimedia.org&tab=m&vn=&hide-hf=false&metric_group= [16:08:45] a while being 30 minutes ? :D [16:08:48] RECOVERY - zuul_gearman_service on gallium is OK: TCP OK - 0.000 second response time on port 4730 [16:08:48] RECOVERY - HTTP on gallium is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 563 bytes in 0.261 second response time [16:08:58] RECOVERY - zuul_service_running on gallium is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/local/bin/zuul-server [16:09:08] I'm o it [16:09:16] and it's recovered, I bet something just got shot [16:09:43] yeah java taking 6GB memory :/ [16:09:59] maybe a stronger machine is needed [16:10:05] say 16 GB [16:10:15] though java takes what ever you give it [16:10:21] na I need 3 or 4 weeks of uninterrupted work to phase out jenkins entirely :D [16:10:34] and replace it by what? [16:10:38] travis? [16:12:31] I shot one test job over there that had been running for a little while and was the last big hog [16:13:00] matanya: so if ^d wants nothing to do with etherpad, the question about shared storage comes back to you... [16:13:08] RECOVERY - jenkins_service_running on gallium is OK: PROCS OK: 1 process with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war [16:13:09] !log gallium : killed leftover jenkins instance {{bug|51817}} "Jenkins init script is crap" [16:13:18] Logged the message, Master [16:13:43] I should get pages/emails for those [16:13:49] andrewbogott: might be any other member, but for all i care it can go [16:14:11] matanya: ok, nothing in there you care about at all? [16:14:21] pupet config :) [16:14:31] looks normal-ish now [16:14:39] but i can take it from git [16:15:03] so, if you prefer, you can delete it and i'll create the imstance in eqiad [16:15:48] np, I can migrate. [16:17:00] thanks, i'm out for now [16:17:42] (03CR) 10CSteipp: [C: 031] "Just reading through it, that looks right to me." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114656 (owner: 10Hoo man) [16:25:38] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [16:33:24] (03PS6) 10BBlack: Set ZeroOpts=tls cookie for HTTPS-enabled Zero clients [operations/puppet] - 10https://gerrit.wikimedia.org/r/115669 [16:38:43] (03CR) 10BBlack: [C: 032 V: 032] "Validated manually on deployment-cache-mobile01.pmtpa.wmflabs, seems to work correctly in all cases." [operations/puppet] - 10https://gerrit.wikimedia.org/r/115669 (owner: 10BBlack) [17:07:06] greg-g: ok, can I have the conch shell of deployment? [17:10:08] manybubbles: you may speak.. er, deploy [17:10:47] thanks! [17:18:04] !log manybubbles synchronized php-1.23wmf16/extensions/CirrusSearch/ '3 small fixes including search timeouts' [17:18:12] Logged the message, Master [17:19:39] * manybubbles gives greg-g back the conch [17:19:52] everything still looks good [17:21:07] manybubbles: awesome [17:21:16] !log rebuilding the search index on commons after failing yesterday now that I've deployed a fix to the timeout issue [17:21:24] Logged the message, Master [17:28:47] bblack: yes, seems to work [17:28:55] bblack: curl -Is http://en.wikipedia.beta.wmflabs.org/wiki/Main_Page | grep Set-Cookie [17:30:12] ori: awesome. let's try putting it back in prod again later today (like, 3-4 hours from now) [17:30:56] bblack: cool, yeah! [17:31:10] the inc bug being the segfault makes sense in general, probably depended on random factors like string length, but would eventually have written to unallocated memory (or at least read it) [17:31:15] sorry about that :) [17:32:10] notjhing to be sorry about, thanks for all your help with this [17:32:25] np! [17:43:26] (03CR) 10Ottomata: "Sure. Ok to merge?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/116140 (owner: 10BryanDavis) [17:43:43] bd808 ^ [17:44:09] ottomata: I think it should be safe to merge yes [17:44:47] ok thank you for your confidence! :p [17:44:55] (03PS2) 10BryanDavis: Remove use of deprecated --extended argument to mwversionsinuse [operations/puppet] - 10https://gerrit.wikimedia.org/r/116140 [17:45:11] (03CR) 10Ottomata: [C: 032 V: 032] Remove use of deprecated --extended argument to mwversionsinuse [operations/puppet] - 10https://gerrit.wikimedia.org/r/116140 (owner: 10BryanDavis) [17:48:12] (03PS1) 10Dzahn: add Icinga contactgroup for contint [operations/puppet] - 10https://gerrit.wikimedia.org/r/116765 [17:51:04] (03CR) 10Ottomata: Config changes for Elasticsearch (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/116498 (owner: 10Manybubbles) [17:51:16] (03CR) 10Dzahn: [C: 032] add Icinga contactgroup for contint [operations/puppet] - 10https://gerrit.wikimedia.org/r/116765 (owner: 10Dzahn) [17:53:24] (03CR) 10Manybubbles: Config changes for Elasticsearch (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/116498 (owner: 10Manybubbles) [17:57:06] (03CR) 10Ottomata: Config changes for Elasticsearch (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/116498 (owner: 10Manybubbles) [17:57:25] bblack, thx for post-processed cookie merge. hey, did you happen to see my question about removal of the vcl_deliver part in the gerrit comments? any feedback there? [17:57:54] akosiaris: still around? [17:58:04] getting ready for a meeting [17:58:09] you got 2 mins :P [17:58:10] aye, real quick [17:58:17] if copyright years are mismatched [17:58:20] but everything else is the same [17:58:26] can I include files under the same Files section? [17:58:29] like [17:58:31] (03PS1) 10Andrew Bogott: Should be --os-region not --region [operations/puppet] - 10https://gerrit.wikimedia.org/r/116767 [17:58:39] (03CR) 10Manybubbles: Config changes for Elasticsearch (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/116498 (owner: 10Manybubbles) [17:58:45] Apache 2006-2010 [17:58:45] vs Apache 2008-2012 or something [17:58:47] does it matter? [17:58:52] can I just take the span and say [17:59:05] Copyright: Apache 1908 - 2532? :p [17:59:07] heheh [17:59:51] In the year 2525 if the webserver's still alive ... :P [18:00:13] oh god [18:00:20] (03CR) 10Ottomata: Config changes for Elasticsearch (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/116498 (owner: 10Manybubbles) [18:00:28] hmmm. I would say just put the union there [18:00:35] you are a good man. [18:00:39] way too nitpick to be of importance [18:00:40] (03CR) 10Andrew Bogott: [C: 032] Should be --os-region not --region [operations/puppet] - 10https://gerrit.wikimedia.org/r/116767 (owner: 10Andrew Bogott) [18:00:42] that is the answer I wanted to hear [18:01:05] if we ever get really legal trouble over this I .... [18:01:17] you know what? I ain't gonna finish that sentence [18:01:20] hahah [18:01:35] law surprises me way too often [18:02:14] dr0ptp4kt: no, I didn't quite follow all of that yet, maybe it will make sense if I read it 3 more times. but in any case, what we have now works. being able to kill the vcl_deliver bit is just an optimization. [18:04:50] oh, akosiaris, also, I assume 'Files: *' by default just apply to anything not explicitly listed? [18:04:55] right? [18:05:01] bblack, cool. and sorry. if we don't have to remove it, that will be best. i think eventually in an esi world we can come up with some way to delete the vcl_deliver for the optimiazation, but i suppose that will be a way off. i also recall some discussion about aging the cookies; the cookie and as i recall http rfcs allow for this sort of thing. [18:05:25] Urgh, I am getting lost in wikitech, what's the bastion to use for stat1 in eqiad? [18:06:09] Er...pmtpa [18:06:11] (03PS1) 10Andrew Bogott: Revert "Should be --os-region not --region" [operations/puppet] - 10https://gerrit.wikimedia.org/r/116769 [18:06:12] I'm all turned around [18:07:02] Maybe there's a doc page with example SSH configs? [18:07:06] (03CR) 10Andrew Bogott: [C: 032] Revert "Should be --os-region not --region" [operations/puppet] - 10https://gerrit.wikimedia.org/r/116769 (owner: 10Andrew Bogott) [18:11:55] ottomata: yes [18:13:36] danke [18:17:42] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 08:45:44 AM UTC [18:19:55] (03PS6) 10Ottomata: Initial 2.0.0-1 debian release [operations/debs/archiva] (debian) - 10https://gerrit.wikimedia.org/r/115323 [18:20:04] (03CR) 10Andrew Bogott: [C: 032] Add a line specifying the nova api rate limits. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115614 (owner: 10Andrew Bogott) [18:21:47] bblack, you know if that change was deployed out to all relevant prod varnishes? it seems there's no Set-Cookie header. if i just need to be patient, understood! [18:24:33] (03CR) 10Andrew Bogott: [C: 032] Turn rate limits WAY up for nova api. [operations/puppet] - 10https://gerrit.wikimedia.org/r/115615 (owner: 10Andrew Bogott) [18:24:39] dr0ptp4kt: I don't have the ability/time at the moment to debug it in prod, but are you saying you get no cookie, but you do get a TEST carrier banner? [18:24:57] (and you're not already sending the cookie from an earlier set-cookie?) [18:30:17] (03PS1) 10Ottomata: Giving ottomata (myself) ops icinga access [operations/puppet] - 10https://gerrit.wikimedia.org/r/116771 [18:30:47] bblack, no worries, it can wait until later. to answer your question, starting with a browser completely clear of cookies, cache, localstorage, etc. and sending an X-CS header of 646-02 (and not sending any cookies) in the request to http://en.m.wikipedia.org/wiki/Computers results in a response without the Set-Cookie response header. the banner and other normal page rewriting is showing as normal, though. seems like forcing a [18:30:48] deliberately cachebusted page like http://en.m.wikipedia.org/wiki/Computers?cb=87543892349075 has the same effect of no set-cookie response header [18:32:05] dr0ptp4kt: I don't know if it makes any difference, but I've never tested via setting my own X-CS header [18:32:33] usually I add myself to the TEST carrier, there's some existing networks there now, but donno if you can test from them now [18:34:19] bblack, yeah, i think in prod without new ips being pulled and refreshed via the cronjob, the netmapper tagging wouldn't do what we expect. which url are you using? [18:34:32] ...then i'll leave you alone as i understand you're busy [18:34:56] I tested via http://en.m.wikipedia.beta.wmflabs.org/wiki/Supreme_Court_of_the_United_States [18:35:10] (which still has puppet disabled and a manual copy of the same patch) [18:36:03] you mean has puppet 'enabled'? [18:36:17] no, disabled, from testing before merge [18:36:30] I'm turning it back on though, to see if there was any real diff in my actual merge and what was tested [18:37:13] looks the same after puppet run [18:37:33] do you get your banner+cookie there on beta? [18:38:30] bblack, banner, but no cookies [18:38:38] (starting from fresh browser) [18:38:50] well, works for me, but I'm manually in the TEST carrier there (by editing zero.json directly) [18:38:59] if you send me your IP, I can add you to test that way [18:39:37] I'm not 100% sure about the effects of manually setting X-CS. It sort of seems like it works, but there's no real guard against re-setting it based on XFF, etc [18:41:08] bblack, yeah, i just tried forging both 646-02 and TEST as X-CS, but got the same: banner, but no set-cookie that i see. [18:41:21] (oh, actually, that's almost certainly the problem (testing via setting your own X-CS) - because it's the internal X-CS-2-setting code in zero.inc.vcl that also sets the ZeroOpts stuff up) [18:41:40] (^ so what I'm saying is, forging X-CS definitely doesn't work for testing the ZeroOpts Set-Cookie) [18:43:46] bblack, LOL, okay, i get it [18:44:02] try now, your addr is added to TEST (assuming you end up hitting beta via v6) [18:44:04] * dr0ptp4kt  [18:46:26] !log bringing analytics1007 and analytics1013-1016 into elasticsearch cluster [18:46:34] Logged the message, Master [18:46:34] bah [18:46:40] i always type that [18:46:40] hah [18:46:53] !log s/analytics/elastic in previous log message [18:47:02] Logged the message, Master [18:47:27] bblack i tried http://en.m.wikipedia.beta.wmflabs.org/wiki/Supreme_Court_of_the_United_States, but it didn't tag it at all. no banners, no outbound x-analytics or x-cs headers, etc. [18:47:42] probably that IP isn't what you're actually hitting beta with [18:51:15] (03CR) 10Ottomata: "This is labs/ldap name, right? If so, 'ottomata' should do it." [operations/puppet] - 10https://gerrit.wikimedia.org/r/116771 (owner: 10Ottomata) [18:56:33] bblack: that latest change is the ticket. i'm pretty confident that it works fine in production, but would it be possible to get my ipv6 and ipv4 addresses added to the json file for the 646-02 part of the config on prod to do some prod testing? i have a couple other ipv6 addresses for two smartphones on home wifi here as well. i can email them to you for dealing with later if it's doable [18:57:03] ok so you got set-cookie on beta now? [18:57:11] bblack, yes! :) [18:57:20] I have a patch prepped to give you guys an interim solution for editing the JSON again [18:57:21] looks like beta does ipv4 [18:57:28] oh sweet [18:57:40] (basically, you guys will have access to scp new copies of the JSON file somewhere secure that prod pulls from) [18:57:52] right on. i like scp [18:58:04] but I need to get back home and deal with AT&T U-verse issues as soon as this meeting is over, so it will be later this afternoon before I merge that in [18:58:16] i've been having similar fun...good luck [18:58:29] (the perils of being in sf where they roll out new stuff every 5 days) [18:58:48] and thanks again [19:02:22] !log bd808 updated /a/common to {{Gerrit|I62b32888b}}: Create an autopatrolled group on itwikiquote [19:02:30] Logged the message, Master [19:02:35] lies [19:02:42] (03PS1) 10BryanDavis: Group1 wikis to 1.23wmf16 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116774 [19:04:13] bd808|deploy: what? I already synced that hours ago [19:04:52] hoo: It's the git trigger lying when I added a commit directly on tin [19:05:08] rdwrer: ..so .. SSH config [19:05:18] Right [19:05:21] rdwrer: Host gallium [19:05:29] I see .P [19:05:40] mutante: hostname gallium.wikimedia.org [19:05:42] ProxyCommand ssh -W %h:%p mholmquist@bast1001.wikimedia.org [19:05:49] User mholmquist [19:05:58] --- [19:05:59] manybubbles: Elastica is angry about 10.2.2.30 [19:06:02] ssh gallium [19:06:05] ^d: ^^ [19:06:14] bd808|deploy: checking [19:06:45] manybubbles: Thanks. ~5000 retries in last 15 minutes [19:06:55] rdwrer: your key is there, watching auth.log if you wanna try now [19:07:02] ottomata: could you revert the lvs change you made for me regarding ^^^ [19:07:03] K, sec [19:07:19] Hm, that works fine [19:07:28] But trying to do it with wildcards for some reason doesn't [19:07:33] manybubbles: all of them? [19:07:34] just new ones? [19:07:43] ottomata: not really sure [19:07:49] all I imagine [19:07:49] greg-g: I'm ready to roll for group1 to 1.23wmf16 but will wait to see if manybubbles can make Elastica happier [19:07:51] rdwrer: cool, yep, session opened:) [19:07:52] ok [19:08:03] rdwrer: sounds vaguely familiar problem somebody else had [19:08:35] bd808|deploy: which log file contains these? [19:08:47] manybubbles: fatal.log [19:09:19] !log depooled elastic1007 and elastic1013-1016 per request from manybubbles [19:09:20] manybubbles: PHP Warning: Retrying connection to 10.2.2.30 after 1 attempts. [Called [19:09:20] from {closure} in /usr/local/apache/common-local/php-1.23wmf15/extensions/Elast [19:09:20] ica/ElasticaConnection.php at line 92] in /usr/local/apache/common-local/php-1.2 [19:09:20] 3wmf15/includes/debug/Debug.php on line 303 [19:09:27] Logged the message, Master [19:10:05] mutante: So what, I just specify the bastion manually for each one? :/ [19:10:58] manybubbles, ottomata: depool dropped the new error rate to zero [19:11:27] bd808|deploy: cool. we'll figure out which server is being bad and kick it [19:11:41] elastica looks like it did its job and retried when the server was down [19:11:46] rdwrer: you mean "Host gallium stat1" ? yea [19:11:51] manybubbles: I'm looking at https://logstash.wikimedia.org/#/dashboard/elasticsearch/fatalmonitor. Last 5 minutes view shows the trend nicely at the moment. [19:12:04] ...oh that's brilliant [19:12:12] I didn't know I could do that :3 [19:12:16] I'll watch that [19:12:29] rdwrer: heh, i thought you asked for even more wildcard than that [19:12:37] but that's what i do , yea [19:12:46] mutante: But stat1...is stat1 in the same datacenter as tin and gallium et al.? [19:13:02] rdwrer: no, it's not, i hope it will be soon [19:13:19] greg-g: Ready when you say it's ok. [19:13:28] mutante: Should I go through a different bastion 'til then? [19:13:39] rdwrer: either it's still in tampa, then you dont need to proxy, or it will be moved then it will be just like tin and gallium [19:14:02] rdwrer: ..or ..use fenari [19:14:03] mutante: But I tried not-proxying and that didn't work at all [19:14:23] It was a key error I guess [19:14:56] rdwrer: session opened for user mholmquist [19:15:00] * ^d had just stepped away for a sec. [19:15:08] Accepted publickey for mholmquist ... [19:15:21] I just did for stat1 [19:15:25] But through bast1001 [19:15:29] Which seems mighty silly [19:15:42] rdwrer: for direct connect you'll need to make it stat1.wikimedia.org [19:15:48] <^d> The heck happened? [19:15:50] But that's when...OK sec. [19:16:33] ...huh [19:16:33] (03PS2) 10John F. Lewis: Apache config for legalteamwiki [operations/apache-config] - 10https://gerrit.wikimedia.org/r/116219 (owner: 10Jalexander) [19:16:36] It worked that time [19:16:38] Graw damn it [19:17:11] ^d: one or more non-responsive hosts added to the search pool apparently [19:17:29] Oh, that's the problem, I must be giving it the wrong key somehow [19:17:36] <^d> bd808|deploy: Yeah I knew they were pooled. [19:17:42] <^d> Wonder why they aren't responding :\ [19:17:55] Probably just tired :) [19:18:14] * bd808|deploy pings greg-g one more time for luck [19:19:07] ottomata: any chance something about lvs isn't working on them? they respond to :9200 [19:19:13] like, via their names [19:19:18] (03CR) 10BryanDavis: [C: 032] Group1 wikis to 1.23wmf16 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116774 (owner: 10BryanDavis) [19:20:05] (03Merged) 10jenkins-bot: Group1 wikis to 1.23wmf16 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116774 (owner: 10BryanDavis) [19:21:11] bd808|deploy: hey, sorry, was in-transit and it took longer than expected [19:21:25] bd808|deploy: god speed and such. [19:21:32] greg-g: No worries. [19:22:33] !log bd808 rebuilt wikiversions.cdb and synchronized wikiversions files: group1 to 1.23wmf16 [19:22:41] Logged the message, Master [19:22:59] ^d: sync-wikiversions worked with the batch options added :) [19:23:10] <^d> yay [19:23:18] greg-g: How do I/you/we test? [19:23:32] go to special:version [19:23:35] tail logs [19:23:52] <^d> Have Paula Pray 9 times. [19:24:00] s/tail/logstash/ ;) [19:24:08] commons looks sane [19:24:13] manybubbles: everything looks fine in pybal logs [19:24:59] manybubbles: https://gist.github.com/ottomata/9353723 [19:25:02] Randomly selected fr.wiktionary.org at 1.23wmf16 as expected [19:25:08] also https://gdash.wikimedia.org/dashboards/reqerror/deploys [19:25:08] that's from the new nodes when we pooled them [19:25:10] ... [19:25:36] (looks fine, except the deploys tick not working grumble) [19:25:42] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [19:26:29] ^d: that brings me back [19:26:29] oh, and ignore all puppet warnings ;) [19:26:33] * bd808|deploy is getting a rendering error from graphite.wikimedia.org for that dashboard [19:26:42] yeah, untick the deploys checkbox [19:26:47] oh, graphite [19:26:51] ah [19:27:41] <^d> manybubbles: One of my favorite games ever :) [19:27:53] greg-g: Hmm… that seems to confirm the suspicion that something is wring with the metricd relay for metrics sent from tin [19:28:00] I was replaying it a year ago but my wife got bored [19:28:49] <^d> Aw. [19:29:16] ottomata: so I'm stumped - do we punt an email the ops list for help? [19:29:54] bd808|deploy: when you're done, if you have 2 sentences of what you think/don't know, that'd be nice to send to Ops/engineering (flip a coin) [19:31:31] greg-g: I can telnet to statsd.eqiad.wmnet:2003 which is where we are sending the statsd packets [19:32:07] * bd808|deploy doesn't have access rights to log into tungsten.eqiad.wmnet to check further [19:32:25] yeah, will need to punt to ops then [19:33:02] manybubbles: i have to run really soon, i teach on tuesdays [19:33:09] (03PS5) 10John F. Lewis: Initial setup for legalteamwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112850 (owner: 10TTO) [19:33:36] greg-g: 10 minutes after version change and fatalmonitor looks positively boring. When do we declare success and move on? [19:33:52] (03CR) 10Alexandros Kosiaris: [C: 032] Giving ottomata (myself) ops icinga access [operations/puppet] - 10https://gerrit.wikimedia.org/r/116771 (owner: 10Ottomata) [19:34:09] bd808|deploy: a definition of success other than the opposite of "things have gone bad"? [19:34:17] :) [19:34:31] the lack of "things have gone bad", actually [19:34:48] * bd808|deploy really wishes someone would fix https://bugzilla.wikimedia.org/show_bug.cgi?id=54193 [19:34:50] so, yeah, you're good, bd808|deploy, maintenance script away [19:35:02] bd808|deploy: you feel ok about that part? [19:35:30] greg-g: I'll give it a shot. I got the list of wikis but don't know how to verify the change [19:35:42] ragesoss: around? [19:35:53] yes, for a short time. [19:35:54] ragesoss: we're about ready to run that maint script [19:35:59] wanna test for us? [19:36:04] sure. [19:36:10] ty [19:36:16] bd808|deploy: fire when ready [19:37:06] first try not so successful [19:37:09] The MediaWiki script file "/usr/local/apache/common-local/php-1.23wmf16/maintenance/fixSummaryData.php" does not exist. [19:37:25] AndyRussG: ^ [19:37:26] * bd808|deploy goes to read the patch again [19:38:09] Ah. It's in the extension not core [19:39:29] ragesoss: I ran it against test2wiki. No error output; no output at all [19:40:50] AndyRussG, is that the expected behavior? [19:41:20] ragesoss: I think so. There are no output statements in the script [19:41:42] How do we verify that summary data has been updated? [19:41:49] I'm search for the instances of the problem it is supposed to correct, but I cannot find any. [19:42:01] which either means it worked, or there weren't any in the first place on that wiki. [19:42:06] :) [19:42:12] en.wiki is the only place I know there is bad data. [19:42:36] Should be try it on another wiki where you know there is bad data before doing them all? [19:42:41] s/be/we/ [19:43:11] for example, this page lists "0" courses and "17" students, but there is one course lists, and it has 16 students: https://en.wikipedia.org/wiki/Education_Program:Case_Western_Reserve_University [19:43:18] yeah, go ahead and run it on en.wiki. [19:43:39] (nothing terrible happened on test2 that I can see.) [19:43:51] ragesoss: {{done}} [19:44:00] For enwiki [19:44:01] huzzah! [19:44:04] yay [19:44:06] it seems to have worked. [19:44:07] bd808|deploy: ragesoss yes that's the expected behavior [19:44:24] the test case linked above now shows the correct data. [19:44:38] Cool. I'll go ahead and run it on the rest of the wikis [19:45:04] bd808|deploy: greg-g thanks a ton [19:45:15] np [19:45:25] excellent, thanks. after purging cache, all the obvious errors are gone from [[Special:Institutions]] as well. [19:45:31] ragesoss, AndyRussG: greg-g: {{done}} [19:45:31] thanks all! [19:45:36] Much of the data that was bad was getting corrected bit by bit--it's corrected any time a user edits any of those Education program pages [19:46:12] That may be why some but not all the errors seemed to have corrected themselves [19:46:30] The script was necessary to just go through everything [19:49:02] bd808|deploy: alright, you all done/feel good? [19:49:48] greg-g: I don't see anything more broken than normal [19:50:08] good enough for me (sadly) [19:50:24] :p [19:51:40] (03PS2) 10Ori.livneh: reprepro: import Facebook's HHVM pkgs, for Labs / Vagrant usage only [operations/puppet] - 10https://gerrit.wikimedia.org/r/112314 [19:51:50] (03PS3) 10Ori.livneh: reprepro: import Facebook's HHVM pkgs, for Labs / Vagrant usage only [operations/puppet] - 10https://gerrit.wikimedia.org/r/112314 [19:52:44] Any ops here have a moment to look at the statsd service on tungsten.eqiad.wmnet to see if there is an obvious reason that the metrics sent from scap and other deploy scripts are not showing up in graphite? [20:28:09] (03PS1) 10Hashar: beta: natfixup for eqiad public IP addresses [operations/puppet] - 10https://gerrit.wikimedia.org/r/116787 [20:28:50] (03PS1) 10Hashar: beta: sent HTCP purges to eqiad varnishes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116788 [20:34:44] (03PS1) 10Catrope: Really actually fix VE on private wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116792 [20:37:57] (03CR) 10Catrope: [C: 032] Really actually fix VE on private wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116792 (owner: 10Catrope) [20:38:04] (03Merged) 10jenkins-bot: Really actually fix VE on private wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116792 (owner: 10Catrope) [20:38:33] !log catrope updated /a/common to {{Gerrit|I7ff564e4a}}: Really actually fix VE on private wikis [20:38:41] Logged the message, Master [20:39:05] !log catrope synchronized wmf-config/CommonSettings.php 'Actually fix VE on officewiki this time' [20:39:13] Logged the message, Master [20:41:04] (03CR) 10Hashar: "The varnish instances are not serving traffic yet though :]" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116788 (owner: 10Hashar) [20:45:12] !log operations/apache-config.git now has a "betacluster" branch to host the beta cluster apache configuration files {{bug|56395}} [20:45:19] Logged the message, Master [21:14:58] (03PS1) 10Hashar: Import .gitignore from master branch [operations/apache-config] (betacluster) - 10https://gerrit.wikimedia.org/r/116859 [21:15:00] (03PS1) 10Hashar: .gitreview with defaultbranch=betacluster [operations/apache-config] (betacluster) - 10https://gerrit.wikimedia.org/r/116860 [21:15:15] lovely jenkins [21:15:23] (03CR) 10Hashar: [C: 032] Import .gitignore from master branch [operations/apache-config] (betacluster) - 10https://gerrit.wikimedia.org/r/116859 (owner: 10Hashar) [21:15:25] (03Merged) 10jenkins-bot: Import .gitignore from master branch [operations/apache-config] (betacluster) - 10https://gerrit.wikimedia.org/r/116859 (owner: 10Hashar) [21:15:37] (03CR) 10Hashar: [C: 032] .gitreview with defaultbranch=betacluster [operations/apache-config] (betacluster) - 10https://gerrit.wikimedia.org/r/116860 (owner: 10Hashar) [21:15:39] (03Merged) 10jenkins-bot: .gitreview with defaultbranch=betacluster [operations/apache-config] (betacluster) - 10https://gerrit.wikimedia.org/r/116860 (owner: 10Hashar) [21:15:52] should send that to #wikimedia-qa I guess [21:18:42] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Fri 28 Feb 2014 08:45:44 AM UTC [22:08:48] (03Abandoned) 10Chad: Create global /etc/gitconfig for deployment hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/114385 (owner: 10Chad) [22:10:28] (03PS1) 10Ori.livneh: Labs: set $wgCentralGeoScriptURL to false for GeoIP cookie testing [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116873 [22:10:44] ^d: grr, did i drop the ball on that? [22:10:58] <^d> I dunno who did :p [22:11:21] (03CR) 10Ori.livneh: [C: 032] Labs: set $wgCentralGeoScriptURL to false for GeoIP cookie testing [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116873 (owner: 10Ori.livneh) [22:11:57] ^d: i think it's OK. would you like me to restore and review it? [22:12:18] (03Restored) 10Chad: Create global /etc/gitconfig for deployment hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/114385 (owner: 10Chad) [22:12:20] (03Merged) 10jenkins-bot: Labs: set $wgCentralGeoScriptURL to false for GeoIP cookie testing [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116873 (owner: 10Ori.livneh) [22:12:29] thanks [22:12:34] !log ori updated /a/common to {{Gerrit|I5ca0adf39}}: Labs: set $wgCentralGeoScriptURL to false for GeoIP cookie testing [22:12:41] Logged the message, Master [22:13:56] !log ori synchronized wmf-config/CommonSettings-labs.php 'Syncing lab config change I5ca0adf39 to prod for cluster consistency' [22:14:03] Logged the message, Master [22:18:01] (03PS1) 10Ori.livneh: Set $text_enable_geo to true for cp1066 [operations/puppet] - 10https://gerrit.wikimedia.org/r/116877 [22:18:07] ^ bblack [22:18:23] (03PS1) 10BBlack: Update netmapper JSON from noc.wikimedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/116878 [22:18:48] (03CR) 10jenkins-bot: [V: 04-1] Set $text_enable_geo to true for cp1066 [operations/puppet] - 10https://gerrit.wikimedia.org/r/116877 (owner: 10Ori.livneh) [22:19:40] grrr. [22:20:07] (03PS2) 10Ori.livneh: Set $text_enable_geo to true for cp1066 [operations/puppet] - 10https://gerrit.wikimedia.org/r/116877 [22:21:48] (03CR) 10BBlack: [C: 032 V: 032] Update netmapper JSON from noc.wikimedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/116878 (owner: 10BBlack) [22:24:23] ori: I'm watching 1066 too [22:24:29] shall I merge, or? [22:24:54] bblack: yeah, let's try it [22:25:14] bblack: remember that you'll have to do a service varnish-frontend restart; the vcl loading thing doesn't work because of the compiler args issue [22:25:22] PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1372: active_shards: 4019: relocating_shards: 6: initializing_shards: 2: unassigned_shards: 0 [22:25:22] PROBLEM - ElasticSearch health check on elastic1016 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1372: active_shards: 4019: relocating_shards: 6: initializing_shards: 2: unassigned_shards: 0 [22:25:24] ok [22:25:45] (03PS3) 10Ori.livneh: Set $text_enable_geo to true for cp1066 [operations/puppet] - 10https://gerrit.wikimedia.org/r/116877 [22:25:50] (03CR) 10BBlack: [C: 032 V: 032] Set $text_enable_geo to true for cp1066 [operations/puppet] - 10https://gerrit.wikimedia.org/r/116877 (owner: 10Ori.livneh) [22:26:22] RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1374: active_shards: 4021: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [22:26:22] RECOVERY - ElasticSearch health check on elastic1016 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1374: active_shards: 4021: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [22:26:42] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Last successful Puppet run was Tue 25 Feb 2014 06:33:37 PM UTC [22:27:13] (03PS1) 10Bsitu: Update flow cache version number [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116887 [22:27:44] ori: it's restarted [22:27:52] no segfault yet :) [22:28:01] cookies look to be set correctly too [22:28:39] given the long complicated history of this patch so far, let's just let it go on this one host for tonight, and then we can do the rest tomorrow [22:28:42] just in case :) [22:29:08] bblack: yes, perfectly fine by me -- that sounds sane [22:31:39] !log GeoIP cookie enabled on text frontend varnish on cp1066. Appears to work well. If it causes issues, roll back by reverting I7e5ca8e54 and running 'service varnish-frontend restart'. [22:31:46] Logged the message, Master [22:55:28] !log ebernhardson synchronized php-1.23wmf16/extensions/Flow/ [22:55:37] Logged the message, Master [22:56:35] * greg-g heads home to dinner [22:58:24] !log ebernhardson synchronized php-1.23wmf15/extensions/Flow/ [22:58:32] Logged the message, Master [23:02:52] (03CR) 10Bsitu: [C: 032] Update flow cache version number [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116887 (owner: 10Bsitu) [23:03:00] (03Merged) 10jenkins-bot: Update flow cache version number [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116887 (owner: 10Bsitu) [23:06:01] !log bsitu updated /a/common to {{Gerrit|I566e00bcc}}: Update flow cache version number [23:06:10] Logged the message, Master [23:06:58] !log bsitu synchronized wmf-config/CommonSettings.php 'Update Flow cache key to 3.0' [23:07:06] Logged the message, Master [23:08:04] (03CR) 10Dzahn: [C: 032] remove "zhen" public IP, decom [operations/dns] - 10https://gerrit.wikimedia.org/r/116658 (owner: 10Dzahn) [23:09:41] !log DNS update - removing 'zhen' [23:09:49] Logged the message, Master [23:16:07] bblack: would you mind if we enabled geo_cookie on one other host (or disabled it on cp1066 and enabled it elsewhere?) one thing i forgot to do is continuously sample time-to-first-byte before/after enabling [23:16:28] sure [23:16:38] maybe one in ulsfo? [23:34:22] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 907.133362 [23:41:55] (03PS1) 10Ori.livneh: Enable geo_cookie on cp1055 as well [operations/puppet] - 10https://gerrit.wikimedia.org/r/116905 [23:43:21] (03CR) 10Ori.livneh: [C: 032 V: 032] "ok'd by bblack" [operations/puppet] - 10https://gerrit.wikimedia.org/r/116905 (owner: 10Ori.livneh) [23:46:22] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [23:53:02] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 104.800003 [23:53:22] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 457.166656 [23:54:22] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 414.133331