[00:14:48] (03PS1) 10TTO: Change favicon for angwiktionary to ['w] icon [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97458 [00:24:04] (03PS1) 10Springle: repool db1050 at LB 0 for dumps & QueryPage::recache. depool db1037 for upgrade & schema change [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97459 [00:25:06] (03CR) 10Springle: [C: 032] repool db1050 at LB 0 for dumps & QueryPage::recache. depool db1037 for upgrade & schema change [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97459 (owner: 10Springle) [00:26:15] !log springle synchronized wmf-config/db-eqiad.php 'repool db1050. depool db1037' [00:26:30] Logged the message, Master [00:32:51] (03CR) 10Tim Starling: Normalise the path part of URLs in the text frontend (034 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/96941 (owner: 10Tim Starling) [00:55:33] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [01:15:40] !log springle synchronized wmf-config/db-eqiad.php 'warm up db1037' [01:15:56] Logged the message, Master [01:20:45] !log mariadb 5.5.34 live-fire test on db1037 [01:21:01] Logged the message, Master [01:39:48] (03PS1) 10Springle: repool db1037 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97461 [01:40:15] (03CR) 10Springle: [C: 032] repool db1037 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97461 (owner: 10Springle) [01:41:15] !log springle synchronized wmf-config/db-eqiad.php [01:41:33] Logged the message, Master [02:14:09] !log LocalisationUpdate completed (1.23wmf4) at Mon Nov 25 02:14:09 UTC 2013 [02:14:25] Logged the message, Master [02:25:55] PROBLEM - Puppet freshness on rhodium is CRITICAL: No successful Puppet run for 2d 0h 7m 53s [02:26:17] !log LocalisationUpdate completed (1.23wmf5) at Mon Nov 25 02:26:16 UTC 2013 [02:26:33] Logged the message, Master [02:37:27] (03CR) 10Tim Starling: Generate redirects.conf (031 comment) [operations/apache-config] - 10https://gerrit.wikimedia.org/r/96438 (owner: 10Tim Starling) [02:37:48] (03PS2) 10Tim Starling: Generate redirects.conf [operations/apache-config] - 10https://gerrit.wikimedia.org/r/96438 [02:39:00] (03CR) 10Tim Starling: "PS2:" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/96438 (owner: 10Tim Starling) [02:40:11] webhostingwikipedia.com [02:40:20] * Aaron|home chuckles [02:42:34] A few domains have clearly been donated. [03:05:06] (03PS2) 10Tim Starling: Normalise the path part of URLs in the text frontend [operations/puppet] - 10https://gerrit.wikimedia.org/r/96941 [03:05:20] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Nov 25 03:05:19 UTC 2013 [03:05:35] Logged the message, Master [03:05:36] (03CR) 10Tim Starling: "PS2: use memcpy()" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96941 (owner: 10Tim Starling) [03:56:49] (03PS1) 10Springle: depool db1045 for uprade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97463 [03:57:09] (03CR) 10Springle: [C: 032] depool db1045 for uprade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97463 (owner: 10Springle) [03:57:59] (03PS1) 10Andrew Bogott: Move the proxy API to a port 5668 [operations/puppet] - 10https://gerrit.wikimedia.org/r/97464 [03:58:09] !log springle synchronized wmf-config/db-eqiad.php 'depool db1045 for upgrade' [03:58:25] Logged the message, Master [03:59:45] (03CR) 10Andrew Bogott: [C: 032] Move the proxy API to a port 5668 [operations/puppet] - 10https://gerrit.wikimedia.org/r/97464 (owner: 10Andrew Bogott) [04:09:08] !log springle synchronized wmf-config/db-eqiad.php 'warm up db1045' [04:09:24] Logged the message, Master [04:10:18] !log mariadb 5.5.34 live-fire test on db1045 [04:10:34] Logged the message, Master [04:41:05] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [04:42:35] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 35.52 ms [04:43:45] (03PS1) 10Springle: repool db1045 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97469 [04:44:04] (03CR) 10Springle: [C: 032] repool db1045 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97469 (owner: 10Springle) [04:45:27] !log springle synchronized wmf-config/db-eqiad.php [04:45:42] Logged the message, Master [04:56:55] PROBLEM - NTP on mw31 is CRITICAL: NTP CRITICAL: Offset unknown [05:01:55] RECOVERY - NTP on mw31 is OK: NTP OK: Offset -0.001426935196 secs [05:14:20] (03CR) 10Ori.livneh: [C: 031] "Simple and elegant; I like it. For future projects, consider implementing configuration DSLs by extending Puppet with custom Ruby code ins" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/96438 (owner: 10Tim Starling) [05:20:26] (03CR) 10Faidon Liambotis: [C: 032] "Thanks Tim. Good to go from my side, shall I deploy or will you?" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/96438 (owner: 10Tim Starling) [05:23:54] hey TimStarling, can you give me a pointer or two for investigating the issue that Sean reported (a sudden crunch of parser cache writes on pc1001)? I extracted the queries from cp1001's slow query log and the specific keys referenced in those queries. At this point I'm not stuck, but I'm sort of flailing around without a real plan. [05:25:09] can I see the list of keys? [05:26:03] TimStarling: iron:/home/ori/keys [05:26:19] PROBLEM - Puppet freshness on rhodium is CRITICAL: No successful Puppet run for 2d 3h 8m 17s [05:26:35] note that it coincides exactly with a fundraising banner test run [05:26:46] no idea if it's related, but it's a nice coincidence [05:28:39] ori-l: it's quite broadly distributed then [05:28:57] yep [05:29:19] these are from write queries? [05:29:31] or from select queries as well? [05:29:52] only SqlBagOStuff::set [05:32:55] there were a lot of deletes as well [05:33:43] 50k deletes, 110k replaces [05:38:40] pity this shows nothing useful: http://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&c=MySQL+eqiad&h=pc1001.eqiad.wmnet&jr=&js=&v=1512&m=mysql_com_delete&vl=stmts&ti=mysql_com_delete [05:39:10] that metric script really should be fixed [05:40:43] this is quite nice: http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=MySQL%20eqiad&h=pc1001.eqiad.wmnet&r=hour&z=default&jr=&js=&st=1385357718&v=21403783&m=mysql_innodb_rows_deleted&vl=rows&ti=mysql_innodb_rows_deleted&z=large [05:41:09] although it would be nicer as a derivative, if counter overflows could be filtered out [05:41:59] well, overflows and server restarts [05:42:36] was the restart a cause or an effect? [05:43:35] an effect [05:45:31] 15:34 springle: bounced pc1001 mysqld. massive spike of writes exhausted innodb txn log slots and wouldn't be killed [05:45:43] bah, sorry for the ping [05:46:12] so did the query rate definitely increase? [05:46:34] maybe I should click these CSV links... [05:48:35] 21M rows deleted? [05:48:40] https://graphite.wikimedia.org/render?from=-1weeks&until=now&width=500&height=380&target=query.DELETE.FROM_pcN_WHERE_keyname_X.count&target=query.REPLACE.INTO_pcN_keyname_value_exptime_VALUES_X.count&uniq=0.04967071581631899 [05:49:09] per hour? [05:49:34] no, it's a counter; it indicates the number of deletes since the server started [05:50:01] ah [05:52:43] I'm not sure how to interpret the Graphite graphs -- is it a real increase in the query rate, or was the count depressed by queries being backed up [05:54:03] the last one seems to show a normal-ish trendline [05:54:16] cept that spike on the 23rd [05:55:52] I've got the ganglia CSVs, there's no spike in the 14:42 or 15:24 samples [05:56:30] in either com_replace or com_delete [05:56:41] Hi bougyman. [05:57:01] hello Elsie [05:59:11] TimStarling: mysql_bytes_received spiked: http://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&c=MySQL+eqiad&h=pc1001.eqiad.wmnet&jr=&js=&v=37695292&m=mysql_bytes_received&vl=bytes&ti=mysql_bytes_received [05:59:46] that's one heck of a spike [06:00:15] * ori-l looks to see what sort of size limitations are enforced [06:01:11] ori-l: but no spike in bytes_in [06:01:26] which makes me think it is another metric bug [06:02:43] yes, you're right [06:02:48] on both counts [06:05:38] http://tstarling.com/stuff/pc1001-query-rate-2013-11-21.png [06:06:28] once you discount the bad sample, it looks like the server slowed down under normal traffic, then went back to normal after a restart [06:12:54] why did it slow down? well, cpu_wio was high starting from the 20th [06:20:43] hrm [06:20:44] http://ganglia.wikimedia.org/latest/graph.php?r=year&z=xlarge&c=MySQL+eqiad&h=pc1001.eqiad.wmnet&jr=&js=&v=0.4&m=cpu_wio&vl=%25&ti=CPU+wio [06:21:17] that's quite nice [06:21:58] https://bugs.launchpad.net/ubuntu/+source/apt-xapian-index/+bug/363695 [06:22:07] "update-apt-xapian-index uses too much CPU and memory" [06:22:11] that's in cron.weekly [06:23:06] that's max: 5.83% in the graph [06:27:27] PROBLEM - udp2log log age for lucene on oxygen is CRITICAL: CRITICAL: log files /a/log/lucene/lucene.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [06:27:49] yes, but that's a percentage of 24 cores [06:28:37] iowait is really the most awful disk metric [06:29:24] gtg [06:29:27] RECOVERY - udp2log log age for lucene on oxygen is OK: OK: all log files active [06:34:20] thanks for investigating [06:35:06] weekly cron runs on sundays anyways [06:37:15] (03CR) 10MZMcBride: "I don't think this is a "confirmed" bug, per se. I think the underlying idea here needs further consideration." (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/97190 (owner: 10Faidon Liambotis) [06:49:10] (03CR) 10Aaron Schulz: [C: 031] Normalise the path part of URLs in the text frontend [operations/puppet] - 10https://gerrit.wikimedia.org/r/96941 (owner: 10Tim Starling) [06:50:03] (03CR) 10Peachey88: "> Dzahn: didn't you mean https://twitter.com/wikimediatech instead of https://twitter.com/wikimedia ? ...snip..." [operations/puppet] - 10https://gerrit.wikimedia.org/r/97190 (owner: 10Faidon Liambotis) [06:51:39] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [06:58:36] ori-l: http://ori.scs.stanford.edu/ [06:59:00] HAHAHAHA :D [06:59:42] nice! [07:00:07] Fits nicely with https://github.com/cobrateam/roan :-) [07:00:34] there's also https://en.wikipedia.org/wiki/Ori_(Stargate) [07:00:59] The Ori are "a group of 'ascended' beings who use their advanced technology and knowledge of the universe to attempt to trick non-ascended humans into worshipping them as gods." [07:01:21] basically they were evil gods [07:02:07] i'm still disappointed we never meet the furlings in stargate [07:02:29] i've never watched it; krinkle pointed it out to me [07:02:51] * paravoid is a huge stargate fan [07:02:54] seasons 1-5 were good and should be watched [07:03:00] up to 7 was great [07:03:13] the ori story line was pretty meh tbh [07:03:21] s/meh/awful/ [07:04:05] no offence ori-l [07:06:45] yurik: around? [07:07:04] paravoid, yep [07:07:08] hey [07:07:14] hi [07:07:19] adam was mentioning a BDD script that you run periodically [07:07:28] bdd? [07:07:51] that's how he called it [07:08:05] what does it do? :) [07:08:11] wiki returns Body dysmorphic disorder [07:08:13] zero testing against production [07:08:33] ? you mean the script that tests that zero prod is correct? [07:08:52] yes [07:09:00] i sometimes run a small script that checks that IP ranges don't overlap [07:09:14] https://github.com/wikimedia/mediawiki-extensions-ZeroRatedMobileAccess/tree/master/maintenance/phantom [07:09:16] the prod testing script is all adam - he wrote it and is running it on daily [07:09:22] oh, okay [07:09:49] yeah, that's adam [07:10:47] paravoid, so we are postponing all depl this week? Or is it just the apache? [07:11:01] all large deployments [07:11:13] is landing page a large depl? [07:11:18] tech ops obviously continues, but no deployments of ours that would end up in greg's calendar [07:11:22] dunno, you have to ask greg [07:11:31] i thought he is gone this week [07:12:05] oh, right, should have asked earlier :) [07:12:06] besides, you will need to do most of that depl - i wouldn't want to mess up apache configs :) [07:12:19] I think robla is in his stead [07:12:36] heh, well, its all ready to go - even the minor patch to remove relative redirects [07:12:54] let me know if you want to poke it later this week [07:13:32] or we can even poke at it now - the redirect can be made temporarily in the mobileportal.php script [07:13:54] or i will head to bed [07:14:05] nah, not a great time now [07:14:09] also very late for you, isn't it? [07:14:19] 2am,not a biggi [07:14:29] i'm an owl (as russians call it) [07:15:43] https://en.wikipedia.org/wiki/Night_owl_(person) [07:16:19] thx ori-l , i thought it was a common expression mostly in ru [07:16:40] nah, english too [07:16:45] is there an always awake word for ori? :) [07:16:59] i think you are around more than anyone i know :) [07:17:26] oh, btw, ori, I have been poking at the zero config stuff, had some basic thoughts [07:17:28] i hope not, that would be a dubious distinction [07:17:33] hehe [07:17:55] what would you say about ... wait for it... Config:SubSpace:BlahBlah structure? [07:18:09] the SubSpace would be defined by an extension [07:18:17] it would describe it with a json schema [07:18:40] this way we will avoid meta community ripping us to shreds for creating new namespaces for each new config type [07:18:42] doesn't sound bad [07:18:52] have you seen https://www.mediawiki.org/wiki/Requests_for_comment/Configuration_database_2 ? [07:19:05] no, i'm affraid of going to RFC page... [07:19:08] readidng... [07:20:00] it may or may not make sense to expand the scope of the problem to mediawiki configuration generally [07:20:15] it would certainly be nice to have a single framework for both, but the requirements may be too different [07:20:31] Yes. [07:22:07] part of the rationale cited by the RFC for making all configuration editable on-wiki is that it's a nightmare to get set up with gerrit [07:22:11] ori-l, not sure about the security/storage [07:22:21] how does he propose to store it? [07:22:23] which is true, but it's a bigger problem, and it should be fixed at the root [07:22:37] i.e., by making it not be a nightmare to get set up with gerrit [07:23:10] there's a storage section but it's not tightly specced [07:23:18] I commented on the talk page a few minutes ago about the rationale for a graphical configuration interface. [07:23:19] legoktm might be around [07:23:25] Gerrit has nothing to do with it, I don't think. [07:23:40] legoktm said good night in another channel a few mins ago [07:23:43] so he might not be [07:23:52] I *was* about to sleep.... [07:24:06] well, gerrit requires very complex tools to use it - whereas wiki requires a browser [07:24:20] Right, but the point is that setting your wiki logo shouldn't require editing PHP. [07:24:24] For anyone. [07:24:36] hi legoktm, could you explain what you meant in the storage section of that rfc? [07:24:58] what is the storage engine/editing/etc? [07:25:05] well I was planning to just use json as a storage mechanism [07:25:14] but in the rfc review, Tim suggested using MySQL (or any db I guess) as the backend [07:25:40] Elsie: $wgLogo belongs to a very small set of config vars that really do beg for a web interface [07:25:48] I wouldn't generalize from that to all configuration vars [07:26:06] ori-l: Why not? [07:26:27] Adding namespaces, configuring user groups can all use web interfaces [07:26:30] legoktm, but what about all the other goodies of the wiki, namelly: history, diffs, monitoring, email notifications, etc? [07:27:01] yurik: Those come with using a MySQL backend, I think. ;-) [07:27:31] well, all of wiki uses mysql backend, but that doesn't solve the stated security problem :) [07:27:38] yurik: you mean storing in the page text? yes there are a lot of advantages to that, but problems with that is a) security: can't have private settings stored in page text, and b) accessing another wiki in a farm's settings becomes non-trivial [07:28:15] legoktm, security - reading, or security - editing? [07:28:33] there are very few really private settings that we have [07:28:54] reading [07:29:10] and i don't feel its a good tradeoff to trade the regular wiki abilities to the few keys that should be hidden [07:29:14] I'm in favor of ploughing ahead with an extension that provides a generic on-wiki configuration facility for other extensions rather than starting from core [07:29:32] so am i [07:30:11] why do we use #wikimedia-operations [07:30:13] and all external access can easily be done through api - it has all functionality for that [07:30:32] because ops are asleep and mice are having a field day? [07:30:37] heh [07:30:40] mice or owls? [07:30:41] * apergos peeks in [07:30:53] hi apergos [07:30:56] awake since 7:30 (for some value of 'awake') [07:31:06] for some value of TZ [07:31:06] yurik: using the API sounds like a good idea, I hadn't really considered that. [07:31:20] legoktm, that's how we do it internally for zero [07:31:26] ori-l: If there's a generic on-wiki configuration facility, I'm not sure what sense it makes to start with extensions. [07:31:47] I have to sleep though, I'll update the page tomorrow [07:31:51] it's a smaller problem [07:32:01] night [07:32:02] http://meta.wikimedia.org/w/api.php?action=help&modules=zeroconfig [07:32:04] night lego [07:32:24] night [07:32:41] ori-l: Well, focusing on all core configuration variables isn't exactly the alternative. We could focus on any handful. [07:32:42] ori-l, what's the better channel? [07:32:48] #wikimedia-dev [07:32:51] Or #mediawiki. [07:33:02] -dev [07:33:16] -tech is basically dead [07:33:46] If you say so. [07:34:01] -staff-cabal [07:43:20] !log imported de-debianized mariadb 5.5.34 debs into rerepro on brewster [07:43:34] Logged the message, Master [08:00:55] paravoid: any further thoughts re: https://gerrit.wikimedia.org/r/#/c/96961/ ? [08:01:24] ori-l: nope, looks good from my side, but I thought of letting Ryan have a final look [08:01:34] (03CR) 10Faidon Liambotis: [C: 031] "LGTM" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96961 (owner: 10Ori.livneh) [08:03:11] kk [08:03:21] he okayed it in principle [08:03:42] ori-l: I don't love 'donotify' though, maybe 'managed' on the nginx class? [08:03:42] ori-l: class { 'nginx': managed => false, } [08:03:42] Ryan_Lane: that works for me [08:03:44] Ryan_Lane: I don't care too much what the variable is :) [08:04:07] but if i press you to merge it murphy's law dictates that the cluster explodes horribly [08:05:17] probably a good idea to let Ryan look first [08:26:47] PROBLEM - Puppet freshness on rhodium is CRITICAL: No successful Puppet run for 2d 6h 8m 45s [08:37:20] (03CR) 10Odder: [C: 031] Change favicon for angwiktionary to ['w] icon [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97458 (owner: 10TTO) [09:07:09] (03PS1) 10Akosiaris: Raise splaylimit from 45 to 60 seconds [operations/puppet] - 10https://gerrit.wikimedia.org/r/97486 [09:10:57] (03CR) 10Akosiaris: [C: 032] Raise splaylimit from 45 to 60 seconds [operations/puppet] - 10https://gerrit.wikimedia.org/r/97486 (owner: 10Akosiaris) [10:07:19] (03CR) 10Ori.livneh: [C: 04-1] role and module structure for ishmael (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/96403 (owner: 10Dzahn) [10:10:17] (03CR) 10Daniel Kinzler: "somebody give a +2 then, i can't :)" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/92925 (owner: 10Dzahn) [10:14:10] so, who half-killed streber? [10:14:32] no SAL entries, ticket was last updated by apergos last week but for a seemingly unrelated reason [10:14:40] and now we're left with no smokeping to debug a site outage [10:14:42] great [10:18:27] yeah I've done nothing over there [10:39:26] (03CR) 10Hashar: "Daniel: that can only be deployed by ops." [operations/apache-config] - 10https://gerrit.wikimedia.org/r/92925 (owner: 10Dzahn) [11:06:50] hey, are you guys holding some courses sometimes ? [11:06:54] or some tutorials [11:07:02] like "this is how you do X with puppet" [11:07:14] or "this is how you do this and that with the X and the Y to get Z" [11:12:56] paravoid: Hallowed are the Ori! [11:26:48] LeslieCarr: do we have something like https://monitor.archive.org/weathermap/weathermap.html ? [11:27:05] PROBLEM - Puppet freshness on rhodium is CRITICAL: No successful Puppet run for 2d 9h 9m 3s [11:27:36] i.e. compact representation of bandwidth capacity & usage for all links [11:29:31] no [11:30:29] paravoid: no, to Krinkle or to Nemo_bis? [11:30:35] or no to both at once? [11:31:48] no to Nemo_bis [11:32:09] oki thanks [11:32:26] average: At hackathons and community events (such as the annual Wikimedia Hackathon in Europe, and at Wikimania) there are usually several hands-on workshops and talks about the various tools we use and how to use them. [11:32:35] Some of them are also recorded and/or documented on-wiki. [11:39:31] so when's the next one in Europe ? [11:39:52] paravoid: can I knock on your door and ask you puppet questions ? [11:40:13] https://www.mediawiki.org/wiki/Berlin_Hackathon_2012 [11:40:14] https://www.mediawiki.org/wiki/Amsterdam_Hackathon_2013 [11:40:17] https://www.mediawiki.org/wiki/Z%C3%BCrich_Hackathon_2014 [12:00:53] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [12:06:50] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [12:34:45] https://gdash.wikimedia.org/dashboards/reqerror/ [12:39:08] heh, I read that as rageerror [12:39:30] not raging yet [12:39:35] YET!! [12:40:54] exactly! [12:47:28] PROBLEM - LVS Lucene on search-pool1.svc.eqiad.wmnet is CRITICAL: Connection timed out [12:48:18] RECOVERY - LVS Lucene on search-pool1.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [13:00:07] (03CR) 10Edenhill: "Overall looks good, but some smaller issues here and there." (037 comments) [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/95473 (owner: 10Ottomata) [13:01:31] (03CR) 10Edenhill: Writing JSON statistics to log file rather than syslog or stderr (031 comment) [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/95473 (owner: 10Ottomata) [13:16:40] about half of today's 503s are Special:CentralAutoLogin either .../start or .../createSession [13:17:02] Spezial:Zentrale_automatische_Anmeldung for de which is where most of em are [13:20:50] apergos: related to https://bugzilla.wikimedia.org/show_bug.cgi?id=54195 ? [13:21:41] maybe [13:21:46] if those URLs are 60 % of apache traffic, maybe being 50 % of 503s doesn't mean anything [13:22:10] (unless I'm comparing apple and oranges) [13:27:49] looks like there's still an outstanding patch: https://gerrit.wikimedia.org/r/#/c/96317/ [13:34:22] there are a bunch (at least enough not to be buried in the rest of the noise) of these: GET http://en.wikipedia.org\ [13:36:46] (03Abandoned) 10Hashar: contint: firewall out ssh access (restrict to bastion) [operations/puppet] - 10https://gerrit.wikimedia.org/r/96040 (owner: 10Hashar) [13:38:06] !log Jenkins: upgrading openjdk 6 on gallium and lanthanum [13:38:22] Logged the message, Master [13:38:50] !log jenkins : restarted slave daemon on lanthanum.eqiad.wmnet [13:39:05] Logged the message, Master [13:39:58] !log jenkins : restarted slave daemon on gallium.wikimedia.org [13:40:13] Logged the message, Master [13:52:07] so we are getting peaks where there's a lot of these centralautologins [13:52:11] root@oxygen:/a/log/webrequest# tail -2000 5xx.tsv.log | egrep '(Special:CentralAutoLogin|Spezial:Zentrale_automatische_Anmeldung)' | wc -l [13:52:11] 1431 [13:52:23] 1431 out of 2000, rather a lot [14:27:13] PROBLEM - Puppet freshness on rhodium is CRITICAL: No successful Puppet run for 2d 12h 9m 11s [14:36:43] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [operations/puppet] - 10https://gerrit.wikimedia.org/r/89002 (owner: 10Hashar) [15:02:22] (03CR) 10Cmcmahon: "It would be nice to do this soon. This is blocking some testing work in beta labs and will block release very soon." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96161 (owner: 10EBernhardson) [15:12:43] tail -4000 5xx.tsv.log | egrep '(Special:CentralAutoLogin|Spezial:Zentrale_automatisch)' | wc -l [15:12:43] 3390 [15:12:58] and really no idea what can be done about it right now :-( [15:13:07] https://gdash.wikimedia.org/dashboards/reqerror/ [15:18:21] (03CR) 10Anomie: Normalise the path part of URLs in the text frontend (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/96941 (owner: 10Tim Starling) [15:33:25] apergos: Don't suppose you've any more information about the error than that? [15:33:55] fatal and exception logs are quiet [15:33:57] typical [15:35:07] well nemo has been bugzilla watching and pointed out this with rather a long discussion: [15:35:21] https://bugzilla.wikimedia.org/show_bug.cgi?id=54195 [15:35:35] and the vast majority of those are indeed /start [15:42:26] (03PS2) 10Addshore: Start wikidata puppet module for builder [operations/puppet] - 10https://gerrit.wikimedia.org/r/96552 [15:42:51] (03CR) 10Addshore: "Covered most if not all of the initial comments from PS1" (037 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/96552 (owner: 10Addshore) [15:45:17] apergos, what DC is this? [15:46:08] (03PS3) 10Addshore: Start wikidata puppet module for builder [operations/puppet] - 10https://gerrit.wikimedia.org/r/96552 [15:46:36] seem nicely split across cp100x and amssqx [15:46:42] MaxSem: [15:46:45] if anyone around fancies reviewing some puppet stuff for me I would be very gratefull, [= (see above) nice and small ;p [15:49:23] addshore, is this for labs? [15:49:29] yus [15:54:40] addshore, have you tested it? [15:58:16] this is my first time touching puppet, So I'm not sure what the standard process is :P is the jenkins validation not enough? ;p [15:59:49] addshore: it just do very basic tests [16:00:28] addshore, you're supposed to test it on labs [16:00:59] from the looks of it, exec { 'npm_install' would fail due to lack of path [16:01:02] Nemo_bis: nope [16:03:13] ah ha... can I ask you about the pop list links we want for The Matrix? ( LeslieCarr ) [16:03:52] apergos: sure [16:03:59] (03CR) 10Jeroen De Dauw: Start wikidata puppet module for builder (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/96552 (owner: 10Addshore) [16:04:08] MaxSem: is the process documented anywhere? :/ [16:04:09] i'm about to head out for a little bit though... so you have like 5 minutes :) go! [16:04:22] so...erm... links from where? :-D [16:05:36] (03PS4) 10Addshore: Start wikidata puppet module for builder [operations/puppet] - 10https://gerrit.wikimedia.org/r/96552 [16:05:36] is this the lit of peers if the dc happens to be in a facility listed in peeringdb? or some other thing? [16:05:38] *list [16:05:51] (03CR) 10Addshore: Start wikidata puppet module for builder (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/96552 (owner: 10Addshore) [16:05:53] addshore, https://wikitech.wikimedia.org/wiki/Help:Self-hosted_puppetmaster [16:07:43] i'ma little confused ? [16:08:06] let me look [16:08:07] me too [16:08:21] there's two columns I haven't touched, [16:08:34] ok, i'm not actually sure what to put in that column specifically [16:08:34] "POP List Link" [16:08:48] "POPs at Location (and type)" [16:08:57] someone mentioned there may be lists, but they are usually in bid [16:08:58] not sure what we want in there [16:09:01] so that column may be pointlesss [16:09:14] i preferred to put too much on and parse down ;] [16:09:15] the pops at location and type would beif we know that XYZ carrier has a pop at the location and if it's major or just backhauled [16:09:22] but yeah, i'd get rid of the pop list link column [16:09:38] ok [16:09:52] I was really scratching my head over it [16:09:54] thanks [16:10:10] ok [16:10:12] i do have an idea though [16:10:19] yes? [16:10:37] we don't have a column for cost of 10G waves/dark fiber [16:10:49] the info is sorta all over [16:11:07] we could replace that with cost [16:11:18] yeah, there's a little in a few replies to tickets but mostly I think we don't have that pricing info [16:11:24] (and 10G vs dark fiber is quite important, since we can put a lot of 10g's per dark fiber) [16:11:28] a few bids had it [16:11:36] and then some tickets [16:11:56] lemme change that column title then [16:12:00] coo [16:12:05] mm wrong location, I'll move it also [16:12:22] did anyone respond to the sites i wanted to throw out of this round email ? [16:12:29] because we can also start doing theoretical pricing [16:13:03] digital realty contegix etc? [16:13:40] I saw no replies, I looked at them all and agree, we can easily afford to be choosy about connectivity, lots of bids that qualify [16:14:21] apergos: no names in this channel [16:14:41] sorry [16:15:01] well, we can start doing pricing theoreticals on the others that we have info for ? [16:15:24] oh shit i gotta run 5 minutes ago! [16:15:26] bye [16:15:27] go go go [16:15:29] thanks [16:19:44] (03PS1) 10Hashar: contint: package curl on slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/97526 [16:20:15] !log jenkins : installed curl on lanthanum.eqiad.wmnet puppet change is {{gerrit|97526}} [16:20:28] ty MaxSem :) (in our new office and the internet here currently leaves a bit to be desired) [16:20:31] Logged the message, Master [16:20:58] (03CR) 10Hashar: "manually installed on lanthanum.eqiad.wmnet" [operations/puppet] - 10https://gerrit.wikimedia.org/r/97526 (owner: 10Hashar) [16:31:40] !jenkins mediawiki-core-qunit [16:31:41] https://integration.wikimedia.org/ci/job/ [16:31:46] !jenkins mediawiki-core-qunit [16:31:46] https://integration.wikimedia.org/ci/job/mediawiki-core-qunit [16:50:53] (03CR) 10Ottomata: Writing JSON statistics to log file rather than syslog or stderr (031 comment) [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/95473 (owner: 10Ottomata) [18:18:21] labstore[34] [18:19:17] that is not eqiad [18:19:18] that is tampa [18:19:19] that was the example i was given. [18:19:19] labstore100[3-4] [18:19:19] RobH: labstore100x [18:19:20] thats not what you pasted! [18:19:20] =p [18:19:22] RobH: Doh! Braino. [18:19:25] So labs no longer requires the labstore servers or shelves? [18:19:25] in eqiad? [18:19:44] let's start this conversation over [18:19:44] because if its not moving, it goes to me [18:20:15] indeed. [18:20:16] Coren: so... [18:20:18] whaaa? [18:20:20] Coren: no need or procurement [18:20:27] They are the same as labstore100[34], which are the ones I won't need for NFS. The shelves could be added to 100[12] for great NFS justice; the servers could be used for the OSM postgres DB [18:20:53] sigh [18:20:54] the systems are already purchased and are in eqiad [18:20:54] they are already on the mgmt lan [18:20:54] you just need to install them [18:20:55] RobH: I was asking the question wondering "if I don't use them, how do we go about making them available for other things" [18:20:55] ok [18:20:55] RobH: But that's all in eqiad. :-) [18:20:55] got it [18:20:55] so you have a couple eqiad servers you may not need anymore [18:20:55] Right. [18:21:12] So the process is decommission for reclaim on lifecycle doc [18:21:19] hold up [18:21:20] with a very specific note to not skip the step ' If system is reclaimed into spares, ticket should be assigned to the HW Allocation Tech so he can update spares lists for allocation. ' [18:21:31] Ryan_Lane: holding. [18:21:38] Coren: why would you think we don't need them? [18:21:47] we just had to scrape back hardware from analytics [18:21:54] well, regardless of what labs decides, i answered the question ;] [18:22:00] yep [18:22:15] but i thought it was odd that you would have spare hw [18:22:19] Ryan_Lane: Well, I was *hoping* they could be repurposed as new virts. :-) [18:22:31] they are not even close to spec of virts. [18:22:35] what is the hardware in question? [18:22:46] Ryan_Lane: Dells with lotsa storage. [18:22:54] the labstore100x boxes? [18:23:05] 100[34]. I only need 100[12] for NFS [18:23:14] ah [18:23:18] So I was thinking since we can't use them for virt... [18:23:38] Move the shelves to 100[12] to double space, and use the servers for the postgres DB [18:23:45] they'd be shitty for it no? [18:23:47] I thought we were going to use them for another set of NFS [18:23:55] for public datasets and such [18:23:57] to split the IO [18:24:05] * apergos looks greedily at them [18:24:10] public datasets, om nom nom nom [18:24:13] Ryan_Lane: That could also work. But that's a lot of TB for just the datasets? [18:24:23] how many T are we talking? [18:24:26] Whereas general labs storage we can always use more of. [18:24:29] 25, I think [18:24:44] a lot of spare at least initially, that's true [18:25:12] apergos: 30T, give or take. [18:25:58] we could also use this for postgres dbs [18:26:09] which tools is still asking for [18:26:19] [13:23:37] Move the shelves to 100[12] to double space, and use the servers for the postgres DB [18:26:19] (for OSM dbs) [18:26:20] :-) [18:26:34] this isn't giving hardware away, then [18:26:47] it's just renaming them ;) [18:26:54] so yea if you guys are just moving disk shelves and renaming within labs [18:26:59] Ryan_Lane: I know, that's an alternative I've just considered earlier rather than releasing it. :-) [18:27:00] its all local datacenter tickets for the hands on stuff [18:27:03] but what that doesn't solve (if it's an issue) is splitting th i/o [18:27:05] and core-ops and network for the otehr stuff [18:27:23] if it was going 'spare' then you would also assign me a ticket [18:27:25] but its not, so you dont [18:27:26] apergos: It's not an issue yet; right now the NFS are pretty much idle. [18:27:36] hm ok [18:28:38] (It doesn't harm that I'm in a two controller setup with raid over the different shelves so disk IO is higher than the net can give in the first place) [18:31:38] that's pretty sweet [18:31:51] http://ganglia.wikimedia.org/latest/graph.php?r=1hr&z=xlarge&h=labstore3.pmtpa.wmnet&m=cpu_report&s=descending&mc=2&g=cpu_report&c=Labs+NFS+cluster+pmtpa [18:39:53] Right [18:39:55] So [18:40:04] I'm gonna deploy a cherry-pick of a file rename that fixes some fatals [18:40:09] Going to wmf4 and wmf5 [18:41:00] Just FYI [18:46:32] !log mholmquist synchronized php-1.23wmf4/extensions/UploadWizard/resources/ext.uploadWizard.uploadCampaign.list.css [18:46:47] Logged the message, Master [18:46:54] (03PS1) 10Yurik: Added relative redirect workaround until its fixed ext [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97540 [18:47:42] paravoid, ^ [18:48:43] !log mholmquist synchronized php-1.23wmf5/extensions/UploadWizard/resources/ext.uploadWizard.uploadCampaign.list.css [18:48:58] Logged the message, Master [18:49:50] (03CR) 10jenkins-bot: [V: 04-1] Added relative redirect workaround until its fixed ext [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97540 (owner: 10Yurik) [18:49:53] 'kay, seems like that's working [18:50:44] I'm done, y'all can carry on [18:50:45] PROBLEM - Host labstore1001 is DOWN: PING CRITICAL - Packet loss = 100% [18:53:55] (03PS2) 10Yurik: Added relative redirect workaround until its fixed ext [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97540 [18:55:10] robla, are you the new greg? :) [18:55:35] Ryan_Lane: Opinons on how the new labs DBs should be named? Simply labsdb100[45], or do we want to name them differently since they're not "true" production replicas? [18:56:06] Or, I suppose, not project replicas? [18:56:36] (Also, as a reminder of a long-gone conversation, the postgres slave will be the new physical DB for labs user master and vice versa) [18:59:02] yurik: So says an email I got [18:59:19] He's working on the beard I think [18:59:40] Coren: labs user master? [18:59:59] oh, the place users will write data? [19:00:08] The tools-db replacement to stop having a DB for user stuff on a VM (evuuuul!) [19:01:00] ah. right [19:01:01] yes [19:01:02] please :) [19:01:07] that's horrible [19:02:55] yurik: yeah, I'll do my best to make sure you all wish for Greg's speedy return :-) [19:03:31] Ryan_Lane: So just labsb100[45]? [19:03:40] no clue :) [19:03:48] ummm [19:03:59] I guess so, yeah [19:07:28] (03CR) 10Edenhill: [C: 031] Writing JSON statistics to log file rather than syslog or stderr (033 comments) [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/95473 (owner: 10Ottomata) [19:10:06] (03PS7) 10Ottomata: Writing JSON statistics to log file rather than syslog or stderr [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/95473 [19:12:01] (03CR) 10Edenhill: [C: 031] Writing JSON statistics to log file rather than syslog or stderr [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/95473 (owner: 10Ottomata) [19:12:56] it's like ping pong [19:14:53] heheh [19:35:31] (03CR) 10Ottomata: [C: 032 V: 032] "Coool, let's do it." [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/95473 (owner: 10Ottomata) [19:35:32] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [19:37:11] (03PS1) 10Ottomata: Adding --always flag to git describe. [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/97545 [19:37:20] Snaps: check that real quick too ^ [19:42:22] RobH: Want to coach me about labs hardware stuff now, or ping me after lunch? [19:44:39] (03CR) 10Edenhill: [C: 031] Adding --always flag to git describe. [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/97545 (owner: 10Ottomata) [19:45:13] lemme ping ya later [19:45:17] 'k [19:45:23] cuz i wanna wrap up row d vendor replies before i forget again [19:45:29] I'll be around until about 3pm PST [19:45:40] if we don't touch base today [19:45:46] we'll do so my AM tomorrow =] [19:45:56] sounds good [19:46:13] * RobH may fall back into rfp rabbit hole later [19:47:31] (03CR) 10Ottomata: [C: 032 V: 032] Adding --always flag to git describe. [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/97545 (owner: 10Ottomata) [19:47:51] marktraceur: we should sync your fix [19:48:27] oh, you already did [19:50:14] yo paravoid, do you know the datetime periods where we were seeing amsterdam packet loss? [19:50:24] I want to see if they correlate with when I have varnishkafka produce errors [19:50:37] as I was saying at the meeting, smokeping is currently broken, so... no :/ [19:50:42] rats [19:50:55] this is what i've got: [19:50:56] well [19:50:57] Nov 18 18:26:51 - Nov 18 20:30:38 [19:50:57] Nov 24 17:44:54 - Nov 24 20:58:23 [19:50:57] Nov 25 09:45:24 - Nov 25 10:19:23 [19:51:21] smokeping's web page is broken, it probably still collected [19:51:23] let's check our mailbox [19:51:44] Ryan_Lane: hey, are you about? [19:52:17] ori-l: yep. what's up? [19:52:40] I'd like to merge the nginx change but would only do it with someone around, in case it blows up [19:52:47] hm, times of smokeping emails I have don't seem to correlate [19:52:52] I really don't think it will, but I'm nervous, and nginx is kind of important [19:54:03] is there a good time for you? [19:57:36] maybe 30 mins from now [19:57:45] WFM, thanks! [19:57:52] I'll ping you to confirm before I do anything [19:57:55] one easy way to do this is to disable puppet on all the ssl hosts [19:57:57] using salt [19:58:06] ahhhh that sounds very useful [19:58:07] salt 'ssl*' cmd.run 'puppetd --disable' [19:58:15] nice! [19:58:16] then depool one of the ssl servers [19:58:26] enable puppet on it, run puppet, see what happens [19:58:28] how do you depool? [19:58:41] on fenari as root, go to: /home/w/conf/pybal [19:58:46] ori-l: are you working on spdy? [19:58:48] cd [19:58:56] vi https (or is it ssl?) [19:59:08] gwicke: I was planning to, yeah [19:59:09] pick the one you want to disable and switch True to False [19:59:17] ori-l: awesome! [20:00:07] YuviPanda is enabling it for labs protoproxy today, I think (or else he did yesterday), was going to see how it worked out for him [20:00:26] analytics is already using it :D [20:00:33] he has been running it on the labs instance proxy for a while [20:00:46] gwicke: yeah but it's 'rolled out' for wider use now. [20:00:57] gwicke: there's a self service interface, so no need to find me :) [20:00:57] sweet [20:01:24] the build we have in apt doesn't have SPDY, I haven't checked yet if the Raring or Quantal packages do [20:01:57] ori-l: the current build on labs is from andrewbogott, I think he based it off raring [20:02:01] but the original maintainer of the debian Nginx package is a member of the WMF language engineering team [20:02:01] * gwicke is looking forward to fine-grained caching of API end points [20:02:25] gwicke, how does SPDY help with that? [20:02:43] it lowers the overhead per request [20:03:08] right, but fine-grained caching? [20:03:16] so you don't have to work around http issues as much on the application layer [20:03:28] we're not planning on enabling SPDY without quite a bit of testing/investigation [20:03:31] (in production) [20:03:54] we wanted to get anon by default https done first [20:03:55] Ryan_Lane: I wasn't suggesting that we do [20:04:30] * Ryan_Lane nods [20:05:22] ori-l: added you to the proxy project, just in case you want to check something out. [20:05:24] ori-l: with low per-request overhead and now head-of-line blocking it becomes feasible to expose independent bits of information as separately cacheable API end points [20:05:32] *no head-of-line blocking [20:05:47] gwicke: right [20:05:57] YuviPanda: thanks! [20:06:04] ori-l: yw! :) [20:06:30] gwicke: was dealing with the same issue with module storage [20:07:14] gwicke: i.e., ResourceLoader concatenates modules, so whenever the version of any single module bumps, we change the URL and thus throw out a cached response that is still largely current [20:07:15] I wonder how RL would look in a pure SPDY / HTTP 2.0 world [20:07:26] there might not be much left there [20:07:44] yeah [20:10:23] !log text-lb.esams, bits-lb.esams and more (amslvs1/3) are now load shared amongst amslvs1 and amslvs3 instead of just amslvs1 [20:10:39] Logged the message, Mistress of the network gear. [20:16:21] paravoid: do you know if there is a machine in eqiad from which we can run a kafka throughput test? [20:16:28] sorry [20:16:30] in esams [20:16:41] want to test throughput between esams and eqiad [20:17:35] why do you want to do that? [20:18:07] Snaps' idea, but we're trying to isolate problems, i guess [20:18:25] during those hours that I pasted above, there was about 20-30% loss of messages between esams and eqiad [20:18:48] varnishkafka times out talking to brokers, message buffers get full [20:18:52] and then it starts dropping messages [20:18:59] it's not congestion [20:19:13] well, not 1gbps congestion [20:19:19] it may be somewhere on their path, who knows [20:20:18] hm [20:20:36] yes, it sucks [20:20:38] I know [20:21:58] yeah hm, actually, doh, i think smokeping email timestamps do correlate [20:21:58] hm [20:21:58] ok [20:23:28] hm, paravoid, does varnish itself handle this? or does it drop requests? [20:23:44] does it fall back to internet routing when link is sketchy? or just if link dies? [20:25:31] tcp adapts and it slows down, eventually 503ing requests [20:25:52] but no, nothing fall backs to transits automatically [20:26:01] and is in fact difficult to do so even manually now [20:27:58] PROBLEM - Puppet freshness on rhodium is CRITICAL: No successful Puppet run for 2d 18h 9m 56s [20:28:39] ottomata: but we saw errors lasting for 500 seconds (possibly), does that correlate to smokeping? [20:30:40] I have smokeping loss alerts (either bigloss or someloss) in my inbox for [20:30:40] Mon Nov 25 09:48:28 2013 [20:30:40] Mon Nov 25 09:53:28 2013 [20:30:40] Mon Nov 25 11:18:28 2013 [20:30:41] Mon Nov 25 19:48:28 2013 [20:30:41] Mon Nov 25 19:18:28 2013 [20:30:42] today [20:30:47] In case people haven't see it, people on enwiki are reporting getting 503 errors from Varnish when saving pages. If it's just timeouts waiting for apaches it should improve in 1.23wmf5, but confirmation that that's what's going on would be good. https://en.wikipedia.org/wiki/Wikipedia:VPT#Error_message [20:31:20] anomie: yeah, graphs are showing the same all day... [20:31:25] ottomata, join me in #wikimedia-labs? [20:31:37] http://gdash.wikimedia.org/dashboards/reqerror/ [20:31:39] :/ [20:33:45] ugh [20:34:25] sorry, Snaps, there are more than that [20:34:26] umm [20:35:18] mostly this morning during, 9:48-11:03 [20:35:36] then intermittently for other hours today [20:35:46] (03PS1) 10coren: Labs DB: views for archive table [operations/software] - 10https://gerrit.wikimedia.org/r/97557 [20:37:10] how does smokeping do its test? ICMP? shortlived TCP connections? [20:37:19] paravoid? [20:37:28] icmp [20:38:08] Ryan_Lane: OK, so I'm going to do what you suggested [20:38:18] disable puppet on ssl*, depool one host [20:38:22] huhm, okay [20:39:17] gwicke: ori-l SPDY for our public *production* parsoid cluster :) https://parsoid-prod.wmflabs.org/ [20:39:41] TCP will start acting up at about 2% packet loss [20:40:43] YuviPanda: sweet! [20:41:17] Snaps: i'm not exactly sure how to read these smokealert emails [20:41:19] but [20:41:20] loss: 0%, 0%, 0%, 0%, 0%, 5%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 15%, 15%, 20% [20:41:31] that's the bigloss 09:53 alert [20:41:37] close to when our problems today started [20:41:54] !log Disabling Puppet on ssl* in preparation for merging https://gerrit.wikimedia.org/r/#/c/96961/ [20:42:07] ssl* is not enough btw [20:42:09] Logged the message, Master [20:42:18] we now have nginx with localssl to some varnish boxes [20:42:23] I think it's limited to ulsfo now [20:42:36] ottomata: at 15% packet loss our poor kafka connections wont be very productive. But I'm wondering if the 0% periods actually has a higher percentage in practice. [20:43:07] paravoid: cp4*? [20:43:13] ori-l: sounds good [20:43:44] !log Also disabling puppet on cp400* [20:43:47] aye [20:43:59] Logged the message, Master [20:44:02] ori-l: yup [20:44:10] ori-l: that would be role::cache::ssl iirc [20:44:40] any host in particular that would be a better candidate for depooling than others? [20:45:45] i'll go with ssl3002.esams.wikimedia.org because traffic in europe is waning [20:45:56] paravoid, so what's your prefs for working on the new landing page - so i can schedule it. Would you like it this week (i added a patch to do a frontend fix), or some time next week? [20:47:16] yurik: next week is fine too [20:47:32] ok [20:48:04] actually, maybe eqiad because there's more redundancy [20:48:36] what is zookeeper module good for? [20:48:42] sorry for being chatty, just making sure you have a chance to holler if i'm doing something moronic [20:48:51] no worries [20:49:44] !log Depooling ssl1008 to test {{Gerrit|96961}} [20:49:58] Logged the message, Master [20:50:09] (03PS4) 10Ori.livneh: rewrite nginx module [operations/puppet] - 10https://gerrit.wikimedia.org/r/96961 [20:50:47] (rebase) [20:52:03] (03CR) 10Ori.livneh: [C: 032] rewrite nginx module [operations/puppet] - 10https://gerrit.wikimedia.org/r/96961 (owner: 10Ori.livneh) [20:55:31] (03PS1) 10Ori.livneh: Fix typo in parameter name (enable -> enabled) [operations/puppet] - 10https://gerrit.wikimedia.org/r/97560 [20:55:34] shhhh [20:55:53] (03CR) 10Ori.livneh: [C: 032 V: 032] Fix typo in parameter name (enable -> enabled) [operations/puppet] - 10https://gerrit.wikimedia.org/r/97560 (owner: 10Ori.livneh) [20:56:33] so, we are 503ing a lot [20:56:35] it's multiple issues [20:56:57] 1124 RxURL c /w/index.php?title=2013%E2%80%9314_Chelsea_F.C._season&action=submit [20:57:00] 2152 RxURL c /w/index.php?title=2013%E2%80%9314_Chelsea_F.C._season&action=submit [20:57:36] 1748 RxURL c /w/index.php?title=Template:Syrian_civil_war_detailed_map&action=submit [20:57:39] 1700 RxURL c /w/index.php?title=Template:Syrian_civil_war_detailed_map&action=submit [20:57:42] 546 RxURL c /w/index.php?title=Template:Syrian_civil_war_detailed_map&action=submit [20:57:45] 1477 RxURL c /w/index.php?title=Template:Syrian_civil_war_detailed_map&action=submit [20:58:13] does that correspond with me doing things or did it start earlier? [20:58:18] no, earlier [20:59:15] paravoid: are we still triple-parsing wikitext? [20:59:25] there was a bug about that, dunno if it was fixed [20:59:44] no idea [21:01:04] paravoid: https://bugzilla.wikimedia.org/show_bug.cgi?id=57026 [21:01:55] (all you guys really should watch bugzilla a little bit more) [21:02:13] watch what, all bugs? [21:02:30] why not? there are not that many [21:02:33] no performance tag, no ops tag [21:02:54] there are saved searches 'bugs filed today' and 'bugs filed yesterday' :> [21:03:10] i assumed aaron was handling that [21:03:12] * ^d hides shared saved searches, they clutter his sidebar [21:03:15] Ryan_Lane: is it cool if i restart nginx on the depooled host just to be totally sure? [21:03:19] yep [21:03:27] make sure to use restart and not reload [21:03:31] nginx has an issue with reload [21:03:44] k [21:05:01] i included those by default [21:05:13] - config file name changed from 'localhost.conf' to 'localhost' (omitting the extension) [21:05:22] which is consistent with what the package does [21:05:40] but the diff is clean [21:06:01] !log Restarting nginx on ssl1008 [21:06:17] Logged the message, Master [21:07:39] all 503s that I see so far are edits [21:07:40] ottomata: need a new motherboard on analytics1012....sorry :-( [21:07:57] I don't think these two fixes from 57026 are deployed [21:09:07] # grep -l '57026' */includes/WikiPage.php [21:09:07] php-1.23wmf5/includes/WikiPage.php [21:09:45] and enwiki is on wmf4 [21:10:04] okay, so that's one 503 cause, but it probably doesn't explain the spike in reqerror -- too high of a number to be editors [21:11:22] ori-l: do you think these two patches from 57026 can/should be backported? [21:12:12] sorry, the nginx dance is new to me so i'm focussed on that, will look in a moment [21:12:27] (03PS1) 10Springle: expose archive to labs, suitably redacted [operations/puppet] - 10https://gerrit.wikimedia.org/r/97563 [21:12:47] verified with openssl client and curl -k -H 'Host: en.wikipedia.org' https://208.80.154.224:443/wiki/Main_Page [21:14:29] paravoid: (if my opinion counts, please backportthem; i though they already were backported. this is also user-visible (see WP:VPT on en.wp), and wmf5 is coming later than usual) [21:17:26] (03CR) 10Springle: [C: 032] expose archive to labs, suitably redacted [operations/puppet] - 10https://gerrit.wikimedia.org/r/97563 (owner: 10Springle) [21:21:09] (03PS1) 10Yuvipanda: dynamicproxy: Increase filesize limit for uploads. [operations/puppet] - 10https://gerrit.wikimedia.org/r/97625 [21:21:22] Coren: andrewbogott anyone else, quick small +2? ^ [21:21:27] paravoid: still haven't looked, but MatmaRex is usually on point, so if he's in favor I tentatively am too [21:22:15] YuviPanda: Not even rebasable. Also, tl;dr. :-) [21:22:22] gah rebase [21:22:27] !log restarting sanitarium mysqld processes, db1053, db1054, db1057 [21:22:40] Coren: 'tis one line patch + 22 line comments [21:22:43] Logged the message, Master [21:24:37] Ryan_Lane: ssl1004 uhoh: [21:24:39] notice: /Stage[main]/Protoproxy::Ganglia/Nginx::Site[localhost]/File[/etc/nginx/sites-available/localhost]/ensure: defined content as '{md5}0b0dd9f0fea31789659717608065486e' [21:24:39] notice: /File[/etc/nginx/sites-enabled/gerrit]/ensure: removed [21:24:50] (03PS2) 10Yuvipanda: dynamicproxy: Increase filesize limit for uploads. [operations/puppet] - 10https://gerrit.wikimedia.org/r/97625 [21:24:55] :D [21:25:25] no clue why that would occur [21:26:15] maybe puppet hadn't run in forever? [21:26:20] (03CR) 10coren: [C: 032] "tl;dr" [operations/puppet] - 10https://gerrit.wikimedia.org/r/97625 (owner: 10Yuvipanda) [21:26:30] heh, ty Coren [21:26:40] I guess? shouldn't gerrit be there, though? [21:26:44] site.pp assigns role::protoproxy::ssl to /ssl100[1-9]\.wikimedia\.org/ [21:27:15] springle: Want to double check me? https://gerrit.wikimedia.org/r/#/c/97557/ [21:27:17] (03CR) 10Ryan Lane: [C: 032] dynamicproxy: Increase filesize limit for uploads. [operations/puppet] - 10https://gerrit.wikimedia.org/r/97625 (owner: 10Yuvipanda) [21:27:41] Ryan_Lane: it wasn't on ssl1008 and ssl1007 [21:27:52] and i don't see the puppet config for gerrit being served by that box [21:28:16] (03CR) 10Springle: [C: 031] Labs DB: views for archive table [operations/software] - 10https://gerrit.wikimedia.org/r/97557 (owner: 10coren) [21:28:22] Coren: looks sane [21:28:24] no getti defined in the role [21:29:30] ori-l: I'd imagine it's fine [21:29:38] yeah, the dir wasn't managed recursively before [21:29:42] so it's just leftover, i think [21:29:45] gerrit is obviously still up :) [21:29:53] well, it doesn't restart nginx, remember [21:29:59] Coren: archive tables and triggers now in place on sanitarium [21:30:01] ah, right [21:30:02] but it might be good to restart ssl1004 just in case [21:30:07] do i need to depool it first? [21:30:24] misc might be running localssl [21:30:33] (03CR) 10coren: [C: 032 V: 032] "WFM" [operations/software] - 10https://gerrit.wikimedia.org/r/97557 (owner: 10coren) [21:30:41] I like to depool a few hosts at a time, let them drain [21:30:46] then restart them and repool them [21:30:52] then do a few more [21:30:55] how many is a few? [21:31:09] look at the network utilization [21:31:18] make sure you don't saturate links [21:31:27] * ori-l nods [21:31:29] depool as many as the hardware can take :) [21:33:37] well, actually [21:33:47] do i really need to restart nginx if no configuration changes occurred on the host? [21:33:54] that looks to be the common case; ssl1004 was unusual [21:37:09] !log Depooling ssl1004 to restart nginx [21:37:25] Logged the message, Master [21:38:06] I'm so annoyed that we don't have 5xx counts per varnish backend [21:40:26] Ryan_Lane: notice: /File[/etc/nginx/sites-available/.svn]/ensure: removed :D [21:41:43] ori-l: yes [21:42:09] because you made some changes [21:42:21] and we want to make sure the system is in a working state after a change [21:42:57] OK [21:45:38] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (202941) [21:47:09] (03CR) 10jenkins-bot: [V: 04-1] dynamicproxy: Increase filesize limit for uploads. [operations/puppet] - 10https://gerrit.wikimedia.org/r/97625 (owner: 10Yuvipanda) [21:47:21] took you long enough! [21:47:24] wait [21:47:25] wat [21:47:26] -1 [21:47:40] oh, for PS1 [21:47:56] look at PS2! [21:49:31] ori-l: Yeah, that was first on my list today [21:49:40] heads up--fundraising is doing a 100% test now [21:49:57] since when? [21:51:34] paravoid: stared about 11 min ago [21:51:49] started. typing is hrad [21:59:39] !log finished updating / restarting eqiad ssls [21:59:52] Logged the message, Master [22:00:26] (03PS1) 10Yuvipanda: tools: Remove redundant declration [operations/puppet] - 10https://gerrit.wikimedia.org/r/97629 [22:00:31] andrewbogott: ^ [22:00:35] merge? [22:05:31] PROBLEM - LVS HTTPS IPv4 on mediawiki-lb.pmtpa.wikimedia.org is CRITICAL: Connection refused [22:05:32] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: Connection refused [22:05:41] PROBLEM - LVS HTTPS IPv6 on wikiquote-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: Connection refused [22:05:41] PROBLEM - LVS HTTPS IPv4 on wikisource-lb.pmtpa.wikimedia.org is CRITICAL: Connection refused [22:05:41] PROBLEM - LVS HTTPS IPv4 on wikidata-lb.pmtpa.wikimedia.org is CRITICAL: Connection refused [22:05:45] PROBLEM - HTTPS on ssl1 is CRITICAL: Connection refused [22:05:51] PROBLEM - LVS HTTPS IPv6 on wiktionary-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: Connection refused [22:06:01] PROBLEM - LVS HTTPS IPv6 on mediawiki-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: Connection refused [22:06:02] PROBLEM - LVS HTTPS IPv4 on wikiquote-lb.pmtpa.wikimedia.org is CRITICAL: Connection refused [22:06:02] ori-l: hey [22:06:12] PROBLEM - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is CRITICAL: Connection refused [22:06:12] PROBLEM - LVS HTTPS IPv6 on wikipedia-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: Connection refused [22:06:17] argh [22:07:01] nginx: [emerg] bind() to [2620:0:860:ed1a::c]:443 failed (99: Cannot assign requested address) [22:07:02] nginx: configuration file /etc/nginx/nginx.conf test failed [22:07:03] on ssl1 [22:07:10] hey [22:07:12] YuviPanda: That's redundant because it's included in the inherited class? [22:07:21] It's not the cause of the duplicate definition problem... [22:07:35] ori-l so what's up ? [22:07:37] ori-l: sooo.... [22:07:42] looks like that change didn't work so well :D [22:07:44] nginx: [emerg] bind() to [2620:0:860:ed1a::c]:443 failed (99: Cannot assign requested address) [22:07:45] nginx: configuration file /etc/nginx/nginx.conf test failed [22:07:49] on ssl1 [22:07:56] eqiad went fine [22:07:56] depool ssl1 [22:08:00] oh [22:08:01] hrm, why can't assign requested address... [22:08:08] pmtpa shouldn't be an issue [22:08:20] depool it anyway [22:08:22] done argh [22:08:26] we still have a little traffic going to pmtpa [22:08:29] yeah, i didn't think it was getting any traffic [22:08:32] or had alerting set up [22:08:36] super sorry :/ [22:08:41] nah, no worries [22:08:51] RECOVERY - LVS HTTPS IPv6 on wiktionary-lb.pmtpa.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 69263 bytes in 0.294 second response time [22:08:54] ssl1 is the least of our problems [22:09:00] oh? [22:09:01] RECOVERY - LVS HTTPS IPv6 on mediawiki-lb.pmtpa.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 69263 bytes in 0.300 second response time [22:09:01] RECOVERY - LVS HTTPS IPv4 on wikiquote-lb.pmtpa.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 69263 bytes in 0.348 second response time [22:09:11] RECOVERY - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 69263 bytes in 0.296 second response time [22:09:11] RECOVERY - LVS HTTPS IPv6 on wikipedia-lb.pmtpa.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 69263 bytes in 0.302 second response time [22:09:13] hehe, it's ok , just "eep! dinging!" and yeah, tampa is sorta barely limping along there [22:09:22] ok, back to the interview i'm giving [22:09:22] Ryan_Lane: https://graphite.wikimedia.org/render/?title=HTTP%205xx%20Responses%20-1day&from=-1%20day&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=2&lineMode=staircase&target=color(cactiStyle(alias(reqstats.5xx,%225xx%20resp/min%22)),%22blue%22)&target=color(cactiStyle(alias(reqstats.500,%22500%20resp/min%22)),%22red%22) [22:09:31] RECOVERY - LVS HTTPS IPv4 on mediawiki-lb.pmtpa.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 69263 bytes in 0.304 second response time [22:09:32] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 69263 bytes in 0.309 second response time [22:09:38] ah [22:09:39] Ryan_Lane: also https://graphite.wikimedia.org/render/?title=HTTP%204xx%20Responses%20-1day&from=-1%20day&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=2&lineMode=staircase&target=color%28cactiStyle%28alias%28reqstats.4xx,%224xx%20resp/min%22%29%29,%22blue%22%29 although that probably correlates with fundraising's test [22:09:41] RECOVERY - LVS HTTPS IPv6 on wikiquote-lb.pmtpa.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 69263 bytes in 0.306 second response time [22:09:42] RECOVERY - LVS HTTPS IPv4 on wikisource-lb.pmtpa.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 69261 bytes in 0.312 second response time [22:09:42] RECOVERY - LVS HTTPS IPv4 on wikidata-lb.pmtpa.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 1020 bytes in 0.188 second response time [22:09:49] Ryan_Lane: also, packet loss [22:10:02] ah. right. so those are likely unrelated to this [22:10:15] paravoid: https://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&title=MediaWiki+errors&vl=errors+%2F+sec&x=0.5&n=&hreg[]=vanadium.eqiad.wmnet&mreg[]=fatal|exception>ype=stack&glegend=show&aggregate=1&embed=1 [22:10:21] so it's mediawiki, probably [22:10:29] fluorine:/a/mw-log/fatal.log should have the details [22:10:32] paravoid: does that follow our esams traffic curve ? [22:10:35] gotta duck into a meeting [22:10:55] ori-l: there's that bug I mentioned before that's certainly part of the problem [22:11:04] ori-l: that MatmaRex pointed out, to be fair :) [22:11:08] andrewbogott: i thought it was from the inherited class since the other roles don't have it, but doesn't seem to be the case [22:11:34] ori-l: I don't feel sufficiently experienced with mediawiki to perform two backports into all wikis [22:11:50] paravoid: happy do it in 50 mins [22:11:53] paravoid: don't be worried, it's mediawiki, it's rock solid [22:11:56] you can't hurt it [22:12:14] well, i can do it in the background [22:17:09] (03PS1) 10Andrew Bogott: Replace some weird class syntax. [operations/puppet] - 10https://gerrit.wikimedia.org/r/97631 [22:17:13] YuviPanda: ^ seems to help, I know not why. [22:17:24] * andrewbogott can only hope that… someone can explain [22:18:59] (03CR) 10Andrew Bogott: [C: 032] Replace some weird class syntax. [operations/puppet] - 10https://gerrit.wikimedia.org/r/97631 (owner: 10Andrew Bogott) [22:19:24] andrewbogott: oh wow that's... weird [22:19:57] Does anyone know what the timeout is for the frontend varnishes? [22:20:00] andrewbogott: jenkins is being super slow today [22:20:48] (03PS1) 10Dzahn: fix the SSL cert chain for *.planet.wm [operations/puppet] - 10https://gerrit.wikimedia.org/r/97633 [22:22:57] !log upload replacement arbcom election config [22:23:12] Logged the message, Master [22:23:42] (03CR) 10Dzahn: [C: 032] fix the SSL cert chain for *.planet.wm [operations/puppet] - 10https://gerrit.wikimedia.org/r/97633 (owner: 10Dzahn) [22:24:51] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [22:25:58] mhoover, so… if you can get to bastion but not to virt1000 then... [22:26:07] are you forwarding your key? [22:26:21] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 35.37 ms [22:26:34] it appears as if it's being forwarded, but i'll double check. [22:27:07] just wanted to make sure the keys were in place before i started troubleshooting :) [22:27:09] thx man [22:27:25] Working now? [22:28:09] mhoover: I'm pretty sure the keys are there, but I haven't done per-user keys before so I'm not especially confident. [22:29:32] ori-l: Any idea why https://gerrit.wikimedia.org/r/#/c/97631/ fixed an actual bug rather than being a no-op? [22:29:59] andrewbogott: ok, cool. i'll dump the agent (might have to many keys loaded and the wrong statements in my ssh config) - lemme run through it real quick [22:30:36] you don't, for example, have a homedir on virt1000 which seems like a bad sign [22:34:08] hrmmm... [22:35:06] (03CR) 10Dzahn: "this fixed "incomplete" and "extra certs", "contains anchor" remains. but that's a MAY in RFC 2119 e.g." [operations/puppet] - 10https://gerrit.wikimedia.org/r/97633 (owner: 10Dzahn) [22:36:00] (03PS1) 10Andrew Bogott: Give mike an actual account to use his sudo from. [operations/puppet] - 10https://gerrit.wikimedia.org/r/97636 [22:36:05] mhoover: I predict that ^ will help [22:36:22] hehe [22:36:54] it'll be 5 or so before I can apply that [22:37:05] um… 5 minutes I mean, not 5:00 [22:37:43] (03CR) 10Andrew Bogott: [C: 032] Give mike an actual account to use his sudo from. [operations/puppet] - 10https://gerrit.wikimedia.org/r/97636 (owner: 10Andrew Bogott) [22:40:37] (03PS1) 10Yuvipanda: dynamicproxy: Don't include API by default in class [operations/puppet] - 10https://gerrit.wikimedia.org/r/97637 [22:40:37] andrewbogott: ^ [22:41:14] (03CR) 10Andrew Bogott: [C: 032] dynamicproxy: Don't include API by default in class [operations/puppet] - 10https://gerrit.wikimedia.org/r/97637 (owner: 10Yuvipanda) [22:43:28] YuviPanda: Keep in mind that without the API installed you have no way of recovering proxy config in case of a crash [22:43:46] andrewbogott: that's fine. This is for tools, and so is transient [22:43:53] ...? [22:44:45] andrewbogott: as in, tools are supposed to register for URL / port / host when they start [22:45:20] andrewbogott: so if there is a crash big enough to kill redis data, then yeah, we just restart the jobs and things are peachy [22:45:21] ah, I see. OK. [22:45:25] Yep, makes sense. [22:45:33] andrewbogott: :) [22:46:09] andrewbogott: if you have class { 'foo': } and class { 'foo': }, there's a multiple def'n error iirc [22:46:18] if you have class { 'foo': } and include foo, there isn't one [22:46:20] furthermore, [22:46:29] if you have class { 'foo': param => 'buzz', } and include foo [22:46:40] the class is parametrized [22:47:23] What we saw was sicker than that… class {'foo':} was causing a multiple-definition error even though the class that declared it was /not/ included. [22:47:40] so the 'include foo' amounts to "instantiate 'foo' if it isn't declared anywhere, or if a declaration already exists, use it -- parameters and all [22:47:48] so we were getting a multiple-definition failure that pointed to a line in a class that wasn't included. [22:48:21] Which makes me think that puppet simply can't parse class {'foo':} and loses its mind [22:48:29] no, it definitely can [22:48:34] we use it in lots of places [22:48:39] yeah :9 [22:48:39] :( [22:48:41] you probably just overlooked a tortuous dependency relation [22:48:59] Possible but we looked for a long time! [22:49:13] It was https://dpaste.de/o9WB [22:49:26] and we made a related change (killed API!) that wouldn't have worked if there was a dependency between those two [22:49:53] (03PS1) 10Andrew Bogott: Create an admin::labs class with one member: mike. [operations/puppet] - 10https://gerrit.wikimedia.org/r/97639 [22:50:20] mwalker: ping [22:50:31] paravoid: what's up? [22:50:40] ... or down [22:50:47] two things [22:51:05] first, HideBanners is still on the top 5 of URLs to backend [22:51:11] (03CR) 10Andrew Bogott: [C: 032] Create an admin::labs class with one member: mike. [operations/puppet] - 10https://gerrit.wikimedia.org/r/97639 (owner: 10Andrew Bogott) [22:51:15] i.e. your caching change hasn't probably been deployed [22:51:20] that's true [22:51:34] I can deploy it today; but only if greg-g releases me [22:51:41] well, you're running a FR test today [22:51:47] (greg-g is on vacation) [22:51:55] yes; and we'll be running them all week [22:51:55] so either don't run your FR test, or fix it :) [22:52:18] second [22:52:23] https://graphite.wikimedia.org/render/?title=HTTP 4xx Responses -1day&from=-1 day&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=2&lineMode=staircase&target=color(cactiStyle(alias(reqstats.4xx,"4xx resp/min")),"blue") [22:52:51] I haven't troubleshot this yet at all, but seems very aligned to the start of your test so far [22:52:55] * greg-g delegated to rob-la, and paravoid is also a good sane person ;) [22:53:01] * greg-g goes back to being on vacation [22:53:10] * greg-g grumbles at unified irc account ;) [22:53:13] hehe [22:53:15] greg-g: GO AWAY [22:53:32] /away [22:53:37] :) [22:53:57] now's the part of the show where I apply six tiny patches in quick succession because each one is 'trivial' and 'sure to work' [22:54:11] RobH: as greg's official delegation; paravoid has an issue; I have a fix -- may I deploy: https://gerrit.wikimedia.org/r/#/c/97150/ [22:54:31] K4-713: ^ [22:54:33] mwalker: wrong rob [22:54:38] heh [22:54:50] oh; right... robla! ^ [22:55:02] buh? [22:55:18] (03PS1) 10Andrew Bogott: I guess we need a group for mike to belong to... [operations/puppet] - 10https://gerrit.wikimedia.org/r/97640 [22:55:21] so, without it, no tests, right mwalker ? [22:55:21] Oh, CN? [22:55:24] no FR tests, that is [22:55:35] greg-g: seems like it [22:55:36] (03PS1) 10Ottomata: Conditionally setting content attribute for net-topology.sh file. [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/97641 [22:55:51] ah, caching, good stuff [22:55:53] (03CR) 10Ottomata: [C: 032 V: 032] Conditionally setting content attribute for net-topology.sh file. [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/97641 (owner: 10Ottomata) [22:55:55] it's not causing huge operational problems [22:55:59] marktraceur: did you get your stuff deployed? [22:56:06] since it's not multiplied by 10 this time and hitting the poor small mobile cluster [22:56:43] so it /can/ wait [22:56:56] the 40x error count on the other hand is probably more serious [22:56:56] (03PS1) 10Ottomata: Updating cdh4 module to fix puppet error in labs. [operations/puppet] - 10https://gerrit.wikimedia.org/r/97642 [22:57:04] mwalker: how many more tests and when do you have? [22:57:08] (03CR) 10Ottomata: [C: 032 V: 032] Updating cdh4 module to fix puppet error in labs. [operations/puppet] - 10https://gerrit.wikimedia.org/r/97642 (owner: 10Ottomata) [22:57:13] (03CR) 10Andrew Bogott: [C: 032] I guess we need a group for mike to belong to... [operations/puppet] - 10https://gerrit.wikimedia.org/r/97640 (owner: 10Andrew Bogott) [22:57:13] robla: Yeah, it's set [22:57:20] I started debugging that and hit HideBanners [22:57:48] ottomata: go ahead and merge my one-liner along with your patch [22:57:51] mwalker: go for it [22:58:04] greg-g: probably every day from now until we make all our money -- we've another full test today that I know of [22:58:14] mwalker: yeah, then go [22:58:18] * greg-g waves to robla  [22:58:26] * greg-g goes back to ignoring -operations [22:58:54] ah, thanks andrewbogott got it [22:59:17] greg-g: I can kickban you if it helps [23:00:19] (03PS1) 10Ottomata: Ah! Need == 'present' for the net-toplogy.sh content check. [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/97643 [23:00:34] (03CR) 10Ottomata: [C: 032 V: 032] Ah! Need == 'present' for the net-toplogy.sh content check. [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/97643 (owner: 10Ottomata) [23:01:10] (03PS1) 10Ottomata: Updating cdh4 module again with fix for net-topology.sh [operations/puppet] - 10https://gerrit.wikimedia.org/r/97644 [23:01:19] (03PS2) 10Ottomata: Updating cdh4 module again with fix for net-topology.sh [operations/puppet] - 10https://gerrit.wikimedia.org/r/97644 [23:01:23] (03CR) 10Ottomata: [C: 032 V: 032] Updating cdh4 module again with fix for net-topology.sh [operations/puppet] - 10https://gerrit.wikimedia.org/r/97644 (owner: 10Ottomata) [23:01:49] mhoover: ok, /now/ try it? [23:01:58] robla: the patch we were talking about is already staged; along with https://gerrit.wikimedia.org/r/#/c/96646/ and https://gerrit.wikimedia.org/r/#/c/93602/ [23:01:59] virt1000 again, that's the only host I've changed so far [23:02:12] checking [23:02:30] may I deploy those as well; or would you prefer I cherry pick? [23:02:56] *staged in CentralNotice's deploy branch [23:03:01] * robla looks [23:03:05] hmm, nope [23:03:19] can you tell me the last few digits of the key being used? [23:03:51] 641gXdkB mhoover@wikimedia.org [23:04:12] That's in ls /home/mhoover/.ssh/authorized_keys [23:04:28] ok, i'll double check one sec... [23:04:48] looks correct, but i'll verify [23:07:00] paravoid: re the 4XX issue; do you want me to tell you when megan launches another 100% test? [23:07:19] or ... do you want us to hold off on that? [23:07:30] mwalker: does deploying those two turn this from a syncfile into a scap? [23:07:35] didn't she just did a few hours ago? [23:07:52] robla: shouldn't... /me looks at the patches again [23:08:05] about 21:30 UTC? [23:08:29] paravoid: yep yep; but we turned it off some minutes ago [23:09:08] how many minutes? [23:09:11] robla: I would probably do a sync-dir on the centralnotice-extension; no need for a scap [23:09:17] paravoid: looking [23:09:40] mwalker: ok, go for it. [23:10:22] paravoid: 2251 is when it came down [23:10:34] anyone have a puppet manifest that they're particularly fond of that I can use as a template? (package install, config files, db schema tie in?) [23:11:00] Coren: I need to leave in five… can I leave you with the task of figuring out mhoover's access? If you look at node "virt1000.wikimedia.org" it should be obvious what I'm trying to do (albeit not with fine results so far) [23:11:03] !log ori synchronized php-1.23wmf4/includes/WikiPage.php 'Avoid extra parsing in prepareContentForEdit() I2c34baaf8' [23:11:17] Logged the message, Master [23:11:20] robla: ok; prepping [23:11:22] !log ori synchronized php-1.23wmf4/tests/phpunit/includes/ArticleTablesTest.php 'Avoid extra parsing in prepareContentForEdit() I2c34baaf8' [23:11:34] mwalker: https://graphite.wikimedia.org/render/?title=HTTP%204xx%20Responses%20-1day&from=-1%20day&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=2&lineMode=staircase&target=color%28cactiStyle%28alias%28reqstats.4xx,%224xx%20resp/min%22%29%29,%22blue%22%29 [23:11:36] Logged the message, Master [23:11:42] !log ori synchronized php-1.23wmf4/extensions/TemplateData 'Update TemplateData to master for I0782ea669' [23:11:45] still problems. thx andrew, i'll keep messing with it [23:11:56] Logged the message, Master [23:12:17] (done) [23:12:33] ori-l: thank you! [23:12:38] paravoid: ya, I agree that looks like us -- we went full throttle at 2142 [23:12:43] mhoover: can you try right now while I watch the log? [23:12:45] whee backports, thanks ori-l [23:13:16] yes, trying [23:13:34] ok, tried [23:14:23] welp, I don't know enough to interpret this. It says Connection closed by [23:15:01] the only thing i can spot with ssh -vvv is [23:15:02] Roaming not allowed by server [23:15:18] how many keys do you have in your agent? [23:15:22] :( I must be missing something. [23:15:26] just one at the moment [23:15:31] previously it was saying invalid user, but no longer. [23:15:48] setting that, then promptly going away to run errands! [23:15:49] mhoover: you try with -i /path/to/your/key ? [23:15:50] win [23:16:21] yeah, the same key operation works to the bast host [23:16:33] mhoover: sorry, I really have to run. Hopefully another op will take up the cause, otherwise we'll have to try more tomorrow. [23:16:54] no prob man, thanks for trying. ttyl [23:17:05] * andrewbogott hates leaving in the middle of a puzzle [23:18:41] sorry for not announcing my syncs to the channel earlier [23:19:01] too many context-switches, got distracted. [23:19:12] Coren, for reference, I'm trying to copy the pattern that allows hashar access to Jenkins boxes. But for mhoover + labs boxes. [23:19:44] could be a perms thing on my home dir or ssh dir [23:20:04] or the authorized_keys file. [23:20:07] your perms look ok, and it doesn't seem to be rejecting you for public key [23:20:17] mwalker: 2013-11-25 22:24:36 mw1022 metawiki: [fcc383fc] /wiki/Special:CentralNoticeBanners/edit/2009_Notice32 Exception from line 1298 of /usr/local/apache/common-local/php-1.23wmf4/extensions/CentralNotice/includes/Banner.php: Banner doesn't exist! [23:20:51] you're comin in through bast1001 via proxycommand right? [23:22:28] hm oh well gone [23:22:36] and so am I, toooo sleepy [23:23:20] ori-l: I don't have awesome error handling there -- so that's "somewhat" expected [23:23:31] if someone tries to edit a banner that... doesn't exist :p [23:23:49] was that something you triggered; or just something you saw? [23:25:17] saw [23:25:55] only one such exception today [23:26:05] just happened tho [23:26:35] (03PS1) 10GWicke: WIP Bug 56282: Gzip all Parsoid HTML before storing it in Varnish [operations/puppet] - 10https://gerrit.wikimedia.org/r/97647 [23:27:25] *nods* I'll take a wander; it's interesting because I'm curious where that link came from [23:27:59] PROBLEM - Puppet freshness on rhodium is CRITICAL: No successful Puppet run for 2d 21h 9m 57s [23:28:01] gwicke: what do you mean "make sure that [23:28:03] we don't vary on Accept-Encoding." [23:28:04] ? [23:28:21] (03PS1) 10Matanya: wikidata_singlenode : lint clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/97648 [23:28:31] paravoid: we have clients that don't send the proper accept, and we want to only have a single gzipped copy in varnish [23:28:56] so we don't want to vary [23:29:04] varnish overrides the vary anyway [23:29:06] varnish will decompress for clients that need it [23:29:09] overrides/ignores [23:29:26] I am not so sure about that [23:29:32] we had to roll back a change that relied on that [23:29:37] (03CR) 10jenkins-bot: [V: 04-1] wikidata_singlenode : lint clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/97648 (owner: 10Matanya) [23:30:05] } else if (params->http_gzip_support && [23:30:05] !strcasecmp(H_Accept_Encoding, (const char*)((*v1)+2))) { [23:30:08] /* [23:30:11] * If we do gzip processing, we do not vary on Accept-Encoding, [23:30:14] * because we want everybody to get the gzip'ed object, and [23:30:18] that's from [23:30:19] static int [23:30:19] vry_cmp(const uint8_t * const *v1, uint8_t * const *v2) [23:30:34] !log mwalker synchronized php-1.23wmf4/extensions/CentralNotice 'Updating CentralNotice to master (mostly for caching reasons' [23:30:49] Logged the message, Master [23:30:56] mwalker: thanks :) [23:31:16] can someone merge https://gerrit.wikimedia.org/r/#/c/97625/ [23:31:17] paravoid: that assumes that a gzipped object is in cache? [23:31:20] andrew +2'd it [23:31:24] but forgot to hit submit? [23:31:35] gwicke: yes [23:31:42] paravoid: wmf5 was already on master; so I didn't do that one [23:31:49] that's what I am trying to enforce [23:31:53] but you should be seeing a decrease in my traffic [23:32:19] YuviPanda: k [23:32:32] previously I added gzip support to Parsoid, but this was conditional on the accept headers (HTTP compliant) [23:32:35] YuviPanda: done [23:32:47] mwalker: well, I will next time you'll run a banner test I guess [23:32:48] if the non-gzip request came first, this would result in a non-gzip response to varnish [23:32:53] and thus two copies in cache [23:32:59] ori-l: thanke! [23:36:04] gwicke: varnish will always set Accept-Encoding: gzip when talking to the backend [23:36:25] gwicke: irrespective of whether the client accepts it or not [23:36:51] gwicke: so if a non-gzip request comes, varnish will fetch a gzipped response from the backend, store it in cache, uncompress it and serve it to the client [23:37:30] hmm, interesting- I did not perform a tcp dump when I tested this last, so that might be possible [23:38:14] it did not seem to ignore the Vary: Accept-Encoding the express framework sent out though [23:38:45] or maybe there was another reason for unexpected cache misses [23:43:10] paravoid: do you think that ensuring gzip compression in the Parsoid varnishes is a good idea? [23:43:30] you mean instead of doing it with Node? [23:43:35] the alternative would be to always return gzip-encoded content without the vary header from Parsoid [23:43:52] no, vary shouldn't matter [23:44:22] the alternative would be to make node (express?) obey accept-encoding and serve gzipped content when asked to [23:44:41] that's what we had earlier [23:44:53] and why didn't it work? [23:45:37] I can try to re-enable it, it is well possible that there was another bug that made gzip look like the likely culprit when it was not [23:46:13] we've had at least gunzip bug with varnish that we know is still outstanding [23:46:17] it doesn't happen always thought [23:46:48] and in most cases varnish doesn't do gunzip at all, since most user agents accept gzip content [23:47:02] yeah, VE sadly doesn't yet [23:47:04] (an exception is Range requests, so we have these disabled for text-lb/mobile-lb right now) [23:47:28] so it's quite possible you were hitting that bug [23:48:48] the symptoms looked as if varnish was varying on Accept-Encoding (cache miss except from identical client), but we didn't investigate it too thoroughly as we were dealing with other issues too [23:49:36] I'll push a patch to parsoid master and test that on betalabs [23:50:03] or some other vm [23:50:12] (03CR) 10Dzahn: "eh yea, role::bugzilla already exists, including the misc:: classes, i'd change that to actually make the switch" [operations/puppet] - 10https://gerrit.wikimedia.org/r/94075 (owner: 10Dzahn) [23:58:03] okay