[00:00:10] AaronSchulz: still need https://gerrit.wikimedia.org/r/#/c/213010/ deployed? [00:00:19] bblack: uselang=user (default) disables caching for logged in users since it varies per user. uselang=content forces any localizatoin to use the content language, which can be cached [00:00:22] sure [00:00:24] Merged. [00:00:33] is it a bad idea to do that now? [00:00:40] Or +2'd at least. [00:01:21] well it might help paper over the request rate from logged-in users? [00:01:32] but it doesn't tell us what broke recently or why these requests sometimes hang forever [00:02:47] yeah, just trying to make it less worse [00:04:55] I see a lot of "CAS update failed on user_touched for user ID '9974988';the version of the user to be saved is older than the current version." - can this cause slowdowns? [00:05:39] (03CR) 1020after4: [C: 032] Fixed totally broken runner JSON response code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213010 (owner: 10Aaron Schulz) [00:05:46] (03Merged) 10jenkins-bot: Fixed totally broken runner JSON response code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213010 (owner: 10Aaron Schulz) [00:05:49] sounds like old news that mostly affects certain users [00:07:27] so [00:07:32] back to: http://graphite.wikimedia.org/render/?width=588&height=311&_salt=1432768192.869&target=MediaWiki.xhprof.ApiQueryFileRepoInfo.execute.calls.rate [00:07:38] 6operations, 10Wikimedia-Logstash, 7Elasticsearch: Update Wikimedia apt repo to include debs for Elasticsearch on jessie - https://phabricator.wikimedia.org/T98042#1316247 (10bd808) 1.3.6 was the version I installed on the Jessie boxes for Logstash because that matches the version that is installed on the Pr... [00:07:48] what could possibly have changed to massively increase the rate of hits on repoinfo? [00:07:50] !log twentyafterfour Synchronized rpc/RunJobs.php: deploy I98b8a4ddbcdd58d1f2f23e4b1bf154f10b6b279e (duration: 00m 17s) [00:07:55] Logged the message, Master [00:07:57] (03PS1) 10MaxSem: Fix exceptionmonitor [puppet] - 10https://gerrit.wikimedia.org/r/214271 [00:08:12] btw, can someone review ^^^ [00:08:22] inda delayed my investigations today [00:08:30] bblack: it's hard to tell but is it maybe starting to taper off now? [00:08:31] s/inda/kinda/ [00:08:33] (and given timing, whatever changed went out with the 21:00 train) [00:08:53] we know we haven't fixed whatever it was [00:08:55] twentyafterfour, we're still getting hanging requests [00:09:36] bblack, is the rate increase consistent with retries? [00:09:36] unless the revert fixed it but we're still caching the effects of it [00:09:41] but the problem with repoinfo probably already existed, right? it's just that it's getting hit harder now? I was thinking the 'hit harder' should be tapering off at least slightly now [00:10:02] not enough [00:10:36] MaxSem: the thing is, most of them eventually succeed. The 503's are relatively-rare, there's just a lot of requests [00:10:46] the hangs are relatively-rare too [00:11:01] but it doesn't take many to screw things up [00:11:03] (03CR) 1020after4: [C: 031] Fix exceptionmonitor [puppet] - 10https://gerrit.wikimedia.org/r/214271 (owner: 10MaxSem) [00:11:24] uhh [00:11:26] twentyafterfour: the graph linked above does show it stabilizing some, but still way higher than before the train [00:11:36] <3 shotgun debugging [00:12:21] so it must be something in cache referencing it, right? there doesn't seem to be any other explanation [00:12:27] whatever client-side issues might have there been, they shoudn't last longer than 15 minutes [00:12:44] ? [00:12:56] RL cache [00:13:00] I could be more helpful if I even knew what repoinfo does or what might be referincing calls to it [00:13:47] repoinfo just dumps the information about file repositories (commons + local wiki) [00:13:59] only callers I could find in extensions are VE and MMV [00:14:07] wait let's rewind a bit [00:14:21] the problems spike off circa 21:00. That's not when the train update was in the deploy cal [00:14:28] what actually happened around that time [00:14:46] output looks like http://fpaste.org/226328/43277207/raw/ [00:14:50] (on mw.o) [00:15:58] !sal [00:15:58] https://labsconsole.wikimedia.org/wiki/Server_Admin_Log see it and you will know all you need [00:16:03] bblack: if you include RL caching, that means anything 15m before too [00:16:34] I've got this from SAL in that window-ish: [00:16:35] 21:10 logmsgbot: twentyafterfour Synchronized php-1.26wmf8: Fix ConfirmEdit fatal Change-Id: I22353669a85391c3d9760a5253cac1263e895cf9 (duration: 01m 08s) [00:16:37] 20:45 logmsgbot: twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: group0 to 1.26wmf8 [00:16:37] 20:41 logmsgbot: twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedias to 1.26wmf7 [00:16:38] 20:46 logmsgbot: twentyafterfour Purged l10n cache for 1.26wmf6 [00:16:41] 20:45 logmsgbot: twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: group0 to 1.26wmf8 [00:16:44] 20:41 logmsgbot: twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedias to 1.26wmf7 [00:16:47] 20:36 logmsgbot: twentyafterfour Finished scap: testwiki to php-1.26wmf8 and rebuild l10n cache (duration: 67m 53s) [00:16:50] so this was the train move, it was just delayed? [00:17:54] yep, frontend changes start kicking in immediately, but it takes 15 minutes to fully propagate [00:18:18] that's a huge pile of changes in each of those version bumps, jeez [00:18:22] 20:41+15 mins = ~21:00 where we see the spike [00:22:43] what I don't understand is this: the problem must have been present in wmf7 right? but wmf7 went out yesterday to group1 [00:22:53] why we didn't see any spike before? [00:23:04] because group1 is nothing [00:23:17] nothing? as in, not enough traffic? [00:23:29] apparently [00:23:31] yup [00:23:37] enwiki takes it all [00:23:54] https://gerrit.wikimedia.org/r/#/c/207661/ <- just trawling through wmf7 changes for things that affect caching in some way. this one looks kinda interesting? [00:23:57] RL though [00:25:02] bblack, I already pointed at it. but it doesn't look like it could break anything in the api [00:26:06] so, where does the code that handles that repoinfo request actually live? [00:26:24] incudes/api/ApiQueryFilerepoinfo.php [00:26:31] hasn't changed in a while [00:26:44] of course, there's a bunch of dependencies [00:27:44] the api hasn't been touched substaintially in a while, while OutputPage, another usual suspect in caching issues, was touched only by the above change [00:28:53] API requests shouldn't use OutputPage though [00:30:05] nah, ApiMain touches it [00:30:41] well we've got the long execution time in repoinfo, but it could have always been that way and just been hit at a much lower rate (or not) [00:31:15] we've got a higher hitrate on that URL, which could be new or could be due to client or varnish retries due to timeouts/failures (maybe?) [00:31:31] so solving that would be good even if the long execution isn't a new problem [00:32:00] we've got the caches that serve that URL trying to fall to pieces, but it's hard to say if that's because of repoinfourl, or perhaps some other unrelated cacheability issue is hurting those caches, which is causing more uncacheable repoinfo hits to lead to the above [00:32:28] bblack I wonder if varnish would still pipeline an anon api request that's hanging forever and there are 200 more requests to this url [00:33:44] s/pipeline/coalesce/? [00:33:53] err, yes:P [00:34:22] the only time it doesn't coalesce a cacheable response is if req.hash_ignore_busy is set, and only do that for a few special things in the text VCL (restbase, CentralAutoLogin) [00:34:53] so... the anonymous hits on that supposedly cacheable API URL should be coalescing [00:35:10] mmm, can it be a new cookie that busted cache for anons? [00:35:11] and in general, we do see a ton of hits on it that are cached [00:35:31] anything cookie related that changed could impact this from a couple of different angles [00:35:32] or that would've caused an apocalypse? [00:35:44] depends how common it ended up being [00:36:11] (meaning that there are things way more popular than filerepoinfo requests) [00:36:29] but logged-in users' hits on this would never be cacheable. so if there was some generic change that lead to increased calls, the hits from anons might get cache-absorbed but not the ones from logged-in [00:36:56] ditto for increased execution time I guess [00:37:39] it's an important point that I'd expect there to be many things more popular than filerepoinfo requests, and we're still getting a lot of those [00:38:44] cookie related there was https://gerrit.wikimedia.org/r/#/c/176948/ [00:39:08] * mw_hidetoc -> {wikiprefix} hidetoc * mediaWiki.user.sessionId -> {wikiprefix} mwuser-session * mediaWiki.user.bucket -> {wikiprefix} mwuser-bucket [00:39:31] oh fuck really? [00:39:33] Make [00:39:33] use of this oppertunity to rename some cookie names. [00:39:34] * mw_hidetoc -> {wikiprefix} hidetoc [00:39:34] * mediaWiki.user.sessionId -> {wikiprefix} mwuser-session [00:39:34] * mediaWiki.user.bucket -> {wikiprefix} mwuser-bucket [00:40:03] is that bad? [00:41:17] I thought we adopted a policy that all cookie changes had to be vetoed by ops? [00:41:21] well I don't know, but I was totally unaware of it and those are pretty potentially big changes for us [00:41:37] I'm still looking at exactly what the effect was there [00:42:50] if (req.http.Cookie ~ "([sS]ession|Token)=") { [00:42:51] set req.hash_ignore_busy = true; [00:42:51] } else { [00:43:27] mtherfuck [00:43:31] ...? [00:43:33] wow [00:43:49] but the cookie already had session in its name? [00:43:49] changed cookie name started matching this regex [00:44:06] well, to match it had to end with "session" [00:44:10] oh. [00:44:11] fuck. [00:44:17] yes.... [00:44:37] which is exactly the case we hit before when we instituted the rule about ops reviewing cookies heh [00:44:49] and here we have a req.hash_ignore_busy = true; that could've multiplied slow requests [00:45:12] more importantly, anyone with a cookiename ending in session|token has all their hits go uncached [00:45:21] https://gerrit.wikimedia.org/r/#/c/214277/ <-- revert [00:45:46] but we've already undone this with the earlier revert right? [00:45:47] https://gerrit.wikimedia.org/r/#/c/214278/ <-- too [00:45:49] :P [00:46:02] fwiw phabricator would allow us to enforce that certain changes get vetted by certain people/projects [00:46:07] probably, if this is the source of the issue, we need to do something to kill those cookies as well [00:46:14] once we kill gerrit ;) [00:46:30] let me poke at actual traffic/cookies a bit and see what it looks like [00:48:37] twentyafterfour, this is something that ideally would have been caught at the wikimedia train deployment process, not something for a mediawiki commit with proper release notes to worry about [00:48:45] yeah this has to be it, or at least a big part of it [00:49:09] I've got live anon request traffic that looks like: [00:49:10] 707 RxHeader c Cookie: GeoIP=CA:Vancouver:49.2499:-123.0556:v4; WMF-Last-Access=28-May-2015; uls-previous-languages=%5B%22sl%22%5D; slwikimwuser-session=1501e7f9bac07df0 [00:49:31] (no real logged-in cookie, but has the new cookie from that change, so this is now uncacheable whereas it would have cached before) [00:49:34] unfortunately this commit didn't have such notes [00:50:05] so to confirm, we've already revert this code by reverting wmf7 right? [00:50:18] Krenair: what I mean is that herald can add reviewers from ops on changes to involving cookies [00:50:19] bblack: wmf7 is still live on group1 sites I think [00:50:26] well sure but I mean in the big cases [00:50:29] yes [00:50:30] no wmf7 is not live anywhere [00:50:37] oh? [00:50:42] I rolled ALL back to wmf6 [00:50:43] probably the easiest fix now would be to destroy that cookie with some Set-Cookie hacking in varnish [00:50:55] to get rid of the extant ones still out there [00:50:55] twentyafterfour, you could, but it wouldn't require them to sign off, it'd just alert them, right? [00:51:02] ok, well I have backports to wmf7 and wmf8 [00:51:15] bblack: could we just stick something in common.js to delete any "enwikimwuser-session" cookies? [00:51:15] Krenair: it could be either way [00:51:18] either that or make another temporary exception to that session|token logic, that might be better [00:51:23] or that [00:51:25] it'd be quite shitty to non-wikimedia users if suddenly wmf ops got to veto mediawiki changes and their own ops teams did not [00:52:02] I really don't think now is the time or place for that conversation. [00:52:13] yeah [00:52:32] we'll need a negative lookahead in the regex to fix it that way, but it's cleaner [00:52:35] sorry didn't mean to start a discussion in the middle of fixing this.. [00:52:56] Krenair, would be nice if non-wmf changes not kill the website of an org that's funding a significant part of this project's development [00:54:11] bblack: how long do varnish changes take to deploy? [00:54:15] problem with the deployment process, not mediawiki [00:55:36] legoktm: not long at all [00:55:51] (03PS1) 10BBlack: Negative lookahead for mwuser-session cookies [puppet] - 10https://gerrit.wikimedia.org/r/214281 [00:55:56] sanity-check that? [00:57:43] (03CR) 10MaxSem: [C: 04-1] "That's a lookahead assertion, you need lookbehind: ? oh doh [00:58:16] but more important, I'm trying to wrap my ahead around all the forgotten details of how session|token cookies worked before all this [00:58:19] I swear, I had to look it up myself:) [00:58:26] !log legoktm Synchronized php-1.26wmf7/resources/: Revert "Convert mediawiki.toc and mediawiki.user to using mw.cookie" (duration: 00m 13s) [00:58:29] Logged the message, Master [00:58:48] (03PS2) 10BBlack: Negative lookahead for mwuser-session cookies [puppet] - 10https://gerrit.wikimedia.org/r/214281 [00:59:08] !log legoktm Synchronized php-1.26wmf8/resources/: Revert "Convert mediawiki.toc and mediawiki.user to using mw.cookie" (duration: 00m 17s) [00:59:12] Logged the message, Master [00:59:24] oh right we're using the centralauth token cookies to do real logins [00:59:28] ok [00:59:53] that and e.g. enwikiSession= [01:00:21] this thing from the javascript in question is something else entirely, I'm not sure what, but it doesn't match up 1:1 with real login sessions I don't think [01:00:23] (03CR) 10MaxSem: Negative lookahead for mwuser-session cookies (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/214281 (owner: 10BBlack) [01:00:55] I think we have both session and sessionId cookies [01:00:58] or something [01:01:27] right but we only made session= uncacheable in the past, not sessionId [01:01:40] sessionId says "Get an automatically generated random ID (stored in a session cookie)" [01:01:48] > This ID is ephemeral for everyone, staying in their browser only until they close their browser [01:01:55] logged in users get them too [01:01:59] er, logged out* [01:02:41] I have an enwikiSession and mediaWiki.user.sessionId [01:03:08] right [01:03:20] MaxSem: what was "wrong assertion"? [01:03:21] so enwikiSession is PHP session, mw.user.sessionId is some frontend thingie [01:03:32] bblack, it still says lookahead:) [01:04:27] oh right :) [01:05:16] and there is centralauth_Session [01:05:57] yeah [01:06:11] (03PS3) 10BBlack: Negative lookahead for mwuser-session cookies [puppet] - 10https://gerrit.wikimedia.org/r/214281 [01:06:31] anyways I could stumble around trying to better document/plan what we'll do about removing the cookie/code later, but let's put this in now? [01:06:44] I have to take care of some irl stuff now, going offline [01:06:49] ok [01:07:00] lookbehind still, commitmsg heh [01:07:08] heh [01:07:15] (03PS4) 10BBlack: Negative lookbehind for mwuser-session cookies [puppet] - 10https://gerrit.wikimedia.org/r/214281 [01:07:37] (03CR) 1020after4: [C: 031] Negative lookbehind for mwuser-session cookies [puppet] - 10https://gerrit.wikimedia.org/r/214281 (owner: 10BBlack) [01:08:05] (03CR) 10BBlack: [C: 032 V: 032] Negative lookbehind for mwuser-session cookies [puppet] - 10https://gerrit.wikimedia.org/r/214281 (owner: 10BBlack) [01:09:36] hopefully salt will actually work :P [01:09:47] * MaxSem waits for that change to destroy the cluster because nobody can read regexes [01:10:16] man fuck salt [01:10:42] "No manual entry for fuck" [01:11:38] lol [01:12:34] haha [01:12:41] :D [01:13:10] What if we replaced puppet with salt? [01:13:44] Negative24, s/monstrosity/monstrosity/ [01:14:13] "now you've got two problems!" [01:15:55] health indicators are slowly getting better [01:16:17] I'm re-running salt over and over and it's applying in more places, but in general it would get there in ~20 minutes anyways [01:16:48] call rate is not there either http://graphite.wikimedia.org/render/?width=590&height=315&_salt=1432775781.143&target=MediaWiki.xhprof.ApiQueryFileRepoInfo.execute.calls.rate&from=-8hours [01:17:44] ok the backend healths I've been watching all this time just fully recovered on a few test hosts I'm looking at [01:17:59] so at least, this seems to have been the right fix for the right thing at some level [01:19:09] so yes, whatever other general issues may have been long dormant in repoinfo, it was just a prominent victim, not the cause [01:21:50] bblack, are you in the US? [01:22:33] yeah [01:23:06] would've sucked to fix that on european time [01:23:49] Wow weird rabbit hole, thanks bblack [01:24:16] well, we had this rabbit hole before, which I think resulted in a "rule" about ops reviewing cookie changes [01:24:24] but I don't know that that rule was ever really documented well [01:24:40] outside of people remembering from the previous incident and the discussion then [01:25:38] Needs a big caps comment I guess [01:26:05] probably even that rule isn't the right solution [01:26:26] the right solution is to document what our contract is on the meaning of cookie names for caching purposes somewhere [01:26:41] so that the rule isn't "consult ops", but "do not name new cookies matching this pattern, unless you know what it means" [01:27:20] Right, this is easy to get wrong atm [01:27:25] bblack, then it would shift the blame on ops who didn't update this rule when changing vcl [01:27:49] well sure, but that seems less volatile than random featurework introducing new cookie names on the mw side [01:29:03] (or of course, we could get busy on a project to make most logged-in hits cacheable and get rid of the session|token exclusion) [01:29:49] which is all about finding other ways to communicate "hey I just edited this one thing, so don't cache it for me for the next few minutes" via headers/js/etc [01:30:06] then we wouldn't need the general-case exclusion for all the other logged-in hits that aren't on things that person just edited [01:30:50] (and then there's probably some other BS about whether some of the content actually varies per user, and whether it should) [01:31:58] that is, we need X-Vary-Options back? [01:32:14] I don't know that XVO is really the right thing here [01:34:03] who knows [01:34:22] I'm late for drinks and basketball. My team is trying to hurry and finish losing to your SF one :P [01:34:37] one big thing with logged-in pageviews is parser cache fragmentation [01:34:48] I'll write up an outage report type of thing when I get back later tonight [01:35:07] cool, I'm heading home [01:35:21] https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=rockets%20vs%20golden%20state [01:37:20] (03PS1) 10Andrew Bogott: Only include libmysql-ruby and libldap-ruby1.8 on precise. [puppet] - 10https://gerrit.wikimedia.org/r/214283 [01:39:06] so now what to do about wmf7 and 8 [01:39:30] do not want [01:39:39] :) [01:40:35] its safe to return them to where they were today, but not now [01:40:57] yeah [01:41:08] deploy the train tomorrow? [01:41:11] should just wait for tomorrow? I know everyone is worn out [01:41:26] we can pretend we started the new deployment schedule early [01:41:41] Guest42370, you got k-lined? :O [01:42:08] (03CR) 10Andrew Bogott: [C: 032] Only include libmysql-ruby and libldap-ruby1.8 on precise. [puppet] - 10https://gerrit.wikimedia.org/r/214283 (owner: 10Andrew Bogott) [01:43:35] MaxSem: nfi. I'm emailing freenode now [01:44:17] RECOVERY - puppet last run on labcontrol1001 is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [01:52:37] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [01:55:17] RECOVERY - puppetmaster https on labcontrol1001 is OK: HTTP OK: Status line output matched 400 - 287 bytes in 1.381 second response time [01:58:37] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 21 data above and 8 below the confidence bounds [02:01:16] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [02:05:09] ahem, 40K db errors per second? dberrors & oom? [02:05:15] https://logstash.wikimedia.org/#/dashboard/elasticsearch/fatalmonitor [02:07:00] 2359 JobQueueGroup::__destruct: 1 buffered job(s) never inserted. in /srv/mediawiki/php-1.26wmf6/includes/jobqueue/JobQueueGroup.php on line [02:07:00] 419 [02:07:17] AaronSchulz, ^ [02:08:36] geez [02:09:09] those " Can't connect to MySQL server" aren't good [02:11:10] it doesn't look like there are any new ones coming through on fluorine [02:11:45] hmm [02:12:00] momentary hiccup? [02:12:15] perhaps [02:12:20] concerning nonetheless [02:12:38] springle, you there? [02:17:04] Krenair: aye see it [02:17:21] another error coming up is stuff like "The MariaDB server is running with the --read-only option so it cannot execute this statement (10.64.16.157)" [02:17:42] parsercache [02:19:12] that'll do it [02:19:14] this is all in logstash though - I don't understand why it's not showing up on fluorine logs [02:19:36] two of the three pc boxes were upgraded. read only is a bug [02:19:38] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [02:20:50] !log set global read_only=0 on pc1001 pc1002. this config broke in the recent upgrade [02:21:03] Logged the message, Master [02:22:00] I'm not sure that's the only thing though [02:23:35] "Can't connect to MySQL server on '10.64.48.20' (4)" for example, 2015-05-28T02:22:16.678Z [02:23:37] indeed, however the (4) EINTR connnection issues is known T98489 [02:24:40] ahh [02:26:55] (03PS1) 10Springle: parsercache has no master; every box needs to be writeable [puppet] - 10https://gerrit.wikimedia.org/r/214287 [02:36:06] !log l10nupdate Synchronized php-1.26wmf6/cache/l10n: (no message) (duration: 06m 46s) [02:36:13] Logged the message, Master [02:36:35] (03PS1) 10Legoktm: Lint JSON files [tools/scap] - 10https://gerrit.wikimedia.org/r/214288 (https://phabricator.wikimedia.org/T100600) [02:40:17] PROBLEM - are wikitech and wt-static in sync on silver is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (100170s 100000s) [02:41:03] !log LocalisationUpdate completed (1.26wmf6) at 2015-05-28 02:40:00+00:00 [02:41:07] Logged the message, Master [02:51:17] 6operations, 10Architecture, 10MediaWiki-RfCs, 10RESTBase, and 5 others: RFC: Re-evaluate varnish-level request-restart behavior on 5xx - https://phabricator.wikimedia.org/T97206#1316466 (10Arlolra) [03:07:31] (03PS1) 10Andrew Bogott: Unclude trusty or jessie-appropriate packages for libldap and libldap-ruby. [puppet] - 10https://gerrit.wikimedia.org/r/214290 [03:08:10] (03PS2) 10Andrew Bogott: Include trusty or jessie-appropriate packages for libldap and libldap-ruby. [puppet] - 10https://gerrit.wikimedia.org/r/214290 [03:08:56] (03CR) 10jenkins-bot: [V: 04-1] Include trusty or jessie-appropriate packages for libldap and libldap-ruby. [puppet] - 10https://gerrit.wikimedia.org/r/214290 (owner: 10Andrew Bogott) [03:10:13] (03PS3) 10Andrew Bogott: Include trusty or jessie-appropriate packages for libldap and libldap-ruby. [puppet] - 10https://gerrit.wikimedia.org/r/214290 [03:18:38] (03PS4) 10Andrew Bogott: Include trusty or jessie-appropriate packages for libmysql-ruby. [puppet] - 10https://gerrit.wikimedia.org/r/214290 [03:34:25] aww, ExtensionDistributor updates got reverted in the train rollback :( [03:49:23] (03PS1) 10BryanDavis: Set HHVM mysql connection timeout to 3s on app and api servers [puppet] - 10https://gerrit.wikimedia.org/r/214295 (https://phabricator.wikimedia.org/T98489) [03:54:39] (03CR) 1020after4: [C: 031] Lint JSON files [tools/scap] - 10https://gerrit.wikimedia.org/r/214288 (https://phabricator.wikimedia.org/T100600) (owner: 10Legoktm) [04:10:46] 6operations: Alias docs.wikimedia.org to doc.wikimedia.org - https://phabricator.wikimedia.org/T100349#1316572 (10MZMcBride) I've always been slightly annoyed at the domain name being doc.wikimedia.org instead of docs.wikimedia.org. I'd prefer a move, but a redirect is a decent alternative. [04:22:45] !log reload dbstore1002 s7 [04:22:50] Logged the message, Master [04:52:02] SF won :P [05:10:28] RECOVERY - are wikitech and wt-static in sync on silver is OK: wikitech-static OK - wikitech and wikitech-static in sync (12462 100000s) [05:33:52] you mean oakland? [05:58:27] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [06:00:13] eh close enough :) [06:02:42] (03PS6) 10KartikMistry: CX: Log to logstash [puppet] - 10https://gerrit.wikimedia.org/r/213840 (https://phabricator.wikimedia.org/T89265) [06:03:25] (03CR) 10KartikMistry: CX: Log to logstash (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/213840 (https://phabricator.wikimedia.org/T89265) (owner: 10KartikMistry) [06:03:35] (03PS7) 10KartikMistry: CX: Log to logstash [puppet] - 10https://gerrit.wikimedia.org/r/213840 (https://phabricator.wikimedia.org/T89265) [06:07:48] (03PS8) 10KartikMistry: CX: Log to logstash [puppet] - 10https://gerrit.wikimedia.org/r/213840 (https://phabricator.wikimedia.org/T89265) [06:09:29] (03CR) 10Mobrovac: [C: 04-1] "One in-lined comment." [puppet] - 10https://gerrit.wikimedia.org/r/213840 (https://phabricator.wikimedia.org/T89265) (owner: 10KartikMistry) [06:11:00] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu May 28 06:09:56 UTC 2015 (duration 9m 55s) [06:11:06] Logged the message, Master [06:11:12] (03CR) 10Mobrovac: CX: Log to logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/213840 (https://phabricator.wikimedia.org/T89265) (owner: 10KartikMistry) [06:12:03] preliminary incident rep from earlier: feel free to amend/improve it, I'm zonked out and offline till tomorrow: https://wikitech.wikimedia.org/wiki/Incident_documentation/20150527-Cookie [06:22:07] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 0 below the confidence bounds [06:25:59] (03CR) 10Giuseppe Lavagetto: [C: 031] "I'll merge this during the day if no one has anything against it." [puppet] - 10https://gerrit.wikimedia.org/r/214295 (https://phabricator.wikimedia.org/T98489) (owner: 10BryanDavis) [06:33:27] PROBLEM - puppet last run on cp4008 is CRITICAL Puppet has 1 failures [06:33:38] PROBLEM - puppet last run on wtp2008 is CRITICAL Puppet has 1 failures [06:33:47] PROBLEM - puppet last run on mw1226 is CRITICAL Puppet has 1 failures [06:33:56] PROBLEM - puppet last run on mw2173 is CRITICAL Puppet has 1 failures [06:34:07] PROBLEM - puppet last run on db2018 is CRITICAL Puppet has 1 failures [06:34:27] PROBLEM - puppet last run on analytics1030 is CRITICAL Puppet has 2 failures [06:34:36] PROBLEM - puppet last run on mw2123 is CRITICAL Puppet has 1 failures [06:35:17] PROBLEM - puppet last run on mw2093 is CRITICAL Puppet has 1 failures [06:35:18] PROBLEM - puppet last run on mw2016 is CRITICAL Puppet has 1 failures [06:35:26] PROBLEM - puppet last run on mw2096 is CRITICAL Puppet has 1 failures [06:35:26] PROBLEM - puppet last run on mw2079 is CRITICAL Puppet has 1 failures [06:35:26] PROBLEM - puppet last run on mw2113 is CRITICAL Puppet has 1 failures [06:35:56] PROBLEM - puppet last run on mw2206 is CRITICAL Puppet has 1 failures [06:42:17] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 0 below the confidence bounds [06:45:37] RECOVERY - puppet last run on mw1226 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:45:46] (03CR) 10Jcrespo: [C: 032] "Ops :-/" [puppet] - 10https://gerrit.wikimedia.org/r/214287 (owner: 10Springle) [06:46:17] RECOVERY - puppet last run on analytics1030 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:46:57] RECOVERY - puppet last run on cp4008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:07] RECOVERY - puppet last run on mw2093 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:47:07] RECOVERY - puppet last run on wtp2008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:16] RECOVERY - puppet last run on mw2016 is OK Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:47:16] RECOVERY - puppet last run on mw2096 is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:47:16] RECOVERY - puppet last run on mw2079 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:47:26] RECOVERY - puppet last run on mw2173 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:36] RECOVERY - puppet last run on db2018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:46] RECOVERY - puppet last run on mw2206 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:06] RECOVERY - puppet last run on mw2123 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:57] RECOVERY - puppet last run on mw2113 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:39:59] (03CR) 10Filippo Giunchedi: [C: 031] Add diamond collector_module class [puppet] - 10https://gerrit.wikimedia.org/r/214053 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [07:49:13] 6operations, 7database: Document x1 DB requirements for new wikis - https://phabricator.wikimedia.org/T100527#1316926 (10jcrespo) This is not a database/deployment issue, but a mediawiki one- the installation scripts failed (which should create the database with no problem). [08:03:06] (03CR) 10Filippo Giunchedi: "minor nits but LGTM overall!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/214261 (owner: 10Ottomata) [08:07:41] 6operations: pc100[123] maintenance and upgrade - https://phabricator.wikimedia.org/T100301#1316959 (10jcrespo) Actionables: - Check `read_only` on master servers on icinga with a puppet rule - Create a query on kibana-logstash for db-related errors - Plot `SELECT sum(errors) as errors, sum(warnings) as w... [08:08:14] (03PS4) 10Filippo Giunchedi: Only return mostly fresh data for elasticsearch ganglia monitoring [puppet] - 10https://gerrit.wikimedia.org/r/212322 (owner: 10EBernhardson) [08:08:21] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Only return mostly fresh data for elasticsearch ganglia monitoring [puppet] - 10https://gerrit.wikimedia.org/r/212322 (owner: 10EBernhardson) [08:17:31] (03CR) 10Filippo Giunchedi: "merged and looks good so far, puppet restarted gmond by itself. Is there a related phab ticket?" [puppet] - 10https://gerrit.wikimedia.org/r/212322 (owner: 10EBernhardson) [08:23:06] (03PS1) 10Jcrespo: Upgrade pc1003 to MariaDB 10 and trusty [puppet] - 10https://gerrit.wikimedia.org/r/214307 [08:25:39] (03PS9) 10KartikMistry: CX: Log to logstash [puppet] - 10https://gerrit.wikimedia.org/r/213840 (https://phabricator.wikimedia.org/T89265) [08:30:46] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 0 below the confidence bounds [08:57:22] (03PS3) 10Filippo Giunchedi: es-tool: try harder to enable replication [puppet] - 10https://gerrit.wikimedia.org/r/211672 (https://phabricator.wikimedia.org/T99005) [08:58:05] (03PS2) 10Mjbmr: Enable SandboxLink for cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214247 (https://phabricator.wikimedia.org/T100513) [09:02:29] (03PS4) 10Filippo Giunchedi: es-tool: try harder to enable replication [puppet] - 10https://gerrit.wikimedia.org/r/211672 (https://phabricator.wikimedia.org/T99005) [09:02:37] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] es-tool: try harder to enable replication [puppet] - 10https://gerrit.wikimedia.org/r/211672 (https://phabricator.wikimedia.org/T99005) (owner: 10Filippo Giunchedi) [09:03:34] (03CR) 10Filippo Giunchedi: "tested in beta via simulation while es-tool enabled replication again:" [puppet] - 10https://gerrit.wikimedia.org/r/211672 (https://phabricator.wikimedia.org/T99005) (owner: 10Filippo Giunchedi) [09:09:36] (03PS2) 10Jcrespo: Upgrade pc1003 to MariaDB 10 and trusty [puppet] - 10https://gerrit.wikimedia.org/r/214307 [09:10:50] (03CR) 10Jcrespo: [C: 032] Upgrade pc1003 to MariaDB 10 and trusty [puppet] - 10https://gerrit.wikimedia.org/r/214307 (owner: 10Jcrespo) [09:23:42] 6operations, 10Wikimedia-Apache-configuration, 7HHVM, 7Wikimedia-log-errors: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#1317040 (10hashar) Would be nice to get rid of those errors. They are quite spammy... [09:25:31] (03PS1) 10Jcrespo: Depool pc1003 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214312 [09:33:50] (03PS1) 10ArielGlenn: salt master: wake up minions after master key rotation [puppet] - 10https://gerrit.wikimedia.org/r/214314 [09:34:43] (03CR) 10ArielGlenn: [C: 032] salt master: wake up minions after master key rotation [puppet] - 10https://gerrit.wikimedia.org/r/214314 (owner: 10ArielGlenn) [09:35:28] (03CR) 10Jcrespo: [C: 032] Depool pc1003 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214312 (owner: 10Jcrespo) [09:35:37] <_joe_> !log pooling mw1152 into the imagescalers pool after fixes made in Lyon [09:35:44] Logged the message, Master [09:37:37] !log jynus Synchronized wmf-config/db-eqiad.php: Depool pc1003 (duration: 00m 15s) [09:37:41] Logged the message, Master [09:39:26] (03CR) 10Filippo Giunchedi: Basic role for Sentry (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [09:40:17] 6operations, 6Labs: Investigate why nscd is used in labs - https://phabricator.wikimedia.org/T100564#1317056 (10yuvipanda) Yeah, I agree :) We could perhaps try switching it off after the move to designate. [09:42:47] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 11 data above and 0 below the confidence bounds [09:43:08] (03PS2) 10Filippo Giunchedi: initial debian packaging [debs/python-etcd] - 10https://gerrit.wikimedia.org/r/212528 (https://phabricator.wikimedia.org/T99771) [09:43:37] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] initial debian packaging [debs/python-etcd] - 10https://gerrit.wikimedia.org/r/212528 (https://phabricator.wikimedia.org/T99771) (owner: 10Filippo Giunchedi) [09:44:05] 6operations, 6Labs: salt does not run reliably for toollabs - https://phabricator.wikimedia.org/T99213#1317060 (10ArielGlenn) this is now live on virt1000. https://gerrit.wikimedia.org/r/#/c/214314/ [09:48:41] <_joe_> !log depooling the HHVM appserver. 503s reduced slightly but still non-irrelevant [09:48:45] Logged the message, Master [09:51:36] 6operations, 7database: mysql user and group should be a system user/group - https://phabricator.wikimedia.org/T100501#1317066 (10Ciencia_Al_Poder) [09:55:57] I think sync-file is responsible for this: https://graphite.wikimedia.org/render/?title=HTTP%205xx%20Responses%20-1hours&from=-1hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=2&lineMode=connected&target=color%28cactiStyle%28alias%28reqstats.500,%22500%20resp/min%22%29%29,%22red%22%29&target=color%28cactiStyle%28alias%28reqstats.5xx,%225xx%20resp/min%22%29%29,%22blue%22%29 [09:57:15] 6operations: wrong partitioning scheme for copper (500GB of swap) - https://phabricator.wikimedia.org/T100636#1317078 (10fgiunchedi) 3NEW [09:57:35] <_joe_> jynus: the spike in 500s? probably [09:57:55] is that considered "normal"? [09:58:02] <_joe_> not really [09:58:12] that is what I thought [09:58:32] <_joe_> the spike in 503s is me pooling the HHVM imagescaler again [10:15:32] (03PS11) 10Giuseppe Lavagetto: confd: create module [puppet] - 10https://gerrit.wikimedia.org/r/208399 (https://phabricator.wikimedia.org/T97974) [10:15:46] (03CR) 10Giuseppe Lavagetto: [C: 032] "Tested in labs" [puppet] - 10https://gerrit.wikimedia.org/r/208399 (https://phabricator.wikimedia.org/T97974) (owner: 10Giuseppe Lavagetto) [10:17:31] 6operations, 5Patch-For-Review, 7discovery-system, 5services-tooling: Create a debian package of python-etcd - https://phabricator.wikimedia.org/T99771#1317125 (10fgiunchedi) 5Open>3Resolved 0.3.3-1 uploaded to trusty and jessie [10:22:14] <_joe_> godog: thanks a lot [10:22:32] <_joe_> I'll probably need to do a new package soon-ish, though :) [10:28:36] PROBLEM - mysqld processes on pc1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [10:29:07] ^my fault [10:29:21] not enough maintenance time [10:37:05] _joe_: np [10:41:17] (03CR) 10Filippo Giunchedi: [C: 04-1] add varnishstatsd (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/214147 (owner: 10Ori.livneh) [10:51:23] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1317208 (10Billinghurst) Just to note that steward don't have accounts on the wiki, so we are limited in our ability to assist. [10:59:27] RECOVERY - mysqld processes on pc1003 is OK: PROCS OK: 1 process with command name mysqld [10:59:27] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, tested in labs already I assume?" [puppet] - 10https://gerrit.wikimedia.org/r/213530 (https://phabricator.wikimedia.org/T99564) (owner: 10Mobrovac) [11:07:37] 6operations, 6Labs: Investigate why nscd is used in labs - https://phabricator.wikimedia.org/T100564#1317231 (10fgiunchedi) I'd say mostly to avoid roundtrips to ldap? ``` $ sudo nscd -g | grep -e 'cache:$' -e 'rate' -e 'number of cached' passwd cache: 1% cache hit rate 38 current... [11:12:05] TIL: edits to existing phab comments don't show up here [11:12:41] (03PS10) 10KartikMistry: CX: Log to logstash [puppet] - 10https://gerrit.wikimedia.org/r/213840 (https://phabricator.wikimedia.org/T89265) [11:13:26] akosiaris: review, https://gerrit.wikimedia.org/r/#/c/213840/10 please :) [11:17:05] (03CR) 10Alexandros Kosiaris: [C: 04-1] Basic role for Sentry (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [11:19:51] akosiaris: question re: https://phabricator.wikimedia.org/T82698 would it matter having VMs with public IPs as opposed to internal-only? (for mailman) [11:21:47] godog: it is possible to have VMs with a public IP. There's some minor configuration changes needed on the switches but it is possible [11:22:13] godog: it's in the plan anyway [11:23:47] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [11:24:28] akosiaris: cool, thanks, so that'd be an option alright [11:31:12] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1317246 (10Glaisher) Stewards have 'createaccount' right globally so signing up with Special:CreateAccount will work, I think. [11:35:40] hi, someone with powers over labs? can you check if LDAP is working? there seem to be some issues [11:35:57] * petan pokes paravoid in the eye [11:50:54] <_joe_> petan: what is your specific issue? [11:51:39] <_joe_> I can log onto labs instances, so I have some doubts ldap is broken [11:51:59] <_joe_> maybe something consuming from it is broken instead [11:52:17] _joe_: most specific issue is that tools-master on tool labs doesn't work with error error: commlib error: got select error (Connection refused) ERROR: unable to send message to qmaster using port 6444 on host "tools-master": got send error [11:52:32] I don't know why is that but syslog on both nodes is full of some ldap errors [11:52:42] May 28 11:33:14 tools-bastion-01 nslcd[1221]: [34aa76] error writing to client: Broken pipe [11:52:46] <_joe_> this seems to be more of a toollabs error [11:53:00] <_joe_> petan: that is pretty common (the nslcd error) [11:53:11] maybe! but not sure... I tried rebooting the master server and it didn't fix it [11:53:20] it sounds like some networking issue though [11:53:43] <_joe_> petan: I've seen YuviPanda|brb and moritzm discuss about some gridengine master issue earlier [11:53:55] yes but none of them are online [11:53:57] <_joe_> let's see if one of them chimes is [11:54:00] <_joe_> *in [11:54:05] hashar: ping? [11:54:13] paravoid: good morning [11:54:13] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1317310 (10Billinghurst) Not a major concern, that was more a note. Doesn't autocreate, presumably due to fishbowl, and at that point o... [11:54:17] <_joe_> yeah, I know zero-to-nothing about gridengine or toollabs, though [11:54:25] hashar: hi! [11:54:30] <_joe_> I don't even think I have access [11:54:30] no clue about grid engine ever if that is the reason for the ping :/ [11:54:36] hashar: could you explain your review https://gerrit.wikimedia.org/r/#/c/196175/ more verbosely? [11:55:01] _joe_: that's the magic of ops; if you don't - you can give it yourself ;) [11:55:23] (03PS3) 10Faidon Liambotis: gerrit: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/196176 (https://phabricator.wikimedia.org/T92475) [11:55:25] (03PS3) 10Faidon Liambotis: mha: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/196177 (https://phabricator.wikimedia.org/T92475) [11:55:27] (03PS3) 10Faidon Liambotis: datasets: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/196178 (https://phabricator.wikimedia.org/T92475) [11:55:29] (03PS3) 10Faidon Liambotis: logging: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/196179 (https://phabricator.wikimedia.org/T92475) [11:55:31] (03PS2) 10Faidon Liambotis: jenkins: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/196175 (https://phabricator.wikimedia.org/T92475) [11:55:33] (03PS5) 10Faidon Liambotis: admin: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183823 (https://phabricator.wikimedia.org/T92475) [11:55:40] <_joe_> JohnFLewis: yeah for sure, thanks for reminding me :) I was just stating how little I know about toollabs in practice [11:55:54] paravoid: the puppet compilation had some error apparently so I just copy pasted the error. [11:56:02] Okay :) [11:56:05] <_joe_> I know how the big picture works, but never operated into gridengine [11:56:15] hashar: "The reference (production) shows:" -- what does that mean? [11:56:43] paravoid: ah that is the run of the compiler against the tip of the production branch [11:56:44] (03CR) 10Faidon Liambotis: [C: 032] mha: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/196177 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [11:56:54] let me run it again :) [11:56:57] _joe_: tbf Yuvi says 'the easiest fix for SGE is not to use it' so.. [11:57:10] (03CR) 10Faidon Liambotis: [C: 032] datasets: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/196178 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [11:57:30] https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/773/ [11:57:36] (03CR) 10Faidon Liambotis: [C: 032] logging: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/196179 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [11:59:01] [ 05/28/2015 11:58:02 ] ERROR: Error in compilation on node lanthanum.eqiad.wmnet [11:59:04] [ 05/28/2015 11:58:02 ] ERROR: Exception was: Command '['/opt/wmf/software/compare-puppet-catalogs/shell/compile', '3', 'lanthanum.eqiad.wmnet', '/opt/wmf/software/compare-puppet-catalogs/output/773/change/196175/compiled', 'production']' returned non-zero exit status 30 [11:59:09] broken again [11:59:20] what a surprise :P [11:59:47] http://puppet-compiler.wmflabs.org/773/change/196175/compiled/puppet_catalogs_3_production/gallium.wikimedia.org.warnings [12:00:01] Error: Must pass enable to Class[Base::Remote_syslog] at modules/base/manifests/init.pp:63 on node gallium.wikimedia.org [12:00:04] <_joe_> yep the compiler is broken :( [12:00:20] <_joe_> I haven't had the time to fix it. [12:00:45] I don't see anything wrong with the jenkins change, I'm going to merge it and test it on prod [12:01:03] (03CR) 10Faidon Liambotis: [C: 032] jenkins: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/196175 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [12:01:31] and of course it's broken :P [12:01:35] different error message, though [12:01:55] petan: I didn't look into this further, yuvi wanted to wait until coren appears, he knows the setup best [12:01:58] oh wait, it's not, probably something intermittent [12:02:51] anybody able to kick https://tools.wmflabs.org/copyvios/ back into life ? [12:03:06] how broken is toollabs right now? [12:03:20] paravoid: defunct [12:03:20] both coren and yuvi are in a european timezone, time to call them [12:03:52] (03PS2) 10Alexandros Kosiaris: Introduce etherpad1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/214130 [12:03:57] paravoid: oh wait, it's not totally defunct, it's impossible to start new jobs and maintain current ones, even stop jobs and so on [12:04:13] paravoid: jobs that were started before outage and are still running are probably OK but there is no way to check [12:05:01] (03CR) 10Alexandros Kosiaris: [C: 04-1] CX: Log to logstash (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/213840 (https://phabricator.wikimedia.org/T89265) (owner: 10KartikMistry) [12:07:53] (03CR) 10Faidon Liambotis: [C: 032] gerrit: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/196176 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [12:12:40] petan: turn off nscd on the host? [12:12:53] I'm at a clinic about 30mins away from my laptop [12:13:12] petan: we had turned off puppet and nscd and I guess your restart brought nscd back to life [12:13:26] but it was broken even before restart [12:13:44] Yes I know [12:13:44] I checked syslog before I rebooted and errors from nscd were there [12:14:08] The basic answer is that me, valhallasw nor black had any idea wtf was wrong [12:14:16] We spent a while at it yesterday night [12:14:35] It was failing about 1 of 500 requests which was ok until coren came online [12:15:09] (03PS11) 10KartikMistry: CX: Log to logstash [puppet] - 10https://gerrit.wikimedia.org/r/213840 (https://phabricator.wikimedia.org/T89265) [12:15:20] petan: and restart master again? [12:16:02] hmm I turned off nscd, it didn't help the error is still same. You think I should restart it once more? [12:16:28] Did you restart the master as well? [12:16:35] The service I mean [12:16:38] nope [12:16:47] (03PS6) 10Faidon Liambotis: ssh: remove .ssh/authorized_keys support from prod [puppet] - 10https://gerrit.wikimedia.org/r/183824 (https://phabricator.wikimedia.org/T92475) [12:16:50] Do it? [12:17:00] done [12:17:14] Any luck now? [12:17:25] now I got different error: [12:17:27] error: commlib error: access denied (server host resolves rdata host "tools-bastion-02.eqiad.wmflabs" as "(HOST_NOT_RESOLVABLE)") [12:17:28] error: unable to contact qmaster using port 6444 on host "tools-master" [12:17:37] before: [12:17:47] error: commlib error: got select error (Connection refused) [12:17:48] error: unable to send message to qmaster using port 6444 on host "tools-master": got send error [12:17:54] Right that was the error I was getting intermittently earlier. [12:17:58] The new one [12:18:14] should I restart something on bastion as well? [12:18:18] (I should be back at my laptop in a while) [12:18:23] Hmm you shouldn't need to no [12:18:53] Can you put an entry for the bastion on tools-master /etc/hosts [12:21:09] yes I can but I believe it's not needed because I can ping it [12:21:34] petrb@tools-master:~$ ping tools-bastion-02 [12:21:35] PING tools-bastion-02.eqiad.wmflabs (10.68.16.44) 56(84) bytes of data << [12:21:46] Basically sge disagrees and we are kind of bumbling around now [12:23:08] Ideally - strace the master process and figure out wtf it is trying to so [12:23:09] Do [12:23:26] (Going to be a bit before I can be at a laptop) [12:23:42] I did it and it's still same [12:23:56] /etc/hosts I mean [12:24:19] Ok [12:24:26] I'm going to focus on getting to my laptop now [12:24:30] Brb [12:35:25] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce etherpad1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/214250 (owner: 10Alexandros Kosiaris) [12:35:47] PROBLEM - puppet last run on db2069 is CRITICAL puppet fail [12:36:15] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce etherpad1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/214130 (owner: 10Alexandros Kosiaris) [12:36:27] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 0 below the confidence bounds [12:48:55] akosiaris: moving services to gantei? Awesome :) [12:49:15] (03PS1) 10Jcrespo: Repool db1003 after upgrade and maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214330 [12:51:27] RECOVERY - puppet last run on db2069 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [12:51:28] JohnFLewis: that's the plan. Let's see how it goes though [12:52:01] (03CR) 10Jcrespo: [C: 032] Repool db1003 after upgrade and maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214330 (owner: 10Jcrespo) [12:52:36] akosiaris: if that doesn't go well, shoving static-bugzilla.wikimedia.org to a VM may be an option as that's just static files but I don't know [12:53:12] JohnFLewis: I assume you mean it goes well, no ? [12:53:27] akosiaris: bah yes [12:53:57] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 0 below the confidence bounds [12:54:03] JohnFLewis: yeah, it sure makes sense. [12:55:02] I'd assume moving zirconium to VMs would be the first milestone as that seems to be where everything misc is shove rn [12:55:20] you assume correctly [12:55:28] etherpad is on zirconium anyway [12:55:49] and planet is next [12:55:56] which is on zirconium as well [12:57:04] May be best to least Bugzilla on zirconium as the hassle likely won't be worth it considering its purpose is very short lived and will be killed hopefully in a month or so (just my thought anyway) [12:57:09] more 5XX after a merge but not deploy mmmm [13:03:42] 6operations, 7database: Document x1 DB requirements for new wikis - https://phabricator.wikimedia.org/T100527#1317414 (10Krenair) Okay. Let's work out what needs to be added to the script then [13:06:01] (03PS6) 10Faidon Liambotis: admin: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183823 (https://phabricator.wikimedia.org/T92475) [13:06:08] (03CR) 10Faidon Liambotis: [C: 032] admin: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183823 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [13:11:17] PROBLEM - puppet last run on mc2014 is CRITICAL puppet fail [13:11:17] PROBLEM - puppet last run on copper is CRITICAL puppet fail [13:11:24] and there we go [13:11:26] PROBLEM - puppet last run on mw1247 is CRITICAL puppet fail [13:11:27] PROBLEM - puppet last run on analytics1013 is CRITICAL puppet fail [13:11:27] PROBLEM - puppet last run on elastic1019 is CRITICAL puppet fail [13:11:37] PROBLEM - puppet last run on analytics1002 is CRITICAL puppet fail [13:11:37] PROBLEM - puppet last run on mw1151 is CRITICAL puppet fail [13:11:37] PROBLEM - puppet last run on db2057 is CRITICAL puppet fail [13:11:37] PROBLEM - puppet last run on mw1079 is CRITICAL puppet fail [13:11:53] !log killed ircecho service on neon [13:11:59] Logged the message, Master [13:12:17] (03PS1) 10Faidon Liambotis: ssh: remove fail()s from ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/214332 [13:13:00] (03CR) 10Faidon Liambotis: [C: 032] ssh: remove fail()s from ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/214332 (owner: 10Faidon Liambotis) [13:13:18] akosiaris: thx [13:14:29] paravoid: np. why was though content empty ? [13:14:40] content => join($ssh_keys, "\n"), [13:14:46] $ssh_keys was empty as well ? [13:15:21] 6operations, 10Architecture, 10MediaWiki-RfCs, 10RESTBase, and 5 others: RFC: Re-evaluate varnish-level request-restart behavior on 5xx - https://phabricator.wikimedia.org/T97206#1317468 (10BBlack) [13:17:36] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1317474 (10Krenair) No, stewards can't create themselves accounts on fishbowls - they can't even log in. These sites aren't running Cent... [13:18:55] (03PS1) 10Faidon Liambotis: admin: absent ssh keys & sudoers when empty [puppet] - 10https://gerrit.wikimedia.org/r/214333 [13:18:58] akosiaris: ^ :) [13:19:36] (03CR) 10jenkins-bot: [V: 04-1] admin: absent ssh keys & sudoers when empty [puppet] - 10https://gerrit.wikimedia.org/r/214333 (owner: 10Faidon Liambotis) [13:21:39] (03PS2) 10Faidon Liambotis: admin: absent ssh keys & sudoers when empty [puppet] - 10https://gerrit.wikimedia.org/r/214333 [13:25:45] !log jynus Synchronized wmf-config/db-eqiad.php: repool pc1003 (not to confuse with db1003) after warmup (duration: 00m 15s) [13:25:48] Logged the message, Master [13:29:59] paravoid: running it via the compiler just to be 100% sure, but it LGTM [13:30:16] puppet failed on host [13:30:23] which host? [13:30:30] paravoid: acamar [13:30:40] (03PS1) 10BBlack: ipv6 token stuff: add flush on first set, update comments [puppet] - 10https://gerrit.wikimedia.org/r/214334 [13:30:42] (03PS1) 10BBlack: ipv6 token stuff: re-enable cp1008 testing [puppet] - 10https://gerrit.wikimedia.org/r/214335 [13:30:51] paravoid: nope [13:30:54] acamar.wikimedia.org Failed to parse template sudo/sudoers.erb: [13:30:54] Filepath: /tmp/catalog-differ/akosiaris/214333/change/src/modules/sudo/templates/sudoers.erb [13:30:54] Line: 3 [13:30:54] Detail: undefined method `each' for nil:NilClass [13:30:55] at /tmp/catalog-differ/akosiaris/214333/change/src/modules/sudo/manifests/user.pp:40 on node acamar.wikimedia.org [13:31:14] template calls each on a nil value [13:31:20] now it works [13:31:22] akosiaris: while you’re at it, could you run 213543 through the compiler again? Hosts virt1000.wikimedia.org, labnet1001, labvirt1001 [13:31:26] paravoid, pc1003 [13:31:27] (03CR) 10BBlack: [C: 032] ipv6 token stuff: add flush on first set, update comments [puppet] - 10https://gerrit.wikimedia.org/r/214334 (owner: 10BBlack) [13:31:33] Mjbmr: Re your SWAT patches, I'm not seeing any revision f27905e99d395f5ca0f6f8762ed777ee62750525 in ULS. Also, it's usually preferable to cherry-pick to the deployment branches, it looks like you're trying to update to master. [13:32:00] jynus: btw, I’m interested in moving designate dbs but I think it will have to wait until tomorrow or Monday, I have a scheduled maintenance window today that I’m going to need all of. [13:32:28] andrewbogott, not in a hurry, just let you know [13:32:39] andrewbogott: yeah, doing it now [13:32:44] thanks! [13:33:03] akosiaris: I'm confused about that error [13:33:05] btw, akosiaris, I note that our puppet masters are still running on precise, do you know if there’s a fundamental issue with Trusty or if they just haven’t been upgraded yet? [13:33:16] andrewbogott: yes there is [13:33:20] lemme find the ticket [13:33:27] uh-oh [13:34:58] andrewbogott: https://phabricator.wikimedia.org/T98129 [13:35:05] andrewbogott, homework for tomorrow also, I will ask you which data can be regenerated and which has to have recovery [13:35:16] so, some of our code reportedly has some ruby 1.8 stuff [13:35:32] and trusty has ruby 1.9 so parts of the source won't compile [13:35:58] akosiaris: ‘our code’ meaning e.g. facts? [13:36:15] andrewbogott: and ERBs [13:36:22] and puppet parser fuctions [13:36:31] and whatever else is pure ruby and not puppet [13:36:51] the good thing is that we can pinpoint them one by one [13:36:58] lengthy process but doable [13:37:00] crap. One more reason why i can’t upgrade virt1000 :( [13:37:08] catalog compiler to the rescue [13:37:28] (03CR) 10BBlack: [C: 032] ipv6 token stuff: re-enable cp1008 testing [puppet] - 10https://gerrit.wikimedia.org/r/214335 (owner: 10BBlack) [13:37:29] except the catalog compiler is broken :) [13:37:38] I wonder how lengthy it can be, esp. for labs :) [13:37:59] * andrewbogott ’s stack overflows [13:38:06] :P [13:38:07] well, last time it was worse... at some point we had problem with unicode [13:38:17] but mutante fixed most of those for CI IIRC [13:39:02] 6operations: Investigate the compatibility of our puppet tree with ruby2.1 and create a plan to upgrade - https://phabricator.wikimedia.org/T98129#1317532 (10Andrew) [13:39:48] 6operations: Investigate the compatibility of our puppet tree with ruby2.1 and create a plan to upgrade - https://phabricator.wikimedia.org/T98129#1260150 (10Andrew) [13:39:50] 6operations: puppet-compiler has strange problems with some facts and/or hiera - https://phabricator.wikimedia.org/T96802#1317535 (10Andrew) [13:41:33] akosiaris: you’re aware of ^, right? Is that something I should pester _joe_ about or do you know how to fix it? [13:42:25] andrewbogott: I am no really aware unfortunately... perhaps I can help though [13:42:32] s/no/not/ [13:43:34] akosiaris: thanks. The sum total of my current knowledge is “the catalog compiler is broken.” You built it originally, right? [13:43:50] andrewbogott: labvirt1001.eqiad.wmnet No matching value for selector param '(undefined)' at /tmp/catalog-differ/akosiaris/213543/production/src/manifests/role/nova.pp:14 on node labvirt1001.eqiad.wmnet [13:43:50] Could not find data item labs_keystone_host in any Hiera data file and no default supplied at /tmp/catalog-differ/akosiaris/213543/change/src/manifests/role/keystone.pp:26 on node labvirt1001.eqiad.wmnet [13:43:57] andrewbogott: no [13:44:03] I built a catalog compiler [13:44:11] then _joe_ built another one [13:44:17] that's the one you have there [13:44:21] Ah, ok. [13:44:36] funnily, the catalog compiler I built is not broken right now [13:44:51] anyway, labvirt1001 is not compiling [13:45:18] hm, that error suggests that hiera does not work the way I thought it worked. [13:45:28] (03PS1) 10Yuvipanda: tools: Include labsdb aliases only in exec hosts [puppet] - 10https://gerrit.wikimedia.org/r/214338 (https://phabricator.wikimedia.org/T100554) [13:45:35] that value is defined in eqiad.yaml which I thought was applied everywhere in eqiad [13:45:48] labnet1001 is a noop [13:45:54] great [13:45:59] virt1000 does have some changes though [13:46:08] -auth_uri = http://virt1000.wikimedia.org:5000 [13:46:08] +auth_uri = virt1000.wikimedia.org:5000 [13:46:16] this does not look good, does it ? [13:46:27] huh, I saw that before, thought I fixed it. [13:46:49] !log upgrading Jenkins git plugin from 1.4.6+wmf1 to 1.7.1 {{bug|T100655}} and restarting Jenkins [13:46:57] Logged the message, Master [13:47:29] paravoid: any luck with https://gerrit.wikimedia.org/r/#/c/214333/2/modules/admin/manifests/user.pp,cm ? [13:47:41] labvirt1001 has its own definition in hieradata/hosts. Does that mean it does not use eqiad.yaml? I thought that hiera was… hierarchical [13:47:44] luck with what? [13:48:00] paravoid: it's erroring out [13:50:38] !log started ircecho (icinga-wm) on neon [13:50:41] Logged the message, Master [13:50:49] (03Restored) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [puppet] - 10https://gerrit.wikimedia.org/r/163814 (owner: 10Hashar) [13:50:54] (03PS2) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [puppet] - 10https://gerrit.wikimedia.org/r/163814 [13:51:54] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [puppet] - 10https://gerrit.wikimedia.org/r/163814 (owner: 10Hashar) [13:52:04] (03CR) 10Yuvipanda: [C: 032] tools: Include labsdb aliases only in exec hosts [puppet] - 10https://gerrit.wikimedia.org/r/214338 (https://phabricator.wikimedia.org/T100554) (owner: 10Yuvipanda) [13:52:54] Restored? :) [13:53:14] I guess I never went back to see if the was such an option [13:53:37] bblack: yeah in Gerrit you can restore an abandoned patch :-D [13:54:01] akosiaris: ok, I found the source of that diff. I still don’t understand about the hiera error on labvirt1001 though. Am i mistaken that it should pull in eqiad.yaml? [13:55:41] andrewbogott: utils/hiera_lookup [13:55:52] good idea [13:55:55] it should help debug the issue. I am also looking at the change [13:56:11] you can also pass --debug [13:56:14] thx [13:56:22] it will tell you where it's looking for the value [13:56:58] (03PS3) 10Faidon Liambotis: admin: absent ssh keys & sudoers when empty [puppet] - 10https://gerrit.wikimedia.org/r/214333 [13:57:01] akosiaris: ^ [14:00:29] andrewbogott: are you doing anything with the labs puppetmaster atm? [14:00:42] paravoid: compiles cleanly now. A lot of new resources of the type file[/etc/sudoers.d/] as expected [14:00:43] nope [14:01:00] akosiaris: that's not expected [14:01:11] oh [14:01:17] I guess it is [14:01:22] ;-) [14:01:29] or else we 'd be toast ... [14:01:50] hm, hiera_lookup just tells me that everything is nil [14:01:58] hm, maybe that's not a great idea, that'd be slow [14:02:19] maybe https://gerrit.wikimedia.org/r/#/c/180513/ is better after all [14:03:44] oh my mistake [14:05:00] (03PS1) 10Yuvipanda: tools: Remove exec_environ from gridengine master [puppet] - 10https://gerrit.wikimedia.org/r/214339 [14:05:15] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Remove exec_environ from gridengine master [puppet] - 10https://gerrit.wikimedia.org/r/214339 (owner: 10Yuvipanda) [14:06:00] (03PS5) 10Andrew Bogott: Use the new service names for labs puppetmasters. [puppet] - 10https://gerrit.wikimedia.org/r/214105 [14:06:02] (03PS10) 10Andrew Bogott: Replace many references to virt1000 and labcontrol2001 with hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/213543 [14:07:43] akosiaris: I don’t see the issue with hiera_lookup; that variable resolves just fine. That means it’s time for me to eat breakfast — back in 15 [14:09:23] (03PS3) 10BBlack: add_ip6_mapped: enable token-based SLAAC for all jessie/trusty [puppet] - 10https://gerrit.wikimedia.org/r/202725 (https://phabricator.wikimedia.org/T94417) [14:10:53] (03Abandoned) 10BBlack: add_ip6_mapped: middle approach [puppet] - 10https://gerrit.wikimedia.org/r/203069 (owner: 10BBlack) [14:11:00] (03Abandoned) 10BBlack: add_ip6_mapped: enable token-based SLAAC for all jessie/trusty [puppet] - 10https://gerrit.wikimedia.org/r/202725 (https://phabricator.wikimedia.org/T94417) (owner: 10BBlack) [14:11:42] akosiaris: http://bugs.debian.org/786460 [14:13:58] (03PS1) 10BBlack: Deploy v6 token approach to some jessie/trusty hosts [puppet] - 10https://gerrit.wikimedia.org/r/214341 [14:14:49] paravoid: yay! [14:15:05] akosiaris: might want to get involved in there [14:15:15] https://packages.debian.org/search?keywords=kafka is interesting too [14:15:28] vincent bernat and I are comaintaining both librdkafka and kafkacat now [14:15:29] Gradle is broken too ? sigh... [14:15:43] and someone has uploaded python-kafka, plus rsyslog has gotten kafka support(!) [14:16:02] yeah, that last one caught my eye as well [14:16:30] petan: everything seems ok again [14:16:31] Magnus told me that syslog-ng has gotten kafka support upstream as well [14:16:45] NotASpy: what was the link you wanted restarting? it should have automatically been restarted now [14:17:22] paravoid: now all we want is someone to rewrite kafka in a saner language [14:17:34] that's librdkafka :) [14:17:34] like C [14:17:37] bblack: ? :P [14:17:39] !log deploying https://gerrit.wikimedia.org/r/214341 - keep in mind if ipv6-related issues arise! [14:17:43] Logged the message, Master [14:18:11] paravoid: I meant the daemon [14:18:21] 6operations, 10Collaboration-Team-Sprint-B-2015-05-20, 7database: Document x1 DB requirements for new wikis - https://phabricator.wikimedia.org/T100527#1317643 (10Krenair) After speaking to Matt it seems like this is because we used the Echo extension and ran something like `mwscript createExtensionTables.ph... [14:19:07] oh the mw* servers don't evn have add_ip6_mapped, so not as risky as I thought [14:19:28] (03CR) 10John F. Lewis: Deploy v6 token approach to some jessie/trusty hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/214341 (owner: 10BBlack) [14:19:34] 6operations, 10Collaboration-Team-Sprint-B-2015-05-20, 7database: Document/fix scripts for echo x1 DB requirements - https://phabricator.wikimedia.org/T100527#1317648 (10Krenair) [14:19:40] bblack: ^ sorry, had to :) [14:20:00] lol [14:20:10] (03CR) 10BBlack: [C: 032] Deploy v6 token approach to some jessie/trusty hosts [puppet] - 10https://gerrit.wikimedia.org/r/214341 (owner: 10BBlack) [14:23:39] (03CR) 10Ottomata: "Two qs:" [puppet] - 10https://gerrit.wikimedia.org/r/214147 (owner: 10Ori.livneh) [14:25:16] (03PS1) 10Faidon Liambotis: admin: clean up removed/revoked SSH keys [puppet] - 10https://gerrit.wikimedia.org/r/214343 [14:25:25] akosiaris: I'll go with this one for now [14:25:59] 6operations, 7Monitoring, 5Patch-For-Review: Job queue stats are broken - https://phabricator.wikimedia.org/T87594#1317655 (10fgiunchedi) 5Open>3Resolved jobq dashboard and alarms are back, resolving [14:26:09] (03PS2) 10Faidon Liambotis: admin: clean up removed/revoked SSH keys [puppet] - 10https://gerrit.wikimedia.org/r/214343 [14:26:22] (03CR) 10Faidon Liambotis: [C: 032] admin: clean up removed/revoked SSH keys [puppet] - 10https://gerrit.wikimedia.org/r/214343 (owner: 10Faidon Liambotis) [14:30:47] (03PS1) 10BBlack: deploy v6 token approach to all jessie/trusty [puppet] - 10https://gerrit.wikimedia.org/r/214345 [14:31:46] (03PS1) 10Faidon Liambotis: sudo: fix sudo::user/group's ensure => absent [puppet] - 10https://gerrit.wikimedia.org/r/214346 [14:32:28] (03CR) 10Faidon Liambotis: [C: 032] sudo: fix sudo::user/group's ensure => absent [puppet] - 10https://gerrit.wikimedia.org/r/214346 (owner: 10Faidon Liambotis) [14:33:40] (03PS4) 10Faidon Liambotis: admin: absent ssh keys & sudoers when empty [puppet] - 10https://gerrit.wikimedia.org/r/214333 [14:34:51] (03Abandoned) 10Faidon Liambotis: admin: absent ssh keys & sudoers when empty [puppet] - 10https://gerrit.wikimedia.org/r/214333 (owner: 10Faidon Liambotis) [14:35:30] ottomata: hey [14:35:38] ottomata: 17:11 <@paravoid> akosiaris: http://bugs.debian.org/786460 [14:36:14] akosiaris: I’ve uploaded a new version of 213543, can you retest? I’m ignoring that hiera error for now, blaming it on the compiler [14:36:29] (03PS4) 10Faidon Liambotis: ssh: fix completely broken host key collection [puppet] - 10https://gerrit.wikimedia.org/r/210926 [14:36:34] andrewbogott: are you blaming my compiler ? [14:36:42] akosiaris: um… maybe? [14:36:50] I welcome an alternate explanation [14:36:54] andrewbogott: you must be blaming my compiler, I dont see anyone else in the room [14:37:06] well that's not true but I needed a taxidriver quote [14:37:37] * andrewbogott doesn’t have enough hair for a mohawk [14:37:41] !log powering down dataset1001 to add disk array [14:37:48] Logged the message, Master [14:37:58] oh hm! [14:38:02] retesting it now [14:38:04] bblack: may I merge a potentially disrupting change to the puppetmaster? [14:38:15] disruptive* [14:38:39] sure why not [14:38:54] well if you needed to emergency revert that ipv6 change :) [14:39:01] so far so good :) [14:39:25] (03CR) 10Faidon Liambotis: [C: 032] "Let's try it and see :) I'll monitor the extra load." [puppet] - 10https://gerrit.wikimedia.org/r/210926 (owner: 10Faidon Liambotis) [14:39:38] PROBLEM - Host dataset1001 is DOWN: CRITICAL - Host Unreachable (208.80.154.11) [14:39:46] lol [14:39:54] YuviPanda: it was https://tools.wmflabs.org/copyvios/ and yes, it's back. Many thanks :-) [14:40:08] someone working on dataset1001? [14:40:15] bblack: yes, cmjohnson1 just logged this [14:40:23] oh I see, I'm slow [14:40:40] yep..i forgot to add icinga alert [14:41:12] Notice: Skipping run of Puppet configuration client; administratively disabled (Reason: 'reason not specified'); [14:41:15] for rhodium [14:41:17] anyone knows what's up with that? [14:41:41] rhodium's the new 3rd puppetmaster, not yet in use? [14:41:49] reenabled [14:41:51] yes [14:42:27] RECOVERY - puppet last run on rhodium is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [14:42:36] (03PS1) 10ArielGlenn: salt: small increase of workers on the master [puppet] - 10https://gerrit.wikimedia.org/r/214348 [14:42:37] I've salt rm'ed /home/faidon/.ssh/authorized_keys to dogfood my userkey change [14:42:47] so anything that hasn't run puppet today, won't allow me to login [14:43:27] PROBLEM - puppet last run on cp3013 is CRITICAL puppet fail [14:43:27] PROBLEM - puppet last run on mw2214 is CRITICAL puppet fail [14:43:37] PROBLEM - puppet last run on mw1218 is CRITICAL puppet fail [14:43:37] PROBLEM - puppet last run on mw1109 is CRITICAL puppet fail [14:43:38] works for me [14:43:46] PROBLEM - puppet last run on db2035 is CRITICAL puppet fail [14:43:47] PROBLEM - puppet last run on mw2197 is CRITICAL puppet fail [14:43:47] PROBLEM - puppet last run on mw1040 is CRITICAL puppet fail [14:43:52] puppet fail ? [14:43:58] PROBLEM - puppet last run on elastic1026 is CRITICAL puppet fail [14:44:05] looking [14:44:06] PROBLEM - puppet last run on mw1124 is CRITICAL puppet fail [14:44:07] PROBLEM - puppet last run on cp1043 is CRITICAL puppet fail [14:44:08] PROBLEM - puppet last run on ganeti2005 is CRITICAL puppet fail [14:44:15] ESC[1;31mError: Could not retrieve catalog from remote server: Error 400 on SERVER: Failed to execute generator /usr/local/bin/sshknowngen: Execution of '/usr/local/bin/sshknowngen' returned 1: at /etc/puppet/modules/ssh/manifests/client.pp:7 on node cp3013.esams.wmnetESC[0m [14:44:18] PROBLEM - puppet last run on mw2102 is CRITICAL puppet fail [14:44:25] that's from logs though, right? [14:44:27] PROBLEM - puppet last run on wtp1024 is CRITICAL puppet fail [14:44:27] PROBLEM - puppet last run on mw1130 is CRITICAL puppet fail [14:44:33] there was a tiny time window where that would fail [14:44:43] yes from logs [14:44:44] between change merged -> puppet running on puppetmasters [14:44:45] NotASpy: yw [14:44:47] PROBLEM - puppet last run on wtp1021 is CRITICAL puppet fail [14:44:51] trying manual [14:44:54] which I manually triggered, hence the rhodium question above :) [14:44:56] PROBLEM - puppet last run on analytics1036 is CRITICAL puppet fail [14:44:56] PROBLEM - puppet last run on elastic1028 is CRITICAL puppet fail [14:45:06] PROBLEM - puppet last run on mw1216 is CRITICAL puppet fail [14:45:17] yeah manually works on at least one of the hosts above [14:45:17] PROBLEM - puppet last run on mw2041 is CRITICAL puppet fail [14:45:33] paravoid: labs instances say “Failed to execute generator /usr/local/bin/sshknowngen: Execution of '/usr/local/bin/sshknowngen' returned 1: at /etc/puppet/modules/ssh/manifests/client.pp:7 on node i-00000bc9.eqiad.wmflabs” [14:45:43] andrewbogott: run puppet on the puppetmaster [14:45:47] ok [14:45:51] (03CR) 10ArielGlenn: [C: 032] salt: small increase of workers on the master [puppet] - 10https://gerrit.wikimedia.org/r/214348 (owner: 10ArielGlenn) [14:46:05] we could also two-stage changes like this, but then reverts/updates get hard to track too [14:46:13] yeah [14:46:20] andrewbogott: https://phabricator.wikimedia.org/P698 [14:46:25] that's why I didn't want to do it [14:46:35] in hindsight, should had just scp'ed sshknowngen :) [14:46:36] RECOVERY - puppet last run on analytics1036 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:47:06] RECOVERY - puppet last run on mw1109 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:47:09] so, hosts that have not been running puppet [14:47:11] labcontrol1001.wikimedia.org, labvirt1005.eqiad.wmnet, labstore1002.eqiad.wmnet, berkelium.eqiad.wmnet, curium.eqiad.wmnet, graphite1002.eqiad.wmnet [14:47:20] godog: graphite1002 hasn't run puppet for ~1 month now, can you fix ASAP? [14:47:34] YuviPanda/andrewbogott: for those lab hosts [14:47:41] * YuviPanda looks [14:47:43] and jgage for berkelium/curium [14:47:44] the two without numbers are jgage ipsec tests [14:47:49] yup [14:47:51] let me do labvirt1005 [14:47:54] labcontrol1001 I’m working on right now, but I can enable for the moment. [14:47:55] wait, is it even up [14:48:02] I'm about to apply a change to the salt master on palladium, this will mean a minute of master unresponsiveness [14:48:03] andrewbogott: nah, it's fine [14:48:04] labvirt1005 is not up, that file system issue [14:48:06] yeah [14:48:07] that [14:48:28] labstore1002 is also not up, hopefully - it's connected to the shelves and is supposed to be powered off afaik [14:48:37] RECOVERY - puppet last run on cp3013 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:48:55] paravoid: after a refresh on the puppetmaster, labs instances look like this now: https://dpaste.de/NajP [14:49:15] oops [14:49:20] no exported resources on labs? [14:49:23] nope [14:49:23] I guess so [14:50:42] akosiaris: thanks. I guess I missed another one [14:50:56] thcipriani, marktraceur: Who wants to SWAT this morning? So far it's all config patches. [14:51:09] * anomie would rather not, still catching up from the Hackathon [14:51:30] anomie: can swat [14:51:35] thcipriani: Ok! [14:51:41] akosiaris: can you merge, https://gerrit.wikimedia.org/r/#/c/214071 [14:51:49] or godog ^^ [14:51:59] (03CR) 10Tim Landscheidt: "On a Precise instance, "dpkg-query -L libmysql-ruby" and "apt-cache showpkg libmysql-ruby" suggest that libmysql-ruby is just an "alias" f" [puppet] - 10https://gerrit.wikimedia.org/r/214290 (owner: 10Andrew Bogott) [14:52:13] Dereckson: Mjbmr ping for SWAT in 8 minutes [14:52:23] (03CR) 10Alexandros Kosiaris: [C: 032] CX: Add languages for deployment on 20150528 [puppet] - 10https://gerrit.wikimedia.org/r/214071 (https://phabricator.wikimedia.org/T99535) (owner: 10KartikMistry) [14:52:33] (03PS1) 10Filippo Giunchedi: graphite1002: include only standard [puppet] - 10https://gerrit.wikimedia.org/r/214352 (https://phabricator.wikimedia.org/T88994) [14:52:45] paravoid: sure, fixed in https://gerrit.wikimedia.org/r/#/c/214352/ [14:52:51] akosiaris: thanks! [14:52:59] thcipriani: no ping to me? :/ [14:53:01] (03PS2) 10Filippo Giunchedi: graphite1002: include only standard [puppet] - 10https://gerrit.wikimedia.org/r/214352 (https://phabricator.wikimedia.org/T88994) [14:53:01] :) [14:53:16] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] graphite1002: include only standard [puppet] - 10https://gerrit.wikimedia.org/r/214352 (https://phabricator.wikimedia.org/T88994) (owner: 10Filippo Giunchedi) [14:53:36] kart_: assumed you'd be hanging around for another few minutes :) [14:53:47] (03PS1) 10Faidon Liambotis: ssh: guard ssh_known_hosts generation behind $::realm [puppet] - 10https://gerrit.wikimedia.org/r/214353 [14:53:58] andrewbogott: ^ [14:54:07] (03CR) 10Faidon Liambotis: [C: 032] ssh: guard ssh_known_hosts generation behind $::realm [puppet] - 10https://gerrit.wikimedia.org/r/214353 (owner: 10Faidon Liambotis) [14:54:15] thcipriani: yep [14:54:33] (03CR) 10Andrew Bogott: [C: 031] ssh: guard ssh_known_hosts generation behind $::realm [puppet] - 10https://gerrit.wikimedia.org/r/214353 (owner: 10Faidon Liambotis) [14:54:50] godog: thanks :) [14:55:30] why isn't rhodium in prod yet? [14:56:53] paravoid: all better, thank you! [14:57:02] paravoid: np, it's been too long alright [14:57:16] (03CR) 10Chad: "This implies root access to the box...is that what you want? Or do you just want admin access to Gerrit itself?" [puppet] - 10https://gerrit.wikimedia.org/r/214255 (https://phabricator.wikimedia.org/T100565) (owner: 10Hashar) [14:57:26] yeah, and the ssh-userkey changes will break stuff for sure [14:57:31] eventually :) [14:58:26] akosiaris: why isn't rhodium in prod yet? [14:58:26] RECOVERY - puppet last run on graphite1002 is OK Puppet is currently enabled, last run 27 seconds ago with 0 failures [14:58:39] (03CR) 10Hashar: "Just admin access via the group 'gerrit-admin'. Seems root is granted via 'gerrit-root'." [puppet] - 10https://gerrit.wikimedia.org/r/214255 (https://phabricator.wikimedia.org/T100565) (owner: 10Hashar) [14:58:39] things calm now? can I go break ipv6? [14:58:40] paravoid: unpuppetized bits here and there [14:58:55] godog: btw, diamond fails to start on g1002 [14:59:00] bblack: yes :) [14:59:06] (03PS2) 10BBlack: deploy v6 token approach to all jessie/trusty [puppet] - 10https://gerrit.wikimedia.org/r/214345 [14:59:14] paravoid: like /srv/private /usr/share/ruby/rack, puppetmaster-passenger failing on install due to some race I am investigating [14:59:17] (03CR) 10BBlack: [C: 032 V: 032] deploy v6 token approach to all jessie/trusty [puppet] - 10https://gerrit.wikimedia.org/r/214345 (owner: 10BBlack) [14:59:31] seems like last time I solved most of these manually [14:59:34] paravoid: uhuh I'll take a look, standard jessie box so I'm surprised [14:59:39] http://ganglia.wikimedia.org/latest/graph.php?r=month&z=xlarge&h=palladium.eqiad.wmnet&m=cpu_report&s=descending&mc=2&g=cpu_report&c=Miscellaneous+eqiad [14:59:43] that's interesting [14:59:45] I wonder what that is [15:00:00] !log merged up https://gerrit.wikimedia.org/r/214345 - look here if IPv6 problems! [15:00:03] (03CR) 10Hashar: Add all Release-Engineering team as Gerrit admins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/214255 (https://phabricator.wikimedia.org/T100565) (owner: 10Hashar) [15:00:04] manybubbles, anomie, ^d, thcipriani, marktraceur, Dereckson, Mjbmr: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150528T1500). Please do the needful. [15:00:05] Logged the message, Master [15:00:08] (03PS2) 10Hashar: Add all Release-Engineering team as Gerrit admins [puppet] - 10https://gerrit.wikimedia.org/r/214255 (https://phabricator.wikimedia.org/T100565) [15:00:11] paravoid: easy enough, lemme find the commit [15:00:25] (03CR) 10Hashar: "Fix typo in comment: qhcris -> qchris" [puppet] - 10https://gerrit.wikimedia.org/r/214255 (https://phabricator.wikimedia.org/T100565) (owner: 10Hashar) [15:00:26] RECOVERY - Host dataset1001 is UPING OK - Packet loss = 0%, RTA = 0.94 ms [15:00:40] (03CR) 10Thcipriani: [C: 032 V: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213992 (https://phabricator.wikimedia.org/T99535) (owner: 10KartikMistry) [15:01:01] (03CR) 10Chad: "This doesn't grant admin access to gerrit, that's https://gerrit.wikimedia.org/r/#/admin/groups/1,members. I dunno what this group even do" [puppet] - 10https://gerrit.wikimedia.org/r/214255 (https://phabricator.wikimedia.org/T100565) (owner: 10Hashar) [15:01:06] Merge "Assign weights to puppetmasters" into production [15:01:10] paravoid: https://gerrit.wikimedia.org/r/#/c/208933/ [15:01:10] that one probably [15:01:12] (03CR) 10Chad: [C: 04-1] Add all Release-Engineering team as Gerrit admins [puppet] - 10https://gerrit.wikimedia.org/r/214255 (https://phabricator.wikimedia.org/T100565) (owner: 10Hashar) [15:01:15] yup [15:01:18] 6operations, 6Labs: Investigate why nscd is used in labs - https://phabricator.wikimedia.org/T100564#1317738 (10yuvipanda) It's back on on tools-master now, and doesn't really seem to be causing any issues atm. [15:01:19] right [15:01:26] RECOVERY - puppet last run on wtp1021 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [15:01:36] RECOVERY - puppet last run on elastic1028 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures [15:01:43] (03Merged) 10jenkins-bot: CX: Add wikis for CX deployment on 20150528 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213992 (https://phabricator.wikimedia.org/T99535) (owner: 10KartikMistry) [15:01:45] RECOVERY - puppet last run on elastic1026 is OK Puppet is currently enabled, last run 31 seconds ago with 0 failures [15:01:46] RECOVERY - puppet last run on mw1216 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [15:01:48] thcipriani: ping me when you sync :) [15:01:49] 6operations, 6Labs: Investigate why nscd is used in labs - https://phabricator.wikimedia.org/T100564#1317743 (10yuvipanda) @fgiunchedi but we also have nslcd which does that... [15:01:56] kart_: will do [15:01:58] paravoid: ssh knownhosts change + ipv6 change intersection already :) [15:02:00] paravoid: it actually bought us time to finish up properly puppetizing rhodium [15:02:01] -bast2001.wikimedia.org,bast2001,208.80.153.5,2620:0:860:1:92b1:1cff:fe00:9deb ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBHzoEEU4ymEV3hvsviSCPn4Fc4Sis7ysk9RtRvfXpKG/ZZykFju5mKTaeNjdHGXGfaYzr4/CDxiT4atZgdLNW7k= [15:02:05] +bast2001.wikimedia.org,bast2001,208.80.153.5,2620:0:860:1:208:80:153:5 ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBHzoEEU4ymEV3hvsviSCPn4Fc4Sis7ysk9RtRvfXpKG/ZZykFju5mKTaeNjdHGXGfaYzr4/CDxiT4atZgdLNW7k= [15:02:05] RECOVERY - puppet last run on mw2214 is OK Puppet is currently enabled, last run 11 seconds ago with 0 failures [15:02:06] RECOVERY - puppet last run on ganeti2005 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:02:06] RECOVERY - puppet last run on mw1218 is OK Puppet is currently enabled, last run 22 seconds ago with 0 failures [15:02:08] haha :) [15:02:14] 6operations, 6Labs: Investigate why nscd is used in labs - https://phabricator.wikimedia.org/T100564#1317754 (10yuvipanda) [15:02:15] RECOVERY - puppet last run on cp1043 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:02:16] RECOVERY - puppet last run on db2035 is OK Puppet is currently enabled, last run 35 seconds ago with 0 failures [15:02:16] awesome :) [15:02:25] RECOVERY - puppet last run on mw1040 is OK Puppet is currently enabled, last run 42 seconds ago with 0 failures [15:02:25] RECOVERY - puppet last run on mw2197 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:02:26] RECOVERY - puppet last run on mw1124 is OK Puppet is currently enabled, last run 1 second ago with 0 failures [15:02:28] nice [15:02:36] RECOVERY - puppet last run on mw2102 is OK Puppet is currently enabled, last run 14 seconds ago with 0 failures [15:02:36] RECOVERY - puppet last run on wtp1024 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:03:26] RECOVERY - puppet last run on mw1130 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:03:37] RECOVERY - puppet last run on mw2041 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:03:40] 2620:0:860:1:d6ae:52ff:feac:4dc8 Recursive DNS CRITICAL 2015-05-28 15:02:13 0d 0h 0m 32s 1/3 [15:03:43] bad check I assume [15:03:46] PROBLEM - Host 2620:0:860:1:d6ae:52ff:feac:4dc8 is DOWN: PING CRITICAL - Packet loss = 100% [15:03:53] yup, probably [15:03:55] relying on the autoconf IP [15:04:00] lol [15:04:04] where is that defined at? [15:04:12] I think you committed that :) [15:04:36] ::dnsrecursor::monitor { [ $::ipaddress, $::ipaddress6_eth0 ]: } [15:04:39] manifests/role/dns.pp [15:05:14] ah yes [15:05:19] it will fix as it picks up facts I think [15:05:22] yup [15:05:29] kart_: syncing now [15:05:39] !log thcipriani Synchronized wmf-config/InitialiseSettings.php: SWAT: CX: Add wikis for CX deployment on 20150528 [[gerrit:213992]] (duration: 00m 15s) [15:05:44] Logged the message, Master [15:05:46] RECOVERY - configured eth on analytics1036 is OK - interfaces up [15:06:06] PROBLEM - puppet last run on mw2119 is CRITICAL Puppet has 1 failures [15:06:54] Dereckson: Mjbmr 2nd ping for SWAT [15:07:04] I'm here. [15:07:34] Mjbmr: cool, you're up next then once kart_ gives me that all clear. [15:08:36] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [15:09:00] thcipriani: thanks [15:09:09] kart_: yw :) [15:09:50] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214247 (https://phabricator.wikimedia.org/T100513) (owner: 10Mjbmr) [15:10:20] (03Merged) 10jenkins-bot: Enable SandboxLink for cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214247 (https://phabricator.wikimedia.org/T100513) (owner: 10Mjbmr) [15:11:15] PROBLEM - NTP on dataset1001 is CRITICAL: NTP CRITICAL: Offset unknown [15:11:30] !log set operations/debs/txstatsd as hidden in gerrit -- deprecated [15:11:34] Logged the message, Master [15:11:42] thcipriani: do we have any code updates in SWAT? Should I go ahead with it if not. [15:12:21] kart_: everything seems to be config AFAICT [15:13:04] okay! [15:13:16] !log thcipriani Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable SandboxLink for cswiki [[gerrit:214247]] (duration: 00m 15s) [15:13:19] Logged the message, Master [15:13:34] 6operations: pc100[123] maintenance and upgrade - https://phabricator.wikimedia.org/T100301#1317806 (10bd808) >>! In T100301#1316959, @jcrespo wrote: > Actionables: > > - Create a query on kibana-logstash for db-related errors I just made https://logstash.wikimedia.org/#/dashboard/elasticsearch/wfLogDBError... [15:13:57] Mjbmr: 214247 look good to you? [15:14:05] thcipriani: works. [15:14:10] Hi. [15:14:14] cool. [15:14:38] Dereckson: Hi—one more Mjbmr then yours are up [15:15:31] Okay. Thanks for your manual ping by the way, I didn't notice it were 15:00 UTC and irssi didn't react to jouncebot_. [15:15:34] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213841 (https://phabricator.wikimedia.org/T100431) (owner: 10Shanmugamp7) [15:15:45] RECOVERY - puppet last run on hooft is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures [15:16:05] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=77.60 Read Requests/Sec=181.90 Write Requests/Sec=21.70 KBytes Read/Sec=1470.40 KBytes_Written/Sec=255.05 [15:17:07] Dereckson: np :) [15:18:05] PROBLEM - Host 2620:0:860:2:d6ae:52ff:fead:5610 is DOWN: PING CRITICAL - Packet loss = 100% [15:18:33] Hmmmm ... irssi don't hl on "" it seems. [15:19:09] (03PS1) 10Alexandros Kosiaris: Fix typo in dhcpd.conf [puppet] - 10https://gerrit.wikimedia.org/r/214359 [15:20:05] RECOVERY - NTP on dataset1001 is OK: NTP OK: Offset -0.002357959747 secs [15:20:16] RECOVERY - puppet last run on mw2119 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:20:18] (03CR) 10Alexandros Kosiaris: [C: 032] Fix typo in dhcpd.conf [puppet] - 10https://gerrit.wikimedia.org/r/214359 (owner: 10Alexandros Kosiaris) [15:21:13] andrewbogott: labvirt1001 compiles fine now as well [15:21:24] akosiaris: interesting [15:22:14] andrewbogott: it did not have up to date facts :-( [15:22:25] which could be biting the other catalog compiler as well ? [15:22:36] (03Merged) 10jenkins-bot: Enable Extension:NewUserMessage on ta.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213841 (https://phabricator.wikimedia.org/T100431) (owner: 10Shanmugamp7) [15:22:42] seems possible [15:24:40] !log thcipriani Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable Extension:NewUserMessage on ta.wikipedia [[gerrit:213841]] (duration: 00m 12s) [15:24:41] Mjbmr: ^ check please [15:24:44] Logged the message, Master [15:25:17] thcipriani: look ok. [15:25:26] kk [15:25:29] RECOVERY - NTP on analytics1036 is OK: NTP OK: Offset -0.06739878654 secs [15:26:02] thcipriani: ping me when swat is done :) [15:26:06] kart_: will do [15:26:27] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212724 (https://phabricator.wikimedia.org/T99879) (owner: 10Dereckson) [15:26:29] (03PS1) 10Faidon Liambotis: Remove IPv6 SLAAC addresses from network.pp etc. [puppet] - 10https://gerrit.wikimedia.org/r/214360 [15:26:31] (03CR) 10jenkins-bot: [V: 04-1] Enable NewUserMessage on sa.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212724 (https://phabricator.wikimedia.org/T99879) (owner: 10Dereckson) [15:26:46] bblack: ^ [15:26:59] dangit, Dereckson can you rebase that one, sorry. [15:27:07] RECOVERY - mailman I/O stats on sodium is OK - I/O stats: Transfers/Sec=75.70 Read Requests/Sec=34.20 Write Requests/Sec=30.00 KBytes Read/Sec=1088.80 KBytes_Written/Sec=332.20 [15:27:42] paravoid: I think not yet, because some of those are precise? looking [15:28:00] (03PS1) 10Dereckson: Removed comma as nick separator for deployment events [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/214361 [15:28:08] Rebasing. [15:28:12] thanks [15:28:47] !log set operations/debs/python-statsd as hidden in gerrit -- deprecated [15:28:51] Logged the message, Master [15:29:56] akosiaris: meanwhile logstash for cx is okay now? I would like to test a bit in beta and then to production sometime next week. [15:30:59] bblack: oh right -- probably.. [15:31:57] (03PS2) 10Dereckson: Enable NewUserMessage on sa.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212724 (https://phabricator.wikimedia.org/T99879) [15:31:59] Done. [15:32:04] paravoid: best way to detect virtualization in d-i that you can think of ? [15:32:14] paravoid: I think they're all precise except hooft, those changes [15:32:17] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=226.40 Read Requests/Sec=247.10 Write Requests/Sec=17.90 KBytes Read/Sec=4136.40 KBytes_Written/Sec=170.25 [15:33:03] paravoid: I am amending d-i preseed/include_command string for have some stuff like shutdown on successful install, the rootdevice different in virtualized envs [15:33:11] (03CR) 10Thcipriani: [C: 032] "SWAT, Take II" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212724 (https://phabricator.wikimedia.org/T99879) (owner: 10Dereckson) [15:33:18] (03Merged) 10jenkins-bot: Enable NewUserMessage on sa.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212724 (https://phabricator.wikimedia.org/T99879) (owner: 10Dereckson) [15:33:35] (03CR) 10Alexandros Kosiaris: [C: 032] qualify hadoop erb variables [puppet] - 10https://gerrit.wikimedia.org/r/214081 (owner: 10Alexandros Kosiaris) [15:34:22] oh hooft too for now heh [15:34:50] (03CR) 10BBlack: [C: 04-1] "All of these SLAAC exist on precise hosts, which don't have the token fixup." [puppet] - 10https://gerrit.wikimedia.org/r/214360 (owner: 10Faidon Liambotis) [15:35:12] 6operations, 6Labs: Change name servers for .wmflabs.org with our registrar - https://phabricator.wikimedia.org/T100665#1317857 (10Andrew) [15:35:19] !log thcipriani Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable NewUserMessage on sa.wikipedia [[gerrit:212724]] (duration: 00m 13s) [15:35:23] Logged the message, Master [15:35:41] Dereckson: ^ look good to you? [15:36:07] 6operations: pc100[123] maintenance and upgrade - https://phabricator.wikimedia.org/T100301#1317860 (10jcrespo) Thanks, @bd808, I would add a couple of panels (group by error message & count, for example) and add it to the main page, if everyone is ok with that (but obviously, //I// can do that myself). [15:36:22] (03CR) 10Ori.livneh: add varnishstatsd (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/214147 (owner: 10Ori.livneh) [15:36:46] Nothing broken. We'll have to wait some minutes to catch our first welcomed user. So yes looks good. [15:37:11] 6operations, 6Labs: Change name servers for .wmflabs.org with our registrar - https://phabricator.wikimedia.org/T100665#1317865 (10Andrew) p:5Triage>3Normal [15:37:51] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210680 (https://phabricator.wikimedia.org/T98926) (owner: 10Glaisher) [15:38:44] (03Merged) 10jenkins-bot: Prevent indexing of User: namespace on ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210680 (https://phabricator.wikimedia.org/T98926) (owner: 10Glaisher) [15:39:08] RECOVERY - mailman I/O stats on sodium is OK - I/O stats: Transfers/Sec=93.60 Read Requests/Sec=33.40 Write Requests/Sec=37.10 KBytes Read/Sec=1217.20 KBytes_Written/Sec=442.60 [15:39:11] Dereckson: oh, thanks. [15:39:23] Yw. [15:40:36] !log thcipriani Synchronized wmf-config/InitialiseSettings.php: SWAT: Prevent indexing of User: namespace on ukwiki [[gerrit:210680]] (duration: 00m 14s) [15:40:40] Logged the message, Master [15:42:21] Looks good. [15:43:00] Dereckson: u speak js? [15:43:37] (03PS2) 10Ori.livneh: add varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/214147 [15:44:17] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206510 (https://phabricator.wikimedia.org/T96669) (owner: 10Glaisher) [15:44:41] (03Merged) 10jenkins-bot: Modify AbuseFilter block configuration on eswikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206510 (https://phabricator.wikimedia.org/T96669) (owner: 10Glaisher) [15:44:49] (03PS12) 10Alexandros Kosiaris: CX: Log to logstash [puppet] - 10https://gerrit.wikimedia.org/r/213840 (https://phabricator.wikimedia.org/T89265) (owner: 10KartikMistry) [15:45:00] RECOVERY - Disk space on dataset1001 is OK: DISK OK [15:45:06] kart_: now it is OK ^ [15:45:51] (03CR) 10Ori.livneh: "@Ottomata:" [puppet] - 10https://gerrit.wikimedia.org/r/214147 (owner: 10Ori.livneh) [15:45:55] akosiaris: thanks! [15:46:12] akosiaris: will ask for merge sometime later. [15:47:17] !log thcipriani Synchronized wmf-config/abusefilter.php: SWAT: Modify AbuseFilter block configuration on eswikibooks [[gerrit:206510]] (duration: 00m 15s) [15:47:20] (03PS13) 10KartikMistry: CX: Log to logstash [puppet] - 10https://gerrit.wikimedia.org/r/213840 (https://phabricator.wikimedia.org/T89265) [15:47:22] Logged the message, Master [15:47:34] No idea about how to test abuse filter block duration (excepted with a rule), but es.wikibooks.org is still alive. So looks good too. [15:47:45] Dereckson: gotcha. Thanks. [15:47:45] Steinsplitter: how can I help you? [15:47:55] kart_: SWAT is complete [15:47:57] Thanks for the deploy thcipriani. [15:48:11] Dereckson: thanks for the assist :) [15:48:30] 6operations, 6Labs, 10hardware-requests: New server for labs dns recursor - https://phabricator.wikimedia.org/T99133#1317888 (10Andrew) p:5Triage>3Normal [15:48:33] thcipriani: cool [15:48:35] (03PS1) 10BBlack: Revert v6 tokens for interface::tagged [puppet] - 10https://gerrit.wikimedia.org/r/214372 [15:48:36] thcipriani: thanks! [15:48:37] (03PS1) 10BBlack: remove dysfunctional wmflabs lookups from prod dns caches [puppet] - 10https://gerrit.wikimedia.org/r/214373 [15:48:45] Dereckson: i don't speak js, but i get some deprecation warnings in my console. Looks like some scripts in MW: namespace are outdated. It would be higly utter <3 nice if you could take a look at it :) [15:49:01] tawiki confirmed, got a new user welcomed. [15:49:41] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=79.50 Read Requests/Sec=197.50 Write Requests/Sec=9.10 KBytes Read/Sec=1205.60 KBytes_Written/Sec=1134.35 [15:49:59] Steinsplitter: okay, let's take that on #wikimedia-tech or #wikimedia-commons. [15:50:17] (03PS14) 10Alexandros Kosiaris: CX: Log to logstash [puppet] - 10https://gerrit.wikimedia.org/r/213840 (https://phabricator.wikimedia.org/T89265) (owner: 10KartikMistry) [15:50:19] (03CR) 10BBlack: [C: 032] Revert v6 tokens for interface::tagged [puppet] - 10https://gerrit.wikimedia.org/r/214372 (owner: 10BBlack) [15:50:39] (03CR) 10BBlack: [C: 032] remove dysfunctional wmflabs lookups from prod dns caches [puppet] - 10https://gerrit.wikimedia.org/r/214373 (owner: 10BBlack) [15:50:51] !log kartik Started scap: Update ContentTranslation [15:50:56] Logged the message, Master [15:52:39] akosiaris: good catch! :/ [15:54:11] !log kartik Finished scap: Update ContentTranslation (duration: 03m 19s) [15:54:14] Logged the message, Master [15:54:28] wth. scap fiished in 3 minutes? [15:55:44] thcipriani: bd808 any idea? [15:56:35] well the only version currently deployed is 1.26wmf6 since there were a bunch of rollbacks yesterday [15:56:52] oops. [15:57:03] I update code in wmf7/wmf8 [15:57:03] was there anything to update for that version? [15:57:10] nope [15:57:22] should I update wmf6 too then? [15:57:29] thcipriani: ^^ [15:58:10] those versions will be "redeployed" this afternoon with the train. [15:58:11] RECOVERY - mailman I/O stats on sodium is OK - I/O stats: Transfers/Sec=52.00 Read Requests/Sec=35.00 Write Requests/Sec=42.50 KBytes Read/Sec=333.60 KBytes_Written/Sec=317.45 [15:58:20] was there any email about it? [15:58:34] uh...I don't know...I read it in scrollback here this morning [16:00:00] 6operations, 6Labs, 10hardware-requests: New server for labs dns recursor - https://phabricator.wikimedia.org/T99133#1317913 (10mark) So PowerDNS supports multiple instances I think. Wouldn't this be easy to do on a separate IP? From a resources perspective, it doesn't need another full server at all... [16:00:04] kart_: Dear anthropoid, the time has come. Please deploy Content Translation Deployment (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150528T1600). [16:00:12] thcipriani: that's sad. [16:00:19] will those versions still contain the bad cookie change? [16:00:23] jouncebot_: yes sir, but we have done. [16:00:55] https://wikitech.wikimedia.org/wiki/Incident_documentation/20150527-Cookie [16:01:04] ^ incident rep from the scrollback being referred to above [16:01:16] I don't think anyone had made a call last night on what to do about moving forward with the train [16:01:23] greg-g: ping? ^ [16:02:20] I think the bad commit in question was reverted on master, not sure beyond that [16:02:35] bblack: kk, one second, in a meeting [16:02:38] (we have protection against it anyways, but it would be better if we didn't push out that cookie name even more) [16:03:44] bblack: thanks! [16:03:49] right. There was also talk of a patch to remove that cookie. [16:04:37] (03PS3) 10Ottomata: Add varnishlog python module [puppet] - 10https://gerrit.wikimedia.org/r/214261 [16:04:46] I think if we need to try to wipe it from browsers, we could do that at the varnish level as a temp hack [16:04:53] (03CR) 10Ottomata: Add varnishlog python module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/214261 (owner: 10Ottomata) [16:04:59] but most likely, assuming they had valid expiries, we'll just wait for them to expire before pulling out the workaround [16:05:01] (03CR) 10Ottomata: Add varnishlog python module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/214261 (owner: 10Ottomata) [16:05:11] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=157.30 Read Requests/Sec=212.40 Write Requests/Sec=5.40 KBytes Read/Sec=1986.00 KBytes_Written/Sec=44.45 [16:05:29] (03CR) 10Ottomata: Add varnishlog python module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/214261 (owner: 10Ottomata) [16:05:44] thcipriani: I hope I don't need to do anything now, right? wmf7/wmf8 will be deployed when cookie is eaten. [16:07:34] kart_: I think you're good: your content translation stuff will get pushed out with the next train. We just need to backport the reversion of the bad commit to 7 and 8, I think. [16:07:47] 6operations, 6Labs: Investigate why nscd is used in labs - https://phabricator.wikimedia.org/T100564#1317947 (10fgiunchedi) nslcd AFAIK doesn't cache responses, so querying nss will still result in a ldap lookup (through nslcd) if nscd isn't running [16:08:23] 6operations, 6Labs, 10hardware-requests: New server for labs dns recursor - https://phabricator.wikimedia.org/T99133#1317948 (10Andrew) that's probably fine, I will try. [16:08:40] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=130.20 Read Requests/Sec=104.80 Write Requests/Sec=10.70 KBytes Read/Sec=1051.60 KBytes_Written/Sec=80.30 [16:09:00] thcipriani: cool. I can run tomorrow \0/ [16:09:01] a random volunteer was denied making a new wiki project in two different tasks [16:09:07] so now they have taken to sending PMs to folks [16:09:09] including me. [16:09:15] =P [16:09:19] (phabricator PMs) [16:09:26] robh: you're blessed :) [16:09:28] who knew you could pm in phab ;D [16:16:54] (03PS1) 10Alexandros Kosiaris: install-server: Accomodate virtualization [puppet] - 10https://gerrit.wikimedia.org/r/214377 [16:17:11] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=179.40 Read Requests/Sec=147.90 Write Requests/Sec=40.20 KBytes Read/Sec=2223.20 KBytes_Written/Sec=422.55 [16:18:54] this looks like bacula ^ [16:19:41] as to why sodium has problem with such little IO... probably bad check ? [16:20:50] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=209.70 Read Requests/Sec=202.00 Write Requests/Sec=4.20 KBytes Read/Sec=4410.40 KBytes_Written/Sec=34.70 [16:21:21] (03CR) 10QChris: [C: 031] "Chad> I dunno what this group even does if it's not root." [puppet] - 10https://gerrit.wikimedia.org/r/214255 (https://phabricator.wikimedia.org/T100565) (owner: 10Hashar) [16:24:09] (03CR) 10Alexandros Kosiaris: [C: 031] "I am myself a bit skeptical on the check. Perhaps checking for the presence of /dev/vda would be better (or worse) ? What do you guys thin" [puppet] - 10https://gerrit.wikimedia.org/r/214377 (owner: 10Alexandros Kosiaris) [16:24:36] (03CR) 10Filippo Giunchedi: Add varnishlog python module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/214261 (owner: 10Ottomata) [16:25:44] (03CR) 10Filippo Giunchedi: [C: 031] Add varnishlog python module [puppet] - 10https://gerrit.wikimedia.org/r/214261 (owner: 10Ottomata) [16:26:01] RECOVERY - mailman I/O stats on sodium is OK - I/O stats: Transfers/Sec=82.70 Read Requests/Sec=39.10 Write Requests/Sec=17.90 KBytes Read/Sec=243.20 KBytes_Written/Sec=226.25 [16:26:48] (03PS6) 10Andrew Bogott: Use the new service names for labs puppetmasters. [puppet] - 10https://gerrit.wikimedia.org/r/214105 [16:26:50] (03PS11) 10Andrew Bogott: Replace many references to virt1000 and labcontrol2001 with hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/213543 [16:29:55] akosiaris: can I get one more compiler run? Fixed up some DNS things which should shrink the diff yet more. [16:31:11] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=314.00 Read Requests/Sec=211.20 Write Requests/Sec=18.70 KBytes Read/Sec=2184.40 KBytes_Written/Sec=153.00 [16:33:01] greg-g: https://fr.wikipedia.org/w/index.php?title=Sp%C3%A9cial:Modifications_r%C3%A9centes&namespace=&tagfilter=visualeditor by way of example, BTW. [16:34:43] 6operations, 6Labs, 10hardware-requests: New server for labs dns recursor - https://phabricator.wikimedia.org/T99133#1318026 (10RobH) a:5RobH>3Andrew [16:35:17] 6operations, 6Labs, 10hardware-requests: New server for labs dns recursor - https://phabricator.wikimedia.org/T99133#1285990 (10RobH) Since this is now goign to be a test with a second IP, I'm pulling the #hardware-request project for now. If this goes back to needing new metal, just re-append it back on. [16:35:27] 6operations, 6Labs: New server for labs dns recursor - https://phabricator.wikimedia.org/T99133#1318029 (10RobH) [16:35:54] (03CR) 10Filippo Giunchedi: add varnishstatsd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/214147 (owner: 10Ori.livneh) [16:36:20] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=143.40 Read Requests/Sec=85.60 Write Requests/Sec=78.40 KBytes Read/Sec=1016.40 KBytes_Written/Sec=3149.30 [16:37:08] bblack: thanks for that writeup, so, summary on plan to move forward: Ok for me to ask mukunda to catch the train back up now? [16:37:43] yes, do whatever to catch up train and proceed, but preferably get the bad cookie commit out of wmf7/8 before redeploying them [16:37:50] * greg-g nods [16:37:52] twentyafterfour: ^ [16:38:07] greg-g: :) [16:38:18] (03CR) 10Faidon Liambotis: [C: 04-1] "1) The setting is overrideable. We override it for ms-be, for example, to boot from sdm/sdn." [puppet] - 10https://gerrit.wikimedia.org/r/214360 (owner: 10Faidon Liambotis) [16:38:32] do we have a patch for the "bad cookie" ? [16:38:43] yeah, and it was already reverted in at least master [16:39:09] https://gerrit.wikimedia.org/r/#/c/176948/ was the bad one [16:39:18] bblack: btw, is the incident report @ wikitech done? [16:39:25] you should definitely mail ops@ with it :) [16:39:39] it's there, but the category isn't updated [16:39:55] and yes, I should, but I really have to run out now, will later :) [16:40:06] k [16:40:06] * greg-g is adding actionables :) [16:40:37] twentyafterfour: so yeah, backport that to wmf7/8 and we should be good [16:41:01] greg-g: backport a revert of that, right? [16:41:17] * YuviPanda is adding actionables for https://wikitech.wikimedia.org/wiki/Incident_documentation/20150527-GridEngine [16:41:51] (03PS1) 10Faidon Liambotis: Remove commented-out frack list from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/214378 [16:42:14] (03CR) 10Faidon Liambotis: [C: 032] Remove commented-out frack list from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/214378 (owner: 10Faidon Liambotis) [16:42:18] twentyafterfour: right right [16:42:22] it's already been cherry-picked to wmf7 and 8 [16:42:27] https://gerrit.wikimedia.org/r/#/q/I40f0c2141e25ad37af0babfa95421915adce496b,n,z [16:42:31] ah, looks https://gerrit.wikimedia.org/r/#/q/I40f0c2141e25ad37af0babfa95421915adce496b,n,z [16:42:32] (03CR) 10Faidon Liambotis: [V: 032] Remove commented-out frack list from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/214378 (owner: 10Faidon Liambotis) [16:42:34] yeah [16:42:36] :) [16:43:21] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=338.20 Read Requests/Sec=282.90 Write Requests/Sec=29.90 KBytes Read/Sec=1982.80 KBytes_Written/Sec=379.60 [16:43:31] Ok I'll go ahead with wikiversions changes [16:43:42] nobody deploying now, right? [16:43:45] can anyone look at sodium? mutante maybe? [16:44:10] jynus, kart_: ^ [16:44:29] ? [16:44:31] it looks like kart_ might be, cxserver deployment on the calendar [16:44:33] sodium? [16:44:41] no, see twentyafterfour's message [16:44:45] paravoid: the params for sodium are off [16:44:55] ah, mailman [16:45:01] twentyafterfour: I'm finished. [16:45:01] RECOVERY - mailman I/O stats on sodium is OK - I/O stats: Transfers/Sec=113.50 Read Requests/Sec=21.10 Write Requests/Sec=80.40 KBytes Read/Sec=135.20 KBytes_Written/Sec=3027.55 [16:45:17] Krenair: done with it and found out wmf7/8 are not deployed :) [16:45:23] twentyafterfour: ^^ [16:45:44] kart_: wmf7 and 8 are about to be deployed :) [16:45:52] twentyafterfour: cool. [16:45:55] I'll wait. [16:45:57] :) [16:46:10] give me about 5 minutes, I want to do this carefully :) [16:46:49] no more cookies? :) [16:47:05] https://gerrit.wikimedia.org/r/#/q/I40f0c2141e25ad37af0babfa95421915adce496b,n,z [16:47:24] cookie change reverted on the branches [16:48:15] ACKNOWLEDGEMENT - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=145.20 Read Requests/Sec=104.30 Write Requests/Sec=36.40 KBytes Read/Sec=717.60 KBytes_Written/Sec=401.25 Jcrespo ack [16:48:27] jynus: ack wyhy? [16:48:53] no more alerts while I chek it? [16:49:10] no, don't do that -- ack is for more long-term stuff [16:49:17] ok, then [16:49:31] thx for looking though :) [16:49:39] (03CR) 10Filippo Giunchedi: "agreed the check should be more robust, what about going explicit again? e.g" [puppet] - 10https://gerrit.wikimedia.org/r/214377 (owner: 10Alexandros Kosiaris) [16:50:11] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=216.70 Read Requests/Sec=250.90 Write Requests/Sec=18.20 KBytes Read/Sec=4098.40 KBytes_Written/Sec=567.60 [16:51:48] so group0 to 1.26wmf8 and everything else to 1.26wmf7 right? [16:51:57] greg-g: ^ [16:52:22] !log redirecting ns1 traffic to rubidium (= ns0) in preparation for baham upgrade [16:52:25] Logged the message, Master [16:52:46] it is bacula [16:52:47] (03PS4) 10Ottomata: Add varnishlog python module [puppet] - 10https://gerrit.wikimedia.org/r/214261 [16:53:16] ottomata: an1026 puppet failure for ~22h [16:53:29] ottomata: plus varnishkafka warnings [16:53:42] i looked at it earlier, thoguht it might have been some intermittent thing. has something to do with user puppet stuff [16:53:57] an1026 [16:54:03] don't see the vks atm [16:54:15] (03PS1) 1020after4: Train deploy, a day later: Group0 to 1.26wmf8, everything else to 1.26wmf7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214380 [16:54:33] (03CR) 10jenkins-bot: [V: 04-1] Add varnishlog python module [puppet] - 10https://gerrit.wikimedia.org/r/214261 (owner: 10Ottomata) [16:54:36] (03PS1) 10Faidon Liambotis: autoinstall: rubidium/baham to jessie [puppet] - 10https://gerrit.wikimedia.org/r/214381 [16:55:08] (03PS5) 10Ottomata: Add varnishlog python module [puppet] - 10https://gerrit.wikimedia.org/r/214261 [16:55:10] (03CR) 1020after4: [C: 032] Train deploy, a day later: Group0 to 1.26wmf8, everything else to 1.26wmf7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214380 (owner: 1020after4) [16:55:15] (03CR) 10Faidon Liambotis: [C: 032 V: 032] autoinstall: rubidium/baham to jessie [puppet] - 10https://gerrit.wikimedia.org/r/214381 (owner: 10Faidon Liambotis) [16:55:18] (03Merged) 10jenkins-bot: Train deploy, a day later: Group0 to 1.26wmf8, everything else to 1.26wmf7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214380 (owner: 1020after4) [16:55:31] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=118.30 Read Requests/Sec=132.80 Write Requests/Sec=5.30 KBytes Read/Sec=7805.20 KBytes_Written/Sec=58.85 [16:55:32] oh weird [16:55:47] (03CR) 10jenkins-bot: [V: 04-1] Add varnishlog python module [puppet] - 10https://gerrit.wikimedia.org/r/214261 (owner: 10Ottomata) [16:55:51] impala user on an26 is > LAST_SYSTEM_UID [16:55:52] hm [16:56:16] the recent change was not a functional change [16:56:19] oh, no its not [16:56:26] ID_BOUND(A)RY was 999, same as LAST_SYSTEM_UID [16:56:37] yeah, impala is 996 should be ok [16:56:56] root@analytics1026:~# enforce-users-groups --dry-run [16:56:56] /usr/local/sbin/enforce-users-groups removing user/id: llama/11985 [16:57:03] uid=11985(llama) gid=996(llama) groups=996(llama) [16:57:12] $ id impala [16:57:12] uid=996(impala) gid=995(impala) groups=995(impala),119(hdfs),121(hive) [16:57:17] llama, not impala [16:57:19] OH [16:57:21] doh [16:57:45] that's it, hm. probalby the package is bad, installed it as non system user [16:58:17] i'll just add an exclude? [16:58:30] can't you just fix it? [16:58:36] create the user before the package creates it [16:58:43] also, I don't understand why it started failing now [16:58:46] 10Ops-Access-Requests, 6operations, 6Release-Engineering, 10Wikimedia-Git-or-Gerrit, 5Patch-For-Review: Add all Release-Engineering team as Gerrit admins - https://phabricator.wikimedia.org/T100565#1318095 (10RobH) @Greg, can we get you to approve this as the manager of release engineering? [16:59:03] don't get that either [16:59:41] !log reimaging baham [16:59:45] Logged the message, Master [17:00:26] yep, nothing strange [17:00:34] either we tell bacula to go slower [17:00:52] or tell the check to allow higher read IO when doing backups [17:01:02] I will set a task [17:01:28] bwerf, ok will fix this soon [17:01:33] !log twentyafterfour Started scap: Group0 to 1.26wmf8, everything else to 1.26wmf7 [17:01:42] Logged the message, Master [17:01:43] and, if paravoid is ok, ack it now for real :-) [17:02:16] twentyafterfour: yeah [17:02:43] 10Ops-Access-Requests, 6operations, 6Release-Engineering, 10Wikimedia-Git-or-Gerrit, 5Patch-For-Review: Add all Release-Engineering team as Gerrit admins - https://phabricator.wikimedia.org/T100565#1318113 (10greg) {F100025} [17:02:55] 6operations, 10Analytics-Cluster: Fix llama user id - https://phabricator.wikimedia.org/T100678#1318116 (10Ottomata) 3NEW a:3Ottomata [17:03:23] watching fatalmonitor, hopefully nothing blows up this time [17:03:48] funny thing is that there are like 2 commmit trying to fix this [17:03:50] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=301.80 Read Requests/Sec=156.50 Write Requests/Sec=76.90 KBytes Read/Sec=939.20 KBytes_Written/Sec=4007.35 [17:05:57] I would also like sysstat on all hosts, but that is my preference [17:06:18] (03PS6) 10Ottomata: Add varnishlog python module [puppet] - 10https://gerrit.wikimedia.org/r/214261 [17:07:25] 6operations, 10ops-eqiad: analytics1028, Replace system board, raid card - Disks OK - https://phabricator.wikimedia.org/T99947#1318132 (10Cmjohnson) The new DIMM is expected 5/29 [17:08:00] PROBLEM - Host ns1-v6 is DOWN: PING CRITICAL - Packet loss = 100% [17:08:01] 6operations, 10ops-eqiad: dataset1001: add new disk array - https://phabricator.wikimedia.org/T99808#1318134 (10Cmjohnson) 5Open>3Resolved This has been completed. [17:09:09] 6operations, 10Possible-Tech-Projects, 10Wikimedia-Apache-configuration: Make it possible to quickly and programmatically pool and depool application servers - https://phabricator.wikimedia.org/T73212#760100 (10greg) [17:09:28] 6operations, 6Release-Engineering: Try out hack ( scap is syncing now [17:11:40] PROBLEM - puppet last run on baham is CRITICAL: Connection refused by host [17:12:00] PROBLEM - Auth DNS on baham is CRITICAL - Plugin timed out while executing system call [17:12:01] PROBLEM - RAID on baham is CRITICAL: Connection refused by host [17:12:02] PROBLEM - salt-minion processes on baham is CRITICAL: Connection refused by host [17:12:20] PROBLEM - configured eth on baham is CRITICAL: Connection refused by host [17:12:40] PROBLEM - dhclient process on baham is CRITICAL: Connection refused by host [17:12:51] PROBLEM - DPKG on baham is CRITICAL: Connection refused by host [17:13:11] PROBLEM - Disk space on baham is CRITICAL: Connection refused by host [17:14:35] 6operations, 10hardware-requests, 5Continuous-Integration-Isolation: eqiad: 2 hardware access request for CI isolation on labsnet - https://phabricator.wikimedia.org/T93076#1318159 (10RobH) a:3hashar Since the second server is on hold, rather than keep this unassigned (and end up with me checking it every... [17:14:57] I'm still seeing quite a lot of "cannot connect to mysql" in logstash [17:16:18] 6operations, 10ops-requests, 5Patch-For-Review: Monitor mailman - https://phabricator.wikimedia.org/T84150#1318163 (10jcrespo) 5Resolved>3Open @Dzhan, I/O monitoring is producing false positives when creating backups with bacula, with 7805.20 KB/s of reads and 4007.35 KB/s of writes. I have disabled tha... [17:16:25] <_joe_> twentyafterfour: lemme merge https://gerrit.wikimedia.org/r/#/c/214295/ [17:16:39] <_joe_> or, lemme verify those don't come from the canaries [17:17:27] twentyafterfour, it is a know state due to an HVMM config timout [17:17:51] ok so I shouldn't be concerned? [17:18:38] <_joe_> twentyafterfour: what search you do specifically on logstash? [17:18:46] (03PS7) 10Ottomata: Add varnishlog python module [puppet] - 10https://gerrit.wikimedia.org/r/214261 [17:18:56] _joe_ https://logstash.wikimedia.org/#/dashboard/elasticsearch/fatalmonitor [17:20:07] <_joe_> uhm I see different errors though [17:20:11] there are a lot of cannot connect to mysql, and it's up since a few minutes ago [17:20:32] <_joe_> "Lock wait timeout exceeded; try restarting transaction" [17:20:49] (03CR) 10Jcrespo: "Please see my comment on T84150" [puppet] - 10https://gerrit.wikimedia.org/r/204179 (owner: 10Dzahn) [17:20:57] ewe yeah and that too [17:21:09] so we need to roll back again? [17:21:30] <_joe_> twentyafterfour: these errors are serious [17:21:35] there is also this, in fatalmonitor on fluorine: JobQueueGroup::__destruct: 1 buffered job(s) never inserted. in /srv/mediawiki/php-1.26wmf6/includes/jobqueue/ [17:21:37] JobQueueGroup.php on line 419 [17:21:41] <_joe_> cannot connect is a different problem than before [17:21:52] <_joe_> so yeah something strange is happening [17:22:08] well I just sync'd the train [17:22:21] still syncing actually [17:22:21] <_joe_> I can't see which hosts they do come from [17:22:25] <_joe_> lemme see on disk [17:22:45] 80% of servers sync'd to 1.26wmf7 [17:22:56] which host? [17:23:40] Notice: Bad stat cache or race condition for file mwstore://global-swift-eqiad/captcha-render/3/d/d/image_e4189843_3dd09aa93453972c.png. in /srv/mediawiki/php-1.26wmf6/includes/filebackend/FileBackendStore.php on line 867 [17:23:51] that's from the old branch still hmm [17:25:28] RECOVERY - Host ns1-v6 is UPING OK - Packet loss = 0%, RTA = 52.43 ms [17:25:31] even though I just sync'd, most of the error's I've seen so far are still from 1.27wmf7 [17:25:34] greg-g: Do you know if anyone has taken lead on writing the incident report for T100248 / T100438 (change tagging breakage)? [17:25:54] I see max 7 db errors per second, which is on the treshold of the known problem [17:26:39] <_joe_> jynus: that problem manifests as "can't connect to mysql"? [17:26:45] <_joe_> do I remember correctly? [17:26:52] yes [17:26:54] James_F: not that I know of [17:26:58] <_joe_> twentyafterfour: I see a ton of those JobQueueGroup messages [17:27:22] <_joe_> that would really need to be fixed :) [17:27:26] or at least that has been happening for a long time [17:27:36] not new [17:28:07] why would we have an incident report for that? aren't those reserved for site outages? [17:28:08] (!= not something that should be fixed) [17:28:37] 6operations, 6Phabricator, 7database: Phabricator database access for Joel Aufrecht - https://phabricator.wikimedia.org/T99295#1318183 (10JAufrecht) Currently stuck on one issue while preparing a dump script on phab-01: 1) What database credentials (username, password if any, database name) should I use to c... [17:28:41] yeah those jobqueue messages are probably not serious though? [17:28:59] I mean, they need to be fixed but probably not rollback worthy right? [17:29:00] Krenair: Breaking the #1 priority A/B test with a Friday deployment? [17:29:50] !log twentyafterfour Finished scap: Group0 to 1.26wmf8, everything else to 1.26wmf7 (duration: 28m 16s) [17:29:55] Logged the message, Master [17:29:57] ACKNOWLEDGEMENT - puppet last run on analytics1026 is CRITICAL Puppet has 1 failures daniel_zahn T100678 [17:30:08] yeah it's a high priority software regression, but not a site outage... [17:30:30] unbreak now things don't just get incident documentation.. [17:30:56] Krenair: Sometimes they do. [17:31:10] yes, but the point is not necessarily [17:31:17] https://gerrit.wikimedia.org/r/#/c/212485/ <-- that is supposed to fix the jobqueuegroup errors [17:31:19] ACKNOWLEDGEMENT - puppetmaster backend https on rhodium is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 500 Internal Server Error daniel_zahn T98173 [17:31:35] <_joe_> twentyafterfour: I agree [17:33:16] ewwe now lots of this: [17:33:18] OutputPage::getModuleStyles: style module should define its position explicitly: ext.globalCssJs.user Resource [17:33:20] LoaderGlobalUserModule [Called from OutputPage::getModuleStyles in /srv/mediawiki/php-1.26wmf8/includes/OutputPage.php [17:33:22] at line 603] [17:33:27] twentyafterfour: yes known [17:33:33] it's overrunning the logs [17:34:14] (03CR) 10Giuseppe Lavagetto: "Please note that this didn't work, as we already set hhvm::fcgi_settings in mediawiki::hhvm, I'll write a fixup." [puppet] - 10https://gerrit.wikimedia.org/r/211155 (https://phabricator.wikimedia.org/T98489) (owner: 10BryanDavis) [17:34:21] 6operations, 10Traffic: Upgrade prod DNS daemons to gdnsd 2.2.0 - https://phabricator.wikimedia.org/T98003#1318213 (10faidon) Update, on the Debian front: - gdnsd 2.1.2-1 is in Debian unstable/testing. 2.1.2-1~deb8u1 is in stable-proposed-updates and will be part of Debian 8.1 in a few days. It's also, tempora... [17:35:04] (03PS12) 10Andrew Bogott: Replace many references to virt1000 and labcontrol2001 with hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/213543 [17:35:19] what the hell [17:35:30] 419GB of swap on a newly-provisioned box [17:35:31] seriously? [17:35:51] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Should be hhvm::extra::cli and hhvm::extra::fcgi which are used by hiera" [puppet] - 10https://gerrit.wikimedia.org/r/214295 (https://phabricator.wikimedia.org/T98489) (owner: 10BryanDavis) [17:36:15] <_joe_> paravoid: someone maybe changed something in partman? [17:36:21] paravoid, lol [17:36:26] <_joe_> paravoid: it happened to copper as well I think [17:37:47] LVM fail, I think [17:38:29] (03CR) 10Andrew Bogott: [C: 032] Replace many references to virt1000 and labcontrol2001 with hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/213543 (owner: 10Andrew Bogott) [17:39:44] no, the recipe is the same [17:39:48] raid1-lvm.cfg [17:39:54] last touched Oct 2013 [17:40:03] 1000 1000 1000 linux-swap \ [17:40:05] this bit is ignored [17:40:10] <_joe_> nice [17:40:11] and the whole space gets allocated to swap [17:40:19] <_joe_> partman bug then? [17:40:23] possibly [17:40:51] (03PS1) 10Giuseppe Lavagetto: mediawiki: raise the mysql timeout to 3 seconds [puppet] - 10https://gerrit.wikimedia.org/r/214392 (https://phabricator.wikimedia.org/T98489) [17:41:34] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: raise the mysql timeout to 3 seconds [puppet] - 10https://gerrit.wikimedia.org/r/214392 (https://phabricator.wikimedia.org/T98489) (owner: 10Giuseppe Lavagetto) [17:42:10] bblack: baham's facter: ipaddress6 => 2620:0:862:ed1a::e [17:42:52] <_joe_> twentyafterfour: a ton of May 28 17:42:43 mw1254: #012Warning: OutputPage::getModuleStyles: style module should define its position explicitly: ext.globalCssJs.site ResourceLoaderGlobalSiteModule [Called from OutputPage::getModuleStyles in /srv/mediawiki/php-1.26wmf8/includes/OutputPage.php at line 603] in /srv/mediawiki/php-1.26wmf8/includes/debug/MWDebug.php on line 300 [17:42:58] <_joe_> in the logs [17:43:00] yes [17:43:01] I know [17:43:02] <_joe_> like really a ton [17:43:07] I'm reviewing the patch for it now [17:43:16] it's a non-trivial fix [17:43:17] <_joe_> ok, all good then :) [17:43:20] <_joe_> oh [17:43:26] PROBLEM - Auth DNS on labs-ns0.wikimedia.org is CRITICAL - Plugin timed out while executing system call [17:43:35] <_joe_> andrewbogott: ^^??? [17:44:02] 6operations, 6Labs: Change name servers for .wmflabs.org with our registrar - https://phabricator.wikimedia.org/T100665#1318238 (10Dzahn) The nameservers for wmflabs.org are set to: whois wmflabs.org ... Name Server:LABS-NS0.WIKIMEDIA.ORG Name Server:LABS-NS1.WIKIMEDIA.ORG The IP addresses of these are (fro... [17:45:05] (03CR) 10Giuseppe Lavagetto: "see https://gerrit.wikimedia.org/r/#/c/214392/" [puppet] - 10https://gerrit.wikimedia.org/r/214295 (https://phabricator.wikimedia.org/T98489) (owner: 10BryanDavis) [17:45:11] greg-g: Where are we at with production right now? Could we do an emergency VisualEditor fix for the core-broken revision tags? [17:45:39] <_joe_> James_F: we've got a shower of failures on the cluster at the moment [17:45:46] _joe_: Fun. :-( [17:45:50] <_joe_> I'd wait for those to be solved? [17:45:54] goddamit stupid puppet [17:45:58] <_joe_> James_F: non-critical, I hope [17:46:12] need me to do anything James_F? [17:46:16] _joe_: I'm >< close away from reverting the core patch that broke everything. [17:46:30] <_joe_> James_F: uh I don't mind :) [17:46:31] Krenair: Yeah, but only once production isn't completely broken. :-( [17:46:34] _joe_, you're not talking about the job warnings are you? [17:46:44] _joe_: The Performance Department might. :_0 [17:46:46] <_joe_> Krenair: what I posted a minute back [17:46:52] oh those [17:47:04] I thought those were gone last time I checked the logs a few hours ago [17:47:08] <_joe_> Krenair: kunal stated the fix is not trivial, I would advise to wait for him to be done [17:47:19] <_joe_> Krenair: nope, there is a ton going on just now [17:47:38] <_joe_> like several tens per second [17:47:38] gosh, yeah, they're everywhere [17:47:39] Krenair: Could you do core pull-throughs of https://gerrit.wikimedia.org/r/#/q/I2688356cd5f628dca395d1caaa82b9a5b21c025e,n,z and if greg-g say's it's OK they need to go out ASAP to unbreak production. But as _joe_ says, let's not make things worse. [17:47:44] Krenair: the warnings weren't present on wmf6 [17:47:51] Krenair: (Sorry.) [17:47:58] legoktm, oh I see, because of the late train deployment [17:48:04] that makes sense [17:48:12] we have a couple of bugs sitting around for these warnings [17:48:37] (03PS2) 10Jforrester: Enable A/B test of VisualEditor for new accounts on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211697 (https://phabricator.wikimedia.org/T90666) [17:48:54] 6operations, 3Roadmap, 10Wikimedia-Mailing-lists, 7notice, 7user-notice: Mailing list maintenance window - 2015-05-19 17:00 UTC to 19:00 UTC - https://phabricator.wikimedia.org/T99098#1318262 (10Ladsgroup) If it's possible please rename pywikipedia-l to pywikibot (the framework itself is changed) [17:49:13] James_F, okay, problem [17:49:18] one of these is for wmf6 [17:49:20] we need wmf8 [17:49:35] Krenair: There are ones for each of wmf6,7,8. [17:49:42] (03PS3) 10Ottomata: add varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/214147 (owner: 10Ori.livneh) [17:49:53] Krenair: Given that production keeps lurching between wmf7 and wmf6 as the heroic battle against instability continues… [17:49:54] I only see wmf6 and wmf7 on this list? am I missing one? [17:50:12] Krenair: Oh. I gave you the wrong link. [17:50:14] * James_F sighs. [17:50:20] ah [17:50:28] Krenair: https://gerrit.wikimedia.org/r/#/q/Ic4c7d8d89b8cfeee57eda867c0ff74fa9682ffc8,n,z [17:50:32] And now to meeting. [17:50:37] <_joe_> errors are in php-1.26wmf8 btw [17:50:37] (03CR) 10Ottomata: "Just rebased this on the varnishlog.py change, and modified this to use varnishlog module instead of interacting with ctypes and varnishap" [puppet] - 10https://gerrit.wikimedia.org/r/214147 (owner: 10Ori.livneh) [17:50:44] okay [17:50:46] <_joe_> so it seems we have wmf8 in prod :) [17:50:59] only group0 [17:51:06] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL - Plugin timed out while executing system call [17:51:35] (03CR) 10jenkins-bot: [V: 04-1] add varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/214147 (owner: 10Ori.livneh) [17:51:39] right, wmf8 = group0, wmf7 = everything else? [17:51:53] group0 is mediawiki.org, testwiki, test2wiki and zerowiki iirc [17:52:08] _joe_: the labs-ns0 thing has to do with the IP address of the server being changed, neon gets 208.80.152.33 but from external it's 208.80.154.19 for labs-ns0. i'll ping [17:52:14] right [17:52:15] (03PS2) 10BryanDavis: Set HHVM mysql connection timeout to 3s on app and api servers [puppet] - 10https://gerrit.wikimedia.org/r/214295 (https://phabricator.wikimedia.org/T98489) [17:52:47] given the volume of errors coming from group0, I'd hate to see what happens if they aren't fixed before we push wmf8 to group 1 & 2 [17:52:47] <_joe_> bd808: I'd wait until tomorrow to merge it, after I'm sure the errors disappeared from the canaries [17:53:02] <_joe_> twentyafterfour: ahah, hell would break loose [17:53:05] yeah [17:53:14] 6operations: raid1-lvm recipe broken for jessie, sets up available LVM space as swap - https://phabricator.wikimedia.org/T100636#1318278 (10faidon) p:5Triage>3Unbreak! [17:53:17] _joe_: *nod* and thanks for catching the wrong naming there [17:53:41] <_joe_> bd808: eh, no problem. I know that you're a manager now [17:53:44] * _joe_ ducks [17:54:06] 6operations: raid1-lvm recipe broken for jessie, sets up available LVM space as swap - https://phabricator.wikimedia.org/T100636#1317078 (10faidon) This happened on a reinstall I did today as well, for baham. There was a swap LV, occupying all of the available space on the VG (420G). This needs to be fixed ASAP... [17:54:10] (03PS4) 10Ottomata: add varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/214147 (owner: 10Ori.livneh) [17:55:11] <_joe_> paravoid: are you already looking into this? [17:55:21] <_joe_> (the parman recipe?) [17:55:26] <_joe_> oh god how late it is [17:55:26] any labs people up? [17:55:33] <_joe_> bbl [17:55:59] something went wrong there with the DNS work [17:56:04] users cant login [17:56:27] PROBLEM - puppet last run on oxygen is CRITICAL Puppet has 1 failures [17:56:54] * matanya is users [17:58:01] James_F, wmf7 = https://gerrit.wikimedia.org/r/214395 + wmf8 = https://gerrit.wikimedia.org/r/214396 + do we want wmf6 as well? [17:58:07] confirmed can't connect to labs [17:58:11] alex@alex-laptop:~/Development/MediaWiki (wmf/1.26wmf8)$ ssh tools-login.wmflabs.org [17:58:11] ssh: Could not resolve hostname tools-login.wmflabs.org: Name or service not known [17:59:10] YuviPanda, [17:59:21] andrewbogott, [17:59:40] umm [17:59:41] [Labs-l] Possible mild puppet and wikitech breakages tomorrow [17:59:42] ? [17:59:50] when will wmf treat labs as a project and resource is as it deserves ... :/ [17:59:55] oddly enough, I'm still connected over mosh [18:00:00] "What definitely won't break: [18:00:02] so it's probably just dns [18:00:04] - Anything that a toollabs user would notice or care about [18:00:13] " [18:00:20] (03PS5) 10Ori.livneh: add varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/214147 [18:00:25] [neon:~] $ host labs-ns0.wikimedia.org [18:00:25] labs-ns0.wikimedia.org has address 208.80.152.33 [18:00:32] not being able to connect to tools-login because dns is down is... pretty noticeable [18:00:33] @sphinx:~$ host labs-ns0.wikimedia.org [18:00:33] labs-ns0.wikimedia.org has address 208.80.154.19 [18:00:53] hold on I just saw something about labs-ns [18:00:53] godog: yt? [18:00:56] and then there is https://phabricator.wikimedia.org/T100665 but ?? [18:01:14] yes that was what I saw earlier [18:01:20] (03CR) 10Ori.livneh: [C: 031] "Updated to use a buffer with a max size of 1472 bytes to avoid fragmentation. Ottomata's changes look good to me." [puppet] - 10https://gerrit.wikimedia.org/r/214147 (owner: 10Ori.livneh) [18:01:27] now is not a great time to go afk andrew.. [18:01:50] ori: i'm looking at adapting my host based reqstats now too [18:02:05] but, that is slightly more annoying using diamond and varnishlog, instead of just varnishncsa [18:02:08] ottomata: awesome work with varnishlog.py, just reviewing that [18:02:11] thanks! [18:02:21] using varnishncsa, i get the values already grouped by request [18:02:34] PROBLEM - Host labs-ns0.wikimedia.org is DOWN: CRITICAL - Network Unreachable (208.80.152.33) [18:02:41] using vanrishlog.py, i will have to count totals based on...what, ReqEnd? [18:03:17] i think i can do it, instead of using an iterator i just augment my shared_dict one entry at a time, rather than a whole request with multiple status entries (method, status, etc.) at a time [18:03:37] ottomata: wow you added unit tests [18:04:08] ottomata: you will have to count totals based on ReqEnd where remote_party == 'client' [18:04:10] Krenair: i wonder if this is related https://gerrit.wikimedia.org/r/#/c/213543/12/manifests/role/dns.pp [18:04:20] twentyafterfour, jynus, kart_: Anyone deploying anything? If not I'd like to sync a VE change out [18:04:29] Krenair, not me [18:04:44] * twentyafterfour isn't deploying anything [18:04:55] hm, well, ori, probably to do this properly i will specify -c [18:04:57] (diconnecting from tin) [18:04:58] so all will be client [18:05:17] probably shouldn't change the counting logic for that, just filter if that is what is wanted [18:05:18] right [18:05:21] yeah [18:05:22] * twentyafterfour follows jynus' lead [18:05:37] although, i shoudl probably put 'client' or 'backend' in the metric key [18:05:40] (03CR) 10Ori.livneh: "very nice work, kudos" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/214261 (owner: 10Ottomata) [18:05:46] kart_'s window was over an hour ago so I'll assume he just forgot to log out of tin [18:06:03] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 0 below the confidence bounds [18:06:58] (03PS8) 10Ottomata: Add varnishlog python module [puppet] - 10https://gerrit.wikimedia.org/r/214261 [18:13:19] 6operations, 6Labs: Change name servers for .wmflabs.org with our registrar - https://phabricator.wikimedia.org/T100665#1318330 (10BBlack) The Nameserver IPs that the registrar stores are separate from the ones you're looking up in e.g. dig or whatever. They're part of the whois system. In whois, they have t... [18:13:34] (03CR) 10Ori.livneh: [C: 031] Add varnishlog python module [puppet] - 10https://gerrit.wikimedia.org/r/214261 (owner: 10Ottomata) [18:14:15] RECOVERY - puppet last run on oxygen is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:17:01] !log krenair Synchronized php-1.26wmf7/extensions/VisualEditor: https://gerrit.wikimedia.org/r/#/c/214395/ (duration: 00m 14s) [18:17:04] Logged the message, Master [18:19:08] Krenair: ping once you're done please [18:19:24] ok [18:19:29] hm, ori, i won't really be able to count invalids like i was [18:19:33] at least not very easily [18:19:33] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 0.263 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.155.135 [18:19:40] ottomata: what do you mean by invalids? [18:19:41] since each callback I only have a single entry [18:19:53] well, sometimes method or status is malformed [18:20:02] !log krenair Synchronized php-1.26wmf8/extensions/VisualEditor: https://gerrit.wikimedia.org/r/#/c/214396/ (duration: 00m 13s) [18:20:06] Logged the message, Master [18:20:08] ok, folks using salt for the next while, try -b 20 and use wildcards ( * ) rather than grains for a little while and see how that goes [18:20:13] and, when using varnishncsa, i got both method and status in the same iteration [18:20:15] so i could say [18:20:24] if either method or status is not valid, then invalid++ [18:20:35] but, now, i only know method or status at any given time [18:20:43] i'd have to track transaction_ids like you do [18:20:53] ottomata: varnishncsa is itself a client of varnishapi, so everything it does, you can do from python. you could just do something like: if tid != last_tid: call_callback_with_all_records_of_request(records); last_tid = tid; else: records.append(record) [18:21:06] or something like that [18:21:09] you get the idea [18:21:14] https://phabricator.wikimedia.org/T83580#1301375 [18:21:31] yeah, sure, but are the tids all in order? [18:21:53] i'm not saying it can't be done, i'm just saying it is more complicated than it was ;) [18:22:12] i'm not so sure it isn't better to just use varnishncsa [18:22:47] !log krenair Synchronized php-1.26wmf6/extensions/VisualEditor: https://gerrit.wikimedia.org/r/#/c/214397/ - in case we have to go back to wmf6 again for whatever reason (duration: 00m 15s) [18:22:50] Logged the message, Master [18:22:53] (wmf7 and wmf8 work, wmf6 is obviously not testable or particularly useful at the moment) [18:22:55] legoktm, all your's [18:23:13] ottomata: that's fine as well [18:23:17] 6operations, 6Labs: Change name servers for .wmflabs.org with our registrar - https://phabricator.wikimedia.org/T100665#1318348 (10Dzahn) but that was output from the "whois" command: Name Server:LABS-NS0.WIKIMEDIA.ORG Name Server:LABS-NS1.WIKIMEDIA.ORG [18:23:22] thanks [18:23:44] twentyafterfour: logging tasks for the new errors you've seen in the wmf8 deploy? [18:24:26] ottomata: i should abandon https://gerrit.wikimedia.org/r/#/c/213293/ , right? your patch supercedes it [18:24:29] !log legoktm Synchronized php-1.26wmf8/extensions/GlobalCssJs/: Explicitly define module position (duration: 00m 13s) [18:24:32] Logged the message, Master [18:24:40] yes, but mine doesn't have an iterator [18:25:22] (03Abandoned) 10Ori.livneh: varnish: add Python library for iterating on log records [puppet] - 10https://gerrit.wikimedia.org/r/213293 (owner: 10Ori.livneh) [18:25:47] 6operations, 6Labs: Change name servers for .wmflabs.org with our registrar - https://phabricator.wikimedia.org/T100665#1318367 (10BBlack) Whois (well really the registrar data that's sent to the TLD origin servers, but that's what whois reflects) also stores IP addresses for those hosts, independently of howe... [18:26:20] meh i think we don't need to count invalids, there are few enogh of them [18:26:22] ottomata: that's ok; the iterator implementation in my patch wasn't scalable anyway [18:26:23] we can reserve that for v2 [18:26:40] aye cool [18:26:52] /* [18:26:52] Problematic modules: {"ext.globalCssJs.user.styles":"missing"} [18:26:52] */ [18:26:53] hmm [18:27:06] ottomata / godog: are you up for merging the varnishlog and varnishstatsd patches? [18:27:16] they are not provisioned anywhere yet [18:27:31] oh because meta [18:27:35] and i tested them just now [18:27:44] 6operations, 6Labs: Change name servers for .wmflabs.org with our registrar - https://phabricator.wikimedia.org/T100665#1318383 (10BBlack) Heh - the last comment is incorrect! I re-ran the commands and pasted quickly without looking. The old bad data that was there earlier today was this: ``` bblack-mba:~ b... [18:29:10] 6operations, 6Labs: Change name servers for .wmflabs.org with our registrar - https://phabricator.wikimedia.org/T100665#1318405 (10Andrew) 5Open>3Resolved Yes, registrar emailed directly an hour ago and said it was done. [18:30:16] ori: might as well leave them at the moment while I code up the diamond thing now [18:30:21] might cause more patches, who knows [18:30:31] or, if you want to go ahead and get yours out there [18:30:33] go ahead :) [18:30:38] greg-g: parsoid would like to do a deploy, since we had to skip ours yesterday. [18:30:42] (03CR) 10Jcrespo: [C: 04-1] "We need to rebase this change, do not apply." [puppet] - 10https://gerrit.wikimedia.org/r/212302 (https://phabricator.wikimedia.org/T92693) (owner: 10Jcrespo) [18:30:46] maybe mutante is a a better person to ask? [18:31:01] looks like we're in a lull between content translation and wikidata at the moment, if i'm translating my time zones correctly. [18:31:05] ottomata: nah, i can wait [18:32:15] !log legoktm Synchronized php-1.26wmf7/extensions/GlobalCssJs/: Explicitly define module position (duration: 00m 12s) [18:32:20] Logged the message, Master [18:32:37] legoktm: are you swatting? [18:34:54] cscott: backporting stuff yes [18:35:16] legoktm: we'd like to do a parsoid deploy. is now a good time, or should we wait until you are done? [18:35:26] I need a few more minutes [18:35:32] hm, ori backend requests don't have ReqEnds? [18:36:13] legoktm: ok, let us know when you're ready for us. [18:36:22] ottomata: possibly not, i can't remember. it's tricky. i used 'varnishlog' a lot to reverse-engineer this. [18:36:29] we're trying to squeeze in before wikidata at 1pm PDT (if i'm reading the calendar right) [18:36:44] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL - Plugin timed out while executing system call [18:37:07] yeah [18:37:45] hm, maybe my thing should only workon client requests always? [18:38:23] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 0.106 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.155.135 [18:39:28] cscott: ok, I'm done [18:39:39] YuviPanda: if I want to add a hiera setting for every labs host, where do I do that? [18:39:57] legoktm: ok, thanks! [18:43:12] bblack, you there? [18:43:31] _joe_: same question? [18:43:46] <_joe_> andrewbogott: which key? [18:44:22] andrewbogott: hieradata/labs.yaml, I'd think? [18:44:39] valhallasw: you’re totally right, that’s for sure where it goes [18:44:40] <_joe_> valhallasw: you'd think correctly [18:44:43] I was looking in the wrong dir :) [18:44:44] thanks [18:45:16] (03CR) 10Faidon Liambotis: "Ignore the above comment, was for a different patchset." [puppet] - 10https://gerrit.wikimedia.org/r/214360 (owner: 10Faidon Liambotis) [18:45:27] (03CR) 10Faidon Liambotis: [C: 04-1] "Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/214377 (owner: 10Alexandros Kosiaris) [18:45:30] ottomata: yes, but busy, what? [18:46:37] was going to ask you a couple of qs about varnish backend/frontends and varnishlog [18:46:41] but, s'ok if you are busy [18:46:52] (03PS7) 10Andrew Bogott: Use the new service names for labs puppetmasters. [puppet] - 10https://gerrit.wikimedia.org/r/214105 [18:46:54] (03PS1) 10Andrew Bogott: Define labs_puppet_master for labs hiera [puppet] - 10https://gerrit.wikimedia.org/r/214405 [18:47:15] (03CR) 10Andrew Bogott: [C: 032] Define labs_puppet_master for labs hiera [puppet] - 10https://gerrit.wikimedia.org/r/214405 (owner: 10Andrew Bogott) [18:47:53] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL - Plugin timed out while executing system call [18:49:16] cscott: uh, ping me when you're done. I have more stuff to backport :/ [18:49:33] ok, working on the parsoid deploy. shouldn't take too long. [18:50:34] greg-g: yes I always put in tasks for them [18:53:38] 6operations, 5Interdatacenter-IPsec, 5Patch-For-Review: Fix ipv6 autoconf issues - https://phabricator.wikimedia.org/T94417#1318487 (10BBlack) 5Open>3Resolved The token approach was deployed for all jessie/trusty nodes with add_ip6_mapped. [18:53:39] 6operations, 5Interdatacenter-IPsec: IPsec: roll-out plan - https://phabricator.wikimedia.org/T92604#1318491 (10BBlack) [18:53:57] woo [18:54:32] 6operations, 7network: Enable add_ip6_mapped functionality on all hosts - https://phabricator.wikimedia.org/T100690#1318494 (10BBlack) 3NEW [18:54:32] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 0.103 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.155.135 [18:57:43] (03PS1) 10ArielGlenn: salt minions: prefer 2048 bit key size [puppet] - 10https://gerrit.wikimedia.org/r/214409 [18:59:05] (03CR) 10ArielGlenn: [C: 032] salt minions: prefer 2048 bit key size [puppet] - 10https://gerrit.wikimedia.org/r/214409 (owner: 10ArielGlenn) [19:01:21] "(Cannot access the database: Can't connect to MySQL server on '10.64.48.15' (4) (10.64.48.15))" [19:01:24] (ENWP) [19:01:31] RECOVERY - Auth DNS on labs-ns0.wikimedia.org is OK: DNS OK: 0.256 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.155.135 [19:01:41] RECOVERY - Host labs-ns0.wikimedia.org is UPING OK - Packet loss = 0%, RTA = 3.21 ms [19:01:58] abartov: what were you doing? [19:02:09] "Edit source" on [[Ann Bannon]] [19:02:53] (is it still useful to paste these here when they happen, or is Ops infrastructure these days automated to the point it's redundant? [19:02:56] ) [19:03:19] (refresh worked, btw) [19:03:26] legoktm: hm, we're seeing some unexpected behavior of parsoid on beta. [19:03:46] legoktm: why don't you do another backport while we investigate that. [19:03:47] (03PS8) 10Andrew Bogott: Use the new service names for labs puppetmasters. [puppet] - 10https://gerrit.wikimedia.org/r/214105 [19:03:49] (03PS1) 10Andrew Bogott: Removed use of labs_ldap_dns_host_secondary in 'local-address'. [puppet] - 10https://gerrit.wikimedia.org/r/214412 [19:04:20] cscott: ok, doing [19:05:53] !log legoktm Synchronized php-1.26wmf8/extensions/Gadgets/: Explicitly define module position (duration: 00m 14s) [19:05:57] Logged the message, Master [19:06:03] (03CR) 10Andrew Bogott: [C: 032] Removed use of labs_ldap_dns_host_secondary in 'local-address'. [puppet] - 10https://gerrit.wikimedia.org/r/214412 (owner: 10Andrew Bogott) [19:06:14] legoktm: ok, i think we're ready to go again, if you can give us a slot. [19:07:13] cscott: I'm done :) [19:08:31] (03PS3) 10Ottomata: Add diamond collector_module class [puppet] - 10https://gerrit.wikimedia.org/r/214053 (https://phabricator.wikimedia.org/T83580) [19:08:57] cscott: subbu: sure [19:09:04] 6operations, 10Traffic: Upgrade prod DNS daemons to gdnsd 2.2.0 - https://phabricator.wikimedia.org/T98003#1318539 (10hashar) >>! In T98003#1318213, @faidon wrote: > Finally, there is also the matter of the Jenkins authdns-lint jobs — we should run tests on jessie, not Ubuntu, both to avoid building 2.2 for Ub... [19:09:04] twentyafterfour: :) [19:09:46] (03CR) 10Ottomata: [C: 032] Add diamond collector_module class [puppet] - 10https://gerrit.wikimedia.org/r/214053 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [19:13:18] twentyafterfour: GlobalCssJs/Gadgets are no longer spamming, new offenders are CodeReview and GeSHi [19:13:19] (03CR) 10Hashar: "My commit summary "Gerrit admins" refer to the admin shell access group "gerrit-admin". So that is solely to give us shell access on the " [puppet] - 10https://gerrit.wikimedia.org/r/214255 (https://phabricator.wikimedia.org/T100565) (owner: 10Hashar) [19:13:23] (03PS1) 10Ottomata: Fix dependency after collector module change [puppet] - 10https://gerrit.wikimedia.org/r/214414 [19:13:23] legoktm: ok, thanks. [19:13:23] greg-g: starting parsoid deploy. [19:13:23] (03CR) 10Ottomata: [C: 032 V: 032] Fix dependency after collector module change [puppet] - 10https://gerrit.wikimedia.org/r/214414 (owner: 10Ottomata) [19:14:33] 10Ops-Access-Requests, 6operations, 6Release-Engineering, 10Wikimedia-Git-or-Gerrit, 5Patch-For-Review: Add all Release-Engineering team as Gerrit admins - https://phabricator.wikimedia.org/T100565#1318550 (10hashar) Some more discussion happening on https://gerrit.wikimedia.org/r/#/c/214255/ . I have po... [19:15:25] PROBLEM - puppet last run on mw1032 is CRITICAL puppet fail [19:15:30] (03PS9) 10Ottomata: Add varnishlog python module [puppet] - 10https://gerrit.wikimedia.org/r/214261 [19:15:31] PROBLEM - puppet last run on mw1148 is CRITICAL puppet fail [19:15:31] PROBLEM - puppet last run on mw1243 is CRITICAL puppet fail [19:15:31] PROBLEM - puppet last run on mw1186 is CRITICAL puppet fail [19:15:40] PROBLEM - puppet last run on mw2026 is CRITICAL puppet fail [19:15:41] PROBLEM - puppet last run on mw2140 is CRITICAL puppet fail [19:15:41] PROBLEM - puppet last run on mw2193 is CRITICAL puppet fail [19:15:50] PROBLEM - puppet last run on mw2204 is CRITICAL puppet fail [19:15:51] PROBLEM - puppet last run on mw2155 is CRITICAL puppet fail [19:16:00] PROBLEM - puppet last run on mw1121 is CRITICAL puppet fail [19:16:00] PROBLEM - puppet last run on mw1185 is CRITICAL puppet fail [19:16:01] PROBLEM - puppet last run on mw2116 is CRITICAL puppet fail [19:16:01] PROBLEM - puppet last run on mw2068 is CRITICAL puppet fail [19:16:01] PROBLEM - puppet last run on mw2071 is CRITICAL puppet fail [19:16:01] PROBLEM - puppet last run on mw1239 is CRITICAL puppet fail [19:16:11] PROBLEM - puppet last run on mw2209 is CRITICAL puppet fail [19:16:11] PROBLEM - puppet last run on mw2186 is CRITICAL puppet fail [19:16:11] PROBLEM - puppet last run on mw1024 is CRITICAL puppet fail [19:16:20] PROBLEM - puppet last run on mw1201 is CRITICAL puppet fail [19:16:20] PROBLEM - puppet last run on mw2044 is CRITICAL puppet fail [19:16:21] PROBLEM - puppet last run on mw2159 is CRITICAL puppet fail [19:16:21] PROBLEM - puppet last run on mw2153 is CRITICAL puppet fail [19:16:30] PROBLEM - puppet last run on mw1077 is CRITICAL puppet fail [19:16:30] PROBLEM - puppet last run on mw1108 is CRITICAL puppet fail [19:16:30] PROBLEM - puppet last run on mw1091 is CRITICAL puppet fail [19:16:31] PROBLEM - puppet last run on mw1152 is CRITICAL puppet fail [19:16:31] PROBLEM - puppet last run on mw1043 is CRITICAL puppet fail [19:16:31] PROBLEM - puppet last run on mw2031 is CRITICAL puppet fail [19:16:31] PROBLEM - puppet last run on mw2020 is CRITICAL puppet fail [19:16:32] PROBLEM - puppet last run on mw2009 is CRITICAL puppet fail [19:16:32] PROBLEM - puppet last run on mw1122 is CRITICAL puppet fail [19:16:33] PROBLEM - puppet last run on mw1167 is CRITICAL puppet fail [19:16:33] PROBLEM - puppet last run on mw2205 is CRITICAL puppet fail [19:16:40] PROBLEM - puppet last run on mw1209 is CRITICAL puppet fail [19:16:41] PROBLEM - puppet last run on mw1033 is CRITICAL puppet fail [19:16:50] is possibly me? i made a puppet change, checking... [19:16:50] PROBLEM - puppet last run on mw2157 is CRITICAL puppet fail [19:16:50] PROBLEM - puppet last run on mw1093 is CRITICAL puppet fail [19:16:51] PROBLEM - puppet last run on mw1225 is CRITICAL puppet fail [19:16:51] ohai puppet [19:17:00] PROBLEM - puppet last run on mw1229 is CRITICAL puppet fail [19:17:00] RECOVERY - puppet last run on mw2201 is OK Puppet is currently enabled, last run 18 seconds ago with 0 failures [19:17:01] PROBLEM - puppet last run on mw2115 is CRITICAL puppet fail [19:17:01] PROBLEM - puppet last run on mw2091 is CRITICAL puppet fail [19:17:01] PROBLEM - puppet last run on mw2148 is CRITICAL puppet fail [19:17:01] PROBLEM - puppet last run on mw2154 is CRITICAL puppet fail [19:17:05] I just checked one of those by hand (mw2201) and it ran fine [19:17:10] Although slow [19:17:10] PROBLEM - puppet last run on mw2175 is CRITICAL puppet fail [19:17:11] PROBLEM - puppet last run on mw1105 is CRITICAL puppet fail [19:17:12] PROBLEM - puppet last run on mw2185 is CRITICAL puppet fail [19:17:20] PROBLEM - puppet last run on mw1219 is CRITICAL puppet fail [19:17:21] PROBLEM - puppet last run on mw2064 is CRITICAL puppet fail [19:17:40] PROBLEM - puppet last run on mw1223 is CRITICAL puppet fail [19:17:41] PROBLEM - puppet last run on mw1022 is CRITICAL puppet fail [19:17:41] PROBLEM - puppet last run on mw1139 is CRITICAL puppet fail [19:17:41] PROBLEM - puppet last run on mw2141 is CRITICAL puppet fail [19:17:50] PROBLEM - puppet last run on mw2188 is CRITICAL puppet fail [19:17:51] PROBLEM - puppet last run on mw1231 is CRITICAL puppet fail [19:17:51] PROBLEM - puppet last run on mw2075 is CRITICAL puppet fail [19:18:01] PROBLEM - puppet last run on mw2019 is CRITICAL puppet fail [19:18:11] PROBLEM - puppet last run on mw1086 is CRITICAL puppet fail [19:18:11] PROBLEM - puppet last run on mw2010 is CRITICAL puppet fail [19:18:20] PROBLEM - puppet last run on mw2180 is CRITICAL puppet fail [19:18:25] ja mw1148 ran fine too [19:18:31] PROBLEM - puppet last run on mw1142 is CRITICAL puppet fail [19:18:31] PROBLEM - puppet last run on mw1027 is CRITICAL puppet fail [19:18:34] (03PS22) 10Ottomata: Add varnish request stats diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) [19:18:41] PROBLEM - puppet last run on mw2069 is CRITICAL puppet fail [19:18:51] RECOVERY - puppet last run on mw1148 is OK Puppet is currently enabled, last run 30 seconds ago with 0 failures [19:19:08] brief mysql connection spike on db1049.eqiad.wmnet [19:19:10] 6operations: salt broken after the upgrade - https://phabricator.wikimedia.org/T100502#1318558 (10ArielGlenn) batch.py is updated and with the updates, running salt -b 20 cmd.run blah seems to be ok. This goes for test.ping, uptime as well as other commands. yu may need to add a timeout if commands will take... [19:19:14] (03CR) 10jenkins-bot: [V: 04-1] Add varnish request stats diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [19:21:53] 6operations: Alias docs.wikimedia.org to doc.wikimedia.org - https://phabricator.wikimedia.org/T100349#1318562 (10Dzahn) We also have doc.mediawiki.org, so it would have to be that as well. [19:22:12] (03CR) 10Ottomata: "I refactored this to use the new python varnishlog interface, rather than varnishncsa." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [19:22:56] RECOVERY - puppet last run on mw2141 is OK Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:27:30] (03PS1) 10Dzahn: add "docs" as alias for "doc" in wm.org/mw.org [dns] - 10https://gerrit.wikimedia.org/r/214416 (https://phabricator.wikimedia.org/T100349) [19:28:22] (03PS2) 10Dzahn: access: add dbrant to researchers [puppet] - 10https://gerrit.wikimedia.org/r/213970 (https://phabricator.wikimedia.org/T99798) [19:28:23] 6operations: salt broken after the upgrade - https://phabricator.wikimedia.org/T100502#1318588 (10ArielGlenn) Note that the load on the puppet master is 15 right now and that's all puppet. It doesn't leave much cpu for salt given that the box is 2 quad cores. We should really consider moving salt off to a sepa... [19:31:15] RECOVERY - puppet last run on mw1163 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [19:31:22] (03PS2) 10Dzahn: add "docs" as alias for "doc" in wm.org/mw.org [dns] - 10https://gerrit.wikimedia.org/r/214416 (https://phabricator.wikimedia.org/T100349) [19:31:25] RECOVERY - puppet last run on mw1239 is OK Puppet is currently enabled, last run 8 seconds ago with 0 failures [19:31:36] RECOVERY - puppet last run on mw1210 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [19:31:46] RECOVERY - puppet last run on rcs1002 is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures [19:31:46] RECOVERY - puppet last run on mw2038 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [19:32:06] RECOVERY - puppet last run on mw1248 is OK Puppet is currently enabled, last run 51 seconds ago with 0 failures [19:32:07] RECOVERY - puppet last run on mw2193 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [19:32:15] RECOVERY - puppet last run on mw1032 is OK Puppet is currently enabled, last run 24 seconds ago with 0 failures [19:32:16] RECOVERY - puppet last run on mw1023 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:32:16] RECOVERY - puppet last run on mw1212 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:32:35] RECOVERY - puppet last run on mw2068 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [19:32:35] RECOVERY - puppet last run on mw1043 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [19:32:36] RECOVERY - puppet last run on mw2140 is OK Puppet is currently enabled, last run 10 seconds ago with 0 failures [19:32:45] RECOVERY - puppet last run on mw1022 is OK Puppet is currently enabled, last run 12 seconds ago with 0 failures [19:32:46] RECOVERY - puppet last run on mw2139 is OK Puppet is currently enabled, last run 14 seconds ago with 0 failures [19:32:46] RECOVERY - puppet last run on mw2167 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [19:32:46] RECOVERY - puppet last run on mw2116 is OK Puppet is currently enabled, last run 31 seconds ago with 0 failures [19:32:55] RECOVERY - puppet last run on mw2186 is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures [19:33:05] RECOVERY - puppet last run on mw1053 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:33:05] RECOVERY - puppet last run on mw2155 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [19:33:06] RECOVERY - puppet last run on mw2187 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [19:33:06] RECOVERY - puppet last run on mw1121 is OK Puppet is currently enabled, last run 10 seconds ago with 0 failures [19:33:06] RECOVERY - puppet last run on mw1033 is OK Puppet is currently enabled, last run 11 seconds ago with 0 failures [19:33:06] RECOVERY - puppet last run on mw2026 is OK Puppet is currently enabled, last run 18 seconds ago with 0 failures [19:33:15] RECOVERY - puppet last run on mw1105 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [19:33:16] RECOVERY - puppet last run on mw1243 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:33:17] RECOVERY - puppet last run on mw1225 is OK Puppet is currently enabled, last run 18 seconds ago with 0 failures [19:33:26] RECOVERY - puppet last run on mw2209 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [19:33:35] RECOVERY - puppet last run on mw1229 is OK Puppet is currently enabled, last run 40 seconds ago with 0 failures [19:33:36] RECOVERY - puppet last run on mw2058 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures [19:33:36] RECOVERY - puppet last run on mw2091 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [19:33:36] RECOVERY - puppet last run on mw2175 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [19:33:47] !log temp. stopped icinga-wm [19:33:51] Logged the message, Master [19:34:07] snicker [19:36:21] ottomata: I’m pretty sure your patch broken puppet on a zillion labs instances. Duplicate declaration: Class[Diamond::Collector_module] is already declared in file /etc/puppet/modules/diamond/manifests/collector.pp:76; cannot redeclare at /etc/puppet/modules/diamond/manifests/collector.pp:76 [19:37:53] !!!! [19:38:02] uh oh. but it is a class [19:38:06] ACKNOWLEDGEMENT - IPsec on cp1008 is CRITICAL: Strongswan CRITICAL - 0 ESP transports installed, 2 problems (not-connected: 2) daniel_zahn test machine [19:38:10] that would be me if it is broken for sure. [19:38:20] for example tools-exec-1401 [19:38:55] hmmmmm oh right. right right right, dummy me [19:38:56] got it [19:41:24] (03CR) 10Tim Landscheidt: "This seems to cause errors:" [puppet] - 10https://gerrit.wikimedia.org/r/214053 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [19:41:34] yeha, my fault [19:41:37] on it [19:42:15] (03PS1) 10Dzahn: redirect "docs" to doc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/214418 (https://phabricator.wikimedia.org/T100349) [19:42:36] (03PS1) 10Aude: Enable usage tracking on Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214419 (https://phabricator.wikimedia.org/T98248) [19:44:37] (03PS2) 10Dzahn: redirect "docs" to doc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/214418 (https://phabricator.wikimedia.org/T100349) [19:44:51] can someone help us inspect outgoing network traffic on wtp1001? [19:45:05] ew did a canary restart on wtp1001 after deploy and ew see a high network traffic in ganglia from that node. [19:45:18] we are trying to figure out what is going on there. [19:45:29] https://ganglia.wikimedia.org/latest/graph_all_periods.php?h=wtp1001.eqiad.wmnet&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1432841580&g=network_report&z=large&c=Parsoid%20eqiad [19:45:34] if you want to play along at home [19:45:37] (03PS1) 10Ottomata: Reverting some changes that broke diamond in labs [puppet] - 10https://gerrit.wikimedia.org/r/214420 [19:45:51] outgoing traffic jumped 5x. [19:46:03] (03PS1) 10Dzahn: retab redirects.dat [puppet] - 10https://gerrit.wikimedia.org/r/214421 [19:46:23] (03CR) 10Ottomata: [C: 032 V: 032] Reverting some changes that broke diamond in labs [puppet] - 10https://gerrit.wikimedia.org/r/214420 (owner: 10Ottomata) [19:46:32] bblack, andrewbogott YuviPanda ^^^ above [19:46:37] `sudo tcpdump -i eth0 -w somefilename` would be a good start, but neither subbu nor I have sudo permissions on wtp1001.eqiad [19:47:07] cscott: doing that [19:47:25] thanks [19:47:41] cscott: subbu: it's taking to graphite1001 a LOT [19:47:43] talking [19:47:58] 19:47:49.782902 IP wtp1001.eqiad.wmnet.37072 > graphite1001.eqiad.wmnet.8125: UDP, length 29 [19:47:59] oh. [19:48:03] a lot of that [19:48:07] Is that error/warning logging? [19:48:10] no. stats [19:48:19] andrewbogott: can you run puppet on a tool labs? [19:48:25] tell me if it is fixed [19:48:25] arlolra, maybe something broke in the ordering. [19:48:32] ottomata: yep, one second [19:48:56] mutante: any way you can give us a TCP/UDP breakdown? [19:49:23] mutante: if TCP traffic is at the expected pre-deploy level, then we can probably safely say the issue is excessive logging (which goes via UDP) [19:49:55] cscott: i think i can say it from just looking at the screen , kind of, it's all UDP that scrolls by [19:50:08] cscott, this is statsd mind you .. but, probably it uses udp as well. [19:50:15] ottomata: seems better. [19:50:25] phew ok, i need to fix that in a better way [19:50:38] need to decoule installation of diamond python modules, and their config files [19:53:14] mutante, just for sanity's sake .. what do you see on wpt1003? /cc cscott arlolra [19:53:27] that is still running old code. [19:55:34] (03CR) 10Giuseppe Lavagetto: [C: 031] "Seems sane. Since this is already used in a few places apparently, I'd test this in labs maybe." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/212784 (https://phabricator.wikimedia.org/T99833) (owner: 10Yuvipanda) [19:55:43] subbu: also a lot of UDP to graphite1001 .. hrmm [19:56:32] cscott, arlolra ^ [19:57:46] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [19:58:58] mutante, ok .. we didn't proceed with the deploy and restarted wtp1001 with old code. [19:59:38] subbu: ok, and this confirms it https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=wtp1001.eqiad.wmnet&m=cpu_report&s=by+name&mc=2&g=network_report&c=Parsoid+eqiad [19:59:40] well, the difference between "a lot" of stats UDP traffic on wtp1003 and "a crapload" of stats UDP traffic on wtp1001 might not be immediatey obvious on console. [20:00:04] aude: Dear anthropoid, the time has come. Please deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150528T2000). [20:00:14] mutante: yes, i just wish that graph split out UDP and TCP as separate numbers. [20:00:40] cscott: are you done deploying? [20:01:18] aude: yes, we ended up reverting. i'm in the middle of restarting all the parsoid servers to be safe (in case they picked up the deploy code in the window it was checked out) [20:01:29] cscott: ok :/ [20:01:33] aude: the restart should finish in a few minutes. [20:01:44] ok [20:01:53] cscott: subbu: probably too late now because reverted but i timed how long it takes to capture 100 packets "udp port 8125" [20:01:58] about 8 seconds [20:02:00] my thing will take 1-2 minutes [20:02:41] mutante, before the revert? [20:03:01] no, after and like 7 seconds on wtp1003 [20:03:10] ok. [20:03:26] !log updated Parsoid to version 497da30e ; canary restart of wtp1001; observed network TX spike (possibly UDP, possibly logging); reverted to 8ed6fd0b and restarted all parsoids. [20:03:29] Logged the message, Master [20:04:09] subbu: this code should be live on beta still -- do we have a ganglia looking at the beta machines? [20:04:24] that is what i was asking greg-g in #parsoid :) [20:04:31] of course, logging config could be different enough to obscure the bug, but it's possible we can still diagnosis this there. [20:05:36] when the restart finishes, it's still worth doing our usual post-deploy tests to double-check that I didn't screw things up [20:08:06] aude: ok, restart complete. the deploy window is yours. [20:08:11] thanks for your patience. [20:08:26] PROBLEM - Parsoid on wtp1016 is CRITICAL - Socket timeout after 10 seconds [20:09:10] cscott: ok, thanks [20:09:54] (03CR) 10Aude: [C: 032] Enable usage tracking on Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214419 (https://phabricator.wikimedia.org/T98248) (owner: 10Aude) [20:09:55] PROBLEM - Parsoid on wtp1023 is CRITICAL - Socket timeout after 10 seconds [20:10:13] (03Merged) 10jenkins-bot: Enable usage tracking on Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214419 (https://phabricator.wikimedia.org/T98248) (owner: 10Aude) [20:10:48] they usually recover .. but if not .. wtp1016 and wtp1023 might need service restarts [20:11:24] !log aude Synchronized wmf-config/InitialiseSettings.php: Enable Wikibase usage tracking on Wikivoyage (duration: 00m 13s) [20:11:28] Logged the message, Master [20:11:37] done :) [20:11:44] subbu: cscott: i had made a dumpfile before the revert, it was about 200M of traffic, i loaded that in wireshark now [20:11:48] statistics: [20:11:56] 95.66% UDP [20:12:02] of packets [20:12:14] 69.4% of bytes [20:12:40] how does that compare to the current ratios [20:13:43] coming up.. [20:14:03] arlolra, cscott i found the problem :) [20:14:12] oh really [20:14:13] - }, HEAP_USAGE_SAMPLE_INTERVAL); [20:14:14] + }, parsoidConfig.headpUsageSampleInterval); [20:14:16] typo!! [20:14:24] oops [20:14:29] "headpUsageSampleInterval" vs "heapUsageSampleInterval" [20:14:48] heaped, alright [20:14:49] wtp1016 has a zombie parsoid that doesn't want to die. [20:14:53] haha :) [20:15:07] arlolra: 4.5 % / 1.06 % :p [20:15:19] after revert it's 95% TCP [20:15:21] mutante, so yes .. parsoid was spamming graphite with heap usage stats [20:15:38] all the time instead of every 5 mins :) [20:15:40] mutante: thanks [20:15:45] ok :) [20:16:21] crap man .. strong typing helps sometimes with these kind of errors .. [20:16:26] ok --> #parsoid [20:16:46] PROBLEM - puppet last run on oxygen is CRITICAL Puppet has 1 failures [20:18:58] mutante: could i get you to sudo kill -9 a zombie parsoid process on wtp1023 and wtp1016? [20:19:22] mutante: on wtp1023, it's pid 21373 [20:20:11] on wtp1016 it's pid 3347 [20:20:32] both of those are resisting ordinary 'sudo service parsoid stop' [20:21:01] but i don't have permissions to 'sudo kill -9' [20:23:39] cscott: done [20:24:01] !log killed nodejs on wtp1023,wtp1016 [20:24:05] Logged the message, Master [20:24:20] mutante: thanks. [20:24:33] np [20:24:56] RECOVERY - Parsoid on wtp1023 is OK: HTTP OK: HTTP/1.1 200 OK - 1476 bytes in 0.019 second response time [20:25:07] parsoid looks to be up everywhere, and back to its old self. [20:25:14] cool [20:25:15] RECOVERY - Parsoid on wtp1016 is OK: HTTP OK: HTTP/1.1 200 OK - 1476 bytes in 0.052 second response time [20:25:19] maybe one of these days we'll actually succeed with a deploy [20:28:50] (03PS1) 10Dzahn: add IPv6 for silver (wikitech web) [puppet] - 10https://gerrit.wikimedia.org/r/214430 [20:30:04] 6operations, 7network: Enable add_ip6_mapped functionality on all hosts - https://phabricator.wikimedia.org/T100690#1318722 (10Dzahn) ``` interface::add_ip6_mapped { 'main': interface => 'eth0', } ``` So if it's not eth0 don't we just change the interface parameter and that's it? [20:30:54] 6operations, 7network: Enable add_ip6_mapped functionality on all hosts - https://phabricator.wikimedia.org/T100690#1318724 (10Dzahn) or did you mean to automatically add this in base and stop doing it on individual nodes? [20:33:00] (03PS1) 10Dzahn: add IPv6 for antimony (git web,svn) [puppet] - 10https://gerrit.wikimedia.org/r/214432 [20:33:36] RECOVERY - puppet last run on oxygen is OK Puppet is currently enabled, last run 27 seconds ago with 0 failures [20:34:26] (03PS1) 10Dzahn: add IPv6 for argon (irc,mw-rc streams) [puppet] - 10https://gerrit.wikimedia.org/r/214434 [20:35:14] (03PS1) 10Dzahn: add IPv6 for caesium (releases) [puppet] - 10https://gerrit.wikimedia.org/r/214435 [20:38:21] (03PS1) 10Dzahn: add IPv6 for ytterbium (gerrit) [puppet] - 10https://gerrit.wikimedia.org/r/214437 [20:38:35] PROBLEM - puppet last run on sca1001 is CRITICAL puppet fail [20:39:02] 6operations, 7Monitoring, 5Patch-For-Review: Overhaul reqstats - https://phabricator.wikimedia.org/T83580#1318743 (10Ottomata) varnishlog python module here: https://gerrit.wikimedia.org/r/#/c/214261/9 diamond collector refactored to work using this. [20:40:33] 6operations, 7network: Enable add_ip6_mapped functionality on all hosts - https://phabricator.wikimedia.org/T100690#1318753 (10Dzahn) https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/puppet+branch:production+topic:add-IPv6,n,z https://gerrit.wikimedia.org/r/#/c/214430/ https://gerrit.wikimedi... [20:41:18] (03CR) 10Dzahn: [C: 032] Removed comma as nick separator for deployment events [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/214361 (owner: 10Dereckson) [20:47:54] 6operations, 6Phabricator, 7database: Phabricator database access for Joel Aufrecht - https://phabricator.wikimedia.org/T99295#1318771 (10chasemp) Thanks for the summary @jaufrecht. In the act of reviewing teh schema it seems there were lots of edge cases @csteipp and I discovered. Old comment history we h... [20:49:36] 6operations, 10ops-requests, 5Patch-For-Review: Monitor mailman - https://phabricator.wikimedia.org/T84150#1318775 (10Dzahn) So the syntax is: --warning/critical = Total Transfers/sec,Read IO/Sec,Write IO/Sec,Bytes Read/Sec,Bytes Written/Sec and the current setting: -c 500,80,600,8000,11000 that would me... [20:52:49] (03CR) 10Andrew Bogott: [C: 032] Use the new service names for labs puppetmasters. [puppet] - 10https://gerrit.wikimedia.org/r/214105 (owner: 10Andrew Bogott) [21:02:40] (03PS1) 10Dzahn: mailman I/O monitoring: adjust Read Requests/Sec [puppet] - 10https://gerrit.wikimedia.org/r/214486 (https://phabricator.wikimedia.org/T84150) [21:03:43] 6operations, 10ops-requests, 5Patch-For-Review: Monitor mailman - https://phabricator.wikimedia.org/T84150#1318818 (10Dzahn) nevermind that. per https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=sodium&service=mailman+I%2FO+stats it's also KB/s, i'd say the usage info is wrong. in the web ui it... [21:04:48] mutante: yeah I meant doing it in base or something like that [21:04:49] (03CR) 10Dzahn: [C: 032] mailman I/O monitoring: adjust Read Requests/Sec [puppet] - 10https://gerrit.wikimedia.org/r/214486 (https://phabricator.wikimedia.org/T84150) (owner: 10Dzahn) [21:05:06] and finding a way to automate the part about what the name of the primary interface is [21:05:24] bblack: ah, ok :) well. meanwhile i uploaded those patches to add it to some more that have public IPs in wikimedia.org but did not have Ipv6 yet [21:05:38] like, the releases server for example.. should be fine [21:06:05] another thing to sanitize along these lines is actually adding matching public or private IPv6 DNS for all hosts as appropriate. A lot of them are missing today. [21:06:21] eh, bad example. caesium does not have a public IP, but the others do [21:06:24] but it's not "everything in the zonefile" either, it's the actual hosts that are puppet clients and such. [21:06:50] yes, aware of that, they also need DNS changes [21:07:50] i'll add some [21:09:26] PROBLEM - puppet last run on cp3014 is CRITICAL puppet fail [21:09:36] 6operations, 10ops-requests, 5Patch-For-Review: Monitor mailman - https://phabricator.wikimedia.org/T84150#1318830 (10Dzahn) i hope it's good now, but i'll leave it open for a little while longer to remind me to check alert history in icinga in a few days. i did not re-enable notifications just yet [21:10:15] 6operations, 6Commons, 10MediaWiki-Database, 6Multimedia, 7Wikimedia-log-errors: internal_api_error_DBQueryError: Database query error - https://phabricator.wikimedia.org/T98706#1318831 (10Jdforrester-WMF) [21:16:39] (03PS1) 10Dzahn: add AAAA record for silver.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/214489 [21:17:15] (03CR) 10Dzahn: "related DNS change in https://gerrit.wikimedia.org/r/#/c/214489/" [puppet] - 10https://gerrit.wikimedia.org/r/214430 (owner: 10Dzahn) [21:21:26] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [21:21:35] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [21:23:03] legoktm, cscott: I'm still interested in the answer to my question about whether I should keep reporting in this way. [21:23:32] abartov: if it happens more than once yes [21:23:36] mutante: have a minute to help me sort out a puppetmaster/cert situation? I’m trying to rename the puppetmaster on virt1000 to ‘labs-puppetmaster-eqiad.’ That’s going fine except the client says “Server hostname 'labs-puppetmaster-eqiad.wikimedia.org' did not match server certificate; expected one of virt1000.wikimedia.org, DNS:puppet, DNS:puppet.wikimedia.org, DNS:virt1000.wikimedia.org” [21:23:43] Can’t tell if that’s a client or a master issue [21:26:15] 6operations, 10Wikimedia-Mailing-lists: Rename pywikipedia-l to pywikibot - https://phabricator.wikimedia.org/T100707#1318890 (10JohnLewis) 3NEW a:3RobH [21:26:26] RECOVERY - puppet last run on cp3014 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [21:26:27] i don't se the server cert stored under /var/lib/puppet/ssl on the client, so that sounds like a puppetmaster issue to me [21:26:41] (03PS2) 10Dzahn: add AAAA record for silver.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/214489 [21:26:47] robh: ^^ [21:27:05] yea i got the ping for the task bot ;D [21:27:13] Okay :p [21:27:26] andrewbogott: sounds like you need a new cert on the master [21:27:42] lemme see on virt1000 [21:27:48] mutante, jgage, seems likely but it’s not clear to me where those certs come from. [21:28:00] self-signed? pulled from puppet? generated by puppet? [21:28:20] andrewbogott: shouldn't the cert be revoked from puppet and then a new one made from puppet? [21:28:22] 6operations, 10Wikimedia-Mailing-lists: Rename pywikipedia-l to pywikibot - https://phabricator.wikimedia.org/T100707#1318907 (10RobH) [21:28:24] 6operations, 7Epic, 10Wikimedia-Mailing-lists: Rename all mailing lists with -l suffixes to get rid of that suffix - https://phabricator.wikimedia.org/T99138#1318908 (10RobH) [21:28:29] andrewbogott: created and signed by the WMF CA i think [21:28:50] I’ve been searching for a while for the logic that copies that cert over. Do you see it? [21:29:02] JohnFLewis: you aren't like going to every list asking if they want a rename right? cuz its not trivial, was more of a 'if they really want it we can' [21:29:14] oh, i see the one comment linked [21:29:19] robh: I dropped that when I closed the ticket about it :) [21:29:21] (still, just checkin ;) [21:29:32] i thought they were self signed from /var/lib/puppet/ssl/certs/ca.pem [21:29:34] /var/lib/puppet/server/ssl/certs/labs-puppetmaster-eqiad.wikimedia.org.pem [21:29:36] I'm just doing the processes for one that's ask ;) [21:29:37] where is this WMF CA cert? [21:29:37] yea, this is easy enough, i think next tuesday would work [21:29:59] jgage: /etc/ssl/certs [21:30:06] wmf_ca_2014_2017.pem [21:30:06] wmf-ca.pem [21:30:10] You say, /me waits for another 4 hour outage slot [21:30:53] thanks mutante [21:31:20] mutante: /var/lib/puppet/server/ssl/certs/labs-puppetmaster-eqiad.wikimedia.org.pem <- what about it? [21:31:23] That’s what I need? [21:32:27] Issuer: CN=Puppet CA: virt1000.wikimedia.org [21:32:32] JohnFLewis: well... i hope not that long [21:32:41] i also plan to fully stop and restart mailman BEFORE we do shit ;D [21:32:49] andrewbogott: that's the cert file used on virt1000 for the puppetmaster site [21:32:50] with a puppet run between the stop andstart [21:33:06] mutante: ok, but that file doesn’t exist does it? [21:33:18] andrewbogott: it does, but you need a new one [21:33:25] robh: while you're at it ask Chris to physically turn the server off and on to make sure :p [21:33:32] JohnFLewis: so a two hour window with a one hour expectation. we woudl have hit that before if the unexpected merge condition didnt happen =] [21:33:46] mutante: Sorry, I’m just getting more and more confused. [21:33:48] andrewbogott: DNS:puppet, DNS:puppet.wikimedia.org, DNS:virt1000.wikimedia.org [21:33:55] True true [21:33:55] andrewbogott: those names will work ^ but not others [21:34:09] I definitely don’t see /var/lib/puppet/server/ssl/certs/labs-puppetmaster-eqiad.wikimedia.org.pem on virt1000 [21:34:25] 6operations, 10Wikimedia-Mailing-lists: Rename pywikipedia-l to pywikibot - https://phabricator.wikimedia.org/T100707#1318923 (10RobH) [21:34:47] andrewbogott: ah, it's this instead: [21:35:05] :/var/lib/puppet/server/ssl/certs/virt1000.wikimedia.org.pem [21:35:28] the CN is virt1000 and the alt. names on it are: [21:35:35] puppet.wikimedia.org and pupppet [21:36:30] yes [21:36:33] andrewbogott: yea, you are right, the file that is reference in the Apache config doesnt exist there [21:36:47] right [21:37:00] but I also don’t see where the old vert virt1000.wikimedia.org.pem came from [21:37:11] I’m pretty sure it wasn’t just copied there by hand though. [21:37:15] So, my questions are: [21:37:21] - How did this work before? [21:37:28] - How do a make a new cert for the new server name? [21:38:25] andrewbogott: it looks like it was done manually [21:38:33] i dont see it being installed by puppet [21:38:42] ok, glad we agree on that at least :) [21:38:49] Hrmm, I want to schedule this the same way I did the last mailman outage [21:38:59] and it also then had an overlap with a mediawiki train deploy [21:39:03] So, I can install the new one manually as well. Once I have one. [21:39:05] but that week was a light week for that.... [21:39:19] so i think this is ok... [21:40:17] andrewbogott: so if we know which CA we use (i heard something like there were multiple "wmf" ones over time).. then making one is an openssl command ..something like [21:40:30] openssl req -x509 -new -nodes -key rootCA.key -days 1024 -out rootCA.pem [21:40:32] mutante: it’s definitely the 2014-2017 one. That’s the new one. [21:41:28] that command isn't right yet [21:41:39] odd how the wierd spikes are gone from https://gdash.wikimedia.org/dashboards/jobq/deploys after I98b8a4ddbc [21:41:47] * AaronSchulz isn't complaining [21:42:00] andrewbogott: vi +51 /var/lib/dpkg/info/puppetmaster-passenger.postinst [21:42:19] (03CR) 10Ori.livneh: [C: 031] "Name seems fine to me." [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [21:42:30] hm nevermind that's for the webserver part [21:42:57] (03PS3) 10Ori.livneh: Set HHVM mysql connection timeout to 3s on app and api servers [puppet] - 10https://gerrit.wikimedia.org/r/214295 (https://phabricator.wikimedia.org/T98489) (owner: 10BryanDavis) [21:43:07] (03CR) 10Ori.livneh: [C: 032 V: 032] Set HHVM mysql connection timeout to 3s on app and api servers [puppet] - 10https://gerrit.wikimedia.org/r/214295 (https://phabricator.wikimedia.org/T98489) (owner: 10BryanDavis) [21:43:28] mutante: ok to merge dzahn: mailman I/O monitoring: adjust Read Requests/Sec ? [21:43:54] ori: yes, please [21:43:59] {{done}} [21:44:07] ori: I'm getting reports from nl.wikipedia that runtime is being aborted early. I suspect that the whole cookie revert stuff got done incorrectly. [21:44:24] It's a bit racy, but sometimes on nlwiki requests I see the following [21:44:27] mw.loader.moduleRegistry['mediawiki.toc'].dependencies [21:44:28] ["jquery.cookie"] [21:44:37] Where mw.loader.moduleRegistry['mediawiki.toc'].script contains mw.cookie.get references [21:44:46] resulting in a fatal .get from undefined error [21:45:01] Someone reverted the patch, but then re-reverted it with only syncing the js, withotu the dependencies? [21:45:06] It's not loading mw.cookie [21:45:16] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [21:45:18] Got it. Why ping me, though? (I don't mind, just curious if this has some connection to my work I'm not aware of) [21:45:18] I g2g unfortunately. Can you or someone else verify and fix? [21:45:25] Ah, ok. Yes, no problem. [21:45:25] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [21:45:28] legoktm: see above [21:45:28] 6operations, 3Roadmap, 7notice: Mailing list maintenance window - 2015-06-02 17:00 UTC to 19:00 UTC - https://phabricator.wikimedia.org/T100711#1318965 (10RobH) 3NEW a:3RobH [21:45:32] First user that showed up, no specifics. [21:45:33] Thx! [21:45:56] 6operations, 3Roadmap, 7notice: Mailing list maintenance window - 2015-06-02 17:00 UTC to 19:00 UTC - https://phabricator.wikimedia.org/T100711#1318974 (10RobH) [21:45:56] jgage: i thought we _are_ talking about the webserver part, so that would be it then? [21:46:06] Spotty connection and being moved around on this end. Will be better tomorrow when back home. [21:46:14] Krinkle: safe travels [21:47:29] 6operations, 3Roadmap, 7notice: Mailing list maintenance window - 2015-06-02 17:00 UTC to 19:00 UTC - https://phabricator.wikimedia.org/T100711#1318965 (10RobH) [21:47:44] 6operations, 10Wikimedia-Mailing-lists: Rename pywikipedia-l to pywikibot - https://phabricator.wikimedia.org/T100707#1318890 (10RobH) [21:47:46] 6operations, 3Roadmap, 7notice: Mailing list maintenance window - 2015-06-02 17:00 UTC to 19:00 UTC - https://phabricator.wikimedia.org/T100711#1318965 (10RobH) [21:48:23] hmm, looks like legoktm synced the entire resources/ dir, so it should be fine [21:48:32] yup [21:48:40] also didn't we scap since then? [21:48:55] when I deployed it was a no-op since everything was on wmf6 [21:49:00] yes [21:50:23] JohnFLewis: its now on the schedule =] (well, not hte wiki page yet since it doesnt have next week, but on the phab workboard for next week roadmap) [21:50:30] i'll touch mediawiki.toc and sync [21:50:33] just in case [21:50:43] so if there are other downtime mailman things we need doing, just keep appending them to the maint window task [21:50:56] robh: okay [21:51:36] PROBLEM - nutcracker port on silver is CRITICAL - Socket timeout after 2 seconds [21:51:54] !log ori Synchronized php-1.26wmf8/resources/src/mediawiki/mediawiki.toc.js: Touching file on unconfirmed suspicion of stale cache (duration: 00m 15s) [21:52:01] Logged the message, Master [21:52:14] !log ori Synchronized php-1.26wmf7/resources/src/mediawiki/mediawiki.toc.js: Touching file on unconfirmed suspicion of stale cache (duration: 00m 16s) [21:52:21] Logged the message, Master [21:52:44] hopefully that resolves it. [21:53:15] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [21:53:40] it fixed silver! [21:53:41] (j/k) [21:56:57] what's wrong with silver now? [21:57:00] ori, hmm - I can still repro the discrepancy on enwiki but not the failures [21:57:42] mutante: isn’t it possible that virt1000.wikimedia.org.pem was generated locally with ‘puppet cert generate’? [21:57:48] Or can you tell that it was signed with our CA? [22:00:06] greg-g: You around and back to normal work? (I wanted to chat about deployment calendar, but its not pressing.) [22:03:35] andrewbogott: yes, virt1000.wikimedia.org.pem has been issued by "Puppet CA". that would be "ca_crt.pem" and that has a comment " Puppet Ruby/OpenSSL Internal Certificate" [22:03:59] ah, ok! So then I can just generate one right this second... [22:05:06] mutante: ok, so, now I have a /var/lib/puppet/server/ssl/certs/labs-puppetmaster-eqiad.wikimedia.org.pem [22:05:10] the client behavior is identical [22:05:19] ‘Server hostname 'labs-puppetmaster-eqiad.wikimedia.org' did not match server certificate; expected one of virt1000.wikimedia.org, DNS:puppet, DNS:puppet.wikimedia.org, DNS:virt1000.wikimedia.org’ [22:05:32] Feel like I’m missing something obvious here [22:05:55] andrewbogott: i dont see such a file on virt1000 yet [22:06:08] root@virt1000:/var/lib/puppet/server/ssl/certs# ls [22:06:08] ca.pem virt1000.wikimedia.org.pem [22:06:09] you’re right, it vanished. [22:06:11] huh [22:06:13] let me try this again [22:06:36] ok, see it now? [22:06:42] yes [22:06:55] forcing a puppet run on virt1000, let’s see if that deletes it [22:06:57] now you would have to restart apache [22:07:20] robh: what's up? [22:07:42] andrewbogott: it's gone :p [22:07:51] is the update of the wikitech deploymetns page you having to make a manual entry for everyting on the roadmap workboard? [22:07:54] yeah, puppet didn’t /say/ it was deleting it, but something clearly is [22:07:58] if so, that seems horrible for you. [22:08:23] virt1000 puppet-master[28293]: Removing file Puppet::SSL::CertificateRequest labs-puppetmaster-eqiad.wikimedia.org at '/var/lib/puppet/server/ssl/certificate_requests/labs-puppetmaster-eqiad.wikimedia.org.pem' [22:08:37] (i fear your answer is yes, cuz i recall you saying you had to generate it last week) [22:09:07] mutante: ok then. Any idea why it’s doing that? [22:09:16] robh: uh, not sure if I understand the question: the things in #roadmap normally come from the teams/devs themselves because they have some task they're trying to get done, either I or a team member puts it in the right column of #roadmap [22:09:32] right, but how does it get from roadmap, which is a weekly sort [22:09:33] I don’t think I even know how to tell puppet ‘delete everything in this dir but not quite everything’ [22:09:34] I copy/paste/edit the wikitech page [22:09:37] to the deployments wikipage [22:09:39] oh [22:09:45] oh, well, not all goes on there, but yeah, by hand [22:09:47] ok, so you are manually making it [22:09:48] :( [22:09:56] 6operations, 7Availability: Make redis/redisdb roles support multiple instances on the same servers - https://phabricator.wikimedia.org/T100714#1319029 (10aaron) 3NEW [22:10:01] dude, that sucks. [22:10:07] 6operations, 7Availability, 7Performance: Make redis/redisdb roles support multiple instances on the same servers - https://phabricator.wikimedia.org/T100714#1319036 (10aaron) [22:10:15] there has to be some kind of phabricator sorting workboard by date magic we can apply.... [22:10:28] i was really hoping you were running some hacked together script ;] [22:10:37] oh! mutante, I know what’s deleting that file because it’s a cron job that I wrote [22:10:38] the hardest part is enforcing distinct windows for things [22:10:43] like, a month ago [22:10:52] dang [22:10:57] andrewbogott: heh :p it said the puppet-master was doing it somehow [22:11:06] greg-g: I'm not sure we can enforce windows, since sometimes the overlap is quite intentional [22:11:14] like i start my 2 hour mailman window an hour before mediawiki train [22:11:22] since unless shit breaks, i'll be done in like 30 minutes... [22:11:27] (last time was 4 hours.... shit broke.) [22:11:52] but it would be nice if it then alerted the task as such (an overlap) [22:12:06] in RT you could set a start and end/due date/time [22:12:15] i wonder if phab has any such possibility. [22:12:29] well, you know what I mean, the wiki page makes overlaps clear, whereas $something_in_phab eitehr doesn't or doesn't exist ;) [22:12:29] (if we had those, we could generate a calendar output) [22:12:44] no, tasks don't have that [22:12:46] yea and all we can easily reference in phab is the subject. [22:12:54] right [22:13:42] hrmm... [22:13:48] what aobut the phab calendar app? [22:14:00] https://phabricator.wikimedia.org/calendar/ [22:14:17] robh: https://phabricator.wikimedia.org/T466 [22:14:19] every person taking a window has to make an entry and link to the task? [22:14:41] haha [22:14:45] mutante: nicely linked. [22:15:05] so yea, the calendar is in phabricator and isnt really used for shit it seems [22:15:06] I'm not sure if it actually helps [22:15:17] would show overlap of windows no? [22:15:29] and stop you having to put in a single user maintained wiki page of deployments [22:15:39] "Greg already made great points about why we should have it enabled. Additionally:" [22:15:44] "I want to test the app to see if it will be suitable for organizing swat deployments." [22:15:46] cuz right now, i have next week and a window ready, but im far too lazy to go and setup the next week template ;D [22:17:54] just add it to Upcoming on the page [22:18:06] but yes.... this is all a mess and I hate it and I want something better [22:18:11] lemme play with Phab's calendar [22:18:12] (03PS1) 10Andrew Bogott: Don't clean the puppetcert for the puppetmaster service name. [puppet] - 10https://gerrit.wikimedia.org/r/214499 [22:19:12] mutante: ^ [22:19:46] (03CR) 10Andrew Bogott: [C: 031] add IPv6 for silver (wikitech web) [puppet] - 10https://gerrit.wikimedia.org/r/214430 (owner: 10Dzahn) [22:20:23] (03CR) 10Andrew Bogott: [C: 031] add AAAA record for silver.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/214489 (owner: 10Dzahn) [22:20:41] (03CR) 10Andrew Bogott: [C: 032] Don't clean the puppetcert for the puppetmaster service name. [puppet] - 10https://gerrit.wikimedia.org/r/214499 (owner: 10Andrew Bogott) [22:20:46] grr, no way to make recurring events in there [22:21:37] i added mine to the cal [22:21:45] but indeed, non reoccurring is annoying [22:22:24] (03CR) 10Ori.livneh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [22:22:29] should 'roadmap' be a subsriber or invitee to any of these? [22:22:49] its not now, but if it was it would likely show on the roadmpa board right? [22:22:52] * robh tests [22:22:52] andrewbogott: ah! so if not re.match(r'^[a-zA-Z0-9_-]+\.eqiad\.wmflabs$ so it needed an exception to match itself [22:23:04] right [22:23:35] oh, workboard is task only. [22:23:35] and if hostname == socket.getfqdn(): worked before [22:23:38] got it [22:24:00] so each calendar event still has to have a manual tie to a task in its details [22:24:05] its not even a linked field =[ [22:24:39] that is sub-par [22:24:39] ok, thanks, legoktm. [22:25:38] and even in the details, the phab markupdoesnt work, cannot just put T# and have it link. [22:25:40] lame... [22:25:46] yeah, it's not made for this [22:26:07] it's not the point of the calendar app, it's to have you put in times you aren't available and it shows up as an icon next to your name on code review and tasks [22:26:26] yea [22:26:45] ok, i elect releng team to hire a dev to make a phab plugin. [22:26:47] ;] [22:27:05] deployment tracker plugin [22:27:20] :) [22:27:59] oh well, resume whatever you were doing before i distracted you, phab has no good solution =[ [22:31:32] (03PS1) 10Andrew Bogott: Switch labs instances to use the new puppetmaster service names [puppet] - 10https://gerrit.wikimedia.org/r/214502 [22:31:42] mutante: ^^ "phab has no good solution" they'll use it as a reason to keep Bugzilla soon! :p [22:32:04] (03PS23) 10Ori.livneh: Add varnish request stats diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [22:32:28] (03CR) 10Andrew Bogott: [C: 032] Switch labs instances to use the new puppetmaster service names [puppet] - 10https://gerrit.wikimedia.org/r/214502 (owner: 10Andrew Bogott) [22:34:57] JohnFLewis: ""Basically whenever a> "Starts" or "Due" date is set in a ticket, it creates an entry in the> calendar for that user"" [22:35:33] JohnFLewis: from rt-users list :p [22:35:38] /window 9 [22:35:42] Huh? [22:35:47] gah [22:35:49] :) [22:36:36] aude, type your password next time! [22:36:46] lol [22:39:45] andrewbogott: is it possible to get newer versions of avconv to labs ? [22:40:01] when you say ‘labs’... [22:40:04] you mean tools? [22:40:15] (context: I don’t know what avconv is) [22:40:31] no, my video machine - avconv - ubuntu's fork of ffmpeg [22:40:43] my = project video [22:41:16] Can’t you just install it? [22:41:36] matanya, on your own instance you have sudo so you can do whatever [22:41:58] MaxSem: not in the repo, prefer not to build from source ... [22:41:59] unless it’s closed-source :) [22:42:26] you can just download a .deb build, can’t you? [22:42:31] Sorry, I still don’t quite understand the question [22:42:35] i guess [22:43:02] andrewbogott: https://libav.org/download.html [22:43:03] matanya: libav-tools [22:43:18] waaaah, what's wrong with wget && tar -xzf && ./configure && make && sudo make install ????? [22:43:39] matanya: apt-cache show ffmpeg [22:43:45] Users are advised to use avconv from [22:43:45] the libav-tools package instead of ffmpeg. [22:43:46] MaxSem: nothing wrong, i just prefer not to be a package manager [22:43:55] use the package [22:43:57] pfft :P [22:44:46] mutante: it is old [22:45:07] twentyafterfour: Do you mind if I touch some js files and re-sync wmf7 in order to rule out some things debugging Wikidata JS breakage [22:45:24] hoo: sure [22:45:38] hrm, https://packages.qa.debian.org/liba/libav.html is petrified mammoth shit [22:47:40] twentyafterfour: Thanks [22:48:27] !log hoo Synchronized php-1.26wmf7/: Touching some JS, re-syncing resource definitions to rule out causes for Wikidata JS problem. (duration: 01m 00s) [22:48:31] Logged the message, Master [23:00:04] RoanKattouw, ^d, rmoen, James_F, James_F, AaronSchulz: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150528T2300). Please do the needful. [23:02:38] Alright, let's roll [23:02:38] soo... [23:02:41] ah, ok [23:02:47] Sorry for the delay [23:03:05] (03CR) 10Catrope: [C: 032] Enable A/B test of VisualEditor for new accounts on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211697 (https://phabricator.wikimedia.org/T90666) (owner: 10Jforrester) [23:03:08] Good evening. Looking at loading up Parasoid on a load balanced and clustered setup, however, the documentation I've located isn't as clear as it needs to be for the type of environment we run. Is there anyone that can point me to a more detailed setup/configuration guide? We've got two front end web servers involved that are mirrored. I don't want to have our client inadvertently cause a conflict. [23:03:13] (03Merged) 10jenkins-bot: Enable A/B test of VisualEditor for new accounts on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211697 (https://phabricator.wikimedia.org/T90666) (owner: 10Jforrester) [23:03:43] halfak: ^^^ [23:03:48] BearlyDoug, #mediawiki-parsoid might be more helpful [23:03:52] Glad to see you're awake, RoanKattouw. subbu said you're one of the people I should ask this of. :) [23:04:04] !log catrope Synchronized wmf-config/InitialiseSettings.php: Enable A/B test of VE for new accounts on enwiki (duration: 00m 13s) [23:04:05] Thanks James_F [23:04:08] Logged the message, Master [23:04:13] BearlyDoug, he is in the middle of a deployment. :) [23:04:47] Krenair, subbu referred me to in here. I can be patient, haha! Just wanted to get the question in here. :) [23:04:49] (03PS1) 10Dzahn: add AAAA record for antimony [dns] - 10https://gerrit.wikimedia.org/r/214504 [23:04:57] * BearlyDoug hushes for now. :D [23:05:50] (03CR) 10Dzahn: "related DNS change in https://gerrit.wikimedia.org/r/#/c/214504/" [puppet] - 10https://gerrit.wikimedia.org/r/214432 (owner: 10Dzahn) [23:10:49] (03PS1) 10Dzahn: add AAAA record for argon (irc,rc streams) [dns] - 10https://gerrit.wikimedia.org/r/214506 [23:11:02] (03CR) 10Dzahn: "related DNS change at https://gerrit.wikimedia.org/r/214506" [puppet] - 10https://gerrit.wikimedia.org/r/214434 (owner: 10Dzahn) [23:11:06] (03PS2) 10Dzahn: add IPv6 for argon (irc,mw-rc streams) [puppet] - 10https://gerrit.wikimedia.org/r/214434 [23:15:41] (03PS1) 10Dzahn: add AAAA record for ytterbium (gerrit) [dns] - 10https://gerrit.wikimedia.org/r/214507 [23:15:52] (03CR) 10Dzahn: "related DNS change in https://gerrit.wikimedia.org/r/214507" [puppet] - 10https://gerrit.wikimedia.org/r/214437 (owner: 10Dzahn) [23:15:56] (03PS2) 10Dzahn: add IPv6 for ytterbium (gerrit) [puppet] - 10https://gerrit.wikimedia.org/r/214437 [23:17:36] RECOVERY - puppet last run on sca1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [23:18:03] matanya: sorry, had to handle a delivery. I think that you’ll probably need to build the package yourself, or run from source. It seems weird that the upstream people don’t deliver a package, though. [23:18:36] andrewbogott: already built from source, deps hell solved [23:18:37] !log catrope Synchronized php-1.26wmf6/includes/EditPage.php: Fix regression with URL-specified edit tags (duration: 00m 13s) [23:18:41] Logged the message, Master [23:18:43] cool :) [23:18:50] !log catrope Synchronized php-1.26wmf7/includes/EditPage.php: Fix regression with URL-specified edit tags (duration: 00m 13s) [23:18:54] Logged the message, Master [23:22:36] 6operations: raid1-lvm recipe broken for jessie, sets up available LVM space as swap - https://phabricator.wikimedia.org/T100636#1319195 (10Dzahn) The following servers use the raid1-lvm partman recipe: acamar|achernar|baham|cobalt|lead|lithium|polonium|rhodium argon|bast4001|copper|neon|ruthenium|subra|suhail|... [23:31:53] any roots around? nutcracker on mw1056 needs to be restarted. [23:32:12] WP:BOLD [23:32:18] bd808: hi [23:32:20] it has logged 4500 errors in the last 15 minutes [23:32:27] o/ jgage [23:32:52] bd808: done! [23:33:00] hoo: if I had the sudo rights I would have just !logged after :) [23:33:05] thanks jgage [23:33:13] !log restarted nutcracker on mw1056 due to errors, per bd808 [23:33:17] Logged the message, Master [23:33:19] bd808: Wow, they've taken that right from us :P [23:33:23] Wasn't aware of that [23:34:21] oh i see, it's one of those clever daemons [23:34:22] [2015-05-28 23:32:46.477] nc.c:189 run, rabbit run / dig that hole, forget the sun / and when at last the work is done / don't sit down / it's time to dig another one [23:35:41] (03PS5) 10Andrew Bogott: Use ruby-mysql instead of libmysql-ruby. [puppet] - 10https://gerrit.wikimedia.org/r/214290 [23:36:41] (03CR) 10John F. Lewis: [C: 031] add IPv6 for ytterbium (gerrit) [puppet] - 10https://gerrit.wikimedia.org/r/214437 (owner: 10Dzahn) [23:36:55] (03CR) 10Andrew Bogott: [C: 032] Use ruby-mysql instead of libmysql-ruby. [puppet] - 10https://gerrit.wikimedia.org/r/214290 (owner: 10Andrew Bogott) [23:37:31] (03CR) 10John F. Lewis: [C: 031] add IPv6 for argon (irc,mw-rc streams) [puppet] - 10https://gerrit.wikimedia.org/r/214434 (owner: 10Dzahn) [23:37:35] (03CR) 10John F. Lewis: [C: 031] add AAAA record for argon (irc,rc streams) [dns] - 10https://gerrit.wikimedia.org/r/214506 (owner: 10Dzahn) [23:38:02] (03CR) 10John F. Lewis: [C: 031] add IPv6 for antimony (git web,svn) [puppet] - 10https://gerrit.wikimedia.org/r/214432 (owner: 10Dzahn) [23:38:23] (03CR) 10John F. Lewis: [C: 031] add IPv6 for silver (wikitech web) [puppet] - 10https://gerrit.wikimedia.org/r/214430 (owner: 10Dzahn) [23:38:36] (03CR) 10John F. Lewis: [C: 031] add IPv6 for caesium (releases) [puppet] - 10https://gerrit.wikimedia.org/r/214435 (owner: 10Dzahn) [23:38:58] mutante: something says your patches have been reviewed :[ [23:39:00] *:p [23:39:22] JohnFLewis: thanks. btw, i'm missing the DNS record for caesium, it's internal [23:39:37] (03PS2) 10Ori.livneh: Fixes for ori's dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/214106 [23:39:54] (03CR) 10Ori.livneh: [C: 032 V: 032] Fixes for ori's dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/214106 (owner: 10Ori.livneh) [23:39:58] Do really none of those have bugs? [23:40:10] Shouldn't https://gerrit.wikimedia.org/r/#/c/214430/ be T73218 ? [23:40:49] (03PS13) 10BBlack: sslcert: generate chained certs automatically [puppet] - 10https://gerrit.wikimedia.org/r/197341 (owner: 10Faidon Liambotis) [23:42:05] (03CR) 10Alex Monk: "Isn't this T73218 ? Also, doesn't merge." [puppet] - 10https://gerrit.wikimedia.org/r/214430 (owner: 10Dzahn) [23:42:34] 6operations, 6Labs, 10Labs-Infrastructure, 10wikitech.wikimedia.org, 7Ipv6: Enable IPv6 on wikitech.wikimedia.org - https://phabricator.wikimedia.org/T73218#1319254 (10Krenair) [23:44:23] 6operations, 10Wikimedia-Blog, 7Ipv6, 7Upstream: Enable IPv6 on (tech)blog.wikimedia.org - https://phabricator.wikimedia.org/T73261#1319270 (10Krenair) @Tbayer: Any progress here? Do we at least have some sort of bug number so we can tag this #Upstream ? [23:44:40] 6operations, 10Wikimedia-Blog, 7Ipv6: Enable IPv6 on (tech)blog.wikimedia.org - https://phabricator.wikimedia.org/T73261#1319271 (10Krenair) [23:47:52] woah [23:48:08] ? [23:48:27] ... [23:48:44] just all the random ipv6-related traffic [23:49:34] I really meant, we should put something in base so that all the other (>precise-based) v6 addrs in our networks get sanitized. The goal is to get rid of the profusion of add_ip6_mapped, not add to it :) [23:53:11] (which will involve something silly like searching for a match between facter's $ipaddress and $ipaddress_IFACE sets to determine the primary interface) [23:54:07] (03PS2) 10Dzahn: add IPv6 for silver (wikitech web) [puppet] - 10https://gerrit.wikimedia.org/r/214430 (https://phabricator.wikimedia.org/T73218) [23:54:10] we could argue for mechanically adding them all to v6 revdns once that's done as well [23:54:17] (03PS3) 10Dzahn: add IPv6 for silver (wikitech web) [puppet] - 10https://gerrit.wikimedia.org/r/214430 (https://phabricator.wikimedia.org/T73218) [23:54:40] putting them in forward is a separate think I think, at least for public machines and services, as there could be impact to something-or-other that doesn't work well with v6