[00:00:05] RoanKattouw, ^d, marktraceur, MaxSem: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141126T0000). [00:01:34] (03PS1) 10BryanDavis: logstash: Rules for processing MW input via Redis [puppet] - 10https://gerrit.wikimedia.org/r/175896 [00:03:04] greg-g: howdy. LQT conversion on officewiki didn't take (API calls to private wiki), can we retry on Wednesday 26? It's only 8 pages, it's only on officewiki, What Could Go Wrong [00:03:10] whee nothing to deploy [00:03:48] ™ [00:04:18] !log restarted eventlogging mysql-m2-master consumer. It seems it could no longer write to the database. [00:04:20] Logged the message, Master [00:05:53] spagewmf: yessir [00:06:42] greg-g: thanks giving. Is 9:30am-10:30am OK to get out of the way of the train in time [00:07:13] spagewmf: perfect [00:31:05] greg-g: Thoughts about me renaming the "ve-deploy-2014-11-26 (MW 1.25wmf10)" projects to "WMF-Deploy-…" so others feel free to use them? [00:33:36] !log power down db2033 for reassignement to codfw frack [00:33:38] Logged the message, Master [00:35:28] (03CR) 10Dzahn: "how to add this to deploy schedule" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173083 (owner: 10JanZerebecki) [00:38:42] James_F: innnnnnnteresting [00:39:01] greg-g: I've already started doing it for OOUI tasks. [00:39:07] * greg-g nods [00:39:13] purpose? [00:39:18] greg-g: (Previously this wasn't possible because they were VE milestones, and OOUI was outside of that.) [00:39:35] Mostly it's so I can point to a "this is what changes went out that week" log. [00:39:45] I write the weekly changelog, after all. [00:40:41] (03PS2) 10BryanDavis: logstash: Rules for processing MW input via Redis [puppet] - 10https://gerrit.wikimedia.org/r/175896 [00:41:10] I don't see the harm, I'm just having a hard time coming up with when another team would use it [00:42:15] greg-g: Sure. [00:44:46] (03CR) 10BryanDavis: "Cherry-picked to deployment-salt for testing. I expect there will be some adjustments needed here as I test out the firehose of Monolog ev" [puppet] - 10https://gerrit.wikimedia.org/r/175896 (owner: 10BryanDavis) [00:47:29] (03CR) 10Aaron Schulz: [C: 032] Remove obsolete profiling settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164011 (owner: 10PleaseStand) [00:47:38] (03Merged) 10jenkins-bot: Remove obsolete profiling settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164011 (owner: 10PleaseStand) [00:48:07] !log aaron Synchronized wmf-config/StartProfiler.php: Remove obsolete profiling settings (duration: 00m 06s) [00:48:10] Logged the message, Master [00:54:26] (03PS1) 10Dzahn: wikistats: add cron to enabled wikia updates [puppet] - 10https://gerrit.wikimedia.org/r/175904 [00:55:03] (03PS2) 10Dzahn: wikistats: add cron to enable wikia updates [puppet] - 10https://gerrit.wikimedia.org/r/175904 [00:55:50] (03CR) 10Dzahn: [C: 032] wikistats: add cron to enable wikia updates [puppet] - 10https://gerrit.wikimedia.org/r/175904 (owner: 10Dzahn) [01:05:27] (03CR) 10Springle: [C: 031] mha: replace pmtpa with codfw? [puppet] - 10https://gerrit.wikimedia.org/r/173464 (owner: 10Dzahn) [01:09:23] PROBLEM - puppet last run on mw1224 is CRITICAL: CRITICAL: Puppet has 1 failures [01:16:28] (03PS5) 10Dzahn: mha: replace pmtpa with codfw [puppet] - 10https://gerrit.wikimedia.org/r/173464 [01:23:18] !log restarted logstash on logstash1001; no events from log2udp relay being recorded [01:23:21] Logged the message, Master [01:23:56] RECOVERY - puppet last run on mw1224 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [01:46:49] PROBLEM - HHVM busy threads on mw1235 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [115.2] [01:47:57] (03CR) 10Dzahn: [C: 032] mha: replace pmtpa with codfw [puppet] - 10https://gerrit.wikimedia.org/r/173464 (owner: 10Dzahn) [01:50:45] PROBLEM - HHVM busy threads on mw1227 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [01:53:42] PROBLEM - HHVM queue size on mw1232 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [80.0] [01:53:43] PROBLEM - HHVM busy threads on mw1233 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [01:53:53] PROBLEM - HHVM busy threads on mw1232 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [115.2] [01:54:43] PROBLEM - HHVM busy threads on mw1222 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [01:55:13] RECOVERY - HHVM busy threads on mw1235 is OK: OK: Less than 1.00% above the threshold [76.8] [01:56:33] RECOVERY - HHVM queue size on mw1232 is OK: OK: Less than 1.00% above the threshold [10.0] [01:57:04] PROBLEM - HHVM busy threads on mw1229 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [01:57:53] PROBLEM - HHVM busy threads on mw1231 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [115.2] [01:58:32] PROBLEM - HHVM busy threads on mw1226 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [01:58:42] PROBLEM - HHVM busy threads on mw1234 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [01:59:24] PROBLEM - HHVM busy threads on mw1230 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [02:00:12] RECOVERY - HHVM busy threads on mw1229 is OK: OK: Less than 1.00% above the threshold [76.8] [02:01:43] RECOVERY - HHVM busy threads on mw1226 is OK: OK: Less than 1.00% above the threshold [76.8] [02:01:43] RECOVERY - HHVM busy threads on mw1234 is OK: OK: Less than 1.00% above the threshold [76.8] [02:02:24] RECOVERY - HHVM busy threads on mw1227 is OK: OK: Less than 1.00% above the threshold [76.8] [02:02:24] RECOVERY - HHVM busy threads on mw1230 is OK: OK: Less than 1.00% above the threshold [76.8] [02:02:32] RECOVERY - HHVM busy threads on mw1233 is OK: OK: Less than 1.00% above the threshold [76.8] [02:02:45] RECOVERY - HHVM busy threads on mw1232 is OK: OK: Less than 1.00% above the threshold [76.8] [02:03:23] RECOVERY - HHVM busy threads on mw1222 is OK: OK: Less than 1.00% above the threshold [76.8] [02:03:38] RECOVERY - HHVM busy threads on mw1231 is OK: OK: Less than 1.00% above the threshold [76.8] [02:18:25] !log l10nupdate Synchronized php-1.25wmf8/cache/l10n: (no message) (duration: 00m 03s) [02:18:29] !log LocalisationUpdate completed (1.25wmf8) at 2014-11-26 02:18:29+00:00 [02:18:30] Logged the message, Master [02:18:34] Logged the message, Master [02:30:16] !log l10nupdate Synchronized php-1.25wmf9/cache/l10n: (no message) (duration: 00m 01s) [02:30:19] Logged the message, Master [02:30:20] !log LocalisationUpdate completed (1.25wmf9) at 2014-11-26 02:30:20+00:00 [02:30:23] Logged the message, Master [03:13:13] PROBLEM - puppet last run on amssq36 is CRITICAL: CRITICAL: puppet fail [03:31:59] (03PS1) 10GWicke: Move restbase config to regex.yaml [puppet] - 10https://gerrit.wikimedia.org/r/175939 [03:32:38] RECOVERY - puppet last run on amssq36 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [03:34:06] * gwicke looks around for opsens with merge rights [03:40:29] PROBLEM - HHVM busy threads on mw1229 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [03:43:21] RECOVERY - HHVM busy threads on mw1229 is OK: OK: Less than 1.00% above the threshold [76.8] [03:52:46] (03PS2) 10GWicke: Move restbase config to regex.yaml [puppet] - 10https://gerrit.wikimedia.org/r/175939 [03:55:55] (03CR) 10Tim Starling: "I would still want this change, or something like it. It's nice to be able to profile individual requests without modifying the output, es" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174372 (owner: 10Tim Starling) [04:18:19] Who broke OAuth logins? [04:24:14] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Nov 26 04:24:14 UTC 2014 (duration 24m 13s) [04:24:20] Logged the message, Master [04:28:03] James_F: do you know who I should ping about OAuth breakage? [04:28:25] Login seems to be completely broken for all the apps I tried. [04:28:58] I don't see a "cmjohnson" in the channel (per chan topic) [04:31:55] hi [04:32:13] rageoss, can you give us some specific examples to aid troubleshooting? [04:32:22] sorry, ragesoss [04:32:55] i haven't worked on that subsystem before but i'll take a look [04:32:58] jgage: some examples listed here: https://phabricator.wikimedia.org/T75968 [04:33:02] thanks [04:33:40] jgage: the wizard.wikiedu.org one has error stuff enabled, so there's lots of environment stuff you can see if you try that one. [04:33:50] great [04:34:19] jgage: there weren't any commits to the OAuth extension lately, so I guess the breakage is probably somewhere else. [04:34:28] hmm [04:37:22] the hello world works for me. can anyon confirm whether oauth is working or fails for them? [04:37:41] i used a regular nonprivileged account [04:38:03] jgage: did you try the "post to talk page" or "verify your identity"? [04:38:12] For me, that revealed that I was not in fact logged in. [04:38:13] i tried verify your identity [04:39:10] with the wikiedu wizard i was prompted to auth and now i'm at the assignment design wizard form [04:40:04] jgage: that's odd. [04:40:11] ragesoss have you tried a second browser? [04:40:22] with both privileged and unprivileged accounts, it's broken for me, on Chrome and Firefox. [04:40:26] hm ok [04:40:46] (also broken for my developer, on a different IP, etc) [04:41:01] hmmm [04:42:11] confirmed, https://phabricator.wikimedia.org/T75968 [04:43:05] maybe i'm having success because of cached credentials, because i'm logged in to phab [04:43:12] * jgage tries another browser [04:43:37] * ragesoss was also logged in to Phab [04:43:38] to be clear i believe that there's a problem i just need to be able to reproduce it for troubleshooting [04:44:00] I'm logged in with my session, but I tried to login with my other account (personal) and it didn't work [04:44:33] yeah... odd that you were able to get the wizard and the hello world app to work, jgage. [04:45:18] phab login worked for me in a clean browser [04:45:41] that's new... [04:45:41] weird [04:45:54] I got 503 error on phab login that time. [04:46:03] actually, not new... happened once to me earlier today. [04:46:35] jgage: ok, yeah, in an incognito chrome window it worked (loging into phab) [04:46:37] i don't see any oauth documentation on wikitech [04:46:48] it wouldn't be there :/ [04:46:49] (to be clear, that 503 happened before I got to the oauth page on the wiki) [04:47:04] (so a phab problem, not an oauth problem) [04:47:41] I just got the oauth_token error in an incognito window in chrome. [04:47:55] https://www.mediawiki.org/wiki/Extension:OAuth [04:48:03] thanks greg-g [04:49:42] TimStarling: I hate to bother you on this, but we're having intermittent oauth authentication issues. I got it once when loggin into phab via mw.org, ragesoss got it on other consumers. See: https://phabricator.wikimedia.org/T75968 [04:49:55] hi [04:50:00] * legoktm reads up [04:50:02] oh, it's a lego! [04:50:23] is it possible that this is related to hhvm changes today? nodes were merged from two pools to one or something. [04:52:05] uh, maybe [04:52:26] * duploktm grumbles about flaky internet [04:52:27] ori: around? [04:52:31] seems like the components involved are appservers, memcached, mysql [04:52:42] I remember magnus having an issue with OAuth that was hhvm related [04:53:09] jgage: why do you think those components? [04:53:18] just reading the oauth extension url you pasted [04:53:22] because i know nothing about it [04:53:27] * greg-g nods [04:53:33] what does this 503 look like? [04:54:50] TimStarling: the 503, which I think is not connected to the OAuth problem, looks like a normal wikimedia server 503 error. Let me see if I can find it in my history. [04:55:24] ragesoss what time (utc) did you first observe this problem? [04:55:34] TimStarling: Request: POST http://phabricator.wikimedia.org/auth/login/mediawiki:mediawiki/, from 10.64.0.172 via cp1044 cp1044 ([10.64.0.172]:80), Varnish XID 1588971278 [04:55:35] Forwarded for: 2601:8:b100:9c0:bdd1:40d:6643:e5d6, 10.64.0.172 [04:55:35] Error: 503, Service Unavailable at Wed, 26 Nov 2014 04:55:23 GMT [04:56:22] https://old-bugzilla.wikimedia.org/show_bug.cgi?id=72384 is what I'm thinking about, but doesn't seem related here [04:56:27] jgage: two hours ago, for the OAuth issue. [04:56:32] thank you [04:56:57] (That 503 error is fresh; I just repro it) [04:57:48] I'm not really sure how to debug oauth tbh... [04:59:01] I just logged into quarry fine. [04:59:10] via OAuth. [04:59:23] I'm... going to step away [04:59:50] I can authorize-ish with oauth-hello-world. [04:59:56] I just logged into quarry as well, after earlier fails. [05:00:03] legoktm: feel free to use the contacts page on officewiki to call whoever you need if it is deemed worth it (you and jgage and tim can decide) [05:00:31] do we know who is knowledgeable about oauth? [05:00:37] csteipp [05:00:49] aaron, anomie, and tim? [05:00:55] those too [05:01:28] so the 503 error is not related, and oauth is just plain "not working" [05:01:34] no further information? [05:01:55] TimStarling: the error I got from phab was: [05:01:55] Unhandled Exception ("Exception") [05:01:56] Expected 'oauth_token' in response! [05:02:35] what URL? [05:02:59] TimStarling: hitting https://tools.wmflabs.org/oauth-hello-world/index.php?action=identify after authorizing the application randomly works otherwise it gives Invalid identify response: {"error":"mwoauth-oauth-exception"} [05:03:35] (sent privately) [05:03:57] so... do you have the MW API response? [05:04:39] i get about 50% success/fail on that url out of 10 tries [05:05:47] * greg-g has to go, kid crying [05:05:57] TimStarling: if I'm reading the code right in anomie's tool, the API response is just {"error":"mwoauth-oauth-exception"} [05:06:15] Also, in https://old-bugzilla.wikimedia.org/show_bug.cgi?id=72384#c4 anomie said "That sounds like you hit a wiki using HHVM when the OAuth authorization was done using Zend, or vice versa. For some reason the OAuth stuff doesn't seem to be shared between the two." [05:06:27] oho [05:06:45] ruh roh [05:07:17] _joe_ should be awake in an hour or two [05:10:15] so if there is an exception in Special:OAuth, it should be logged to the OAuth log channel [05:11:20] and the full error message text should be in the output [05:12:07] the OAuth log channel goes to /dev/null [05:12:13] whee [05:12:16] >.> [05:12:36] that's fixable though [05:13:49] (03PS1) 10Legoktm: Add debug log group for OAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175945 [05:13:52] TimStarling: ^ [05:14:39] (03CR) 10Tim Starling: [C: 032 V: 032] Add debug log group for OAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175945 (owner: 10Legoktm) [05:15:41] !log tstarling Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 06s) [05:15:46] Logged the message, Master [05:16:16] ok, well it's a bit noisy, but better than nothing [05:17:39] I triggered a few exceptions on mediawikiwiki [05:18:05] 2014-11-26 05:16:00 mw1239 mediawikiwiki: MediaWiki\Extensions\OAuth\SpecialMWOAuth::execute: Exception Invalid consumer key [05:19:35] TimStarling: is there a list of which servers are running hhvm and which are zend? [05:20:11] not that I know of [05:20:33] i think ori showed me a way to see on ganglia once... /me looks through irc logs [05:20:45] hmm, I guess I can just login to them and see what version of php they have. [05:21:00] 2014-11-26 05:20:29 mw1188 mediawikiwiki: MediaWiki\Extensions\OAuth\SpecialMWOAuth::execute: Exception Sorry, something went wrong connecting this application. [05:21:29] there were 4 exceptions when I tried to log in, that was the text of three of them [05:23:52] TimStarling: that's the one I see a lot when I go back on my browser to the OAuth login page after already being logged in. [05:24:05] (when things are working normally) [05:24:38] it was mwoauthdatastore-request-token-not-found [05:25:09] It provides a nice URL: https://www.mediawiki.org/wiki/Help:OAuth/Errors#E004 [05:26:10] And says sorry. I mean, that's pretty nice. [05:26:51] so the consumer token is some kind of long hashy thing [05:26:59] does MW give it to the application at some point? [05:29:21] right, so it is fixed for a given consumer [05:31:34] I think applications have to be approved and access can be revoked. [05:31:53] how is oauth configured in phabricator? [05:32:08] where does it get the token from? is it sending the right token? [05:33:11] I'd guess the token is stored in the private puppet repo? [05:33:25] manifests/role/phabricator.pp? [05:33:43] Oh, there's a module as well. [05:34:29] auth to phab works sometimes, similar to the hello world app. so it seems to get the right token at least some of the time. [05:34:52] you mean giving the right token? [05:35:23] yes, sorry [05:35:42] If it's both Phabricator and the Hello World app having issues, it's probably MediaWiki.org's OAuth that's gone weird? [05:36:49] this all seems to match the behavior that anomie described regarding a cluster with both hhvm and zend hosts: https://old-bugzilla.wikimedia.org/show_bug.cgi?id=72384#c4 [05:37:01] Carmela: obviously, but that doesn't answer my question [05:39:13] I'm a little surprised we're still using memcached for sessions. [05:39:30] Carmela: That's not really helpful. [05:40:07] jgage: ok, but no isolation was done [05:40:27] right [05:41:40] 2014-11-25 20:39:23 mw1232 mediawikiwiki: Memcached error for key "mediawikiwiki:messages:en:status" on server "127.0.0.1:11212": SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY [05:41:54] yes, well [05:42:07] I guess that patch was never merged upstream [05:42:41] 11212.. nonstandard memcached port [05:44:08] so would memcached failure cause oauth failure? [05:46:46] this is going to take a while [05:47:46] Fwiw phab oauth config is all in d. And hasn't changed in awhile [05:47:55] In the db :) [05:48:19] ;) [05:48:28] is the consumer key in the DB? [05:49:50] In phab db yes, I think it was registered under mukunda's account [05:52:04] so you can confirm that? [05:53:00] TimStarling: I can provide keys for my app, if that will help. [05:53:29] no [05:54:04] we know OAuthServer::get_consumer() throws an exception [05:54:31] so most likely $request->get_parameter( "oauth_consumer_key" ) returns something that is false [05:58:47] $request is probably from MWOAuthRequest::fromRequest() [06:00:26] which fills in parameters from GET parameters, headers and post data [06:00:28] TimStarling: yes I can verify the consumer key, is that helpful? [06:00:38] so it is hard to see how it is false without some amount of client involvement [06:01:46] TimStarling: PM'd in case it's useful for you [06:07:07] !log tstarling Synchronized php-1.25wmf9/extensions/OAuth/lib/OAuth.php: (no message) (duration: 00m 06s) [06:07:11] Logged the message, Master [06:09:06] !log tstarling Synchronized php-1.25wmf9/extensions/OAuth/lib/OAuth.php: (no message) (duration: 00m 06s) [06:12:03] chasemp: What happened to OAuth? [06:12:27] csteipp: not sure, I saw https://phabricator.wikimedia.org/T75968 get logged. Seems to be a general MW oauth issue [06:12:57] possibly related to mixing hhvm and zend nodes? I don't have much insight into it tbh, just wanted to see if I could be of assistance on the phab as a client front [06:13:34] Hmm.. intermittent? WFM just now.. [06:15:42] csteipp: yeah, i'm getting about 50% error rate. TimStarling is debugging. according to https://old-bugzilla.wikimedia.org/show_bug.cgi?id=72384#c4 it's a problem of hitting a wiki with zend when oauth was done with hhvm or vice versa. [06:15:52] well, potentially [06:16:21] csteipp: we enabled the oauth debug log, it's at fluorine:/a/mw-log/oauth.log [06:16:58] also I just did a patch to log some extra data, if 21MB of logs in the last half hour is not enough [06:17:19] My first guess is hmac on hhvm might be slightly different. [06:19:07] is there documentation of the client/server request flow? [06:20:18] https://www.mediawiki.org/wiki/OAuth/For_Developers#mediaviewer/File:OAuth-basicSVG.svg I guess [06:20:45] TimStarling: https://www.mediawiki.org/wiki/Auth_systems/OAuth/Design or https://www.mediawiki.org/wiki/OAuth/For_Developers [06:21:17] OAuthServer::get_consumer() throws an exception "Invalid consumer key" [06:21:27] Do hhvm and zend share memcache? [06:21:42] yes [06:22:14] I am struggling to understand how this is possible since apparently the consumer key is persistently configured [06:22:28] I did a var_export of the request at that point [06:23:45] this is typical: http://paste.tstarling.com/p/DVRMXL.html [06:23:53] definitely no consumer key [06:27:33] Is it always the /token call that throws the exception? [06:27:59] no, we logged one at /identify [06:28:08] at 06:13:56 [06:28:30] for commonswiki, not mediawikiwiki [06:28:36] 'http_url' => 'https://commons.wikimedia.org/wiki/Special:OAuth/identify', [06:30:01] Both those come from the remote server, so it's pretty certain that they're not randomly leaving off their client token. And all the OAuth parameters are missing on that paste. There should be a token, signature, signature method, etc. [06:30:39] well, it could be a bug in MWOAuthRequest in getting the parameters from the environment [06:31:40] where is it meant to be? authorization header, post or get? [06:31:44] Yeah, that's possible. We also had an issue at one point that if one of the functions was called twice, it didn't have all the OAuth info the second time. [06:31:53] Yeah, Authorization header is the normal method [06:31:58] GET is allowed as a backup [06:32:41] <_joe_> jgage: searching for me? [06:33:43] <_joe_> oh just read the backlog [06:34:01] ok, logging that [06:34:02] !log tstarling Synchronized php-1.25wmf9/extensions/OAuth/lib/OAuth.php: (no message) (duration: 00m 05s) [06:34:05] Logged the message, Master [06:34:33] PROBLEM - puppet last run on elastic1030 is CRITICAL: CRITICAL: puppet fail [06:34:50] <_joe_> chasemp: do we have a confirmation that hhvm / zend mixing is the problem? [06:35:02] I wonder if WebRequest::ggetRawInput isn't happy on hhvm [06:35:10] yeah, blank [06:35:23] <_joe_> ugh [06:35:37] <_joe_> should we rollback to having the two separate pools? [06:35:48] (03CR) 10Yuvipanda: kill facilities.pp, move to nagios_common (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/173999 (owner: 10Dzahn) [06:36:03] <_joe_> In case, it will take me ~ 20 minutes to do so [06:36:16] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Puppet has 2 failures [06:36:16] PROBLEM - puppet last run on virt1006 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:17] <_joe_> (plus all the puppet runs) [06:36:18] (03CR) 10Yuvipanda: [C: 04-2] "Moving -1 to -2, since I'm more strongly inclined now." [puppet] - 10https://gerrit.wikimedia.org/r/173999 (owner: 10Dzahn) [06:36:34] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:44] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 2 failures [06:36:47] _joe_: we probably have enough information, so you may as well [06:36:51] TimStarling: We need access to the raw values there-- if there's another way to do that in hhvm, we can patch that. That's what the signature is checked against, so the values can't be touched before OAuth gets them. [06:37:12] <_joe_> TimStarling: well it's not like I'd do that if we are going to have a patch soon [06:37:13] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:14] PROBLEM - puppet last run on amslvs1 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:36] PROBLEM - puppet last run on amssq55 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:49] let's do a small test case [06:37:53] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:53] <_joe_> I'll work towards that anyways [06:38:18] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: Puppet has 1 failures [06:38:18] PROBLEM - puppet last run on mw1039 is CRITICAL: CRITICAL: Puppet has 1 failures [06:38:18] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:38:23] * _joe_ just got out of bed [06:38:43] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:38:44] PROBLEM - puppet last run on lvs2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:39:14] PROBLEM - puppet last run on amssq48 is CRITICAL: CRITICAL: Puppet has 1 failures [06:39:25] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 2 failures [06:39:36] PROBLEM - puppet last run on mw1126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:39:39] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 1 failures [06:40:04] !log tstarling Synchronized live-1.5/oauth-headers.php: (no message) (duration: 00m 05s) [06:40:09] Logged the message, Master [06:42:04] why does it say file not found? it's not RA or something is it? [06:42:46] PROBLEM - puppet last run on ssl3001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:43:15] !log tstarling Synchronized w/oauth-headers.php: (no message) (duration: 00m 06s) [06:43:16] never mind, I'm too old [06:43:17] Logged the message, Master [06:44:21] (03PS1) 10Giuseppe Lavagetto: Revert "mediawiki: move most servers from the hhvm to the standard pool" [puppet] - 10https://gerrit.wikimedia.org/r/175950 [06:44:38] (03PS1) 10Giuseppe Lavagetto: Revert "varnish: remove redirection to the hhvm pool" [puppet] - 10https://gerrit.wikimedia.org/r/175951 [06:45:04] <_joe_> ok, whenever we decide, I'm ready to rollback and create the hhvm pool again [06:45:27] what's a zend server? [06:45:31] <_joe_> we should also revert mediawiki-config changes, btw [06:45:46] i.e. the hostname of one instance thereof [06:45:48] <_joe_> TimStarling: do you want a host? [06:45:53] <_joe_> mw1040 [06:45:58] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:46:11] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:46:11] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:46:20] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:46:30] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:47:00] RECOVERY - puppet last run on amslvs1 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:47:00] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:27] RECOVERY - puppet last run on amssq55 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:47:28] RECOVERY - puppet last run on elastic1030 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:32] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:51] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:52] Wonder if $HTTP_RAW_POST_DATA works on hhvm? [06:47:56] RECOVERY - puppet last run on mw1039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:57] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:11] !log tstarling Synchronized w/oauth-headers.php: (no message) (duration: 00m 05s) [06:48:12] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:12] Logged the message, Master [06:48:15] RECOVERY - puppet last run on lvs2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:42] RECOVERY - puppet last run on amssq48 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:48] botspam [06:48:55] RECOVERY - puppet last run on mw1126 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:48:56] RECOVERY - puppet last run on virt1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:48:57] maybe we should talk in #mediawiki-core? [06:49:02] <_joe_> ok [06:57:05] RECOVERY - puppet last run on ssl3001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:53] PROBLEM - puppet last run on db1006 is CRITICAL: CRITICAL: Puppet has 2 failures [07:05:13] PROBLEM - puppet last run on neon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:05:13] PROBLEM - check if dhclient is running on neon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:05:14] PROBLEM - ircecho_service_running on neon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:07:30] (03PS1) 10Giuseppe Lavagetto: mediawiki: fix passing of the Authorization header in HAT [puppet] - 10https://gerrit.wikimedia.org/r/175952 [07:07:35] <_joe_> csteipp_afk: I think we nailed it ^^ [07:07:54] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 6 minutes ago with 0 failures [07:07:59] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 3 processes with args ircecho [07:07:59] RECOVERY - check if dhclient is running on neon is OK: PROCS OK: 0 processes with command name dhclient [07:11:09] yay _joe_ [07:11:30] <_joe_> what was the bug again? [07:12:10] https://phabricator.wikimedia.org/T75968 [07:12:39] (03CR) 10Giuseppe Lavagetto: [C: 032] "see https://phabricator.wikimedia.org/T75968" [puppet] - 10https://gerrit.wikimedia.org/r/175952 (owner: 10Giuseppe Lavagetto) [07:13:46] RECOVERY - puppet last run on db1006 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [07:14:23] <_joe_> jgage: next time if you get this is hhvm-related, feel free to phone me [07:14:38] <_joe_> well, if it's related to something I'm working on in general [07:14:45] <_joe_> I can usually wake up [07:15:30] <_joe_> I hope you don't speak italian, so that the swear words will sound obscure and funny to you, but it's really ok :P [07:15:57] <_joe_> ok in ~ 20 minutes, oauth should be unbroken [07:16:16] <_joe_> I don't feel like pushing the puppet change across the board [07:21:45] <_joe_> now I can take my morning shower I guess :) [07:22:26] * YuviPanda breaks something else for _joe_ to fix [07:23:30] <_joe_> YuviPanda: it really was tim doing all the important guesswork, I just wrote the apache fix [07:23:35] :) [07:30:32] PROBLEM - HHVM busy threads on mw1223 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [07:33:16] RECOVERY - HHVM busy threads on mw1223 is OK: OK: Less than 1.00% above the threshold [76.8] [07:35:08] <_joe_> jgage: can you try oauth again? [08:10:01] PROBLEM - HHVM busy threads on mw1223 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [115.2] [08:12:39] RECOVERY - HHVM busy threads on mw1223 is OK: OK: Less than 1.00% above the threshold [76.8] [08:29:16] _joe_ looks good, i tried the hello world app 10 times, all successful [08:29:36] and i will gladly wake you up to hear cursing in italian if the need arises :D [08:30:44] first attempt was pretty slow, maybe 30s. but it didn't fail! [08:30:55] after that it was speedy [08:32:07] * jgage zzz [08:35:11] <_joe_> jgage: good night [08:50:57] jenkins seems to be stuck [08:52:16] Nemo_bis: I doubt that it was millions of users and I doubt it was init7-related :) [08:52:45] Nemo_bis: also, do not assume that routing is symmetric [08:53:01] the path to wikimedia vs. the path *from* wikimedia could be entirely different and the issue could be in either direction [08:53:40] so traceroutes, while very helpful, are not giving the whole picture; that's why we usually want an IP as well (feel free to mask it to a /24 for privacy reasons) [09:18:01] (03PS5) 10Faidon Liambotis: realm: remove pmtpa, add codfw [puppet] - 10https://gerrit.wikimedia.org/r/173476 (owner: 10Dzahn) [09:19:01] (03CR) 10Faidon Liambotis: [C: 032] realm: remove pmtpa, add codfw [puppet] - 10https://gerrit.wikimedia.org/r/173476 (owner: 10Dzahn) [09:20:44] (03CR) 10Faidon Liambotis: "Ping?" [puppet] - 10https://gerrit.wikimedia.org/r/151523 (https://bugzilla.wikimedia.org/60238) (owner: 10Tim Landscheidt) [09:27:15] (03PS2) 10Faidon Liambotis: geoip: kill geoliteupdate in favor of geoipupdate [puppet] - 10https://gerrit.wikimedia.org/r/175571 [09:28:04] (03CR) 10Faidon Liambotis: [C: 032] geoip: kill geoliteupdate in favor of geoipupdate [puppet] - 10https://gerrit.wikimedia.org/r/175571 (owner: 10Faidon Liambotis) [09:43:44] (03PS1) 10Yuvipanda: shinken: Add checks for labs infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/175964 [09:48:00] (03PS1) 10Giuseppe Lavagetto: reimage: add a few configs, beautify output [puppet] - 10https://gerrit.wikimedia.org/r/175965 [09:49:43] (03PS2) 10Yuvipanda: shinken: Add checks for labs infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/175964 [09:52:56] (03PS2) 10Giuseppe Lavagetto: reimage: add a few configs, beautify output [puppet] - 10https://gerrit.wikimedia.org/r/175965 [09:53:30] (03PS3) 10Yuvipanda: shinken: Add checks for labs infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/175964 [09:54:12] _joe_: this looks like it could get a rewrite in python :) [09:54:53] <_joe_> paravoid: don't tempt me [09:55:02] <_joe_> (not that I didn't think of that) [09:55:10] I'm sure you did :) [09:55:16] we should think of ways to automate this even further [09:55:35] puppet has some new autosigning features that we could perhaps use [09:55:52] <_joe_> paravoid: actually, the best thing would be to run a script from iron, or having the install ssh key on palladium [09:55:58] then... IPMI + BIOS + iDRAC automation [09:56:43] <_joe_> so I can make the depool/clean/reboot to pxe/sign/enable and run puppet/sign salt/run puppet again/ cycle work from a single machine [09:57:36] <_joe_> yeah right now the more time-consuming thing of reimaging is going to be enabling hyperthreading [09:58:55] <_joe_> because well, you need to reboot into bios, enable it, get out of it, and I don't think we can automate that ATM; I should look into it [09:59:02] oh I've looked into it [09:59:06] wsman and all that glory [09:59:21] mjg59 has a new library that I should probably check though [09:59:59] paravoid: the IP was included [10:00:11] I just didn't make it explicit [10:00:19] ah sorry, didn't see that [10:00:56] <_joe_> https://github.com/jtallieu/dell-wsman-client-api-python/ looks quite unmaintained [10:01:13] https://github.com/nebula/firmware_config is mjg59's [10:01:21] depends on openwsman apparently [10:01:50] <_joe_> mmmh I could experiment with that [10:02:08] I've experimented extensively with wsman in the past [10:02:11] too messy/complicated [10:02:15] <_joe_> ok [10:02:19] not saying no [10:02:52] <_joe_> well the one you linked seems incredibly simple as an API [10:02:55] yes [10:03:01] I'm building openwsman now [10:03:02] let's see.. [10:03:06] <_joe_> oh ok :) [10:03:56] wsman-dispatcher.c:924:9: error: variable 'resUriMatch' set but not used [-Werror=unused-but-set-variable] [10:04:00] grumble [10:04:15] Werror sillyness [10:04:42] paravoid: anyway, this time they were much faster at resolving the issue; I don't know if the reason is that they learnt, or that I managed to tell several users they had to complain to the ISP, or that the blackout was really total this time (packet loss 100%) [10:04:58] and yes it's millions users for that ISP [10:05:03] if it was the same issue [10:06:49] (03PS4) 10Yuvipanda: shinken: Add checks for labs infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/175964 [10:07:40] _joe_: and that wouldn't do it for the HPs btw [10:07:48] but akosiaris had previously worked on automating some of that [10:07:56] not sure if it was BIOS too or just iLOs though [10:08:15] <_joe_> paravoid: right [10:08:38] and, well, mjg59 had worked on that too, not sure what happened with that [10:08:50] he had previously said he'd submit it as a kernel module [10:08:50] <_joe_> sorry I'm a bit slow today, I woke up to a UBN! bug on HHVM and two coffee later I'm still groggy [10:09:12] http://mjg59.dreamwidth.org/25686.html [10:09:36] https://lkml.org/lkml/2013/9/4/22 probably? [10:11:35] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.0133333333333 [10:12:13] ...which isn't merged [10:16:23] (03PS5) 10Yuvipanda: shinken: Add checks for labs puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/175964 [10:21:25] akosiaris: can I kill the shinken-server instance you had set up a long time ago? [10:25:13] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: Puppet has 1 failures [10:25:56] (03PS6) 10Yuvipanda: shinken: Add checks for labs puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/175964 [10:30:13] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: Puppet has 1 failures [10:32:04] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.0133333333333 [10:35:16] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: Puppet has 1 failures [10:35:26] YuviPanda: I think it was apergos that had set that up [10:35:38] ah, will poke apergos [10:35:51] yes you may stomp all over it [10:36:11] it was running a pre 2.0 release anyways [10:36:51] hmm, I hope to upgrade us to 2.x at some point. [10:36:57] they've made progress with packaging since I started [10:37:06] and we'll need custom webui fixes anyway, can't use current auth mechanisms [10:37:13] * YuviPanda stomps over the shinken instances [10:37:45] <_joe_> YuviPanda: who knows something more about our OAuth implementation? [10:37:59] nobody awak at this point, I think [10:38:08] anomie and csteipp, usually. [10:38:17] *awake [10:38:18] <_joe_> which are both sleeping I guess [10:38:28] yeah [10:39:03] <_joe_> I'm not sure that the report "most tools are failing" is correct btw [10:39:45] me neither. [10:40:08] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: Puppet has 1 failures [10:40:09] _joe_: the reporter also probably has some setup to disable all https, so that might have something to do with it perhaps? [10:40:57] <_joe_> no idea [10:41:14] apparently not. [10:45:09] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: Puppet has 1 failures [10:45:18] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.01 [10:46:47] In [6]: f.get_boot_options() [10:46:47] Out[6]: [10:46:47] {'BootOrderNone': {'current': ['Embedded NIC 1: MBA v6.0.11 Slot 0200 BootSeq', [10:46:50] 'Hard drive C: BootSeq', [10:46:52] 'Embedded SATA Port A Disk: Embedded SATA Port A HddSeq', [10:46:55] 'Embedded SATA Port B Disk: Embedded SATA Port B HddSeq', [10:46:57] 'Embedded NIC 1: Broadcom NetXtreme II Gigabit Ethernet (BCM5716C) UefiBootSeq'], [10:47:00] 'default': '', [10:47:03] 'dell_boot': True, [10:47:07] very very slow [10:47:17] and I took a few shortcuts as well with openwsman [10:47:22] but it kinda works :) [10:48:12] <_joe_> paravoid: so, time for a python rewrite? :P [10:50:17] RECOVERY - check_puppetrun on lutetium is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [10:50:21] nice [10:51:59] can't find a setting for HT though :) [10:52:53] <_joe_> isn't it "Logical Processors"? [10:53:48] BIOS.LogicalProc [10:53:51] 'possible': ['Enabled', 'Disabled']}, [10:54:05] is it called like that in the BIOS? [10:54:11] <_joe_> yes [10:54:29] <_joe_> it took me a while to figure out [10:54:41] ok then :) [10:55:05] <_joe_> https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/Dell_PowerEdge_RN20#Initial_System_Setup [10:55:31] <_joe_> but it seems like we have more OAUTH troubles [11:00:30] PROBLEM - HHVM busy threads on mw1189 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [11:01:19] PROBLEM - HHVM busy threads on mw1229 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [115.2] [11:04:15] (03CR) 10Alexandros Kosiaris: "Yeah, the premise is fine for now. See comment about the user's existence however" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/174161 (owner: 10Dzahn) [11:05:22] (03CR) 10Alexandros Kosiaris: [C: 031] apachesync - delete sync-apache script [puppet] - 10https://gerrit.wikimedia.org/r/175884 (owner: 10Dzahn) [11:06:18] RECOVERY - HHVM busy threads on mw1189 is OK: OK: Less than 1.00% above the threshold [76.8] [11:06:18] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.0166112956811 [11:06:49] RECOVERY - HHVM busy threads on mw1229 is OK: OK: Less than 1.00% above the threshold [76.8] [11:25:14] PDF download seems broken entirely [11:26:04] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.0133333333333 [11:34:08] broken in production, not beta [11:36:44] ganglia shows a drop in load for the prod ocg hosts [11:37:27] http://ganglia.wikimedia.org/latest/graph_all_periods.php?h=ocg1001.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2&st=1417001721&g=cpu_report&z=large&c=PDF%20servers%20eqiad [11:37:40] (03PS1) 10Giuseppe Lavagetto: mediawiki: fix content-type and content-length in fcgi [puppet] - 10https://gerrit.wikimedia.org/r/175975 [11:37:56] <_joe_> paravoid: ^^ [11:38:02] other ocg hosts show the same [11:38:47] _joe_: that's fine [11:39:17] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: fix content-type and content-length in fcgi [puppet] - 10https://gerrit.wikimedia.org/r/175975 (owner: 10Giuseppe Lavagetto) [11:39:27] not ugly at all :) [11:40:21] so... who wants to fix ocg? :) [11:42:55] hmm [11:43:32] YuviPanda, http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&h=ocg1001.eqiad.wmnet&m=cpu_report&s=descending&mc=2&g=cpu_report&c=PDF+servers+eqiad - know anything about this? [11:43:39] akosiaris, godog: could you take a look at OCG? [11:44:13] ouch, taking a look [11:44:17] thanks [11:44:31] Krenair: ^ :) [11:45:36] <_joe_> quite interestingly, we got no pages about that. I suspect a software failure upper in the chain [11:45:45] <_joe_> (that is, ocg has nothing to consume) [11:46:14] indeed, I take it those don't go through the job queue? [11:46:28] <_joe_> godog: they have their own queue [11:46:48] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.0166112956811 [11:46:49] where does that queue live? [11:47:11] <_joe_> godog: I suspect a disk full instead http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=PDF%20servers%20eqiad&h=ocg1001.eqiad.wmnet&r=week&z=default&jr=&js=&st=1417002379&v=94&m=ocg_data_filesystem_utilization&vl=%25&ti=ocg_data_filesystem_utilization&z=large [11:47:30] _that_ should get paging [11:47:36] <_joe_> it should [11:47:45] <_joe_> mark: that queue is on redis [11:47:50] ok [11:47:58] <_joe_> but well, I'm still debugging the other problem [11:48:02] yup [11:48:21] on all three hosts? [11:48:26] simultaneously? [11:49:06] looks like it [11:49:16] and why not, if they have equal load [11:49:29] also trying to find where ocg logs [11:49:49] nevermind /srv/deployment/ocg/log [11:55:04] thoughts on removing pdfs older than 7d ? [11:56:10] yeah do it [11:56:28] ack, will start with 14 [11:56:56] !log removing pdf files older than 14d from ocg1001 [11:57:02] Logged the message, Master [12:00:09] !log removing pdf files older than 14d from ocg100* [12:00:11] Logged the message, Master [12:00:15] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.0166666666667 [12:04:18] for the record, ocg seems to log (only to?) logstash [12:05:38] <_joe_> godog: yes I think cscott made that modification [12:06:43] yep seems to indicate redis failure? looking for mw-ocg-service on logstash [12:10:21] Nov 25 15:37:43 ocg1002 ganglia-ocg[15741]: ocg_job_status_queue 503449 [12:10:25] Nov 25 15:38:19 ocg1002 ganglia-ocg[25920]: ocg_job_status_queue 0 [12:12:33] (03CR) 10ArielGlenn: "well what I was hoping to do with this is have it be invoked only when the changeset being checked touches data.yaml. Is that even possib" [puppet] - 10https://gerrit.wikimedia.org/r/175442 (owner: 10ArielGlenn) [12:13:29] godog: did the strace point out anything ? [12:13:49] akosiaris: nothing meaningful to my eyes :( [12:14:55] <_joe_> I can try to help with that [12:14:57] (03PS3) 10Giuseppe Lavagetto: reimage: add a few configs, beautify output [puppet] - 10https://gerrit.wikimedia.org/r/175965 [12:15:10] <_joe_> godog: did you try to turn it off and on again? [12:15:15] <_joe_> (TM) [12:15:33] _joe_: on ocg1001 yes, doesn't seem to have worked though judging by the logstash logs [12:16:00] Error: send_command: stream not writeable. enable_offline_queue is false [12:16:30] <_joe_> looks like we need to find out were the problem is in redis? [12:17:09] I suppose the /srv/ocg/output directory having 400GBs of PDFs inside is expected right ? [12:17:10] <_joe_> btw the OCG health endpoint gives a 500 internal server error since 15 days on ocg1002 and it's a... warning [12:17:21] <_joe_> akosiaris: let's say it is [12:17:27] 500 a warning ? that is wrong [12:18:22] !log restarting ocg on ocg1001 [12:18:27] Logged the message, Master [12:18:37] <_joe_> so yes this _is_ a redis failure [12:18:49] why? [12:19:16] ls -l /srv/deployment/ocg/postmortem/p* [12:19:21] <_joe_> in /var/log/ocg/ocg.log on ocg1002 [12:19:22] that can not be good... [12:19:38] <_joe_> akosiaris: ocg servers are failing to talk to redis [12:19:39] <_joe_> Nov 26 12:18:06 ocg1002 mw-ocg-service: {"name":"mw-ocg-service","hostname":"ocg1002","pid":47728,"level":50,"channel":"frontend.error","err":{"message":"send_command: stream not writeable. enable_offline_queue is false","name":"Error","stack":"Error: send_command: stream not writeable. enable_offline_queue is false\n at RedisClient.send_command (/srv/deployment/ocg/ocg/node_modules/redis/index.js:802:39)\n at RedisClient.(anonymous [12:19:57] _joe_: run the command I pasted on any ocg server [12:20:00] <_joe_> so, may this be a config issue I created by moving things around yesterday? [12:20:08] and tell what that file is doing there... [12:20:15] those files.... [12:20:17] <_joe_> akosiaris: oh god [12:20:34] <_joe_> wtf [12:20:59] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.0133333333333 [12:22:27] you mean in this commit? https://gerrit.wikimedia.org/r/#/c/175681/ [12:25:10] <_joe_> Krenair: no [12:25:14] !log stopped ocg on ocg1* [12:25:18] Logged the message, Master [12:36:51] no luck godog? [12:38:53] <_joe_> Krenair: let's say we fixed the underlying issue we had, but uncovered some other issues [12:39:22] Krenair: yep, what _joe_ said [12:40:11] it looks like you've moved discussion about it elsewhere [12:41:01] Ah. I see why :) [12:41:35] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00996677740864 [12:51:32] (03PS1) 10KartikMistry: WIP: Add ContentTranslation in wikishared DB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175979 [13:02:15] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.0133333333333 [13:15:49] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.0167224080268 [13:23:19] Reedy: ping [13:31:06] (03CR) 10Yuvipanda: [C: 04-1] "Err, back to -1, since I only object to putting them in nagios_common, not getting rid of facilities.pp" [puppet] - 10https://gerrit.wikimedia.org/r/173999 (owner: 10Dzahn) [13:32:50] * YuviPanda heads back to PHPland [13:36:24] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00996677740864 [13:55:45] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [14:01:07] aude: here? [14:05:51] please someone decrease the ammout of executors on the wikidata-jenkins* nodes in jenkins to 3. since the weekend somehow regularly jobs time out. perhaps that will prevent some pain until we have some better solution. [14:08:16] PROBLEM - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is CRITICAL: Connection refused [14:09:55] PROBLEM - Host ms-be2014 is DOWN: PING CRITICAL - Packet loss = 100% [14:10:18] godog: you I assume? :) [14:11:37] ah whoops, yes, the downtime must have expired [14:23:17] (03PS1) 10Faidon Liambotis: Assign IPs for pfw1-codfw/pfw2-codfw [dns] - 10https://gerrit.wikimedia.org/r/175989 [14:26:05] (03CR) 10Mark Bergsma: [C: 04-2] "Overlapping the LVS ips" [dns] - 10https://gerrit.wikimedia.org/r/175989 (owner: 10Faidon Liambotis) [14:34:10] (03CR) 10Nikerabbit: "Also, is this missing a change in InitializeSettings?" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175979 (owner: 10KartikMistry) [14:38:50] (03CR) 10Alexandros Kosiaris: [C: 04-1] "minor comments" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/175965 (owner: 10Giuseppe Lavagetto) [14:41:48] (03PS2) 10Faidon Liambotis: Assign IPs for pfw1-codfw/pfw2-codfw [dns] - 10https://gerrit.wikimedia.org/r/175989 [14:43:57] (03CR) 10Faidon Liambotis: [C: 032] Assign IPs for pfw1-codfw/pfw2-codfw [dns] - 10https://gerrit.wikimedia.org/r/175989 (owner: 10Faidon Liambotis) [14:44:56] hi there, can someone help me with account registration? [14:47:31] hi there, can someone help me with account registration?? [14:50:32] veluwse: Hi! [14:50:38] veluwse: What problem are you having? [14:50:44] hi [14:50:56] i'm from a school ip and can not register an account [14:51:05] veluwse: It may be that your school is blocked. [14:51:26] indeed, but can not contact anybody to help me [14:51:28] veluwse: That sometimes happens when one person uses the school IP to vandalise Wikipedia [14:51:28] :) [14:51:41] veluwse: You can request an account on English Wikipedia, though...one sec [14:52:08] veluwse: https://en.wikipedia.org/wiki/Wikipedia:Request_an_account [14:52:16] I think that's the right way to go [14:52:26] ok, and I can use that account to on the dutch wiki? [14:54:59] (03PS2) 10KartikMistry: WIP: Add ContentTranslation in wikishared DB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175979 [14:55:38] veluwse: yes, an account is global between the different wikipedia sites [14:58:54] veluwse: in het nederlands behoor je via deze mogelijkheid: https://nl.wikipedia.org/wiki/Speciaal:Contactpagina ook een account aan te kunnen vragen. [14:59:05] thedj: Bless you. [14:59:25] Thedj helaas [14:59:25] * marktraceur thought he had escaped the Dutch!!! [14:59:29] ook die pagina is geblokkeerd [14:59:32] :) [14:59:36] back to english [14:59:42] No, please, continue [14:59:57] veluwse: lol, so your school has quite a few persistent vandals then :) [15:00:12] haha yes I quess so. We have 7 schools [15:00:23] On one IP address? [15:00:24] around 10.000 students [15:00:39] yes, one internet line with black fiber connected [15:00:42] Or maybe you have a small block and that was blocked. [15:00:43] Ah. [15:00:51] 10 students? :P [15:00:53] range block happens as well quite often [15:01:01] haha 10k [15:01:02] marktraceur: don't be so american :) [15:01:08] I can't halp iiiiit [15:05:20] PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: puppet fail [15:15:51] Dibs SWAT [15:16:05] SWATraceur [15:19:33] RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:19:40] James_F: "Enable VisualEditor as a Beta Feature on most remaining wikis" oh, good, nothing potentially controversial or difficult or anything [15:21:04] i don't recall that being discussed [15:23:25] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [15:25:43] mark: Nor do I, but maybe James_F has an answer for that [15:31:53] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [15:41:44] (03CR) 10Alexandros Kosiaris: [C: 032] Fix most annoying SVG bug, SVG path data number parsing issue [debs/librsvg] - 10https://gerrit.wikimedia.org/r/173639 (owner: 10Ebrahim) [15:41:51] (03CR) 10Alexandros Kosiaris: [V: 032] Fix most annoying SVG bug, SVG path data number parsing issue [debs/librsvg] - 10https://gerrit.wikimedia.org/r/173639 (owner: 10Ebrahim) [15:57:15] Arright, James_F, ebernhardson, are y'all here for your SWATs? [15:57:51] SWATs, SWATs, SWATs, SWATs. TURN DOWN FOR SWAT? [15:58:48] * ebernhardson waves [15:59:30] It's a start [15:59:40] ebernhardson: Yer first [15:59:57] Hey. [16:00:04] manybubbles, anomie, ^d, marktraceur: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141126T1600). Please do the needful. [16:00:04] Yay! [16:02:15] RECOVERY - Disk space on db2011 is OK: DISK OK [16:05:43] _joe_: just read scrollback on the oauth stuff, thanks man [16:05:55] <_joe_> greg-g: np [16:06:22] _joe_: so, I know there is another thing potentially taking your attention right now, but is it worth a report? [16:06:28] <_joe_> greg-g: me & Tim fixed part of the problem, the other part was fixed around 13:00Z [16:06:44] <_joe_> greg-g: maybe, yes [16:06:56] <_joe_> it was user-impacting after all [16:07:00] right [16:07:09] <_joe_> grr I hate that you're right :P [16:07:38] <_joe_> but I honestly have no other action points than "exterminate PHP from the face of earth" [16:08:16] <_joe_> one language that has the apache_request_headers() function should be burned down. seriously. [16:08:28] _joe_: it can be a short report with only one action item :) [16:10:06] ebernhardson: Doesn't need a scap for anything, right? [16:11:19] ....looks like no [16:11:23] markSWATteur: nope, should be safe to just sync the two files [16:11:26] KK [16:11:29] ebernhardson: Syncing [16:11:32] !log marktraceur Synchronized php-1.25wmf9/extensions/Flow/: [SWAT] [wmf9] 175941 "Provide user to local LQT api calls" for officewiki. (duration: 00m 08s) [16:11:34] Logged the message, Master [16:11:35] ebernhardson: And done, test plox [16:11:37] (03CR) 10Giuseppe Lavagetto: reimage: add a few configs, beautify output (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/175965 (owner: 10Giuseppe Lavagetto) [16:12:02] (03CR) 10MarkTraceur: [C: 032] Enable VisualEditor Beta Feature on other wikis too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174793 (owner: 10Jforrester) [16:12:07] <_joe_> akosiaris: thanks for taking a look [16:12:10] Whee. [16:12:53] I never know whether to sync CommonSettings or InitialiseSettings first. [16:13:14] markSWATteur: Do the directory? [16:13:20] I guess. [16:13:26] !log Jenkins is displaying everything in French (both logged-in/logged-out users alike) [16:13:28] Logged the message, Master [16:13:31] markSWATteur: Initialise. [16:14:04] yea i usually do the directory and dont think about it :) [16:14:09] Krinkle: Wait, that was a bug? I thought it was just Jenkins misunderstanding my accept-language header. [16:14:24] (03Merged) 10jenkins-bot: Enable VisualEditor Beta Feature on other wikis too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174793 (owner: 10Jforrester) [16:15:37] PROBLEM - Host ms-be2014 is DOWN: PING CRITICAL - Packet loss = 100% [16:16:19] !log marktraceur Synchronized wmf-config/: [SWAT] [config] 174793 Enable VisualEditor as a Beta Feature on most remaining wikis (duration: 00m 06s) [16:16:20] Logged the message, Master [16:16:21] ebernhardson: Is Flow working? [16:16:24] James_F: Testy test [16:16:32] * James_F is doing os. [16:16:51] markSWATteur: Also sync the removal of visualeditor.dblist. [16:16:54] markSWATteur: No, I surely am not sending Accept-Language: fr [16:17:02] man, I couldn't work on the Flow team, I'd be making Dune jokes all the time, and I'd start annoying even myself [16:17:05] and in case the headers were cached, I tried url garbage, no luck. [16:17:21] I think someone somewhere with A-L: fr did some action that stuck in Jenkins [16:17:27] Probably Antoine. [16:17:33] Maybe it sticks to the language used during restart [16:17:35] James_F: How do I sync a file that isn't there? [16:17:39] I'm trying that theory at the moment. [16:17:49] markSWATteur: I don't know. I don't have deploy rights. :-) [16:17:54] markSWATteur: sync-dir should do it [16:17:54] Eff. [16:18:05] Krinkle: sync-dir on all of mediawiki-staging? :((( [16:18:14] markSWATteur: parent dir of to be deleted file [16:18:25] Krinkle: mediawiki-staging *is* the parent dir. [16:18:29] Then maybe not [16:18:35] I'm scapping in a little bit [16:18:35] Exactly. [16:18:36] Coudl leave it around [16:18:37] that'll remove it [16:18:39] I figured [16:18:47] Reedy: Thanks! [16:19:03] markSWATteur: OK, interesting. [16:19:04] markSWATteur: alternatively, we usuallly do clean up like this afterwards by using dsh directly with a rm command I think [16:19:23] markSWATteur: It's mostly working. I've found a couple of edge cases. Will do a follow-up patch. Consider it {{done}}. [16:19:25] one dblist doesn't seem worth the effort [16:19:51] * James_F needs to read the internals of CommonSettings.php carefully. [16:19:55] Cool beanss [16:20:27] man, that's messed up. I really want the compiled binary blobs o' doom that HHVM will give us :) [16:20:33] SWAT is closed! [16:20:45] You may return to your meaningless existence. [16:24:48] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 110, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/3: down - Core: pfw2-codfw:xe-6/0/0 {#10901} [10Gbps DF]BR [16:25:31] marktraceur: yea it works well thanks [16:26:11] Sweet [16:27:32] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 112, down: 0, dormant: 0, excluded: 0, unused: 0 [16:29:51] (03CR) 10Andrew Bogott: [C: 031] "This looks good! Of course, ideally, we will never test it :)" [puppet] - 10https://gerrit.wikimedia.org/r/175964 (owner: 10Yuvipanda) [16:34:16] i'm on line, ocg is not. [16:34:22] !log Changed Jenkins default language from "en_US" to "en" ("Ignore browser settings" was already enabled). Not sure why, but it's back to English now. [16:34:23] good morning, opsen [16:34:24] Logged the message, Master [16:41:48] <^d> Krinkle: When it was set to 'en' before it was giving everyone italian. [16:42:00] <^d> 'en_US' made it go away [16:42:42] (03PS1) 10Cscott: Temporarily remove the 'download as PDF' link from the sidebar. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176005 [16:43:19] <_joe_> d'oh [16:43:30] <_joe_> cscott: I could've thought of that :( [16:44:12] (03PS3) 10Giuseppe Lavagetto: Move restbase config to regex.yaml [puppet] - 10https://gerrit.wikimedia.org/r/175939 (owner: 10GWicke) [16:47:37] (03CR) 10GWicke: [C: 032] Temporarily remove the 'download as PDF' link from the sidebar. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176005 (owner: 10Cscott) [16:47:48] (03Merged) 10jenkins-bot: Temporarily remove the 'download as PDF' link from the sidebar. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176005 (owner: 10Cscott) [16:48:08] cscott: are you going to push this out? [16:48:50] gwicke: my understanding is that ocg is shut down at the moment, so there's no critical reason to deploy until we think we've fixed all the bugs. [16:48:56] _joe_: tell me if i'm wrong [16:49:06] I meant the config change [16:49:14] to disable the PDF link [16:49:31] gwicke: oh, i meant for that to be swatted. [16:49:43] ah, okay [16:49:50] gwicke: could you push that out, while i keep digging through the ocg sources? [16:50:16] it only removes the download links from production wikis, not labs, but i assume that's good enough. [16:50:45] (03CR) 10GWicke: [C: 031] Move restbase config to regex.yaml [puppet] - 10https://gerrit.wikimedia.org/r/175939 (owner: 10GWicke) [16:51:09] _joe_: would be great to merge ^^ [16:53:05] cscott: I can try [16:53:06] I don't particularly like this [16:53:20] we're copying (and have to maintain) the host->role mapping in two places [16:54:57] !log gwicke Synchronized wmf-config/CommonSettings.php: (no message) (duration: 00m 05s) [16:55:01] Logged the message, Master [16:55:15] <_joe_> paravoid: me neither, but I still have to find a good solution for that [16:55:25] paravoid: I agree, but lets figure that out later [16:56:01] do we have any reason to use hiera for this right now? [16:56:20] no functional one, no [16:56:47] so let's not until hiera is better than what we have I'd say [16:56:54] in any case, lets please get something merged for now [16:57:54] I mostly care about having something that works so that I can test stuff [16:58:07] yeah that's fair :) [16:58:17] <_joe_> +1 [16:58:17] it's easy to reorganize the hiera config at any point [16:58:19] _joe_: what do you want to do? [16:59:26] <_joe_> paravoid: having a simple way to assign a "main role" to a node group, without having to set a string global variable in the node defs [16:59:41] I don't understand? [17:00:00] <_joe_> can we save this for another moment? [17:00:27] <_joe_> I'd prefer to discuss this with a few notes written down + I'm pretty tired [17:00:56] <_joe_> I was about to fill in the incident report for the oauth failure and call it a day [17:01:50] (03PS4) 10Giuseppe Lavagetto: Move restbase config to regex.yaml [puppet] - 10https://gerrit.wikimedia.org/r/175939 (owner: 10GWicke) [17:02:33] <_joe_> gwicke: merging now, if you want, but I won't be around for fixups [17:03:02] (03CR) 10Giuseppe Lavagetto: [C: 032] Move restbase config to regex.yaml [puppet] - 10https://gerrit.wikimedia.org/r/175939 (owner: 10GWicke) [17:03:05] _joe_: that's fine, if I'm lucky it'll at least let me start testing [17:03:54] before this patch the data wasn't in the hiera search path afaik, so I'm optimistic that this will help [17:04:20] <_joe_> I'm running puppet on one of the nodes now [17:05:09] looks good! [17:05:13] <_joe_> gwicke: seems fine [17:05:18] hosts: [xenon.eqiad.wmnet,cerium.eqiad.wmnet,praseodymium.eqiad.wmnet] [17:05:27] in /etc/restbase/config.yaml [17:05:34] thank you! [17:05:38] <_joe_> yw [17:08:44] PROBLEM - Host ms-be2014 is DOWN: CRITICAL - Plugin timed out after 15 seconds [17:09:12] <_joe_> greg-g: when were the oauth problems first reported? [17:10:04] RECOVERY - Host ms-be2014 is UP: PING OK - Packet loss = 0%, RTA = 43.04 ms [17:19:17] PROBLEM - puppet last run on ms-be2014 is CRITICAL: CRITICAL: Puppet has 2 failures [17:20:08] _joe_: "23:18 < ragesoss> Who broke OAuth logins?" which is eastern us timezone [17:20:40] 8pm SF time, which is what 2am UTC? [17:21:17] <_joe_> no more like 5 am [17:21:22] <_joe_> sorry, 4 am [17:22:42] oh right, yeah [17:22:49] I'm in a 1:1 riht now :) [17:30:04] spagewmf, ebernhardson: Dear anthropoid, the time has come. Please deploy Flow retry (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141126T1730). [17:30:32] <_joe_> greg-g: https://wikitech.wikimedia.org/wiki/Incident_documentation/20141126-oauth (when you're done with the 1:1) [17:30:49] <_joe_> I am done with the day :) [17:32:54] _joe_: cool, thanks [17:43:38] (03Abandoned) 10Reedy: Make apple-touch-icon.png configurable via touch.php [puppet] - 10https://gerrit.wikimedia.org/r/147488 (owner: 10Reedy) [17:44:09] (03Abandoned) 10Reedy: Add robots.txt rewrite rule where wiki is public [puppet] - 10https://gerrit.wikimedia.org/r/147487 (owner: 10Reedy) [17:46:28] springle: still up? [18:00:55] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [18:09:18] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [18:10:00] PROBLEM - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is CRITICAL: Connection refused [18:16:36] (03CR) 10EBernhardson: [C: 032] Flow whitelist for pages converted from LQT on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175861 (owner: 10EBernhardson) [18:16:48] (03Merged) 10jenkins-bot: Flow whitelist for pages converted from LQT on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175861 (owner: 10EBernhardson) [18:18:00] !log ebernhardson Synchronized wmf-config/InitialiseSettings.php: Whitelist converted lqt pages on officewiki (duration: 00m 07s) [18:18:03] Logged the message, Master [18:29:55] (03CR) 10Dzahn: "thanks Faidon" [puppet] - 10https://gerrit.wikimedia.org/r/173476 (owner: 10Dzahn) [18:37:11] (03CR) 10Dzahn: CI: install private ssh key for Travis integration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/174161 (owner: 10Dzahn) [18:44:50] (03PS2) 10Dzahn: CI: install private ssh key for Travis integration [puppet] - 10https://gerrit.wikimedia.org/r/174161 [18:46:31] (03PS3) 10Dzahn: CI: install private ssh key for Travis integration [puppet] - 10https://gerrit.wikimedia.org/r/174161 [18:47:41] (03CR) 10Dzahn: [C: 032] "target systems are gallium and lanthanum" [puppet] - 10https://gerrit.wikimedia.org/r/174161 (owner: 10Dzahn) [18:50:24] typo.. sigh [18:52:08] (03PS1) 10Dzahn: ci: travis user, typo 'nmptravis' vs. 'npmtravis' [puppet] - 10https://gerrit.wikimedia.org/r/176024 [18:53:04] (03CR) 10Dzahn: [C: 032] ci: travis user, typo 'nmptravis' vs. 'npmtravis' [puppet] - 10https://gerrit.wikimedia.org/r/176024 (owner: 10Dzahn) [18:56:46] PROBLEM - puppet last run on lanthanum is CRITICAL: CRITICAL: Puppet has 1 failures [18:57:07] Does donatewiki have access to the LocalisationUpdate cache? [18:57:09] system users don't get .ssh dirs .. [18:57:28] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: Puppet has 1 failures [18:57:51] bd808: hi... can you help out with jenkins stuff? [18:58:08] hoo: Possibly. What's up? [18:58:35] Our jenkins instances are often overloaded beyond what's reasonable.... thus stuff times out and things fall apart [18:58:49] are you able to reduce the number of workers we have per instance? [18:58:52] awight: Is donate on the cluster like other stuff? [18:58:56] (number of jobs that can run in parallel) [18:59:01] hoo: Should be possible. Let me look [18:59:02] http://donate.wikimedia.org/wiki/Special:Version [18:59:07] 1.25wmf9 [18:59:11] * YuviPanda is back [18:59:13] I'm presuming so [18:59:37] RECOVERY - puppet last run on lanthanum is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:59:53] Reedy: yes, AFAIK. But I'm seeing outdated translations... [18:59:59] awight: What do you want to know/what are you trying to do? [19:00:02] Ah [19:00:04] to what? [19:00:04] Reedy, greg-g: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141126T1900). [19:00:27] (03PS7) 10Yuvipanda: shinken: Add checks for labs puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/175964 [19:00:38] hoo: wikidata-jenkins[1-3] have 5 workers each configured right now. WHat do you think is sane for the size of the instances? [19:00:53] 5 :D [19:01:04] But since late last week that no longer works [19:01:06] so... 3? [19:01:16] *nod* I'll change them [19:01:27] Reedy: so, for example https://donate.wikimedia.org/w/index.php?title=MediaWiki:Donate_interface-informationsharing/sv is a stale translation. I would have assumed that LocalisationUpdate would be providing new translations daily... [19:01:33] Perhaps it's not working for donatewiki [19:01:46] Hmm [19:01:49] Where is that message from? [19:02:00] from the DonationInterface extension [19:02:15] hrm. mediawiki.org doesn't have the new translation, either. [19:02:21] bd808: Thanks :) [19:02:26] Maybe... things don't work like I assumed. [19:02:35] Or it's been broken since I changed stuff [19:02:37] (03CR) 10Yuvipanda: [C: 032] shinken: Add checks for labs puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/175964 (owner: 10Yuvipanda) [19:02:37] Hopefully we can soonish abandon these... [19:02:44] And no one has noticed [19:02:49] Reedy: does LU only work for mediawiki-core, perhaps? [19:02:53] Nope [19:03:01] It doesn't work for skins, but it does for extensions [19:03:07] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:03:12] hoo: {{done}} [19:04:11] Reedy: do u have time to look into this? [19:04:18] (03PS2) 10Dr0ptp4kt: Vary mdot webroot on Accept-Language, X-Subdomain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175797 [19:04:35] awight: Not now, as I'm just prepping for the big deploy... [19:04:45] Reedy: ok, no worries [19:05:00] Can you at least file a phab task and I'll poke it later or something? [19:05:08] Reedy: will do! [19:05:11] thanks [19:05:29] Ideally, we could do with checking if any messages are showing as updated [19:06:46] Nemo_bis: About? [19:07:40] (03PS1) 10Dzahn: ci/Travis: ensure .ssh dir is present [puppet] - 10https://gerrit.wikimedia.org/r/176026 [19:09:05] (03CR) 10Dzahn: [C: 032] "4.0K -r-------- 1 npmtravis jenkins 3.2K Nov 26 19:05 npmtravis_id_rsa" [puppet] - 10https://gerrit.wikimedia.org/r/176026 (owner: 10Dzahn) [19:09:08] reedy@tin:~$ ls -al /var/log/l10nupdatelog/l10nupdate.log [19:09:08] -rw-rw-r-- 1 l10nupdate l10nupdate 0 Nov 26 06:25 /var/log/l10nupdatelog/l10nupdate.log [19:09:11] Isn't that most useful? [19:09:30] oh, rotated out after being made? [19:10:36] Reedy: There was a bug with the scap integration of l10nupdate but I think I fixed it last week. [19:10:48] It at least worked once while I wathced :) [19:11:18] bd808: I think it's still broken [19:11:19] sync-common: 99% (ok: 0; fail: 265; left: 1) ^M^[[33m02:18:25 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--i$ [19:11:24] cscott: the npmtravis user and key now exists on CI slaves [19:12:21] Why are we getting netcat output too? [19:12:21] This is nc from the netcat-openbsd package. An alternative nc is available [19:12:21] in the netcat-traditional package. [19:12:31] tin:/tmp/l10nupdate.log-20141126 [19:12:32] Reedy: I filed a bug for that... [19:13:10] Reedy: Hi [19:13:12] Reedy: https://phabricator.wikimedia.org/T1387 [19:13:19] mutante: whoo. sorry, distracted with ocg stuff today. [19:13:29] aha [19:13:36] Reedy: sorry to distract you, we can definitely work around for a few days. I've filed as https://phabricator.wikimedia.org/T76061 -- the problem might also be with DonationInterface's i18n directory structure... [19:13:41] that's not so important then [19:13:58] awight: It looks like it's not syncing to servers properly [19:14:07] OK that's good to know [19:14:15] cscott: yep, just sayin' the requirement is done for later [19:14:18] It's not just me going crazy :) [19:14:30] Reedy: I bet sync-common is not happy with the new scap change to prefer the common ssh-agent somehow [19:14:57] mutante: how about that mw-ext-sync user for hashar's old sync stuff? [19:14:58] awight: if you need the updated messages, we could switch donatewiki to 1.25wmf10 today [19:15:00] Because l10nupdate user won't have access to the shared agent socket. I think I maybe filed a bug about that too. [19:15:24] cscott: ah, right, yea, not done yet [19:15:54] i will update ticket [19:15:58] mutante: thanks. [19:16:42] Reedy: nope, we can wait for a normal fix to LU. Thanks! [19:21:49] Reedy: This was the fix I made last week -- https://gerrit.wikimedia.org/r/#/c/174784/ [19:28:44] Anyone know why Jenkins is being slow? [19:28:57] I have a failing build that shows up in Gerrit but 404's in the web interface. [19:30:57] chrismcmalunch: Maybe? [19:34:43] (03PS1) 10Reedy: Add symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176031 [19:34:45] (03PS1) 10Reedy: testwiki to 1.25wmf10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176032 [19:34:47] (03PS1) 10Reedy: wikipedias to 1.25wmf9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176033 [19:34:49] (03PS1) 10Reedy: group0 to 1.25wmf10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176034 [19:35:11] (03CR) 10Reedy: [C: 032] Add symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176031 (owner: 10Reedy) [19:35:22] (03CR) 10Reedy: [C: 032] testwiki to 1.25wmf10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176032 (owner: 10Reedy) [19:35:32] (03Merged) 10jenkins-bot: Add symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176031 (owner: 10Reedy) [19:35:34] (03Merged) 10jenkins-bot: testwiki to 1.25wmf10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176032 (owner: 10Reedy) [19:35:46] Never mind, I just ran the tests locally. [19:35:47] !log reedy Started scap: testwiki to 1.25wmf10 and build l10n cache [19:35:49] Logged the message, Master [19:41:17] milimetric: heya, what was done to close these to bugs? https://phabricator.wikimedia.org/T75206 and https://phabricator.wikimedia.org/T67683 (empty resolution comment) [19:42:11] sorry in meeting [19:42:18] oh - nvm, that's important [19:42:31] um, I resolved them at Chris McMahon's request as part of the scrum of scrums [19:42:34] greg-g: ^ [19:42:54] chrismcmalunch: can you comment on those bugs with resolution reason? ^^ [19:54:01] Reedy: if you haven't deployed yet, can we get https://gerrit.wikimedia.org/r/#/c/176036/ backported? [19:56:38] legoktm: It's currently scapping [19:56:45] We can deploy it in a bit [19:57:00] ok, thanks [20:15:12] csteipp: andrewbogott has already +1ed https://gerrit.wikimedia.org/r/#/c/169830/ [20:15:27] csteipp: you asked for someone from ops to merge it, but it's a mediawiki changeset [20:16:04] and I think you're probably more qualified for that :) [20:16:08] in any case, andrewbogott ^^ :) [20:16:45] paravoid: Cool, we can do that. [20:17:24] needs manual rebase [20:18:39] I'm not Labs, but I think labs people would agree with me :) [20:19:11] bblack: just a heads up that yurikR will be doing the mdot webroot related stuff this afternoon. [20:19:18] +1 on getting more MW-people eyes on OSM [20:24:51] !log reedy Finished scap: testwiki to 1.25wmf10 and build l10n cache (duration: 49m 03s) [20:24:53] Logged the message, Master [20:25:14] Reedy: :( scap takes too long to run now [20:25:26] Still better than last week [20:25:40] 19:47:47 Finished mw-update-l10n (duration: 09m 10s) [20:25:44] 20:17:02 Finished sync-apaches (duration: 24m 58s) [20:25:49] 20:24:44 Finished scap-rebuild-cdbs (duration: 07m 41s) [20:26:18] greg-g: about those phab issues, I think that they should just not be marked with the "scrum of scrums" tag. they [20:26:23] (03CR) 10Reedy: [C: 032] wikipedias to 1.25wmf9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176033 (owner: 10Reedy) [20:26:25] mw1070 took a pounding again -- https://ganglia.wikimedia.org/latest/?c=Application%20servers%20eqiad&h=mw1070.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [20:26:29] (03PS1) 10GWicke: Fix apparmor config syntax [puppet] - 10https://gerrit.wikimedia.org/r/176047 [20:26:36] (03Merged) 10jenkins-bot: wikipedias to 1.25wmf9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176033 (owner: 10Reedy) [20:26:41] greg-g: I don't think they're hanging on any unknown issues at this point. [20:26:55] bd808: I guess just upping the number will hopefully have some benefit [20:27:02] chrismcmahon: I tagged the one that needs an opsen to update ldap with SOS so somebody would actually do it [20:27:47] chrismcmahon: Is SOS not for surfacing blocker bugs that need cross team attention? [20:28:11] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedias to 1.25wmf9 [20:28:13] Logged the message, Master [20:28:20] (03CR) 10Reedy: [C: 032] group0 to 1.25wmf10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176034 (owner: 10Reedy) [20:28:29] (03Merged) 10jenkins-bot: group0 to 1.25wmf10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176034 (owner: 10Reedy) [20:28:36] bd808: which one are we talking about? [20:28:54] csteipp: https://gerrit.wikimedia.org/r/#/c/176047/ [20:29:08] chrismcmahon: https://phabricator.wikimedia.org/T75206 [20:29:13] (03CR) 10Cscott: [C: 031] Fix apparmor config syntax [puppet] - 10https://gerrit.wikimedia.org/r/176047 (owner: 10GWicke) [20:29:24] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: group0 to 1.25wmf10 [20:29:28] Logged the message, Master [20:30:14] !log reedy Purged l10n cache for 1.25wmf6 [20:30:16] Logged the message, Master [20:31:19] greg-g: What versions can we delete? :P [20:31:35] still got php-1.25wmf2 [20:31:37] bd808: re-opened. that one is not easy to parse [20:32:01] I can make a dependent task just for the ldap change [20:32:16] (03CR) 10CSteipp: [C: 031] "Syntax looks good now. I think the profile can be simplified by adding one or two base profiles, but we can do that another time." [puppet] - 10https://gerrit.wikimedia.org/r/176047 (owner: 10GWicke) [20:32:40] (03CR) 10Dzahn: [C: 032] Fix apparmor config syntax [puppet] - 10https://gerrit.wikimedia.org/r/176047 (owner: 10GWicke) [20:33:31] bd808: I think if you just spelled out exactly what is required, it would be helpful. [20:34:36] chrismcmahon: then don't close them if they're not resolved and instead remove the SoS project? [20:35:37] greg-g: yeah, I mentioned that I thought all the parties involved were working on the issue, so it didn't belong in SoS board. I didn't realize they got closed-closed. [20:36:17] greg-g: this was the first time everyone saw the SoS phab board, it'll take a couple tries I guess [20:41:00] chrismcmahon: *nod* the change from copy-pasted tracking bugs to actual tasks is probably a big one for SOS triage. [20:41:54] bd808: yeah, also that particular one was unclear to me what is required [20:44:14] the last four comments are it, really [20:46:39] (03Abandoned) 10Dzahn: kill facilities.pp, move to nagios_common [puppet] - 10https://gerrit.wikimedia.org/r/173999 (owner: 10Dzahn) [20:48:07] greg-g, chrismcmahon: All fixed up with https://phabricator.wikimedia.org/T76086 [20:49:20] thanks bd808 [20:50:38] !log reedy Synchronized php-1.25wmf10: (no message) (duration: 00m 47s) [20:50:46] Logged the message, Master [20:50:48] (03PS2) 10Reedy: Remove enwiki's OTRS-member group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175872 (owner: 10Legoktm) [20:50:52] (03CR) 10Reedy: [C: 032] Remove enwiki's OTRS-member group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175872 (owner: 10Legoktm) [20:51:04] (03Merged) 10jenkins-bot: Remove enwiki's OTRS-member group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175872 (owner: 10Legoktm) [20:51:05] !log updated OCG to version 7d8f2b8bd496464041e3ef9c092732457cc8f7ef (did not restart ocg) [20:51:07] Logged the message, Master [20:51:09] dr0ptp4kt: you're deploying it today? [20:52:00] is logstash alive? [20:52:04] greg-g? [20:52:30] yurikR: looking [20:52:43] bd808, thx, can't open fatalmon [20:53:20] (03CR) 10BBlack: [C: 031] Vary mdot webroot on Accept-Language, X-Subdomain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175797 (owner: 10Dr0ptp4kt) [20:54:48] yurikR: yikes! looks like today's index is not assigned to any shard when I look from logstash1001. That probably means a split brain [20:55:23] bd808, how could ppl deploy?!? [20:55:23] Reedy: You still doing the train? [20:55:34] James_F: It's done.. I think [20:55:48] Reedy: OK. Will schedule a SWAT window. [20:56:07] yurikR: what? [20:57:10] yurikR: in the future, please give me more context and preferably not ping just me for things like that. I'm not a SPOF and should never be one. I'm in 1:1s all afternoon. [20:57:49] !log All three elasticsaerch nodes in the logstash clsuter think logstash1003 is master but ogstash-2014.11.26 is not allocated on any node [20:57:52] Logged the message, Master [21:00:04] gwicke, cscott, arlolra, subbu: Respected human, time to deploy Parsoid/OCG (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141126T2100). Please do the needful. [21:00:08] bblack: yeah. cc yurikR [21:00:25] !log restarted elasticsearch on logstash1003 for OOM [21:00:31] Logged the message, Master [21:01:47] !log restarted elasticsearch on logstash1002 for OOM [21:01:49] Logged the message, Master [21:03:33] dr0ptp4kt: I don't think deploying things today is wise in the overall, given it's a sort of virtual friday in the US. Not that I feel strongly enough to stop you, but keep in mind my general level of annoyance if I have to log in for something related to this tomorrow will be high :P [21:03:35] jgage: That [21:03:52] yurikR: ^^ [21:04:08] jgage: That's two days in a row that we've had OOMs on logstash's elasticsearch nodes [21:04:25] bd808: Looks like we've got an ES point upgrade to install [21:04:33] Inst elasticsearch [1.3.4] (1.3.5 Wikimedia:12.04/precise-wikimedia [all]) [21:05:13] manybubbles: ^demon|lunch anything nice/good/interesting/other in ES 1.3.5? [21:05:42] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 14 threshold =0.1% breach: {ustatus: uyellow, unumber_of_nodes: 3, uunassigned_shards: 11, utimed_out: False, uactive_primary_shards: 41, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 109, uinitializing_shards: 3, unumber_of_data_nodes: 3} [21:05:54] bd808 :( this time i wasn't using it [21:05:57] Reedy: I'm kind of waiting on elastic 1.4 to get super stable [21:06:02] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 14 threshold =0.1% breach: {ustatus: uyellow, unumber_of_nodes: 3, uunassigned_shards: 11, utimed_out: False, uactive_primary_shards: 41, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 109, uinitializing_shards: 3, unumber_of_data_nodes: 3} [21:06:03] relatedly, i have wondered about upgrading the logstash nodes to trusty [21:06:06] bblack, dr0ptp4kt, agree [21:06:11] won't deploy anything today [21:06:59] jgage: Making them match the cirrus hosts would probably be good just for sanity sake [21:07:05] Reedy: https://github.com/elasticsearch/elasticsearch/pull/8062 would be nice but not a huge deal. [21:07:27] yurikR: maybe what would make sense is to add the stuff exclusive of the redirect (mediawiki-config and zerobanner). what do you think? cc bblack [21:07:40] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 14 threshold =0.1% breach: {ustatus: uyellow, unumber_of_nodes: 3, uunassigned_shards: 11, utimed_out: False, uactive_primary_shards: 41, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 109, uinitializing_shards: 3, unumber_of_data_nodes: 3} [21:07:50] manybubbles: I guess upgrading ES for logstash is a bit easier/quicker than production search [21:08:04] Reedy: prod takes two days for rolling restart..... [21:08:13] exactly [21:08:20] Reedy: You did it last. :) Didn't it take about 6 hours or so? [21:08:21] I've offered to work on fixing that but its thoroughly cookie-licked [21:08:25] yurikR: then swot the redirect things in mediawiki-config and zerobanner next week, or do it on wednesday as usual [21:08:29] cc bblack ^ [21:08:49] oh no i fear netsplit bblack and yurikR . i'm gonna disconnect and reconnect irc [21:09:13] bd808 the logstash nodes also have only 16gb ram, which brings us back to the question of whether to seek hardware spec'd for this purpose rather than reused misc nodes [21:09:20] gwicke, are you depl now? [21:09:33] dr0ptp4kt: it's really up to you guys re schedules and priorities, I'm just stating a related opinion as input :) [21:09:40] yurikR: bblack i'm back [21:10:09] dr0ptp4kt, which patches are MUST-HAVE? [21:10:17] yurikR, no gwicke is not [21:10:22] yurikR: i'm working on deploying parsoid [21:10:25] yurikR, with links [21:10:31] cscott, gotcha [21:10:31] yurikR: is that what you're asking about? [21:10:39] yep, thx [21:12:58] <^demon|lunch> manybubbles: Didn't 1.4 go final already? [21:13:38] ^demon|lunch: it did but I keep hearing things about it on the mailing list. People having all sorts of trouble. I haven't picked it up yet. I'm not sure if we'd have the same trouble though. [21:14:15] <^demon|lunch> Yeah I hadn't had the time to play with it much outside of testing swift. [21:15:49] !log Zuul stuck, restarting Gearman client [21:15:52] Logged the message, Master [21:16:01] yurikR: i think the followin are what you'd want: https://gerrit.wikimedia.org/r/#/c/176023/ https://gerrit.wikimedia.org/r/#/c/173744/ https://gerrit.wikimedia.org/r/#/c/175881/ https://gerrit.wikimedia.org/r/#/c/176061/ https://gerrit.wikimedia.org/r/#/c/170483/. cc bblack [21:16:38] dr0ptp4kt, ouch! i was hoping for a shorter list... like 1! [21:17:09] yurikR: stated differently, though, i think (temporarily) reverting https://gerrit.wikimedia.org/r/#/c/169210/ would be easier. then not merge-deploying https://gerrit.wikimedia.org/r/#/c/175797/ until we're ready to put 169210 in next week [21:18:07] yurikR: i know there's an operator looking for a fix on the x close button thing. and of course there are all those translation updates (although maybe those get pulled in automatically into prod?) [21:18:53] (03CR) 10Yuvipanda: "Why abandoned?" [puppet] - 10https://gerrit.wikimedia.org/r/173999 (owner: 10Dzahn) [21:20:14] dr0ptp4kt, there is a probelm - mobile was just updated yesterday, and jhobs changed his code to accomodate that. I don't think i will be able to push out the zerobanner because it might not work without the changes to the MFE [21:20:56] yurikR: i see. you're saying the mfe stuff needs to be deployed to production in order for jeff's stuff to work, right? [21:21:03] jgage: I think we have proved the POC that folks want logstash. I'd be all for someone thinking hard about how to build out a more robust implementation. [21:21:10] dr0ptp4kt, i'm suspecting so [21:21:40] bd808, agreed. i'll bring it up at the next ops meeting. [21:21:46] sweet [21:21:57] MaxSem: ^^^^^^ able to shed any light on the current MobileFrontend stuff? [21:22:27] i'd like that [21:22:32] but "next quarter" [21:22:47] ok [21:23:05] do bring it up though :) [21:23:21] will do. maybe we can try upgrading one of them to trusty before then to see if it helps the OOM problem. [21:23:26] dr0ptp4kt, I don't follow frontend stuff that closely [21:24:04] MaxSem, okay [21:27:02] yurikR: MaxSem i'll see if i can get ahold of jeff about dependencies on mfe. yurikR, deploying zb at c585de146cdb7646136fd2dc1bbb397ac8632ed2 and zp at the tip of master would probably do the trick. c585 by definition wouldn't include the fix in 3be10086db5530bc6634bea2b377fe535dd87da5 for the undefined offset, though [21:27:29] mark: Next quarter will be the 1 year anniversary of the experiment so a nice time to say "yup let's really do this" :) [21:27:49] bd808, i think the log events are not coming in ( [21:30:55] (03PS1) 10Cscott: Update apparmor profile for OCG. [puppet] - 10https://gerrit.wikimedia.org/r/176092 [21:31:36] yurikR: I think you're right :( [21:33:34] !log restarted logstash on logstash1001; log2udp events not being received [21:33:36] Logged the message, Master [21:33:52] yurikR: looking better now. [21:34:07] * bd808 needs to finish up testing the redis input method in beta [21:34:34] the logstash-gelf jar i'm using for hadoop can talk to redis, i've love to convert to that rather than hardcoding logstash1002 [21:35:18] jgage: I'm testing monolog->redis->logstash in beta now. [21:35:27] cool [21:36:12] With one host it seems to work well. We could try setting it up for hadoop in prod to see how it scales [21:37:09] Is there a good way to front redis for HA pushes into a list? Can lvs make that work? [21:37:39] From the logstash side I was just planning on adding connections to each of the three redis nodes [21:37:40] (03PS1) 10Cscott: Correctly remove the 'Download as PDF' link from sidebar. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176140 [21:40:49] cscott, just disable Collection, lol [21:41:12] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 9, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 111, initializing_shards: 3, number_of_data_nodes: 3 [21:41:15] MaxSem: no reason why people can't build and edit their books, just because we can't render them right now. [21:42:43] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 9, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 111, initializing_shards: 3, number_of_data_nodes: 3 [21:43:22] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 9, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 111, initializing_shards: 3, number_of_data_nodes: 3 [21:43:41] is there someone who could swat https://gerrit.wikimedia.org/r/176140 for me? [21:46:59] opsen: jenkins on trusty hosts seems to be down. [21:47:01] hashar: ^^ [21:47:18] cscott: busy in RFC meeting then bed time sorry :( [21:47:32] you can ask in #mediawiki-core maybe [21:48:35] cscott: also, log a bug [21:49:00] (03PS2) 10GWicke: Update apparmor profile for OCG. [puppet] - 10https://gerrit.wikimedia.org/r/176092 (owner: 10Cscott) [21:52:58] !log Restarting Gearman client. I am in a meeting, will cleanup later. [21:53:01] Logged the message, Master [21:54:53] hashar: Hey it looks like the job updating VisualEditor in beta labs is broken? http://en.wikipedia.beta.wmflabs.org/wiki/Special:Version says VE hasn't been updated there in 20 hours, and there are a number of things that are fixed in master but not in beta labs [21:55:06] cscott, what's the update? [21:55:22] yurikR: waiting for jenkins to merge the deploy commit [21:55:37] jenkins seems to have been hung, but the trusty ci slaves fixed themselves? [21:55:43] RoanKattouw: fill a task against Beta-Cluster please :] [21:55:44] that's what it looks like at least. seems to be running. [21:55:52] RoanKattouw: I am in a meeting and crashing to bed after [21:55:52] Will do [21:56:28] RoanKattouw: cc me on that bug, that job needs some updating in any case [21:57:03] greg-g: around? [21:57:03] dr0ptp4kt: You sent me a contentless ping. This is a contentless pong. Please provide a bit of information about what you want and I will respond when I am around. [21:57:16] (03CR) 10CSteipp: [C: 031] Update apparmor profile for OCG. [puppet] - 10https://gerrit.wikimedia.org/r/176092 (owner: 10Cscott) [21:57:32] greg-g: i was wondering if swat is available monday? [21:57:36] yurikR: ^ [21:58:14] greg-g: we're holding off on deployment giving the jenkins/logging stuff, but weren't sure if we can do monday, tuesday, wednesday of next week (latest being wednesday regular window) [21:59:12] dr0ptp4kt: yes, next week is a normal week [22:00:04] greg-g: should i just create a new table in https://wikitech.wikimedia.org/wiki/Deployments ? [22:00:04] yurik: Respected human, time to deploy Wikipedia Zero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141126T2200). Please do the needful. [22:00:36] greg-g ^^ sorry, shoulda used your handle [22:00:39] dr0ptp4kt: just add it to Upcoming [22:00:44] greg-g, cool [22:00:46] greg-g thx [22:00:48] np [22:00:54] RoanKattouw: The code update and scap jobs seem to be running and passing in beta. I'll look at the repo on deployment-bastion to see if something looks weird [22:01:32] (03CR) 10Gage: [C: 032] Update apparmor profile for OCG. [puppet] - 10https://gerrit.wikimedia.org/r/176092 (owner: 10Cscott) [22:02:41] !log updated Parsoid to version 67e2596c [22:02:45] Logged the message, Master [22:03:30] RoanKattouw: The beta repo matches what I see in github -- https://github.com/wikimedia/mediawiki-extensions-VisualEditor/commits/master [22:03:52] greg-g, dr0ptp4kt per previous conv, seems prod is too unstable at the moment, calling off our deployment until monday. [22:04:10] bd808, thx for helping with loging ) [22:04:23] in what way is prod unstable? [22:04:43] RECOVERY - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.006 second response time [22:04:43] the /topic says Status: Up [22:05:31] greg-g, logging wasn't working, jenkins acting up, but most importantly - there were some un-anticipated changes recently to MFE, and we don't want to accidently miss-match which version of the Banner will work with which version of MFE [22:05:37] too bad that paged (luckily I'mnot yet in the bed) [22:05:40] * greg-g nods [22:05:55] yurikR: logging as in logstash or udp? [22:06:03] greg-g, logstash [22:06:13] bd808 fixed it [22:06:25] greg-g: Just logstash, not udp2log [22:06:27] <_joe_> defining prod unstable based on that [22:06:27] logstash isn't the log of record yet, just fyi :) [22:06:35] _joe_: yeah, that's my point [22:06:59] greg-g, i use logstash for fatalmonitor, is that the definitive one? [22:07:08] fatal.log [22:07:16] <_joe_> right when a bunch of grumpy opsens have been paged in the middle of the evening, that's adventurous :) [22:07:24] in any case, no depl, feels unsafe, :) [22:07:42] I'm fine with no deploy, just "prod is unstable" isn't fair nor correct [22:08:01] true, i'll be more careful with words ) [22:08:08] :) [22:08:43] the correct wording should have been: the prod is fine, but greg-g is too grumpy, and I'm scared of him, hence no depl. [22:08:50] !log investigating Zuul/Jenkins. Jenkins potentially has a deadlock [22:08:55] Logged the message, Master [22:08:57] yurikR: much better ;) [22:09:03] RoanKattouw: Is there some VE component besides the extension that is deployed in beta? [22:09:12] off i go, gnight ) [22:11:50] (03CR) 10Dzahn: "the recent SNI switch probably made much of this deprecated, right? also needs manual rebase that may be complicated" [puppet] - 10https://gerrit.wikimedia.org/r/171496 (owner: 10Dzahn) [22:13:55] (03PS1) 10GWicke: Allow read access to fonts in OCG apparmor profile [puppet] - 10https://gerrit.wikimedia.org/r/176153 [22:14:20] !log mediawiki/core postmerge changes are stuck because mediawiki-core-doxygen-publish refuses to start. Attempted to retrigger them by promoting a change: gallium$ zuul promote --pipeline postmerge --changes 175960,1 [22:14:23] Logged the message, Master [22:15:07] (03CR) 10Dzahn: "also: "Invalid resource type install_certificate"... i'll abandon for now i think, i may recreate in a second attempt later" [puppet] - 10https://gerrit.wikimedia.org/r/171496 (owner: 10Dzahn) [22:15:11] (03CR) 10BBlack: "I don't know, but just FYI I plan to do several cleanup commits early next week to rip out a bunch of SSL-related things we don't need any" [puppet] - 10https://gerrit.wikimedia.org/r/171496 (owner: 10Dzahn) [22:15:14] (03CR) 10Cscott: [C: 031] Allow read access to fonts in OCG apparmor profile [puppet] - 10https://gerrit.wikimedia.org/r/176153 (owner: 10GWicke) [22:17:15] (03CR) 10Gage: [C: 032] Allow read access to fonts in OCG apparmor profile [puppet] - 10https://gerrit.wikimedia.org/r/176153 (owner: 10GWicke) [22:17:17] !log Bah there can only be one mediawiki-core-doxygen-publish job running, with all the merges that happened on mediawiki/core due to the release, there are currently six of them in the queue. They will all be processed eventually [22:17:21] Logged the message, Master [22:17:43] (03CR) 10Gage: [V: 032] Allow read access to fonts in OCG apparmor profile [puppet] - 10https://gerrit.wikimedia.org/r/176153 (owner: 10GWicke) [22:17:53] (03PS1) 10Cscott: Replace the 'download as PDF' link in the sidebar. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176154 [22:19:27] bd808: Not that I know of. The extension contains a submodule, but we're also seeing that non-submodule-related things aren't making it [22:20:04] (03CR) 10Cscott: [C: 04-2] "I think https://gerrit.wikimedia.org/r/176154 is the better solution right now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176140 (owner: 10Cscott) [22:21:41] RoanKattouw: Ah. got it. The lib/ve submodule is stale then [22:21:50] Well maybe [22:22:03] But what I'm saying is, there are also non-submodule changes that haven't made it out to beta [22:22:16] The submodules is at f0a63dc [22:22:38] !log Jenkins executors are in deadlock ( https://phabricator.wikimedia.org/T72597 ) [22:22:39] That's what it's at in master [22:22:42] Logged the message, Master [22:23:06] master being a14f88d of the extension repo [22:23:27] (03CR) 10GWicke: [C: 032] Replace the 'download as PDF' link in the sidebar. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176154 (owner: 10Cscott) [22:23:35] But Special:Version says that (cee93a0) is deployed [22:23:36] (03Merged) 10jenkins-bot: Replace the 'download as PDF' link in the sidebar. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176154 (owner: 10Cscott) [22:23:43] hmm... [22:24:20] !log gwicke Synchronized wmf-config/CommonSettings.php: (no message) (duration: 00m 05s) [22:24:23] Logged the message, Master [22:24:40] !log restarted ocg [22:24:43] Logged the message, Master [22:25:52] cscott: puppet actually restarts ocg on apparmor config changes [22:25:59] RoanKattouw: on disk on deploymnet-mediawiki01 the gitinfo cache files says "head": "a14f88d80eec331a94cfdfc50da800c8fbc68ef3" [22:26:25] gwicke: yeah, i was more logging the fact that ocg is now running again [22:26:41] gwicke: since i think the last server admin log entries said it was stopped [22:26:48] RoanKattouw: And the same on deployment-mediawiki02 [22:26:50] ah, k [22:27:17] bd808: Hah now I am seeing correct behavior [22:27:55] cscott: alright, things look good so far [22:28:09] gwicke: yup, that's what i'm seeing too [22:28:27] yay for firedrills [22:28:28] we should keep an eye on syslog on the ocg boxes to see if there are apparmor messages for things we missed [22:28:42] yeah, it's a shame we can't get those into logstash somehow [22:29:04] bd808: WTF I don't know how it updated now, it was definitely serving me old code when I filed that bug [22:29:09] Oh well [22:29:09] gwicke: are you looking at /var/log/kern.log or somewhere else? [22:29:15] It would be nice, though, if Special:Version showed accurate hashes? [22:29:20] cscott: We could if it was ok with ops. logstash can take in syslog input nicely [22:29:29] cscott: syslog has the same messages [22:29:52] & has friendlier permissions [22:29:53] Nov 26 22:29:43 ocg1001 kernel: [11073045.522981] type=1400 audit(1417040983.824:157): apparmor="DENIED" operation="open" profile="/usr/bin/nodejs-ocg" name="/etc/papersize" pid=37253 comm="nodejs-ocg" requested_mask="r" denied_mask="r" fsuid=997 ouid=0 [22:30:01] yes, i can read /var/log/syslog! [22:30:48] cscott: you might be able to craft some rsyslog filter that looks for things related to ocg and get them forwarded to fluorine. There are rules for that with the hhvm servers [22:31:02] If you got the data to florine we can stick it in logstash [22:32:03] (03CR) 10Dzahn: [C: 04-2] (WIP) generate wikimedia wiki entries from helper [dns] - 10https://gerrit.wikimedia.org/r/171769 (https://bugzilla.wikimedia.org/38799) (owner: 10Dzahn) [22:32:34] bd808: maybe monday, but not tonight [22:32:35] (03PS1) 10Cscott: Allow OCG to read default papersize. [puppet] - 10https://gerrit.wikimedia.org/r/176155 [22:32:40] gwicke: ^ [22:32:45] cscott: :) [22:33:05] cscott: there are a few more messages [22:33:35] might be worth rolling them into one update [22:34:17] https://gist.github.com/gwicke/f456ea525c42a3df0a97 [22:39:44] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: puppet fail [22:46:01] !log Jenkins still in deadlock, will hard restart Jenkins and Zuul soonish. [22:46:04] Logged the message, Master [22:48:40] gwicke: yup, i'll get to those in a bit [22:49:36] cscott: updated the gist at https://gist.github.com/gwicke/f456ea525c42a3df0a97 [22:50:15] offline for about 30 minutes, bbl [22:50:56] (03CR) 10GWicke: "Could you add handling for the errors in https://gist.githubusercontent.com/gwicke/f456ea525c42a3df0a97/raw/5fd1ccf0842dc716bb452934d6841e" [puppet] - 10https://gerrit.wikimedia.org/r/176155 (owner: 10Cscott) [22:55:45] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:56:13] !jenkins bye [22:56:28] where is our salt bot? [22:56:42] !log Killing Jenkins, it is deadlocked beyond repair [22:56:48] Logged the message, Master [23:03:05] poor Jenkins [23:04:26] so here the majordomo is the victim and not the assassin [23:08:59] PROBLEM - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 {channel:frontend.error,request:{id:1417043337527-01089},error:{message:Status check failed (redis failure?)}} - 232 bytes in 0.087 second response time [23:09:39] hm wassup ocg [23:09:41] gwicke is afk [23:10:24] redis password again? [23:11:05] cscott: ocg.svc.eqiad.wmnet dead :d [23:12:01] RECOVERY - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.012 second response time [23:14:30] (03PS2) 10Cscott: Further improvements to OCG apparmor profile. [puppet] - 10https://gerrit.wikimedia.org/r/176155 [23:14:50] hrm. [23:15:59] (03PS1) 10Yuvipanda: tools: Remove experimental mongo role/class [puppet] - 10https://gerrit.wikimedia.org/r/176169 [23:17:59] (03CR) 10Gage: [C: 031] tools: Remove experimental mongo role/class [puppet] - 10https://gerrit.wikimedia.org/r/176169 (owner: 10Yuvipanda) [23:18:12] (03PS2) 10Yuvipanda: tools: Remove experimental mongo role/class [puppet] - 10https://gerrit.wikimedia.org/r/176169 [23:18:17] :D [23:18:21] thanks jgage [23:19:37] (03CR) 10Yuvipanda: [C: 032] tools: Remove experimental mongo role/class [puppet] - 10https://gerrit.wikimedia.org/r/176169 (owner: 10Yuvipanda) [23:19:50] YuviPanda: so, why did the redis password disappear the first time, i never understood what happened there [23:20:21] cscott: sorry, I didn't either. wasn't following the conversation too closely... [23:30:24] config.redis.port = 6379; [23:30:24] config.redis.password = ""; [23:30:42] so puppet isn't writing the Sekret Password into the ocg config file [23:31:32] _joe_, mutante: any ideas? [23:33:27] godog: ping re ocg redis password fail [23:38:05] erg, i need to get an ops help here, i don't have root on ocg100x [23:38:09] hmm [23:38:18] it's 5AM, but looks like I'm the only one on IRC atm [23:38:21] cscott: how can I help? [23:38:22] !log Jenkins all happy after a restart. Crashing to bed [23:38:27] Logged the message, Master [23:38:29] puppet last ran on ocg1002 15 minutes ago, looks like it has the right password [23:38:35] looking [23:38:54] I'm not sure how exactly the password is set. Don't see anything in hiera. [23:39:00] YuviPanda: sudo -u ocg more /etc/ocg/mw-ocg-service.js [23:39:19] YuviPanda: It would probably come from ops/private.git [23:39:27] bd808: yup, I'm looking at it right now [23:39:44] puppet:modules/ocg/templates/mw-ocg-service.js.erb [23:39:52] config.redis.password = "<%= @redis_password %>"; [23:40:02] and yes, it's then magic past that point [23:40:02] yeah [23:40:08] found it in hiera/private [23:40:27] cscott: so which machine doesn't have the password? [23:40:27] but the strange thing is that it's been picked up on ocg1002 but not ocg1001 or ocg1003 [23:40:31] ocg1001? [23:40:32] looking [23:40:51] YuviPanda: ocg1001 and ocg1003 [23:40:57] puppet's running on 1001 [23:40:58] atm [23:41:13] YuviPanda: Do you know where your teenager is at 5 o'clock in the morning? [23:41:34] hashar: heh, in the #wikimedia-operations channel? :) [23:41:40] The after hours club? The clubs attract thousands of Chicago area young people. Some say they come looking for drugs, dirty dancing and pounding techno music. [23:41:59] ocg1003 says that puppet was last run 9 minutes ago, but it doesn't have the redis password [23:43:31] cscott: interesting, it got the password, and about two puppet runs ago decided to set it back to '' [23:43:54] yeah, i keep expecting to see ocg1002 forget the password at some point too [23:44:00] YuviPanda: sorry 5am always make me think of a popular music sample which I quoted above ;D [23:44:03] some race condition or something? [23:44:25] hashar: aaah :) [23:44:36] cscott: yeah, I wouldn't be surprised if that happened with ocg1002 [23:44:47] cscott: I think this might be also a bug in our hiera backend? unsure. [23:45:09] one option is to just put the password manually there, and disable puppet until _joe_ wakes up [23:45:12] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [23:45:29] huh [23:45:33] that's weird, nothing on palladium [23:45:41] maybe things aren't being pushed through? [23:45:45] that would also explain the ocg issue [23:45:46] * YuviPanda looks [23:46:42] nope, private is updated there [23:46:43] YuviPanda: disabling puppet is fine with me, although i've got https://gerrit.wikimedia.org/r/176155 i've eventually to have pushed through puppet [23:47:11] cscott: true, but I guess ocg is down atm? [23:47:55] (03CR) 10Cscott: "added rules to cover https://gist.github.com/gwicke/f456ea525c42a3df0a97" [puppet] - 10https://gerrit.wikimedia.org/r/176155 (owner: 10Cscott) [23:48:00] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [23:48:25] YuviPanda: yes, ocg is down, so getting it back up would be good. [23:48:44] !log manually ran puppet merge on strontium, puppet merge on palladium didn't sync [23:48:47] Logged the message, Master [23:49:00] cscott: ok, I'm going to hack up a puppet change for this now. can you file a phab task so I can reference it? [23:51:17] cscott: just bypassing hiera for now. [23:51:48] * cscott hasn't had to deal with phab yet [23:51:51] ah, heh [23:51:54] let me just do that then [23:51:54] guess i've got to bite the bullet some time [23:52:00] YuviPanda: you're my hero [23:52:13] cscott: I collect payment in form of whiskey/beer [23:52:31] if you can wait until january, i'm good for it [23:52:52] https://phabricator.wikimedia.org/T76111 [23:53:58] cscott: ^ would be great if you can add more details tho :) [23:53:58] (03PS1) 10Yuvipanda: ocg: Temp hack to bypass hiera for redis passwords [puppet] - 10https://gerrit.wikimedia.org/r/176181 [23:56:07] (03PS1) 10Catrope: Followup 313c29f: correct spelling of wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176182 [23:56:09] (03CR) 10Yuvipanda: [C: 032] ocg: Temp hack to bypass hiera for redis passwords [puppet] - 10https://gerrit.wikimedia.org/r/176181 (owner: 10Yuvipanda) [23:57:40] ok, that's strange.