[00:00:05] RoanKattouw, ^d, marktraceur, MaxSem: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141126T0000). [00:01:34] (03PS1) 10BryanDavis: logstash: Rules for processing MW input via Redis [puppet] - 10https://gerrit.wikimedia.org/r/175896 [00:03:04] greg-g: howdy. LQT conversion on officewiki didn't take (API calls to private wiki), can we retry on Wednesday 26? It's only 8 pages, it's only on officewiki, What Could Go Wrong [00:03:10] whee nothing to deploy [00:03:48] ™ [00:04:18] !log restarted eventlogging mysql-m2-master consumer. It seems it could no longer write to the database. [00:04:20] Logged the message, Master [00:05:53] spagewmf: yessir [00:06:42] greg-g: thanks giving. Is 9:30am-10:30am OK to get out of the way of the train in time [00:07:13] spagewmf: perfect [00:31:05] greg-g: Thoughts about me renaming the "ve-deploy-2014-11-26 (MW 1.25wmf10)" projects to "WMF-Deploy-…" so others feel free to use them? [00:33:36] !log power down db2033 for reassignement to codfw frack [00:33:38] Logged the message, Master [00:35:28] (03CR) 10Dzahn: "how to add this to deploy schedule" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173083 (owner: 10JanZerebecki) [00:38:42] James_F: innnnnnnteresting [00:39:01] greg-g: I've already started doing it for OOUI tasks. [00:39:07] * greg-g nods [00:39:13] purpose? [00:39:18] greg-g: (Previously this wasn't possible because they were VE milestones, and OOUI was outside of that.) [00:39:35] Mostly it's so I can point to a "this is what changes went out that week" log. [00:39:45] I write the weekly changelog, after all. [00:40:41] (03PS2) 10BryanDavis: logstash: Rules for processing MW input via Redis [puppet] - 10https://gerrit.wikimedia.org/r/175896 [00:41:10] I don't see the harm, I'm just having a hard time coming up with when another team would use it [00:42:15] greg-g: Sure. [00:44:46] (03CR) 10BryanDavis: "Cherry-picked to deployment-salt for testing. I expect there will be some adjustments needed here as I test out the firehose of Monolog ev" [puppet] - 10https://gerrit.wikimedia.org/r/175896 (owner: 10BryanDavis) [00:47:29] (03CR) 10Aaron Schulz: [C: 032] Remove obsolete profiling settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164011 (owner: 10PleaseStand) [00:47:38] (03Merged) 10jenkins-bot: Remove obsolete profiling settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164011 (owner: 10PleaseStand) [00:48:07] !log aaron Synchronized wmf-config/StartProfiler.php: Remove obsolete profiling settings (duration: 00m 06s) [00:48:10] Logged the message, Master [00:54:26] (03PS1) 10Dzahn: wikistats: add cron to enabled wikia updates [puppet] - 10https://gerrit.wikimedia.org/r/175904 [00:55:03] (03PS2) 10Dzahn: wikistats: add cron to enable wikia updates [puppet] - 10https://gerrit.wikimedia.org/r/175904 [00:55:50] (03CR) 10Dzahn: [C: 032] wikistats: add cron to enable wikia updates [puppet] - 10https://gerrit.wikimedia.org/r/175904 (owner: 10Dzahn) [01:05:27] (03CR) 10Springle: [C: 031] mha: replace pmtpa with codfw? [puppet] - 10https://gerrit.wikimedia.org/r/173464 (owner: 10Dzahn) [01:09:23] PROBLEM - puppet last run on mw1224 is CRITICAL: CRITICAL: Puppet has 1 failures [01:16:28] (03PS5) 10Dzahn: mha: replace pmtpa with codfw [puppet] - 10https://gerrit.wikimedia.org/r/173464 [01:23:18] !log restarted logstash on logstash1001; no events from log2udp relay being recorded [01:23:21] Logged the message, Master [01:23:56] RECOVERY - puppet last run on mw1224 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [01:46:49] PROBLEM - HHVM busy threads on mw1235 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [115.2] [01:47:57] (03CR) 10Dzahn: [C: 032] mha: replace pmtpa with codfw [puppet] - 10https://gerrit.wikimedia.org/r/173464 (owner: 10Dzahn) [01:50:45] PROBLEM - HHVM busy threads on mw1227 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [01:53:42] PROBLEM - HHVM queue size on mw1232 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [80.0] [01:53:43] PROBLEM - HHVM busy threads on mw1233 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [01:53:53] PROBLEM - HHVM busy threads on mw1232 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [115.2] [01:54:43] PROBLEM - HHVM busy threads on mw1222 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [01:55:13] RECOVERY - HHVM busy threads on mw1235 is OK: OK: Less than 1.00% above the threshold [76.8] [01:56:33] RECOVERY - HHVM queue size on mw1232 is OK: OK: Less than 1.00% above the threshold [10.0] [01:57:04] PROBLEM - HHVM busy threads on mw1229 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [01:57:53] PROBLEM - HHVM busy threads on mw1231 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [115.2] [01:58:32] PROBLEM - HHVM busy threads on mw1226 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [01:58:42] PROBLEM - HHVM busy threads on mw1234 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [01:59:24] PROBLEM - HHVM busy threads on mw1230 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [02:00:12] RECOVERY - HHVM busy threads on mw1229 is OK: OK: Less than 1.00% above the threshold [76.8] [02:01:43] RECOVERY - HHVM busy threads on mw1226 is OK: OK: Less than 1.00% above the threshold [76.8] [02:01:43] RECOVERY - HHVM busy threads on mw1234 is OK: OK: Less than 1.00% above the threshold [76.8] [02:02:24] RECOVERY - HHVM busy threads on mw1227 is OK: OK: Less than 1.00% above the threshold [76.8] [02:02:24] RECOVERY - HHVM busy threads on mw1230 is OK: OK: Less than 1.00% above the threshold [76.8] [02:02:32] RECOVERY - HHVM busy threads on mw1233 is OK: OK: Less than 1.00% above the threshold [76.8] [02:02:45] RECOVERY - HHVM busy threads on mw1232 is OK: OK: Less than 1.00% above the threshold [76.8] [02:03:23] RECOVERY - HHVM busy threads on mw1222 is OK: OK: Less than 1.00% above the threshold [76.8] [02:03:38] RECOVERY - HHVM busy threads on mw1231 is OK: OK: Less than 1.00% above the threshold [76.8] [02:18:25] !log l10nupdate Synchronized php-1.25wmf8/cache/l10n: (no message) (duration: 00m 03s) [02:18:29] !log LocalisationUpdate completed (1.25wmf8) at 2014-11-26 02:18:29+00:00 [02:18:30] Logged the message, Master [02:18:34] Logged the message, Master [02:30:16] !log l10nupdate Synchronized php-1.25wmf9/cache/l10n: (no message) (duration: 00m 01s) [02:30:19] Logged the message, Master [02:30:20] !log LocalisationUpdate completed (1.25wmf9) at 2014-11-26 02:30:20+00:00 [02:30:23] Logged the message, Master [03:13:13] PROBLEM - puppet last run on amssq36 is CRITICAL: CRITICAL: puppet fail [03:31:59] (03PS1) 10GWicke: Move restbase config to regex.yaml [puppet] - 10https://gerrit.wikimedia.org/r/175939 [03:32:38] RECOVERY - puppet last run on amssq36 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [03:34:06] * gwicke looks around for opsens with merge rights [03:40:29] PROBLEM - HHVM busy threads on mw1229 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [03:43:21] RECOVERY - HHVM busy threads on mw1229 is OK: OK: Less than 1.00% above the threshold [76.8] [03:52:46] (03PS2) 10GWicke: Move restbase config to regex.yaml [puppet] - 10https://gerrit.wikimedia.org/r/175939 [03:55:55] (03CR) 10Tim Starling: "I would still want this change, or something like it. It's nice to be able to profile individual requests without modifying the output, es" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174372 (owner: 10Tim Starling) [04:18:19] Who broke OAuth logins? [04:24:14] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Nov 26 04:24:14 UTC 2014 (duration 24m 13s) [04:24:20] Logged the message, Master [04:28:03] James_F: do you know who I should ping about OAuth breakage? [04:28:25] Login seems to be completely broken for all the apps I tried. [04:28:58] I don't see a "cmjohnson" in the channel (per chan topic) [04:31:55] hi [04:32:13] rageoss, can you give us some specific examples to aid troubleshooting? [04:32:22] sorry, ragesoss [04:32:55] i haven't worked on that subsystem before but i'll take a look [04:32:58] jgage: some examples listed here: https://phabricator.wikimedia.org/T75968 [04:33:02] thanks [04:33:40] jgage: the wizard.wikiedu.org one has error stuff enabled, so there's lots of environment stuff you can see if you try that one. [04:33:50] great [04:34:19] jgage: there weren't any commits to the OAuth extension lately, so I guess the breakage is probably somewhere else. [04:34:28] hmm [04:37:22] the hello world works for me. can anyon confirm whether oauth is working or fails for them? [04:37:41] i used a regular nonprivileged account [04:38:03] jgage: did you try the "post to talk page" or "verify your identity"? [04:38:12] For me, that revealed that I was not in fact logged in. [04:38:13] i tried verify your identity [04:39:10] with the wikiedu wizard i was prompted to auth and now i'm at the assignment design wizard form [04:40:04] jgage: that's odd. [04:40:11] ragesoss have you tried a second browser? [04:40:22] with both privileged and unprivileged accounts, it's broken for me, on Chrome and Firefox. [04:40:26] hm ok [04:40:46] (also broken for my developer, on a different IP, etc) [04:41:01] hmmm [04:42:11] confirmed, https://phabricator.wikimedia.org/T75968 [04:43:05] maybe i'm having success because of cached credentials, because i'm logged in to phab [04:43:12] * jgage tries another browser [04:43:37] * ragesoss was also logged in to Phab [04:43:38] to be clear i believe that there's a problem i just need to be able to reproduce it for troubleshooting [04:44:00] I'm logged in with my session, but I tried to login with my other account (personal) and it didn't work [04:44:33] yeah... odd that you were able to get the wizard and the hello world app to work, jgage. [04:45:18] phab login worked for me in a clean browser [04:45:41] that's new... [04:45:41] weird [04:45:54] I got 503 error on phab login that time. [04:46:03] actually, not new... happened once to me earlier today. [04:46:35] jgage: ok, yeah, in an incognito chrome window it worked (loging into phab) [04:46:37] i don't see any oauth documentation on wikitech [04:46:48] it wouldn't be there :/ [04:46:49] (to be clear, that 503 happened before I got to the oauth page on the wiki) [04:47:04] (so a phab problem, not an oauth problem) [04:47:41] I just got the oauth_token error in an incognito window in chrome. [04:47:55] https://www.mediawiki.org/wiki/Extension:OAuth [04:48:03] thanks greg-g [04:49:42] TimStarling: I hate to bother you on this, but we're having intermittent oauth authentication issues. I got it once when loggin into phab via mw.org, ragesoss got it on other consumers. See: https://phabricator.wikimedia.org/T75968 [04:49:55] hi [04:50:00] * legoktm reads up [04:50:02] oh, it's a lego! [04:50:23] is it possible that this is related to hhvm changes today? nodes were merged from two pools to one or something. [04:52:05] uh, maybe [04:52:26] * duploktm grumbles about flaky internet [04:52:27] ori: around? [04:52:31] seems like the components involved are appservers, memcached, mysql [04:52:42] I remember magnus having an issue with OAuth that was hhvm related [04:53:09] jgage: why do you think those components? [04:53:18] just reading the oauth extension url you pasted [04:53:22] because i know nothing about it [04:53:27] * greg-g nods [04:53:33] what does this 503 look like? [04:54:50] TimStarling: the 503, which I think is not connected to the OAuth problem, looks like a normal wikimedia server 503 error. Let me see if I can find it in my history. [04:55:24] ragesoss what time (utc) did you first observe this problem? [04:55:34] TimStarling: Request: POST http://phabricator.wikimedia.org/auth/login/mediawiki:mediawiki/, from 10.64.0.172 via cp1044 cp1044 ([10.64.0.172]:80), Varnish XID 1588971278 [04:55:35] Forwarded for: 2601:8:b100:9c0:bdd1:40d:6643:e5d6, 10.64.0.172 [04:55:35] Error: 503, Service Unavailable at Wed, 26 Nov 2014 04:55:23 GMT [04:56:22] https://old-bugzilla.wikimedia.org/show_bug.cgi?id=72384 is what I'm thinking about, but doesn't seem related here [04:56:27] jgage: two hours ago, for the OAuth issue. [04:56:32] thank you [04:56:57] (That 503 error is fresh; I just repro it) [04:57:48] I'm not really sure how to debug oauth tbh... [04:59:01] I just logged into quarry fine. [04:59:10] via OAuth. [04:59:23] I'm... going to step away [04:59:50] I can authorize-ish with oauth-hello-world. [04:59:56] I just logged into quarry as well, after earlier fails. [05:00:03] legoktm: feel free to use the contacts page on officewiki to call whoever you need if it is deemed worth it (you and jgage and tim can decide) [05:00:31] do we know who is knowledgeable about oauth? [05:00:37] csteipp [05:00:49] aaron, anomie, and tim? [05:00:55] those too [05:01:28] so the 503 error is not related, and oauth is just plain "not working" [05:01:34] no further information? [05:01:55] TimStarling: the error I got from phab was: [05:01:55] Unhandled Exception ("Exception") [05:01:56] Expected 'oauth_token' in response! [05:02:35] what URL? [05:02:59] TimStarling: hitting https://tools.wmflabs.org/oauth-hello-world/index.php?action=identify after authorizing the application randomly works otherwise it gives Invalid identify response: {"error":"mwoauth-oauth-exception"} [05:03:35] (sent privately) [05:03:57] so... do you have the MW API response? [05:04:39] i get about 50% success/fail on that url out of 10 tries [05:05:47] * greg-g has to go, kid crying [05:05:57] TimStarling: if I'm reading the code right in anomie's tool, the API response is just {"error":"mwoauth-oauth-exception"} [05:06:15] Also, in https://old-bugzilla.wikimedia.org/show_bug.cgi?id=72384#c4 anomie said "That sounds like you hit a wiki using HHVM when the OAuth authorization was done using Zend, or vice versa. For some reason the OAuth stuff doesn't seem to be shared between the two." [05:06:27] oho [05:06:45] ruh roh [05:07:17] _joe_ should be awake in an hour or two [05:10:15] so if there is an exception in Special:OAuth, it should be logged to the OAuth log channel [05:11:20] and the full error message text should be in the output [05:12:07] the OAuth log channel goes to /dev/null [05:12:13] whee [05:12:16] >.> [05:12:36] that's fixable though [05:13:49] (03PS1) 10Legoktm: Add debug log group for OAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175945 [05:13:52] TimStarling: ^ [05:14:39] (03CR) 10Tim Starling: [C: 032 V: 032] Add debug log group for OAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175945 (owner: 10Legoktm) [05:15:41] !log tstarling Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 06s) [05:15:46] Logged the message, Master [05:16:16] ok, well it's a bit noisy, but better than nothing [05:17:39] I triggered a few exceptions on mediawikiwiki [05:18:05] 2014-11-26 05:16:00 mw1239 mediawikiwiki: MediaWiki\Extensions\OAuth\SpecialMWOAuth::execute: Exception Invalid consumer key [05:19:35] TimStarling: is there a list of which servers are running hhvm and which are zend? [05:20:11] not that I know of [05:20:33] i think ori showed me a way to see on ganglia once... /me looks through irc logs [05:20:45] hmm, I guess I can just login to them and see what version of php they have. [05:21:00] 2014-11-26 05:20:29 mw1188 mediawikiwiki: MediaWiki\Extensions\OAuth\SpecialMWOAuth::execute: Exception Sorry, something went wrong connecting this application. [05:21:29] there were 4 exceptions when I tried to log in, that was the text of three of them [05:23:52] TimStarling: that's the one I see a lot when I go back on my browser to the OAuth login page after already being logged in. [05:24:05] (when things are working normally) [05:24:38] it was mwoauthdatastore-request-token-not-found [05:25:09] It provides a nice URL: https://www.mediawiki.org/wiki/Help:OAuth/Errors#E004 [05:26:10] And says sorry. I mean, that's pretty nice. [05:26:51] so the consumer token is some kind of long hashy thing [05:26:59] does MW give it to the application at some point? [05:29:21] right, so it is fixed for a given consumer [05:31:34] I think applications have to be approved and access can be revoked. [05:31:53] how is oauth configured in phabricator? [05:32:08] where does it get the token from? is it sending the right token? [05:33:11] I'd guess the token is stored in the private puppet repo? [05:33:25] manifests/role/phabricator.pp? [05:33:43] Oh, there's a module as well. [05:34:29] auth to phab works sometimes, similar to the hello world app. so it seems to get the right token at least some of the time. [05:34:52] you mean giving the right token? [05:35:23] yes, sorry [05:35:42] If it's both Phabricator and the Hello World app having issues, it's probably MediaWiki.org's OAuth that's gone weird? [05:36:49] this all seems to match the behavior that anomie described regarding a cluster with both hhvm and zend hosts: https://old-bugzilla.wikimedia.org/show_bug.cgi?id=72384#c4 [05:37:01] Carmela: obviously, but that doesn't answer my question [05:39:13] I'm a little surprised we're still using memcached for sessions. [05:39:30] Carmela: That's not really helpful. [05:40:07] jgage: ok, but no isolation was done [05:40:27] right [05:41:40] 2014-11-25 20:39:23 mw1232 mediawikiwiki: Memcached error for key "mediawikiwiki:messages:en:status" on server "127.0.0.1:11212": SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY [05:41:54] yes, well [05:42:07] I guess that patch was never merged upstream [05:42:41] 11212.. nonstandard memcached port [05:44:08] so would memcached failure cause oauth failure? [05:46:46] this is going to take a while [05:47:46] Fwiw phab oauth config is all in d. And hasn't changed in awhile [05:47:55] In the db :) [05:48:19] ;) [05:48:28] is the consumer key in the DB? [05:49:50] In phab db yes, I think it was registered under mukunda's account [05:52:04] so you can confirm that? [05:53:00] TimStarling: I can provide keys for my app, if that will help. [05:53:29] no [05:54:04] we know OAuthServer::get_consumer() throws an exception [05:54:31] so most likely $request->get_parameter( "oauth_consumer_key" ) returns something that is false [05:58:47] $request is probably from MWOAuthRequest::fromRequest() [06:00:26] which fills in parameters from GET parameters, headers and post data [06:00:28] TimStarling: yes I can verify the consumer key, is that helpful? [06:00:38] so it is hard to see how it is false without some amount of client involvement [06:01:46] TimStarling: PM'd in case it's useful for you [06:07:07] !log tstarling Synchronized php-1.25wmf9/extensions/OAuth/lib/OAuth.php: (no message) (duration: 00m 06s) [06:07:11] Logged the message, Master [06:09:06] !log tstarling Synchronized php-1.25wmf9/extensions/OAuth/lib/OAuth.php: (no message) (duration: 00m 06s) [06:12:03] chasemp: What happened to OAuth? [06:12:27] csteipp: not sure, I saw https://phabricator.wikimedia.org/T75968 get logged. Seems to be a general MW oauth issue [06:12:57] possibly related to mixing hhvm and zend nodes? I don't have much insight into it tbh, just wanted to see if I could be of assistance on the phab as a client front [06:13:34] Hmm.. intermittent? WFM just now.. [06:15:42] csteipp: yeah, i'm getting about 50% error rate. TimStarling is debugging. according to https://old-bugzilla.wikimedia.org/show_bug.cgi?id=72384#c4 it's a problem of hitting a wiki with zend when oauth was done with hhvm or vice versa. [06:15:52] well, potentially [06:16:21] csteipp: we enabled the oauth debug log, it's at fluorine:/a/mw-log/oauth.log [06:16:58] also I just did a patch to log some extra data, if 21MB of logs in the last half hour is not enough [06:17:19] My first guess is hmac on hhvm might be slightly different. [06:19:07] is there documentation of the client/server request flow? [06:20:18] https://www.mediawiki.org/wiki/OAuth/For_Developers#mediaviewer/File:OAuth-basicSVG.svg I guess [06:20:45] TimStarling: https://www.mediawiki.org/wiki/Auth_systems/OAuth/Design or https://www.mediawiki.org/wiki/OAuth/For_Developers [06:21:17] OAuthServer::get_consumer() throws an exception "Invalid consumer key" [06:21:27] Do hhvm and zend share memcache? [06:21:42] yes [06:22:14] I am struggling to understand how this is possible since apparently the consumer key is persistently configured [06:22:28] I did a var_export of the request at that point [06:23:45] this is typical: http://paste.tstarling.com/p/DVRMXL.html [06:23:53] definitely no consumer key [06:27:33] Is it always the /token call that throws the exception? [06:27:59] no, we logged one at /identify [06:28:08] at 06:13:56 [06:28:30] for commonswiki, not mediawikiwiki [06:28:36] 'http_url' => 'https://commons.wikimedia.org/wiki/Special:OAuth/identify', [06:30:01] Both those come from the remote server, so it's pretty certain that they're not randomly leaving off their client token. And all the OAuth parameters are missing on that paste. There should be a token, signature, signature method, etc. [06:30:39] well, it could be a bug in MWOAuthRequest in getting the parameters from the environment [06:31:40] where is it meant to be? authorization header, post or get? [06:31:44] Yeah, that's possible. We also had an issue at one point that if one of the functions was called twice, it didn't have all the OAuth info the second time. [06:31:53] Yeah, Authorization header is the normal method [06:31:58] GET is allowed as a backup [06:32:41] <_joe_> jgage: searching for me? [06:33:43] <_joe_> oh just read the backlog [06:34:01] ok, logging that [06:34:02] !log tstarling Synchronized php-1.25wmf9/extensions/OAuth/lib/OAuth.php: (no message) (duration: 00m 05s) [06:34:05] Logged the message, Master [06:34:33] PROBLEM - puppet last run on elastic1030 is CRITICAL: CRITICAL: puppet fail [06:34:50] <_joe_> chasemp: do we have a confirmation that hhvm / zend mixing is the problem? [06:35:02] I wonder if WebRequest::ggetRawInput isn't happy on hhvm [06:35:10] yeah, blank [06:35:23] <_joe_> ugh [06:35:37] <_joe_> should we rollback to having the two separate pools? [06:35:48] (03CR) 10Yuvipanda: kill facilities.pp, move to nagios_common (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/173999 (owner: 10Dzahn) [06:36:03] <_joe_> In case, it will take me ~ 20 minutes to do so [06:36:16] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Puppet has 2 failures [06:36:16] PROBLEM - puppet last run on virt1006 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:17] <_joe_> (plus all the puppet runs) [06:36:18] (03CR) 10Yuvipanda: [C: 04-2] "Moving -1 to -2, since I'm more strongly inclined now." [puppet] - 10https://gerrit.wikimedia.org/r/173999 (owner: 10Dzahn) [06:36:34] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:44] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 2 failures [06:36:47] _joe_: we probably have enough information, so you may as well [06:36:51] TimStarling: We need access to the raw values there-- if there's another way to do that in hhvm, we can patch that. That's what the signature is checked against, so the values can't be touched before OAuth gets them. [06:37:12] <_joe_> TimStarling: well it's not like I'd do that if we are going to have a patch soon [06:37:13] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:14] PROBLEM - puppet last run on amslvs1 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:36] PROBLEM - puppet last run on amssq55 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:49] let's do a small test case [06:37:53] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:53] <_joe_> I'll work towards that anyways [06:38:18] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: Puppet has 1 failures [06:38:18] PROBLEM - puppet last run on mw1039 is CRITICAL: CRITICAL: Puppet has 1 failures [06:38:18] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:38:23] * _joe_ just got out of bed [06:38:43] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:38:44] PROBLEM - puppet last run on lvs2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:39:14] PROBLEM - puppet last run on amssq48 is CRITICAL: CRITICAL: Puppet has 1 failures [06:39:25] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 2 failures [06:39:36] PROBLEM - puppet last run on mw1126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:39:39] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 1 failures [06:40:04] !log tstarling Synchronized live-1.5/oauth-headers.php: (no message) (duration: 00m 05s) [06:40:09] Logged the message, Master [06:42:04] why does it say file not found? it's not RA or something is it? [06:42:46] PROBLEM - puppet last run on ssl3001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:43:15] !log tstarling Synchronized w/oauth-headers.php: (no message) (duration: 00m 06s) [06:43:16] never mind, I'm too old [06:43:17] Logged the message, Master [06:44:21] (03PS1) 10Giuseppe Lavagetto: Revert "mediawiki: move most servers from the hhvm to the standard pool" [puppet] - 10https://gerrit.wikimedia.org/r/175950 [06:44:38] (03PS1) 10Giuseppe Lavagetto: Revert "varnish: remove redirection to the hhvm pool" [puppet] - 10https://gerrit.wikimedia.org/r/175951 [06:45:04] <_joe_> ok, whenever we decide, I'm ready to rollback and create the hhvm pool again [06:45:27] what's a zend server? [06:45:31] <_joe_> we should also revert mediawiki-config changes, btw [06:45:46] i.e. the hostname of one instance thereof [06:45:48] <_joe_> TimStarling: do you want a host? [06:45:53] <_joe_> mw1040 [06:45:58] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:46:11] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:46:11] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:46:20] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:46:30] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:47:00] RECOVERY - puppet last run on amslvs1 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:47:00] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:27] RECOVERY - puppet last run on amssq55 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:47:28] RECOVERY - puppet last run on elastic1030 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:32] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:51] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:52] Wonder if $HTTP_RAW_POST_DATA works on hhvm? [06:47:56] RECOVERY - puppet last run on mw1039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:57] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:11] !log tstarling Synchronized w/oauth-headers.php: (no message) (duration: 00m 05s) [06:48:12] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:12] Logged the message, Master [06:48:15] RECOVERY - puppet last run on lvs2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:42] RECOVERY - puppet last run on amssq48 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:48] botspam [06:48:55] RECOVERY - puppet last run on mw1126 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:48:56] RECOVERY - puppet last run on virt1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:48:57] maybe we should talk in #mediawiki-core? [06:49:02] <_joe_> ok [06:57:05] RECOVERY - puppet last run on ssl3001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:53] PROBLEM - puppet last run on db1006 is CRITICAL: CRITICAL: Puppet has 2 failures [07:05:13] PROBLEM - puppet last run on neon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:05:13] PROBLEM - check if dhclient is running on neon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:05:14] PROBLEM - ircecho_service_running on neon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:07:30] (03PS1) 10Giuseppe Lavagetto: mediawiki: fix passing of the Authorization header in HAT [puppet] - 10https://gerrit.wikimedia.org/r/175952 [07:07:35] <_joe_> csteipp_afk: I think we nailed it ^^ [07:07:54] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 6 minutes ago with 0 failures [07:07:59] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 3 processes with args ircecho [07:07:59] RECOVERY - check if dhclient is running on neon is OK: PROCS OK: 0 processes with command name dhclient [07:11:09] yay _joe_ [07:11:30] <_joe_> what was the bug again? [07:12:10] https://phabricator.wikimedia.org/T75968 [07:12:39] (03CR) 10Giuseppe Lavagetto: [C: 032] "see https://phabricator.wikimedia.org/T75968" [puppet] - 10https://gerrit.wikimedia.org/r/175952 (owner: 10Giuseppe Lavagetto) [07:13:46] RECOVERY - puppet last run on db1006 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [07:14:23] <_joe_> jgage: next time if you get this is hhvm-related, feel free to phone me [07:14:38] <_joe_> well, if it's related to something I'm working on in general [07:14:45] <_joe_> I can usually wake up [07:15:30] <_joe_> I hope you don't speak italian, so that the swear words will sound obscure and funny to you, but it's really ok :P [07:15:57] <_joe_> ok in ~ 20 minutes, oauth should be unbroken [07:16:16] <_joe_> I don't feel like pushing the puppet change across the board [07:21:45] <_joe_> now I can take my morning shower I guess :) [07:22:26] * YuviPanda breaks something else for _joe_ to fix [07:23:30] <_joe_> YuviPanda: it really was tim doing all the important guesswork, I just wrote the apache fix [07:23:35] :) [07:30:32] PROBLEM - HHVM busy threads on mw1223 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [07:33:16] RECOVERY - HHVM busy threads on mw1223 is OK: OK: Less than 1.00% above the threshold [76.8] [07:35:08] <_joe_> jgage: can you try oauth again? [08:10:01] PROBLEM - HHVM busy threads on mw1223 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [115.2] [08:12:39] RECOVERY - HHVM busy threads on mw1223 is OK: OK: Less than 1.00% above the threshold [76.8] [08:29:16] _joe_ looks good, i tried the hello world app 10 times, all successful [08:29:36] and i will gladly wake you up to hear cursing in italian if the need arises :D [08:30:44] first attempt was pretty slow, maybe 30s. but it didn't fail! [08:30:55] after that it was speedy [08:32:07] * jgage zzz [08:35:11] <_joe_> jgage: good night [08:50:57] jenkins seems to be stuck [08:52:16] Nemo_bis: I doubt that it was millions of users and I doubt it was init7-related :) [08:52:45] Nemo_bis: also, do not assume that routing is symmetric [08:53:01] the path to wikimedia vs. the path *from* wikimedia could be entirely different and the issue could be in either direction [08:53:40] so traceroutes, while very helpful, are not giving the whole picture; that's why we usually want an IP as well (feel free to mask it to a /24 for privacy reasons) [09:18:01] (03PS5) 10Faidon Liambotis: realm: remove pmtpa, add codfw [puppet] - 10https://gerrit.wikimedia.org/r/173476 (owner: 10Dzahn) [09:19:01] (03CR) 10Faidon Liambotis: [C: 032] realm: remove pmtpa, add codfw [puppet] - 10https://gerrit.wikimedia.org/r/173476 (owner: 10Dzahn) [09:20:44] (03CR) 10Faidon Liambotis: "Ping?" [puppet] - 10https://gerrit.wikimedia.org/r/151523 (https://bugzilla.wikimedia.org/60238) (owner: 10Tim Landscheidt) [09:27:15] (03PS2) 10Faidon Liambotis: geoip: kill geoliteupdate in favor of geoipupdate [puppet] - 10https://gerrit.wikimedia.org/r/175571 [09:28:04] (03CR) 10Faidon Liambotis: [C: 032] geoip: kill geoliteupdate in favor of geoipupdate [puppet] - 10https://gerrit.wikimedia.org/r/175571 (owner: 10Faidon Liambotis) [09:43:44] (03PS1) 10Yuvipanda: shinken: Add checks for labs infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/175964 [09:48:00] (03PS1) 10Giuseppe Lavagetto: reimage: add a few configs, beautify output [puppet] - 10https://gerrit.wikimedia.org/r/175965 [09:49:43] (03PS2) 10Yuvipanda: shinken: Add checks for labs infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/175964 [09:52:56] (03PS2) 10Giuseppe Lavagetto: reimage: add a few configs, beautify output [puppet] - 10https://gerrit.wikimedia.org/r/175965 [09:53:30] (03PS3) 10Yuvipanda: shinken: Add checks for labs infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/175964 [09:54:12] _joe_: this looks like it could get a rewrite in python :) [09:54:53] <_joe_> paravoid: don't tempt me [09:55:02] <_joe_> (not that I didn't think of that) [09:55:10] I'm sure you did :) [09:55:16] we should think of ways to automate this even further [09:55:35] puppet has some new autosigning features that we could perhaps use [09:55:52] <_joe_> paravoid: actually, the best thing would be to run a script from iron, or having the install ssh key on palladium [09:55:58] then... IPMI + BIOS + iDRAC automation [09:56:43] <_joe_> so I can make the depool/clean/reboot to pxe/sign/enable and run puppet/sign salt/run puppet again/ cycle work from a single machine [09:57:36] <_joe_> yeah right now the more time-consuming thing of reimaging is going to be enabling hyperthreading [09:58:55] <_joe_> because well, you need to reboot into bios, enable it, get out of it, and I don't think we can automate that ATM; I should look into it [09:59:02] oh I've looked into it [09:59:06] wsman and all that glory [09:59:21] mjg59 has a new library that I should probably check though [09:59:59] paravoid: the IP was included [10:00:11] I just didn't make it explicit [10:00:19] ah sorry, didn't see that [10:00:56] <_joe_> https://github.com/jtallieu/dell-wsman-client-api-python/ looks quite unmaintained [10:01:13] https://github.com/nebula/firmware_config is mjg59's [10:01:21] depends on openwsman apparently [10:01:50] <_joe_> mmmh I could experiment with that [10:02:08] I've experimented extensively with wsman in the past [10:02:11] too messy/complicated [10:02:15] <_joe_> ok [10:02:19] not saying no [10:02:52] <_joe_> well the one you linked seems incredibly simple as an API [10:02:55] yes [10:03:01] I'm building openwsman now [10:03:02] let's see.. [10:03:06] <_joe_> oh ok :) [10:03:56] wsman-dispatcher.c:924:9: error: variable 'resUriMatch' set but not used [-Werror=unused-but-set-variable] [10:04:00] grumble [10:04:15] Werror sillyness [10:04:42] paravoid: anyway, this time they were much faster at resolving the issue; I don't know if the reason is that they learnt, or that I managed to tell several users they had to complain to the ISP, or that the blackout was really total this time (packet loss 100%) [10:04:58] and yes it's millions users for that ISP [10:05:03] if it was the same issue [10:06:49] (03PS4) 10Yuvipanda: shinken: Add checks for labs infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/175964 [10:07:40] _joe_: and that wouldn't do it for the HPs btw [10:07:48] but akosiaris had previously worked on automating some of that [10:07:56] not sure if it was BIOS too or just iLOs though [10:08:15] <_joe_> paravoid: right [10:08:38] and, well, mjg59 had worked on that too, not sure what happened with that [10:08:50] he had previously said he'd submit it as a kernel module [10:08:50] <_joe_> sorry I'm a bit slow today, I woke up to a UBN! bug on HHVM and two coffee later I'm still groggy [10:09:12] http://mjg59.dreamwidth.org/25686.html [10:09:36] https://lkml.org/lkml/2013/9/4/22 probably? [10:11:35] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.0133333333333 [10:12:13] ...which isn't merged [10:16:23]