[00:21:55] !log started second machine (nobelium) performing copy of elasticsearch indices to codfw with 40 threads [00:22:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:26:56] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [00:28:45] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [00:28:52] Krenair: I don't think so since it varies so much [00:29:05] could probably maybe hackup puppet-compiler-type stuff though [00:29:08] I've no idea how that works [00:31:44] https://groups.google.com/forum/#!topic/puppet-users/8rv-0XB8g6g [00:38:11] seems to sort of half work [01:06:35] PROBLEM - High load average on labstore1002 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [24.0] [01:13:16] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [01:16:55] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [01:22:26] PROBLEM - NFS read/writeable on labs instances on labstore1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:25:55] RECOVERY - NFS read/writeable on labs instances on labstore1002 is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.075 second response time [01:31:37] RECOVERY - High load average on labstore1002 is OK: OK: Less than 50.00% above the threshold [16.0] [01:48:41] PROBLEM - puppet last run on mw2099 is CRITICAL: CRITICAL: puppet fail [01:55:08] PROBLEM - puppet last run on mw2144 is CRITICAL: CRITICAL: puppet fail [02:05:01] PROBLEM - High load average on labstore1002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [24.0] [02:14:40] RECOVERY - puppet last run on mw2099 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [02:23:36] RECOVERY - puppet last run on mw2144 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [02:28:25] RECOVERY - High load average on labstore1002 is OK: OK: Less than 50.00% above the threshold [16.0] [02:33:00] !log l10nupdate@tin Synchronized php-1.27.0-wmf.3/cache/l10n: l10nupdate for 1.27.0-wmf.3 (duration: 08m 23s) [02:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:37:34] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.3) at 2015-10-24 02:37:34+00:00 [02:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:19:36] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds [04:24:56] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds [04:35:45] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds [04:44:06] PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:44:56] PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [04:48:17] PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [04:49:16] RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy [04:54:36] PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [04:55:08] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds [05:00:37] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds [05:04:15] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 7 below the confidence bounds [05:11:16] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds [05:11:32] !log Set an email address for user "Ymnes", after request. Confirmed by several, including. [05:11:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:14:47] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 7 below the confidence bounds [05:20:13] (03PS1) 10EBernhardson: Generate weekly cirrussearch dumps [puppet] - 10https://gerrit.wikimedia.org/r/248596 (https://phabricator.wikimedia.org/T109690) [05:20:15] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds [05:21:09] (03CR) 10jenkins-bot: [V: 04-1] Generate weekly cirrussearch dumps [puppet] - 10https://gerrit.wikimedia.org/r/248596 (https://phabricator.wikimedia.org/T109690) (owner: 10EBernhardson) [05:22:06] RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy [05:22:16] RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy [05:23:06] RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy [05:28:13] (03PS2) 10EBernhardson: Generate weekly cirrussearch dumps [puppet] - 10https://gerrit.wikimedia.org/r/248596 (https://phabricator.wikimedia.org/T109690) [05:28:50] (03CR) 10jenkins-bot: [V: 04-1] Generate weekly cirrussearch dumps [puppet] - 10https://gerrit.wikimedia.org/r/248596 (https://phabricator.wikimedia.org/T109690) (owner: 10EBernhardson) [05:30:38] (03PS3) 10EBernhardson: Generate weekly cirrussearch dumps [puppet] - 10https://gerrit.wikimedia.org/r/248596 (https://phabricator.wikimedia.org/T109690) [05:31:50] (03CR) 10Hydriz: [C: 04-1] Generate weekly cirrussearch dumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/248596 (https://phabricator.wikimedia.org/T109690) (owner: 10EBernhardson) [05:31:52] (03PS4) 10EBernhardson: Generate weekly cirrussearch dumps [puppet] - 10https://gerrit.wikimedia.org/r/248596 (https://phabricator.wikimedia.org/T109690) [05:33:14] (03CR) 10Hydriz: [C: 04-1] Generate weekly cirrussearch dumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/248596 (https://phabricator.wikimedia.org/T109690) (owner: 10EBernhardson) [05:44:12] !log ori@tin Synchronized php-1.27.0-wmf.3/extensions/Flow/includes/Data/Index/FeatureIndex.php: (no message) (duration: 00m 17s) [05:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:44:31] (03CR) 10EBernhardson: Generate weekly cirrussearch dumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/248596 (https://phabricator.wikimedia.org/T109690) (owner: 10EBernhardson) [05:44:51] (03PS5) 10EBernhardson: Generate weekly cirrussearch dumps [puppet] - 10https://gerrit.wikimedia.org/r/248596 (https://phabricator.wikimedia.org/T109690) [05:47:41] (03PS6) 10EBernhardson: Generate weekly cirrussearch dumps [puppet] - 10https://gerrit.wikimedia.org/r/248596 (https://phabricator.wikimedia.org/T109690) [05:50:20] (03CR) 10Hydriz: [C: 031] "Format of filename reviewed. Thanks a bunch!" [puppet] - 10https://gerrit.wikimedia.org/r/248596 (https://phabricator.wikimedia.org/T109690) (owner: 10EBernhardson) [05:52:06] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [05:55:47] PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [05:55:56] PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [05:56:46] PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [05:57:18] (03PS1) 10Legoktm: Add apache rewrite rule for UrlShortener on beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/248601 (https://phabricator.wikimedia.org/T116444) [05:57:45] RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy [05:58:18] (03PS2) 10Legoktm: Add apache rewrite rule for UrlShortener on beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/248601 (https://phabricator.wikimedia.org/T116444) [06:00:53] !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Oct 24 06:00:53 UTC 2015 (duration 0m 52s) [06:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:03:15] PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [06:06:36] RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy [06:11:56] PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [06:12:56] RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy [06:18:16] PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [06:22:36] RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy [06:22:38] RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy [06:23:35] RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy [06:24:19] (03PS1) 10Legoktm: Set up UrlShortener extension on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248602 (https://phabricator.wikimedia.org/T116444) [06:26:25] (03PS1) 10Ori.livneh: Configure nova's nutcracker not to eject hosts [puppet] - 10https://gerrit.wikimedia.org/r/248603 [06:26:30] (03PS2) 10Legoktm: Set up UrlShortener extension on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248602 (https://phabricator.wikimedia.org/T116444) [06:26:38] (03PS2) 10Ori.livneh: Configure nova's nutcracker not to eject hosts [puppet] - 10https://gerrit.wikimedia.org/r/248603 [06:26:51] (03CR) 10Ori.livneh: [C: 032 V: 032] Configure nova's nutcracker not to eject hosts [puppet] - 10https://gerrit.wikimedia.org/r/248603 (owner: 10Ori.livneh) [06:31:16] PROBLEM - puppet last run on db2044 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:26] PROBLEM - puppet last run on wtp1008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:26] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:27] PROBLEM - puppet last run on labnet1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:26] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:45] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 3 failures [06:32:45] PROBLEM - puppet last run on nobelium is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:56] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 3 failures [06:35:52] (03PS3) 10Ori.livneh: Add apache rewrite rule for UrlShortener on beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/248601 (https://phabricator.wikimedia.org/T116444) (owner: 10Legoktm) [06:35:58] (03CR) 10Ori.livneh: [C: 032 V: 032] Add apache rewrite rule for UrlShortener on beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/248601 (https://phabricator.wikimedia.org/T116444) (owner: 10Legoktm) [06:36:25] \o/ [06:37:10] ori: thanks :) I'll set up the novaproxy now [06:37:17] np [06:40:36] PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [06:41:26] PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [06:41:57] hm :/ [06:42:01] http://w.beta.wmflabs.org/ 400 Bad Request [06:42:15] PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [06:42:27] (I just forced a puppet run on deployment-mediawiki01) [06:42:53] legoktm: before I go debug that, a quick question --- have you discussed with Aaron the fact that the extension performs a database write on GET requests? [06:43:09] in UrlShortenerUtils::maybeCreateShortCode() [06:43:11] uhh, it shouldn't be? [06:43:18] that code path should only be hit on POST requests [06:43:26] oh, ok then [06:43:42] you have to POST to Special:UrlShortener or action=shortenurl [06:43:59] (because Aaron made me change it a month ago :P) [06:44:20] ori@deployment-mediawiki01:~$ curl -I -H 'host: w.beta.wmflabs.org' localhost [06:44:21] HTTP/1.1 200 OK [06:44:36] PROBLEM - Analytics Cassanda CQL query interface on aqs1001 is CRITICAL: Connection timed out [06:44:36] so maybe a nova proxy thing? [06:45:22] yeah, the 400 is coming from nginx [06:45:31] > Successfully created new proxy w.beta.wmflabs.org for backend deployment-mediawiki01.deployment-prep.eqiad.wmflabs:80. [06:45:46] Maybe it doesn't like the w.beta? [06:45:55] try 15.beta [06:45:57] * ori ducks [06:46:02] All the other proxies set up for beta use -beta [06:46:04] xP [06:46:40] hrm. i'll look at the nginx config. where did you see: > Successfully created new proxy w.beta.wmflabs.org for backend deployment-mediawiki01.deployment-prep.eqiad.wmflabs:80. ? [06:47:07] that's the confirmation message after I added the proxy in https://wikitech.wikimedia.org/wiki/Special:NovaProxy [06:47:31] have you done that before, and if so, do you know if it normally works instantly? maybe it requires a puppet run on the proxy host [06:47:59] in the past it typically works in 2-3 minutes, which I waited [06:48:07] but I've never added one with a . in it [06:49:17] RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy [06:51:21] legoktm: yeah, i created a w-beta proxy and that one works [06:51:36] RECOVERY - Analytics Cassanda CQL query interface on aqs1001 is OK: TCP OK - 0.000 second response time on port 9042 [06:51:53] so that must be it. but instead of relying on instance-proxy, you could just have that hostname go directly to the instance, if it has a public ip [06:52:59] https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep#Instances_for_this_project doesn't show a public IP [06:53:35] i could allocate one, but i'm going to look at the dynamicproxy code for a minute first to see if i can spot why w. fails [06:54:16] ok. we could also use w-beta :) [06:54:36] PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [06:56:55] legoktm: that might be easiest, yeah [06:57:15] (03PS1) 10Legoktm: Switch UrlShortener to w-beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/248604 [06:57:22] heh [06:57:33] (03CR) 10Ori.livneh: [C: 032 V: 032] Switch UrlShortener to w-beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/248604 (owner: 10Legoktm) [07:00:05] RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy [07:00:59] (03PS3) 10Legoktm: Set up UrlShortener extension on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248602 (https://phabricator.wikimedia.org/T116444) [07:01:24] progress! http://w-beta.wmflabs.org/ "Domain not configured" [07:02:01] oh, I don't think puppet ran [07:02:18] and now that I ran puppet, back to 400 :( [07:03:17] * legoktm reads apache docs [07:03:55] RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy [07:04:14] it's not apache [07:04:49] you think it's the proxy again? [07:05:38] PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:05:42] yeah, curl is working fine on deployment-mediawiki01 :/ [07:06:04] and by working fine I mean "Domain not configured" [07:06:08] right [07:06:19] do you know the instance name of the dynamicproxy host? [07:06:35] i guess it's whatever w-beta.wmflabs.org resolves to [07:07:31] novaproxy-01, 02 [07:07:39] https://wikitech.wikimedia.org/wiki/Nova_Resource:Project-proxy [07:09:15] PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [07:10:05] RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy [07:20:55] PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:22:15] legoktm: i temporarily enabled debug error logging in nginx and captured the logs for one request to w-beta: https://dpaste.de/83Mz/raw [07:23:00] 2015/10/24 07:19:58 [debug] 28554#28554: *292560211 http script var: "http://deployment-mediawiki01.deployment-prep.eqiad.wmflabs:80" [07:23:03] so that bit works [07:23:16] RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy [07:23:18] RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy [07:23:31] 2015/10/24 07:19:58 [debug] 28554#28554: *292560211 http proxy status 400 "400 Bad Request" [07:23:31] 2015/10/24 07:19:58 [debug] 28554#28554: *292560211 http proxy header: "Date: Sat, 24 Oct 2015 07:19:58 GMT" [07:23:31] 2015/10/24 07:19:58 [debug] 28554#28554: *292560211 http proxy header: "Server: Apache" [07:23:31] 2015/10/24 07:19:58 [debug] 28554#28554: *292560211 http proxy header: "X-Powered-By: HHVM/3.3.0-static" [07:24:00] so it looks like apache is sending the 400? [07:24:15] RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy [07:24:23] yes, maybe it has to do with the other headers the proxy sends [07:24:25] like X-Forwarded-For [07:24:32] * ori curls [07:25:29] yep [07:25:43] # curl -I -H 'host: w-beta.wmflabs.org' -H 'X-Forwarded-For: 127.0.0.1' -H 'X-Forwarded-Proto: http' deployment-mediawiki01.deployment-prep.eqiad.wmflabs [07:25:43] HTTP/1.1 400 Bad Request [07:26:02] i told you it was apache >.> [07:26:25] RECOVERY - puppet last run on wtp1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:26:25] RECOVERY - puppet last run on db2044 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [07:26:26] RECOVERY - puppet last run on labnet1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:27:27] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [07:27:46] RECOVERY - puppet last run on nobelium is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [07:27:47] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:27:47] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:28:11] even curl -H 'host: w-beta.wmflabs.org' deployment-mediawiki01.deployment-prep.eqiad.wmflabs is 400'ing [07:28:15] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:28:58] legoktm: it's your rewrite rule [07:29:02] if i comment it out the vhost works [07:29:13] hmm [07:30:22] I just took it from https://www.mediawiki.org/wiki/Extension:UrlShortener#Rewrite_rules, and got rid of the /r/ [07:33:35] i got it [07:35:41] what was it? :) [07:36:22] (03PS1) 10Ori.livneh: UrlShortener on beta: fix RewriteRule [puppet] - 10https://gerrit.wikimedia.org/r/248607 [07:36:27] legoktm: that ^ [07:36:56] (03CR) 10Ori.livneh: [C: 032 V: 032] UrlShortener on beta: fix RewriteRule [puppet] - 10https://gerrit.wikimedia.org/r/248607 (owner: 10Ori.livneh) [07:36:58] (03CR) 10Legoktm: [C: 031] UrlShortener on beta: fix RewriteRule [puppet] - 10https://gerrit.wikimedia.org/r/248607 (owner: 10Ori.livneh) [07:37:04] ok, makes sense [07:38:06] you could probably make it w.beta again if you really wanted to [07:38:11] since that turned out not to be related [07:39:47] meh [07:40:32] ran puppet, w-beta.wmflabs.org works now [07:41:46] http://w-beta.wmflabs.org/ works but not http://w-beta.wmflabs.org/foo [07:41:56] it still doesn't like the rewrite rule [07:42:01] we're just not hitting it anymore [07:42:05] for the / case [07:42:05] :/ [07:42:21] I wonder if it doesn't like the domain [07:43:55] PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [07:44:52] ori: is there a way to get extra debug output from apache? [07:45:18] error.log is empty [07:46:26] PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [07:46:27] PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [07:46:29] legoktm: why did you pick the PT flag? [07:47:00] a) it was on the wiki page b) that's how the internal example on https://httpd.apache.org/docs/2.4/rewrite/remapping.html does it [07:47:23] (really a, I discovered b after the fact). [07:47:36] with [L] it works [07:48:05] but via a 302 [07:48:11] is that what you want? [07:49:07] like a 302 to UrlRedirector? no, it should be handled internally [07:49:30] the client should directly go from w-beta --> actual target [07:50:10] the difference between L and PT is that with L the rewritten URL is treated as final, and with PT, cycles through the rewrite rules again with the new URL, allowing further transformations to occur [07:50:36] PROBLEM - Analytics Cassanda CQL query interface on aqs1001 is CRITICAL: Connection timed out [07:50:39] so PT is correct, but the bad news is that the 400 is caused by some other rewrite rule, who knows which [07:51:16] the meta configuration is in remnant.conf [07:52:16] RECOVERY - Analytics Cassanda CQL query interface on aqs1001 is OK: TCP OK - 0.025 second response time on port 9042 [07:55:11] legoktm: ok so i disabled puppet temporarily and enabled debug logging for the rewrite module in /etc/apache2/apache2.conf by adding the lines: [07:55:13] LogLevel debug rewrite:trace3 [07:55:14] ErrorLog /var/log/apache2/error.log [07:55:19] now there's actually data there [07:56:18] [Sat Oct 24 07:55:56.946360 2015] [rewrite:trace2] [pid 4620] mod_rewrite.c(468): [client 10.68.21.68:40196] 10.68.21.68 - - [w-beta.wmflabs.org/sid#7f7ed6534d88][rid#7f7ed63db0a0/initial] forcing 'http://meta.wikimedia.beta.wmflabs.org/w/index.php' to get passed through to next API URI-to-filename handler [07:56:18] [Sat Oct 24 07:55:56.946368 2015] [core:error] [pid 4620] [client 10.68.21.68:40196] AH00126: Invalid URI in request GET /foo HTTP/1.1 [07:58:16] http://fpaste.org/283174/14456734/raw/ is the full request logs [07:58:51] our rewrite rule looks fine [07:59:54] it should hit [07:59:54] ProxyPassMatch ^/w/(.*\.(php|hh))$ fcgi://127.0.0.1:9000/srv/mediawiki/docroot/wikimedia.org/w/$1 [08:05:16] ori: I'm going to sleep now, thanks for helping out with this :) [08:05:30] legoktm: np, i was gonna give up for the night too [08:05:32] ttyl [08:05:36] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above limits (thresholds: bounces: 25 in: 25 out: 300 virgin: 25) [08:05:38] i'll re-enable puppet etc [08:08:06] RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy [08:09:46] RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy [08:12:06] PROBLEM - Analytics Cassanda CQL query interface on aqs1001 is CRITICAL: Connection timed out [08:13:26] PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [08:13:47] RECOVERY - Analytics Cassanda CQL query interface on aqs1001 is OK: TCP OK - 3.003 second response time on port 9042 [08:15:06] PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [08:18:37] RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy [08:19:45] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below the limits. [08:24:08] RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy [08:25:05] RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy [08:56:13] 6operations, 6Analytics-Backlog, 10Datasets-General-or-Unknown: Requests to dumps.wikimedia.org should end up in hadoop wmf.webrequest via kafka! - https://phabricator.wikimedia.org/T116430#1750184 (10Addshore) Hmm, this isn't a duplicate..? lists.wm.o != dumps.wm.o !!! [08:57:51] 6operations, 6Analytics-Backlog, 10Wikimedia-Mailing-lists: Requests to lists.wikimedia.org should end up in hadoop wmf.webrequest via kafka! - https://phabricator.wikimedia.org/T116429#1750185 (10Addshore) Well, this mainly applies to dumps.wm.o (which the other ticket was open for). But I was looking to se... [08:59:37] PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [08:59:46] PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [09:01:45] RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy [09:03:16] RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy [09:04:25] PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [09:07:06] PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [09:08:36] PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [09:23:47] RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy [09:24:36] RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy [09:24:47] RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy [09:41:35] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, sessions up: 44, down: 1, shutdown: 0BRPeering with AS1273 not established - CWBR [09:43:25] RECOVERY - BGP status on cr2-ulsfo is OK: OK: host 198.35.26.193, sessions up: 45, down: 0, shutdown: 0 [09:59:36] PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [10:04:05] PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [10:04:15] PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [10:13:06] PROBLEM - puppet last run on mw2042 is CRITICAL: CRITICAL: puppet fail [10:14:05] RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy [10:19:36] PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:24:47] RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy [10:25:36] RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy [10:25:48] RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy [10:27:46] PROBLEM - puppet last run on mw1134 is CRITICAL: CRITICAL: Puppet has 1 failures [10:35:55] PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [10:36:37] PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:36:45] PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [10:41:56] RECOVERY - puppet last run on mw2042 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [10:43:47] RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy [10:48:55] PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: puppet fail [10:49:16] PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:52:56] RECOVERY - puppet last run on mw1134 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [11:07:07] RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy [11:12:28] PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [11:15:46] RECOVERY - puppet last run on cp3010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:19:45] RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy [11:25:06] RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy [11:26:05] RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy [11:29:05] PROBLEM - Cassandra CQL query interface on restbase-test2002 is CRITICAL: Connection refused [11:29:06] PROBLEM - Restbase endpoints health on restbase-test2002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [11:30:16] PROBLEM - Restbase root url on restbase-test2002 is CRITICAL: Connection refused [11:30:36] PROBLEM - Cassandra database on restbase-test2002 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 111 (cassandra), command name java, args CassandraDaemon [11:40:46] PROBLEM - puppet last run on bast2001 is CRITICAL: CRITICAL: puppet fail [11:45:25] PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:47:56] PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [11:49:46] RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy [11:50:45] PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:54:16] RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy [11:55:25] PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:59:46] PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [12:08:06] RECOVERY - puppet last run on bast2001 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [12:16:06] RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy [12:21:38] PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:25:06] RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy [12:25:07] RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy [12:26:06] RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy [12:57:05] PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [12:57:47] PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [12:57:47] PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [13:07:56] RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy [13:08:46] RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy [13:13:35] PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:14:16] PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:17:05] RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy [13:22:36] PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:25:06] RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy [13:25:07] RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy [13:26:05] RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy [14:04:35] PROBLEM - puppet last run on ms-be1008 is CRITICAL: CRITICAL: Puppet has 1 failures [14:05:20] PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:06:06] PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [14:07:46] RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy [14:08:46] PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [14:13:16] PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:21:06] RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy [14:25:36] RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy [14:26:17] RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy [14:31:16] RECOVERY - puppet last run on ms-be1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:44:16] PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [14:44:17] PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [14:45:26] PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:54:22] (03PS1) 10Alex Monk: Fix w-beta.wmflabs.org redirect [puppet] - 10https://gerrit.wikimedia.org/r/248617 (https://phabricator.wikimedia.org/T116444) [15:09:46] RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy [15:20:37] PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [15:24:43] (03PS1) 10Alex Monk: openstack: Remove havana/icehouse files [puppet] - 10https://gerrit.wikimedia.org/r/248619 [15:25:56] RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy [15:25:57] RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy [15:26:56] RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy [15:42:16] (03PS1) 10Alex Monk: dynamicproxy: Empty data from initial-data.db [puppet] - 10https://gerrit.wikimedia.org/r/248622 [15:42:28] modules/mw_rc_irc/files/upstart/ircecho.conf:exec /usr/local/bin/udpmxircecho.py rc-pmtpa localhost [15:42:28] modules/requesttracker/files/rt.aliases:pmtpa: pmtpa@phabricator.wikimedia.org [15:56:38] Krenair: tampa lives! [16:19:53] Yeah, because people hardcode shit, the rc bot on not pmtpa is still called thes ame [16:19:57] Changing stuff is hard [16:27:32] How did it end up getting called rc-pmtpa? [16:29:55] It was called "rc" at some point [16:30:35] But it's part of the API now [16:31:51] yes, so how did it get the -pmtpa suffix? [16:33:57] (03CR) 10Legoktm: Fix w-beta.wmflabs.org redirect (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/248617 (https://phabricator.wikimedia.org/T116444) (owner: 10Alex Monk) [16:34:21] Krenair: thanks for looking at this :) [16:34:52] ah, you're right [16:46:08] have been fiddling with this locally, I think after adding the protocol it might work legoktm [16:46:10] (03PS2) 10Alex Monk: Fix w-beta.wmflabs.org redirect [puppet] - 10https://gerrit.wikimedia.org/r/248617 (https://phabricator.wikimedia.org/T116444) [16:46:12] will test in beta [16:47:41] i tried that at one point last night and i don't think it worked [16:51:52] (03PS7) 10EBernhardson: Generate weekly cirrussearch dumps [puppet] - 10https://gerrit.wikimedia.org/r/248596 (https://phabricator.wikimedia.org/T109690) [17:01:21] ori, what about enabling proxy_http and using ProxyPass "http://meta.wikimedia.beta.wmflabs.org/wiki/Special:UrlRedirector/" ? [17:01:36] yeah that might work [17:02:13] am trying it [17:05:26] PROBLEM - NFS read/writeable on labs instances on labstore1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:06:37] PROBLEM - High load average on labstore1002 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [24.0] [17:07:06] RECOVERY - NFS read/writeable on labs instances on labstore1002 is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.771 second response time [17:07:26] ori, legoktm: http://w-beta.wmflabs.org/t [17:07:36] \o/ [17:07:43] well done [17:07:45] submit a patch [17:10:50] (03PS3) 10Alex Monk: Fix w-beta.wmflabs.org redirect [puppet] - 10https://gerrit.wikimedia.org/r/248617 (https://phabricator.wikimedia.org/T116444) [17:11:59] ori, did you merge "UrlShortener on beta: fix RewriteRule" on deployment-puppetmaster? [17:12:16] yes [17:12:20] please use rebase [17:12:57] we ended up with your "Merge branch 'production' of https://gerrit.wikimedia.org/r/operations/puppet into production" with your commit on top of all of the live hacks [17:13:05] I've cleared it up now [17:13:08] thanks [17:15:36] RECOVERY - High load average on labstore1002 is OK: OK: Less than 50.00% above the threshold [16.0] [17:15:48] ori, do you know how the @phabricator.wikimedia.org email addresses are set up? [17:16:12] Krenair: no clue, sorry [17:18:32] (03CR) 10Alex Monk: "Cherry-picked on deployment-puppetmaster" [puppet] - 10https://gerrit.wikimedia.org/r/248617 (https://phabricator.wikimedia.org/T116444) (owner: 10Alex Monk) [17:29:15] (03CR) 10Ori.livneh: [C: 032] Fix w-beta.wmflabs.org redirect [puppet] - 10https://gerrit.wikimedia.org/r/248617 (https://phabricator.wikimedia.org/T116444) (owner: 10Alex Monk) [17:32:40] '(.*\.)?wikivoyage\.beta\.wmflabs\.org', // None in beta? [17:32:46] legoktm, I was thinking maybe we should make one at some point [17:34:37] Why is QuickSurveys not in the extension lists? [17:40:24] (03PS1) 10Alex Monk: Add QuickSurveys to extension-list-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248632 [17:46:58] (03PS1) 10Alex Monk: Change Venetian Wikipedia logo per admin request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248633 (https://phabricator.wikimedia.org/T116476) [17:55:57] (03PS1) 10Alex Monk: Checkout instead of cherry-pick [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/248634 [17:56:21] (03CR) 10jenkins-bot: [V: 04-1] Checkout instead of cherry-pick [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/248634 (owner: 10Alex Monk) [17:57:20] (03PS2) 10Alex Monk: Checkout instead of cherry-pick [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/248634 [17:59:57] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 35, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps DWDM]BR [18:00:25] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 214, down: 1, dormant: 0, excluded: 1, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps DWDM]BR [18:56:06] Krenair: yaaaaay [18:57:33] Krenair: yeah, having a wikivoyage is probably a good idea since they have custom extensions [19:29:51] 10Ops-Access-Requests, 6operations, 6Repository-Ownership-Requests: Requesting access to (wmf-)deployment for jzerebecki - https://phabricator.wikimedia.org/T116487#1750837 (10Krenair) You want to be added to the gerrit wmf-deployment group only? Or you want actual deployment rights on the cluster? [19:34:52] !log deployed https://gerrit.wikimedia.org/r/#/c/248638/ and restarted apache on iridium [19:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:53:54] 7Puppet, 6Labs, 6Phabricator: phabricator puppet at labs broken - https://phabricator.wikimedia.org/T116442#1750852 (10Krenair) [19:55:44] (03PS1) 10Alex Monk: beta: Add enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248639 [19:57:22] 10Ops-Access-Requests, 6operations, 6Repository-Ownership-Requests: Requesting access to (wmf-)deployment for jzerebecki - https://phabricator.wikimedia.org/T116487#1750858 (10JanZerebecki) I didn't differentiate there. So yes both. Is there any use in having only the gerrit group? (Even for Wikibase which h... [19:58:03] 10Ops-Access-Requests, 6operations: Requesting access to (wmf-)deployment for jzerebecki - https://phabricator.wikimedia.org/T116487#1750859 (10Krenair) [20:01:04] 6operations, 6Labs, 10wikitech.wikimedia.org: distribution upgrade for wikitech-static instance - https://phabricator.wikimedia.org/T94585#1750861 (10Aklapper) [20:03:45] 10Ops-Access-Requests, 6operations: Requesting access to (wmf-)deployment for jzerebecki - https://phabricator.wikimedia.org/T116487#1750992 (10Krenair) wmf-deployment can be added by any other deployer (or ops) once you get access on the cluster. I am aware of only one person who has wmf-deployment gerrit acc... [20:08:55] 7Puppet, 6Labs, 6Phabricator: phabricator puppet at labs broken - https://phabricator.wikimedia.org/T116442#1751011 (10Negative24) 5Open>3Resolved a:3Negative24 `role::phabricator::main` isn't the right Puppet class to use in Labs. I'm pretty sure the error had to do with the site variables. I walked @... [20:09:29] &win /win 19 [20:10:24] ? [20:10:56] misstyped irrsi command, was not intended to be sent here [20:11:54] ah :) [20:12:16] twentyafterfour: does that mean someone now needs to correct these tasks permissions? [20:12:39] jzerebec1i: yes I'm on it [20:12:50] thx [20:17:09] jzerebecki: fixed [20:17:57] PROBLEM - RAID on db1030 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [21:08:40] (03CR) 10Luke081515: [C: 031] beta: Add enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248639 (owner: 10Alex Monk) [21:29:28] 7Puppet, 6Labs, 6Phabricator, 5Patch-For-Review: On labs phabricator references security extension even though it isn't present - https://phabricator.wikimedia.org/T104904#1751092 (10Negative24) 5Resolved>3Open Those two commits ensure the directory is created but doesn't install the security extension... [21:30:35] twentyafterfour: ^ I can do that now if you want [21:30:54] turns out we did have a task :) [21:32:31] 7Puppet, 6Labs, 6Phabricator, 5Patch-For-Review: On labs phabricator references security extension even though it isn't present - https://phabricator.wikimedia.org/T104904#1751095 (10mmodell) I think we want the security extension in labs. At least until we deprecate it's use. I'm in the process of develop... [21:40:17] (03PS1) 10Negative24: phabricator: Set security ext tag for labs [puppet] - 10https://gerrit.wikimedia.org/r/248646 (https://bugzilla.wikimedia.org/104904) [21:41:54] (03PS2) 10Negative24: phabricator: Set security ext tag for labs [puppet] - 10https://gerrit.wikimedia.org/r/248646 (https://phabricator.wikimedia.org/T104904) [21:45:31] (03PS3) 10Negative24: phabricator: Set security ext tag for labs [puppet] - 10https://gerrit.wikimedia.org/r/248646 (https://phabricator.wikimedia.org/T104904) [21:48:12] 6operations, 7Database, 5Patch-For-Review: Puppetize pt-heartbeat on MariaDB10 masters and its corresponding checks on icinga - https://phabricator.wikimedia.org/T114752#1751121 (10jcrespo) [21:50:26] PROBLEM - puppet last run on install2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:52:31] 6operations, 10ops-eqiad: db1030 RAID degraded (disk failed) - https://phabricator.wikimedia.org/T116499#1751123 (10jcrespo) 3NEW [21:53:29] ACKNOWLEDGEMENT - RAID on db1030 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) Jcrespo https://phabricator.wikimedia.org/T116499 [22:23:18] 10Ops-Access-Requests, 6operations: Requesting access to (wmf-)deployment for jzerebecki - https://phabricator.wikimedia.org/T116487#1751145 (10hoo) While (probably) not a formal requirement, I think you should also get access to the `mediawiki` gerrit group beforehand. [22:30:05] (03CR) 10BryanDavis: "> does exported resources work in beta (cluster)" [puppet] - 10https://gerrit.wikimedia.org/r/179121 (owner: 10Giuseppe Lavagetto) [22:37:26] 10Ops-Access-Requests, 6operations: Requesting access to (wmf-)deployment for jzerebecki - https://phabricator.wikimedia.org/T116487#1751166 (10Krenair) >>! In T116487#1751145, @hoo wrote: > While (probably) not a formal requirement, I think you should also get access to the `mediawiki` gerrit group beforehand... [22:48:55] PROBLEM - puppet last run on install2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:18:47] PROBLEM - puppet last run on install2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:19:36] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 216, down: 0, dormant: 0, excluded: 1, unused: 0 [23:20:06] PROBLEM - puppet last run on mw2070 is CRITICAL: CRITICAL: puppet fail [23:20:45] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 [23:22:16] PROBLEM - logstash process on logstash1003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 998 (logstash), command name java, args logstash [23:26:56] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 27.27% of data above the critical threshold [500.0] [23:34:06] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:48:06] RECOVERY - puppet last run on mw2070 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:48:36] PROBLEM - puppet last run on install2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.