[00:21:55] <ebernhardson>	 !log started second machine (nobelium) performing copy of elasticsearch indices to codfw with 40 threads
[00:22:01] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[00:26:56] <icinga-wm>	 PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds
[00:28:45] <icinga-wm>	 RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212
[00:28:52] <YuviPanda>	 Krenair: I don't think so since it varies so much
[00:29:05] <YuviPanda>	 could probably maybe hackup puppet-compiler-type stuff though
[00:29:08] <YuviPanda>	 I've no idea how that works
[00:31:44] <Krenair>	 https://groups.google.com/forum/#!topic/puppet-users/8rv-0XB8g6g
[00:38:11] <Krenair>	 seems to sort of half work
[01:06:35] <icinga-wm>	 PROBLEM - High load average on labstore1002 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [24.0]
[01:13:16] <icinga-wm>	 PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds
[01:16:55] <icinga-wm>	 RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212
[01:22:26] <icinga-wm>	 PROBLEM - NFS read/writeable on labs instances on labstore1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:25:55] <icinga-wm>	 RECOVERY - NFS read/writeable on labs instances on labstore1002 is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.075 second response time
[01:31:37] <icinga-wm>	 RECOVERY - High load average on labstore1002 is OK: OK: Less than 50.00% above the threshold [16.0]
[01:48:41] <icinga-wm>	 PROBLEM - puppet last run on mw2099 is CRITICAL: CRITICAL: puppet fail
[01:55:08] <icinga-wm>	 PROBLEM - puppet last run on mw2144 is CRITICAL: CRITICAL: puppet fail
[02:05:01] <icinga-wm>	 PROBLEM - High load average on labstore1002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [24.0]
[02:14:40] <icinga-wm>	 RECOVERY - puppet last run on mw2099 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures
[02:23:36] <icinga-wm>	 RECOVERY - puppet last run on mw2144 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures
[02:28:25] <icinga-wm>	 RECOVERY - High load average on labstore1002 is OK: OK: Less than 50.00% above the threshold [16.0]
[02:33:00] <logmsgbot>	 !log l10nupdate@tin Synchronized php-1.27.0-wmf.3/cache/l10n: l10nupdate for 1.27.0-wmf.3 (duration: 08m 23s)
[02:33:15] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:37:34] <logmsgbot>	 !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.3) at 2015-10-24 02:37:34+00:00
[02:37:41] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[04:19:36] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds
[04:24:56] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds
[04:35:45] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds
[04:44:06] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[04:44:56] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[04:48:17] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[04:49:16] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy
[04:54:36] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[04:55:08] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds
[05:00:37] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds
[05:04:15] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 7 below the confidence bounds
[05:11:16] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds
[05:11:32] <hoo>	 !log Set an email address for user "Ymnes", after request. Confirmed by several, including.
[05:11:38] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[05:14:47] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 7 below the confidence bounds
[05:20:13] <grrrit-wm>	 (03PS1) 10EBernhardson: Generate weekly cirrussearch dumps [puppet] - 10https://gerrit.wikimedia.org/r/248596 (https://phabricator.wikimedia.org/T109690) 
[05:20:15] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds
[05:21:09] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Generate weekly cirrussearch dumps [puppet] - 10https://gerrit.wikimedia.org/r/248596 (https://phabricator.wikimedia.org/T109690) (owner: 10EBernhardson)
[05:22:06] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy
[05:22:16] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy
[05:23:06] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy
[05:28:13] <grrrit-wm>	 (03PS2) 10EBernhardson: Generate weekly cirrussearch dumps [puppet] - 10https://gerrit.wikimedia.org/r/248596 (https://phabricator.wikimedia.org/T109690) 
[05:28:50] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Generate weekly cirrussearch dumps [puppet] - 10https://gerrit.wikimedia.org/r/248596 (https://phabricator.wikimedia.org/T109690) (owner: 10EBernhardson)
[05:30:38] <grrrit-wm>	 (03PS3) 10EBernhardson: Generate weekly cirrussearch dumps [puppet] - 10https://gerrit.wikimedia.org/r/248596 (https://phabricator.wikimedia.org/T109690) 
[05:31:50] <grrrit-wm>	 (03CR) 10Hydriz: [C: 04-1] Generate weekly cirrussearch dumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/248596 (https://phabricator.wikimedia.org/T109690) (owner: 10EBernhardson)
[05:31:52] <grrrit-wm>	 (03PS4) 10EBernhardson: Generate weekly cirrussearch dumps [puppet] - 10https://gerrit.wikimedia.org/r/248596 (https://phabricator.wikimedia.org/T109690) 
[05:33:14] <grrrit-wm>	 (03CR) 10Hydriz: [C: 04-1] Generate weekly cirrussearch dumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/248596 (https://phabricator.wikimedia.org/T109690) (owner: 10EBernhardson)
[05:44:12] <logmsgbot>	 !log ori@tin Synchronized php-1.27.0-wmf.3/extensions/Flow/includes/Data/Index/FeatureIndex.php: (no message) (duration: 00m 17s)
[05:44:18] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[05:44:31] <grrrit-wm>	 (03CR) 10EBernhardson: Generate weekly cirrussearch dumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/248596 (https://phabricator.wikimedia.org/T109690) (owner: 10EBernhardson)
[05:44:51] <grrrit-wm>	 (03PS5) 10EBernhardson: Generate weekly cirrussearch dumps [puppet] - 10https://gerrit.wikimedia.org/r/248596 (https://phabricator.wikimedia.org/T109690) 
[05:47:41] <grrrit-wm>	 (03PS6) 10EBernhardson: Generate weekly cirrussearch dumps [puppet] - 10https://gerrit.wikimedia.org/r/248596 (https://phabricator.wikimedia.org/T109690) 
[05:50:20] <grrrit-wm>	 (03CR) 10Hydriz: [C: 031] "Format of filename reviewed. Thanks a bunch!" [puppet] - 10https://gerrit.wikimedia.org/r/248596 (https://phabricator.wikimedia.org/T109690) (owner: 10EBernhardson)
[05:52:06] <icinga-wm>	 RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected
[05:55:47] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[05:55:56] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[05:56:46] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[05:57:18] <grrrit-wm>	 (03PS1) 10Legoktm: Add apache rewrite rule for UrlShortener on beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/248601 (https://phabricator.wikimedia.org/T116444) 
[05:57:45] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy
[05:58:18] <grrrit-wm>	 (03PS2) 10Legoktm: Add apache rewrite rule for UrlShortener on beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/248601 (https://phabricator.wikimedia.org/T116444) 
[06:00:53] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Oct 24 06:00:53 UTC 2015 (duration 0m 52s)
[06:01:00] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[06:03:15] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[06:06:36] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy
[06:11:56] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[06:12:56] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy
[06:18:16] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[06:22:36] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy
[06:22:38] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy
[06:23:35] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy
[06:24:19] <grrrit-wm>	 (03PS1) 10Legoktm: Set up UrlShortener extension on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248602 (https://phabricator.wikimedia.org/T116444) 
[06:26:25] <grrrit-wm>	 (03PS1) 10Ori.livneh: Configure nova's nutcracker not to eject hosts [puppet] - 10https://gerrit.wikimedia.org/r/248603 
[06:26:30] <grrrit-wm>	 (03PS2) 10Legoktm: Set up UrlShortener extension on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248602 (https://phabricator.wikimedia.org/T116444) 
[06:26:38] <grrrit-wm>	 (03PS2) 10Ori.livneh: Configure nova's nutcracker not to eject hosts [puppet] - 10https://gerrit.wikimedia.org/r/248603 
[06:26:51] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] Configure nova's nutcracker not to eject hosts [puppet] - 10https://gerrit.wikimedia.org/r/248603 (owner: 10Ori.livneh)
[06:31:16] <icinga-wm>	 PROBLEM - puppet last run on db2044 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:31:26] <icinga-wm>	 PROBLEM - puppet last run on wtp1008 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:26] <icinga-wm>	 PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:27] <icinga-wm>	 PROBLEM - puppet last run on labnet1002 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:26] <icinga-wm>	 PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:45] <icinga-wm>	 PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 3 failures
[06:32:45] <icinga-wm>	 PROBLEM - puppet last run on nobelium is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:56] <icinga-wm>	 PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 3 failures
[06:35:52] <grrrit-wm>	 (03PS3) 10Ori.livneh: Add apache rewrite rule for UrlShortener on beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/248601 (https://phabricator.wikimedia.org/T116444) (owner: 10Legoktm)
[06:35:58] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] Add apache rewrite rule for UrlShortener on beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/248601 (https://phabricator.wikimedia.org/T116444) (owner: 10Legoktm)
[06:36:25] <legoktm>	 \o/
[06:37:10] <legoktm>	 ori: thanks :) I'll set up the novaproxy now
[06:37:17] <ori>	 np
[06:40:36] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[06:41:26] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[06:41:57] <legoktm>	 hm :/
[06:42:01] <legoktm>	 http://w.beta.wmflabs.org/ 400 Bad Request
[06:42:15] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[06:42:27] <legoktm>	 (I just forced a puppet run on deployment-mediawiki01)
[06:42:53] <ori>	 legoktm: before I go debug that, a quick question --- have you discussed with Aaron the fact that the extension performs a database write on GET requests?
[06:43:09] <ori>	 in UrlShortenerUtils::maybeCreateShortCode()
[06:43:11] <legoktm>	 uhh, it shouldn't be?
[06:43:18] <legoktm>	 that code path should only be hit on POST requests
[06:43:26] <ori>	 oh, ok then
[06:43:42] <legoktm>	 you have to POST to Special:UrlShortener or action=shortenurl
[06:43:59] <legoktm>	 (because Aaron made me change it a month ago :P)
[06:44:20] <ori>	 ori@deployment-mediawiki01:~$ curl -I -H 'host: w.beta.wmflabs.org' localhost
[06:44:21] <ori>	 HTTP/1.1 200 OK
[06:44:36] <icinga-wm>	 PROBLEM - Analytics Cassanda CQL query interface on aqs1001 is CRITICAL: Connection timed out
[06:44:36] <ori>	 so maybe a nova proxy thing?
[06:45:22] <ori>	 yeah, the 400 is coming from nginx
[06:45:31] <legoktm>	 > Successfully created new proxy w.beta.wmflabs.org for backend deployment-mediawiki01.deployment-prep.eqiad.wmflabs:80. 
[06:45:46] <legoktm>	 Maybe it doesn't like the w.beta?
[06:45:55] <ori>	 try 15.beta
[06:45:57] * ori ducks
[06:46:02] <legoktm>	 All the other proxies set up for beta use -beta
[06:46:04] <legoktm>	 xP
[06:46:40] <ori>	 hrm. i'll look at the nginx config. where did you see: > Successfully created new proxy w.beta.wmflabs.org for backend deployment-mediawiki01.deployment-prep.eqiad.wmflabs:80. ?
[06:47:07] <legoktm>	 that's the confirmation message after I added the proxy in https://wikitech.wikimedia.org/wiki/Special:NovaProxy
[06:47:31] <ori>	 have you done that before, and if so, do you know if it normally works instantly? maybe it requires a puppet run on the proxy host
[06:47:59] <legoktm>	 in the past it typically works in 2-3 minutes, which I waited
[06:48:07] <legoktm>	 but I've never added one with a . in it
[06:49:17] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy
[06:51:21] <ori>	 legoktm: yeah, i created a w-beta proxy and that one works
[06:51:36] <icinga-wm>	 RECOVERY - Analytics Cassanda CQL query interface on aqs1001 is OK: TCP OK - 0.000 second response time on port 9042
[06:51:53] <ori>	 so that must be it. but instead of relying on instance-proxy, you could just have that hostname go directly to the instance, if it has a public ip
[06:52:59] <legoktm>	 https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep#Instances_for_this_project doesn't show a public IP
[06:53:35] <ori>	 i could allocate one, but i'm going to look at the dynamicproxy code for a minute first to see if i can spot why w. fails
[06:54:16] <legoktm>	 ok. we could also use w-beta :)
[06:54:36] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[06:56:55] <ori>	 legoktm: that might be easiest, yeah
[06:57:15] <grrrit-wm>	 (03PS1) 10Legoktm: Switch UrlShortener to w-beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/248604 
[06:57:22] <ori>	 heh
[06:57:33] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] Switch UrlShortener to w-beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/248604 (owner: 10Legoktm)
[07:00:05] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy
[07:00:59] <grrrit-wm>	 (03PS3) 10Legoktm: Set up UrlShortener extension on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248602 (https://phabricator.wikimedia.org/T116444) 
[07:01:24] <legoktm>	 progress! http://w-beta.wmflabs.org/ "Domain not configured"
[07:02:01] <legoktm>	 oh, I don't think puppet ran
[07:02:18] <legoktm>	 and now that I ran puppet, back to 400 :(
[07:03:17] * legoktm reads apache docs
[07:03:55] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy
[07:04:14] <ori>	 it's not apache
[07:04:49] <legoktm>	 you think it's the proxy again?
[07:05:38] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:05:42] <legoktm>	 yeah, curl is working fine on deployment-mediawiki01 :/
[07:06:04] <legoktm>	 and by working fine I mean "Domain not configured"
[07:06:08] <ori>	 right
[07:06:19] <ori>	 do you know the instance name of the dynamicproxy host?
[07:06:35] <ori>	 i guess it's whatever w-beta.wmflabs.org resolves to
[07:07:31] <legoktm>	 novaproxy-01, 02
[07:07:39] <legoktm>	 https://wikitech.wikimedia.org/wiki/Nova_Resource:Project-proxy
[07:09:15] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[07:10:05] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy
[07:20:55] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:22:15] <ori>	 legoktm: i temporarily enabled debug error logging in nginx and captured the logs for one request to w-beta: https://dpaste.de/83Mz/raw
[07:23:00] <ori>	 2015/10/24 07:19:58 [debug] 28554#28554: *292560211 http script var: "http://deployment-mediawiki01.deployment-prep.eqiad.wmflabs:80" 
[07:23:03] <ori>	 so that bit works
[07:23:16] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy
[07:23:18] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy
[07:23:31] <legoktm>	 2015/10/24 07:19:58 [debug] 28554#28554: *292560211 http proxy status 400 "400 Bad Request"
[07:23:31] <legoktm>	 2015/10/24 07:19:58 [debug] 28554#28554: *292560211 http proxy header: "Date: Sat, 24 Oct 2015 07:19:58 GMT"
[07:23:31] <legoktm>	 2015/10/24 07:19:58 [debug] 28554#28554: *292560211 http proxy header: "Server: Apache"
[07:23:31] <legoktm>	 2015/10/24 07:19:58 [debug] 28554#28554: *292560211 http proxy header: "X-Powered-By: HHVM/3.3.0-static"
[07:24:00] <legoktm>	 so it looks like apache is sending the 400?
[07:24:15] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy
[07:24:23] <ori>	 yes, maybe it has to do with the other headers the proxy sends
[07:24:25] <ori>	 like X-Forwarded-For
[07:24:32] * ori curls
[07:25:29] <ori>	 yep
[07:25:43] <ori>	 # curl -I -H 'host: w-beta.wmflabs.org' -H 'X-Forwarded-For: 127.0.0.1' -H 'X-Forwarded-Proto: http' deployment-mediawiki01.deployment-prep.eqiad.wmflabs
[07:25:43] <ori>	 HTTP/1.1 400 Bad Request
[07:26:02] <ori>	 i told you it was apache >.>
[07:26:25] <icinga-wm>	 RECOVERY - puppet last run on wtp1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:26:25] <icinga-wm>	 RECOVERY - puppet last run on db2044 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures
[07:26:26] <icinga-wm>	 RECOVERY - puppet last run on labnet1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:27:27] <icinga-wm>	 RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures
[07:27:46] <icinga-wm>	 RECOVERY - puppet last run on nobelium is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures
[07:27:47] <icinga-wm>	 RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:27:47] <icinga-wm>	 RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:28:11] <legoktm>	 even curl -H 'host: w-beta.wmflabs.org'  deployment-mediawiki01.deployment-prep.eqiad.wmflabs is 400'ing
[07:28:15] <icinga-wm>	 RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:28:58] <ori>	 legoktm: it's your rewrite rule
[07:29:02] <ori>	 if i comment it out the vhost works
[07:29:13] <legoktm>	 hmm
[07:30:22] <legoktm>	 I just took it from https://www.mediawiki.org/wiki/Extension:UrlShortener#Rewrite_rules, and got rid of the /r/
[07:33:35] <ori>	 i got it
[07:35:41] <legoktm>	 what was it? :)
[07:36:22] <grrrit-wm>	 (03PS1) 10Ori.livneh: UrlShortener on beta: fix RewriteRule [puppet] - 10https://gerrit.wikimedia.org/r/248607 
[07:36:27] <ori>	 legoktm: that ^
[07:36:56] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] UrlShortener on beta: fix RewriteRule [puppet] - 10https://gerrit.wikimedia.org/r/248607 (owner: 10Ori.livneh)
[07:36:58] <grrrit-wm>	 (03CR) 10Legoktm: [C: 031] UrlShortener on beta: fix RewriteRule [puppet] - 10https://gerrit.wikimedia.org/r/248607 (owner: 10Ori.livneh)
[07:37:04] <legoktm>	 ok, makes sense
[07:38:06] <ori>	 you could probably make it w.beta again if you really wanted to
[07:38:11] <ori>	 since that turned out not to be related
[07:39:47] <legoktm>	 meh
[07:40:32] <ori>	 ran puppet, w-beta.wmflabs.org works now
[07:41:46] <ori>	 http://w-beta.wmflabs.org/ works but not http://w-beta.wmflabs.org/foo
[07:41:56] <ori>	 it still doesn't like the rewrite rule
[07:42:01] <ori>	 we're just not hitting it anymore
[07:42:05] <ori>	 for the / case
[07:42:05] <legoktm>	 :/
[07:42:21] <legoktm>	 I wonder if it doesn't like the domain
[07:43:55] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[07:44:52] <legoktm>	 ori: is there a way to get extra debug output from apache?
[07:45:18] <legoktm>	 error.log is empty
[07:46:26] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[07:46:27] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[07:46:29] <ori>	 legoktm: why did you pick the PT flag?
[07:47:00] <legoktm>	 a) it was on the wiki page b) that's how the internal example on https://httpd.apache.org/docs/2.4/rewrite/remapping.html does it
[07:47:23] <legoktm>	 (really a, I discovered b after the fact).
[07:47:36] <ori>	 with [L] it works
[07:48:05] <ori>	 but via a 302
[07:48:11] <ori>	 is that what you want?
[07:49:07] <legoktm>	 like a 302 to UrlRedirector? no, it should be handled internally
[07:49:30] <legoktm>	 the client should directly go from w-beta --> actual target 
[07:50:10] <ori>	 the difference between L and PT is that with L the rewritten URL is treated as final, and with PT, cycles through the rewrite rules again with the new URL, allowing further transformations to occur
[07:50:36] <icinga-wm>	 PROBLEM - Analytics Cassanda CQL query interface on aqs1001 is CRITICAL: Connection timed out
[07:50:39] <ori>	 so PT is correct, but the bad news is that the 400 is caused by some other rewrite rule, who knows which
[07:51:16] <legoktm>	 the meta configuration is in remnant.conf
[07:52:16] <icinga-wm>	 RECOVERY - Analytics Cassanda CQL query interface on aqs1001 is OK: TCP OK - 0.025 second response time on port 9042
[07:55:11] <ori>	 legoktm: ok so i disabled puppet temporarily and enabled debug logging for the rewrite module in /etc/apache2/apache2.conf by adding the lines:
[07:55:13] <ori>	 LogLevel debug rewrite:trace3
[07:55:14] <ori>	 ErrorLog /var/log/apache2/error.log
[07:55:19] <ori>	 now there's actually data there
[07:56:18] <legoktm>	 [Sat Oct 24 07:55:56.946360 2015] [rewrite:trace2] [pid 4620] mod_rewrite.c(468): [client 10.68.21.68:40196] 10.68.21.68 - - [w-beta.wmflabs.org/sid#7f7ed6534d88][rid#7f7ed63db0a0/initial] forcing 'http://meta.wikimedia.beta.wmflabs.org/w/index.php' to get passed through to next API URI-to-filename handler
[07:56:18] <legoktm>	 [Sat Oct 24 07:55:56.946368 2015] [core:error] [pid 4620] [client 10.68.21.68:40196] AH00126: Invalid URI in request GET /foo HTTP/1.1
[07:58:16] <legoktm>	 http://fpaste.org/283174/14456734/raw/ is the full request logs
[07:58:51] <legoktm>	 our rewrite rule looks fine
[07:59:54] <legoktm>	 it should hit 
[07:59:54] <legoktm>	         ProxyPassMatch  ^/w/(.*\.(php|hh))$  fcgi://127.0.0.1:9000/srv/mediawiki/docroot/wikimedia.org/w/$1
[08:05:16] <legoktm>	 ori: I'm going to sleep now, thanks for helping out with this :)
[08:05:30] <ori>	 legoktm: np, i was gonna give up for the night too
[08:05:32] <ori>	 ttyl
[08:05:36] <icinga-wm>	 PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above limits (thresholds: bounces: 25 in: 25 out: 300 virgin: 25)
[08:05:38] <ori>	 i'll re-enable puppet etc
[08:08:06] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy
[08:09:46] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy
[08:12:06] <icinga-wm>	 PROBLEM - Analytics Cassanda CQL query interface on aqs1001 is CRITICAL: Connection timed out
[08:13:26] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[08:13:47] <icinga-wm>	 RECOVERY - Analytics Cassanda CQL query interface on aqs1001 is OK: TCP OK - 3.003 second response time on port 9042
[08:15:06] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[08:18:37] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy
[08:19:45] <icinga-wm>	 RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below the limits.
[08:24:08] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy
[08:25:05] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy
[08:56:13] <wikibugs>	 6operations, 6Analytics-Backlog, 10Datasets-General-or-Unknown: Requests to dumps.wikimedia.org should end up in hadoop wmf.webrequest via kafka! - https://phabricator.wikimedia.org/T116430#1750184 (10Addshore) Hmm, this isn't a duplicate..? lists.wm.o != dumps.wm.o !!!
[08:57:51] <wikibugs>	 6operations, 6Analytics-Backlog, 10Wikimedia-Mailing-lists: Requests to lists.wikimedia.org should end up in hadoop wmf.webrequest via kafka! - https://phabricator.wikimedia.org/T116429#1750185 (10Addshore) Well, this mainly applies to dumps.wm.o (which the other ticket was open for). But I was looking to se...
[08:59:37] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[08:59:46] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[09:01:45] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy
[09:03:16] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy
[09:04:25] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[09:07:06] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[09:08:36] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[09:23:47] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy
[09:24:36] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy
[09:24:47] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy
[09:41:35] <icinga-wm>	 PROBLEM - BGP status on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, sessions up: 44, down: 1, shutdown: 0BRPeering with AS1273 not established - CWBR
[09:43:25] <icinga-wm>	 RECOVERY - BGP status on cr2-ulsfo is OK: OK: host 198.35.26.193, sessions up: 45, down: 0, shutdown: 0
[09:59:36] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[10:04:05] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[10:04:15] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[10:13:06] <icinga-wm>	 PROBLEM - puppet last run on mw2042 is CRITICAL: CRITICAL: puppet fail
[10:14:05] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy
[10:19:36] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:24:47] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy
[10:25:36] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy
[10:25:48] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy
[10:27:46] <icinga-wm>	 PROBLEM - puppet last run on mw1134 is CRITICAL: CRITICAL: Puppet has 1 failures
[10:35:55] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[10:36:37] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:36:45] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[10:41:56] <icinga-wm>	 RECOVERY - puppet last run on mw2042 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures
[10:43:47] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy
[10:48:55] <icinga-wm>	 PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: puppet fail
[10:49:16] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:52:56] <icinga-wm>	 RECOVERY - puppet last run on mw1134 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
[11:07:07] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy
[11:12:28] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[11:15:46] <icinga-wm>	 RECOVERY - puppet last run on cp3010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[11:19:45] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy
[11:25:06] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy
[11:26:05] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy
[11:29:05] <icinga-wm>	 PROBLEM - Cassandra CQL query interface on restbase-test2002 is CRITICAL: Connection refused
[11:29:06] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase-test2002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused)))
[11:30:16] <icinga-wm>	 PROBLEM - Restbase root url on restbase-test2002 is CRITICAL: Connection refused
[11:30:36] <icinga-wm>	 PROBLEM - Cassandra database on restbase-test2002 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 111 (cassandra), command name java, args CassandraDaemon
[11:40:46] <icinga-wm>	 PROBLEM - puppet last run on bast2001 is CRITICAL: CRITICAL: puppet fail
[11:45:25] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:47:56] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[11:49:46] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy
[11:50:45] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:54:16] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy
[11:55:25] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:59:46] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[12:08:06] <icinga-wm>	 RECOVERY - puppet last run on bast2001 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures
[12:16:06] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy
[12:21:38] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[12:25:06] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy
[12:25:07] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy
[12:26:06] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy
[12:57:05] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[12:57:47] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[12:57:47] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[13:07:56] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy
[13:08:46] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy
[13:13:35] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:14:16] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:17:05] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy
[13:22:36] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:25:06] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy
[13:25:07] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy
[13:26:05] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy
[14:04:35] <icinga-wm>	 PROBLEM - puppet last run on ms-be1008 is CRITICAL: CRITICAL: Puppet has 1 failures
[14:05:20] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:06:06] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[14:07:46] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy
[14:08:46] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[14:13:16] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:21:06] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy
[14:25:36] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy
[14:26:17] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy
[14:31:16] <icinga-wm>	 RECOVERY - puppet last run on ms-be1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:44:16] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[14:44:17] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[14:45:26] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:54:22] <grrrit-wm>	 (03PS1) 10Alex Monk: Fix w-beta.wmflabs.org redirect [puppet] - 10https://gerrit.wikimedia.org/r/248617 (https://phabricator.wikimedia.org/T116444) 
[15:09:46] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy
[15:20:37] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[15:24:43] <grrrit-wm>	 (03PS1) 10Alex Monk: openstack: Remove havana/icehouse files [puppet] - 10https://gerrit.wikimedia.org/r/248619 
[15:25:56] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy
[15:25:57] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy
[15:26:56] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy
[15:42:16] <grrrit-wm>	 (03PS1) 10Alex Monk: dynamicproxy: Empty data from initial-data.db [puppet] - 10https://gerrit.wikimedia.org/r/248622 
[15:42:28] <Krenair>	 modules/mw_rc_irc/files/upstart/ircecho.conf:exec /usr/local/bin/udpmxircecho.py rc-pmtpa localhost
[15:42:28] <Krenair>	 modules/requesttracker/files/rt.aliases:pmtpa: pmtpa@phabricator.wikimedia.org
[15:56:38] <ori>	 Krenair: tampa lives!
[16:19:53] <Reedy>	 Yeah, because people hardcode shit, the rc bot on not pmtpa is still called thes ame
[16:19:57] <Reedy>	 Changing stuff is hard
[16:27:32] <Krenair>	 How did it end up getting called rc-pmtpa?
[16:29:55] <Krinkle>	 It was called "rc" at some point
[16:30:35] <Krinkle>	 But it's part of the API now
[16:31:51] <Krenair>	 yes, so how did it get the -pmtpa suffix?
[16:33:57] <grrrit-wm>	 (03CR) 10Legoktm: Fix w-beta.wmflabs.org redirect (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/248617 (https://phabricator.wikimedia.org/T116444) (owner: 10Alex Monk)
[16:34:21] <legoktm>	 Krenair: thanks for looking at this :)
[16:34:52] <Krenair>	 ah, you're right
[16:46:08] <Krenair>	 have been fiddling with this locally, I think after adding the protocol it might work legoktm 
[16:46:10] <grrrit-wm>	 (03PS2) 10Alex Monk: Fix w-beta.wmflabs.org redirect [puppet] - 10https://gerrit.wikimedia.org/r/248617 (https://phabricator.wikimedia.org/T116444) 
[16:46:12] <Krenair>	 will test in beta
[16:47:41] <ori>	 i tried that at one point last night and i don't think it worked
[16:51:52] <grrrit-wm>	 (03PS7) 10EBernhardson: Generate weekly cirrussearch dumps [puppet] - 10https://gerrit.wikimedia.org/r/248596 (https://phabricator.wikimedia.org/T109690) 
[17:01:21] <Krenair>	 ori, what about enabling proxy_http and using <Location "/"> ProxyPass "http://meta.wikimedia.beta.wmflabs.org/wiki/Special:UrlRedirector/" </Location> ?
[17:01:36] <ori>	 yeah that might work
[17:02:13] <Krenair>	 am trying it
[17:05:26] <icinga-wm>	 PROBLEM - NFS read/writeable on labs instances on labstore1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:06:37] <icinga-wm>	 PROBLEM - High load average on labstore1002 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [24.0]
[17:07:06] <icinga-wm>	 RECOVERY - NFS read/writeable on labs instances on labstore1002 is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.771 second response time
[17:07:26] <Krenair>	 ori, legoktm: http://w-beta.wmflabs.org/t
[17:07:36] <ori>	 \o/
[17:07:43] <ori>	 well done
[17:07:45] <ori>	 submit a patch
[17:10:50] <grrrit-wm>	 (03PS3) 10Alex Monk: Fix w-beta.wmflabs.org redirect [puppet] - 10https://gerrit.wikimedia.org/r/248617 (https://phabricator.wikimedia.org/T116444) 
[17:11:59] <Krenair>	 ori, did you merge "UrlShortener on beta: fix RewriteRule" on deployment-puppetmaster?
[17:12:16] <ori>	 yes
[17:12:20] <Krenair>	 please use rebase
[17:12:57] <Krenair>	 we ended up with your "Merge branch 'production' of https://gerrit.wikimedia.org/r/operations/puppet into production" with your commit on top of all of the live hacks
[17:13:05] <Krenair>	 I've cleared it up now
[17:13:08] <ori>	 thanks
[17:15:36] <icinga-wm>	 RECOVERY - High load average on labstore1002 is OK: OK: Less than 50.00% above the threshold [16.0]
[17:15:48] <Krenair>	 ori, do you know how the @phabricator.wikimedia.org email addresses are set up?
[17:16:12] <ori>	 Krenair: no clue, sorry
[17:18:32] <grrrit-wm>	 (03CR) 10Alex Monk: "Cherry-picked on deployment-puppetmaster" [puppet] - 10https://gerrit.wikimedia.org/r/248617 (https://phabricator.wikimedia.org/T116444) (owner: 10Alex Monk)
[17:29:15] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] Fix w-beta.wmflabs.org redirect [puppet] - 10https://gerrit.wikimedia.org/r/248617 (https://phabricator.wikimedia.org/T116444) (owner: 10Alex Monk)
[17:32:40] <Krenair>	 '(.*\.)?wikivoyage\.beta\.wmflabs\.org', // None in beta?
[17:32:46] <Krenair>	 legoktm, I was thinking maybe we should make one at some point
[17:34:37] <Krenair>	 Why is QuickSurveys not in the extension lists?
[17:40:24] <grrrit-wm>	 (03PS1) 10Alex Monk: Add QuickSurveys to extension-list-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248632 
[17:46:58] <grrrit-wm>	 (03PS1) 10Alex Monk: Change Venetian Wikipedia logo per admin request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248633 (https://phabricator.wikimedia.org/T116476) 
[17:55:57] <grrrit-wm>	 (03PS1) 10Alex Monk: Checkout instead of cherry-pick [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/248634 
[17:56:21] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Checkout instead of cherry-pick [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/248634 (owner: 10Alex Monk)
[17:57:20] <grrrit-wm>	 (03PS2) 10Alex Monk: Checkout instead of cherry-pick [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/248634 
[17:59:57] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 35, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps DWDM]BR
[18:00:25] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 214, down: 1, dormant: 0, excluded: 1, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps DWDM]BR
[18:56:06] <legoktm>	 Krenair: yaaaaay
[18:57:33] <legoktm>	 Krenair: yeah, having a wikivoyage is probably a good idea since they have custom extensions
[19:29:51] <wikibugs>	 10Ops-Access-Requests, 6operations, 6Repository-Ownership-Requests: Requesting access to (wmf-)deployment for jzerebecki - https://phabricator.wikimedia.org/T116487#1750837 (10Krenair) You want to be added to the gerrit wmf-deployment group only? Or you want actual deployment rights on the cluster?
[19:34:52] <twentyafterfour>	 !log deployed https://gerrit.wikimedia.org/r/#/c/248638/ and restarted apache on iridium
[19:34:58] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:53:54] <wikibugs>	 7Puppet, 6Labs, 6Phabricator: phabricator puppet at labs broken - https://phabricator.wikimedia.org/T116442#1750852 (10Krenair)
[19:55:44] <grrrit-wm>	 (03PS1) 10Alex Monk: beta: Add enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248639 
[19:57:22] <wikibugs>	 10Ops-Access-Requests, 6operations, 6Repository-Ownership-Requests: Requesting access to (wmf-)deployment for jzerebecki - https://phabricator.wikimedia.org/T116487#1750858 (10JanZerebecki) I didn't differentiate there. So yes both. Is there any use in having only the gerrit group? (Even for Wikibase which h...
[19:58:03] <wikibugs>	 10Ops-Access-Requests, 6operations: Requesting access to (wmf-)deployment for jzerebecki - https://phabricator.wikimedia.org/T116487#1750859 (10Krenair)
[20:01:04] <wikibugs>	 6operations, 6Labs, 10wikitech.wikimedia.org: distribution upgrade for wikitech-static instance - https://phabricator.wikimedia.org/T94585#1750861 (10Aklapper)
[20:03:45] <wikibugs>	 10Ops-Access-Requests, 6operations: Requesting access to (wmf-)deployment for jzerebecki - https://phabricator.wikimedia.org/T116487#1750992 (10Krenair) wmf-deployment can be added by any other deployer (or ops) once you get access on the cluster. I am aware of only one person who has wmf-deployment gerrit acc...
[20:08:55] <wikibugs>	 7Puppet, 6Labs, 6Phabricator: phabricator puppet at labs broken - https://phabricator.wikimedia.org/T116442#1751011 (10Negative24) 5Open>3Resolved a:3Negative24 `role::phabricator::main` isn't the right Puppet class to use in Labs. I'm pretty sure the error had to do with the site variables. I walked @...
[20:09:29] <jzerebec1i>	 &win /win 19
[20:10:24] <Krenair>	 ?
[20:10:56] <jzerebec1i>	 misstyped irrsi command, was not intended to be sent here
[20:11:54] <Krenair>	 ah :)
[20:12:16] <jzerebec1i>	 twentyafterfour: does that mean someone now needs to correct these tasks permissions?
[20:12:39] <twentyafterfour>	 jzerebec1i: yes I'm on it
[20:12:50] <jzerebec1i>	 thx
[20:17:09] <twentyafterfour>	 jzerebecki: fixed
[20:17:57] <icinga-wm>	 PROBLEM - RAID on db1030 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded)
[21:08:40] <grrrit-wm>	 (03CR) 10Luke081515: [C: 031] beta: Add enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248639 (owner: 10Alex Monk)
[21:29:28] <wikibugs>	 7Puppet, 6Labs, 6Phabricator, 5Patch-For-Review: On labs phabricator references security extension even though it isn't present - https://phabricator.wikimedia.org/T104904#1751092 (10Negative24) 5Resolved>3Open Those two commits ensure the directory is created but doesn't install the security extension...
[21:30:35] <Negative24>	 twentyafterfour: ^ I can do that now if you want
[21:30:54] <Negative24>	 turns out we did have a task :)
[21:32:31] <wikibugs>	 7Puppet, 6Labs, 6Phabricator, 5Patch-For-Review: On labs phabricator references security extension even though it isn't present - https://phabricator.wikimedia.org/T104904#1751095 (10mmodell) I think we want the security extension in labs. At least until we deprecate it's use. I'm in the process of develop...
[21:40:17] <grrrit-wm>	 (03PS1) 10Negative24: phabricator: Set security ext tag for labs [puppet] - 10https://gerrit.wikimedia.org/r/248646 (https://bugzilla.wikimedia.org/104904) 
[21:41:54] <grrrit-wm>	 (03PS2) 10Negative24: phabricator: Set security ext tag for labs [puppet] - 10https://gerrit.wikimedia.org/r/248646 (https://phabricator.wikimedia.org/T104904) 
[21:45:31] <grrrit-wm>	 (03PS3) 10Negative24: phabricator: Set security ext tag for labs [puppet] - 10https://gerrit.wikimedia.org/r/248646 (https://phabricator.wikimedia.org/T104904) 
[21:48:12] <wikibugs>	 6operations, 7Database, 5Patch-For-Review: Puppetize pt-heartbeat on MariaDB10 masters and its corresponding checks on icinga - https://phabricator.wikimedia.org/T114752#1751121 (10jcrespo)
[21:50:26] <icinga-wm>	 PROBLEM - puppet last run on install2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[21:52:31] <wikibugs>	 6operations, 10ops-eqiad: db1030 RAID degraded (disk failed) - https://phabricator.wikimedia.org/T116499#1751123 (10jcrespo) 3NEW
[21:53:29] <icinga-wm>	 ACKNOWLEDGEMENT - RAID on db1030 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) Jcrespo https://phabricator.wikimedia.org/T116499
[22:23:18] <wikibugs>	 10Ops-Access-Requests, 6operations: Requesting access to (wmf-)deployment for jzerebecki - https://phabricator.wikimedia.org/T116487#1751145 (10hoo) While (probably) not a formal requirement, I think you should also get access to the `mediawiki` gerrit group beforehand.
[22:30:05] <grrrit-wm>	 (03CR) 10BryanDavis: "> does exported resources work in beta (cluster)" [puppet] - 10https://gerrit.wikimedia.org/r/179121 (owner: 10Giuseppe Lavagetto)
[22:37:26] <wikibugs>	 10Ops-Access-Requests, 6operations: Requesting access to (wmf-)deployment for jzerebecki - https://phabricator.wikimedia.org/T116487#1751166 (10Krenair) >>! In T116487#1751145, @hoo wrote: > While (probably) not a formal requirement, I think you should also get access to the `mediawiki` gerrit group beforehand...
[22:48:55] <icinga-wm>	 PROBLEM - puppet last run on install2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:18:47] <icinga-wm>	 PROBLEM - puppet last run on install2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:19:36] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 216, down: 0, dormant: 0, excluded: 1, unused: 0
[23:20:06] <icinga-wm>	 PROBLEM - puppet last run on mw2070 is CRITICAL: CRITICAL: puppet fail
[23:20:45] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0
[23:22:16] <icinga-wm>	 PROBLEM - logstash process on logstash1003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 998 (logstash), command name java, args logstash
[23:26:56] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 27.27% of data above the critical threshold [500.0]
[23:34:06] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[23:48:06] <icinga-wm>	 RECOVERY - puppet last run on mw2070 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[23:48:36] <icinga-wm>	 PROBLEM - puppet last run on install2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.