[00:04:15] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:07:04] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:08:16] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy
[00:09:05] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[00:13:41] <wikibugs>	 6operations, 10OTRS, 7user-notice: Upgrade OTRS to a more recent stable release - https://phabricator.wikimedia.org/T74109#1939403 (10Aklapper)
[00:15:24] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:17:25] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[00:44:55] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:46:55] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1001 is OK: All endpoints are healthy
[00:57:25] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:59:25] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[01:01:15] <icinga-wm>	 PROBLEM - puppet last run on mw2166 is CRITICAL: CRITICAL: puppet fail
[01:28:24] <icinga-wm>	 RECOVERY - puppet last run on mw2166 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures
[01:28:55] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:35:24] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[01:36:45] <icinga-wm>	 PROBLEM - puppet last run on cp3015 is CRITICAL: CRITICAL: puppet fail
[01:47:24] <paravoid>	 !log restarting HHVM on mw1120, mw1125, mw1127, mw1132, mw1148; OOM
[01:47:28] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[01:57:44] <icinga-wm>	 PROBLEM - Apache HTTP on mw1127 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.012 second response time
[01:57:46] <icinga-wm>	 PROBLEM - HHVM rendering on mw1125 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.012 second response time
[01:57:55] <icinga-wm>	 PROBLEM - HHVM rendering on mw1127 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.006 second response time
[01:58:05] <icinga-wm>	 PROBLEM - Apache HTTP on mw1125 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.006 second response time
[01:58:24] <icinga-wm>	 PROBLEM - HHVM rendering on mw1132 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.003 second response time
[01:58:36] <icinga-wm>	 PROBLEM - Apache HTTP on mw1132 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.010 second response time
[02:04:05] <icinga-wm>	 RECOVERY - puppet last run on cp3015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[02:13:44] <dbrant>	 YuviPanda: so.. what's happenin'??
[02:13:58] <YuviPanda>	 hey dbrant
[02:14:08] <YuviPanda>	 dbrant: the mobileapps service is flapping
[02:14:16] <YuviPanda>	 dbrant: first question is - does it affect only beta users or everyone?
[02:14:31] <dbrant>	 YuviPanda: ah! yes, it's only beta users
[02:14:35] <YuviPanda>	 ok
[02:14:41] <YuviPanda>	 do we care massively?
[02:14:49] <YuviPanda>	 and do you have a way to switch them over to the action API?
[02:15:56] <paravoid>	 so current status is that we've been getting timeout alerts from mobileapps and restbase
[02:16:13] <dbrant>	 YuviPanda: not *too* massively... Whether the app uses the service is controlled by a remote config variable (checked once a day)
[02:16:19] <paravoid>	 and there has been a massive drop in traffic in both since approximately 21:40Z
[02:16:25] <YuviPanda>	 dbrant: I also left voicemail for bearnd and stephen, btw.
[02:16:31] <niedzielski>	 YuviPanda: hello!
[02:16:47] <paravoid>	 gwicke's original theory was that it was MW API related, but I doubt it is
[02:16:51] <niedzielski>	 bernd is on parental leave so hopefully he can get a pass today
[02:16:52] <dbrant>	 YuviPanda: ok good, (although Bernd is on paternity leave!)
[02:16:56] <YuviPanda>	 hey niedzielski!
[02:17:05] <YuviPanda>	 niedzielski: dbrant ah, I wasn't aware. hopefully he doesn't respond.
[02:18:21] <dbrant>	 It's working for me...
[02:18:28] <dbrant>	 what kind of errors would i be seeing?
[02:18:30] <niedzielski>	 YuviPanda: sorry, coming in a bit late to the convo. so i believe the code is structured so that users will fallback to mw api
[02:18:55] <dbrant>	 oh wait, yes, maybe i'm just seeing it fall back by default.
[02:19:04] <YuviPanda>	 dbrant: so I saw a couple of 'timeout errors' when using the app on my phone, and grafana has https://grafana.wikimedia.org/dashboard/db/restbase which has gone all flat
[02:19:37] <YuviPanda>	 they're all suspiciously flat
[02:20:01] <dbrant>	 hmm, my app is definitely using RB. i'm able to look at articles that have pronunciations and geolocation buttons, which are only available via RB.
[02:21:11] <paravoid>	 https://grafana.wikimedia.org/dashboard/db/mobileapps shows a drop in the request rate
[02:21:41] <paravoid>	 now, it could be resource starvation because requests take a longer time to respond
[02:21:48] <niedzielski>	 dbrant: things seem to be working as expected on my end too
[02:22:00] <paravoid>	 it's not clear from the second graph
[02:22:03] <bearND>	 yeah, what niedzielski. there should be an automatic fallback after the first error to mw api
[02:22:29] <YuviPanda>	 bearND: go back to being on parental leave!
[02:22:34] <paravoid>	 root@scb1001:~# grep  ETIMEDOUT /srv/log/mobileapps/main.log | wc -l
[02:22:37] <paravoid>	 611
[02:22:59] <bearND>	 YuviPanda: it's ok. it's the weekend anyways. lol
[02:23:00] <paravoid>	 root@scb1001:~# grep  ETIMEDOUT /srv/log/mobileapps/main.log | head -1 | jq '.time'
[02:23:03] <paravoid>	 "2016-01-16T21:52:20.467Z"
[02:23:07] <YuviPanda>	 http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&c=Service+Cluster+B+eqiad&h=&tab=m&vn=&hide-hf=false&m=cpu_report&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name
[02:23:20] <paravoid>	 yup
[02:23:27] <YuviPanda>	 drop in network and CPU util too
[02:23:35] <YuviPanda>	 maybe that's rb not passing requests back to mobileapps
[02:23:48] <YuviPanda>	 and mobileapps falling back to mobileview api?
[02:24:02] <YuviPanda>	 niedzielski: bearND dbrant do you have a way to measure how many fallbacks happened?
[02:24:50] <niedzielski>	 hm, i think that's in our sharedpreferences and also in event logging (but beta event logging was kind of broken on friday)? checking
[02:26:03] <YuviPanda>	 niedzielski: prod eventlogging is also broken for normal people (replication to analytics-store is super slow) but I can run small quick queries on the EL master if needed. you can also use kafkacat to look at it realtime (I can help check this)
[02:26:32] <logmsgbot>	 !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.9) (duration: 10m 41s)
[02:26:36] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:26:48] <dbrant>	 it's definitely not failing or falling-back for me...
[02:27:28] <dbrant>	 i don't think we send any events when we fall back
[02:27:58] <YuviPanda>	 ok
[02:28:05] <paravoid>	 I'm leaning towards restbase-related
[02:28:20] <paravoid>	 2016-01-17T01:44:28.467Z warn restbase1001 Setting host 10.64.0.223:9042 as DOWN
[02:28:29] <paravoid>	 2016-01-17T01:44:28.467Z warn restbase1001 Setting host 10.64.32.178:9042 as DOWN
[02:28:32] <paravoid>	 etc.
[02:28:47] <paravoid>	 tons of "Heap memory limit temporarily exceeded"
[02:28:55] <YuviPanda>	 I suppose that's cassandra?
[02:29:05] <YuviPanda>	 or is that nodejs'?
[02:29:05] <icinga-wm>	 PROBLEM - puppet last run on mw1120 is CRITICAL: CRITICAL: puppet fail
[02:29:20] * YuviPanda finds out the machines with that ip
[02:29:27] <paravoid>	 that's just restbase
[02:29:32] <YuviPanda>	 yeah ok
[02:29:34] <paravoid>	 and it was for all the restbase cluster
[02:29:47] <paravoid>	 (these are from logstash -> restbase -> last 24h)
[02:30:41] <niedzielski>	 dbrant: so i think requestSuccesses is a magic number for rb failures. mine is at -1 (rb failure) which may have been from a previous test
[02:31:30] <YuviPanda>	 paravoid: I'm going to try paging marko again and if that fails call urandom
[02:31:49] <paravoid>	 req/s seem to have started raising again
[02:32:04] <paravoid>	 e.g. http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Service+Cluster+B+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report
[02:32:10] <paravoid>	 and https://grafana.wikimedia.org/dashboard/db/restbase
[02:32:20] <niedzielski>	 dbrant: yeah, i reset it and do not see it going back to -1
[02:32:50] <paravoid>	 Completed flushing /var/lib/cassandra/data/local_group_default_T_mobileapps_lead/data-13482db08e0b11e5adc765f8a3ffd5d4/local_group_default_T_mobileapps_lead-data-tmp-ka-165-Data.db (11.069MiB) for commitlog position ReplayPosition(segmentId=1449795889502,
[02:32:56] <paravoid>	 Completed flushing /var/lib/cassandra/data/local_group_default_T_title__revisions/idx_by_rev_ever-a6ef7ab0103d11e5a41a5926693ccb22/local_group_default_T_title__revisions-idx_by_rev_ever-tmp-ka-6909-Data.db (1.299MiB) for commitlog position ReplayPosition(
[02:33:00] <paravoid>	 these were *just* now
[02:33:01] <niedzielski>	 dbrant: the logs look good on my device too
[02:33:02] <dbrant>	 niedzielski: it remains "unset" for me
[02:34:03] <paravoid>	 lots of "Error in Cassandra table storage backend" too
[02:34:17] <YuviPanda>	 no response from marko, calling urandom now
[02:34:41] <paravoid>	 thank you so much YuviPanda 
[02:35:47] <YuviPanda>	 dbrant: niedzielski do you think we should flip the setting to make apps stop using rb for now?
[02:35:51] <niedzielski>	 dbrant: is it worth looking SessionFunnel.apiMode event log?
[02:36:17] <YuviPanda>	 paravoid: no luck with urandom either, 'the person you are calling has a voice mailbox that is not setup yet'
[02:36:20] * YuviPanda leaves him an SMS instead
[02:36:55] <niedzielski>	 YuviPanda: we're not repro-ing the error and the app is designed to handle this scenario. since this is beta, i think we should let it run
[02:37:01] <niedzielski>	 dbrant: ^^
[02:37:21] <YuviPanda>	 niedzielski: hmm, ok. is the steps to move it back docuented somewhere, in case we need to do it to help restbase recover?
[02:38:48] <YuviPanda>	 only peter from services is uncalled so far
[02:38:51] <YuviPanda>	 let me call him too
[02:38:58] <paravoid>	 nah it's fine
[02:39:10] <paravoid>	 don't
[02:39:11] <niedzielski>	 YuviPanda: (checking now). the app config is a json file that's easy to change. the deployment and cache purge might be trickier
[02:39:25] <YuviPanda>	 paravoid: ok
[02:39:35] <paravoid>	 we can call filippo but I'm not sure if there is much point without any noticeable user impact
[02:39:37] <YuviPanda>	 niedzielski: is this the thing in the MobileApp extension?
[02:39:42] <YuviPanda>	 paravoid: yeah, probably.
[02:39:44] <dbrant>	 yep
[02:39:53] <niedzielski>	 YuviPanda: https://meta.wikimedia.org/static/current/extensions/MobileApp/config/android.json
[02:40:08] <YuviPanda>	 nice
[02:40:20] <YuviPanda>	 so switching the % there to 0
[02:40:22] <YuviPanda>	 should do it
[02:40:26] <YuviPanda>	 but slowly (over a day)
[02:40:30] <dbrant>	 correct
[02:40:36] * YuviPanda remembers this mechanism :D
[02:40:42] <dbrant>	 ;)
[02:41:15] <paravoid>	 but what do you lose for not using restbase?
[02:41:19] <paravoid>	 some piece of functionality?
[02:41:35] <icinga-wm>	 PROBLEM - Restbase root url on restbase1009 is CRITICAL: Connection refused
[02:41:40] <YuviPanda>	 dbrant: niedzielski so if the different params aren't documented somewhere (along with the process to deploy them + cache purge if necessary), do set it up on wikitech (not right now though, next week is fine)
[02:41:41] <paravoid>	 yay
[02:41:43] <niedzielski>	 YuviPanda: yeah, i don't think we have a wiki for it... restbaseBetaPercent needs to be zeroed. here's the steps for a previous deployment: T118965
[02:41:43] <YuviPanda>	 uh oh
[02:41:52] <paravoid>	 yeah restbase is fucked basically
[02:41:54] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.48.110, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused)))
[02:42:49] <YuviPanda>	 niedzielski: ah, nice. thank you!
[02:43:33] <niedzielski>	 YuviPanda: i just fwd'd the email on some more detail for the cache purge. we'll wiki this
[02:44:07] <YuviPanda>	 niedzielski: thanks :)
[02:44:17] <James_F>	 VE still working in prod FWIW.
[02:44:48] <dbrant>	 paravoid: yeah, some functionality would be lost; but right now it's only in our Beta app, and only for 55% of users.  And it should fall back to regular API...
[02:45:10] <YuviPanda>	 Is there anything outside of VE and mobileapps that's using restbase?
[02:45:18] <paravoid>	 which piece of functionality?
[02:45:30] <James_F>	 YuviPanda: Nothing big. Some tools.
[02:46:11] <YuviPanda>	 paravoid: prononciation and geolocation info, I think.
[02:46:14] <paravoid>	 something's happening, but it's not completely healthy yet: https://grafana.wikimedia.org/dashboard/db/restbase
[02:46:20] <logmsgbot>	 !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.10) (duration: 08m 53s)
[02:46:23] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:47:07] <YuviPanda>	 niedzielski: dbrant can you give me an example URL to hit mobileapps service with?
[02:47:26] <dbrant>	 YuviPanda: https://rest.wikimedia.org/en.wikipedia.org/v1/page/mobile-sections/Neptune
[02:47:30] <niedzielski>	 YuviPanda: https://en.m.wikipedia.org/api/rest_v1/page/mobile-sections-lead/Cleveland
[02:47:54] <icinga-wm>	 RECOVERY - Restbase root url on restbase1009 is OK: HTTP OK: HTTP/1.1 200 - 15214 bytes in 0.009 second response time
[02:48:05] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy
[02:48:10] <YuviPanda>	 heh. which one does the app hit?
[02:48:52] <niedzielski>	 YuviPanda: er, the request i mentioned is for just getting the lead section. here would be the rest of the page https://en.m.wikipedia.org/api/rest_v1/page/mobile-sections-remaining/Cleveland
[02:49:07] <YuviPanda>	 right but is it hitting rest.wikimedia.org or en.m.wikipedia.org?
[02:49:14] <YuviPanda>	 (probably doesn't matter, but still)
[02:49:45] <dbrant>	 oh sorry, it's en.m
[02:50:34] <niedzielski>	 YuviPanda dbrant: right en.m. the other form is more for dev work
[02:53:19] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Sun Jan 17 02:53:19 UTC 2016 (duration 6m 59s)
[02:53:23] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:55:51] <YuviPanda>	 dbrant: niedzielski ok!
[02:56:25] <icinga-wm>	 RECOVERY - puppet last run on mw1120 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[02:56:39] <YuviPanda>	 dbrant: niedzielski we've gwicke online now and I think you guys are good to go. I'll switch the app off restbase if we feel the need and notify you (and call you both again if necessary)
[02:57:13] <dbrant>	 YuviPanda: your dedication is admirable! (+ paravoid)
[02:57:17] <niedzielski>	 YuviPanda: thanks man
[02:57:33] <YuviPanda>	 np. thanks for responding quickly!
[03:36:15] <icinga-wm>	 PROBLEM - puppet last run on mw1257 is CRITICAL: CRITICAL: Puppet has 2 failures
[03:36:35] <icinga-wm>	 PROBLEM - puppet last run on mw1216 is CRITICAL: CRITICAL: Puppet has 1 failures
[03:36:45] <icinga-wm>	 PROBLEM - puppet last run on mw2125 is CRITICAL: CRITICAL: Puppet has 1 failures
[03:38:45] <icinga-wm>	 RECOVERY - Apache HTTP on mw1120 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 3.153 second response time
[03:39:46] <icinga-wm>	 RECOVERY - HHVM rendering on mw1120 is OK: HTTP OK: HTTP/1.1 200 OK - 65651 bytes in 0.500 second response time
[03:40:55] <icinga-wm>	 RECOVERY - Apache HTTP on mw1127 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.955 second response time
[03:41:05] <icinga-wm>	 RECOVERY - HHVM rendering on mw1127 is OK: HTTP OK: HTTP/1.1 200 OK - 65651 bytes in 0.406 second response time
[03:41:54] <icinga-wm>	 RECOVERY - Apache HTTP on mw1148 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.531 second response time
[03:42:54] <icinga-wm>	 RECOVERY - HHVM rendering on mw1148 is OK: HTTP OK: HTTP/1.1 200 OK - 65651 bytes in 1.069 second response time
[03:50:45] <icinga-wm>	 PROBLEM - puppet last run on mw2131 is CRITICAL: CRITICAL: puppet fail
[03:50:58] <wikibugs>	 6operations, 3Mobile-Content-Service: Improve operational documentation for the MobileApps extension - https://phabricator.wikimedia.org/T123852#1939455 (10yuvipanda) 3NEW
[03:51:24] <wikibugs>	 6operations, 3Mobile-Content-Service: Improve operational documentation for the MobileApps extension - https://phabricator.wikimedia.org/T123852#1939464 (10yuvipanda)
[03:51:25] <icinga-wm>	 PROBLEM - puppet last run on mw2084 is CRITICAL: CRITICAL: puppet fail
[03:51:55] <icinga-wm>	 PROBLEM - puppet last run on mw2111 is CRITICAL: CRITICAL: puppet fail
[03:52:05] <icinga-wm>	 PROBLEM - puppet last run on serpens is CRITICAL: CRITICAL: puppet fail
[04:01:35] <icinga-wm>	 RECOVERY - puppet last run on mw1257 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures
[04:01:55] <icinga-wm>	 RECOVERY - puppet last run on mw1216 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[04:02:15] <icinga-wm>	 RECOVERY - puppet last run on mw2125 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures
[04:02:34] <wikibugs>	 6operations, 3Mobile-Content-Service: Improve operational documentation for the MobileApps extension - https://phabricator.wikimedia.org/T123852#1939467 (10bearND) p:5Triage>3Normal
[04:18:45] <icinga-wm>	 RECOVERY - puppet last run on mw2084 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures
[04:19:15] <icinga-wm>	 RECOVERY - puppet last run on mw2111 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures
[04:19:25] <icinga-wm>	 RECOVERY - puppet last run on serpens is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures
[04:20:14] <icinga-wm>	 RECOVERY - puppet last run on mw2131 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[04:24:01] <wikibugs>	 6operations, 10MediaWiki-API, 10Traffic, 7Monitoring: Set up action API latency / error rate metrics & alerts - https://phabricator.wikimedia.org/T123854#1939492 (10GWicke) 3NEW
[04:55:24] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Add a wiki + osm to labtestweb2001 [puppet] - 10https://gerrit.wikimedia.org/r/264558 
[04:56:14] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.32.178, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused)))
[04:56:55] <icinga-wm>	 PROBLEM - Restbase root url on restbase1002 is CRITICAL: Connection refused
[04:57:25] <icinga-wm>	 PROBLEM - Restbase root url on restbase1008 is CRITICAL: Connection refused
[04:57:59] <YuviPanda>	 um
[04:58:15] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.221, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused)))
[04:58:48] <YuviPanda>	 !log started restbase on restbase1002
[04:58:52] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[04:59:01] <YuviPanda>	 I'm going to let 1008 be
[04:59:03] <YuviPanda>	 to see what happens to it
[04:59:04] <icinga-wm>	 RECOVERY - Restbase root url on restbase1002 is OK: HTTP OK: HTTP/1.1 200 - 15214 bytes in 0.014 second response time
[04:59:35] <icinga-wm>	 RECOVERY - Restbase root url on restbase1008 is OK: HTTP OK: HTTP/1.1 200 - 15214 bytes in 0.022 second response time
[05:00:25] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy
[05:00:25] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1002 is OK: All endpoints are healthy
[05:00:38] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Add a wiki + osm to labtestweb2001 [puppet] - 10https://gerrit.wikimedia.org/r/264558 (owner: 10Andrew Bogott)
[05:00:39] <YuviPanda>	 it just recovers by itself
[05:00:41] <YuviPanda>	 ok
[05:00:43] <YuviPanda>	 fair enough
[05:00:47] <gwicke>	 something weird is going on here
[05:00:57] <gwicke>	 is there anything in syslog?
[05:01:46] <YuviPanda>	 lots of flooding by some metrics collector
[05:01:50] <YuviPanda>	 but nothing I can see beside that
[05:02:04] <YuviPanda>	 > Jan 17 04:52:33 restbase1002 systemd[1]: restbase.service: main process exited, code=exited, status=1/FAILURE
[05:02:06] <YuviPanda>	 Jan 17 04:52:33 restbase1002 systemd[1]: Unit restbase.service entered failed state.
[05:02:13] <YuviPanda>	 Jan 17 04:58:34 restbase1002 systemd[1]: Starting "restbase service"...
[05:02:15] <YuviPanda>	 Jan 17 04:58:34 restbase1002 systemd[1]: Started "restbase service".
[05:02:19] <YuviPanda>	 that was on 1002 where I manually started it
[05:02:54] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge.
[05:03:02] <YuviPanda>	 Jan 17 04:51:24 restbase1008 systemd[1]: restbase.service: main process exited, code=exited, status=1/FAILURE
[05:03:04] <YuviPanda>	 Jan 17 04:52:25 restbase1008 systemd[1]: restbase.service stop-sigterm timed out. Killing.
[05:03:06] <YuviPanda>	 Jan 17 04:52:25 restbase1008 systemd[1]: Unit restbase.service entered failed state.
[05:03:10] <YuviPanda>	 then
[05:03:12] <YuviPanda>	 Jan 17 04:57:57 restbase1008 puppet-agent[38582]: (/Stage[main]/Restbase/Service::Node[restbase]/Base::Service_unit[restbase]/Service[restbase]/ensure) ensure changed 'stopped' to 'running'
[05:03:14] <YuviPanda>	 so puppet started it back up
[05:04:55] <gwicke>	 could you sample some request logs for /api/rest_v1/ ?
[05:05:14] <gwicke>	 and rest.wikimedia.org ?
[05:05:36] <YuviPanda>	 from kafka?
[05:05:48] <YuviPanda>	 or does rb log them somewhere?
[05:05:54] <icinga-wm>	 PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: puppet fail
[05:06:01] <gwicke>	 actually, nevermind
[05:06:10] <gwicke>	 the vast majority of requests are internal
[05:06:21] <gwicke>	 so kafka / varnish wouldn't have them
[05:06:24] <YuviPanda>	 so wouldn't hit varnish, I suppose
[05:06:27] <YuviPanda>	 right
[05:06:47] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Openstack Master: Don't error out on non-production realms. [puppet] - 10https://gerrit.wikimedia.org/r/264560 
[05:07:51] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Openstack Master: Don't error out on non-production realms. [puppet] - 10https://gerrit.wikimedia.org/r/264560 (owner: 10Andrew Bogott)
[05:09:00] <grrrit-wm>	 (03PS2) 10Andrew Bogott: Openstack Master: Don't error out on non-production realms. [puppet] - 10https://gerrit.wikimedia.org/r/264560 
[05:09:51] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Openstack Master: Don't error out on non-production realms. [puppet] - 10https://gerrit.wikimedia.org/r/264560 (owner: 10Andrew Bogott)
[05:10:54] <YuviPanda>	 gwicke: logstash seems to have a lot more 'warn' in the last couple hours
[05:10:56] <YuviPanda>	 or s
[05:10:58] <YuviPanda>	 o
[05:11:27] <grrrit-wm>	 (03PS3) 10Andrew Bogott: Openstack Master: Don't error out on non-production realms. [puppet] - 10https://gerrit.wikimedia.org/r/264560 
[05:12:00] <gwicke>	 normally, those levelPaths contain more info
[05:12:48] <YuviPanda>	 rather than just 'warn'?
[05:13:08] <gwicke>	 yeah, until this week it used to be warn/some/specific/thing
[05:13:17] <gwicke>	 which is useful for filtering / stats
[05:14:22] <gwicke>	 there were some changes in service-runner recently, which apparently dropped the suffix
[05:15:01] <YuviPanda>	 ok
[05:15:03] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Openstack Master: Don't error out on non-production realms. [puppet] - 10https://gerrit.wikimedia.org/r/264560 (owner: 10Andrew Bogott)
[05:15:08] <YuviPanda>	 is there something now you think that needs investigating right now?
[05:15:24] <YuviPanda>	 or ok to let it be until daylight?
[05:15:46] <gwicke>	 I'm pretty tired myself, and I created a task for the restart issue
[05:15:57] <gwicke>	 plus metrics for the action api
[05:16:14] <gwicke>	 the restart rate looks low, and should not affect users
[05:16:39] <YuviPanda>	 ok
[05:16:49] <gwicke>	 so I think we can investigate this later
[05:16:56] <YuviPanda>	 feel free to call me if you need a root until europe wakes up
[05:17:01] <YuviPanda>	 thanks
[05:17:07] <YuviPanda>	 yeah, I'll go too
[05:17:07] <gwicke>	 thanks as well!
[05:17:12] <YuviPanda>	 np
[05:17:14] <YuviPanda>	 cya
[05:38:55] <icinga-wm>	 PROBLEM - puppet last run on mw2187 is CRITICAL: CRITICAL: Puppet has 1 failures
[05:41:08] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Added self-signed cert for labtestwikitech [puppet] - 10https://gerrit.wikimedia.org/r/264561 
[05:42:55] <grrrit-wm>	 (03PS2) 10Andrew Bogott: Added self-signed cert for labtestwikitech [puppet] - 10https://gerrit.wikimedia.org/r/264561 
[05:44:22] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Added self-signed cert for labtestwikitech [puppet] - 10https://gerrit.wikimedia.org/r/264561 (owner: 10Andrew Bogott)
[05:50:04] <icinga-wm>	 PROBLEM - DPKG on labtestweb2001 is CRITICAL: Timeout while attempting connection
[05:51:44] <icinga-wm>	 PROBLEM - configured eth on labtestweb2001 is CRITICAL: Timeout while attempting connection
[05:51:44] <icinga-wm>	 PROBLEM - salt-minion processes on labtestweb2001 is CRITICAL: Timeout while attempting connection
[05:51:54] <icinga-wm>	 PROBLEM - dhclient process on labtestweb2001 is CRITICAL: Timeout while attempting connection
[05:52:15] <icinga-wm>	 PROBLEM - Disk space on labtestweb2001 is CRITICAL: Timeout while attempting connection
[05:53:04] <icinga-wm>	 PROBLEM - RAID on labtestweb2001 is CRITICAL: Timeout while attempting connection
[05:56:46] <andrewbogott>	 Sorry all, forgot to downtime a new testbox.  Should be quiet now.
[06:02:48] <icinga-wm>	 RECOVERY - puppet last run on mw2187 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures
[06:10:28] <icinga-wm>	 RECOVERY - salt-minion processes on labtestweb2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[06:10:37] <icinga-wm>	 RECOVERY - configured eth on labtestweb2001 is OK: OK - interfaces up
[06:10:47] <icinga-wm>	 RECOVERY - DPKG on labtestweb2001 is OK: All packages OK
[06:10:57] <icinga-wm>	 RECOVERY - dhclient process on labtestweb2001 is OK: PROCS OK: 0 processes with command name dhclient
[06:10:58] <icinga-wm>	 RECOVERY - Disk space on labtestweb2001 is OK: DISK OK
[06:11:47] <icinga-wm>	 RECOVERY - RAID on labtestweb2001 is OK: OK: no disks configured for RAID
[06:16:27] <icinga-wm>	 RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures
[06:30:47] <icinga-wm>	 PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: puppet fail
[06:31:07] <icinga-wm>	 PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:17] <icinga-wm>	 PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: puppet fail
[06:31:19] <icinga-wm>	 PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: puppet fail
[06:31:27] <icinga-wm>	 PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 3 failures
[06:31:38] <icinga-wm>	 PROBLEM - puppet last run on mw2036 is CRITICAL: CRITICAL: Puppet has 3 failures
[06:31:57] <icinga-wm>	 PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Puppet has 4 failures
[06:32:07] <icinga-wm>	 PROBLEM - puppet last run on mw2043 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:18] <icinga-wm>	 PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:28] <icinga-wm>	 PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:28] <icinga-wm>	 PROBLEM - puppet last run on db2044 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:47] <icinga-wm>	 PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:48] <icinga-wm>	 PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:57] <icinga-wm>	 PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:33:48] <icinga-wm>	 PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:53:48] <icinga-wm>	 PROBLEM - puppet last run on db1024 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:56:28] <icinga-wm>	 RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures
[06:56:37] <icinga-wm>	 RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures
[06:56:57] <icinga-wm>	 RECOVERY - puppet last run on mw2036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:07] <icinga-wm>	 RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures
[06:57:18] <icinga-wm>	 RECOVERY - puppet last run on mw2043 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures
[06:57:37] <icinga-wm>	 RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:38] <icinga-wm>	 RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures
[06:57:38] <icinga-wm>	 RECOVERY - puppet last run on db2044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:58] <icinga-wm>	 RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures
[06:58:07] <icinga-wm>	 RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures
[06:58:07] <icinga-wm>	 RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:08] <icinga-wm>	 RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:27] <icinga-wm>	 RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[06:58:37] <icinga-wm>	 RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[06:59:07] <icinga-wm>	 RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:18:58] <icinga-wm>	 RECOVERY - puppet last run on db1024 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures
[08:42:08] <icinga-wm>	 PROBLEM - Restbase root url on restbase1001 is CRITICAL: Connection refused
[08:42:22] <paravoid>	 andrewbogott: is that you?
[08:42:39] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.220, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused)))
[08:42:48] <andrewbogott>	 paravoid: I have a shell session there but haven’t done anything
[08:42:56] <andrewbogott>	 besides look at things
[08:55:17] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1001 is OK: All endpoints are healthy
[08:56:48] <icinga-wm>	 RECOVERY - Restbase root url on restbase1001 is OK: HTTP OK: HTTP/1.1 200 - 15214 bytes in 0.007 second response time
[09:35:48] <icinga-wm>	 PROBLEM - puppet last run on db1057 is CRITICAL: CRITICAL: Puppet has 1 failures
[10:00:58] <icinga-wm>	 RECOVERY - puppet last run on db1057 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures
[10:35:17] <icinga-wm>	 PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 816
[10:40:17] <icinga-wm>	 RECOVERY - check_mysql on db1008 is OK: Uptime: 418851 Threads: 2 Questions: 2962907 Slow queries: 2733 Opens: 1339 Flush tables: 2 Open tables: 397 Queries per second avg: 7.073 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0
[11:07:44] <wikibugs>	 6operations, 10RESTBase-Cassandra: replace default Cassandra superuser - https://phabricator.wikimedia.org/T113622#1939635 (10mark)
[11:15:55] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: restbase: cycle username [puppet] - 10https://gerrit.wikimedia.org/r/264582 
[11:21:10] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] restbase: cycle username [puppet] - 10https://gerrit.wikimedia.org/r/264582 (owner: 10Giuseppe Lavagetto)
[11:26:19] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: restbase: cycling the cassandra user (again) [puppet] - 10https://gerrit.wikimedia.org/r/264583 
[11:26:48] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/264583 (owner: 10Giuseppe Lavagetto)
[11:29:40] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: cassandra: change application username [puppet] - 10https://gerrit.wikimedia.org/r/264584 
[11:30:30] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/264584 (owner: 10Giuseppe Lavagetto)
[11:54:37] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-test2001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.16.149, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused)))
[11:54:45] <godog>	 that's me ^
[11:55:18] <icinga-wm>	 PROBLEM - Restbase root url on restbase-test2001 is CRITICAL: Connection refused
[11:56:47] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-test2002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.16.150, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused)))
[11:57:07] <icinga-wm>	 PROBLEM - Restbase root url on restbase-test2002 is CRITICAL: Connection refused
[11:59:27] <icinga-wm>	 RECOVERY - HHVM rendering on mw1132 is OK: HTTP OK: HTTP/1.1 200 OK - 65884 bytes in 9.075 second response time
[11:59:29] <icinga-wm>	 RECOVERY - Apache HTTP on mw1125 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.147 second response time
[12:00:37] <icinga-wm>	 RECOVERY - Apache HTTP on mw1132 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.044 second response time
[12:00:37] <icinga-wm>	 RECOVERY - HHVM rendering on mw1125 is OK: HTTP OK: HTTP/1.1 200 OK - 66195 bytes in 0.174 second response time
[12:03:49] <icinga-wm>	 RECOVERY - Restbase root url on restbase-test2002 is OK: HTTP OK: HTTP/1.1 200 - 15214 bytes in 0.146 second response time
[12:05:17] <icinga-wm>	 PROBLEM - HHVM rendering on mw1137 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:05:37] <icinga-wm>	 PROBLEM - Apache HTTP on mw1137 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:05:47] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy
[12:05:48] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy
[12:06:28] <icinga-wm>	 RECOVERY - Restbase root url on restbase-test2001 is OK: HTTP OK: HTTP/1.1 200 - 15214 bytes in 0.119 second response time
[12:06:37] <icinga-wm>	 PROBLEM - HHVM rendering on mw1134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:07:08] <icinga-wm>	 PROBLEM - Apache HTTP on mw1134 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.003 second response time
[12:07:38] <icinga-wm>	 RECOVERY - Apache HTTP on mw1137 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.593 second response time
[12:08:38] <icinga-wm>	 RECOVERY - HHVM rendering on mw1134 is OK: HTTP OK: HTTP/1.1 200 OK - 66188 bytes in 2.240 second response time
[12:09:27] <icinga-wm>	 RECOVERY - Apache HTTP on mw1134 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.100 second response time
[12:09:28] <icinga-wm>	 RECOVERY - HHVM rendering on mw1137 is OK: HTTP OK: HTTP/1.1 200 OK - 66196 bytes in 2.216 second response time
[12:33:38] <icinga-wm>	 PROBLEM - puppet last run on mw1039 is CRITICAL: CRITICAL: Puppet has 1 failures
[12:59:18] <icinga-wm>	 RECOVERY - puppet last run on mw1039 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[14:54:17] <icinga-wm>	 PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [24.0]
[15:00:57] <icinga-wm>	 RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0]
[15:26:58] <icinga-wm>	 PROBLEM - puppet last run on mw2058 is CRITICAL: CRITICAL: puppet fail
[15:54:48] <icinga-wm>	 RECOVERY - puppet last run on mw2058 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[16:05:48] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[16:10:38] <icinga-wm>	 PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[16:14:17] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[16:14:57] <icinga-wm>	 RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[16:53:19] <icinga-wm>	 PROBLEM - puppet last run on mw1011 is CRITICAL: CRITICAL: Puppet has 1 failures
[17:11:17] <icinga-wm>	 PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: Puppet has 1 failures
[17:14:28] <grrrit-wm>	 (03CR) 10QChris: [C: 04-1] "The .war file is the '2.12 release war' from" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/263631 (owner: 10Chad)
[17:14:53] <wikibugs>	 6operations, 10RESTBase: Reduce log spam by removing non-operational cassandra IPs from seeds - https://phabricator.wikimedia.org/T123869#1939928 (10GWicke) 3NEW
[17:15:35] <paravoid>	 gwicke: yes, thank you :)
[17:18:58] <icinga-wm>	 RECOVERY - puppet last run on mw1011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:29:28] <icinga-wm>	 PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[17:31:08] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[17:36:57] <icinga-wm>	 RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:37:28] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[17:38:08] <icinga-wm>	 RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[17:57:59] <Luke081515>	 qchris_: Do you want to create a project for extension-UserPageEditProtection or not?
[18:00:13] <qchris_>	 Luke081515: By project you mean 'Phabricator project'?
[18:00:24] <Luke081515>	 qchris_: Yes
[18:00:36] <qchris_>	 I'd leave that to the owner of the repo.
[18:00:41] <qchris_>	 I just created the repo.
[18:01:05] <Luke081515>	 ah, ok. Currently I tagged the repo as "mediawiki-extensions-other", so we have to update the project, if a project gets created
[18:01:25] <Luke081515>	 so, I just wanted to know, if I have to update it know or later ;)
[18:02:57] <Luke081515>	 qchris_: You are welcome to add a project to new repos too, because at the moment I got a backlog of 700 extension repos without projects... this takes some time
[18:05:22] <Luke081515>	 (Only repos with the word "extension")
[18:57:57] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on labcontrol1002 is OK: No changes to merge.
[18:59:32] <icinga-wm>	 ACKNOWLEDGEMENT - HTTPS on labtestweb2001 is CRITICAL: Use of uninitialized value sans in concatenation (.) or string at /usr/lib/nagios/plugins/check_ssl line 185. andrew bogott why are these even here when the host is in downtime?
[18:59:32] <icinga-wm>	 ACKNOWLEDGEMENT - mysqld processes on labtestweb2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld andrew bogott why are these even here when the host is in downtime?
[20:59:15] <wikibugs>	 6operations, 6Discovery, 10Wikimedia-Logstash, 7Elasticsearch: Upgrade ElasticSearch to 1.7.4 - https://phabricator.wikimedia.org/T122697#1940111 (10Reedy) https://github.com/ruflin/Elastica/issues/1036#issuecomment-172360242  >All 1.7 versions and even previous versions should not be a problem with 2.3.*....
[20:59:27] <icinga-wm>	 PROBLEM - puppet last run on db2060 is CRITICAL: CRITICAL: Puppet has 1 failures
[21:26:58] <icinga-wm>	 RECOVERY - puppet last run on db2060 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:58:08] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 201, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: asw-c-eqiad:xe-1/1/2 {#2826} [10Gbps DF]BR
[21:58:17] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 222, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: asw-c-eqiad:xe-1/1/0 {#1984} [10Gbps DF]BR
[22:13:08] <icinga-wm>	 PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [24.0]
[22:14:15] <grrrit-wm>	 (03PS1) 10Tim Landscheidt: puppetmaster: Fix git-sync-upstream for unclean rebases [puppet] - 10https://gerrit.wikimedia.org/r/264692 
[22:15:17] <icinga-wm>	 RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0]
[22:57:15] <grrrit-wm>	 (03CR) 10Tim Landscheidt: [C: 04-1] "Pending clarification on T123890." [puppet] - 10https://gerrit.wikimedia.org/r/238662 (https://phabricator.wikimedia.org/T91874) (owner: 10Tim Landscheidt)
[23:34:26] <wikibugs>	 7Puppet, 10MediaWiki-Vagrant, 7Easy: MediaWiki-Vagrant guest OS clock gets out of sync - https://phabricator.wikimedia.org/T116507#1940251 (10bd808) Maybe we could use a Vagrantfile solution like this one <http://stackoverflow.com/a/19492466/8171> for configuring the VirtualBox provider rather than NTP insid...
[23:52:48] <icinga-wm>	 PROBLEM - puppet last run on mw2039 is CRITICAL: CRITICAL: Puppet has 1 failures