[00:04:15] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:07:04] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:08:16] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [00:09:05] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [00:13:41] 6operations, 10OTRS, 7user-notice: Upgrade OTRS to a more recent stable release - https://phabricator.wikimedia.org/T74109#1939403 (10Aklapper) [00:15:24] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:17:25] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [00:44:55] PROBLEM - restbase endpoints health on restbase1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:46:55] RECOVERY - restbase endpoints health on restbase1001 is OK: All endpoints are healthy [00:57:25] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:59:25] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [01:01:15] PROBLEM - puppet last run on mw2166 is CRITICAL: CRITICAL: puppet fail [01:28:24] RECOVERY - puppet last run on mw2166 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [01:28:55] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:35:24] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [01:36:45] PROBLEM - puppet last run on cp3015 is CRITICAL: CRITICAL: puppet fail [01:47:24] !log restarting HHVM on mw1120, mw1125, mw1127, mw1132, mw1148; OOM [01:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:57:44] PROBLEM - Apache HTTP on mw1127 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.012 second response time [01:57:46] PROBLEM - HHVM rendering on mw1125 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.012 second response time [01:57:55] PROBLEM - HHVM rendering on mw1127 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.006 second response time [01:58:05] PROBLEM - Apache HTTP on mw1125 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.006 second response time [01:58:24] PROBLEM - HHVM rendering on mw1132 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.003 second response time [01:58:36] PROBLEM - Apache HTTP on mw1132 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.010 second response time [02:04:05] RECOVERY - puppet last run on cp3015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:13:44] YuviPanda: so.. what's happenin'?? [02:13:58] hey dbrant [02:14:08] dbrant: the mobileapps service is flapping [02:14:16] dbrant: first question is - does it affect only beta users or everyone? [02:14:31] YuviPanda: ah! yes, it's only beta users [02:14:35] ok [02:14:41] do we care massively? [02:14:49] and do you have a way to switch them over to the action API? [02:15:56] so current status is that we've been getting timeout alerts from mobileapps and restbase [02:16:13] YuviPanda: not *too* massively... Whether the app uses the service is controlled by a remote config variable (checked once a day) [02:16:19] and there has been a massive drop in traffic in both since approximately 21:40Z [02:16:25] dbrant: I also left voicemail for bearnd and stephen, btw. [02:16:31] YuviPanda: hello! [02:16:47] gwicke's original theory was that it was MW API related, but I doubt it is [02:16:51] bernd is on parental leave so hopefully he can get a pass today [02:16:52] YuviPanda: ok good, (although Bernd is on paternity leave!) [02:16:56] hey niedzielski! [02:17:05] niedzielski: dbrant ah, I wasn't aware. hopefully he doesn't respond. [02:18:21] It's working for me... [02:18:28] what kind of errors would i be seeing? [02:18:30] YuviPanda: sorry, coming in a bit late to the convo. so i believe the code is structured so that users will fallback to mw api [02:18:55] oh wait, yes, maybe i'm just seeing it fall back by default. [02:19:04] dbrant: so I saw a couple of 'timeout errors' when using the app on my phone, and grafana has https://grafana.wikimedia.org/dashboard/db/restbase which has gone all flat [02:19:37] they're all suspiciously flat [02:20:01] hmm, my app is definitely using RB. i'm able to look at articles that have pronunciations and geolocation buttons, which are only available via RB. [02:21:11] https://grafana.wikimedia.org/dashboard/db/mobileapps shows a drop in the request rate [02:21:41] now, it could be resource starvation because requests take a longer time to respond [02:21:48] dbrant: things seem to be working as expected on my end too [02:22:00] it's not clear from the second graph [02:22:03] yeah, what niedzielski. there should be an automatic fallback after the first error to mw api [02:22:29] bearND: go back to being on parental leave! [02:22:34] root@scb1001:~# grep ETIMEDOUT /srv/log/mobileapps/main.log | wc -l [02:22:37] 611 [02:22:59] YuviPanda: it's ok. it's the weekend anyways. lol [02:23:00] root@scb1001:~# grep ETIMEDOUT /srv/log/mobileapps/main.log | head -1 | jq '.time' [02:23:03] "2016-01-16T21:52:20.467Z" [02:23:07] http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&c=Service+Cluster+B+eqiad&h=&tab=m&vn=&hide-hf=false&m=cpu_report&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name [02:23:20] yup [02:23:27] drop in network and CPU util too [02:23:35] maybe that's rb not passing requests back to mobileapps [02:23:48] and mobileapps falling back to mobileview api? [02:24:02] niedzielski: bearND dbrant do you have a way to measure how many fallbacks happened? [02:24:50] hm, i think that's in our sharedpreferences and also in event logging (but beta event logging was kind of broken on friday)? checking [02:26:03] niedzielski: prod eventlogging is also broken for normal people (replication to analytics-store is super slow) but I can run small quick queries on the EL master if needed. you can also use kafkacat to look at it realtime (I can help check this) [02:26:32] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.9) (duration: 10m 41s) [02:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:26:48] it's definitely not failing or falling-back for me... [02:27:28] i don't think we send any events when we fall back [02:27:58] ok [02:28:05] I'm leaning towards restbase-related [02:28:20] 2016-01-17T01:44:28.467Z warn restbase1001 Setting host 10.64.0.223:9042 as DOWN [02:28:29] 2016-01-17T01:44:28.467Z warn restbase1001 Setting host 10.64.32.178:9042 as DOWN [02:28:32] etc. [02:28:47] tons of "Heap memory limit temporarily exceeded" [02:28:55] I suppose that's cassandra? [02:29:05] or is that nodejs'? [02:29:05] PROBLEM - puppet last run on mw1120 is CRITICAL: CRITICAL: puppet fail [02:29:20] * YuviPanda finds out the machines with that ip [02:29:27] that's just restbase [02:29:32] yeah ok [02:29:34] and it was for all the restbase cluster [02:29:47] (these are from logstash -> restbase -> last 24h) [02:30:41] dbrant: so i think requestSuccesses is a magic number for rb failures. mine is at -1 (rb failure) which may have been from a previous test [02:31:30] paravoid: I'm going to try paging marko again and if that fails call urandom [02:31:49] req/s seem to have started raising again [02:32:04] e.g. http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Service+Cluster+B+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report [02:32:10] and https://grafana.wikimedia.org/dashboard/db/restbase [02:32:20] dbrant: yeah, i reset it and do not see it going back to -1 [02:32:50] Completed flushing /var/lib/cassandra/data/local_group_default_T_mobileapps_lead/data-13482db08e0b11e5adc765f8a3ffd5d4/local_group_default_T_mobileapps_lead-data-tmp-ka-165-Data.db (11.069MiB) for commitlog position ReplayPosition(segmentId=1449795889502, [02:32:56] Completed flushing /var/lib/cassandra/data/local_group_default_T_title__revisions/idx_by_rev_ever-a6ef7ab0103d11e5a41a5926693ccb22/local_group_default_T_title__revisions-idx_by_rev_ever-tmp-ka-6909-Data.db (1.299MiB) for commitlog position ReplayPosition( [02:33:00] these were *just* now [02:33:01] dbrant: the logs look good on my device too [02:33:02] niedzielski: it remains "unset" for me [02:34:03] lots of "Error in Cassandra table storage backend" too [02:34:17] no response from marko, calling urandom now [02:34:41] thank you so much YuviPanda [02:35:47] dbrant: niedzielski do you think we should flip the setting to make apps stop using rb for now? [02:35:51] dbrant: is it worth looking SessionFunnel.apiMode event log? [02:36:17] paravoid: no luck with urandom either, 'the person you are calling has a voice mailbox that is not setup yet' [02:36:20] * YuviPanda leaves him an SMS instead [02:36:55] YuviPanda: we're not repro-ing the error and the app is designed to handle this scenario. since this is beta, i think we should let it run [02:37:01] dbrant: ^^ [02:37:21] niedzielski: hmm, ok. is the steps to move it back docuented somewhere, in case we need to do it to help restbase recover? [02:38:48] only peter from services is uncalled so far [02:38:51] let me call him too [02:38:58] nah it's fine [02:39:10] don't [02:39:11] YuviPanda: (checking now). the app config is a json file that's easy to change. the deployment and cache purge might be trickier [02:39:25] paravoid: ok [02:39:35] we can call filippo but I'm not sure if there is much point without any noticeable user impact [02:39:37] niedzielski: is this the thing in the MobileApp extension? [02:39:42] paravoid: yeah, probably. [02:39:44] yep [02:39:53] YuviPanda: https://meta.wikimedia.org/static/current/extensions/MobileApp/config/android.json [02:40:08] nice [02:40:20] so switching the % there to 0 [02:40:22] should do it [02:40:26] but slowly (over a day) [02:40:30] correct [02:40:36] * YuviPanda remembers this mechanism :D [02:40:42] ;) [02:41:15] but what do you lose for not using restbase? [02:41:19] some piece of functionality? [02:41:35] PROBLEM - Restbase root url on restbase1009 is CRITICAL: Connection refused [02:41:40] dbrant: niedzielski so if the different params aren't documented somewhere (along with the process to deploy them + cache purge if necessary), do set it up on wikitech (not right now though, next week is fine) [02:41:41] yay [02:41:43] YuviPanda: yeah, i don't think we have a wiki for it... restbaseBetaPercent needs to be zeroed. here's the steps for a previous deployment: T118965 [02:41:43] uh oh [02:41:52] yeah restbase is fucked basically [02:41:54] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.48.110, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [02:42:49] niedzielski: ah, nice. thank you! [02:43:33] YuviPanda: i just fwd'd the email on some more detail for the cache purge. we'll wiki this [02:44:07] niedzielski: thanks :) [02:44:17] VE still working in prod FWIW. [02:44:48] paravoid: yeah, some functionality would be lost; but right now it's only in our Beta app, and only for 55% of users. And it should fall back to regular API... [02:45:10] Is there anything outside of VE and mobileapps that's using restbase? [02:45:18] which piece of functionality? [02:45:30] YuviPanda: Nothing big. Some tools. [02:46:11] paravoid: prononciation and geolocation info, I think. [02:46:14] something's happening, but it's not completely healthy yet: https://grafana.wikimedia.org/dashboard/db/restbase [02:46:20] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.10) (duration: 08m 53s) [02:46:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:47:07] niedzielski: dbrant can you give me an example URL to hit mobileapps service with? [02:47:26] YuviPanda: https://rest.wikimedia.org/en.wikipedia.org/v1/page/mobile-sections/Neptune [02:47:30] YuviPanda: https://en.m.wikipedia.org/api/rest_v1/page/mobile-sections-lead/Cleveland [02:47:54] RECOVERY - Restbase root url on restbase1009 is OK: HTTP OK: HTTP/1.1 200 - 15214 bytes in 0.009 second response time [02:48:05] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [02:48:10] heh. which one does the app hit? [02:48:52] YuviPanda: er, the request i mentioned is for just getting the lead section. here would be the rest of the page https://en.m.wikipedia.org/api/rest_v1/page/mobile-sections-remaining/Cleveland [02:49:07] right but is it hitting rest.wikimedia.org or en.m.wikipedia.org? [02:49:14] (probably doesn't matter, but still) [02:49:45] oh sorry, it's en.m [02:50:34] YuviPanda dbrant: right en.m. the other form is more for dev work [02:53:19] !log l10nupdate@tin ResourceLoader cache refresh completed at Sun Jan 17 02:53:19 UTC 2016 (duration 6m 59s) [02:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:55:51] dbrant: niedzielski ok! [02:56:25] RECOVERY - puppet last run on mw1120 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:56:39] dbrant: niedzielski we've gwicke online now and I think you guys are good to go. I'll switch the app off restbase if we feel the need and notify you (and call you both again if necessary) [02:57:13] YuviPanda: your dedication is admirable! (+ paravoid) [02:57:17] YuviPanda: thanks man [02:57:33] np. thanks for responding quickly! [03:36:15] PROBLEM - puppet last run on mw1257 is CRITICAL: CRITICAL: Puppet has 2 failures [03:36:35] PROBLEM - puppet last run on mw1216 is CRITICAL: CRITICAL: Puppet has 1 failures [03:36:45] PROBLEM - puppet last run on mw2125 is CRITICAL: CRITICAL: Puppet has 1 failures [03:38:45] RECOVERY - Apache HTTP on mw1120 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 3.153 second response time [03:39:46] RECOVERY - HHVM rendering on mw1120 is OK: HTTP OK: HTTP/1.1 200 OK - 65651 bytes in 0.500 second response time [03:40:55] RECOVERY - Apache HTTP on mw1127 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.955 second response time [03:41:05] RECOVERY - HHVM rendering on mw1127 is OK: HTTP OK: HTTP/1.1 200 OK - 65651 bytes in 0.406 second response time [03:41:54] RECOVERY - Apache HTTP on mw1148 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.531 second response time [03:42:54] RECOVERY - HHVM rendering on mw1148 is OK: HTTP OK: HTTP/1.1 200 OK - 65651 bytes in 1.069 second response time [03:50:45] PROBLEM - puppet last run on mw2131 is CRITICAL: CRITICAL: puppet fail [03:50:58] 6operations, 3Mobile-Content-Service: Improve operational documentation for the MobileApps extension - https://phabricator.wikimedia.org/T123852#1939455 (10yuvipanda) 3NEW [03:51:24] 6operations, 3Mobile-Content-Service: Improve operational documentation for the MobileApps extension - https://phabricator.wikimedia.org/T123852#1939464 (10yuvipanda) [03:51:25] PROBLEM - puppet last run on mw2084 is CRITICAL: CRITICAL: puppet fail [03:51:55] PROBLEM - puppet last run on mw2111 is CRITICAL: CRITICAL: puppet fail [03:52:05] PROBLEM - puppet last run on serpens is CRITICAL: CRITICAL: puppet fail [04:01:35] RECOVERY - puppet last run on mw1257 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [04:01:55] RECOVERY - puppet last run on mw1216 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:02:15] RECOVERY - puppet last run on mw2125 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [04:02:34] 6operations, 3Mobile-Content-Service: Improve operational documentation for the MobileApps extension - https://phabricator.wikimedia.org/T123852#1939467 (10bearND) p:5Triage>3Normal [04:18:45] RECOVERY - puppet last run on mw2084 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [04:19:15] RECOVERY - puppet last run on mw2111 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [04:19:25] RECOVERY - puppet last run on serpens is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [04:20:14] RECOVERY - puppet last run on mw2131 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:24:01] 6operations, 10MediaWiki-API, 10Traffic, 7Monitoring: Set up action API latency / error rate metrics & alerts - https://phabricator.wikimedia.org/T123854#1939492 (10GWicke) 3NEW [04:55:24] (03PS1) 10Andrew Bogott: Add a wiki + osm to labtestweb2001 [puppet] - 10https://gerrit.wikimedia.org/r/264558 [04:56:14] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.32.178, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [04:56:55] PROBLEM - Restbase root url on restbase1002 is CRITICAL: Connection refused [04:57:25] PROBLEM - Restbase root url on restbase1008 is CRITICAL: Connection refused [04:57:59] um [04:58:15] PROBLEM - restbase endpoints health on restbase1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.221, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [04:58:48] !log started restbase on restbase1002 [04:58:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:59:01] I'm going to let 1008 be [04:59:03] to see what happens to it [04:59:04] RECOVERY - Restbase root url on restbase1002 is OK: HTTP OK: HTTP/1.1 200 - 15214 bytes in 0.014 second response time [04:59:35] RECOVERY - Restbase root url on restbase1008 is OK: HTTP OK: HTTP/1.1 200 - 15214 bytes in 0.022 second response time [05:00:25] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [05:00:25] RECOVERY - restbase endpoints health on restbase1002 is OK: All endpoints are healthy [05:00:38] (03CR) 10Andrew Bogott: [C: 032] Add a wiki + osm to labtestweb2001 [puppet] - 10https://gerrit.wikimedia.org/r/264558 (owner: 10Andrew Bogott) [05:00:39] it just recovers by itself [05:00:41] ok [05:00:43] fair enough [05:00:47] something weird is going on here [05:00:57] is there anything in syslog? [05:01:46] lots of flooding by some metrics collector [05:01:50] but nothing I can see beside that [05:02:04] > Jan 17 04:52:33 restbase1002 systemd[1]: restbase.service: main process exited, code=exited, status=1/FAILURE [05:02:06] Jan 17 04:52:33 restbase1002 systemd[1]: Unit restbase.service entered failed state. [05:02:13] Jan 17 04:58:34 restbase1002 systemd[1]: Starting "restbase service"... [05:02:15] Jan 17 04:58:34 restbase1002 systemd[1]: Started "restbase service". [05:02:19] that was on 1002 where I manually started it [05:02:54] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [05:03:02] Jan 17 04:51:24 restbase1008 systemd[1]: restbase.service: main process exited, code=exited, status=1/FAILURE [05:03:04] Jan 17 04:52:25 restbase1008 systemd[1]: restbase.service stop-sigterm timed out. Killing. [05:03:06] Jan 17 04:52:25 restbase1008 systemd[1]: Unit restbase.service entered failed state. [05:03:10] then [05:03:12] Jan 17 04:57:57 restbase1008 puppet-agent[38582]: (/Stage[main]/Restbase/Service::Node[restbase]/Base::Service_unit[restbase]/Service[restbase]/ensure) ensure changed 'stopped' to 'running' [05:03:14] so puppet started it back up [05:04:55] could you sample some request logs for /api/rest_v1/ ? [05:05:14] and rest.wikimedia.org ? [05:05:36] from kafka? [05:05:48] or does rb log them somewhere? [05:05:54] PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: puppet fail [05:06:01] actually, nevermind [05:06:10] the vast majority of requests are internal [05:06:21] so kafka / varnish wouldn't have them [05:06:24] so wouldn't hit varnish, I suppose [05:06:27] right [05:06:47] (03PS1) 10Andrew Bogott: Openstack Master: Don't error out on non-production realms. [puppet] - 10https://gerrit.wikimedia.org/r/264560 [05:07:51] (03CR) 10jenkins-bot: [V: 04-1] Openstack Master: Don't error out on non-production realms. [puppet] - 10https://gerrit.wikimedia.org/r/264560 (owner: 10Andrew Bogott) [05:09:00] (03PS2) 10Andrew Bogott: Openstack Master: Don't error out on non-production realms. [puppet] - 10https://gerrit.wikimedia.org/r/264560 [05:09:51] (03CR) 10jenkins-bot: [V: 04-1] Openstack Master: Don't error out on non-production realms. [puppet] - 10https://gerrit.wikimedia.org/r/264560 (owner: 10Andrew Bogott) [05:10:54] gwicke: logstash seems to have a lot more 'warn' in the last couple hours [05:10:56] or s [05:10:58] o [05:11:27] (03PS3) 10Andrew Bogott: Openstack Master: Don't error out on non-production realms. [puppet] - 10https://gerrit.wikimedia.org/r/264560 [05:12:00] normally, those levelPaths contain more info [05:12:48] rather than just 'warn'? [05:13:08] yeah, until this week it used to be warn/some/specific/thing [05:13:17] which is useful for filtering / stats [05:14:22] there were some changes in service-runner recently, which apparently dropped the suffix [05:15:01] ok [05:15:03] (03CR) 10Andrew Bogott: [C: 032] Openstack Master: Don't error out on non-production realms. [puppet] - 10https://gerrit.wikimedia.org/r/264560 (owner: 10Andrew Bogott) [05:15:08] is there something now you think that needs investigating right now? [05:15:24] or ok to let it be until daylight? [05:15:46] I'm pretty tired myself, and I created a task for the restart issue [05:15:57] plus metrics for the action api [05:16:14] the restart rate looks low, and should not affect users [05:16:39] ok [05:16:49] so I think we can investigate this later [05:16:56] feel free to call me if you need a root until europe wakes up [05:17:01] thanks [05:17:07] yeah, I'll go too [05:17:07] thanks as well! [05:17:12] np [05:17:14] cya [05:38:55] PROBLEM - puppet last run on mw2187 is CRITICAL: CRITICAL: Puppet has 1 failures [05:41:08] (03PS1) 10Andrew Bogott: Added self-signed cert for labtestwikitech [puppet] - 10https://gerrit.wikimedia.org/r/264561 [05:42:55] (03PS2) 10Andrew Bogott: Added self-signed cert for labtestwikitech [puppet] - 10https://gerrit.wikimedia.org/r/264561 [05:44:22] (03CR) 10Andrew Bogott: [C: 032] Added self-signed cert for labtestwikitech [puppet] - 10https://gerrit.wikimedia.org/r/264561 (owner: 10Andrew Bogott) [05:50:04] PROBLEM - DPKG on labtestweb2001 is CRITICAL: Timeout while attempting connection [05:51:44] PROBLEM - configured eth on labtestweb2001 is CRITICAL: Timeout while attempting connection [05:51:44] PROBLEM - salt-minion processes on labtestweb2001 is CRITICAL: Timeout while attempting connection [05:51:54] PROBLEM - dhclient process on labtestweb2001 is CRITICAL: Timeout while attempting connection [05:52:15] PROBLEM - Disk space on labtestweb2001 is CRITICAL: Timeout while attempting connection [05:53:04] PROBLEM - RAID on labtestweb2001 is CRITICAL: Timeout while attempting connection [05:56:46] Sorry all, forgot to downtime a new testbox. Should be quiet now. [06:02:48] RECOVERY - puppet last run on mw2187 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:10:28] RECOVERY - salt-minion processes on labtestweb2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:10:37] RECOVERY - configured eth on labtestweb2001 is OK: OK - interfaces up [06:10:47] RECOVERY - DPKG on labtestweb2001 is OK: All packages OK [06:10:57] RECOVERY - dhclient process on labtestweb2001 is OK: PROCS OK: 0 processes with command name dhclient [06:10:58] RECOVERY - Disk space on labtestweb2001 is OK: DISK OK [06:11:47] RECOVERY - RAID on labtestweb2001 is OK: OK: no disks configured for RAID [06:16:27] RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:30:47] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: puppet fail [06:31:07] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:17] PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: puppet fail [06:31:19] PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: puppet fail [06:31:27] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 3 failures [06:31:38] PROBLEM - puppet last run on mw2036 is CRITICAL: CRITICAL: Puppet has 3 failures [06:31:57] PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Puppet has 4 failures [06:32:07] PROBLEM - puppet last run on mw2043 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:18] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:28] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:28] PROBLEM - puppet last run on db2044 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:47] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:48] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:57] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:48] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:53:48] PROBLEM - puppet last run on db1024 is CRITICAL: CRITICAL: Puppet has 1 failures [06:56:28] RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:56:37] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:56:57] RECOVERY - puppet last run on mw2036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:07] RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:57:18] RECOVERY - puppet last run on mw2043 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:57:37] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:38] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:57:38] RECOVERY - puppet last run on db2044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:58] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:58:07] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:58:07] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:08] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:27] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:37] RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:59:07] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:18:58] RECOVERY - puppet last run on db1024 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [08:42:08] PROBLEM - Restbase root url on restbase1001 is CRITICAL: Connection refused [08:42:22] andrewbogott: is that you? [08:42:39] PROBLEM - restbase endpoints health on restbase1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.220, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [08:42:48] paravoid: I have a shell session there but haven’t done anything [08:42:56] besides look at things [08:55:17] RECOVERY - restbase endpoints health on restbase1001 is OK: All endpoints are healthy [08:56:48] RECOVERY - Restbase root url on restbase1001 is OK: HTTP OK: HTTP/1.1 200 - 15214 bytes in 0.007 second response time [09:35:48] PROBLEM - puppet last run on db1057 is CRITICAL: CRITICAL: Puppet has 1 failures [10:00:58] RECOVERY - puppet last run on db1057 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [10:35:17] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 816 [10:40:17] RECOVERY - check_mysql on db1008 is OK: Uptime: 418851 Threads: 2 Questions: 2962907 Slow queries: 2733 Opens: 1339 Flush tables: 2 Open tables: 397 Queries per second avg: 7.073 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [11:07:44] 6operations, 10RESTBase-Cassandra: replace default Cassandra superuser - https://phabricator.wikimedia.org/T113622#1939635 (10mark) [11:15:55] (03PS1) 10Giuseppe Lavagetto: restbase: cycle username [puppet] - 10https://gerrit.wikimedia.org/r/264582 [11:21:10] (03CR) 10Giuseppe Lavagetto: [C: 032] restbase: cycle username [puppet] - 10https://gerrit.wikimedia.org/r/264582 (owner: 10Giuseppe Lavagetto) [11:26:19] (03PS1) 10Giuseppe Lavagetto: restbase: cycling the cassandra user (again) [puppet] - 10https://gerrit.wikimedia.org/r/264583 [11:26:48] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/264583 (owner: 10Giuseppe Lavagetto) [11:29:40] (03PS1) 10Giuseppe Lavagetto: cassandra: change application username [puppet] - 10https://gerrit.wikimedia.org/r/264584 [11:30:30] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/264584 (owner: 10Giuseppe Lavagetto) [11:54:37] PROBLEM - restbase endpoints health on restbase-test2001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.16.149, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [11:54:45] that's me ^ [11:55:18] PROBLEM - Restbase root url on restbase-test2001 is CRITICAL: Connection refused [11:56:47] PROBLEM - restbase endpoints health on restbase-test2002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.16.150, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [11:57:07] PROBLEM - Restbase root url on restbase-test2002 is CRITICAL: Connection refused [11:59:27] RECOVERY - HHVM rendering on mw1132 is OK: HTTP OK: HTTP/1.1 200 OK - 65884 bytes in 9.075 second response time [11:59:29] RECOVERY - Apache HTTP on mw1125 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.147 second response time [12:00:37] RECOVERY - Apache HTTP on mw1132 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.044 second response time [12:00:37] RECOVERY - HHVM rendering on mw1125 is OK: HTTP OK: HTTP/1.1 200 OK - 66195 bytes in 0.174 second response time [12:03:49] RECOVERY - Restbase root url on restbase-test2002 is OK: HTTP OK: HTTP/1.1 200 - 15214 bytes in 0.146 second response time [12:05:17] PROBLEM - HHVM rendering on mw1137 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:05:37] PROBLEM - Apache HTTP on mw1137 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:05:47] RECOVERY - restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy [12:05:48] RECOVERY - restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy [12:06:28] RECOVERY - Restbase root url on restbase-test2001 is OK: HTTP OK: HTTP/1.1 200 - 15214 bytes in 0.119 second response time [12:06:37] PROBLEM - HHVM rendering on mw1134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:07:08] PROBLEM - Apache HTTP on mw1134 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.003 second response time [12:07:38] RECOVERY - Apache HTTP on mw1137 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.593 second response time [12:08:38] RECOVERY - HHVM rendering on mw1134 is OK: HTTP OK: HTTP/1.1 200 OK - 66188 bytes in 2.240 second response time [12:09:27] RECOVERY - Apache HTTP on mw1134 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.100 second response time [12:09:28] RECOVERY - HHVM rendering on mw1137 is OK: HTTP OK: HTTP/1.1 200 OK - 66196 bytes in 2.216 second response time [12:33:38] PROBLEM - puppet last run on mw1039 is CRITICAL: CRITICAL: Puppet has 1 failures [12:59:18] RECOVERY - puppet last run on mw1039 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:54:17] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [24.0] [15:00:57] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [15:26:58] PROBLEM - puppet last run on mw2058 is CRITICAL: CRITICAL: puppet fail [15:54:48] RECOVERY - puppet last run on mw2058 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:05:48] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [16:10:38] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [16:14:17] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:14:57] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:53:19] PROBLEM - puppet last run on mw1011 is CRITICAL: CRITICAL: Puppet has 1 failures [17:11:17] PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: Puppet has 1 failures [17:14:28] (03CR) 10QChris: [C: 04-1] "The .war file is the '2.12 release war' from" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/263631 (owner: 10Chad) [17:14:53] 6operations, 10RESTBase: Reduce log spam by removing non-operational cassandra IPs from seeds - https://phabricator.wikimedia.org/T123869#1939928 (10GWicke) 3NEW [17:15:35] gwicke: yes, thank you :) [17:18:58] RECOVERY - puppet last run on mw1011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:29:28] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [17:31:08] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [17:36:57] RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:37:28] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:38:08] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:57:59] qchris_: Do you want to create a project for extension-UserPageEditProtection or not? [18:00:13] Luke081515: By project you mean 'Phabricator project'? [18:00:24] qchris_: Yes [18:00:36] I'd leave that to the owner of the repo. [18:00:41] I just created the repo. [18:01:05] ah, ok. Currently I tagged the repo as "mediawiki-extensions-other", so we have to update the project, if a project gets created [18:01:25] so, I just wanted to know, if I have to update it know or later ;) [18:02:57] qchris_: You are welcome to add a project to new repos too, because at the moment I got a backlog of 700 extension repos without projects... this takes some time [18:05:22] (Only repos with the word "extension") [18:57:57] RECOVERY - Unmerged changes on repository puppet on labcontrol1002 is OK: No changes to merge. [18:59:32] ACKNOWLEDGEMENT - HTTPS on labtestweb2001 is CRITICAL: Use of uninitialized value sans in concatenation (.) or string at /usr/lib/nagios/plugins/check_ssl line 185. andrew bogott why are these even here when the host is in downtime? [18:59:32] ACKNOWLEDGEMENT - mysqld processes on labtestweb2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld andrew bogott why are these even here when the host is in downtime? [20:59:15] 6operations, 6Discovery, 10Wikimedia-Logstash, 7Elasticsearch: Upgrade ElasticSearch to 1.7.4 - https://phabricator.wikimedia.org/T122697#1940111 (10Reedy) https://github.com/ruflin/Elastica/issues/1036#issuecomment-172360242 >All 1.7 versions and even previous versions should not be a problem with 2.3.*.... [20:59:27] PROBLEM - puppet last run on db2060 is CRITICAL: CRITICAL: Puppet has 1 failures [21:26:58] RECOVERY - puppet last run on db2060 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:58:08] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 201, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: asw-c-eqiad:xe-1/1/2 {#2826} [10Gbps DF]BR [21:58:17] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 222, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: asw-c-eqiad:xe-1/1/0 {#1984} [10Gbps DF]BR [22:13:08] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [24.0] [22:14:15] (03PS1) 10Tim Landscheidt: puppetmaster: Fix git-sync-upstream for unclean rebases [puppet] - 10https://gerrit.wikimedia.org/r/264692 [22:15:17] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [22:57:15] (03CR) 10Tim Landscheidt: [C: 04-1] "Pending clarification on T123890." [puppet] - 10https://gerrit.wikimedia.org/r/238662 (https://phabricator.wikimedia.org/T91874) (owner: 10Tim Landscheidt) [23:34:26] 7Puppet, 10MediaWiki-Vagrant, 7Easy: MediaWiki-Vagrant guest OS clock gets out of sync - https://phabricator.wikimedia.org/T116507#1940251 (10bd808) Maybe we could use a Vagrantfile solution like this one for configuring the VirtualBox provider rather than NTP insid... [23:52:48] PROBLEM - puppet last run on mw2039 is CRITICAL: CRITICAL: Puppet has 1 failures