[00:08:31] PROBLEM - Restbase root url on restbase1017 is CRITICAL: connect to address 10.64.32.129 and port 7231: Connection refused [00:14:23] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title}{/revision} (Get mobile-sections for a test page on enwiki) [00:14:23] a response was received [00:14:35] RECOVERY - Restbase root url on restbase1017 is OK: HTTP OK: HTTP/1.1 200 - 16164 bytes in 0.005 second response time [00:15:31] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy [03:33:05] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 985.36 seconds [04:21:45] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 267.01 seconds [05:00:48] PROBLEM - Host text-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [05:01:19] RECOVERY - Host text-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 83.32 ms [05:01:56] bah, that paged =P [05:02:45] silly icinga false positives (it seems) [05:05:49] PROBLEM - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100% [05:15:44] <_joe_> robh: not a false postive [05:16:03] <_joe_> (I'm still waking up, I'll take a look at lvs3001 shortly) [05:16:04] bah, i saw it come back but did not see the lvs3001 fail shortly after [05:16:25] <_joe_> robh: I tried to sleep through it (it's 6 AM here) [05:16:27] mostly cuz im not at home and using mifi to handl ethis from my car in a parking lot (not ideal) [05:16:39] <_joe_> robh: I'll take a look in a few [05:16:39] but, i can totally call/text folks [05:16:53] <_joe_> don't worry, I got this, in a few :D [05:16:56] ok, let me know if you want me to send out any further texts (cuz we saw an alert and then an ok) [05:17:03] so i bet most folks think its fine [05:17:07] <_joe_> nah no problem for now [05:17:10] cool [05:17:32] <_joe_> yeah, one should always take a look, even if it's immediately recovered [05:17:47] cool, i was already in car when i got the page and hadn't pulled out of parking space =] [05:17:55] now i shall, offline for a bit. [05:18:06] <_joe_> yeah, no worries :) [05:18:18] i got paged for the text-lb, but not for lvs [05:18:24] fyi [05:38:29] <_joe_> !log powercycling lvs3001 [05:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:13] RECOVERY - Host lvs3001 is UP: PING WARNING - Packet loss = 64%, RTA = 83.44 ms [06:28:27] PROBLEM - netbox HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 547 bytes in 0.007 second response time [06:28:37] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:28:39] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/apparmor.d/abstractions/ssl_certs] [06:31:23] PROBLEM - puppet last run on analytics1048 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/R/biocLite.R] [06:32:47] PROBLEM - puppet last run on phab1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/field.sh] [06:38:09] RECOVERY - netbox HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 302 Found - 348 bytes in 0.550 second response time [06:38:19] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational [06:57:23] RECOVERY - puppet last run on analytics1048 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:45] RECOVERY - puppet last run on phab1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:59:53] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:04:08] (03PS1) 10ArielGlenn: convert snapshot1005 to a regular dump runner [puppet] - 10https://gerrit.wikimedia.org/r/481285 (https://phabricator.wikimedia.org/T203382) [07:24:37] (03PS2) 10ArielGlenn: convert snapshot1005 to a regular dump runner [puppet] - 10https://gerrit.wikimedia.org/r/481285 (https://phabricator.wikimedia.org/T203382) [07:28:27] (03PS16) 10Mathew.onipe: elasticsearch: allow cross cluster communication [puppet] - 10https://gerrit.wikimedia.org/r/481125 (https://phabricator.wikimedia.org/T212434) [07:35:36] (03PS3) 10ArielGlenn: convert snapshot1005 to a regular dump runner [puppet] - 10https://gerrit.wikimedia.org/r/481285 (https://phabricator.wikimedia.org/T203382) [07:44:45] (03PS4) 10ArielGlenn: convert snapshot1005 to a regular dump runner [puppet] - 10https://gerrit.wikimedia.org/r/481285 (https://phabricator.wikimedia.org/T203382) [08:03:37] (03PS5) 10ArielGlenn: convert snapshot1005 to a regular dump runner [puppet] - 10https://gerrit.wikimedia.org/r/481285 (https://phabricator.wikimedia.org/T203382) [08:21:43] (03CR) 10ArielGlenn: [C: 03+2] convert snapshot1005 to a regular dump runner [puppet] - 10https://gerrit.wikimedia.org/r/481285 (https://phabricator.wikimedia.org/T203382) (owner: 10ArielGlenn) [08:33:06] (03PS17) 10Mathew.onipe: elasticsearch: allow cross cluster communication [puppet] - 10https://gerrit.wikimedia.org/r/481125 (https://phabricator.wikimedia.org/T212434) [08:45:43] 10Operations, 10decommission, 10User-fgiunchedi: Return graphite200[12] to spares pool - https://phabricator.wikimedia.org/T199321 (10fgiunchedi) a:05fgiunchedi→03RobH >>! In T199321#4840822, @RobH wrote: > @fgiunchedi: can you confirm these are ready for reclaim and disk wipe? I claimed it, but I likel... [08:51:59] 10Operations, 10decommission, 10Patch-For-Review, 10User-fgiunchedi: Return graphite100[13] to spares pool (or decom) - https://phabricator.wikimedia.org/T209357 (10fgiunchedi) [09:00:44] 10Operations, 10ops-codfw: Degraded RAID on ms-be2018 - https://phabricator.wikimedia.org/T212560 (10fgiunchedi) 05Open→03Invalid Host's raid controller locked up, a reboot brought things back. I'll update the controller firmware after the holidays JIC. ` ms-be2018:~$ cat /proc/mdstat Personalities : [ra... [09:02:15] (03PS18) 10Mathew.onipe: elasticsearch: allow cross cluster communication [puppet] - 10https://gerrit.wikimedia.org/r/481125 (https://phabricator.wikimedia.org/T212434) [09:20:46] (03PS19) 10Mathew.onipe: elasticsearch: allow cross cluster communication [puppet] - 10https://gerrit.wikimedia.org/r/481125 (https://phabricator.wikimedia.org/T212434) [09:21:41] 10Operations, 10RESTBase-Cassandra, 10Services: restbase cassandra driver excessive logging when cassandra hosts are down - https://phabricator.wikimedia.org/T212424 (10fgiunchedi) Agreed less workers will lessen the problem, though even per-worker logging (assuming different workers have different `pid` in... [10:11:15] PROBLEM - Restbase root url on restbase1018 is CRITICAL: connect to address 10.64.48.97 and port 7231: Connection refused [10:14:55] (03PS20) 10Giuseppe Lavagetto: elasticsearch: allow cross cluster communication [puppet] - 10https://gerrit.wikimedia.org/r/481125 (https://phabricator.wikimedia.org/T212434) (owner: 10Mathew.onipe) [10:17:19] RECOVERY - Restbase root url on restbase1018 is OK: HTTP OK: HTTP/1.1 200 - 16164 bytes in 0.007 second response time [10:32:06] (03PS21) 10Mathew.onipe: elasticsearch: allow cross cluster communication [puppet] - 10https://gerrit.wikimedia.org/r/481125 (https://phabricator.wikimedia.org/T212434) [10:42:26] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Please be careful when merging this patch. It needs to be compiled on all clusters that don't define profile::elasticsearch::instances too" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/481125 (https://phabricator.wikimedia.org/T212434) (owner: 10Mathew.onipe) [10:47:15] (03PS22) 10Mathew.onipe: elasticsearch: allow cross cluster communication [puppet] - 10https://gerrit.wikimedia.org/r/481125 (https://phabricator.wikimedia.org/T212434) [10:48:01] (03CR) 10Mathew.onipe: elasticsearch: allow cross cluster communication (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/481125 (https://phabricator.wikimedia.org/T212434) (owner: 10Mathew.onipe) [10:50:17] (03CR) 10Mathew.onipe: elasticsearch: allow cross cluster communication (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/481125 (https://phabricator.wikimedia.org/T212434) (owner: 10Mathew.onipe) [11:03:50] (03CR) 10Mathew.onipe: "PCC is happy: https://puppet-compiler.wmflabs.org/compiler1002/14082/" [puppet] - 10https://gerrit.wikimedia.org/r/481125 (https://phabricator.wikimedia.org/T212434) (owner: 10Mathew.onipe) [11:09:05] (03CR) 10Mathew.onipe: elasticsearch: allow cross cluster communication (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/481125 (https://phabricator.wikimedia.org/T212434) (owner: 10Mathew.onipe) [11:28:47] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [11:28:47] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3fullscreenrefresh=1morgId=1 [11:32:23] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [11:32:25] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3fullscreenrefresh=1morgId=1 [11:51:27] 10Operations, 10Release Pipeline, 10Core Platform Team Backlog (Watching / External), 10Release-Engineering-Team (Watching / External), 10Services (watching): Revisit the logging work done on Q1 2017-2018 for the standard pod setup - https://phabricator.wikimedia.org/T207200 (10fgiunchedi) Thanks for loo... [11:51:50] 10Operations, 10decommission, 10Patch-For-Review, 10User-fgiunchedi: Return graphite100[13] to spares pool (or decom) - https://phabricator.wikimedia.org/T209357 (10fgiunchedi) a:03RobH @RobH graphite100[13] confirmed ready for wipe/decom [12:06:45] (03PS23) 10Mathew.onipe: elasticsearch: allow cross cluster communication [puppet] - 10https://gerrit.wikimedia.org/r/481125 (https://phabricator.wikimedia.org/T212434) [12:14:12] (03CR) 10Mathew.onipe: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/14083/" [puppet] - 10https://gerrit.wikimedia.org/r/481125 (https://phabricator.wikimedia.org/T212434) (owner: 10Mathew.onipe) [13:37:16] 10Operations, 10Parsoid, 10Patch-For-Review: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 (10MoritzMuehlenhoff) npm 5.8 is now finally available in stretch-backports: https://lists.debian.org/debian-backports-changes/2018/12/threads.html [14:34:53] (03CR) 10Alexandros Kosiaris: [C: 04-2] "For the same reasons that I -1ed https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/475150/" [puppet] - 10https://gerrit.wikimedia.org/r/481215 (https://phabricator.wikimedia.org/T212327) (owner: 10Bstorm) [14:36:36] (03CR) 10Alexandros Kosiaris: [C: 04-2] "I 've also found https://phabricator.wikimedia.org/T41785 that is pretty much the same topic." [puppet] - 10https://gerrit.wikimedia.org/r/481215 (https://phabricator.wikimedia.org/T212327) (owner: 10Bstorm) [15:07:53] (03PS1) 10ArielGlenn: permit rsync pulls from dumps primary nfs server to a peer [puppet] - 10https://gerrit.wikimedia.org/r/481299 [15:13:11] (03PS2) 10ArielGlenn: permit rsync pulls from dumps primary nfs server to a peer [puppet] - 10https://gerrit.wikimedia.org/r/481299 [15:17:14] (03CR) 10ArielGlenn: [C: 03+2] permit rsync pulls from dumps primary nfs server to a peer [puppet] - 10https://gerrit.wikimedia.org/r/481299 (owner: 10ArielGlenn) [15:24:21] (03PS1) 10ArielGlenn: ferm rules for rsync from dumps peers to dumps primary nfs server [puppet] - 10https://gerrit.wikimedia.org/r/481301 [15:29:35] (03CR) 10ArielGlenn: [C: 03+2] ferm rules for rsync from dumps peers to dumps primary nfs server [puppet] - 10https://gerrit.wikimedia.org/r/481301 (owner: 10ArielGlenn) [17:01:41] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [17:04:07] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [17:22:16] (03PS1) 10ArielGlenn: use dumps nfs server fallback host for rsyncs to dump webserver etc [puppet] - 10https://gerrit.wikimedia.org/r/481303 [17:26:06] https://twitter.com/alicegoldfuss/status/1076944612432826368 [17:33:41] I think I might be done for the day. rsync ate my brain. so... have a great holiday or vacation, hopefully everyone gets a couple days off [18:01:31] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [18:02:43] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [18:02:45] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [18:03:55] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [18:03:59] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [18:05:03] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [18:07:27] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [18:07:35] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [18:08:45] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [18:08:49] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [18:09:53] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [18:11:13] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (Scrapes sample page) timed out before a response was received [18:13:31] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [18:17:27] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:19:41] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [18:19:53] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 7.930 second response time [18:25:53] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [18:27:07] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Scrapes sample page) timed out before a response was received [18:27:13] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:28:19] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [18:29:29] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [18:38:17] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 7.725 second response time [18:42:01] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:43:05] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [18:43:07] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [18:44:13] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [18:44:13] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [18:48:01] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 277 bytes in 0.886 second response time [18:48:03] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:59:11] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:01:29] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [19:02:45] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [19:02:51] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 9.079 second response time [19:03:53] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [19:03:55] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [19:05:03] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [19:05:07] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [19:05:07] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [19:06:13] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [19:06:21] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [19:07:27] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [19:07:31] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [19:07:33] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [19:07:33] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [19:08:51] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [19:08:59] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:10:01] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [19:10:03] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [19:11:15] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [19:12:21] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [19:12:27] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Scrapes sample page) timed out before a response was received [19:12:29] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [19:12:37] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 277 bytes in 7.569 second response time [19:13:35] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [19:13:47] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [19:14:53] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [19:15:03] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [19:16:09] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [19:16:09] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [19:16:09] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (Scrapes sample page) timed out before a response was received [19:16:21] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:23] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [19:18:31] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [19:18:37] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [19:19:51] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [19:19:57] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (Scrapes sample page) timed out before a response was received [19:19:57] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [19:21:05] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [19:21:05] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [19:21:05] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [19:24:49] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (Scrapes sample page) timed out before a response was received [19:24:53] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [19:26:01] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [19:29:43] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [19:47:07] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 4.111 second response time [19:50:55] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:09:03] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 3.132 second response time [20:12:49] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:21:27] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 7.430 second response time [20:25:09] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:30:01] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 7.039 second response time [20:33:47] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:34:49] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time [20:36:03] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time [22:20:26] I confess to having restarted pdfrender on three of those hosts and it looks like they've shut up since then [22:20:28] * apergos is off