[01:59:39] PROBLEM - MariaDB Slave Lag: s3 on db1095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 720.67 seconds [03:36:07] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 959.76 seconds [04:02:49] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 251.84 seconds [04:08:23] RECOVERY - MariaDB Slave Lag: s3 on db1095 is OK: OK slave_sql_lag Replication lag: 0.16 seconds [05:28:17] PROBLEM - MariaDB Slave SQL: s8 on db1124 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table wikidatawiki.pagelinks: Cant find record in pagelinks, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1087-bin.003648, end_log_pos 920552713 [05:38:17] PROBLEM - MariaDB Slave Lag: s8 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 750.52 seconds [06:30:07] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:38:39] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational [06:39:26] 10Operations, 10Wikimedia-Site-requests, 10HHVM: Set hhvm.virtual_host[default][always_decode_post_data] = false - https://phabricator.wikimedia.org/T208191 (10Krinkle) It seems that at P7727, @joe confirms this bug with HHVM, confirms that PHP 7 does not have the bug; and... found that changing `hhvm.server... [07:20:53] !log Fix replication db1124:3318 [07:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:07] RECOVERY - MariaDB Slave SQL: s8 on db1124 is OK: OK slave_sql_state Slave_SQL_Running: Yes [07:34:39] RECOVERY - MariaDB Slave Lag: s8 on db1124 is OK: OK slave_sql_lag Replication lag: 0.44 seconds [08:02:31] PROBLEM - HHVM rendering on mw1233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:03:37] RECOVERY - HHVM rendering on mw1233 is OK: HTTP OK: HTTP/1.1 200 OK - 75305 bytes in 0.687 second response time [08:35:59] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [08:37:17] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [08:37:17] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (Scrapes sample page) timed out before a response was received [08:39:41] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [08:40:49] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [08:40:53] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [08:40:57] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [08:40:57] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [08:42:09] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [08:43:17] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [08:43:25] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy [08:44:39] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Scrapes sample page) timed out before a response was received [08:44:39] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [08:47:09] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [08:48:19] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [08:48:21] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [08:48:23] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [08:51:59] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [08:52:21] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [08:54:23] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [08:54:31] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [08:55:41] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [08:55:43] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [08:55:53] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [08:56:26] (03PS1) 10ArielGlenn: make check-cumin-aliases happy again [puppet] - 10https://gerrit.wikimedia.org/r/481329 [08:56:55] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [08:58:09] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [08:59:39] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [09:00:37] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [09:00:47] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [09:04:31] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [09:06:35] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy [09:06:41] (03CR) 10ArielGlenn: [C: 03+2] make check-cumin-aliases happy again [puppet] - 10https://gerrit.wikimedia.org/r/481329 (owner: 10ArielGlenn) [09:07:53] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [09:07:53] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [09:09:19] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [09:10:17] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [09:10:17] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [09:10:21] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [09:11:33] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy [09:11:37] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [09:12:43] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [09:13:40] !log swift eqiad-prod: more weight for ms-be10[44-50].eqiad.wmnet - T209618 [09:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:43] T209618: rack/setup/install ms-be10[44-50].eqiad.wmnet - https://phabricator.wikimedia.org/T209618 [09:16:27] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [09:16:27] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [09:17:39] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [09:17:39] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [09:18:47] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [09:18:51] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [09:19:57] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy [09:21:23] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [09:22:51] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [09:23:41] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [09:23:49] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [09:23:57] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [09:25:01] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [09:26:09] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [09:26:09] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [09:26:13] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [09:29:55] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Scrapes sample page) timed out before a response was received [09:29:55] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [09:31:05] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [09:33:37] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [09:34:43] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [09:36:03] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [09:44:37] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [09:48:13] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [09:49:27] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 42.88 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6fullscreenorgId=1 [09:50:47] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [09:51:59] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [09:54:27] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [10:00:27] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [10:00:33] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [10:04:15] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [10:04:17] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [10:06:37] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [10:14:16] 10Operations, 10Security-Team: jalexander should be removed from security@ as his emails are bouncing - https://phabricator.wikimedia.org/T212621 (10Bawolff) [10:19:47] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 88.84 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6fullscreenorgId=1 [10:38:50] 10Operations, 10DC-Ops, 10media-storage: ms-be raid setup / evaluation (currently using swraid on top of hwraid) - https://phabricator.wikimedia.org/T211231 (10fgiunchedi) a:05fgiunchedi→03RobH Confirmed all swift disks are presented to swift as raid0 arrays. Also IIRC these hosts are sporting hw raid be... [10:42:04] 10Operations, 10DC-Ops, 10media-storage: ms-be raid setup / evaluation (currently using swraid on top of hwraid) - https://phabricator.wikimedia.org/T211231 (10Peachey88) [12:44:41] PROBLEM - Nginx local proxy to apache on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:45:47] RECOVERY - Nginx local proxy to apache on mw1285 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.041 second response time [14:10:05] PROBLEM - Host wtp1028 is DOWN: PING CRITICAL - Packet loss = 100% [14:18:13] !log powercycle wtp1028 - nothing on console [14:18:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:39] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=wtp1028.eqiad.wmnet [14:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:15] 10Operations, 10ops-eqiad: wtp1028 unresponsive - https://phabricator.wikimedia.org/T212624 (10fgiunchedi) [14:33:05] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [14:33:09] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [14:35:27] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [14:38:03] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (Scrapes sample page) timed out before a response was received [14:40:29] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [14:40:35] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [14:42:53] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [14:42:57] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [14:42:59] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (Scrapes sample page) timed out before a response was received [14:44:09] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [14:45:11] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [14:45:21] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [14:45:21] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [14:47:31] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy [14:47:53] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Scrapes sample page) timed out before a response was received [14:48:59] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [14:49:07] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (Scrapes sample page) timed out before a response was received [14:50:13] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [14:50:21] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [14:56:19] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [14:56:25] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [14:57:21] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [14:57:37] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [14:58:29] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy [15:11:11] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [15:12:17] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [15:19:31] 10Operations, 10Cloud-Services, 10DBA, 10cloud-services-team: Prepare and check storage layer for hywwiki - https://phabricator.wikimedia.org/T212625 (10MarcoAurelio) [15:28:23] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team: Prepare and check storage layer for hywwiki - https://phabricator.wikimedia.org/T212625 (10Marostegui) p:05Triage→03Normal Let us know when this is created so we can sanitize it and give the green light to Cloud Team! Thanks [15:34:07] PROBLEM - MariaDB Slave Lag: s8 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 749.35 seconds [15:34:15] PROBLEM - MariaDB Slave Lag: s8 on db1087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 755.70 seconds [15:34:30] checking [15:34:37] thanks! [15:34:46] thanks indeed! [15:35:09] quick :-) [15:35:17] and merry christmas :) [15:35:21] RECOVERY - MariaDB Slave Lag: s8 on db1124 is OK: OK slave_sql_lag Replication lag: 0.40 seconds [15:35:25] should be fine now [15:35:28] RECOVERY - MariaDB Slave Lag: s8 on db1087 is OK: OK slave_sql_lag Replication lag: 0.10 seconds [15:35:53] ok! [15:37:36] * volans|off here [15:38:00] sorry but I'm without wifi, took me a bit to get an hotspot working [15:38:13] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.9995 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [15:38:15] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.9988 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [15:40:33] that loss is mw spamming with dblag messages btw [15:43:43] (03PS1) 10MarcoAurelio: langlist: Add hyw.wikipedia [dns] - 10https://gerrit.wikimedia.org/r/481335 (https://phabricator.wikimedia.org/T212597) [15:44:26] (03PS2) 10MarcoAurelio: langlist: Add hyw.wikipedia [dns] - 10https://gerrit.wikimedia.org/r/481335 (https://phabricator.wikimedia.org/T212597) [15:46:06] (03CR) 10Liuxinyu970226: [C: 03+1] Close chairwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443585 (https://phabricator.wikimedia.org/T184961) (owner: 10MarcoAurelio) [15:47:34] (03CR) 10Liuxinyu970226: [C: 03+1] langlist: Add hyw.wikipedia [dns] - 10https://gerrit.wikimedia.org/r/481335 (https://phabricator.wikimedia.org/T212597) (owner: 10MarcoAurelio) [15:47:52] !log roll-restart logstash on logstash100[789] [15:47:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:50] (03CR) 10MarcoAurelio: "Wiki not yet created in case this info is relevant for the processing of this patch." [dns] - 10https://gerrit.wikimedia.org/r/481335 (https://phabricator.wikimedia.org/T212597) (owner: 10MarcoAurelio) [15:52:45] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.9958 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [15:55:27] godog is it still loosing packets? [15:56:43] marostegui: it is yeah, trying to understand why [16:02:03] there should be no more messages I think [16:03:36] JVM threads plummeted [16:04:10] receivers wedged somehow? [16:04:40] the threads are the rolling restart, but yeah it looks like the receive buffers fill up and logstash isn't reading or not reading fast enough [16:04:43] the pps loss rate is approximately equal to the input rate before the event [16:07:19] some messages are making it to elasticsearch but clearly not enough heh [16:10:38] they're also using a lot less CPU than they were before [16:11:56] I see this in curator-logstash.log on logstash1009 2018-12-26 06:42:04,111 ERROR Failed to complete action: forcemerge. : Exception encountered. Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='127.0.0.1', port=9200): Read timed out. [16:11:58] (read timeout=21600)) [16:12:06] did it just not start up correctly? [16:17:18] not sure if that's related, curator is the tool to cull old indices daily [16:17:22] hm [16:18:11] I tried depooling 1007, bounce logstash and the repool, once repooled looking at ss it seems the receive buffers are hardly read from [16:18:52] makes me think either the logstash pipelines are slow and/or stuck [16:19:49] I'll try bumping the buffer size, though I suspect it won't do much [16:19:55] on 1009 i see the one java logstash process (pid 11802) blocking on a futex and doing nothing else [16:25:01] I've also captured a jstack on /tmp on logstash1007 for logstash jvm [16:29:11] as expected increasing receive buffer size didn't do anythin [16:29:39] I'll start disabling inputs, maybe there's a problematic one [16:30:22] [2018-12-26T15:46:59,372][FATAL][logstash.runner ] An unexpected error occurred! {:error=>#, :backtra [16:30:24] ce=>["org/jruby/RubyIO.java:3565:in `each'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-syslog-3.2.1/lib/logstash/inputs/syslog.rb:182:in `tcp_re [16:30:26] ceiver'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-syslog-3.2.1/lib/logstash/inputs/syslog.rb:167:in `tcp_listener'"]} [16:30:46] oh, that was shortly after getting a SIGTERM, nevermind [16:36:16] ok I tried clearing logstash's persistent queue on 1007 and that seems to have done the trick [16:37:00] !log clear logstash persistent queue /var/lib/logstash on logstash100[789] [16:37:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:49] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [16:39:18] "success" [16:40:48] cdanis: looks like we're back, thanks for taking a look! [16:41:25] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [16:41:44] making a note to followup on this tomorrow, we might not even want/need the persistent queue now anyways [16:42:29] glad you figured it out, I certainly wasn't going to ;) [16:42:41] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0.006881 https://grafana.wikimedia.org/dashboard/db/logstash [16:43:12] heh, more on a hunch than actual hints from the logs, ironically [16:44:05] I grepped them for all the cases I could find on Google of people needing to clear the persistent queue [16:44:13] did not find anything :D [16:45:28] /o\ [16:53:09] PROBLEM - Restbase root url on restbase1009 is CRITICAL: connect to address 10.64.48.110 and port 7231: Connection refused [16:59:11] RECOVERY - Restbase root url on restbase1009 is OK: HTTP OK: HTTP/1.1 200 - 16164 bytes in 0.077 second response time [18:24:21] (03PS1) 10BryanDavis: wmcs: Add postgres maps users for eqiad1-r region [puppet] - 10https://gerrit.wikimedia.org/r/481341 (https://phabricator.wikimedia.org/T212596) [19:35:20] it appears that there's need to do an urgent interwiki map update to stop linking to a "hijacked" site (sic) as reported [19:35:37] I was wondering if there's any deployer around & willing to do it? [21:37:46] :O [21:40:35] this is really `serious stuff` channel then [21:40:35] poor `interwiki map` :( [21:41:11] Hauskatze you may want to file a task. [21:41:32] paladox: I sent a couple of emails [21:41:38] ah ok