[01:59:39] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db1095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 720.67 seconds
[03:36:07] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 959.76 seconds
[04:02:49] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 251.84 seconds
[04:08:23] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db1095 is OK: OK slave_sql_lag Replication lag: 0.16 seconds
[05:28:17] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s8 on db1124 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table wikidatawiki.pagelinks: Cant find record in pagelinks, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1087-bin.003648, end_log_pos 920552713
[05:38:17] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s8 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 750.52 seconds
[06:30:07] <icinga-wm>	 PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[06:38:39] <icinga-wm>	 RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational
[06:39:26] <wikibugs>	 10Operations, 10Wikimedia-Site-requests, 10HHVM: Set hhvm.virtual_host[default][always_decode_post_data] = false - https://phabricator.wikimedia.org/T208191 (10Krinkle) It seems that at P7727, @joe confirms this bug with HHVM, confirms that PHP 7 does not have the bug; and... found that changing `hhvm.server...
[07:20:53] <marostegui>	 !log Fix replication db1124:3318
[07:20:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:21:07] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s8 on db1124 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[07:34:39] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s8 on db1124 is OK: OK slave_sql_lag Replication lag: 0.44 seconds
[08:02:31] <icinga-wm>	 PROBLEM - HHVM rendering on mw1233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[08:03:37] <icinga-wm>	 RECOVERY - HHVM rendering on mw1233 is OK: HTTP OK: HTTP/1.1 200 OK - 75305 bytes in 0.687 second response time
[08:35:59] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received
[08:37:17] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received
[08:37:17] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (Scrapes sample page) timed out before a response was received
[08:39:41] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received
[08:40:49] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy
[08:40:53] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy
[08:40:57] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received
[08:40:57] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy
[08:42:09] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received
[08:43:17] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy
[08:43:25] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy
[08:44:39] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Scrapes sample page) timed out before a response was received
[08:44:39] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received
[08:47:09] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received
[08:48:19] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy
[08:48:21] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received
[08:48:23] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received
[08:51:59] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy
[08:52:21] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received
[08:54:23] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy
[08:54:31] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received
[08:55:41] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received
[08:55:43] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received
[08:55:53] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy
[08:56:26] <wikibugs>	 (03PS1) 10ArielGlenn: make check-cumin-aliases happy again [puppet] - 10https://gerrit.wikimedia.org/r/481329
[08:56:55] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy
[08:58:09] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received
[08:59:39] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received
[09:00:37] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received
[09:00:47] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy
[09:04:31] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received
[09:06:35] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy
[09:06:41] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] make check-cumin-aliases happy again [puppet] - 10https://gerrit.wikimedia.org/r/481329 (owner: 10ArielGlenn)
[09:07:53] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy
[09:07:53] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy
[09:09:19] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy
[09:10:17] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy
[09:10:17] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy
[09:10:21] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received
[09:11:33] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy
[09:11:37] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received
[09:12:43] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy
[09:13:40] <godog>	 !log swift eqiad-prod: more weight for ms-be10[44-50].eqiad.wmnet - T209618
[09:13:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:43] <stashbot>	 T209618: rack/setup/install ms-be10[44-50].eqiad.wmnet - https://phabricator.wikimedia.org/T209618
[09:16:27] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received
[09:16:27] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received
[09:17:39] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy
[09:17:39] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy
[09:18:47] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy
[09:18:51] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received
[09:19:57] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy
[09:21:23] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received
[09:22:51] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received
[09:23:41] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy
[09:23:49] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received
[09:23:57] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy
[09:25:01] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received
[09:26:09] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy
[09:26:09] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy
[09:26:13] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy
[09:29:55] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Scrapes sample page) timed out before a response was received
[09:29:55] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received
[09:31:05] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy
[09:33:37] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received
[09:34:43] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy
[09:36:03] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy
[09:44:37] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received
[09:48:13] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy
[09:49:27] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 42.88 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6fullscreenorgId=1
[09:50:47] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received
[09:51:59] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy
[09:54:27] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received
[10:00:27] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy
[10:00:33] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received
[10:04:15] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy
[10:04:17] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received
[10:06:37] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy
[10:14:16] <wikibugs>	 10Operations, 10Security-Team: jalexander should be removed from security@ as his emails are bouncing - https://phabricator.wikimedia.org/T212621 (10Bawolff)
[10:19:47] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 88.84 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6fullscreenorgId=1
[10:38:50] <wikibugs>	 10Operations, 10DC-Ops, 10media-storage: ms-be raid setup / evaluation (currently using swraid on top of hwraid) - https://phabricator.wikimedia.org/T211231 (10fgiunchedi) a:05fgiunchedi→03RobH Confirmed all swift disks are presented to swift as raid0 arrays. Also IIRC these hosts are sporting hw raid be...
[10:42:04] <wikibugs>	 10Operations, 10DC-Ops, 10media-storage: ms-be raid setup / evaluation (currently using swraid on top of hwraid) - https://phabricator.wikimedia.org/T211231 (10Peachey88)
[12:44:41] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:45:47] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1285 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.041 second response time
[14:10:05] <icinga-wm>	 PROBLEM - Host wtp1028 is DOWN: PING CRITICAL - Packet loss = 100%
[14:18:13] <godog>	 !log powercycle wtp1028 - nothing on console
[14:18:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:26:39] <logmsgbot>	 !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=wtp1028.eqiad.wmnet
[14:26:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:15] <wikibugs>	 10Operations, 10ops-eqiad: wtp1028 unresponsive - https://phabricator.wikimedia.org/T212624 (10fgiunchedi)
[14:33:05] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received
[14:33:09] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received
[14:35:27] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy
[14:38:03] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (Scrapes sample page) timed out before a response was received
[14:40:29] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received
[14:40:35] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received
[14:42:53] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy
[14:42:57] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy
[14:42:59] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (Scrapes sample page) timed out before a response was received
[14:44:09] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy
[14:45:11] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received
[14:45:21] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy
[14:45:21] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy
[14:47:31] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy
[14:47:53] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Scrapes sample page) timed out before a response was received
[14:48:59] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy
[14:49:07] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (Scrapes sample page) timed out before a response was received
[14:50:13] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy
[14:50:21] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received
[14:56:19] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy
[14:56:25] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received
[14:57:21] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received
[14:57:37] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy
[14:58:29] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy
[15:11:11] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received
[15:12:17] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy
[15:19:31] <wikibugs>	 10Operations, 10Cloud-Services, 10DBA, 10cloud-services-team: Prepare and check storage layer for hywwiki - https://phabricator.wikimedia.org/T212625 (10MarcoAurelio)
[15:28:23] <wikibugs>	 10Operations, 10DBA, 10Data-Services, 10cloud-services-team: Prepare and check storage layer for hywwiki - https://phabricator.wikimedia.org/T212625 (10Marostegui) p:05Triage→03Normal Let us know when this is created so we can sanitize it and give the green light to Cloud Team! Thanks
[15:34:07] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s8 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 749.35 seconds
[15:34:15] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s8 on db1087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 755.70 seconds
[15:34:30] <marostegui>	 checking
[15:34:37] <apergos>	 thanks!
[15:34:46] <godog>	 thanks indeed!
[15:35:09] <arturo>	 quick :-)
[15:35:17] <mark>	 and merry christmas :)
[15:35:21] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s8 on db1124 is OK: OK slave_sql_lag Replication lag: 0.40 seconds
[15:35:25] <marostegui>	 should be fine now
[15:35:28] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s8 on db1087 is OK: OK slave_sql_lag Replication lag: 0.10 seconds
[15:35:53] <apergos>	 ok!
[15:37:36] * volans|off here
[15:38:00] <volans|off>	 sorry but I'm without wifi, took me a bit to get an hotspot working
[15:38:13] <icinga-wm>	 PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.9995 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash
[15:38:15] <icinga-wm>	 PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.9988 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash
[15:40:33] <godog>	 that loss is mw spamming with dblag messages btw
[15:43:43] <wikibugs>	 (03PS1) 10MarcoAurelio: langlist: Add hyw.wikipedia [dns] - 10https://gerrit.wikimedia.org/r/481335 (https://phabricator.wikimedia.org/T212597)
[15:44:26] <wikibugs>	 (03PS2) 10MarcoAurelio: langlist: Add hyw.wikipedia [dns] - 10https://gerrit.wikimedia.org/r/481335 (https://phabricator.wikimedia.org/T212597)
[15:46:06] <wikibugs>	 (03CR) 10Liuxinyu970226: [C: 03+1] Close chairwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443585 (https://phabricator.wikimedia.org/T184961) (owner: 10MarcoAurelio)
[15:47:34] <wikibugs>	 (03CR) 10Liuxinyu970226: [C: 03+1] langlist: Add hyw.wikipedia [dns] - 10https://gerrit.wikimedia.org/r/481335 (https://phabricator.wikimedia.org/T212597) (owner: 10MarcoAurelio)
[15:47:52] <godog>	 !log roll-restart logstash on logstash100[789] 
[15:47:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:49:50] <wikibugs>	 (03CR) 10MarcoAurelio: "Wiki not yet created in case this info is relevant for the processing of this patch." [dns] - 10https://gerrit.wikimedia.org/r/481335 (https://phabricator.wikimedia.org/T212597) (owner: 10MarcoAurelio)
[15:52:45] <icinga-wm>	 PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.9958 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash
[15:55:27] <marostegui>	  godog is it still loosing packets?
[15:56:43] <godog>	 marostegui: it is yeah, trying to understand why
[16:02:03] <marostegui>	 there should be no more messages I think
[16:03:36] <cdanis>	 JVM threads plummeted 
[16:04:10] <cdanis>	 receivers wedged somehow?
[16:04:40] <godog>	 the threads are the rolling restart, but yeah it looks like the receive buffers fill up and logstash isn't reading or not reading fast enough
[16:04:43] <cdanis>	 the pps loss rate is approximately equal to the input rate before the event
[16:07:19] <godog>	 some messages are making it to elasticsearch but clearly not enough heh
[16:10:38] <cdanis>	 they're also using a lot less CPU than they were before
[16:11:56] <cdanis>	 I see this in curator-logstash.log on logstash1009 2018-12-26 06:42:04,111 ERROR     Failed to complete action: forcemerge.  <class 'curator.exceptions.FailedExecution'>: Exception encountered.  Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='127.0.0.1', port=9200): Read timed out.
[16:11:58] <cdanis>	 (read timeout=21600))
[16:12:06] <cdanis>	 did it just not start up correctly?
[16:17:18] <godog>	 not sure if that's related, curator is the tool to cull old indices daily
[16:17:22] <cdanis>	 hm
[16:18:11] <godog>	 I tried depooling 1007, bounce logstash and the repool, once repooled looking at ss it seems the receive buffers are hardly read from
[16:18:52] <godog>	 makes me think either the logstash pipelines are slow and/or stuck
[16:19:49] <godog>	 I'll try bumping the buffer size, though I suspect it won't do much
[16:19:55] <cdanis>	 on 1009 i see the one java logstash process (pid 11802) blocking on a futex and doing nothing else
[16:25:01] <godog>	 I've also captured a jstack on /tmp on logstash1007 for logstash jvm
[16:29:11] <godog>	 as expected increasing receive buffer size didn't do anythin
[16:29:39] <godog>	 I'll start disabling inputs, maybe there's a problematic one
[16:30:22] <cdanis>	 [2018-12-26T15:46:59,372][FATAL][logstash.runner          ] An unexpected error occurred! {:error=>#<Errno::EBADF: Bad file descriptor - Bad file descriptor>, :backtra
[16:30:24] <cdanis>	 ce=>["org/jruby/RubyIO.java:3565:in `each'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-syslog-3.2.1/lib/logstash/inputs/syslog.rb:182:in `tcp_re
[16:30:26] <cdanis>	 ceiver'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-syslog-3.2.1/lib/logstash/inputs/syslog.rb:167:in `tcp_listener'"]}
[16:30:46] <cdanis>	 oh, that was shortly after getting a SIGTERM, nevermind
[16:36:16] <godog>	 ok I tried clearing logstash's persistent queue on 1007 and that seems to have done the trick
[16:37:00] <godog>	 !log clear logstash persistent queue /var/lib/logstash on logstash100[789]
[16:37:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:38:49] <icinga-wm>	 RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash
[16:39:18] <godog>	 "success"
[16:40:48] <godog>	 cdanis: looks like we're back, thanks for taking a look!
[16:41:25] <icinga-wm>	 RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash
[16:41:44] <godog>	 making a note to followup on this tomorrow, we might not even want/need the persistent queue now anyways
[16:42:29] <cdanis>	 glad you figured it out, I certainly wasn't going to ;)
[16:42:41] <icinga-wm>	 RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0.006881 https://grafana.wikimedia.org/dashboard/db/logstash
[16:43:12] <godog>	 heh, more on a hunch than actual hints from the logs, ironically
[16:44:05] <cdanis>	 I grepped them for all the cases I could find on Google of people needing to clear the persistent queue 
[16:44:13] <cdanis>	 did not find anything :D
[16:45:28] <godog>	 /o\
[16:53:09] <icinga-wm>	 PROBLEM - Restbase root url on restbase1009 is CRITICAL: connect to address 10.64.48.110 and port 7231: Connection refused
[16:59:11] <icinga-wm>	 RECOVERY - Restbase root url on restbase1009 is OK: HTTP OK: HTTP/1.1 200 - 16164 bytes in 0.077 second response time
[18:24:21] <wikibugs>	 (03PS1) 10BryanDavis: wmcs: Add postgres maps users for eqiad1-r region [puppet] - 10https://gerrit.wikimedia.org/r/481341 (https://phabricator.wikimedia.org/T212596)
[19:35:20] <Hauskatze>	 it appears that there's need to do an urgent interwiki map update to stop linking to a "hijacked" site (sic) as reported
[19:35:37] <Hauskatze>	 I was wondering if there's any deployer around & willing to do it?
[21:37:46] <takidelfin>	 :O
[21:40:35] <takidelfin>	 this is really `serious stuff` channel then
[21:40:35] <takidelfin>	 poor `interwiki map` :(
[21:41:11] <paladox>	 Hauskatze you may want to file a task.
[21:41:32] <Hauskatze>	 paladox: I sent a couple of emails
[21:41:38] <paladox>	 ah ok