[00:14:07] <wikibugs>	 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: send wdqs logs to logstash - https://phabricator.wikimedia.org/T172710#3611604 (10Smalyshev) 05Open>03Resolved a:03Smalyshev
[03:08:26] <wikibugs>	 (03PS1) 10Dzahn: gerrit: fix host for TLS cert/monitoring if on slave [puppet] - 10https://gerrit.wikimedia.org/r/378360
[03:08:49] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] gerrit: fix host for TLS cert/monitoring if on slave [puppet] - 10https://gerrit.wikimedia.org/r/378360 (owner: 10Dzahn)
[03:09:38] <wikibugs>	 (03PS2) 10Dzahn: gerrit: fix host for TLS cert/monitoring if on slave [puppet] - 10https://gerrit.wikimedia.org/r/378360
[03:10:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] gerrit: fix host for TLS cert/monitoring if on slave [puppet] - 10https://gerrit.wikimedia.org/r/378360 (owner: 10Dzahn)
[03:13:57] <wikibugs>	 (03PS3) 10Dzahn: gerrit: fix host for TLS cert/monitoring if on slave [puppet] - 10https://gerrit.wikimedia.org/r/378360
[03:14:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] gerrit: fix host for TLS cert/monitoring if on slave [puppet] - 10https://gerrit.wikimedia.org/r/378360 (owner: 10Dzahn)
[04:12:29] <wikibugs>	 10Operations, 10DBA: Lost access to x1-analytics-slave - https://phabricator.wikimedia.org/T175970#3611695 (10Jayprakash12345) Also Sir When I use http://tools.wmflabs.org/pageviews It is showing an error. See Below.  ``` hi.wikiversity.org is not a valid project or is currently unsupported. ```
[05:13:01] <wikibugs>	 (03PS4) 10Dzahn: gerrit: fix host for TLS cert/monitoring if on slave [puppet] - 10https://gerrit.wikimedia.org/r/378360
[05:16:04] <wikibugs>	 (03PS5) 10Dzahn: gerrit: fix host for TLS cert/monitoring if on slave [puppet] - 10https://gerrit.wikimedia.org/r/378360
[05:23:03] <wikibugs>	 (03PS6) 10Dzahn: gerrit: fix host for TLS cert/monitoring if on slave [puppet] - 10https://gerrit.wikimedia.org/r/378360
[05:23:27] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] gerrit: fix host for TLS cert/monitoring if on slave [puppet] - 10https://gerrit.wikimedia.org/r/378360 (owner: 10Dzahn)
[05:39:27] <wikibugs>	 (03PS7) 10Dzahn: gerrit: fix host for TLS cert/monitoring if on slave [puppet] - 10https://gerrit.wikimedia.org/r/378360
[05:39:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] gerrit: fix host for TLS cert/monitoring if on slave [puppet] - 10https://gerrit.wikimedia.org/r/378360 (owner: 10Dzahn)
[05:46:01] <wikibugs>	 (03PS8) 10Dzahn: gerrit: fix host for TLS cert/monitoring if on slave [puppet] - 10https://gerrit.wikimedia.org/r/378360
[05:52:03] <wikibugs>	 (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/7903/gerrit2001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/378360 (owner: 10Dzahn)
[06:07:54] <yannf>	 Commons down?
[06:08:22] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[06:09:02] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0]
[06:09:02] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0]
[06:11:22] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0]
[06:21:13] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[06:21:14] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[06:21:32] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[06:21:33] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[06:27:42] <icinga-wm>	 PROBLEM - ores on scb1003 is CRITICAL: connect to address 10.64.32.153 and port 8081: Connection refused
[06:34:10] <yannf>	 ok, Commons is back
[06:42:52] <icinga-wm>	 RECOVERY - ores on scb1003 is OK: HTTP OK: HTTP/1.0 200 OK - 3666 bytes in 3.574 second response time
[07:11:22] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1260 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[07:37:22] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[07:37:42] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[07:37:42] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[07:38:22] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0]
[07:45:42] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[07:45:53] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[07:46:22] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[07:46:22] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[08:01:32] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1260 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[10:07:12] <icinga-wm>	 PROBLEM - HHVM rendering on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:08:12] <icinga-wm>	 RECOVERY - HHVM rendering on mw1198 is OK: HTTP OK: HTTP/1.1 200 OK - 73713 bytes in 5.681 second response time
[10:21:42] <icinga-wm>	 RECOVERY - Check HHVM threads for leakage on mw1260 is OK: OK
[11:12:52] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) timed out before a response was received
[11:13:12] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) timed out before a response was received
[11:13:42] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy
[11:14:12] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy
[12:07:18] <wikibugs>	 10Operations, 10DBA: Lost access to x1-analytics-slave - https://phabricator.wikimedia.org/T175970#3611749 (10jcrespo) @Jayprakash12345 Unless I am wrong, that is a different issue, nor related to the analytics db servers- please file a separate ticket so @Analytics ops can have a look at it (it is probably no...
[12:29:35] <wikibugs>	 (03CR) 10Zoranzoki21: [C: 031] New 'abusefilter-helper' configuration for en.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377473 (https://phabricator.wikimedia.org/T175684) (owner: 10MarcoAurelio)
[12:29:52] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp1073 is CRITICAL: CRITICAL: expiry mailbox lag is 2033320
[12:39:52] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp1073 is OK: OK: expiry mailbox lag is 368
[13:10:53] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: m2 on dbstore1001 is OK: OK slave_sql_lag not a slave
[13:10:53] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional)
[13:11:02] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: No, (no error: intentional)
[13:11:03] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: No, (no error: intentional)
[13:11:04] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: No, (no error: intentional)
[13:11:22] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional)
[13:11:22] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: No, (no error: intentional)
[13:11:22] <icinga-wm>	 RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave
[13:11:22] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: No, (no error: intentional)
[13:11:32] <icinga-wm>	 RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: No, (no error: intentional)
[13:11:32] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional)
[13:11:32] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: No, (no error: intentional)
[13:11:33] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional)
[13:11:33] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional)
[13:11:33] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave
[13:11:33] <icinga-wm>	 RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: No, (no error: intentional)
[13:11:33] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: No, (no error: intentional)
[13:11:42] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional)
[13:11:42] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional)
[13:11:43] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional)
[13:11:43] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional)
[13:37:52] <icinga-wm>	 PROBLEM - HHVM rendering on mw1226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:38:42] <icinga-wm>	 RECOVERY - HHVM rendering on mw1226 is OK: HTTP OK: HTTP/1.1 200 OK - 73826 bytes in 4.894 second response time
[14:07:22] <icinga-wm>	 PROBLEM - HHVM rendering on mw1226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:08:22] <icinga-wm>	 RECOVERY - HHVM rendering on mw1226 is OK: HTTP OK: HTTP/1.1 200 OK - 73826 bytes in 8.085 second response time
[14:16:28] <wikibugs>	 (03CR) 10Paladox: gerrit: fix host for TLS cert/monitoring if on slave (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/378360 (owner: 10Dzahn)
[14:17:17] <wikibugs>	 (03CR) 10Paladox: gerrit: fix host for TLS cert/monitoring if on slave (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/378360 (owner: 10Dzahn)
[14:17:20] <wikibugs>	 (03CR) 10Paladox: [C: 04-1] gerrit: fix host for TLS cert/monitoring if on slave [puppet] - 10https://gerrit.wikimedia.org/r/378360 (owner: 10Dzahn)
[14:39:26] <wikibugs>	 (03PS3) 10Andrew Bogott: fullstack: add a 'success' stat [puppet] - 10https://gerrit.wikimedia.org/r/378175
[15:01:12] <wikibugs>	 (03Draft2) 10Zoranzoki21: Add new throttle rules.. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378393 (https://phabricator.wikimedia.org/T176037)
[15:02:21] <wikibugs>	 (03PS3) 10Zoranzoki21: Add new throttle rules.. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378393 (https://phabricator.wikimedia.org/T176037)
[15:03:53] <wikibugs>	 (03PS4) 10Zoranzoki21: Add new throttle rules.. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378393 (https://phabricator.wikimedia.org/T176037)
[15:18:39] <wikibugs>	 (03PS1) 10Ladsgroup: Add am.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/378395 (https://phabricator.wikimedia.org/T176042)
[15:19:57] <wikibugs>	 (03PS1) 10Ladsgroup: Apache config for am.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/378396 (https://phabricator.wikimedia.org/T176042)
[15:23:47] <wikibugs>	 (03CR) 10Zoranzoki21: [C: 031] Apache config for am.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/378396 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup)
[15:24:05] <wikibugs>	 (03CR) 10Zoranzoki21: [C: 031] Add am.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/378395 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup)
[15:25:27] <wikibugs>	 (03PS1) 10Andrew Bogott: nova: depool labvirt1016 [puppet] - 10https://gerrit.wikimedia.org/r/378397 (https://phabricator.wikimedia.org/T176044)
[15:27:28] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] nova: depool labvirt1016 [puppet] - 10https://gerrit.wikimedia.org/r/378397 (https://phabricator.wikimedia.org/T176044) (owner: 10Andrew Bogott)
[15:34:17] <andrewbogott>	 !log rebooting labvirt1015 for T176044
[15:34:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:34:33] <stashbot>	 T176044: Replace kernel and reboot labvirt1015, 1016, 1017, 1018 - https://phabricator.wikimedia.org/T176044
[15:40:36] <andrewbogott>	 !log rebooting labvirt1017
[15:40:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:47:45] <sjoerddebruin>	 Rip?
[15:47:56] <sjoerddebruin>	 Request from 84.81.160.164 via cp1052 cp1052, Varnish XID 720961630 Error: 503, Backend fetch failed at Sat, 16 Sep 2017 15:47:46 GMT
[15:48:02] <sjoerddebruin>	 Oh, there we go.
[15:50:22] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/featured/{yyyy}/{mm}/{dd} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 503 (expecting: 200)
[15:50:52] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[15:51:22] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy
[15:51:31] <paladox>	 hmm, publishing a comment on flow, resulted in 503
[15:51:34] <paladox>	 retrying it and works
[15:52:22] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0]
[15:52:43] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0]
[15:52:52] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[15:53:13] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/featured/{yyyy}/{mm}/{dd} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 503 (expecting: 200)
[15:53:32] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0]
[15:54:17] <sjoerddebruin>	 Hm, still going on it seems.
[15:55:13] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[15:55:52] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [1000.0]
[15:58:58] <wikibugs>	 (03PS1) 10Ladsgroup: Add config for amwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378400 (https://phabricator.wikimedia.org/T176042)
[16:01:01] <wikibugs>	 (03CR) 10Melos: [C: 04-1] "See phab discussions" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378393 (https://phabricator.wikimedia.org/T176037) (owner: 10Zoranzoki21)
[16:01:20] <wikibugs>	 (03CR) 10Zoranzoki21: [C: 031] Add config for amwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378400 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup)
[16:03:20] <wikibugs>	 (03PS1) 10Ladsgroup: Add amwikimedia to s3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378401 (https://phabricator.wikimedia.org/T176042)
[16:06:42] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[16:07:03] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[16:07:42] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[16:08:02] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[16:08:37] <wikibugs>	 (03PS1) 10Ladsgroup: Add amwikimedia to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378403 (https://phabricator.wikimedia.org/T176042)
[16:10:19] <wikibugs>	 (03CR) 10Zoranzoki21: "Ok Melos." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378393 (https://phabricator.wikimedia.org/T176037) (owner: 10Zoranzoki21)
[16:12:13] <wikibugs>	 (03CR) 10Zoranzoki21: [C: 031] Add amwikimedia to s3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378401 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup)
[16:12:59] <wikibugs>	 (03CR) 10Zoranzoki21: [C: 031] Add amwikimedia to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378403 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup)
[16:24:21] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[16:24:51] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0]
[16:25:19] <Steinsplitter>	 Error: 503, Backend fetch failed at Sat, 16 Sep 2017 16:25:01 GMT
[16:25:21] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0]
[16:25:23] <Steinsplitter>	 getting such erros
[16:26:25] <Steinsplitter>	 *cp1053 cp1053
[16:26:48] <Steinsplitter>	 cc: paladox> 
[16:26:51] <icinga-wm>	 PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0]
[16:27:12] <paladox>	 Steinsplitter hi, could you create a task for ops please?
[16:27:21] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0]
[16:29:59] <yannf>	 Commons down again?
[16:30:17] <HakanIST>	 cp1053
[16:31:01] <Urbanecm>	 We noticed performance reasons at cs.wiki too. According to the history of this channel, you are looking into it. 
[16:31:04] <Urbanecm>	 Is there any task?
[16:31:27] <wikibugs>	 10Operations: Error 503, Backend fetch failed. - https://phabricator.wikimedia.org/T176047#3612066 (10Steinsplitter)
[16:31:28] <Amir1>	 fa.wp down: Request from 46.130.38.199 via cp1053 cp1053, Varnish XID 844005833
[16:31:28] <Amir1>	 Error: 503, Backend fetch failed at Sat, 16 Sep 2017 16:31:06 GMT
[16:31:36] <Amir1>	 robh: 
[16:31:41] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0]
[16:31:41] <Amir1>	 moritzm: 
[16:31:43] <sjoerddebruin>	 Yikes. https://grafana.wikimedia.org/dashboard/file/varnish-http-errors.json
[16:32:29] <wikibugs>	 10Operations: Error 503, Backend fetch failed. - https://phabricator.wikimedia.org/T176047#3612080 (10Steinsplitter)
[16:32:41] <ShakespeareFan00>	 Been having some error mmessages
[16:32:54] <ShakespeareFan00>	 You guys updating something?
[16:33:04] <sjoerddebruin>	 We don't know.
[16:33:22] <paladox>	 I doint think there has been any updates today
[16:33:28] <paladox>	 there's no deploys at the weekend
[16:33:41] <sjoerddebruin>	 Started 45 minutes ago.
[16:34:01] <wikibugs>	 10Operations: Error 503, Backend fetch failed. - https://phabricator.wikimedia.org/T176047#3612066 (10Sjoerddebruin) Got my first error 45 minutes ago, has been going on since.
[16:34:14] <Amir1>	 _joe_: akosiaris apergos bblack ema godog mark mutante paravoid Reedy volans|off 
[16:34:22] <Amir1>	 sorry for mass ping
[16:34:26] <Amir1>	 but it's serious enough 
[16:34:29] <wikibugs>	 10Operations: Error 503, Backend fetch failed. - https://phabricator.wikimedia.org/T176047#3612097 (10Sjoerddebruin)
[16:34:35] <paladox>	 I think we should create an task
[16:34:43] <sjoerddebruin>	 There is one ^
[16:34:49] <Urbanecm>	 paladox, T176047
[16:34:50] <stashbot>	 T176047: Error 503, Backend fetch failed. - https://phabricator.wikimedia.org/T176047
[16:34:51] <wikibugs>	 10Operations: Error 503, Backend fetch failed. - https://phabricator.wikimedia.org/T176047#3612100 (10Paladox) p:05Triage>03Unbreak!
[16:35:00] <Amir1>	 In the mean time I try to get hold of the Ops
[16:35:40] <elukey>	 Amir1: it seems that the peak is gone, cp1053 showed mailbox lag
[16:35:47] <elukey>	 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp1053&var-datasource=eqiad%20prometheus%2Fops&from=now-3h&to=now&panelId=21&fullscreen
[16:35:57] <elukey>	 this is an issue that we have been seeing recently
[16:36:43] <elukey>	 https://phabricator.wikimedia.org/T175803
[16:37:02] <wikibugs>	 10Operations, 10Traffic: Text eqiad varnish 503 spikes - https://phabricator.wikimedia.org/T175803#3603561 (10Urbanecm) Dupe of T176047 ?
[16:37:42] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0]
[16:39:11] <wikibugs>	 (03CR) 10Urbanecm: [C: 031] "LGTM. Normal change, somebody should remove the CR-1 as totally irrelevant." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378393 (https://phabricator.wikimedia.org/T176037) (owner: 10Zoranzoki21)
[16:39:43] <elukey>	 Amir1: something interesting is that the MW fatals and the cp1053's mailbox lag seems to align (16:20 - 16:35)
[16:39:52] <_joe_>	 Amir1: what's up?
[16:40:09] <Amir1>	 elukey: Thanks :)
[16:40:16] <ShakespeareFan00>	 Sorry 
[16:40:22] <ShakespeareFan00>	 Did I just drop?
[16:40:24] <Amir1>	 _joe_: sorry for pinging in the weekend, but it seems we are all getting 503s
[16:40:25] <elukey>	 _joe_ cp1053 mailbox lag
[16:40:35] <elukey>	 the recurring issue
[16:40:35] <_joe_>	 ok
[16:40:44] <elukey>	 buuut this time it seems aligning with MW fatals
[16:40:45] <ShakespeareFan00>	 I am also getting non-wikimedia ISP dropouts in the UK
[16:40:55] <Amir1>	 I opened wikipedia and got "Request from 46.130.38.199 via cp1053 cp1053, Varnish XID 844005833 Error: 503, Backend fetch failed at Sat, 16 Sep 2017 16:31:06 GMT"
[16:41:07] <ShakespeareFan00>	 I was getting a string of those the other day as well
[16:41:08] <_joe_>	 Amir1: absolutely ok to ping
[16:41:18] <_joe_>	 elukey: mw fatals where?
[16:41:24] <Amir1>	 is it related to traffic 
[16:41:34] <_joe_>	 oh yeah I see
[16:41:40] <elukey>	 _joe_ https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp1053&var-datasource=eqiad%20prometheus%2Fops&from=now-1h&to=now&panelId=21&fullscreen
[16:41:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add new throttle rules.. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378393 (https://phabricator.wikimedia.org/T176037) (owner: 10Zoranzoki21)
[16:41:58] <Amir1>	 fatalmonitor is not super horrible: https://logstash.wikimedia.org/app/kibana#/dashboard/Fatal-Monitor?_g=h@44136fa&_a=h@15c1f8f
[16:42:08] <Amir1>	 oops: https://logstash.wikimedia.org/goto/128c911ea9bda97d5ff9d690307c3fbc
[16:42:12] <_joe_>	 elukey: the fatals are from contacting ores, mostly
[16:42:19] <_joe_>	 so, discard those
[16:42:31] <_joe_>	 elukey: did you restart the varnish backend on cp1053 already?
[16:42:45] <elukey>	 nope, it auto-recovered and now it is 0
[16:43:02] <elukey>	 but we can definitely restart
[16:43:11] <elukey>	 there were other hosts showing up the same issue earlier on
[16:43:33] <elukey>	 ( I mean hours ago, back scrolling the chan)
[16:43:49] <elukey>	 let's start with cp1053's backend then
[16:43:51] <elukey>	 just to be sure
[16:44:00] <elukey>	 shall I restart _joe_ ?
[16:44:32] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[16:45:11] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[16:45:23] <wikibugs>	 (03PS5) 10Zoranzoki21: Add new throttle rules.. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378393 (https://phabricator.wikimedia.org/T176037)
[16:46:02] <icinga-wm>	 RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[16:49:29] <wikibugs>	 10Operations, 10Traffic: Text eqiad varnish 503 spikes - https://phabricator.wikimedia.org/T175803#3612145 (10Samtar)
[16:49:31] <wikibugs>	 10Operations: Error 503, Backend fetch failed. - https://phabricator.wikimedia.org/T176047#3612147 (10Samtar)
[16:49:41] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[16:50:07] <wikibugs>	 (03CR) 10Zoranzoki21: "@Urbanecm I made a edit, and jenkins-bot add +1.. And -1 automatic removed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378393 (https://phabricator.wikimedia.org/T176037) (owner: 10Zoranzoki21)
[16:50:31] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[16:50:51] <wikibugs>	 10Operations, 10Traffic: Text eqiad varnish 503 spikes - https://phabricator.wikimedia.org/T175803#3612150 (10Urbanecm) p:05High>03Unbreak! Breaking a lot of things.
[16:51:13] <wikibugs>	 (03CR) 10Urbanecm: [C: 031] Add new throttle rules.. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378393 (https://phabricator.wikimedia.org/T176037) (owner: 10Zoranzoki21)
[16:52:22] <TheresNoTime>	 Urbanecm: still breaking? Varnish Webrequest 503s seem to be clearing (https://logstash.wikimedia.org/app/kibana#/dashboard/Varnish-Webrequest-50X?_g=h@97fe121&_a=h@1782aa7)
[16:53:07] <elukey>	 !log restart varnish-backend on cp1073 (cache upload) for mailbox lag
[16:53:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:53:48] <elukey>	 this one is currently showing up in icinga, different issue but it might eventually lead to 503s, so better to fix it now
[16:54:52] <Urbanecm>	 I have no access to logstash. I've raised the priority because the previous task was UBN and it was breaking a lot of things and as the thing which "fixed" it was a reboot (such fixes aren't permanent usually) I think it should be open&UBN for now
[16:56:34] <elukey>	 Urbanecm: things are fine now, I am not saying that it is ok but High should be sufficient in my opinion.. The traffic team is aware and it has been working really hard to find a solution, that sadly it is buried in Varnish internals IIUC 
[16:56:46] <TheresNoTime>	 elukey: could be wrong, but on reboot normally the lag stays at 0 for a while, right?
[16:57:39] <elukey>	 TheresNoTime: you are right, currently the mailbox lag issue can be fixed with a varnish restart (so not a complete reboot but the idea is the same :)
[16:58:09] <Urbanecm>	 elukey, feel free to revert me, I just thought it should be UBN. 
[16:58:39] <elukey>	 Urbanecm: ack, but I didn't mean to overstep you, I wanted to discuss it :)
[16:59:26] <Urbanecm>	 :)
[17:00:29] <wikibugs>	 10Operations, 10Traffic: Text eqiad varnish 503 spikes - https://phabricator.wikimedia.org/T175803#3612156 (10Paladox) p:05Unbreak!>03High Changing to high as things are stable now. But when things break again we can set it to unbreak now.
[17:03:14] <elukey>	 <3
[17:07:28] <TheresNoTime>	 cp1053 has crept up to `24` already :/
[17:08:16] <elukey>	 TheresNoTime: 503s ?
[17:08:30] <TheresNoTime>	 no no, mailbox lag
[17:08:40] <wikibugs>	 (03PS1) 10Ladsgroup: dumps: Align box-shadow with WikimediaUI standard [puppet] - 10https://gerrit.wikimedia.org/r/378408
[17:08:46] <elukey>	 ahhh no no the issues are like
[17:08:47] <elukey>	 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp1053&var-datasource=eqiad%20prometheus%2Fops&from=now-3h&to=now&panelId=21&fullscreen
[17:09:05] <elukey>	 so when it rises up to huge values
[17:09:22] <TheresNoTime>	 ahhh right!
[17:10:32] * TheresNoTime thought it would stay at 0 :-)
[17:11:43] <elukey>	 that's the ideal value :D
[17:15:13] <wikibugs>	 (03CR) 10Samtar: [C: 031] Add new throttle rules.. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378393 (https://phabricator.wikimedia.org/T176037) (owner: 10Zoranzoki21)
[17:19:26] <wikibugs>	 (03CR) 10Zoranzoki21: [C: 031] Leave a comment that ACW must be loaded before VE [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376791 (owner: 10MaxSem)
[17:23:24] <ShakespeareFan00>	 cp1048 also had a peak just before 1700 -https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&from=now-3h&to=now&panelId=21&fullscreen&var-server=cp1048&var-datasource=eqiad%20prometheus%2Fops
[17:23:38] <wikibugs>	 (03CR) 10Zoranzoki21: [C: 031] "Why this patch can not be merged? I not seen conflicts with other patches." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376791 (owner: 10MaxSem)
[17:23:59] <ShakespeareFan00>	 Might be worth checking other servers?
[17:27:36] <TheresNoTime>	 cp1048's mailbox lag seems to be at 0 (and has been OK for 9 days) - 23k, given the context of 1.2 million, could just be a blip? ¯\_(ツ)_/¯
[17:34:02] <ShakespeareFan00>	 I was seeing the peak on some other servers as well
[17:34:16] <ShakespeareFan00>	 Pretty big blip
[17:34:17] <ShakespeareFan00>	 XD
[17:34:58] <wikibugs>	 10Operations, 10Wiki-Loves-Monuments (2017): Import Wiki Loves Monuments photos from Flickr to Commons - https://phabricator.wikimedia.org/T173056#3612204 (10Multichill)
[18:15:51] <wikibugs>	 10Operations, 10Traffic: Text eqiad varnish 503 spikes - https://phabricator.wikimedia.org/T175803#3603561 (10Yann) Request from 88.182.181.224 via cp1052 cp1052, Varnish XID 966459488 Error: 503, Backend fetch failed at Sat, 16 Sep 2017 18:15:18 GMT
[18:16:16] <Krinkle>	 (Out of context weekend report) For what its worth, got several reports from people getting 503 Wikimedia error pages when just reading articles as logged-out reader. Refreshing twice or thrice made it go away.
[18:16:33] <sjoerddebruin>	 Still atm?
[18:16:38] <sjoerddebruin>	 (or some period ago)
[18:17:24] <TheresNoTime>	 18:13 UTC onwards, but the mailbox lag looks okay? o.O
[18:18:20] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0]
[18:18:30] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0]
[18:18:49] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[18:18:51] <TheresNoTime>	 .__.
[18:19:59] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0]
[18:28:39] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[18:28:40] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[18:29:59] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[18:30:09] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[18:30:43] <wikibugs>	 10Operations, 10Traffic: Text eqiad varnish 503 spikes - https://phabricator.wikimedia.org/T175803#3612223 (10Paladox) Hmm, not stable now.  [19:18:20]  <+icinga-wm> PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [19:18:29]  <+icing...
[18:32:49] <wikibugs>	 10Operations, 10Traffic: Text eqiad varnish 503 spikes - https://phabricator.wikimedia.org/T175803#3612226 (10Samtar) It looks like cp1052 had a spike, but has since recovered  {F9585689}  `RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]`
[18:46:10] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: x1 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 85790.32 seconds
[18:58:29] <icinga-wm>	 PROBLEM - Apache HTTP on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:58:30] <icinga-wm>	 PROBLEM - HHVM rendering on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:59:20] <icinga-wm>	 RECOVERY - Apache HTTP on mw1198 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 3.294 second response time
[18:59:39] <icinga-wm>	 RECOVERY - HHVM rendering on mw1198 is OK: HTTP OK: HTTP/1.1 200 OK - 73833 bytes in 9.636 second response time
[19:47:19] <wikibugs>	 10Operations, 10Traffic: Text eqiad varnish 503 spikes - https://phabricator.wikimedia.org/T175803#3612304 (10Yann) Request from 88.182.181.224 via cp1052 cp1052, Varnish XID 34013240 Error: 503, Backend fetch failed at Sat, 16 Sep 2017 19:46:47 GMT
[19:51:09] <icinga-wm>	 PROBLEM - Disk space on mendelevium is CRITICAL: DISK CRITICAL - free space: / 9886 MB (43% inode=3%)
[19:51:19] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[19:51:59] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0]
[19:53:37] <Krenair>	 something may be wrong with otrs
[19:53:39] <Krenair>	 akosiaris
[19:53:42] <Krenair>	 18<icinga-wm18> PROBLEM - Disk space on mendelevium is CRITICAL: DISK CRITICAL - free space: / 9886 MB (43% inode=3%)
[19:53:49] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0]
[19:53:50] <Krenair>	 combined with a spike in the ticket creation graph on the otrs dashboard
[19:54:10] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0]
[20:01:30] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[20:01:59] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[20:02:09] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[20:02:20] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[20:06:59] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: m3 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89851.66 seconds
[20:10:49] <icinga-wm>	 PROBLEM - puppet last run on mw1269 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:24:29] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[20:25:29] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[20:25:49] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[20:26:59] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[20:32:39] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[20:33:00] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[20:33:30] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[20:34:50] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[20:39:19] <icinga-wm>	 RECOVERY - puppet last run on mw1269 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures
[20:39:49] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[20:47:50] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[21:17:49] <icinga-wm>	 RECOVERY - salt-minion processes on labtestvirt2002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[21:20:49] <icinga-wm>	 PROBLEM - salt-minion processes on labtestvirt2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[21:24:42] <wikibugs>	 (03CR) 10Framawiki: [C: 04-1] "Not a good idea for me, see the task." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378393 (https://phabricator.wikimedia.org/T176037) (owner: 10Zoranzoki21)
[21:35:29] <wikibugs>	 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to scb* and pdfrender-admin for tgr - https://phabricator.wikimedia.org/T175882#3612478 (10Tgr) Thanks, Rob!  > I'm assigning this to you for your input on the above (additonal group name plus L3 signature.  Please assign back to me w...
[21:56:29] <icinga-wm>	 PROBLEM - HHVM rendering on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:57:29] <icinga-wm>	 RECOVERY - HHVM rendering on mw1198 is OK: HTTP OK: HTTP/1.1 200 OK - 73828 bytes in 2.526 second response time
[22:05:45] <godog>	 !log compress older otrs directories to reclaim inodes - T171490
[22:06:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:06:02] <stashbot>	 T171490: mendelevium (otrs) running out of inodes - https://phabricator.wikimedia.org/T171490
[22:08:10] <icinga-wm>	 RECOVERY - Disk space on mendelevium is OK: DISK OK
[22:09:14] <wikibugs>	 10Operations, 10OTRS: mendelevium (otrs) running out of inodes - https://phabricator.wikimedia.org/T171490#3612568 (10fgiunchedi) The growth of used inodes since a few hours was pretty steep, I compressed and removed the older otrs versions:  ``` otrs-5.0.13 otrs-5.0.19 otrs-5.0.7 otrs-5.0.6 otrs-3.2.14.bak ot...
[22:14:59] <icinga-wm>	 PROBLEM - HHVM rendering on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:16:32] <Krenair>	 godog, hey
[22:16:39] <Krenair>	 did you see the graph on the otrs dashboard?
[22:16:50] <icinga-wm>	 RECOVERY - HHVM rendering on mw1198 is OK: HTTP OK: HTTP/1.1 200 OK - 73828 bytes in 3.499 second response time
[22:18:16] <godog>	 Krenair: no, which dashboard?
[22:18:30] <Krenair>	 the one you see when you log into otrs
[22:19:17] <Krenair>	 there's a huge spike and it doesn't look like it's stopping
[22:21:26] <godog>	 Krenair: I don't think I have access to otrs :( I've bandaided the problem for now I hope, i.e. inode usage is at 38%
[22:21:43] <Krenair>	 ok
[22:22:28] <Krenair>	 come the thought of it, I think the database server where those tickets actually get stored is a different machine/cluster
[22:22:36] <Krenair>	 i.e. not on mendelevium's disk
[22:23:21] <Krenair>	 though it's possible it's doing a lot of logging
[22:27:17] <godog>	 it is possible yeah, looks like a lot of tmp files
[22:31:10] <icinga-wm>	 PROBLEM - HHVM rendering on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:32:09] <icinga-wm>	 RECOVERY - HHVM rendering on mw1198 is OK: HTTP OK: HTTP/1.1 200 OK - 73808 bytes in 5.320 second response time
[22:33:21] <Platonides>	 Krenair: that's indeed impressing
[22:33:26] <Platonides>	 a spam campaign, probably
[22:34:03] <Krenair>	 yes
[22:34:24] <Krenair>	 delivery failures for people trying to impersonate our domains
[22:34:35] <Platonides>	 ouch
[22:41:02] <Krenair>	 hieradata/role/common/otrs.yaml:profile::otrs::database_host: m2-master.eqiad.wmnet
[22:41:57] <Krenair>	 $ dig m2-master.eqiad.wmnet @ns0.wikimedia.org +short
[22:41:57] <Krenair>	 dbproxy1002.eqiad.wmnet.
[22:41:57] <Krenair>	 10.64.0.166
[22:42:15] <Krenair>	 which is a proxy for... m1, primary db1016 and secondary db1001
[22:43:27] <Krenair>	 nope I read the wrong part
[22:43:52] <Krenair>	 m2, master db1020 and secondary db2011
[22:46:30] <Krenair>	 db1020 traffic does look a bit strange towards the end https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1020&var-port=9104&from=now-7d&to=now
[22:47:21] <Krenair>	 write queries in particular
[22:47:42] <Krenair>	 and InnoDB IO operations
[22:48:52] <Krenair>	 and other stuff
[23:04:19] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89549.70 seconds
[23:27:19] <Krenair>	 hmmmmm
[23:27:38] <Krenair>	 slave lag OK because it's at 24-25 hours?