[00:14:07] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: send wdqs logs to logstash - https://phabricator.wikimedia.org/T172710#3611604 (10Smalyshev) 05Open>03Resolved a:03Smalyshev [03:08:26] (03PS1) 10Dzahn: gerrit: fix host for TLS cert/monitoring if on slave [puppet] - 10https://gerrit.wikimedia.org/r/378360 [03:08:49] (03CR) 10jerkins-bot: [V: 04-1] gerrit: fix host for TLS cert/monitoring if on slave [puppet] - 10https://gerrit.wikimedia.org/r/378360 (owner: 10Dzahn) [03:09:38] (03PS2) 10Dzahn: gerrit: fix host for TLS cert/monitoring if on slave [puppet] - 10https://gerrit.wikimedia.org/r/378360 [03:10:00] (03CR) 10jerkins-bot: [V: 04-1] gerrit: fix host for TLS cert/monitoring if on slave [puppet] - 10https://gerrit.wikimedia.org/r/378360 (owner: 10Dzahn) [03:13:57] (03PS3) 10Dzahn: gerrit: fix host for TLS cert/monitoring if on slave [puppet] - 10https://gerrit.wikimedia.org/r/378360 [03:14:18] (03CR) 10jerkins-bot: [V: 04-1] gerrit: fix host for TLS cert/monitoring if on slave [puppet] - 10https://gerrit.wikimedia.org/r/378360 (owner: 10Dzahn) [04:12:29] 10Operations, 10DBA: Lost access to x1-analytics-slave - https://phabricator.wikimedia.org/T175970#3611695 (10Jayprakash12345) Also Sir When I use http://tools.wmflabs.org/pageviews It is showing an error. See Below. ``` hi.wikiversity.org is not a valid project or is currently unsupported. ``` [05:13:01] (03PS4) 10Dzahn: gerrit: fix host for TLS cert/monitoring if on slave [puppet] - 10https://gerrit.wikimedia.org/r/378360 [05:16:04] (03PS5) 10Dzahn: gerrit: fix host for TLS cert/monitoring if on slave [puppet] - 10https://gerrit.wikimedia.org/r/378360 [05:23:03] (03PS6) 10Dzahn: gerrit: fix host for TLS cert/monitoring if on slave [puppet] - 10https://gerrit.wikimedia.org/r/378360 [05:23:27] (03CR) 10jerkins-bot: [V: 04-1] gerrit: fix host for TLS cert/monitoring if on slave [puppet] - 10https://gerrit.wikimedia.org/r/378360 (owner: 10Dzahn) [05:39:27] (03PS7) 10Dzahn: gerrit: fix host for TLS cert/monitoring if on slave [puppet] - 10https://gerrit.wikimedia.org/r/378360 [05:39:50] (03CR) 10jerkins-bot: [V: 04-1] gerrit: fix host for TLS cert/monitoring if on slave [puppet] - 10https://gerrit.wikimedia.org/r/378360 (owner: 10Dzahn) [05:46:01] (03PS8) 10Dzahn: gerrit: fix host for TLS cert/monitoring if on slave [puppet] - 10https://gerrit.wikimedia.org/r/378360 [05:52:03] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/7903/gerrit2001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/378360 (owner: 10Dzahn) [06:07:54] Commons down? [06:08:22] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [06:09:02] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [06:09:02] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [06:11:22] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [06:21:13] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:21:14] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:21:32] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:21:33] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:27:42] PROBLEM - ores on scb1003 is CRITICAL: connect to address 10.64.32.153 and port 8081: Connection refused [06:34:10] ok, Commons is back [06:42:52] RECOVERY - ores on scb1003 is OK: HTTP OK: HTTP/1.0 200 OK - 3666 bytes in 3.574 second response time [07:11:22] PROBLEM - Check HHVM threads for leakage on mw1260 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [07:37:22] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [07:37:42] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [07:37:42] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [07:38:22] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [07:45:42] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:45:53] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:46:22] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:46:22] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:01:32] PROBLEM - Check HHVM threads for leakage on mw1260 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [10:07:12] PROBLEM - HHVM rendering on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:08:12] RECOVERY - HHVM rendering on mw1198 is OK: HTTP OK: HTTP/1.1 200 OK - 73713 bytes in 5.681 second response time [10:21:42] RECOVERY - Check HHVM threads for leakage on mw1260 is OK: OK [11:12:52] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) timed out before a response was received [11:13:12] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) timed out before a response was received [11:13:42] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [11:14:12] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [12:07:18] 10Operations, 10DBA: Lost access to x1-analytics-slave - https://phabricator.wikimedia.org/T175970#3611749 (10jcrespo) @Jayprakash12345 Unless I am wrong, that is a different issue, nor related to the analytics db servers- please file a separate ticket so @Analytics ops can have a look at it (it is probably no... [12:29:35] (03CR) 10Zoranzoki21: [C: 031] New 'abusefilter-helper' configuration for en.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377473 (https://phabricator.wikimedia.org/T175684) (owner: 10MarcoAurelio) [12:29:52] PROBLEM - Check Varnish expiry mailbox lag on cp1073 is CRITICAL: CRITICAL: expiry mailbox lag is 2033320 [12:39:52] RECOVERY - Check Varnish expiry mailbox lag on cp1073 is OK: OK: expiry mailbox lag is 368 [13:10:53] RECOVERY - MariaDB Slave Lag: m2 on dbstore1001 is OK: OK slave_sql_lag not a slave [13:10:53] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [13:11:02] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: No, (no error: intentional) [13:11:03] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: No, (no error: intentional) [13:11:04] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: No, (no error: intentional) [13:11:22] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [13:11:22] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: No, (no error: intentional) [13:11:22] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [13:11:22] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: No, (no error: intentional) [13:11:32] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: No, (no error: intentional) [13:11:32] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [13:11:32] RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: No, (no error: intentional) [13:11:33] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [13:11:33] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [13:11:33] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [13:11:33] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: No, (no error: intentional) [13:11:33] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: No, (no error: intentional) [13:11:42] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [13:11:42] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [13:11:43] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [13:11:43] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [13:37:52] PROBLEM - HHVM rendering on mw1226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:42] RECOVERY - HHVM rendering on mw1226 is OK: HTTP OK: HTTP/1.1 200 OK - 73826 bytes in 4.894 second response time [14:07:22] PROBLEM - HHVM rendering on mw1226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:08:22] RECOVERY - HHVM rendering on mw1226 is OK: HTTP OK: HTTP/1.1 200 OK - 73826 bytes in 8.085 second response time [14:16:28] (03CR) 10Paladox: gerrit: fix host for TLS cert/monitoring if on slave (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/378360 (owner: 10Dzahn) [14:17:17] (03CR) 10Paladox: gerrit: fix host for TLS cert/monitoring if on slave (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/378360 (owner: 10Dzahn) [14:17:20] (03CR) 10Paladox: [C: 04-1] gerrit: fix host for TLS cert/monitoring if on slave [puppet] - 10https://gerrit.wikimedia.org/r/378360 (owner: 10Dzahn) [14:39:26] (03PS3) 10Andrew Bogott: fullstack: add a 'success' stat [puppet] - 10https://gerrit.wikimedia.org/r/378175 [15:01:12] (03Draft2) 10Zoranzoki21: Add new throttle rules.. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378393 (https://phabricator.wikimedia.org/T176037) [15:02:21] (03PS3) 10Zoranzoki21: Add new throttle rules.. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378393 (https://phabricator.wikimedia.org/T176037) [15:03:53] (03PS4) 10Zoranzoki21: Add new throttle rules.. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378393 (https://phabricator.wikimedia.org/T176037) [15:18:39] (03PS1) 10Ladsgroup: Add am.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/378395 (https://phabricator.wikimedia.org/T176042) [15:19:57] (03PS1) 10Ladsgroup: Apache config for am.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/378396 (https://phabricator.wikimedia.org/T176042) [15:23:47] (03CR) 10Zoranzoki21: [C: 031] Apache config for am.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/378396 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup) [15:24:05] (03CR) 10Zoranzoki21: [C: 031] Add am.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/378395 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup) [15:25:27] (03PS1) 10Andrew Bogott: nova: depool labvirt1016 [puppet] - 10https://gerrit.wikimedia.org/r/378397 (https://phabricator.wikimedia.org/T176044) [15:27:28] (03CR) 10Andrew Bogott: [C: 032] nova: depool labvirt1016 [puppet] - 10https://gerrit.wikimedia.org/r/378397 (https://phabricator.wikimedia.org/T176044) (owner: 10Andrew Bogott) [15:34:17] !log rebooting labvirt1015 for T176044 [15:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:33] T176044: Replace kernel and reboot labvirt1015, 1016, 1017, 1018 - https://phabricator.wikimedia.org/T176044 [15:40:36] !log rebooting labvirt1017 [15:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:45] Rip? [15:47:56] Request from 84.81.160.164 via cp1052 cp1052, Varnish XID 720961630 Error: 503, Backend fetch failed at Sat, 16 Sep 2017 15:47:46 GMT [15:48:02] Oh, there we go. [15:50:22] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/featured/{yyyy}/{mm}/{dd} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 503 (expecting: 200) [15:50:52] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [15:51:22] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [15:51:31] hmm, publishing a comment on flow, resulted in 503 [15:51:34] retrying it and works [15:52:22] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [15:52:43] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [15:52:52] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:53:13] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/featured/{yyyy}/{mm}/{dd} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 503 (expecting: 200) [15:53:32] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [15:54:17] Hm, still going on it seems. [15:55:13] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [15:55:52] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [1000.0] [15:58:58] (03PS1) 10Ladsgroup: Add config for amwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378400 (https://phabricator.wikimedia.org/T176042) [16:01:01] (03CR) 10Melos: [C: 04-1] "See phab discussions" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378393 (https://phabricator.wikimedia.org/T176037) (owner: 10Zoranzoki21) [16:01:20] (03CR) 10Zoranzoki21: [C: 031] Add config for amwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378400 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup) [16:03:20] (03PS1) 10Ladsgroup: Add amwikimedia to s3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378401 (https://phabricator.wikimedia.org/T176042) [16:06:42] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:07:03] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:07:42] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:08:02] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:08:37] (03PS1) 10Ladsgroup: Add amwikimedia to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378403 (https://phabricator.wikimedia.org/T176042) [16:10:19] (03CR) 10Zoranzoki21: "Ok Melos." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378393 (https://phabricator.wikimedia.org/T176037) (owner: 10Zoranzoki21) [16:12:13] (03CR) 10Zoranzoki21: [C: 031] Add amwikimedia to s3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378401 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup) [16:12:59] (03CR) 10Zoranzoki21: [C: 031] Add amwikimedia to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378403 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup) [16:24:21] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [16:24:51] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [16:25:19] Error: 503, Backend fetch failed at Sat, 16 Sep 2017 16:25:01 GMT [16:25:21] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [16:25:23] getting such erros [16:26:25] *cp1053 cp1053 [16:26:48] cc: paladox> [16:26:51] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [16:27:12] Steinsplitter hi, could you create a task for ops please? [16:27:21] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [16:29:59] Commons down again? [16:30:17] cp1053 [16:31:01] We noticed performance reasons at cs.wiki too. According to the history of this channel, you are looking into it. [16:31:04] Is there any task? [16:31:27] 10Operations: Error 503, Backend fetch failed. - https://phabricator.wikimedia.org/T176047#3612066 (10Steinsplitter) [16:31:28] fa.wp down: Request from 46.130.38.199 via cp1053 cp1053, Varnish XID 844005833 [16:31:28] Error: 503, Backend fetch failed at Sat, 16 Sep 2017 16:31:06 GMT [16:31:36] robh: [16:31:41] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] [16:31:41] moritzm: [16:31:43] Yikes. https://grafana.wikimedia.org/dashboard/file/varnish-http-errors.json [16:32:29] 10Operations: Error 503, Backend fetch failed. - https://phabricator.wikimedia.org/T176047#3612080 (10Steinsplitter) [16:32:41] Been having some error mmessages [16:32:54] You guys updating something? [16:33:04] We don't know. [16:33:22] I doint think there has been any updates today [16:33:28] there's no deploys at the weekend [16:33:41] Started 45 minutes ago. [16:34:01] 10Operations: Error 503, Backend fetch failed. - https://phabricator.wikimedia.org/T176047#3612066 (10Sjoerddebruin) Got my first error 45 minutes ago, has been going on since. [16:34:14] _joe_: akosiaris apergos bblack ema godog mark mutante paravoid Reedy volans|off [16:34:22] sorry for mass ping [16:34:26] but it's serious enough [16:34:29] 10Operations: Error 503, Backend fetch failed. - https://phabricator.wikimedia.org/T176047#3612097 (10Sjoerddebruin) [16:34:35] I think we should create an task [16:34:43] There is one ^ [16:34:49] paladox, T176047 [16:34:50] T176047: Error 503, Backend fetch failed. - https://phabricator.wikimedia.org/T176047 [16:34:51] 10Operations: Error 503, Backend fetch failed. - https://phabricator.wikimedia.org/T176047#3612100 (10Paladox) p:05Triage>03Unbreak! [16:35:00] In the mean time I try to get hold of the Ops [16:35:40] Amir1: it seems that the peak is gone, cp1053 showed mailbox lag [16:35:47] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp1053&var-datasource=eqiad%20prometheus%2Fops&from=now-3h&to=now&panelId=21&fullscreen [16:35:57] this is an issue that we have been seeing recently [16:36:43] https://phabricator.wikimedia.org/T175803 [16:37:02] 10Operations, 10Traffic: Text eqiad varnish 503 spikes - https://phabricator.wikimedia.org/T175803#3603561 (10Urbanecm) Dupe of T176047 ? [16:37:42] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [16:39:11] (03CR) 10Urbanecm: [C: 031] "LGTM. Normal change, somebody should remove the CR-1 as totally irrelevant." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378393 (https://phabricator.wikimedia.org/T176037) (owner: 10Zoranzoki21) [16:39:43] Amir1: something interesting is that the MW fatals and the cp1053's mailbox lag seems to align (16:20 - 16:35) [16:39:52] <_joe_> Amir1: what's up? [16:40:09] elukey: Thanks :) [16:40:16] Sorry [16:40:22] Did I just drop? [16:40:24] _joe_: sorry for pinging in the weekend, but it seems we are all getting 503s [16:40:25] _joe_ cp1053 mailbox lag [16:40:35] the recurring issue [16:40:35] <_joe_> ok [16:40:44] buuut this time it seems aligning with MW fatals [16:40:45] I am also getting non-wikimedia ISP dropouts in the UK [16:40:55] I opened wikipedia and got "Request from 46.130.38.199 via cp1053 cp1053, Varnish XID 844005833 Error: 503, Backend fetch failed at Sat, 16 Sep 2017 16:31:06 GMT" [16:41:07] I was getting a string of those the other day as well [16:41:08] <_joe_> Amir1: absolutely ok to ping [16:41:18] <_joe_> elukey: mw fatals where? [16:41:24] is it related to traffic [16:41:34] <_joe_> oh yeah I see [16:41:40] _joe_ https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp1053&var-datasource=eqiad%20prometheus%2Fops&from=now-1h&to=now&panelId=21&fullscreen [16:41:57] (03CR) 10jerkins-bot: [V: 04-1] Add new throttle rules.. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378393 (https://phabricator.wikimedia.org/T176037) (owner: 10Zoranzoki21) [16:41:58] fatalmonitor is not super horrible: https://logstash.wikimedia.org/app/kibana#/dashboard/Fatal-Monitor?_g=h@44136fa&_a=h@15c1f8f [16:42:08] oops: https://logstash.wikimedia.org/goto/128c911ea9bda97d5ff9d690307c3fbc [16:42:12] <_joe_> elukey: the fatals are from contacting ores, mostly [16:42:19] <_joe_> so, discard those [16:42:31] <_joe_> elukey: did you restart the varnish backend on cp1053 already? [16:42:45] nope, it auto-recovered and now it is 0 [16:43:02] but we can definitely restart [16:43:11] there were other hosts showing up the same issue earlier on [16:43:33] ( I mean hours ago, back scrolling the chan) [16:43:49] let's start with cp1053's backend then [16:43:51] just to be sure [16:44:00] shall I restart _joe_ ? [16:44:32] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:45:11] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:45:23] (03PS5) 10Zoranzoki21: Add new throttle rules.. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378393 (https://phabricator.wikimedia.org/T176037) [16:46:02] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:49:29] 10Operations, 10Traffic: Text eqiad varnish 503 spikes - https://phabricator.wikimedia.org/T175803#3612145 (10Samtar) [16:49:31] 10Operations: Error 503, Backend fetch failed. - https://phabricator.wikimedia.org/T176047#3612147 (10Samtar) [16:49:41] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:50:07] (03CR) 10Zoranzoki21: "@Urbanecm I made a edit, and jenkins-bot add +1.. And -1 automatic removed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378393 (https://phabricator.wikimedia.org/T176037) (owner: 10Zoranzoki21) [16:50:31] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:50:51] 10Operations, 10Traffic: Text eqiad varnish 503 spikes - https://phabricator.wikimedia.org/T175803#3612150 (10Urbanecm) p:05High>03Unbreak! Breaking a lot of things. [16:51:13] (03CR) 10Urbanecm: [C: 031] Add new throttle rules.. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378393 (https://phabricator.wikimedia.org/T176037) (owner: 10Zoranzoki21) [16:52:22] Urbanecm: still breaking? Varnish Webrequest 503s seem to be clearing (https://logstash.wikimedia.org/app/kibana#/dashboard/Varnish-Webrequest-50X?_g=h@97fe121&_a=h@1782aa7) [16:53:07] !log restart varnish-backend on cp1073 (cache upload) for mailbox lag [16:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:48] this one is currently showing up in icinga, different issue but it might eventually lead to 503s, so better to fix it now [16:54:52] I have no access to logstash. I've raised the priority because the previous task was UBN and it was breaking a lot of things and as the thing which "fixed" it was a reboot (such fixes aren't permanent usually) I think it should be open&UBN for now [16:56:34] Urbanecm: things are fine now, I am not saying that it is ok but High should be sufficient in my opinion.. The traffic team is aware and it has been working really hard to find a solution, that sadly it is buried in Varnish internals IIUC [16:56:46] elukey: could be wrong, but on reboot normally the lag stays at 0 for a while, right? [16:57:39] TheresNoTime: you are right, currently the mailbox lag issue can be fixed with a varnish restart (so not a complete reboot but the idea is the same :) [16:58:09] elukey, feel free to revert me, I just thought it should be UBN. [16:58:39] Urbanecm: ack, but I didn't mean to overstep you, I wanted to discuss it :) [16:59:26] :) [17:00:29] 10Operations, 10Traffic: Text eqiad varnish 503 spikes - https://phabricator.wikimedia.org/T175803#3612156 (10Paladox) p:05Unbreak!>03High Changing to high as things are stable now. But when things break again we can set it to unbreak now. [17:03:14] <3 [17:07:28] cp1053 has crept up to `24` already :/ [17:08:16] TheresNoTime: 503s ? [17:08:30] no no, mailbox lag [17:08:40] (03PS1) 10Ladsgroup: dumps: Align box-shadow with WikimediaUI standard [puppet] - 10https://gerrit.wikimedia.org/r/378408 [17:08:46] ahhh no no the issues are like [17:08:47] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp1053&var-datasource=eqiad%20prometheus%2Fops&from=now-3h&to=now&panelId=21&fullscreen [17:09:05] so when it rises up to huge values [17:09:22] ahhh right! [17:10:32] * TheresNoTime thought it would stay at 0 :-) [17:11:43] that's the ideal value :D [17:15:13] (03CR) 10Samtar: [C: 031] Add new throttle rules.. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378393 (https://phabricator.wikimedia.org/T176037) (owner: 10Zoranzoki21) [17:19:26] (03CR) 10Zoranzoki21: [C: 031] Leave a comment that ACW must be loaded before VE [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376791 (owner: 10MaxSem) [17:23:24] cp1048 also had a peak just before 1700 -https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&from=now-3h&to=now&panelId=21&fullscreen&var-server=cp1048&var-datasource=eqiad%20prometheus%2Fops [17:23:38] (03CR) 10Zoranzoki21: [C: 031] "Why this patch can not be merged? I not seen conflicts with other patches." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376791 (owner: 10MaxSem) [17:23:59] Might be worth checking other servers? [17:27:36] cp1048's mailbox lag seems to be at 0 (and has been OK for 9 days) - 23k, given the context of 1.2 million, could just be a blip? ¯\_(ツ)_/¯ [17:34:02] I was seeing the peak on some other servers as well [17:34:16] Pretty big blip [17:34:17] XD [17:34:58] 10Operations, 10Wiki-Loves-Monuments (2017): Import Wiki Loves Monuments photos from Flickr to Commons - https://phabricator.wikimedia.org/T173056#3612204 (10Multichill) [18:15:51] 10Operations, 10Traffic: Text eqiad varnish 503 spikes - https://phabricator.wikimedia.org/T175803#3603561 (10Yann) Request from 88.182.181.224 via cp1052 cp1052, Varnish XID 966459488 Error: 503, Backend fetch failed at Sat, 16 Sep 2017 18:15:18 GMT [18:16:16] (Out of context weekend report) For what its worth, got several reports from people getting 503 Wikimedia error pages when just reading articles as logged-out reader. Refreshing twice or thrice made it go away. [18:16:33] Still atm? [18:16:38] (or some period ago) [18:17:24] 18:13 UTC onwards, but the mailbox lag looks okay? o.O [18:18:20] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [18:18:30] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [18:18:49] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [18:18:51] .__. [18:19:59] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [18:28:39] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:28:40] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:29:59] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:30:09] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:30:43] 10Operations, 10Traffic: Text eqiad varnish 503 spikes - https://phabricator.wikimedia.org/T175803#3612223 (10Paladox) Hmm, not stable now. [19:18:20] <+icinga-wm> PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [19:18:29] <+icing... [18:32:49] 10Operations, 10Traffic: Text eqiad varnish 503 spikes - https://phabricator.wikimedia.org/T175803#3612226 (10Samtar) It looks like cp1052 had a spike, but has since recovered {F9585689} `RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]` [18:46:10] RECOVERY - MariaDB Slave Lag: x1 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 85790.32 seconds [18:58:29] PROBLEM - Apache HTTP on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:58:30] PROBLEM - HHVM rendering on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:59:20] RECOVERY - Apache HTTP on mw1198 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 3.294 second response time [18:59:39] RECOVERY - HHVM rendering on mw1198 is OK: HTTP OK: HTTP/1.1 200 OK - 73833 bytes in 9.636 second response time [19:47:19] 10Operations, 10Traffic: Text eqiad varnish 503 spikes - https://phabricator.wikimedia.org/T175803#3612304 (10Yann) Request from 88.182.181.224 via cp1052 cp1052, Varnish XID 34013240 Error: 503, Backend fetch failed at Sat, 16 Sep 2017 19:46:47 GMT [19:51:09] PROBLEM - Disk space on mendelevium is CRITICAL: DISK CRITICAL - free space: / 9886 MB (43% inode=3%) [19:51:19] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [19:51:59] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [19:53:37] something may be wrong with otrs [19:53:39] akosiaris [19:53:42] 18<icinga-wm18> PROBLEM - Disk space on mendelevium is CRITICAL: DISK CRITICAL - free space: / 9886 MB (43% inode=3%) [19:53:49] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [19:53:50] combined with a spike in the ticket creation graph on the otrs dashboard [19:54:10] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [20:01:30] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:01:59] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:02:09] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:02:20] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:06:59] RECOVERY - MariaDB Slave Lag: m3 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89851.66 seconds [20:10:49] PROBLEM - puppet last run on mw1269 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:24:29] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [20:25:29] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [20:25:49] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [20:26:59] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [20:32:39] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:33:00] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:33:30] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:34:50] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:39:19] RECOVERY - puppet last run on mw1269 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [20:39:49] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [20:47:50] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:17:49] RECOVERY - salt-minion processes on labtestvirt2002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:20:49] PROBLEM - salt-minion processes on labtestvirt2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [21:24:42] (03CR) 10Framawiki: [C: 04-1] "Not a good idea for me, see the task." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378393 (https://phabricator.wikimedia.org/T176037) (owner: 10Zoranzoki21) [21:35:29] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to scb* and pdfrender-admin for tgr - https://phabricator.wikimedia.org/T175882#3612478 (10Tgr) Thanks, Rob! > I'm assigning this to you for your input on the above (additonal group name plus L3 signature. Please assign back to me w... [21:56:29] PROBLEM - HHVM rendering on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:57:29] RECOVERY - HHVM rendering on mw1198 is OK: HTTP OK: HTTP/1.1 200 OK - 73828 bytes in 2.526 second response time [22:05:45] !log compress older otrs directories to reclaim inodes - T171490 [22:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:02] T171490: mendelevium (otrs) running out of inodes - https://phabricator.wikimedia.org/T171490 [22:08:10] RECOVERY - Disk space on mendelevium is OK: DISK OK [22:09:14] 10Operations, 10OTRS: mendelevium (otrs) running out of inodes - https://phabricator.wikimedia.org/T171490#3612568 (10fgiunchedi) The growth of used inodes since a few hours was pretty steep, I compressed and removed the older otrs versions: ``` otrs-5.0.13 otrs-5.0.19 otrs-5.0.7 otrs-5.0.6 otrs-3.2.14.bak ot... [22:14:59] PROBLEM - HHVM rendering on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:16:32] godog, hey [22:16:39] did you see the graph on the otrs dashboard? [22:16:50] RECOVERY - HHVM rendering on mw1198 is OK: HTTP OK: HTTP/1.1 200 OK - 73828 bytes in 3.499 second response time [22:18:16] Krenair: no, which dashboard? [22:18:30] the one you see when you log into otrs [22:19:17] there's a huge spike and it doesn't look like it's stopping [22:21:26] Krenair: I don't think I have access to otrs :( I've bandaided the problem for now I hope, i.e. inode usage is at 38% [22:21:43] ok [22:22:28] come the thought of it, I think the database server where those tickets actually get stored is a different machine/cluster [22:22:36] i.e. not on mendelevium's disk [22:23:21] though it's possible it's doing a lot of logging [22:27:17] it is possible yeah, looks like a lot of tmp files [22:31:10] PROBLEM - HHVM rendering on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:32:09] RECOVERY - HHVM rendering on mw1198 is OK: HTTP OK: HTTP/1.1 200 OK - 73808 bytes in 5.320 second response time [22:33:21] Krenair: that's indeed impressing [22:33:26] a spam campaign, probably [22:34:03] yes [22:34:24] delivery failures for people trying to impersonate our domains [22:34:35] ouch [22:41:02] hieradata/role/common/otrs.yaml:profile::otrs::database_host: m2-master.eqiad.wmnet [22:41:57] $ dig m2-master.eqiad.wmnet @ns0.wikimedia.org +short [22:41:57] dbproxy1002.eqiad.wmnet. [22:41:57] 10.64.0.166 [22:42:15] which is a proxy for... m1, primary db1016 and secondary db1001 [22:43:27] nope I read the wrong part [22:43:52] m2, master db1020 and secondary db2011 [22:46:30] db1020 traffic does look a bit strange towards the end https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1020&var-port=9104&from=now-7d&to=now [22:47:21] write queries in particular [22:47:42] and InnoDB IO operations [22:48:52] and other stuff [23:04:19] RECOVERY - MariaDB Slave Lag: s6 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89549.70 seconds [23:27:19] hmmmmm [23:27:38] slave lag OK because it's at 24-25 hours?