[00:02:04] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [00:02:14] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [00:02:23] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [00:02:23] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds [00:02:44] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [00:02:44] PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds [00:02:44] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [00:03:03] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [00:03:13] RECOVERY - DPKG on stat1005 is OK: All packages OK [00:03:23] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [00:03:44] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [00:03:44] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [00:07:23] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [00:59:54] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [01:00:03] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [01:00:14] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [01:00:24] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [01:00:33] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [01:03:54] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [01:04:03] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [01:04:13] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1509152645 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3968331 keys, up 4 minutes 2 seconds - replication_delay is 1509152645 [01:04:14] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [01:04:24] RECOVERY - DPKG on stat1005 is OK: All packages OK [01:04:33] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [01:04:43] PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6481 [01:05:13] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3963247 keys, up 5 minutes 3 seconds - replication_delay is 0 [01:05:43] RECOVERY - Check health of redis instance on 6481 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 3961975 keys, up 5 minutes 33 seconds - replication_delay is 0 [02:28:34] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [02:28:34] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [02:28:43] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [02:28:44] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [02:29:14] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [02:31:24] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds [02:33:34] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [02:33:34] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [02:33:44] RECOVERY - DPKG on stat1005 is OK: All packages OK [02:33:53] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [02:36:23] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [02:47:43] RECOVERY - Router interfaces on cr2-knams is OK: OK: host 91.198.174.246, interfaces up: 57, down: 0, dormant: 0, excluded: 0, unused: 0 [03:25:54] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 826.82 seconds [04:01:03] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 129.01 seconds [05:08:05] (03PS1) 10BryanDavis: wikireplicas: Add index for page_props.pp_value [puppet] - 10https://gerrit.wikimedia.org/r/386973 (https://phabricator.wikimedia.org/T140609) [07:07:23] PROBLEM - Router interfaces on mr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.194, interfaces up: 38, down: 1, dormant: 0, excluded: 1, unused: 0 [07:08:23] RECOVERY - Router interfaces on mr1-ulsfo is OK: OK: host 198.35.26.194, interfaces up: 40, down: 0, dormant: 0, excluded: 1, unused: 0 [07:52:18] set downtime for stat1005, a lot of big jobs ongoing [07:52:26] so it will not spam over the weekend [07:52:44] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0 [07:57:07] Zayo maintenance announced --^ [08:06:53] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [10:15:03] 10Operations: Puppet wmf-style-guide: array of classes not detected properly - https://phabricator.wikimedia.org/T179230#3717442 (10Volans) [10:51:33] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [10:52:13] !log restarted pdfrender on scb1001, was stuck since 2d with AssertionError: display is not set! [10:52:17] cc mobrovac ^^^ [10:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:34] ACKNOWLEDGEMENT - IPMI Sensor Status on analytics1037 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical, Status = Critical] Volans See https://phabricator.wikimedia.org/T179192 [10:55:48] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Check analytics1037 power supply status - https://phabricator.wikimedia.org/T179192#3716364 (10Volans) Icinga is reporting it as critical: ``` Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical, Statu... [11:19:59] (03PS1) 10Hoo man: Don't use Cirrus for wbsearchentities on beta Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386998 [11:20:51] (03CR) 10Hoo man: [C: 032] "Beta only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386998 (owner: 10Hoo man) [11:22:35] (03Merged) 10jenkins-bot: Don't use Cirrus for wbsearchentities on beta Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386998 (owner: 10Hoo man) [11:22:47] (03CR) 10jenkins-bot: Don't use Cirrus for wbsearchentities on beta Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386998 (owner: 10Hoo man) [11:24:10] !log hoo@tin Synchronized wmf-config/Wikibase-labs.php: Consistency sync (duration: 00m 50s) [11:24:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:24] PROBLEM - Nginx local proxy to apache on mw1280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:26:24] PROBLEM - Apache HTTP on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:26:43] PROBLEM - Nginx local proxy to apache on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:26:53] PROBLEM - Apache HTTP on mw1280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:26:54] PROBLEM - Apache HTTP on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:27:03] PROBLEM - HHVM rendering on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:27:03] PROBLEM - HHVM rendering on mw1280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:27:33] PROBLEM - HHVM rendering on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:28:23] PROBLEM - Nginx local proxy to apache on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:28:53] RECOVERY - Apache HTTP on mw1285 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.039 second response time [11:29:13] RECOVERY - Nginx local proxy to apache on mw1285 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.443 second response time [11:32:03] PROBLEM - Apache HTTP on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:32:23] PROBLEM - Nginx local proxy to apache on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:32:28] 10Operations, 10media-storage, 10User-fgiunchedi: Deleting file on Commons "Error deleting file: An unknown error occurred in storage backend "local-multiwrite"." - https://phabricator.wikimedia.org/T173374#3717534 (10Jcb) I have tried again and these two files still cannot be deleted. [11:34:13] RECOVERY - Nginx local proxy to apache on mw1285 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.037 second response time [11:37:23] PROBLEM - Nginx local proxy to apache on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:37:33] RECOVERY - Disk space on stat1005 is OK: DISK OK [11:39:13] RECOVERY - Nginx local proxy to apache on mw1285 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.037 second response time [11:40:13] RECOVERY - HHVM rendering on mw1280 is OK: HTTP OK: HTTP/1.1 200 OK - 74304 bytes in 9.264 second response time [11:40:33] RECOVERY - HHVM rendering on mw1285 is OK: HTTP OK: HTTP/1.1 200 OK - 74303 bytes in 0.118 second response time [11:41:54] RECOVERY - Apache HTTP on mw1285 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 1.378 second response time [11:42:44] RECOVERY - Nginx local proxy to apache on mw1313 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 2.748 second response time [11:43:14] PROBLEM - HHVM rendering on mw1280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:43:44] PROBLEM - HHVM rendering on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:44:30] https://en.wikipedia.orghttps://en.wikipedia.org [11:44:34] oh woops [11:44:43] RECOVERY - HHVM rendering on mw1285 is OK: HTTP OK: HTTP/1.1 200 OK - 74248 bytes in 4.397 second response time [11:45:23] PROBLEM - Nginx local proxy to apache on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:45:39] hmm wikipedia didnt load for me for a minute [11:45:54] RECOVERY - Apache HTTP on mw1280 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.033 second response time [11:45:54] PROBLEM - Nginx local proxy to apache on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:46:00] ops ^^ [11:46:11] (icinga warnnings for mw*) [11:47:34] RECOVERY - Nginx local proxy to apache on mw1280 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.042 second response time [11:47:44] PROBLEM - HHVM rendering on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:48:04] PROBLEM - Apache HTTP on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:48:13] RECOVERY - HHVM rendering on mw1280 is OK: HTTP OK: HTTP/1.1 200 OK - 74246 bytes in 0.098 second response time [11:48:33] PROBLEM - Check systemd state on mw1285 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:49:14] RECOVERY - Nginx local proxy to apache on mw1285 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.035 second response time [11:49:43] RECOVERY - HHVM rendering on mw1285 is OK: HTTP OK: HTTP/1.1 200 OK - 74246 bytes in 0.097 second response time [11:50:03] RECOVERY - Apache HTTP on mw1285 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.029 second response time [11:51:22] hm… seems FS related … I had a hanging getdent on (non-tmpfs) /tmp on one of them [11:51:58] hoo: mmm checking [11:52:35] thanks [11:53:10] !log restart hhvm on mw1285 - hhvm-dump-debug in /tmp/hhvm.17700.bt [11:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:03] so three hosts showing this issue [11:54:26] yeah like mw1280 https://grafana.wikimedia.org/dashboard/db/prometheus-apache-hhvm-dc-stats?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-instance=mw1280 [11:54:32] yeah [12:01:04] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw1313.eqiad.wmnet [12:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:12] elukey: Anything interesting about / or /tmp? [12:04:22] not that I can see.. from the stacktrackes on mw1285 (hhvm-dump-debug) it seems that threads were waiting for a condition and one of them was trying to do something with the translation cache [12:04:29] it reminds to me https://wikitech.wikimedia.org/wiki/Incident_documentation/20170917-MediaWiki-API [12:05:01] trying to dump mw1313's thread status too but it hangs [12:05:58] mw1280 seems to have auto-solved by itself [12:06:18] yeah… I saw that live [12:06:33] my ls /tmp was hanging in getdent… and then it suddenly was all good again [12:07:04] https://grafana-admin.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&var-server=mw1280&var-datasource=eqiad%20prometheus%2Fops [12:07:16] sorry without the -admin [12:10:13] RECOVERY - Nginx local proxy to apache on mw1313 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.078 second response time [12:10:33] RECOVERY - HHVM rendering on mw1313 is OK: HTTP OK: HTTP/1.1 200 OK - 74248 bytes in 0.170 second response time [12:10:46] !log manually killed (SIGTERM) hhvm on mw1313 - high load, hhvm-dump-debug not responsive [12:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:03] RECOVERY - Apache HTTP on mw1313 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.071 second response time [12:14:18] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw1313.eqiad.wmnet [12:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:33] RECOVERY - Check systemd state on mw1285 is OK: OK - running: The system is fully operational [12:14:52] ok all hosts should be ok now, still not sure about the root cause [12:22:02] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Check analytics1037 power supply status - https://phabricator.wikimedia.org/T179192#3716364 (10Cmjohnson) I am not sure what else can be done here. I’ve replaced the PSUs twice, upgraded f/w and they continue to burn through the fans. This group of s... [12:37:32] going afk hoo, all stable, I've reported the issue [12:37:51] Cool, thanks :) [13:11:35] 10Operations, 10Traffic, 10WMF-Communications, 10HTTPS, 10Security-Other: Server certificate is classified as invalid on government computers - https://phabricator.wikimedia.org/T128182#3717641 (10Florian) @BBlack Nevertheless: Thanks for your time and work you put into this! :) [14:12:20] (03CR) 10ArielGlenn: [C: 032] clean up partially written files from previous 7z runs of same wiki and date [dumps] - 10https://gerrit.wikimedia.org/r/385368 (owner: 10ArielGlenn) [14:16:45] (03PS2) 10ArielGlenn: fix multistream job so inprogress files are moved to real names when done [dumps] - 10https://gerrit.wikimedia.org/r/386155 (https://phabricator.wikimedia.org/T177523) [14:18:02] (03CR) 10ArielGlenn: [C: 032] fix multistream job so inprogress files are moved to real names when done [dumps] - 10https://gerrit.wikimedia.org/r/386155 (https://phabricator.wikimedia.org/T177523) (owner: 10ArielGlenn) [14:40:03] PROBLEM - puppet last run on rdb2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:03:23] PROBLEM - Check Varnish expiry mailbox lag on cp4025 is CRITICAL: CRITICAL: expiry mailbox lag is 2019682 [15:10:03] RECOVERY - puppet last run on rdb2002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:53:03] PROBLEM - puppet last run on mw1230 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:55:47] bblack: ping [15:55:49] it's happening again [15:55:51] see cp1055 [15:56:19] ema: ^ [16:06:54] PROBLEM - High lag on wdqs1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1800.0] [16:07:10] 10Operations, 10ORES, 10Scoring-platform-team, 10Traffic, and 4 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3717711 (10hoo) Happening again, this time on `cp1055`. Example from `mw1180`: ``` $ ss --tcp -r | grep -oP 'cp\d+' | sort |... [16:07:13] you don't say [16:07:33] PROBLEM - High lag on wdqs1003 is CRITICAL: CRITICAL: 31.03% of data above the critical threshold [1800.0] [16:08:33] PROBLEM - High lag on wdqs1005 is CRITICAL: CRITICAL: 34.48% of data above the critical threshold [1800.0] [16:14:09] 10Operations, 10ORES, 10Scoring-platform-team, 10Traffic, and 4 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3717712 (10hoo) Also on `mw1180`: ``` $ sudo -u www-data ss --tcp -r -p > ss $ cat ss | grep -c FIN-WAIT-2 16 $ cat ss | grep... [16:22:10] hoo, should that ^^ be triaged as high or unbreak? [16:22:24] UBN by now [16:22:27] ok [16:22:56] thanks [16:22:58] 10Operations, 10ORES, 10Scoring-platform-team, 10Traffic, and 4 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3717713 (10Paladox) p:05Triage>03Unbreak! Spoke to hoo on irc, who agreed it's an UBN now. [16:23:03] RECOVERY - puppet last run on mw1230 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:29:56] hmm looks like wdqs troubles are back. And third time seems to be at the same time of the day... this is very suspicious [16:31:37] somewhere around 16:00 something goes wrong :( [16:36:58] SMalyshev: The same problem again, see the ticket [16:37:04] but no one from ops around to help [16:37:11] so we will have to wait for it to get worse [16:37:13] so that it pages [16:39:13] hoo: hello :) [16:39:29] I didn't follow the issue, can you give me a tl;dr ? [16:39:42] and also what is the impact now [16:39:58] I can try to contact the Traffic team in case it is needed [16:40:10] elukey https://phabricator.wikimedia.org/T179156 [16:41:08] paladox: thanks, I saw the task, but there is not clear indication to what is the state :) [16:41:23] hoo: ? any info? [16:41:49] elukey: Well, too many open connections from cp1055 to the appservers [16:41:59] brandon usually fixes this by restarting the varnish backend [16:42:00] s [16:43:03] hoo: this happens because the permeanent fix has not been deployed or because it is still under investigation ? [16:43:15] elukey: We have no real fix, yet [16:45:38] hoo: current impact? [16:46:02] elukey: Raised 503s, slowness [16:46:04] https://grafana.wikimedia.org/dashboard/db/varnish-http-errors?refresh=5m&orgId=1 [16:46:11] but will fail harder if not touched [16:47:52] so what runs around 16:00 that causes this? [16:48:46] or even 15:50 [16:49:20] SMalyshev: Not sure… see the ticket [16:51:54] !log restart varnish backend on cp1055 - mailbox lag + T179156 [16:52:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:01] T179156: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156 [16:52:25] elukey: That should have fixed it [16:52:40] did you look at open tcp connections there? In which state were they? [16:53:34] nope I didn't, icinga showed mailbox expiry lag so I thought to proceed (plus I trusted your data in the task about conns) [16:54:28] the only thing in puppet cron that I see around this time is cron { 'refinery source guard' for which I have no idea if it could be the culprit [16:54:41] SMalyshev: This is not happening for the first time [16:55:04] If I needed to cache this is a side effect of e475e38d5ab479c65066e5dc32a85aa87e8645c6 [16:55:08] yeah but for the last 3 days it happens at the same time, which to me suggests common cause [16:55:53] hmm 503 spiked from 21 to 39. [16:56:54] SMalyshev: but on different hosts :S [16:57:24] well if something outside causing it, then what hosts are affected is probably random [16:58:23] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [16:58:23] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [16:59:36] so from https://grafana.wikimedia.org/dashboard/db/varnish-http-errors?refresh=5m&orgId=1&from=now-3h&to=now it seems going down [17:01:05] hoo: do you have any good metric to verify if that the lag on wdqs is improving? [17:01:28] sure, one sec [17:01:47] https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?refresh=1m&panelId=8&fullscreen&orgId=1 [17:03:15] hoo: you got cp1055 from the api ss output? [17:03:23] Hey! I'm stuck in a train with limited batteries and connectivity. Back online in ~2h. [17:03:32] gehel: It's fine [17:03:40] elukey: Saw it on Grafana [17:03:47] I saw the frontend connections grow [17:03:49] and then found that [17:04:04] and then I picked a random app server to diagnose (no shell on the cp*s) [17:04:21] hoo: which frontend? [17:04:40] gehel: frontend varnish requests [17:04:44] on several varnishs [17:04:59] cp1054 for example https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=16&fullscreen&orgId=1&var-server=cp1054&var-datasource=eqiad%20prometheus%2Fops [17:05:23] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:05:45] So we are again on the 1k limit? [17:06:20] ah there you go https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All&from=now-3h&to=now [17:06:23] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:06:25] Seems recovering by now... But no idea why [17:06:28] this clearly indicates 1055 [17:06:37] gehel: I restarted varnish on cp1055 [17:07:34] should have been the one causing the lag iiuc but I have limite knowledge of this issue [17:08:54] gehel: Indeed… but we might maxed out 10k eventually as well [17:11:10] yep wdqs is recovering lag is declining [17:11:42] hoo: do we need to follow up with Brandon or anybody else to prevent another mess? [17:11:53] elukey: I guess so [17:12:11] if we don't know the key cause we could have the same happening tomorrow :( [17:14:21] hoo: yeah, I was just stating the obvious, not saying we shoudl change that limit [17:15:03] so it seems to be hapening around 3pm UTC? [17:15:35] I'll try to be around at that time tomorrow (clocks changing this night, I'm probably going to be wrong..) [17:17:28] elukey: Can you say when the restart varnish cron is on that machine? [17:17:42] /etc/cron.d/varnish-backend-restart [17:18:37] 32 3 * * 0 root /usr/local/sbin/run-no-puppet /usr/local/sbin/varnish-backend-restart > /dev/null [17:19:33] ok… at least that's not it [17:25:09] wdqs updater is not catching up as fast as it should, but varnish 500 seems under control again [17:26:16] also, it looks like wdqs updater starts to fall behind >30 minutes before I see varnish 500 raise [17:26:30] no idea what that means, just stating the facts I observe [17:27:20] gehel: It gets slow first [17:27:24] then slowly starts to fail [17:28:11] I should look at yeserday's timeline, but I think the varnish connection number was rising faster than that [17:29:18] less traffic on a saturday [17:29:20] I guess [17:29:46] yeah that might explain... [17:30:06] need to disconnect... see you later... [17:30:14] see you [17:31:30] so I had a chat with Brandon on the phone, he is not close [17:31:56] so this issue is likely to move from a cp server to another, and we'd need to keep everything monitored and act if needed [17:32:10] (because of re-hashing etc.. after the backend restart on cp1055) [17:32:36] (03Abandoned) 10ArielGlenn: Make a few more config settings parseable per-project. [dumps] - 10https://gerrit.wikimedia.org/r/386389 (https://phabricator.wikimedia.org/T178893) (owner: 10ArielGlenn) [17:33:01] elukey: Yeah, it's going against cp1053 now [17:33:02] it is likely that cp1053.eqiad.wmnet will cause issues [17:33:08] exactly :D [17:33:44] hoo@mw1180:~$ sudo -u www-data ss --tcp -r -p | grep FIN-WAIT-2 | grep -oP 'cp\d+' | sort | uniq -c [17:33:44] 6 cp1053 [17:33:53] easy way to detect this ;) [17:34:22] (1) cp1053.eqiad.wmnet [17:34:23] ----- OUTPUT of 'netstat -an|egre...|uniq -c|sort -n' ----- [17:34:23] 1 10.2.2.18:8080 [17:34:23] 174 10.2.2.17:7231 [17:34:23] 498 10.2.2.22:80 [17:34:25] 606 10.2.2.1:80 [17:34:27] this is another angle :) [17:34:34] netstat from the cp1053 host [17:34:44] are they also in fin wait 2? [17:35:00] (03PS1) 10ArielGlenn: Permit overrides section in dump config files and more per proj settings [dumps] - 10https://gerrit.wikimedia.org/r/387022 (https://phabricator.wikimedia.org/T178893) [17:35:32] yeah some of them in FIN WAIT 1/2 [17:39:19] elukey: Does varnish mention running into idle timeout? [17:40:32] hoo: do you mean its logs? [17:40:39] yeah, I guess [17:41:16] not that I can see on the fly [17:46:14] hoo: so I am going away now since I gtg, but I am bringing my laptop with me [17:46:39] ok [17:46:43] I know that cp1053 might cause troubles soon but I'd prefer to wait before restating [17:49:08] * elukey afk [18:15:39] I'm here now [18:16:22] 1053 seems to be current target, I see that above too, it's slowly getting worse there [18:17:38] so it seems like hoo is predicting the same, by looking at outbound conns from certain MW hosts [18:18:12] which is one of the non-api appservers (mw1880) he was seeing [18:19:15] I'm going to do a VCL reload on cp1053 and reduce the timeout parameters there to the backend, as an experiment [18:21:20] !log cp1053 - manual VCL change, backends appservers+api_appservers, reduce connect/firstbyte/betweenbytes timeoues from 5/180/60 to 3/20/10 [18:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:20] so what are the other symptoms presently? just wdqs reporting high lag on updates to wikidata? [18:28:31] !log cp4025 - restart backend for mailbox lag (upload@ulsfo, unrelated to text-cluster issues) [18:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:23] RECOVERY - Check Varnish expiry mailbox lag on cp4025 is OK: OK: expiry mailbox lag is 0 [18:34:50] When I visit en.wikipedia.org, I'm getting blocking(???) requests to https://meta.wikimedia.org/w/load.php?debug=false&lang=en&modules=ext.globalCssJs.user&only=scripts&skin=monobook&user=MZMcBride&version=078lqxd that are hanging. [18:35:45] bblack ^^ [18:35:47] load.php URLs are taking over a minute to respond, hmm. [18:36:34] yeah sounds about right [18:39:42] !log restarting varnish backend on cp1053 to move the lag/503 issues to another box and buy more time to debug [18:39:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:53] 1065 is the new target [18:42:11] (03PS1) 10BBlack: varnish: lower timeouts to api+appservers backends [puppet] - 10https://gerrit.wikimedia.org/r/387024 [18:42:37] (03CR) 10BBlack: [V: 032 C: 032] varnish: lower timeouts to api+appservers backends [puppet] - 10https://gerrit.wikimedia.org/r/387024 (owner: 10BBlack) [18:43:20] https://grafana.wikimedia.org/dashboard/db/varnish-http-errors?refresh=5m&orgId=1 shoots up to 231 [18:43:23] from 5 [18:43:29] that's a big jump [18:43:36] yes, it's expected [18:43:43] there will be as pike every time I restart a backend [18:43:46] ok [18:43:46] s/pike/spike/ [18:43:48] thanks [18:43:51] lol [18:44:19] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=2&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5&from=now-6h&to=now [18:44:23] ^ is a better view [18:44:43] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [18:44:43] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [18:45:39] and there's the laggy report from the stats alerter, about the earlier spike that's already gone :P [18:46:49] so, if the experimental reduction of timeouts actually has a useful effect [18:47:20] we might it see it prevent the Bad Things from spreading to unrelated/normal queries, at the cost that any slow/long-running queries (probably the Bad Things causing problems that we can't identify) will fail harder and faster. [18:47:39] I think 1053 was already too far gone to save with the timeouts, but they applied on 1065 before the pattern had been there for very long [18:50:20] cp1065 does look to be in a better pattern so far [18:50:36] bblack: I'm around if you need me. I can't think of anything that I might actually do, but ping me if you think otherwise... [18:50:49] it's clearly still "the target" with elevated connection counts, but they're levelling off much lower (closer to ~200-300, instead of ramping up to an eventual ~3K, normal being ~100 or less) [18:52:51] what's interesting is I still can't *see* these timeout-ing requests [18:53:43] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:53:43] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:55:54] bblack: Did you see my comments on the ticket? [18:56:02] ok now I'm noticing a few oddballs, with 3s failures and "no backend connection", maybe tcp connect() timeout->fail [18:56:17] it looks more like failure to find an open connection in the pool, though.... [18:56:45] ah here's a good one [18:56:47] - BereqURL /w/api.php?action=flow-parsoid-utils&from=wikitext&to=html&content=%7B%7BQ%7C352804%7D%7D+has+date+P569%3A+1533-01-01%2C+but+%5Bhttp%3A%2F%2Fd-nb.info%2Fgnd%2F100198414+GND%5D+has+1534&title=Q352804&format=json [18:56:51] - BereqHeader Host: www.wikidata.org [18:56:54] - Timestamp Bereq: 1509216884.761646 0.000042 0.000042 [18:56:56] - Timestamp Beresp: 1509216965.538549 80.776945 80.776903 [18:56:59] - Timestamp Error: 1509216965.538554 80.776950 0.000005 [18:57:00] that's an 80s total time to fetch before it failed [18:57:03] - Timestamp Start: 1509216970.911803 0.000000 0.000000 [18:57:09] on wikidata.org [18:57:24] let me see if I can grab that one's whole log before it all scrolls off [18:57:54] hoo: no, not yet [19:00:12] https://phabricator.wikimedia.org/P6210 [19:01:18] I tried the same URL manually from curl just now and it responded quickly, though [19:01:42] but the paste is a log of some Magnus tool hitting that URL via text-lb, and api_appservers taking 80s to timeout responding to it [19:01:48] that API is not the grafan dashboard [19:01:56] but we could get avg response times from hadoop [19:02:04] although with the 3/20/10 timeouts I wonder why it even took 80s [19:02:07] * hoo does that [19:02:23] unless it's trickling out bytes slowly (so refreshed the 10s between_bytes_timeout a few times before failing) [19:02:42] (03PS2) 10MarcoAurelio: Extension:Translate default permissions for Wikimedia wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385953 (https://phabricator.wikimedia.org/T178793) [19:03:43] I wonder what that API even is [19:04:00] seems like an odd combination of things there right in the name+backend. MW API servers + Wikidata + Parsoid? [19:04:49] another one [19:04:51] - BereqURL /w/api.php?action=flow-parsoid-utils&from=wikitext&to=html&content=%7B%7BQ%7C557525%7D%7D+has+date+P569%3A+1879-01-05%2C+but+%5Bhttp%3A%2F%2Fd-nb.info%2Fgnd%2F123925037+GND%5D+has+1879-06-05&title=Q557525&format=json [19:04:55] - BereqHeader Host: www.wikidata.org [19:04:58] - Timestamp Bereq: 1509217259.045747 0.000054 0.000054 [19:05:00] - Timestamp Beresp: 1509217340.318837 81.273144 81.273090 [19:05:03] - Timestamp Error: 1509217340.318843 81.273150 0.000006 [19:05:03] same behavior, 80s timeout, flow-parsoid-utils API [19:05:06] - Timestamp Start: 1509217340.450918 0.000000 0.000000 [19:05:30] not related to WD specifically [19:05:37] but flow inline comment parsing, I suppose [19:05:52] well so far the only two examples are on WD hostname though [19:06:06] just hit another: [19:06:11] - BereqURL /w/api.php?action=flow-parsoid-utils&from=wikitext&to=html&content=%7B%7BQ%7C598228%7D%7D+has+date+P569%3A+1874-07-07%2C+but+%5Bhttp%3A%2F%2Fd-nb.info%2Fgnd%2F1057215910+GND%5D+has+1875&title=Q598228&format=json [19:06:15] - BereqHeader Host: www.wikidata.org [19:06:17] - Timestamp Bereq: 1509217467.874020 0.000036 0.000036 [19:06:20] - Timestamp Beresp: 1509217549.214916 81.340933 81.340897 [19:06:22] - Timestamp Error: 1509217549.214923 81.340940 0.000007 [19:06:24] these are the only >3s timeout queries I'm seeing [19:06:34] wtf :/ [19:08:35] bblack: https://phabricator.wikimedia.org/P6211 [19:08:47] that looks weird [19:08:58] are failed/ timeouts not making it to kafka -> hadoop? [19:09:24] wait, that's the wrong day [19:09:28] not today [19:09:48] I've just changed my log filter so I have less noise to sift through [19:09:55] now I'm only logging timeouts that seem to go 3s+ [19:09:59] err, 4s+ [19:13:16] bblack: But why would that leave the connections in fin wait 2? [19:14:00] apache acting up on varnish killing them? [19:14:19] I don't know all that level of detail yet for sure [19:14:31] on the varnish side they tend to pile up in CLOSE_WAIT [19:14:45] either way, I see RST flying in tcpdump, so the connection closes are certainly often-abnormal [19:16:47] it's still causing mailbox lag to ramp on 1065, even with the better timeouts shielding it [19:16:55] so it will eventually need a restart [19:17:18] caught another, same thing [19:17:20] - BereqURL /w/api.php?action=flow-parsoid-utils&from=wikitext&to=html&content=%7B%7BQ%7C1249554%7D%7D+has+date+P570%3A+1800-03-17%2C+but+%5Bhttp%3A%2F%2Fd-nb.info%2Fgnd%2F116209356+GND%5D+has+1806&title=Q1249554&format=json [19:17:26] (on wikidata.org, ~81s timeout) [19:17:43] and another [19:17:45] - BereqURL /w/api.php?action=flow-parsoid-utils&from=wikitext&to=html&content=%7B%7BQ%7C1145407%7D%7D+has+date+P569%3A+1839-05-28%2C+but+%5Bhttp%3A%2F%2Fd-nb.info%2Fgnd%2F133998282+GND%5D+has+1829&title=Q1145407&format=json [19:17:51] hm [19:18:03] those have all been from one labs IP [19:18:05] 10.68.21.169 [19:19:44] I'll stop reporting new ones, they're all basically the same, just different params [19:20:41] above IP is tools-exec-1421 [19:21:36] hm… the max response times exploded between 13:00 and 14:00 [19:23:35] 10Operations, 10Ops-Access-Requests, 10Phabricator, 10Research: add the #acl*operations-team to the s9 analytics space for nda approvals - https://phabricator.wikimedia.org/T152718#3717797 (10Aklapper) [19:24:04] tools.m+ might be magnus [19:24:27] (03PS1) 10Hoo man: Half the Flow -> Parsoid timeout [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387028 (https://phabricator.wikimedia.org/T179156) [19:24:29] bblack: ^ [19:24:42] 100s as timeout seems a little to much for me [19:25:41] as far as my data suggest reqs outside of our problems are way faster than that [19:26:57] sorry I'm off trying to narrow down the tool that's making those calls [19:27:00] looking [19:28:26] hoo: yeah, even 50 seems a little crazy [19:28:33] is the user going to wait around 50s for it? [19:28:49] I don't know [19:30:37] it's a start anyways [19:31:22] so shall we give it a try? [19:35:29] yeah I think it's worth trying [19:35:44] maybe it causes the 80s to move to 50s and gives some hint that these things are connected [19:36:01] (03CR) 10Hoo man: [C: 032] Half the Flow -> Parsoid timeout [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387028 (https://phabricator.wikimedia.org/T179156) (owner: 10Hoo man) [19:37:21] (03Merged) 10jenkins-bot: Half the Flow -> Parsoid timeout [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387028 (https://phabricator.wikimedia.org/T179156) (owner: 10Hoo man) [19:37:31] (03CR) 10jenkins-bot: Half the Flow -> Parsoid timeout [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387028 (https://phabricator.wikimedia.org/T179156) (owner: 10Hoo man) [19:38:27] I'm going to go ahead and restart varnish-be on cp1065 before its other stats get ugly and it causes 503s or whatever [19:38:38] it'll bounce to a new target, but with the new timeout protections already in place from the get-go [19:39:00] !log backend restart on cp1065 [19:39:03] !log hoo@tin Synchronized wmf-config/CommonSettings.php: Half the Flow -> Parsoid timeout (100s -> 50s) (T179156) (duration: 00m 51s) [19:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:11] T179156: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156 [19:41:58] new target seems to be 1067 [19:42:03] PROBLEM - Nginx local proxy to apache on mw1280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:03] PROBLEM - HHVM rendering on mw1280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:14] PROBLEM - Apache HTTP on mw1280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:22] oh, I think I see what's going on [19:42:29] logging for long timeouts there now (cp1067) [19:43:07] I've looked around at all the long-running bots on that tools-exec node, but I haven't found out which one it is yet [19:43:14] maybe it's something else that just pops up periodically [19:44:46] uh, problem seemed to at least briefly hit 1067 anyways, but may have now vanished from all nodes.... [19:45:08] maybe oldest requests were still open? [19:45:31] the API requests just stopped [19:45:38] well restarting cp1065 killed all the existing ones [19:45:44] * hoo is watching them real-time (ish) on mwlog1001 [19:45:53] not it's going again [19:45:55] cp1067 did have a brief spike of connection counts and mailbox lag, but recovered within a minute or two to normal state [19:48:03] RECOVERY - Nginx local proxy to apache on mw1280 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 2.633 second response time [19:48:03] RECOVERY - HHVM rendering on mw1280 is OK: HTTP OK: HTTP/1.1 200 OK - 74197 bytes in 2.678 second response time [19:49:01] 10Operations, 10Puppet: Puppet wmf-style-guide: array of classes not detected properly - https://phabricator.wikimedia.org/T179230#3717808 (10Peachey88) [19:50:54] thinks look happy again [19:50:57] * things [19:51:13] RECOVERY - Apache HTTP on mw1280 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.034 second response time [19:54:01] so basically the requests causing issues stopped on their own again? [19:54:17] what was "oh, I think I see what's going on"? [19:57:46] Yeah, there are *a lot* of parses happening right now [19:58:03] like a huge number… and I think the timing aligns with the problems fairly well [19:58:11] but I still need to crunch some numbers [19:58:19] ok [20:03:15] I'm gonna update various bits in the ticket from my end: workaround, observations, etc [20:04:49] 2 14:38 [20:04:49] 2 14:40 [20:04:50] 148 14:43 [20:04:50] 199 14:44 [20:04:59] Number of requests to that API per minute today [20:05:11] starts going just 2 minutes before the first spike [20:07:14] I think it's starting again? I have another 80s timeout on flow-parsoid at 20:04 now [20:07:42] err no, it is an 80s timeout, but it's not flow-parsoid [20:08:22] ukwiki: /wiki/%D0%9A%D0%BE%D1%80%D0%B8%D1%81%D1%82%D1%83%D0%B2%D0%B0%D1%87:PavloChemBot/%D0%9D%D0%B5%D0%BF%D0%B5%D1%80%D0%B5%D0%BA%D0%BB%D0%B0%D0%B4%D0%B5%D0%BD%D1%96_%D1%81%D1%82%D0%BE%D1%80%D1%96%D0%BD%D0%BA%D0%B8 [20:08:32] and a few other relatively-normal URLs had some random ~80s timeouts though, oddly [20:08:40] looks like normal Chrome UAs [20:09:01] well, the flow API aligns rather well time wise [20:09:35] but the overall request times don't look to bad [20:09:40] there are a way very bad ones [20:09:43] but just a few [20:10:15] Just 532 of 34826 requests today > 1s [20:14:55] yeah but it doesn't take many slow requests to trash a lot of other unrelated things [20:15:28] 86 requests today that were other 50s [20:15:55] cp1067 still isn't showing the full-blown symptoms, but there are minor signs that things aren't perfect there either [20:16:06] mailbox lag is growing (but slower), and those random 80s timeouts [20:18:21] hm… parsoid related? [20:19:40] flow-parsoid-utils seem to have dropped to 0 now [20:21:39] 10Operations, 10ORES, 10Scoring-platform-team, 10Traffic, and 4 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3717847 (10BBlack) Updates from the Varnish side of things today (since I've been bad about getting commits/logs tagged onto th... [20:21:43] RECOVERY - High lag on wdqs1004 is OK: OK: Less than 30.00% above the threshold [600.0] [20:23:14] RECOVERY - High lag on wdqs1005 is OK: OK: Less than 30.00% above the threshold [600.0] [20:23:53] cp1067 is still showing subtle signs. slightly-elevated connection counts, rising mailbox lag, etc. but not as bad as cp1065 was earlier. [20:24:10] and I'm not seeing the ~80s timeouts, other than the few random ones reported earlier, yet [20:24:46] whatever used the API earlier seems to have stopped [20:25:45] the fin wait 2s are back [20:26:18] it's always possible that the *real* problem lies elsewhere, and the flow-parsoid requests timing out were just a really solid example of fallout [20:27:04] bblack: Can we set net.ipv4.tcp_fin_timeout on the apaches? [20:27:12] the varnishes should have it [20:27:14] PROBLEM - puppet last run on eeden is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:20] well we could, but it that seems like going a bit far tuning low-level details to workaround impacts of a higher-level problem [20:29:09] that's true [20:29:42] Could we tcpdump into the connection on the application side and see what's going on there? [20:29:51] if we can find the "bad" one [20:30:10] I've been doing lots of tcpdumps like that on the cache side, but it's hard to find the needles [20:30:21] You could dump it all and then wait for these that go into fin wait 2? [20:30:39] and see what happened there earlier… I'm just guessing mostly :/ [20:30:46] yeah, kinda [20:31:18] FIN_WAIT_2 isn't really an abnormal state though [20:31:45] but it shouldn't be for long… this is clearly out of the normal here [20:32:01] not happening anywhere but the varnish showing the pile ups [20:32:06] * to the [20:32:23] (at least not for so long that I can see them in ss/netstat) [20:33:16] so, FIN_WAIT_2 on the appserver side means: appserver closed the connection with FIN, Varnish acked it with FIN-ACK, and now appserver is waiting on Varnish's FIN. [20:33:59] do they really stay in FIN_WAIT_2 for a long time, or is there just a lot of them cycling through that state quickly? [20:34:55] I saw one for 10s+ right now on mw1180 [20:39:33] I just did some very stupidly simple collection and most of them close fast it seems [20:39:47] but 4 lasted for 10s+ [20:39:57] out of 17 [20:40:07] (and some cut off, I only polled for a minute) [20:40:12] so not overly accurate data [20:40:43] right [20:41:02] so, I know we see a lot of corresponding CLOSE_WAIT on the varnish side too, which aligns with your FIN_WAIT_2 [20:41:37] that means both sides are getting hung up in a consistent state. appserver sent FIN, Varnish saw it and sent FIN_ACK, Varnish has not sent its own FIN and the appserver has not received it. [20:41:59] I don't know why Varnish would delay in the CLOSE_WAIT state before doing its active close and sending a FIN. It doesn't make a lot of sense. [20:42:44] but perhaps this is a symptom rather than a cause. Varnish is getting behind on things in general (as noted by the mailbox lag rising), and just doesn't get around to the active closes in a timely fashion? [20:43:18] hm… perf top it? [20:44:34] that varnish is falling behind is a given, I don't think perf will give much insight there [20:44:54] I do wonder why it doesn't close immediately though, and instead has some other event fire later, but that's more of a code dive [20:45:26] or it's the getting stuck on the locks that's doing it [20:45:35] (the same locks that lag out the mailbox for expiries) [20:45:39] it actually got worse now [20:45:47] avery fin wait 2 time is > 10s now [20:46:24] * average [20:46:41] well cp1067 mailbox lag is now ~167K and rising, it will only get worse if that's the basic mechanism [20:47:40] (which means, there's 167K messages sitting in this "mailbox" queue, sent by random varnish i/o threads to the expiry thread, whose job it is to expire/free objects out of cache. that's how far behind it is on processing them, usually stuck on pthread locks and iops taking too long) [20:48:34] PROBLEM - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 2 down 1 [20:49:25] 375 conns in CLOSE_WAIT to api-or-appservers on cp1067 right now [20:52:28] if (vbc->state == VBC_STATE_STOLEN) { [20:52:30] (void)shutdown(vbc->fd, SHUT_WR); [20:52:30] vbc->state = VBC_STATE_CLEANUP; [20:52:30] VTAILQ_INSERT_HEAD(&tp->killlist, vbc, list); [20:52:30] tp->n_kill++; [20:54:47] so when it uses that path, it actually doesn't close it immediately, and defers to some future event callback to kill it [20:54:56] Yeah, I see [20:55:35] it's a strange state [20:55:55] right above the "if state == VBC_STATE_STOLEN", there's an assertion that it's not stolen, then taking of a pthread lock :P [20:56:04] there's clearly a logic error there somewhere [20:56:33] I can understand how the state could change by the time the lock is acquired, but then it could've changed before the lock was acquired too, so why would the assert even always be true? [20:57:14] RECOVERY - puppet last run on eeden is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [20:57:28] assert(vbc->state == VBC_STATE_USED); [20:57:36] Lck_Lock(&tp->mtx); [20:57:36] tp->n_used--; [20:57:37] if (vbc->state == VBC_STATE_STOLEN) { [20:57:38] ... [20:58:27] ("stolen" means an idle connection in the pool was stolen by another request to reuse it) [20:58:33] anyways [21:03:41] !log cp1067 (current target cache): disabling the relatively-new VCL that sets do_stream=false if !CL on applayer fetches... [21:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:53] hoo: CLOSE_WAIT dropped off to ~0 on cp1067 after the above [21:05:10] aha [21:05:41] that was a guess after staring at varnish internals and thinking about other recent changes [21:06:20] I wonder if it's *the* trigger here, and the "problematic" requests of whatever kind weren't problematic before that VCL change? [21:06:44] one theory that would fit all of this behavior (well, most of it, I think) [21:07:20] would be that perhaps we have uncacheable requests flowing through Varnish from api/app servers, where the API is streaming very long/slow outputs to the client? [21:07:28] like some kind of unofficial websockets-like hack [21:08:05] if it's known-uncacheable and we have do_stream=true (the default), it would just kinda work, even if this client was slowly streaming back chunked updates from the MW server for many minutes [21:08:55] but do_stream=false would try to slurp the entire response into memory before forwarding the response onwards. and if the response never really stops streaming in reasonable time... it just keeps building up in varnish storage and never forwarding to the client.... causing many of the symptoms we're seeing at the Varnish layer. [21:10:04] https://github.com/wikimedia/operations-debs-varnish4/blob/c46bc9ee9aefeed942cd486fca75614c296299e2/bin/varnishd/cache/cache_req_fsm.c#L205 https://github.com/wikimedia/operations-debs-varnish4/blob/c46bc9ee9aefeed942cd486fca75614c296299e2/bin/varnishd/cache/cache_busyobj.c#L269 [21:10:11] so seems varnish indeed just waits here forever [21:10:17] anyway, I'm out for today [21:11:02] yes, it makes the client wait "forever" until the backend server finishes sending the whole response for the non-streamed store-and-forward mode. [21:11:09] :/ [21:11:21] but that's ok, so long as HTTP transactions are as we expect them to be [21:11:44] AFAIK, we don't have websockets-like hacks flowing through text-lb->MW [21:11:56] we shouldn't [21:11:58] if we do, this probably isn't the only assumption it breaks [21:12:07] would need to look for really long open connections, I suppose [21:12:23] Anyway… see you tomorrow o/ [21:12:29] but long connections are valid, so long as they're long keepalives with reasonable get->response cycles within [21:12:42] true [21:12:44] the broken pattern would be one request and then just streaming dribbling response from server forever [21:13:34] (even then, if the headers make it look cacheable it would probably break things, but if it looks uncacheable it might just happen to work, until our recent do_stream=false patch) [21:15:05] also, FWIW, do_stream=false was the default/normal/basically-only behavior back when we were on varnish3 (most of recent years) [21:15:24] it was the upgrade to varnish4 that brought the new default do_stream=true. so whatever it is, it didn't exist back then. [21:16:04] anyways, I'm going to revert the do_stream=false patch across the board for now, on the assumption it's at least exacerbating the current problem, if not the actual cause [21:16:43] (03PS1) 10BBlack: Revert "varnish: ensure consistent CL" [puppet] - 10https://gerrit.wikimedia.org/r/387059 [21:16:51] (03PS2) 10BBlack: Revert "varnish: ensure consistent CL" [puppet] - 10https://gerrit.wikimedia.org/r/387059 [21:16:59] (03CR) 10BBlack: [V: 032 C: 032] Revert "varnish: ensure consistent CL" [puppet] - 10https://gerrit.wikimedia.org/r/387059 (owner: 10BBlack) [21:23:20] now I'm trying to think how I can best hunt down that one type of connection, if that theory holds [21:24:08] it's on the backend that it causes problems, but the inbound connections to the backends are always long-lived from an upstream varnish. would need to find a long-lived frontend connection, which seems to have mostly unidirectional outbound traffic for a long time? [21:27:24] RECOVERY - High lag on wdqs1003 is OK: OK: Less than 30.00% above the threshold [600.0] [21:44:20] 10Operations, 10Traffic, 10HTTPS: implement Public Key Pinning (HPKP) for Wikimedia domains - https://phabricator.wikimedia.org/T92002#3717886 (10BBlack) 05Open>03declined For all of the same good reasons pointed out a while back in e.g. https://blog.qualys.com/ssllabs/2016/09/06/is-http-public-key-pinni... [22:05:40] 10Operations, 10ORES, 10Scoring-platform-team, 10Traffic, and 4 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3717895 (10BBlack) A while after the above, @hoo started focusing on a different aspect of this we've been somewhat ignoring as... [22:09:40] ^ and with that wall of text, I'm out for now! [22:29:53] PROBLEM - puppet last run on oresrdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:59:44] RECOVERY - puppet last run on oresrdb2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:30:54] (03PS2) 10ArielGlenn: Permit overrides section in dump config files and more per proj settings [dumps] - 10https://gerrit.wikimedia.org/r/387022 (https://phabricator.wikimedia.org/T178893) [23:55:47] 10Operations, 10Traffic, 10HTTPS: implement Public Key Pinning (HPKP) for Wikimedia domains - https://phabricator.wikimedia.org/T92002#1101271 (10Tgr) So are there any plans to implement Certificate Transparency instead, given that it is the replacement suggested by Google?