[01:46:47] (03CR) 10Krinkle: "This makes sense as smaller first step (instead of going to the MultiWrite/mcrouter-first approach directly)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440469 (owner: 10Aaron Schulz) [01:46:49] (03CR) 10Krinkle: [C: 04-1] Make mediawiki.org write to both nutcracker and mcrouter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440469 (owner: 10Aaron Schulz) [02:08:09] 10Operations, 10Wikidata, 10Wikidata-Campsite, 10Wikimedia-General-or-Unknown, and 6 others: Multiple projects reporting Cannot access the database: No working replica DB server - https://phabricator.wikimedia.org/T195520#4309612 (10Krinkle) [03:26:57] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 919.66 seconds [03:30:35] 10Operations, 10Deployments, 10HHVM, 10Patch-For-Review, and 3 others: Translation cache exhaustion caused by changes to PHP code in file scope - https://phabricator.wikimedia.org/T103886#4309643 (10Krinkle) @MoritzMuehlenhoff The alternative is to post-pone the switching of MediaWiki localisation cache fr... [03:37:56] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 266.80 seconds [03:48:30] (03CR) 10Krinkle: [C: 031] "Should the entry be removed from canary_appserver.yaml ?" [puppet] - 10https://gerrit.wikimedia.org/r/440822 (https://phabricator.wikimedia.org/T180183) (owner: 10Giuseppe Lavagetto) [06:30:06] PROBLEM - puppet last run on mw1252 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] [06:32:36] PROBLEM - puppet last run on analytics1061 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/gen_fingerprints] [06:57:56] RECOVERY - puppet last run on analytics1061 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:00:26] RECOVERY - puppet last run on mw1252 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:15:16] PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:26:06] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 503 (expecting: 200) [11:27:07] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy [11:33:46] RECOVERY - Check systemd state on restbase-dev1006 is OK: OK - running: The system is fully operational [15:27:42] PROBLEM - LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:28:30] uh oh, checking [15:28:34] wikidata? [15:28:36] PROBLEM - Host graphite2001 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:42] RECOVERY - LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 0.018 second response time [15:28:42] <_joe_> uh? [15:28:44] oh [15:28:48] that sounds bad [15:29:02] is it related? [15:29:23] (checking stats, rethorical question) [15:29:27] don't know yet, seems unlikely [15:30:18] interface and querying seems up to me right now [15:30:31] (query.wikidata.org) [15:30:37] hey [15:31:01] perhaps bad queries? I was looking at this [15:31:03] I'm around, ideas so far of what's going on? [15:31:05] https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?refresh=1m&orgId=1&from=1529843455394&to=1529854255394&panelId=22&fullscreen&var-cluster_name=wdqs [15:31:24] godog: just saw it too [15:31:27] too many threads [15:31:40] on 4 and 5 [15:31:51] I'm happy to keep looking at wdqs, can someone look at graphit2001 ? [15:31:53] I would kick the proc4ss [15:31:56] <_joe_> +1 [15:31:58] it isn't paging though [15:32:01] I can check [15:32:06] graphite [15:32:37] jynus_: thanks! [15:32:54] maybe an overload could trigger metrics overload [15:33:03] I'll keep looking at wdqs [15:33:14] <_joe_> godog: I would restart wdqs1004 [15:33:22] <_joe_> there is the usual restart-wdqs script [15:33:39] ah! [15:33:41] thanks _joe_ [15:33:46] <_joe_> sudo -i [15:33:52] !log restart-wdqs on wdqs1004 [15:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:55] <_joe_> and it will depool, restart wdqs, repool [15:34:13] # restart-wdqs [15:34:13] Failed to restart wdqs.service: Unit wdqs.service not found. [15:34:20] <_joe_> interesting [15:34:28] <_joe_> so the puppettization is wrong [15:34:29] on graphite, I see puppet running [15:34:36] <_joe_> godog: lemme try [15:34:53] _joe_: just did [15:35:06] !log systemctl restart wdqs-blazegraph on wdqs1004 [15:35:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:20] oh, wrong host [15:35:29] <_joe_> godog: remember to "pool" after it's done [15:35:30] I cannot login into 2001 [15:35:38] trying serial [15:36:02] but so far looks like just a regular hw crash or similar [15:36:03] ok, restart isn't returning anyway [15:36:12] <_joe_> it will take time [15:36:22] <_joe_> I doubt java will be able to kill those stuck threads [15:36:41] <_joe_> so you will have to wait for the kill -9 from systemd that happens IIRC after 180 seconds [15:36:51] yep, it crashed [15:36:55] and stuck on reboot [15:38:11] _joe_: indeed that's what happened [15:38:46] PROBLEM - carbon-frontend-relay metric drops on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [100.0] https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen [15:39:14] <_joe_> I repooled wdqs1004 [15:39:28] I am going to force power reset graphite2001 [15:39:40] <_joe_> jynus_: +1 [15:40:00] <_joe_> godog: should I handle 1005? [15:40:01] _joe_: how do we know wdqs is ok to repool on wdqs1004 ? just waiting? [15:40:13] I'll do wdqs1005 [15:40:49] !log restart graphite2001 [15:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:52] <_joe_> godog: I looked at the logs :) [15:40:59] !log restart wdqs-blazegraph on wdqs1005 [15:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:28] at least now it is doing something [15:42:07] PROBLEM - WDQS HTTP Port on wdqs1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time [15:42:13] it got to grub, so that is something [15:42:23] yep, looking ok [15:42:35] will file a ticket, most likely there will be hw logs [15:42:46] RECOVERY - Host graphite2001 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms [15:42:56] and it is back [15:43:16] RECOVERY - WDQS HTTP Port on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 0.077 second response time [15:43:36] _joe_: heh, in logstash ? [15:43:46] however, now graphite2002 seems unhappy [15:43:50] on metrics [15:43:58] <_joe_> godog: journalctl -u wdqs-blazegraph -f [15:44:23] oh, it got fixed, probably after the other got "repooled" [15:45:07] RECOVERY - carbon-frontend-relay metric drops on graphite1001 is OK: OK: Less than 80.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen [15:45:10] graphite issue seems ok now, just needs hw check [15:45:22] _joe_: thanks, I'm seeing only stack traces so far but the restart completed, I'll repool [15:45:32] jynus: indeed, thanks for taking a look [15:45:59] let me know if there is something else I can do about wikidata? looking at other metrics?รง [15:47:44] seems blazegraph is back after the restart so I think we're ok [15:48:07] I'll file followup tasks tomorrow [15:55:16] looks like we have recovered, I'll be afk [15:56:07] or not! looks like the threads are still climbing on wdqs1004/1005 [16:01:29] 10Operations, 10ops-codfw, 10monitoring: graphite2001 crashed - https://phabricator.wikimedia.org/T198041#4309934 (10jcrespo) [16:08:56] PROBLEM - Memory correctable errors -EDAC- on cp1053 is CRITICAL: 151 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=cp1053&var-datasource=eqiad%2520prometheus%252Fops [16:09:35] !log restart wdqs-updater on wdqs1005 [16:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:07] PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert. [16:31:52] PROBLEM - LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:32:15] known, I'll silence it [16:32:52] RECOVERY - LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 0.056 second response time [16:35:56] RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting. [16:53:29] godog: thanks! [16:54:53] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: WDQS timeout on the public eqiad cluster - https://phabricator.wikimedia.org/T198042#4309966 (10Gehel) [17:01:39] gehel: np, silenced for two hours only tho, in case it needs extension [17:02:18] godog: I silenced it until tomorrow. I'll try to have a deeper look once I get closer to a real laptop [17:05:22] gehel: sounds good! thanks for taking a look [17:05:36] godog :) [17:13:46] PROBLEM - proton endpoints health on proton2002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 503 (expecting: 200) [17:14:46] RECOVERY - proton endpoints health on proton2002 is OK: All endpoints are healthy [17:41:27] 10Operations, 10Deployments, 10HHVM, 10Patch-For-Review, and 3 others: Translation cache exhaustion caused by changes to PHP code in file scope - https://phabricator.wikimedia.org/T103886#4310002 (10MoritzMuehlenhoff) @Krinkle Ok, that piece of context was missing. Makes sense, then. [18:17:12] PROBLEM - High lag on wdqs1004 is CRITICAL: 8720 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [18:17:38] Looking [18:18:55] Looks like wdqs1004 is actually starting to catch up, lag is reported again and going down [18:19:36] We're getting better, not sure why... :( [19:40:19] PROBLEM - WDQS HTTP Port on wdqs1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time [19:40:34] looking, seems things are not stable yet... [19:41:05] !log restart blazegraph on wdqs1004 [19:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:28] RECOVERY - WDQS HTTP Port on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 0.023 second response time [19:47:57] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: WDQS timeout on the public eqiad cluster - https://phabricator.wikimedia.org/T198042#4310044 (10Gehel) Situation is better, but still not entirely stable (I just restarted blazegraph on wdqs1004).... [20:40:59] PROBLEM - Disk space on elastic1019 is CRITICAL: DISK CRITICAL - free space: /srv 62038 MB (12% inode=99%) [20:43:09] RECOVERY - Disk space on elastic1019 is OK: DISK OK [21:38:19] PROBLEM - High lag on wdqs1005 is CRITICAL: 6096 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [21:50:57] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: WDQS timeout on the public eqiad cluster - https://phabricator.wikimedia.org/T198042#4310071 (10Gehel) wdqs1005 was lagging on updates. A few thread dumps for further analysis before restarting it:... [21:51:23] !log restarting wdqs1005 after taking a few thread dumps - T198042 [21:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:26] T198042: WDQS timeout on the public eqiad cluster - https://phabricator.wikimedia.org/T198042 [22:00:38] PROBLEM - WDQS HTTP Port on wdqs1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.000 second response time [22:01:48] RECOVERY - WDQS HTTP Port on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 0.042 second response time [22:02:39] PROBLEM - High lag on wdqs1005 is CRITICAL: 7556 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [22:26:46] !log restarting wdqs1004 - T198042 [22:26:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:48] T198042: WDQS timeout on the public eqiad cluster - https://phabricator.wikimedia.org/T198042 [22:27:29] PROBLEM - WDQS HTTP Port on wdqs1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time [22:28:38] RECOVERY - WDQS HTTP Port on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 0.117 second response time