[01:46:47] <wikibugs>	 (03CR) 10Krinkle: "This makes sense as smaller first step (instead of going to the MultiWrite/mcrouter-first approach directly)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440469 (owner: 10Aaron Schulz)
[01:46:49] <wikibugs>	 (03CR) 10Krinkle: [C: 04-1] Make mediawiki.org write to both nutcracker and mcrouter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440469 (owner: 10Aaron Schulz)
[02:08:09] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Campsite, 10Wikimedia-General-or-Unknown, and 6 others: Multiple projects reporting Cannot access the database: No working replica DB server - https://phabricator.wikimedia.org/T195520#4309612 (10Krinkle)
[03:26:57] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 919.66 seconds
[03:30:35] <wikibugs>	 10Operations, 10Deployments, 10HHVM, 10Patch-For-Review, and 3 others: Translation cache exhaustion caused by changes to PHP code in file scope - https://phabricator.wikimedia.org/T103886#4309643 (10Krinkle) @MoritzMuehlenhoff The alternative is to post-pone the switching of MediaWiki localisation cache fr...
[03:37:56] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 266.80 seconds
[03:48:30] <wikibugs>	 (03CR) 10Krinkle: [C: 031] "Should the entry be removed from canary_appserver.yaml ?" [puppet] - 10https://gerrit.wikimedia.org/r/440822 (https://phabricator.wikimedia.org/T180183) (owner: 10Giuseppe Lavagetto)
[06:30:06] <icinga-wm>	 PROBLEM - puppet last run on mw1252 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh]
[06:32:36] <icinga-wm>	 PROBLEM - puppet last run on analytics1061 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/gen_fingerprints]
[06:57:56] <icinga-wm>	 RECOVERY - puppet last run on analytics1061 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[07:00:26] <icinga-wm>	 RECOVERY - puppet last run on mw1252 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:15:16] <icinga-wm>	 PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:26:06] <icinga-wm>	 PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 503 (expecting: 200)
[11:27:07] <icinga-wm>	 RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy
[11:33:46] <icinga-wm>	 RECOVERY - Check systemd state on restbase-dev1006 is OK: OK - running: The system is fully operational
[15:27:42] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:28:30] <godog>	 uh oh, checking
[15:28:34] <jynus_>	 wikidata?
[15:28:36] <icinga-wm>	 PROBLEM - Host graphite2001 is DOWN: PING CRITICAL - Packet loss = 100%
[15:28:42] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 0.018 second response time
[15:28:42] <_joe_>	 uh?
[15:28:44] <jynus_>	 oh
[15:28:48] <jynus_>	 that sounds bad
[15:29:02] <jynus_>	 is it related?
[15:29:23] <jynus_>	 (checking stats, rethorical question)
[15:29:27] <godog>	 don't know yet, seems unlikely
[15:30:18] <jynus_>	 interface and querying seems up to me right now
[15:30:31] <jynus_>	 (query.wikidata.org)
[15:30:37] <ema>	 hey
[15:31:01] <godog>	 perhaps bad queries? I was looking at this
[15:31:03] <ema>	 I'm around, ideas so far of what's going on?
[15:31:05] <godog>	 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?refresh=1m&orgId=1&from=1529843455394&to=1529854255394&panelId=22&fullscreen&var-cluster_name=wdqs
[15:31:24] <jynus_>	 godog: just saw it too
[15:31:27] <jynus_>	 too many threads
[15:31:40] <jynus_>	 on 4 and 5
[15:31:51] <godog>	 I'm happy to keep looking at wdqs, can someone look at graphit2001 ?
[15:31:53] <jynus_>	 I would kick the proc4ss
[15:31:56] <_joe_>	 +1
[15:31:58] <godog>	 it isn't paging though
[15:32:01] <jynus_>	 I can check
[15:32:06] <jynus_>	 graphite
[15:32:37] <godog>	 jynus_: thanks!
[15:32:54] <jynus_>	 maybe an overload could trigger metrics overload
[15:33:03] <godog>	 I'll keep looking at wdqs
[15:33:14] <_joe_>	 godog: I would restart wdqs1004
[15:33:22] <_joe_>	 there is the usual restart-wdqs script
[15:33:39] <godog>	 ah!
[15:33:41] <godog>	 thanks _joe_ 
[15:33:46] <_joe_>	 sudo -i
[15:33:52] <godog>	 !log restart-wdqs on wdqs1004
[15:33:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:33:55] <_joe_>	 and it will depool, restart wdqs, repool
[15:34:13] <godog>	 # restart-wdqs 
[15:34:13] <godog>	 Failed to restart wdqs.service: Unit wdqs.service not found.
[15:34:20] <_joe_>	 interesting
[15:34:28] <_joe_>	 so the puppettization is wrong
[15:34:29] <jynus_>	 on graphite, I see puppet running
[15:34:36] <_joe_>	 godog: lemme try
[15:34:53] <godog>	 _joe_: just did
[15:35:06] <godog>	 !log systemctl restart wdqs-blazegraph on wdqs1004
[15:35:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:35:20] <jynus_>	 oh, wrong host
[15:35:29] <_joe_>	 godog: remember to "pool" after it's done
[15:35:30] <jynus_>	 I cannot login into 2001
[15:35:38] <jynus_>	 trying serial
[15:36:02] <jynus_>	 but so far looks like just a regular hw crash or similar
[15:36:03] <godog>	 ok, restart isn't returning anyway
[15:36:12] <_joe_>	 it will take time
[15:36:22] <_joe_>	 I doubt java will be able to kill those stuck threads
[15:36:41] <_joe_>	 so you will have to wait for the kill -9 from systemd that happens IIRC after 180 seconds
[15:36:51] <jynus_>	 yep, it crashed
[15:36:55] <jynus_>	 and stuck on reboot
[15:38:11] <godog>	 _joe_: indeed that's what happened
[15:38:46] <icinga-wm>	 PROBLEM - carbon-frontend-relay metric drops on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [100.0] https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen
[15:39:14] <_joe_>	 I repooled wdqs1004
[15:39:28] <jynus_>	 I am going to force power reset graphite2001
[15:39:40] <_joe_>	 jynus_: +1
[15:40:00] <_joe_>	 godog: should I handle 1005?
[15:40:01] <godog>	 _joe_: how do we know wdqs is ok to repool on wdqs1004 ? just waiting?
[15:40:13] <godog>	 I'll do wdqs1005
[15:40:49] <jynus>	 !log restart graphite2001
[15:40:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:40:52] <_joe_>	 godog: I looked at the logs :)
[15:40:59] <godog>	 !log restart wdqs-blazegraph on wdqs1005
[15:41:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:41:28] <jynus>	 at least now it is doing something
[15:42:07] <icinga-wm>	 PROBLEM - WDQS HTTP Port on wdqs1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time
[15:42:13] <jynus>	 it got to grub, so that is something
[15:42:23] <jynus>	 yep, looking ok
[15:42:35] <jynus>	 will file a ticket, most likely there will be hw logs
[15:42:46] <icinga-wm>	 RECOVERY - Host graphite2001 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms
[15:42:56] <jynus>	 and it is back
[15:43:16] <icinga-wm>	 RECOVERY - WDQS HTTP Port on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 0.077 second response time
[15:43:36] <godog>	 _joe_: heh, in logstash ?
[15:43:46] <jynus>	 however, now graphite2002 seems unhappy
[15:43:50] <jynus>	 on metrics
[15:43:58] <_joe_>	 godog: journalctl -u wdqs-blazegraph -f
[15:44:23] <jynus>	 oh, it got fixed, probably after the other got "repooled"
[15:45:07] <icinga-wm>	 RECOVERY - carbon-frontend-relay metric drops on graphite1001 is OK: OK: Less than 80.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen
[15:45:10] <jynus>	 graphite issue seems ok now, just needs hw check
[15:45:22] <godog>	 _joe_: thanks, I'm seeing only stack traces so far but the restart completed, I'll repool
[15:45:32] <godog>	 jynus: indeed, thanks for taking a look
[15:45:59] <jynus>	 let me know if there is something else I can do about wikidata? looking at other metrics?ç
[15:47:44] <godog>	 seems blazegraph is back after the restart so I think we're ok
[15:48:07] <godog>	 I'll file followup tasks tomorrow
[15:55:16] <godog>	 looks like we have recovered, I'll be afk
[15:56:07] <godog>	 or not! looks like the threads are still climbing on wdqs1004/1005
[16:01:29] <wikibugs>	 10Operations, 10ops-codfw, 10monitoring: graphite2001 crashed - https://phabricator.wikimedia.org/T198041#4309934 (10jcrespo)
[16:08:56] <icinga-wm>	 PROBLEM - Memory correctable errors -EDAC- on cp1053 is CRITICAL: 151 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=cp1053&var-datasource=eqiad%2520prometheus%252Fops
[16:09:35] <godog>	 !log restart wdqs-updater on wdqs1005
[16:09:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:29:07] <icinga-wm>	 PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert.
[16:31:52] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:32:15] <godog>	 known, I'll silence it
[16:32:52] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 0.056 second response time
[16:35:56] <icinga-wm>	 RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting.
[16:53:29] <gehel>	 godog: thanks!
[16:54:53] <wikibugs>	 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: WDQS timeout on the public eqiad cluster - https://phabricator.wikimedia.org/T198042#4309966 (10Gehel)
[17:01:39] <godog>	 gehel: np, silenced for two hours only tho, in case it needs extension
[17:02:18] <gehel>	 godog: I silenced it until tomorrow. I'll try to have a deeper look once I get closer to a real laptop
[17:05:22] <godog>	 gehel: sounds good! thanks for taking a look
[17:05:36] <gehel>	 godog :)
[17:13:46] <icinga-wm>	 PROBLEM - proton endpoints health on proton2002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 503 (expecting: 200)
[17:14:46] <icinga-wm>	 RECOVERY - proton endpoints health on proton2002 is OK: All endpoints are healthy
[17:41:27] <wikibugs>	 10Operations, 10Deployments, 10HHVM, 10Patch-For-Review, and 3 others: Translation cache exhaustion caused by changes to PHP code in file scope - https://phabricator.wikimedia.org/T103886#4310002 (10MoritzMuehlenhoff) @Krinkle Ok, that piece of context was missing. Makes sense, then.
[18:17:12] <icinga-wm>	 PROBLEM - High lag on wdqs1004 is CRITICAL: 8720 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[18:17:38] <gehel>	 Looking
[18:18:55] <gehel>	 Looks like wdqs1004 is actually starting to catch up, lag is reported again and going down
[18:19:36] <gehel>	 We're getting better, not sure why... :(
[19:40:19] <icinga-wm>	 PROBLEM - WDQS HTTP Port on wdqs1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time
[19:40:34] <gehel>	 looking, seems things are not stable yet...
[19:41:05] <gehel>	 !log restart blazegraph on wdqs1004
[19:41:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:42:28] <icinga-wm>	 RECOVERY - WDQS HTTP Port on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 0.023 second response time
[19:47:57] <wikibugs>	 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: WDQS timeout on the public eqiad cluster - https://phabricator.wikimedia.org/T198042#4310044 (10Gehel) Situation is better, but still not entirely stable (I just restarted blazegraph on wdqs1004)....
[20:40:59] <icinga-wm>	 PROBLEM - Disk space on elastic1019 is CRITICAL: DISK CRITICAL - free space: /srv 62038 MB (12% inode=99%)
[20:43:09] <icinga-wm>	 RECOVERY - Disk space on elastic1019 is OK: DISK OK
[21:38:19] <icinga-wm>	 PROBLEM - High lag on wdqs1005 is CRITICAL: 6096 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[21:50:57] <wikibugs>	 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: WDQS timeout on the public eqiad cluster - https://phabricator.wikimedia.org/T198042#4310071 (10Gehel) wdqs1005 was lagging on updates. A few thread dumps for further analysis before restarting it:...
[21:51:23] <gehel>	 !log restarting wdqs1005 after taking a few thread dumps - T198042
[21:51:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:51:26] <stashbot>	 T198042: WDQS timeout on the public eqiad cluster - https://phabricator.wikimedia.org/T198042
[22:00:38] <icinga-wm>	 PROBLEM - WDQS HTTP Port on wdqs1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.000 second response time
[22:01:48] <icinga-wm>	 RECOVERY - WDQS HTTP Port on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 0.042 second response time
[22:02:39] <icinga-wm>	 PROBLEM - High lag on wdqs1005 is CRITICAL: 7556 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[22:26:46] <gehel>	 !log restarting wdqs1004 - T198042
[22:26:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:26:48] <stashbot>	 T198042: WDQS timeout on the public eqiad cluster - https://phabricator.wikimedia.org/T198042
[22:27:29] <icinga-wm>	 PROBLEM - WDQS HTTP Port on wdqs1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time
[22:28:38] <icinga-wm>	 RECOVERY - WDQS HTTP Port on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 0.117 second response time