[00:46:34] RECOVERY - Check systemd state on an-launcher1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:59:12] RECOVERY - HTTPS-dbtree on dbmonitor1001 is OK: HTTP OK: HTTP/1.1 200 OK - 81635 bytes in 9.852 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [03:08:36] PROBLEM - HTTPS-dbtree on dbmonitor1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [03:12:16] RECOVERY - HTTPS-dbtree on dbmonitor1001 is OK: HTTP OK: HTTP/1.1 200 OK - 82035 bytes in 8.368 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [03:17:48] PROBLEM - HTTPS-dbtree on dbmonitor1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [03:32:34] RECOVERY - HTTPS-dbtree on dbmonitor1001 is OK: HTTP OK: HTTP/1.1 200 OK - 81955 bytes in 5.473 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [03:38:16] PROBLEM - HTTPS-dbtree on dbmonitor1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [03:40:04] RECOVERY - HTTPS-dbtree on dbmonitor1001 is OK: HTTP OK: HTTP/1.1 200 OK - 83275 bytes in 7.526 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [03:45:42] PROBLEM - HTTPS-dbtree on dbmonitor1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [04:11:48] RECOVERY - HTTPS-dbtree on dbmonitor1001 is OK: HTTP OK: HTTP/1.1 200 OK - 81635 bytes in 9.260 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [04:17:26] PROBLEM - HTTPS-dbtree on dbmonitor1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [04:23:00] RECOVERY - HTTPS-dbtree on dbmonitor1001 is OK: HTTP OK: HTTP/1.1 200 OK - 81675 bytes in 9.036 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [04:28:36] PROBLEM - HTTPS-dbtree on dbmonitor1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [06:46:59] Looks like tendril db is struggling again [06:51:44] RECOVERY - HTTPS-dbtree on dbmonitor1001 is OK: HTTP OK: HTTP/1.1 200 OK - 81365 bytes in 5.212 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200510T0700) [07:11:12] !log Truncate tendril. processlist_query_log T231185 [07:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:16] T231185: Disable/remove unused features on Tendril - https://phabricator.wikimedia.org/T231185 [07:11:23] !log Restart mysql on db1115 T231185 [07:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:24] PROBLEM - HTTPS-dbtree on dbmonitor1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 354 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [07:31:53] 10Operations, 10DBA: Disable/remove unused features on Tendril - https://phabricator.wikimedia.org/T231185 (10Marostegui) This has happened again: ` root@cumin1001:~# mysql.py -hdb1115 -e "show processlist" | grep Copy 35178632 root 10.64.32.25 tendril Connect 15049 Copying to tmp table insert into processlist... [07:34:40] 10Operations, 10DBA: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 (10Marostegui) [07:35:01] 10Operations, 10DBA: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 (10Marostegui) p:05Triage→03High [07:35:35] 10Operations, 10DBA: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 (10Marostegui) Right after restarting mysql: ` 07:35:12 up 5 days, 2:39, 2 users, load average: 492.22, 411.17, 215.39 ` [07:49:10] RECOVERY - HTTPS-dbtree on dbmonitor1001 is OK: HTTP OK: HTTP/1.1 200 OK - 84943 bytes in 4.106 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [08:01:43] !log Disable unused events like %_schema T252324 T231185 [08:01:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:48] T231185: Disable/remove unused features on Tendril - https://phabricator.wikimedia.org/T231185 [08:01:48] T252324: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 [08:03:42] 10Operations, 10DBA: Disable/remove unused features on Tendril - https://phabricator.wikimedia.org/T231185 (10Marostegui) All the events like `es1020_eqiad_wmnet_3306_schema` have been disabled across all the hosts as they were running every minute and we don't really use them. [08:09:28] 10Operations, 10DBA: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 (10Marostegui) Also noticed lots of: ` delete from global_status where server_id = @server_id and variable_name not like '%.%' ` They were taking ages to complete cause that table was around 15GB (and I don't think we e... [08:10:56] 10Operations, 10DBA: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 (10Marostegui) Most of the connections are getting stalled now on: ` 69415 root 10.64.32.13 tendril Connect 1084 updating delete from global_status where server_id = @server_id and variable_name not like '%.%' 0.000 6943... [08:24:30] 10Operations, 10ops-eqiad: Check patch cable between analytics1052 and asw2-a-eqiad - https://phabricator.wikimedia.org/T252325 (10elukey) [08:25:26] PROBLEM - Host analytics1052 is DOWN: PING CRITICAL - Packet loss = 100% [08:25:35] this is me --^ [08:30:36] 10Operations, 10DBA: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 (10Marostegui) I have finally truncated also `global_status` as its data isn't really super used and it gets repopulated by the cronjobs and by the events - the table was around 10GB: An example of its data: ` mysql:root... [08:31:09] of course I have some problems with my pwstore, if anybody can powercycle analytics1052 I'd be grateful [08:31:34] elukey: I can do it [08:31:43] I am already working anyways :( [08:31:52] thanks a lot marostegui <3 [08:32:32] i!log Power cycle analytics1052 per elukey's request [08:36:44] RECOVERY - Host analytics1052 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [08:36:52] elukey: ^ [08:38:21] thanks a lot! [08:43:28] marostegui: ok I restored my settings, all good, of course it happens when you need it -.- [08:43:33] thanks again! [08:43:42] elukey: happy to help :* [08:44:25] !log Power cycle analytics1052 after eno1 issue [08:44:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:39] (the extra i before !log :D) [08:47:01] elukey: ah, thanks [08:47:47] 10Operations, 10DBA: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 (10Marostegui) And it is non again almost unresponsive [08:50:50] !log Restart mysql on db1115 to change buffer pool size from 20GB to 40GB T252324 ( [08:50:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:54] T252324: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 [08:51:59] 10Operations, 10DBA: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 (10Marostegui) I have given double its buffer pool size: from 20 to 40GB to innodb and from 30GB to 40GB to tokudb. [08:55:40] PROBLEM - HTTPS-dbtree on dbmonitor1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [08:57:22] RECOVERY - HTTPS-dbtree on dbmonitor1001 is OK: HTTP OK: HTTP/1.1 200 OK - 85757 bytes in 1.742 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [09:06:46] 10Operations, 10DBA: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 (10Marostegui) At this point I think we've either hit some sort of limits on concurrency (we have 2273 events enabled) or there's some underlying HW issue that is making the host perform slower than usual. Random connect... [09:25:30] 10Operations, 10DBA: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 (10Marostegui) There are many of these entries: ` May 9 19:25:39 db1115 smartd[807]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 100 to 77 May 9 19:25:39 db1115 smartd[807]: D... [09:25:52] !log Stop MySQL and restart db1115 - T252324 [09:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:55] T252324: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 [09:30:34] PROBLEM - HTTPS-dbtree on dbmonitor1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [09:34:14] PROBLEM - Check systemd state on prometheus1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:34:32] PROBLEM - Check systemd state on prometheus1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:34:40] PROBLEM - Check systemd state on prometheus2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:35:00] PROBLEM - Check systemd state on prometheus2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:36:22] PROBLEM - Check the last execution of generate-mysqld-exporter-config on prometheus2003 is CRITICAL: CRITICAL: Status of the systemd unit generate-mysqld-exporter-config https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:39:36] RECOVERY - HTTPS-dbtree on dbmonitor1001 is OK: HTTP OK: HTTP/1.1 200 OK - 85510 bytes in 1.151 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [09:40:08] PROBLEM - Check the last execution of generate-mysqld-exporter-config on prometheus1003 is CRITICAL: CRITICAL: Status of the systemd unit generate-mysqld-exporter-config https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:40:14] PROBLEM - Check the last execution of generate-mysqld-exporter-config on prometheus1004 is CRITICAL: CRITICAL: Status of the systemd unit generate-mysqld-exporter-config https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:42:26] PROBLEM - Check the last execution of generate-mysqld-exporter-config on prometheus2004 is CRITICAL: CRITICAL: Status of the systemd unit generate-mysqld-exporter-config https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:44:07] ^ expected because of my restart [09:44:33] 10Operations, 10DBA: Disable/remove unused features on Tendril - https://phabricator.wikimedia.org/T231185 (10Marostegui) Fully rebooted the host, and started it with the event_scheduler disabled. I went to tendril's web and caught this query: ` select `server_id`, `variable_value` from tendril.global_variabl... [09:56:27] !log Change scaling_governor from powersave to performance on db1115 - T252324 [09:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:30] T252324: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 [09:58:16] 10Operations, 10DBA: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 (10Marostegui) The CPU's were running powersave as scaling_governor I have changed them to performance. I don't expect miracles, but... The problem is still that not very big transactions take lots to finish and everyth... [10:00:18] RECOVERY - Check systemd state on prometheus1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:00:28] RECOVERY - Check systemd state on prometheus2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:00:50] RECOVERY - Check systemd state on prometheus2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:01:50] RECOVERY - Check the last execution of generate-mysqld-exporter-config on prometheus1003 is OK: OK: Status of the systemd unit generate-mysqld-exporter-config https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:01:52] RECOVERY - Check systemd state on prometheus1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:01:56] RECOVERY - Check the last execution of generate-mysqld-exporter-config on prometheus1004 is OK: OK: Status of the systemd unit generate-mysqld-exporter-config https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:04:08] RECOVERY - Check the last execution of generate-mysqld-exporter-config on prometheus2004 is OK: OK: Status of the systemd unit generate-mysqld-exporter-config https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:08:52] RECOVERY - Check the last execution of generate-mysqld-exporter-config on prometheus2003 is OK: OK: Status of the systemd unit generate-mysqld-exporter-config https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:27:20] !log restarting blazgraph on wdqs1004: T242453 [10:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:23] T242453: Deadlock in blazegraph blocking all queries and updates - https://phabricator.wikimedia.org/T242453 [10:47:14] 10Operations, 10DBA: Disable/remove unused features on Tendril - https://phabricator.wikimedia.org/T231185 (10Marostegui) This is quite insane - how can we have that crazy amount of rows for just one month of data: ` mysql:root@localhost [tendril]> show explain for 283543; +------+-------------+---------------... [10:48:37] 10Operations, 10DBA: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 (10Marostegui) Fully rebooted the host, and started it with the event_scheduler disabled. I went to tendril's web and caught this query: ` select `server_id`, `variable_value` from tendril.global_variables where `varia... [10:49:35] 10Operations, 10DBA: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 (10Marostegui) This is quite insane - how can we have that crazy amount of rows for just one month of data: ` mysql:root@localhost [tendril]> show explain for 283543; +------+-------------+-------------------+-------+---... [10:51:59] 10Operations, 10DBA: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 (10Marostegui) ` mysql:root@localhost [(none)]> show explain for 291612; +------+-------------+----------------------+-------+---------------+------+---------+------+-------------+-------------+ | id | select_type |... [11:04:31] 10Operations, 10DBA: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 (10Marostegui) So looks like we have an event to purge that table but doesn't seem to be working at all based on the dates below: ` mysql:root@localhost [tendril]> show create event tendril_purge_global_status_log_5m; +-... [11:05:38] !log Stop event scheduler on db1115 to perform a massive delete - T252324 [11:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:42] T252324: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 [11:49:53] 10Operations, 10DBA: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 (10Marostegui) So looks like the above event was never able to execute when the scheduler is enable due to, basically, locking issues due to all the concurrency: ` daemon.log:May 10 07:35:01 db1115 mysqld[9874]: 2020-05-... [11:53:24] !help [11:53:24] want docs? ask for "!wm-bot". all keywords? try "@regsearch .*" [12:18:49] !log Start event scheduler on db1115 after a massive delete - T252324 [12:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:52] T252324: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 [12:23:27] 10Operations, 10DBA: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 (10Marostegui) 05Open→03Resolved a:03Marostegui I decided to finally fully truncate `global_status_log_5m` as we don't really use those values anyways. truncate + optimize fixed the thing. Tendril is now back at it... [12:23:29] 10Operations, 10DBA: Disable/remove unused features on Tendril - https://phabricator.wikimedia.org/T231185 (10Marostegui) [12:34:42] 10Operations, 10DBA: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 (10Marostegui) Btw, the effect of truncating those tables on the used disk space: {F31811071} [13:44:43] 10Operations, 10MediaWiki-Authentication-and-authorization, 10Security-Team, 10Traffic, 10Security: Investigate usefulness of SameSite cookies for logged-in accounts - https://phabricator.wikimedia.org/T158604 (10Tgr) >>! In T158604#5833372, @Bawolff wrote: > I'm not sure what you mean by "cookie-based C... [14:20:18] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [14:23:28] (03PS10) 10Bmansurov: Add recommendation-api chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/565788 (https://phabricator.wikimedia.org/T241230) [14:24:15] (03CR) 10Bmansurov: "Rebased and removed the dependency per Michael's comment." [deployment-charts] - 10https://gerrit.wikimedia.org/r/565788 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [14:26:49] 10Operations, 10Privacy Engineering, 10Research, 10Traffic, and 2 others: wikiworkshop.org has Facebook button, external statcounter, https to http redirect - https://phabricator.wikimedia.org/T251732 (10bmansurov) @leila head's up that one of the patches is removing the Facebook button. [14:30:46] 10Operations, 10DBA: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 (10jcrespo) I would suggest to move zarcillo database away- it is very annoying to have the db down for an unrelated (to backups) reason, which will result in backup alerting on Monday. Unless you give me a good reason... [14:33:16] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 57.97 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [14:39:42] 10Operations, 10DBA: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 (10Marostegui) Fine with me [16:45:22] 10Operations, 10Cloud-Services, 10Traffic, 10Wikimedia-Incident, 10cloud-services-team (Kanban): Requests to production are sometimes timing out or giving empty response - https://phabricator.wikimedia.org/T249035 (10MusikAnimal) I haven't seen it in a week or so. This has been and on-and-off issue since... [17:00:29] 10Operations, 10DBA: Tendril mysql is stalling - https://phabricator.wikimedia.org/T252324 (10Marostegui) [17:26:32] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 131.2 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [18:20:00] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 32.54 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [20:43:37] 10Operations, 10SCB, 10Services (watching): Page allocation stalls on scb1001, scb1002 - https://phabricator.wikimedia.org/T191199 (10Aklapper) >>! In T191199#4450050, @akosiaris wrote: > T199813 was closed today (nice work on it). I am thking (and hoping) it was the root cause. I 'll stall this task for a w... [21:36:26] 10Operations, 10MediaWiki-General, 10Traffic, 10HTTPS: Protocol-relative URLs are poorly supported or unsupported by a number of HTTP clients - https://phabricator.wikimedia.org/T54253 (10Krinkle) >>! In T227391#6123061, @Krinkle wrote: >>>! In T227391#6105593, @Krinkle wrote: >> […] >> >> | {F31811490} |... [22:47:04] (03PS3) 10Krinkle: mediawiki: increase php7 opcache capacity on mw1407 [puppet] - 10https://gerrit.wikimedia.org/r/587299 [22:47:09] (03Abandoned) 10Krinkle: mediawiki: increase php7 opcache capacity on mw1407 [puppet] - 10https://gerrit.wikimedia.org/r/587299 (owner: 10Krinkle)