[07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210321T0700) [07:29:05] PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [07:29:15] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [07:30:23] gerrit down again? :/ [07:33:14] again out of apache workers for some reason, https://grafana.wikimedia.org/d/L0-l1o0Mz/apache?orgId=1&refresh=1m&var-host=gerrit1001&var-port=9117 [07:33:39] RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 904 bytes in 0.079 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [07:33:57] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 20885 bytes in 4.944 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [07:34:06] still broken for me ^ [07:43:45] PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [07:43:59] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [07:45:59] RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 904 bytes in 0.456 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [07:46:19] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 20960 bytes in 5.850 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [07:51:06] dunno why it recovered, not loading for me at least [07:53:11] PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [07:55:45] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [08:00:07] RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 904 bytes in 6.014 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [08:00:19] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 20991 bytes in 5.413 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [08:14:24] Mhmm. Gerrit is at least shaky, since 07:22. [08:16:17] qchris: yes, been running out of apache workers recently, https://phabricator.wikimedia.org/T277127 [08:16:38] Yup. I was just thinking about which of the tow (or both) to restart [08:16:44] I'll justrestart both of them. [08:16:57] Any reservations against it? [08:18:02] Looking at https://grafana.wikimedia.org/d/L0-l1o0Mz/apache?orgId=1&refresh=1m&var-host=gerrit1001&var-port=9117 it seems apache restart might be sufficient. [08:18:13] I'll go for that. [08:18:35] !log Restarting apache on gerrit1001 (all apache workers busy) [08:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:10] Gerrit is working again for me. [08:19:17] works for me too [08:19:24] Cool. Thanks for confirming. [08:19:39] I'm curious what's causing that, hopefully that restart didn't remove any useful debugging information [08:19:53] ๐Ÿ‘ [08:29:05] Thanks [08:29:16] Can we have a regular restart on it as a cron? [08:29:27] I'd rather find the root cause :P [08:29:34] but that can wait until monday as long as its working [08:31:03] I mean this is happening for months now and we are moving to gitlab so [08:49:01] it's down again :( [08:49:11] Mhmm. Gerrit's not reachable for me again. [08:49:22] Majavah: Does Gerrit still work for you? [08:49:32] qchris: no [08:50:22] !log Restarting apache on gerrit1001 again (all apache workers busy again) see T277127 [08:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:30] T277127: Gerrit Apache out of workers - https://phabricator.wikimedia.org/T277127 [08:51:07] Gerrit is working again for me. [08:51:18] works now again [08:51:31] Thanks for confirming. [08:57:13] qchris: looking at that grafana graph the amount of used workers seems to just be going up, I'm worried that it will just keep running out of them [08:57:52] Yup. Same worries here. [08:58:26] I'd probably leave it running until it runs out again. Then I'd restart both Gerrit and Apache. [09:00:00] Would you suggest something else? [09:00:37] not really, unfortunately [09:01:48] unless there is some way to see what's using those workers [09:12:43] According to Grafana, they are in read state. lsof on the connections does not look off either. But things are working now. I'll have a look again, once Gerrit is choking. But according to the ticket, others have worked on it already. So I'd much rather leave it to them. [09:18:26] thanks a lot for the restart! [09:19:06] I am wondering one thing - could we take a gdb thread apply all bt of one of the apache workers next time to see what they are hanging on? [09:19:39] it might give us some clue [09:22:12] !log install apache2-bin-dbgsym on gerrit1001 - T277127 [09:22:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:21] T277127: Gerrit Apache out of workers - https://phabricator.wikimedia.org/T277127 [09:25:57] added the suggestion to --^ [10:06:49] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1058-production-search-omega-eqiad on elastic1058 is CRITICAL: 128.1 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-omega-eqiad&var-instance=elastic1058&panelId=37 [10:11:17] PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [10:11:31] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [10:12:02] Only 2 backend connectins to gerrit are open :-/ [10:13:31] RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 904 bytes in 2.933 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [10:13:41] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 20512 bytes in 0.504 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [10:23:17] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [10:25:25] PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [10:25:46] <_joe_> !log restarting gerrit on gerrit1001, using 45G of reserved memory [10:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:14] _joe_: Is it ok, if I also restart apache on gerrit1001? (Gerrit is still not accessible to me from the internet, but serves traffic just fine on lo on gerrit1001. And restarting apache typically brought it back) [10:32:25] RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 905 bytes in 7.443 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [10:32:31] <_joe_> no please [10:32:35] Ok. I'll wait. [10:32:37] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 20528 bytes in 3.726 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [10:33:57] <_joe_> there is something very wrong with httpd too [10:34:23] <_joe_> all apache threads are in R [10:35:52] Yup. In the past apache got restarted T277127 to make the problem temporarily go away. They asked for a backtrace which I just uploaded to the ticket. [10:35:52] T277127: Gerrit Apache out of workers - https://phabricator.wikimedia.org/T277127 [10:36:12] <_joe_> so, the problem is that gerrit the application is hosed, and that causes this kind of problem to apache [10:36:20] <_joe_> so restarting just apache fixes nothing per se [10:36:40] <_joe_> gerrit should be back now [10:37:04] Gerrit works again for me. Confirmed. [10:47:39] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1058-production-search-omega-eqiad on elastic1058 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-omega-eqiad&var-instance=elastic1058&panelId=37 [10:54:47] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1058-production-search-omega-eqiad on elastic1058 is OK: (C)100 gt (W)80 gt 75.25 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-omega-eqiad&var-instance=elastic1058&panelId=37 [11:11:46] 10SRE, 10CheckUser, 10Traffic, 10Patch-For-Review: Log source port for anonymous users and expose it for sysops/checkusers - https://phabricator.wikimedia.org/T181368 (10Ladsgroup) 05Openโ†’03Resolved This is half resolved/half declined. The data is now available in the data lake and can be disclosed by... [12:48:43] (03CR) 10Gergล‘ Tisza: [C: 03+1] linkrecommendation: Add Swagger UI environment variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/673471 (https://phabricator.wikimedia.org/T277644) (owner: 10Kosta Harlan) [16:20:13] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: / (spec from root) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [16:22:21] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [18:32:58] 10SRE, 10Horizon, 10Traffic, 10Upstream, 10cloud-services-team (Kanban): Horizon Designate dashboard not allowing creation of NS records - https://phabricator.wikimedia.org/T204013 (10Majavah) 05Openโ†’03Resolved a:03Krenair Looks like this was fixed some time ago, we're now on Horizon Train and I se... [18:35:14] (03PS1) 10Urbanecm: hrwiki: Configure mentorship for Growth team features [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673807 (https://phabricator.wikimedia.org/T275684) [18:35:45] (03CR) 10Urbanecm: [C: 04-2] "wait for a reply to T275684#6932819" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673807 (https://phabricator.wikimedia.org/T275684) (owner: 10Urbanecm) [19:04:03] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 202254640 and 6 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:06:17] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 914760 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:25:13] (03PS1) 10Andrew Bogott: nova-fullstack: change to a different experimental base image [puppet] - 10https://gerrit.wikimedia.org/r/673809 [19:26:21] (03CR) 10Andrew Bogott: [C: 03+2] nova-fullstack: change to a different experimental base image [puppet] - 10https://gerrit.wikimedia.org/r/673809 (owner: 10Andrew Bogott) [21:50:19] (03CR) 10BPirkle: [C: 03+1] "Change looks good, one minor question but I wouldn't object to this being merged as-is." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672579 (https://phabricator.wikimedia.org/T269326) (owner: 10Tim Starling)