[00:12:49] PROBLEM - SSH on mw2246.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:42:27] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:44:45] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:01:56] (03PS2) 10Ryan Kemper: wdqs: new service alias query-preview [dns] - 10https://gerrit.wikimedia.org/r/668255 (https://phabricator.wikimedia.org/T266470) [01:02:50] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: new service alias query-preview [dns] - 10https://gerrit.wikimedia.org/r/668255 (https://phabricator.wikimedia.org/T266470) (owner: 10Ryan Kemper) [01:04:19] !log T266470 merged https://gerrit.wikimedia.org/r/c/operations/dns/+/668255 && `ryankemper@authdns1001:~$ sudo authdns-update` [01:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:04:26] T266470: Expose wdqs1009 to wdqs users and gather feedback - https://phabricator.wikimedia.org/T266470 [01:11:54] (03Abandoned) 10Ryan Kemper: kibana: only render if explicitly set [puppet] - 10https://gerrit.wikimedia.org/r/665058 (https://phabricator.wikimedia.org/T262211) (owner: 10Ryan Kemper) [01:14:01] RECOVERY - SSH on mw2246.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:15:08] (03CR) 10Ladsgroup: "This would help for sure but the evictions is not because it's too small. This is sorta by design. The expiry of score cache is around six" [puppet] - 10https://gerrit.wikimedia.org/r/671186 (owner: 10Elukey) [01:17:39] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:18:44] !log T266470 Re-enabled icinga service notifications for `Check no envoy runtime configuration is left persistent` on `wdqs100[9,10]` [01:18:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:18:53] T266470: Expose wdqs1009 to wdqs users and gather feedback - https://phabricator.wikimedia.org/T266470 [01:20:07] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:04:11] RECOVERY - SSH on mw2227.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:24:19] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:28:53] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.066 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:29:23] (03PS1) 10Huji: Add deleterevision right to botadmin group on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/671402 (https://phabricator.wikimedia.org/T277358) [04:19:17] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:21:39] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:35:45] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:38:05] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:00:11] RECOVERY - PHP opcache health on mwdebug1001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [07:40:47] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:43:09] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210313T0800) [08:09:15] (03CR) 10Elukey: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/671186 (owner: 10Elukey) [08:12:29] PROBLEM - SSH on mw2227.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:20:22] (03CR) 10Elukey: "Added, for reference, cache hits/misses (hopefully I used the right metric) rps for eqiad and codfw at the bottom of https://grafana-rw.wi" [puppet] - 10https://gerrit.wikimedia.org/r/671186 (owner: 10Elukey) [08:35:09] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 34.73 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:37:29] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [11:54:51] 10SRE, 10Mail: Wikipedia.org DMARC "rua" and "ruf" email addresses need verification - https://phabricator.wikimedia.org/T211401 (10Beeloser) The issue is still present and as such the dmarc record for wikiepedia.org is pretty much redundant. The dmarc policy is set to none so the benefit to be gained would be... [13:55:11] PROBLEM - Swift https backend on ms-fe1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [13:57:29] RECOVERY - Swift https backend on ms-fe1005 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.012 second response time https://wikitech.wikimedia.org/wiki/Swift [14:48:30] 10SRE, 10Dumps-Generation, 10SDC General, 10Wikidata, 10wdwb-tech-focus: Capacity planning for Commons Structured Data - https://phabricator.wikimedia.org/T226093 (10Nintendofan885) [14:50:15] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:52:33] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:35:01] 10SRE, 10Mail: Wikipedia.org DMARC "rua" and "ruf" email addresses need verification - https://phabricator.wikimedia.org/T211401 (10Nemo_bis) Is anyone sending email with a `wikipedia.org` domain? For a while, that was assumed not to happen. [15:51:33] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [15:53:03] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [16:20:39] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={gerrit,gerrit-metrics} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:23:01] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:23:19] RECOVERY - SSH on mw2227.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:23:22] (03CR) 10Ladsgroup: [C: 04-1] "If the comment is fixed, I'll deploy this on Monday (remind me to do so!)" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/671402 (https://phabricator.wikimedia.org/T277358) (owner: 10Huji) [16:31:21] 10SRE, 10Mail: Wikipedia.org DMARC "rua" and "ruf" email addresses need verification - https://phabricator.wikimedia.org/T211401 (10Beeloser) The data from Senderscore shows that there are systems sending email from wikipedia.org The Senderscore data however will only show the most egregious senders of these... [16:47:51] 10SRE, 10Mail: Wikipedia.org DMARC "rua" and "ruf" email addresses need verification - https://phabricator.wikimedia.org/T211401 (10Beeloser) If you look a little further in the Senderscore data you can see the subdomain ru.wikipedia.org is sending email {F34157709} Some of the associated senders have low rep... [16:57:26] !log gerrit web interface is slow/timing out [16:57:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:35] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={gerrit,gerrit-metrics} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:00:50] Reedy: topic? [17:01:38] What about it? It's a weekend for starters [17:02:09] Because they'll inevitably be someone ask and I'd argue topic is more visible than SAL to anyone whose on irc [17:02:22] oh, I thought you were meaning clinic duty [17:02:27] https://phabricator.wikimedia.org/T277127 again? [17:02:39] PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [17:02:45] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [17:02:45] yep, https://grafana.wikimedia.org/d/L0-l1o0Mz/apache?orgId=1&refresh=1m&var-host=gerrit1001&var-port=9117 [17:03:25] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:06:35] 10SRE, 10Dumps-Generation, 10SDC General, 10Wikidata, 10wdwb-tech-focus: Capacity planning for Commons Structured Data - https://phabricator.wikimedia.org/T226093 (10Ladsgroup) If we can clean up the image table ({T275268}) it'll give us leeway for the structured data I assume. [17:08:03] Reedy: not sure if it is related but renaming users is being extremelly slow today [17:08:31] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=gerrit-metrics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:10:11] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 21661 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [17:12:35] RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 904 bytes in 0.028 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [17:15:51] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:27:02] Reedy: I guess the topic can be changed back...' [17:37:34] twentyafterfour: o/ I see that you restarted apache2 on gerrit1001, can you !log it so we keep track? [17:37:43] (I was about to do the same) [17:38:06] elukey: sorry, I logged it in the releng channel [17:38:26] ahhhh np! [17:38:33] !log restarted apache on gerrit1001 to resolve apache worker exhaustion see T277127 [17:38:37] <3 thanks [17:38:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:41] T277127: Gerrit Apache out of workers - https://phabricator.wikimedia.org/T277127 [17:38:56] also posted some of my findings on that task [17:39:11] not that I found much [17:40:00] the weird thing is that restarting httpd solves, and what it does is only http proxy IIUC [17:40:17] could it be possible that gerrit slows down for some reason, and httpd is left with proxy conns held open? [17:42:04] elukey: yeah that seems possible [17:42:27] I didn't see any signs of problems with gerrit itself [17:42:43] but it does occasionally experience gc delays and whatnot [17:48:41] (03CR) 10Ladsgroup: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/671186 (owner: 10Elukey) [17:52:52] (03CR) 10Ladsgroup: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/671186 (owner: 10Elukey) [18:01:59] (03CR) 10Huji: Add deleterevision right to botadmin group on fawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/671402 (https://phabricator.wikimedia.org/T277358) (owner: 10Huji) [18:02:33] (03PS2) 10Huji: Add deleterevision right to botadmin group on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/671402 (https://phabricator.wikimedia.org/T277358) [18:04:32] (03CR) 10Ladsgroup: [C: 03+1] Add deleterevision right to botadmin group on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/671402 (https://phabricator.wikimedia.org/T277358) (owner: 10Huji) [18:53:19] !log run schema changes for varbinary on wikitech (T269348) [18:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:30] T269348: wikitech database has almost all of its varbinary fields wrong - https://phabricator.wikimedia.org/T269348 [18:58:39] PROBLEM - SSH on ms-be1038 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:01:59] !log change default charset of all core tables in labstestwiki to binary (T269348) [19:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:06] T269348: wikitech database has almost all of its varbinary fields wrong - https://phabricator.wikimedia.org/T269348 [19:22:45] PROBLEM - Host ms-be1038 is DOWN: PING CRITICAL - Packet loss = 100% [20:37:43] PROBLEM - puppet last run on mwdebug1001 is CRITICAL: CRITICAL: Puppet has been disabled for longer than 86400 seconds, message: experiments - jiji - jiji, last run 1 day ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:24:14] (03PS2) 10Zabe: Enable DiscussionsTools for enwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669960 (https://phabricator.wikimedia.org/T276851) [21:51:31] PROBLEM - Host elastic2043 is DOWN: PING CRITICAL - Packet loss = 100% [21:53:15] RECOVERY - Host elastic2043 is UP: PING OK - Packet loss = 0%, RTA = 31.58 ms [22:31:45] PROBLEM - SSH on mw2227.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:47:41] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={netbox_device_statistics,swagger_check_citoid_cluster_eqiad} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:50:03] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:08:15] Did Gerrit being down page anyone? It probably should IMO [23:37:05] took icinga a long time to realise