[00:37:31] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [00:44:31] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [00:47:20] 10SRE: try planet/people on bullseye / upgrade people.wikimedia.org backends to bullseye - https://phabricator.wikimedia.org/T280989 (10Dzahn) [00:48:39] 10SRE: try planet/people on bullseye / upgrade people.wikimedia.org backends to bullseye - https://phabricator.wikimedia.org/T280989 (10Dzahn) 05Open→03Resolved p:05Low→03Medium This is done. people1003 and people2002 on bullseye have completely replaced people1002 and people2001 on buster the buster... [00:51:27] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:53:47] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:22:19] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [01:24:39] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [02:02:25] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [02:04:47] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [03:22:47] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [03:25:05] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [03:45:32] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/691466 [04:01:37] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [04:22:23] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [05:02:37] 10SRE, 10Wikimedia-Mailing-lists: Mailman3 bounce runner is running very slowly - https://phabricator.wikimedia.org/T282348 (10Legoktm) @Platonides that's being discussed upstream at https://gitlab.com/warsaw/flufl.bounce/-/issues/7 if you want to chime in there :) [05:50:55] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:53:19] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:20:28] (03PS1) 10Majavah: P::kubernetes::deployment_server: Do not use ipv6 on beta [puppet] - 10https://gerrit.wikimedia.org/r/691494 (https://phabricator.wikimedia.org/T281986) [06:30:34] (03CR) 10Jcrespo: [C: 03+2] bacula: Reenable read-write ES database backups, disable read-only [puppet] - 10https://gerrit.wikimedia.org/r/690338 (https://phabricator.wikimedia.org/T282249) (owner: 10Jcrespo) [06:35:29] RECOVERY - Disk space on backup2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=backup2002&var-datasource=codfw+prometheus/ops [06:54:52] !log migrating most of last mailing lists of T280322 [06:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:56] T280322: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 [07:18:17] (03PS1) 10Legoktm: mailman3: Optionally enable memcached caching [puppet] - 10https://gerrit.wikimedia.org/r/691513 (https://phabricator.wikimedia.org/T282931) [07:20:59] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:23:17] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:38:37] PROBLEM - SSH on logstash2020.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:39:46] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 (10Nemo_bis) > Group H is basically done, hyperkitty import failed on wikitech-l and unblock-en-l Let me guess: the HTML archives were meddled with (to remove s... [08:50:55] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:55:53] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:02:55] (03CR) 10Addshore: "If this hadn't already happened for the WCQS I would have said lets not do this yet, and change all the URIs at once (wikidata and commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679327 (https://phabricator.wikimedia.org/T258590) (owner: 10Seddon) [09:03:36] (03CR) 10Alexandros Kosiaris: [C: 04-2] P::kubernetes::deployment_server: Do not use ipv6 on beta (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/691494 (https://phabricator.wikimedia.org/T281986) (owner: 10Majavah) [09:07:18] (03PS18) 10Elukey: Add istio base images build support [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192) [09:15:41] akosiaris: hi, what would you suggest for https://gerrit.wikimedia.org/r/691494? my understanding is that beta specific roles should not be used at all, so the only other option I see would be to move the if $::realm != 'labs' clause to role::deployment_server [09:25:12] (03CR) 10Majavah: [C: 03+1] "works fine on beta. Ideally beta would just use the service proxy like production but that's rather difficult due to service::catalog data" [puppet] - 10https://gerrit.wikimedia.org/r/688315 (https://phabricator.wikimedia.org/T277990) (owner: 10Giuseppe Lavagetto) [09:27:07] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 (10Ladsgroup) that and encoding mess. [09:39:45] RECOVERY - SSH on logstash2020.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:49:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:52:25] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:57:14] (03CR) 10Multichill: [C: 03+1] "Hi Adam," [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679327 (https://phabricator.wikimedia.org/T258590) (owner: 10Seddon) [10:09:17] PROBLEM - WDQS high update lag on wdqs1006 is CRITICAL: 4.829e+04 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [10:48:21] (03PS19) 10Elukey: Add istio base images build support [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192) [10:52:36] (03CR) 10Elukey: "I was finally able to run a test on minikube like the following:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [11:44:17] 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10User-Addshore: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Ladsgroup) [12:02:25] (03PS1) 10Mvolz: [wip] Updated outdated commands [deployment-charts] - 10https://gerrit.wikimedia.org/r/691599 [12:12:05] Heads up I accidentally forgot to deploy on one of the two servers on thursday :/ [12:12:08] gonna do eqiad now [12:12:11] whoops [12:18:41] ugh nvm something is messed up. [12:19:09] when I do diff it just wants to revert the chart?? [12:19:12] :/ [12:31:01] hmm, potentially a rebasing issue... :/ looks like codfw is on the new version but the old chart, and eqiad is on the old version but the new chart... [12:33:21] !log set fr_quality to 0 for all revisions on several wikis (T279761) [12:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:28] T279761: When reviewing pending changes, raw message ID "⧼revreview-hist-quality⧽" shown instead of human readable string - https://phabricator.wikimedia.org/T279761 [12:37:14] XioNoX: any chance you could help with helmfile stuff? [12:37:56] I have one server on an old version of citoid but on the new chart, and another server with the new chart but on an old version of citoid [12:38:05] So - inconsistent results on the user level. [12:38:21] If I try to update eqiad, it wants to downgrade the chart. [12:40:28] master deployment_charts repository looks correct - everything is current on master. [13:50:36] (03PS1) 10Ladsgroup: acme_chief: Migrate cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/691634 (https://phabricator.wikimedia.org/T273673) [13:51:14] (03CR) 10jerkins-bot: [V: 04-1] acme_chief: Migrate cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/691634 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [13:55:05] (03PS1) 10Ladsgroup: Revert "Revert "prometheus: Migrate node_file_count cron to systemd timer"" [puppet] - 10https://gerrit.wikimedia.org/r/691317 [13:59:35] (03PS2) 10Ladsgroup: Revert "Revert "prometheus: Migrate node_file_count cron to systemd timer"" [puppet] - 10https://gerrit.wikimedia.org/r/691317 [14:01:36] (03PS2) 10Ladsgroup: acme_chief: Migrate cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/691634 (https://phabricator.wikimedia.org/T273673) [14:01:38] (03PS3) 10Ladsgroup: Revert "Revert "prometheus: Migrate node_file_count cron to systemd timer"" [puppet] - 10https://gerrit.wikimedia.org/r/691317 [14:03:41] (03PS3) 10Ladsgroup: acme_chief: Migrate cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/691634 (https://phabricator.wikimedia.org/T273673) [14:03:58] (03PS4) 10Ladsgroup: Revert "Revert "prometheus: Migrate node_file_count cron to systemd timer"" [puppet] - 10https://gerrit.wikimedia.org/r/691317 [14:09:29] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:12:01] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:22:27] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 (10Ladsgroup) [14:25:00] 10SRE, 10Wikimedia-Mailing-lists: Wikimedia-RU mailing list page has wrong encoding - https://phabricator.wikimedia.org/T135226 (10Ladsgroup) 05Open→03Resolved It's on mailman3 now: https://lists.wikimedia.org/postorius/lists/wikimedia-ru.lists.wikimedia.org/ Almost all mailing lists are now on mm3 [14:25:48] 10SRE, 10Wikimedia-Mailing-lists: Mailman sends bounce notification messages to list-admins with a subject line in Chinese language - https://phabricator.wikimedia.org/T278574 (10Ladsgroup) 05Open→03Resolved Except a very few mailing lists, all are now migrated to mailman3, I call this done. [14:26:09] 10SRE, 10Wikimedia-Mailing-lists, 10Mobile: List archives on lists.wikimedia.org is not mobile friendly - https://phabricator.wikimedia.org/T190054 (10Ladsgroup) 05Open→03Resolved Except a very few mailing lists, all are now migrated to mailman3, I call this done. [14:27:16] 10SRE, 10Wikimedia-Mailing-lists, 10I18n: Mailman password reminder mail (and other texts) has broken encoding in Czech - https://phabricator.wikimedia.org/T271123 (10Ladsgroup) 05Open→03Resolved This mailing list is one mm3 now https://lists.wikimedia.org/postorius/lists/wikics-l.lists.wikimedia.org/ E... [14:27:25] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:28:09] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: https://lists.wikimedia.org/mailman/options/ doesn't set charset header - https://phabricator.wikimedia.org/T172929 (10Ladsgroup) 05Open→03Resolved Except a very few mailing lists, all are now migrated to mailman3, I call this done. [14:29:53] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:30:08] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: Mailing lists need search function readded - https://phabricator.wikimedia.org/T19390 (10Ladsgroup) 05Open→03Resolved Except a very few mailing lists, all are now migrated to mailman3, I call this done. [14:31:07] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: "From" at start of line becomes ">From" in pipermail - https://phabricator.wikimedia.org/T115329 (10Ladsgroup) 05Open→03Resolved Except a very few mailing lists, all are now migrated to mailman3, I call this done. [14:32:07] 10SRE, 10Wikimedia-Mailing-lists, 10Mobile: Mailman on lists.wikimedia.org is not mobile friendly - https://phabricator.wikimedia.org/T190055 (10Ladsgroup) 05Open→03Resolved Except a very few mailing lists, all are now migrated to mailman3, I call this done. [14:57:41] (03CR) 10Giuseppe Lavagetto: [C: 03+1] New envoy upstream version 1.15.5 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/689950 (owner: 10Hnowlan) [15:17:27] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:19:49] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:51:51] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:54:17] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:06:43] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:09:09] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:16:10] 10SRE, 10Analytics, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Superset/Turnilo access - https://phabricator.wikimedia.org/T282947 (10STei-WMF) [16:17:16] 10SRE, 10discovery-system: Document what #discovery-system is - https://phabricator.wikimedia.org/T282948 (10Aklapper) [16:17:59] 10SRE, 10discovery-system: Document what #discovery-system is - https://phabricator.wikimedia.org/T282948 (10Aklapper) [16:21:17] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:23:45] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:51:17] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:53:35] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:58:26] 10SRE, 10Analytics, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Superset/Turnilo access - https://phabricator.wikimedia.org/T282947 (10Aklapper) Hi @Stei-WMF, please see the bullet points at https://phabricator.wikimedia.org/tag/ldap-access-requests/ - thanks! [19:21:17] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:23:45] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:47:47] (03PS1) 10Andrew Bogott: Add an OpenStack package class for Bullseye VMs [puppet] - 10https://gerrit.wikimedia.org/r/691741 (https://phabricator.wikimedia.org/T280801) [19:49:41] (03CR) 10Andrew Bogott: [C: 03+2] Add an OpenStack package class for Bullseye VMs [puppet] - 10https://gerrit.wikimedia.org/r/691741 (https://phabricator.wikimedia.org/T280801) (owner: 10Andrew Bogott) [19:57:07] (03PS1) 10Andrew Bogott: cloud-vps VMs: use ssd for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/691744 (https://phabricator.wikimedia.org/T280801) [20:00:29] PROBLEM - ats-tls HTTPS wikiworkshop.org RSA on cp5016 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikiworkshop.org has 86371 seconds left https://wikitech.wikimedia.org/wiki/HTTPS [20:01:20] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps VMs: use ssd for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/691744 (https://phabricator.wikimedia.org/T280801) (owner: 10Andrew Bogott) [20:02:15] PROBLEM - ats-tls HTTPS wikiworkshop.org ECDSA on cp5016 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikiworkshop.org has 86267 seconds left https://wikitech.wikimedia.org/wiki/HTTPS [20:32:18] (03PS1) 10Andrew Bogott: cloud-vps openstack packages: don't install python2 packages on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/691746 (https://phabricator.wikimedia.org/T280801) [20:33:39] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps openstack packages: don't install python2 packages on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/691746 (https://phabricator.wikimedia.org/T280801) (owner: 10Andrew Bogott) [20:51:09] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:53:37] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:30:41] 10SRE, 10Analytics, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Superset/Turnilo access for User:STei - https://phabricator.wikimedia.org/T282947 (10Urbanecm) [22:33:14] (03PS1) 10Urbanecm: Enable SandboxLink at azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/691754 (https://phabricator.wikimedia.org/T282954) [23:21:23] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:23:53] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:45:01] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:46:31] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring