[00:00:39] PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection timed out https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:02:35] RECOVERY - mailman archives on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 30 Jun 2021 09:00:48 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:02:39] RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2021-06-30 09:00:48 +0000 (expires in 45 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:04:07] RECOVERY - mailman list info on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 30 Jun 2021 09:00:48 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:07:01] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: drop_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:21:29] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:23:59] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:29:03] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-base-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:53:32] !log restarted mailman3-web on lists1001, uwsgi looked like it got stuck, consuming all CPU/memory [00:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:36:33] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:36:50] * legoktm grumbles [02:38:09] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:39:48] !log restarting mailman3-web on lists1001 again [02:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:40:29] RECOVERY - mailman list info on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 30 Jun 2021 09:00:48 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:41:25] RECOVERY - mailman archives on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 30 Jun 2021 09:00:48 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:39:51] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:42:25] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:47:21] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:49:49] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:49:33] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:52:07] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:32:04] (03CR) 10Marostegui: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/690402 (https://phabricator.wikimedia.org/T282662) (owner: 10Jcrespo) [06:54:11] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:56:41] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210516T0700) [07:24:01] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:26:29] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:44:29] (03CR) 1020after4: "So --refresh-config doesn't actually update the server host name / git url." [puppet] - 10https://gerrit.wikimedia.org/r/670784 (owner: 10Hnowlan) [08:32:05] (03PS4) 10Jcrespo: prometheus-mysqld-exporter: Update generator to remove multisource exception [puppet] - 10https://gerrit.wikimedia.org/r/690402 (https://phabricator.wikimedia.org/T282662) [08:32:22] (03CR) 10jerkins-bot: [V: 04-1] prometheus-mysqld-exporter: Update generator to remove multisource exception [puppet] - 10https://gerrit.wikimedia.org/r/690402 (https://phabricator.wikimedia.org/T282662) (owner: 10Jcrespo) [08:32:40] (03PS5) 10Jcrespo: prometheus-mysqld-exporter: Update generator to remove multisource exception [puppet] - 10https://gerrit.wikimedia.org/r/690402 (https://phabricator.wikimedia.org/T282662) [08:43:27] PROBLEM - HTTPS-planet on en.planet.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org valid until 2021-06-15 08:42:19 +0000 (expires in 29 days) https://wikitech.wikimedia.org/wiki/Planet.wikimedia.org [08:43:51] PROBLEM - HTTPS-wmfusercontent on phab.wmfusercontent.org is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org valid until 2021-06-15 08:42:19 +0000 (expires in 29 days) https://phabricator.wikimedia.org/tag/phabricator/ [08:58:13] PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [09:01:26] 10SRE, 10Toolforge, 10observability: Upgrade prometheus-redis-exporter - https://phabricator.wikimedia.org/T282963 (10Majavah) [09:01:43] 10SRE, 10Toolforge, 10observability: Upgrade prometheus-redis-exporter - https://phabricator.wikimedia.org/T282963 (10Majavah) [09:49:39] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:54:41] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:16:17] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1007 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:18:45] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:20:21] 10SRE, 10Toolforge, 10observability: Upgrade prometheus-redis-exporter - https://phabricator.wikimedia.org/T282963 (10Majavah) 1.16 appears to be available in Debian bullseye: https://tracker.debian.org/pkg/prometheus-redis-exporter [10:51:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:54:13] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:21:15] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:23:49] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:21:01] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:22:43] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 38.32 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [12:23:25] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:26:09] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [12:30:29] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 70.35 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [12:31:23] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [12:51:05] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:53:35] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:21:05] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:23:31] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:32:19] (03CR) 10Majavah: [C: 04-1] "Does not seem to work anymore:" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/621776 (https://phabricator.wikimedia.org/T169695) (owner: 10BryanDavis) [13:35:00] (03CR) 10Majavah: [C: 04-1] "> Patch Set 8: Code-Review-1" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/621776 (https://phabricator.wikimedia.org/T169695) (owner: 10BryanDavis) [13:45:23] (03CR) 10Majavah: Use common k8s labels (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/637813 (https://phabricator.wikimedia.org/T266844) (owner: 10Legoktm) [14:03:11] RECOVERY - WDQS high update lag on wdqs1006 is OK: (C)4.32e+04 ge (W)2.16e+04 ge 2.157e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [14:04:22] (03PS1) 10Andrew Bogott: Openstack VM client packages: don't install python-netaddr on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/691955 [14:04:25] (03PS1) 10Andrew Bogott: ldap::client::utils: Exclude python2 libraries from Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/691956 [14:04:27] (03PS1) 10Andrew Bogott: Diamond: Don't try to install python2 python-statsd on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/691957 [14:10:54] (03PS1) 10Andrew Bogott: cloud-vps: Don't install Diamond on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/691958 [14:11:15] (03Abandoned) 10Andrew Bogott: Diamond: Don't try to install python2 python-statsd on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/691957 (owner: 10Andrew Bogott) [14:14:29] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 142420424 and 13 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:16:49] (03PS2) 10Andrew Bogott: Openstack VM client packages: don't install python-netaddr on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/691955 (https://phabricator.wikimedia.org/T280801) [14:16:51] (03PS2) 10Andrew Bogott: cloud-vps: Don't install Diamond on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/691958 (https://phabricator.wikimedia.org/T280801) [14:16:52] (03PS2) 10Andrew Bogott: ldap::client::utils: Exclude python2 libraries from Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/691956 (https://phabricator.wikimedia.org/T280801) [14:16:55] (03PS1) 10Andrew Bogott: cloud-vps: don't install ldapsupportlib on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/691959 (https://phabricator.wikimedia.org/T114063) [14:16:59] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 452120 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:22:16] (03PS1) 10Andrew Bogott: Install systemd-timesyncd on Bullseye and later [puppet] - 10https://gerrit.wikimedia.org/r/691960 (https://phabricator.wikimedia.org/T280801) [14:23:15] (03CR) 10jerkins-bot: [V: 04-1] Install systemd-timesyncd on Bullseye and later [puppet] - 10https://gerrit.wikimedia.org/r/691960 (https://phabricator.wikimedia.org/T280801) (owner: 10Andrew Bogott) [14:52:17] PROBLEM - Too high an incoming rate of browser-reported Network Error Logging events on alert1001 is CRITICAL: type=tcp.timed_out https://wikitech.wikimedia.org/wiki/Network_monitoring%23NEL_alerts https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 [14:57:21] RECOVERY - Too high an incoming rate of browser-reported Network Error Logging events on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Network_monitoring%23NEL_alerts https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 [15:55:02] (03PS2) 10Andrew Bogott: Install systemd-timesyncd on Bullseye and later [puppet] - 10https://gerrit.wikimedia.org/r/691960 (https://phabricator.wikimedia.org/T280801) [16:06:07] (03CR) 10Andrew Bogott: [C: 04-1] "This doesn't work because the package fails to configure; something else is creating the service user before we get here" [puppet] - 10https://gerrit.wikimedia.org/r/691960 (https://phabricator.wikimedia.org/T280801) (owner: 10Andrew Bogott) [16:20:57] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:23:23] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:25:01] (03CR) 10Andrew Bogott: [C: 03+2] Openstack VM client packages: don't install python-netaddr on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/691955 (https://phabricator.wikimedia.org/T280801) (owner: 10Andrew Bogott) [17:03:18] (03PS1) 10Majavah: toolserver_legacy: send 410 Gone for tile requests [puppet] - 10https://gerrit.wikimedia.org/r/692000 (https://phabricator.wikimedia.org/T282889) [17:21:53] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:22:51] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:28:07] ^ yep, lists.wm.o is down [17:29:27] RECOVERY - mailman list info on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 30 Jun 2021 09:00:48 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:29:39] !log restart mailman3-web [17:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:53] sorry got here by ping of Martin [17:30:01] thanks Amir1 :) [17:30:17] It's this https://phabricator.wikimedia.org/T282957 [17:30:25] RECOVERY - mailman archives on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 30 Jun 2021 09:00:48 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:33:31] I'm half around for a while. Ping me, I think it needs happening once in a while [17:37:05] 10SRE, 10Analytics, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Superset/Turnilo access for User:STei - https://phabricator.wikimedia.org/T282947 (10Elitre) @elukey The username of your existing account on wikitech.wikimedia.org: User:STei Do you currently have shell access (Yes... [17:37:57] 10SRE, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Grant access to LDAP nda for Sannita - https://phabricator.wikimedia.org/T282600 (10Elitre) >>! In T282600#7088910, @Aklapper wrote: > @Elitre: I'm not after policies; I recommended that people separate roles like they already do on... [17:44:10] (preferably telegram) [19:37:29] Amir1: thanks :) [19:38:38] !log restarted mailman3-web [19:38:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:29] I go try to sleep a bit. lego being around makes me not to worry [19:40:38] good night :D [19:41:26] (03PS22) 10Effie Mouzeli: Helm chart to run MediaWiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/670220 (https://phabricator.wikimedia.org/T265327) (owner: 10Giuseppe Lavagetto) [19:51:27] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps: don't install ldapsupportlib on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/691959 (https://phabricator.wikimedia.org/T114063) (owner: 10Andrew Bogott) [19:52:00] (03CR) 10Andrew Bogott: [C: 03+2] ldap::client::utils: Exclude python2 libraries from Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/691956 (https://phabricator.wikimedia.org/T280801) (owner: 10Andrew Bogott) [19:52:09] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps: Don't install Diamond on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/691958 (https://phabricator.wikimedia.org/T280801) (owner: 10Andrew Bogott) [20:22:23] (03PS1) 10Andrew Bogott: cloud-vps: profile::openstack::eqiad1::version to 'victoria' [puppet] - 10https://gerrit.wikimedia.org/r/692026 [20:23:12] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps: profile::openstack::eqiad1::version to 'victoria' [puppet] - 10https://gerrit.wikimedia.org/r/692026 (owner: 10Andrew Bogott) [20:44:31] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:45:06] legoktm: ^ [20:45:25] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:45:27] SIGH [20:46:12] !log restarted mailman3-web [20:46:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:30] I'm stepping out for a few minutes then I'll get gdb setup [20:46:43] RECOVERY - mailman list info on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 30 Jun 2021 09:00:48 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:47:01] Also set a ping for lists1001 [20:47:37] RECOVERY - mailman archives on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 30 Jun 2021 09:00:48 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:50:02] Ty legoktm [21:07:09] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:09:35] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:15:37] PROBLEM - SSH on mw1303.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:45:12] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/WikimediaMaintenance/createExtensionTables.php --wiki=tawiki wikilove # T280326 [22:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:17] T280326: WikiLove Extension in Tamil Wikipedia - https://phabricator.wikimedia.org/T280326 [23:18:27] RECOVERY - SSH on mw1303.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:33:45] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook