[00:06:25] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1118.eqiad.wmnet'] ` and were **ALL** successful. [00:07:22] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10RobH) Ok, so this was a bit timeconsuming to get setup. all of the checkboxes are updated in the task description, however, the boot order must still be c... [00:17:08] (03CR) 10Ladsgroup: mailman3: Add parts for Postorius (web interface) (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/655203 (https://phabricator.wikimedia.org/T256542) (owner: 10Ladsgroup) [00:17:14] 10SRE, 10Wikimedia-Logstash, 10Patch-For-Review, 10Sustainability (Incident Followup): Logstash pipeline crashes on non-UTF8 log messages. - https://phabricator.wikimedia.org/T233662 (10colewhite) 05Open→03Resolved a:03colewhite We haven't seen this happen in a long while and several potential mitiga... [00:18:43] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:22:16] (03PS4) 10Ladsgroup: mailman3: Add parts for Postorius (web interface) [puppet] - 10https://gerrit.wikimedia.org/r/655203 (https://phabricator.wikimedia.org/T256542) [00:41:40] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10wiki_willy) Dell provided some docs that show DYV8773 should be onsite, and John confirmed all 25 were received. @Cmjohnson - it probably got mixed in, with one of the o... [00:44:54] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:11:18] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 6873538320 and 496 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:15:44] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 50056 and 356 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:35:32] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:37:52] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:43:24] PROBLEM - Check systemd state on ms-be2054 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:58:40] RECOVERY - Check systemd state on ms-be2054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:03:59] (03PS4) 10Andrew Bogott: Add designate packages and manifests for openstack/train [puppet] - 10https://gerrit.wikimedia.org/r/656502 (https://phabricator.wikimedia.org/T261135) [02:17:40] (03PS1) 10Andrew Bogott: keystone policy: replace the 'owner' rule [puppet] - 10https://gerrit.wikimedia.org/r/656528 (https://phabricator.wikimedia.org/T272117) [02:18:41] (03CR) 10Andrew Bogott: [C: 03+2] keystone policy: replace the 'owner' rule [puppet] - 10https://gerrit.wikimedia.org/r/656528 (https://phabricator.wikimedia.org/T272117) (owner: 10Andrew Bogott) [02:25:03] (03CR) 10Andrew Bogott: [C: 03+2] Add designate packages and manifests for openstack/train [puppet] - 10https://gerrit.wikimedia.org/r/656502 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott) [02:38:30] (03PS1) 10Ladsgroup: query_service: Migrate hiera() to lookup() in gui [puppet] - 10https://gerrit.wikimedia.org/r/656530 (https://phabricator.wikimedia.org/T209953) [02:47:54] (03PS1) 10Ladsgroup: eventlogging: Migrate hiera() to lookup() and setting datatype [puppet] - 10https://gerrit.wikimedia.org/r/656531 (https://phabricator.wikimedia.org/T209953) [02:50:23] (03CR) 10Ladsgroup: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/27500/" [puppet] - 10https://gerrit.wikimedia.org/r/656531 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [02:51:59] (03CR) 10Ladsgroup: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/27501/" [puppet] - 10https://gerrit.wikimedia.org/r/656530 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [03:58:13] (03PS1) 10Andrew Bogott: designate nova_fixed_multi: update to catch up with upstream changes [puppet] - 10https://gerrit.wikimedia.org/r/656533 (https://phabricator.wikimedia.org/T261135) [03:59:54] (03PS2) 10Andrew Bogott: designate nova_fixed_multi: update to catch up with upstream changes [puppet] - 10https://gerrit.wikimedia.org/r/656533 (https://phabricator.wikimedia.org/T261135) [04:03:22] (03CR) 10Andrew Bogott: [C: 03+2] designate nova_fixed_multi: update to catch up with upstream changes [puppet] - 10https://gerrit.wikimedia.org/r/656533 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott) [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210116T0800) [09:02:18] PROBLEM - HP RAID on ms-be1032 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Failed: 2I:2:1 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [09:02:20] ACKNOWLEDGEMENT - HP RAID on ms-be1032 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Failed: 2I:2:1 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T272209 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [09:02:24] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1032 - https://phabricator.wikimedia.org/T272209 (10ops-monitoring-bot) [09:02:32] PROBLEM - HTTPS-wmfusercontent on phab.wmfusercontent.org is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org valid until 2021-02-15 09:02:12 +0000 (expires in 29 days) https://phabricator.wikimedia.org/tag/phabricator/ [09:02:58] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [09:04:14] PROBLEM - HTTPS-planet on en.planet.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org valid until 2021-02-15 09:02:12 +0000 (expires in 29 days) https://wikitech.wikimedia.org/wiki/Planet.wikimedia.org [13:29:13] (03PS1) 10QChris: Add .gitreview [debs/phalerts] - 10https://gerrit.wikimedia.org/r/656565 [13:29:15] (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [debs/phalerts] - 10https://gerrit.wikimedia.org/r/656565 (owner: 10QChris) [13:42:52] PROBLEM - Disk space on maps1004 is CRITICAL: DISK CRITICAL - free space: /srv 62370 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=maps1004&var-datasource=eqiad+prometheus/ops [15:21:36] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:24:16] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:30:40] RECOVERY - Disk space on maps1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=maps1004&var-datasource=eqiad+prometheus/ops [16:39:40] (03PS1) 10Esanders: DiscussionTools: Enable new topic tool by default on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656572 (https://phabricator.wikimedia.org/T272077) [16:41:12] (03CR) 10Esanders: "See task for deployment date" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656572 (https://phabricator.wikimedia.org/T272077) (owner: 10Esanders) [19:08:04] (03CR) 10ArielGlenn: "I'll test the script itself in deployment-prep, unless anyone else would like to do it (snapshot02 instance, as the dumpsgen user). Probab" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/637895 (https://phabricator.wikimedia.org/T264883) (owner: 10Hoo man)