[00:06:25] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1118.eqiad.wmnet'] `  and were **ALL** successful.
[00:07:22] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10RobH) Ok, so this was a bit timeconsuming to get setup.  all of the checkboxes are updated in the task description, however, the boot order must still be c...
[00:17:08] <wikibugs>	 (03CR) 10Ladsgroup: mailman3: Add parts for Postorius (web interface) (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/655203 (https://phabricator.wikimedia.org/T256542) (owner: 10Ladsgroup)
[00:17:14] <wikibugs>	 10SRE, 10Wikimedia-Logstash, 10Patch-For-Review, 10Sustainability (Incident Followup): Logstash pipeline crashes on non-UTF8 log messages. - https://phabricator.wikimedia.org/T233662 (10colewhite) 05Open→03Resolved a:03colewhite We haven't seen this happen in a long while and several potential mitiga...
[00:18:43] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:22:16] <wikibugs>	 (03PS4) 10Ladsgroup: mailman3: Add parts for Postorius (web interface) [puppet] - 10https://gerrit.wikimedia.org/r/655203 (https://phabricator.wikimedia.org/T256542)
[00:41:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10wiki_willy) Dell provided some docs that show DYV8773 should be onsite, and John confirmed all 25 were received.  @Cmjohnson - it probably got mixed in, with one of the o...
[00:44:54] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:11:18] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 6873538320 and 496 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:15:44] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 50056 and 356 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:35:32] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:37:52] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:43:24] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2054 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:58:40] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:03:59] <wikibugs>	 (03PS4) 10Andrew Bogott: Add designate packages and manifests for openstack/train [puppet] - 10https://gerrit.wikimedia.org/r/656502 (https://phabricator.wikimedia.org/T261135)
[02:17:40] <wikibugs>	 (03PS1) 10Andrew Bogott: keystone policy: replace the 'owner' rule [puppet] - 10https://gerrit.wikimedia.org/r/656528 (https://phabricator.wikimedia.org/T272117)
[02:18:41] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] keystone policy: replace the 'owner' rule [puppet] - 10https://gerrit.wikimedia.org/r/656528 (https://phabricator.wikimedia.org/T272117) (owner: 10Andrew Bogott)
[02:25:03] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Add designate packages and manifests for openstack/train [puppet] - 10https://gerrit.wikimedia.org/r/656502 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott)
[02:38:30] <wikibugs>	 (03PS1) 10Ladsgroup: query_service: Migrate hiera() to lookup() in gui [puppet] - 10https://gerrit.wikimedia.org/r/656530 (https://phabricator.wikimedia.org/T209953)
[02:47:54] <wikibugs>	 (03PS1) 10Ladsgroup: eventlogging: Migrate hiera() to lookup() and setting datatype [puppet] - 10https://gerrit.wikimedia.org/r/656531 (https://phabricator.wikimedia.org/T209953)
[02:50:23] <wikibugs>	 (03CR) 10Ladsgroup: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/27500/" [puppet] - 10https://gerrit.wikimedia.org/r/656531 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[02:51:59] <wikibugs>	 (03CR) 10Ladsgroup: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/27501/" [puppet] - 10https://gerrit.wikimedia.org/r/656530 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[03:58:13] <wikibugs>	 (03PS1) 10Andrew Bogott: designate nova_fixed_multi: update to catch up with upstream changes [puppet] - 10https://gerrit.wikimedia.org/r/656533 (https://phabricator.wikimedia.org/T261135)
[03:59:54] <wikibugs>	 (03PS2) 10Andrew Bogott: designate nova_fixed_multi: update to catch up with upstream changes [puppet] - 10https://gerrit.wikimedia.org/r/656533 (https://phabricator.wikimedia.org/T261135)
[04:03:22] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] designate nova_fixed_multi: update to catch up with upstream changes [puppet] - 10https://gerrit.wikimedia.org/r/656533 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott)
[08:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210116T0800)
[09:02:18] <icinga-wm>	 PROBLEM - HP RAID on ms-be1032 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Failed: 2I:2:1 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[09:02:20] <icinga-wm>	 ACKNOWLEDGEMENT - HP RAID on ms-be1032 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Failed: 2I:2:1 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T272209 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[09:02:24] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on ms-be1032 - https://phabricator.wikimedia.org/T272209 (10ops-monitoring-bot)
[09:02:32] <icinga-wm>	 PROBLEM - HTTPS-wmfusercontent on phab.wmfusercontent.org is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org valid until 2021-02-15 09:02:12 +0000 (expires in 29 days) https://phabricator.wikimedia.org/tag/phabricator/
[09:02:58] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[09:04:14] <icinga-wm>	 PROBLEM - HTTPS-planet on en.planet.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org valid until 2021-02-15 09:02:12 +0000 (expires in 29 days) https://wikitech.wikimedia.org/wiki/Planet.wikimedia.org
[13:29:13] <wikibugs>	 (03PS1) 10QChris: Add .gitreview [debs/phalerts] - 10https://gerrit.wikimedia.org/r/656565
[13:29:15] <wikibugs>	 (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [debs/phalerts] - 10https://gerrit.wikimedia.org/r/656565 (owner: 10QChris)
[13:42:52] <icinga-wm>	 PROBLEM - Disk space on maps1004 is CRITICAL: DISK CRITICAL - free space: /srv 62370 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=maps1004&var-datasource=eqiad+prometheus/ops
[15:21:36] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:24:16] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:30:40] <icinga-wm>	 RECOVERY - Disk space on maps1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=maps1004&var-datasource=eqiad+prometheus/ops
[16:39:40] <wikibugs>	 (03PS1) 10Esanders: DiscussionTools: Enable new topic tool by default on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656572 (https://phabricator.wikimedia.org/T272077)
[16:41:12] <wikibugs>	 (03CR) 10Esanders: "See task for deployment date" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656572 (https://phabricator.wikimedia.org/T272077) (owner: 10Esanders)
[19:08:04] <wikibugs>	 (03CR) 10ArielGlenn: "I'll test the script itself in deployment-prep, unless anyone else would like to do it (snapshot02 instance, as the dumpsgen user). Probab" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/637895 (https://phabricator.wikimedia.org/T264883) (owner: 10Hoo man)