[00:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210305T0000).
[00:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[00:00:19] <Urbanecm>	 the easiest deployment ever :)
[00:00:49] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] install_server: Add registry1004.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/668570 (https://phabricator.wikimedia.org/T276380) (owner: 10Legoktm)
[00:06:50] <logmsgbot>	 !log legoktm@cumin1001 START - Cookbook sre.ganeti.makevm for new host registry2004.eqiad.wmnet
[00:06:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:14:15] <wikibugs>	 (03PS4) 10Ryan Kemper: wdqs: expose wdqs1009 externally [puppet] - 10https://gerrit.wikimedia.org/r/668173 (https://phabricator.wikimedia.org/T266470)
[00:14:38] <wikibugs>	 10SRE, 10DBA, 10Epic, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Decide how to improve  parsercache replication, sharding and HA - https://phabricator.wikimedia.org/T133523 (10aaron) Playing around with  ` mwscript shell.php aawiki `  ...I noticed that SHOW SLAVE STATUS is empty in...
[00:14:40] <wikibugs>	 (03CR) 10Ryan Kemper: wdqs: expose wdqs1009 externally (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/668173 (https://phabricator.wikimedia.org/T266470) (owner: 10Ryan Kemper)
[00:23:33] <logmsgbot>	 !log legoktm@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host registry2004.eqiad.wmnet
[00:23:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:23:57] <legoktm>	 ah crap
[00:24:09] <legoktm>	 that was supposed to be .codfw.wmnet
[00:24:45] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28390/console" [puppet] - 10https://gerrit.wikimedia.org/r/668173 (https://phabricator.wikimedia.org/T266470) (owner: 10Ryan Kemper)
[00:26:17] <wikibugs>	 10SRE, 10vm-requests: codfw: 2 VM request for docker-registry - https://phabricator.wikimedia.org/T276381 (10Legoktm) Uh, screwed up, accidentally created registry2004 in codfw but called it registry2004.eqiad.wmnet. Going to delete it now and try again...
[00:26:23] <wikibugs>	 10SRE, 10DBA, 10Epic, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Decide how to improve  parsercache replication, sharding and HA - https://phabricator.wikimedia.org/T133523 (10aaron) Ideally the SqlBagOStuff hashing would use HashRing, though any naive transition would involve a lo...
[00:27:14] <legoktm>	 I guess I have to finish setting it up before I can decom it?
[00:31:57] <Reedy>	 heh
[00:32:01] <wikibugs>	 (03PS1) 10Legoktm: install_server: Add registry2004.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/668571 (https://phabricator.wikimedia.org/T276381)
[00:32:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] install_server: Add registry2004.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/668571 (https://phabricator.wikimedia.org/T276381) (owner: 10Legoktm)
[00:34:11] <legoktm>	 or I guess not
[00:35:06] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28391/console" [puppet] - 10https://gerrit.wikimedia.org/r/668543 (https://phabricator.wikimedia.org/T266470) (owner: 10Ryan Kemper)
[00:35:55] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] wdqs: new query-preview microsite [puppet] - 10https://gerrit.wikimedia.org/r/668543 (https://phabricator.wikimedia.org/T266470) (owner: 10Ryan Kemper)
[00:36:15] <ryankemper>	 !log T266470 Deploying new `query-preview` microsite: https://gerrit.wikimedia.org/r/c/operations/puppet/+/668543
[00:36:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:36:24] <stashbot>	 T266470: Expose wdqs1009 to wdqs users and gather feedback - https://phabricator.wikimedia.org/T266470
[00:39:28] <logmsgbot>	 !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@4cc913e]: correct refinery-drop-older-than checksum
[00:39:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:39:34] <ryankemper>	 !log T266470 Ran `sudo run-puppet-agent` on `miscweb1002` without issue; `/var/log/apache2/query*.log` looks as expected
[00:39:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:41:02] <logmsgbot>	 !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@4cc913e]: correct refinery-drop-older-than checksum (duration: 01m 34s)
[00:41:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:44:58] <icinga-wm>	 PROBLEM - Check systemd state on registry1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:46:36] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] wdqs: expose wdqs1009 externally [puppet] - 10https://gerrit.wikimedia.org/r/668173 (https://phabricator.wikimedia.org/T266470) (owner: 10Ryan Kemper)
[00:47:02] <ryankemper>	 !log T266470 [ats] Deploying new mappings for `query-preview.wikidata.org` microsite: https://gerrit.wikimedia.org/r/c/operations/puppet/+/668173/
[00:47:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:47:08] <stashbot>	 T266470: Expose wdqs1009 to wdqs users and gather feedback - https://phabricator.wikimedia.org/T266470
[00:47:28] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:49:17] <wikibugs>	 (03Abandoned) 10Legoktm: install_server: Add registry2004.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/668571 (https://phabricator.wikimedia.org/T276381) (owner: 10Legoktm)
[00:49:26] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:49:44] <wikibugs>	 10SRE, 10vm-requests, 10Patch-For-Review: codfw: 2 VM request for docker-registry - https://phabricator.wikimedia.org/T276381 (10Legoktm) ` legoktm@cumin1001:~$ sudo cookbook sre.hosts.decommission registry2004.eqiad.wmnet -t T276381 >>> ATTENTION: the query does not match any host in PuppetDB or failed Host...
[00:50:39] <ryankemper>	 !log T266470 [ats] `sudo cumin 'A:cp-ats' 'sudo run-puppet-agent'`
[00:50:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:51:38] <wikibugs>	 (03PS1) 10Legoktm: sre.hosts.decomission: Don't error if all hosts aren't found [cookbooks] - 10https://gerrit.wikimedia.org/r/668572
[00:54:22] <wikibugs>	 (03PS1) 10Legoktm: conftool: Add registry1004.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/668573 (https://phabricator.wikimedia.org/T276380)
[00:55:02] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] conftool: Add registry1004.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/668573 (https://phabricator.wikimedia.org/T276380) (owner: 10Legoktm)
[00:55:48] <logmsgbot>	 !log legoktm@deploy1002 conftool action : set/pooled=no; selector: name=registry1004.eqiad.codfw
[00:55:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:56:04] <logmsgbot>	 !log legoktm@deploy1002 conftool action : set/weight=10; selector: name=registry1004.eqiad.wmnet
[00:56:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:57:04] <logmsgbot>	 !log legoktm@deploy1002 conftool action : set/pooled=inactive; selector: name=registry1004.eqiad.codfw
[00:57:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:57:14] <logmsgbot>	 !log legoktm@deploy1002 conftool action : set/pooled=no; selector: name=registry1004.eqiad.wmnet
[00:57:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:58:08] <logmsgbot>	 !log legoktm@deploy1002 conftool action : set/pooled=yes; selector: name=registry1004.eqiad.wmnet
[00:58:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:58:47] <wikibugs>	 10SRE, 10vm-requests, 10Patch-For-Review: eqiad: 2 VM request for docker-registry - https://phabricator.wikimedia.org/T276380 (10Legoktm) 05Open→03Resolved
[00:58:51] <wikibugs>	 10SRE, 10serviceops: Upgrade docker-registry servers to Debian Buster - https://phabricator.wikimedia.org/T272550 (10Legoktm)
[00:58:55] <logmsgbot>	 !log legoktm@deploy1002 conftool action : set/pooled=no; selector: name=registry1001.eqiad.wmnet
[00:58:58] <logmsgbot>	 !log legoktm@deploy1002 conftool action : set/pooled=no; selector: name=registry1002.eqiad.wmnet
[00:59:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:59:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:59:46] <legoktm>	 !log depooled registry1001/registry1002 (old stretch VMs) - T272550
[00:59:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:59:53] <stashbot>	 T272550: Upgrade docker-registry servers to Debian Buster - https://phabricator.wikimedia.org/T272550
[01:02:35] <legoktm>	 I'm going to take a break now, too many typos
[01:04:31] <wikibugs>	 10SRE, 10serviceops: Upgrade docker-registry servers to Debian Buster - https://phabricator.wikimedia.org/T272550 (10Legoktm) Update: 2 new buster VMs in eqiad are running, and I depooled the 2 stretch ones, will delete them on Monday if no other problems arise.  In codfw 1 buster VM is running alongside the 2...
[01:05:57] <wikibugs>	 (03CR) 10CRusnov: "Well this turned out to be slightly more complicated than I expected, but I took a bit of extra time implementing a generic solution that " [software/netbox] - 10https://gerrit.wikimedia.org/r/668574 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov)
[02:00:32] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 76 probes of 599 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[02:01:26] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 68 probes of 599 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[02:06:44] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 49 probes of 599 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[02:07:48] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 51 probes of 599 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[02:10:26] <icinga-wm>	 PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[02:12:32] <icinga-wm>	 RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1089 is OK: HTTP OK: HTTP/1.0 200 OK - 23610 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[03:17:22] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:21:44] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.061 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:49:52] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[03:52:12] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[04:47:56] <icinga-wm>	 PROBLEM - Memory correctable errors -EDAC- on thumbor2001 is CRITICAL: 4 ge 4 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor2001&var-datasource=codfw+prometheus/ops
[05:18:40] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[05:21:00] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[05:42:38] <wikibugs>	 10SRE, 10DBA, 10Epic, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Decide how to improve  parsercache replication, sharding and HA - https://phabricator.wikimedia.org/T133523 (10Marostegui) >>! In T133523#6884890, @aaron wrote: > Playing around with  > ` > mwscript shell.php aawiki >...
[05:49:49] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not reimage db2145,db2146 [puppet] - 10https://gerrit.wikimedia.org/r/668598 (https://phabricator.wikimedia.org/T275633)
[05:53:19] <wikibugs>	 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Marostegui)
[05:54:42] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db2145,db2146 [puppet] - 10https://gerrit.wikimedia.org/r/668598 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui)
[06:17:16] <legoktm>	 !log uploaded udplog 1.9 (buster-wikimedia) to apt.wikimedia.org (T276421)
[06:17:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:17:24] <stashbot>	 T276421: Package udplog for Buster - https://phabricator.wikimedia.org/T276421
[06:18:09] <wikibugs>	 10SRE, 10Packaging: Package udplog for Buster - https://phabricator.wikimedia.org/T276421 (10Legoktm)
[06:19:30] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[06:21:52] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[06:51:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2092', diff saved to https://phabricator.wikimedia.org/P14640 and previous config saved to /var/cache/conftool/dbconfig/20210305-065137-marostegui.json
[06:51:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:54:15] <elukey>	 good morning
[06:58:40] <wikibugs>	 (03CR) 10Hashar: "+1 :]" [puppet] - 10https://gerrit.wikimedia.org/r/668172 (owner: 10Hashar)
[06:58:48] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] logspam-watch: redraw when terminal size changes [puppet] - 10https://gerrit.wikimedia.org/r/668172 (owner: 10Hashar)
[07:29:13] <wikibugs>	 (03PS4) 10JMeybohm: Add mwilliams to analytics-privatedata-users with no ssh access [puppet] - 10https://gerrit.wikimedia.org/r/668355 (https://phabricator.wikimedia.org/T275671)
[07:33:08] <wikibugs>	 10SRE, 10Packaging: Package udplog for Buster - https://phabricator.wikimedia.org/T276421 (10Legoktm) @Majavah refreshed the packaging (https://gerrit.wikimedia.org/r/c/analytics/udplog/+/668451), switching to dh, a native package and got it to pass lintian with no errors. I uploaded that as 1.9, it works fine...
[07:33:24] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] "PS4 is a rebase, Ottomata approved on phab. Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/668355 (https://phabricator.wikimedia.org/T275671) (owner: 10JMeybohm)
[07:33:46] <elukey>	 !log drain + reimage analytics107[0-1] to debian buster
[07:33:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:35:07] <wikibugs>	 10SRE, 10Product-Analytics, 10SRE-Access-Requests, 10Patch-For-Review, 10Structured-Data-Backlog (Current Work): Add Matthew Williams to analytics-privatedata-users - https://phabricator.wikimedia.org/T275671 (10JMeybohm) 05Open→03Resolved a:03JMeybohm Thanks
[07:57:12] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1070.eqiad.wmnet with reason: REIMAGE
[07:57:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:59:08] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1071.eqiad.wmnet with reason: REIMAGE
[07:59:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:59:20] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1070.eqiad.wmnet with reason: REIMAGE
[07:59:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210305T0800)
[08:00:20] <icinga-wm>	 PROBLEM - Check systemd state on analytics1066 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:01:28] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1071.eqiad.wmnet with reason: REIMAGE
[08:01:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:05:45] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install backup2003 - https://phabricator.wikimedia.org/T274185 (10jcrespo) It was not an os issue (the os already saw the disks as sda and sdb), but a bios/boot issue. fixed with T274185#6883969
[08:07:38] <wikibugs>	 10SRE, 10serviceops: Upgrade docker-registry servers to Debian Buster - https://phabricator.wikimedia.org/T272550 (10JMeybohm) >>! In T272550#6884932, @Legoktm wrote: > In codfw 1 buster VM is running alongside the 2 stretch ones, except I accidentally created the second new VM with the wrong name (`registry20...
[08:17:22] <wikibugs>	 10SRE, 10Beta-Cluster-Infrastructure: deployment-logstash03: UDP listener died EADDRINUSE, logstash port conflict with rsyslogd - https://phabricator.wikimedia.org/T241481 (10Majavah)
[08:22:28] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hieradata: remove ms-be1034 [puppet] - 10https://gerrit.wikimedia.org/r/668645 (https://phabricator.wikimedia.org/T276193)
[08:25:33] <wikibugs>	 (03PS1) 10JMeybohm: Switch active kubernetes staging cluster to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/668646 (https://phabricator.wikimedia.org/T276305)
[08:25:35] <wikibugs>	 (03PS1) 10JMeybohm: Switch staging.svc.eqiad.wmnet back to eqiad [dns] - 10https://gerrit.wikimedia.org/r/668647 (https://phabricator.wikimedia.org/T276305)
[08:26:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: remove ms-be1034 [puppet] - 10https://gerrit.wikimedia.org/r/668645 (https://phabricator.wikimedia.org/T276193) (owner: 10Filippo Giunchedi)
[08:27:06] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install backup2003 - https://phabricator.wikimedia.org/T274185 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin2001.codfw.wmnet for hosts: ` backup2003.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2021030...
[08:32:12] <elukey>	 !log drain + reimage an-worker107[8,9] to Debian Buster
[08:32:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:56] <icinga-wm>	 PROBLEM - NFS Share Volume Space /srv/tools on labstore1004 is CRITICAL: DISK CRITICAL - free space: /srv/tools 1263863 MB (15% inode=80%): https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage%23NFS_volume_cleanup https://grafana.wikimedia.org/d/50z0i4XWz/tools-overall-nfs-storage-utilization?orgId=1
[08:39:48] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on ms-be1032 - https://phabricator.wikimedia.org/T272209 (10fgiunchedi) >>! In T272209#6884508, @wiki_willy wrote: > Hi @fgiunchedi - let us know when you have the decom task for ms-be1034 submitted per our conversation on IRC....then we can pull one of the drives for this.  T...
[08:42:20] <logmsgbot>	 !log jynus@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on backup2003.codfw.wmnet with reason: REIMAGE
[08:42:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:37] <wikibugs>	 (03PS1) 10QChris: Add .gitreview [alerts] - 10https://gerrit.wikimedia.org/r/668649
[08:43:39] <wikibugs>	 (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [alerts] - 10https://gerrit.wikimedia.org/r/668649 (owner: 10QChris)
[08:44:29] <logmsgbot>	 !log jynus@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup2003.codfw.wmnet with reason: REIMAGE
[08:44:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:49:05] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] sre.hosts.decomission: Don't error if all hosts aren't found (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/668572 (owner: 10Legoktm)
[08:51:27] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install backup2003 - https://phabricator.wikimedia.org/T274185 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['backup2003.codfw.wmnet'] `  and were **ALL** successful.
[08:55:18] <wikibugs>	 (03PS1) 10Elukey: sre.hosts.decommission: fix use of self [cookbooks] - 10https://gerrit.wikimedia.org/r/668650 (https://phabricator.wikimedia.org/T276524)
[08:56:45] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM (to my untrained eye anyways)" [cookbooks] - 10https://gerrit.wikimedia.org/r/668650 (https://phabricator.wikimedia.org/T276524) (owner: 10Elukey)
[08:57:46] <wikibugs>	 (03CR) 10David Caro: [C: 04-1] sre.hosts.decommission: fix use of self (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/668650 (https://phabricator.wikimedia.org/T276524) (owner: 10Elukey)
[08:57:51] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "@Volans: I think it is my fault when I moved the cookbook to the class api, going to merge so Filippo can retry, lemme know if you have do" [cookbooks] - 10https://gerrit.wikimedia.org/r/668650 (https://phabricator.wikimedia.org/T276524) (owner: 10Elukey)
[08:58:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sre.hosts.decommission: fix use of self [cookbooks] - 10https://gerrit.wikimedia.org/r/668650 (https://phabricator.wikimedia.org/T276524) (owner: 10Elukey)
[08:58:39] <wikibugs>	 (03CR) 10Elukey: sre.hosts.decommission: fix use of self (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/668650 (https://phabricator.wikimedia.org/T276524) (owner: 10Elukey)
[08:59:11] <wikibugs>	 (03PS2) 10Elukey: sre.hosts.decommission: fix use of self [cookbooks] - 10https://gerrit.wikimedia.org/r/668650 (https://phabricator.wikimedia.org/T276524)
[08:59:23] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Switch staging.svc.eqiad.wmnet back to eqiad [dns] - 10https://gerrit.wikimedia.org/r/668647 (https://phabricator.wikimedia.org/T276305) (owner: 10JMeybohm)
[08:59:42] <wikibugs>	 (03CR) 10Elukey: sre.hosts.decommission: fix use of self (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/668650 (https://phabricator.wikimedia.org/T276524) (owner: 10Elukey)
[08:59:49] <wikibugs>	 (03CR) 10David Caro: "This is the same as https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/668572 I think" [cookbooks] - 10https://gerrit.wikimedia.org/r/668650 (https://phabricator.wikimedia.org/T276524) (owner: 10Elukey)
[09:00:09] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "This fixes the same as https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/668650 I think" [cookbooks] - 10https://gerrit.wikimedia.org/r/668572 (owner: 10Legoktm)
[09:01:23] <wikibugs>	 (03CR) 10Elukey: "It is yes, we can merge Lego's one, if it is better. No problem :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/668650 (https://phabricator.wikimedia.org/T276524) (owner: 10Elukey)
[09:02:58] <wikibugs>	 (03CR) 10Elukey: "@Legoktm: I moved the cookbook to the class API a while ago and probably missed these variables, hopefully it is the only bug, my bad!" [cookbooks] - 10https://gerrit.wikimedia.org/r/668572 (owner: 10Legoktm)
[09:03:17] <wikibugs>	 (03Abandoned) 10Elukey: sre.hosts.decomission: Don't error if all hosts aren't found [cookbooks] - 10https://gerrit.wikimedia.org/r/668572 (owner: 10Legoktm)
[09:03:23] <wikibugs>	 (03Restored) 10Elukey: sre.hosts.decomission: Don't error if all hosts aren't found [cookbooks] - 10https://gerrit.wikimedia.org/r/668572 (owner: 10Legoktm)
[09:03:44] <wikibugs>	 (03Abandoned) 10Elukey: sre.hosts.decommission: fix use of self [cookbooks] - 10https://gerrit.wikimedia.org/r/668650 (https://phabricator.wikimedia.org/T276524) (owner: 10Elukey)
[09:04:19] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "Friday, I mistakenly closed this instead of mine, +1 from my side, I think that we can merge to let Filippo test?" [cookbooks] - 10https://gerrit.wikimedia.org/r/668572 (owner: 10Legoktm)
[09:04:51] <elukey>	 dcaro: o/ ok if we merge --^ so Filippo can test?
[09:04:55] <elukey>	 seems ready to go
[09:05:40] <dcaro>	 👍
[09:05:52] <elukey>	 super, thanks a lot for the -1 and the pings :)
[09:06:05] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] sre.hosts.decomission: Don't error if all hosts aren't found [cookbooks] - 10https://gerrit.wikimedia.org/r/668572 (owner: 10Legoktm)
[09:06:16] <dcaro>	 glad to help :)
[09:08:14] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.decomission: Don't error if all hosts aren't found [cookbooks] - 10https://gerrit.wikimedia.org/r/668572 (owner: 10Legoktm)
[09:12:10] <wikibugs>	 (03CR) 10Jakob: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/668651 (https://phabricator.wikimedia.org/T268640) (owner: 10Jakob)
[09:12:11] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.decommission for hosts ms-be1034.eqiad.wmnet
[09:12:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:03] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Switch staging.svc.eqiad.wmnet back to eqiad [dns] - 10https://gerrit.wikimedia.org/r/668647 (https://phabricator.wikimedia.org/T276305) (owner: 10JMeybohm)
[09:16:06] <wikibugs>	 (03PS1) 10Filippo Giunchedi: install_server: remove ms-be1034 [puppet] - 10https://gerrit.wikimedia.org/r/668652 (https://phabricator.wikimedia.org/T276522)
[09:16:52] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] install_server: remove ms-be1034 [puppet] - 10https://gerrit.wikimedia.org/r/668652 (https://phabricator.wikimedia.org/T276522) (owner: 10Filippo Giunchedi)
[09:17:36] <wikibugs>	 (03CR) 10Tarrow: [C: 03+2] Update termbox to 2021-03-01-112916-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/668651 (https://phabricator.wikimedia.org/T268640) (owner: 10Jakob)
[09:18:44] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Switch active kubernetes staging cluster to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/668646 (https://phabricator.wikimedia.org/T276305) (owner: 10JMeybohm)
[09:18:49] <wikibugs>	 (03Merged) 10jenkins-bot: Update termbox to 2021-03-01-112916-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/668651 (https://phabricator.wikimedia.org/T268640) (owner: 10Jakob)
[09:19:47] <logmsgbot>	 !log jakob@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' .
[09:19:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:20:05] <icinga-wm>	 ACKNOWLEDGEMENT - NFS Share Volume Space /srv/tools on labstore1004 is CRITICAL: DISK CRITICAL - free space: /srv/tools 1252491 MB (15% inode=80%): David Caro Tracking here: https://phabricator.wikimedia.org/T276525 - The acknowledgement expires at: 2021-03-11 10:30:00. https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage%23NFS_volume_cleanup https://grafana.wikimedia.org/d/50z0i4XWz/tools-overall-nfs-sto
[09:20:05] <icinga-wm>	 orgId=1
[09:21:10] <logmsgbot>	 !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts ms-be1034.eqiad.wmnet
[09:21:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:04] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1079.eqiad.wmnet with reason: REIMAGE
[09:24:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:26:08] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1079.eqiad.wmnet with reason: REIMAGE
[09:26:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:26:39] <wikibugs>	 10ops-eqiad, 10decommission-hardware: decommission ms-be1034 - https://phabricator.wikimedia.org/T276522 (10fgiunchedi) a:03Cmjohnson
[09:26:42] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1078.eqiad.wmnet with reason: REIMAGE
[09:26:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:28:32] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'apertium' for release 'staging' .
[09:28:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:28:51] <jayme>	 !log switched back active kubernetes staging cluster to eqiad
[09:28:56] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1078.eqiad.wmnet with reason: REIMAGE
[09:28:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:29:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:31:09] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
[09:31:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:31:45] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' .
[09:31:45] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' .
[09:31:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:31:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:37:42] <RhinosF1>	 godog: I noticed the b'' on https://phabricator.wikimedia.org/T276522#6885279 and looking took me to https://docs.python.org/3/library/locale.html#locale.getpreferredencoding but I'm not sure if that's safe to simply do .decode('utf-8') on the string when adding to phab
[09:37:47] <wikibugs>	 (03PS1) 10JMeybohm: Fix a bunch of dst_net entries to actually be CIDR [deployment-charts] - 10https://gerrit.wikimedia.org/r/668654 (https://phabricator.wikimedia.org/T276268)
[09:41:07] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 121 probes of 599 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[09:42:48] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Fix a bunch of dst_net entries to actually be CIDR [deployment-charts] - 10https://gerrit.wikimedia.org/r/668654 (https://phabricator.wikimedia.org/T276268) (owner: 10JMeybohm)
[09:43:30] <wikibugs>	 (03Merged) 10jenkins-bot: Fix a bunch of dst_net entries to actually be CIDR [deployment-charts] - 10https://gerrit.wikimedia.org/r/668654 (https://phabricator.wikimedia.org/T276268) (owner: 10JMeybohm)
[09:45:09] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
[09:45:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:48:40] <wikibugs>	 (03PS1) 10Jcrespo: Fix bug introduced on check microcode by 7a99982 & 4a436bd [puppet] - 10https://gerrit.wikimedia.org/r/668657
[09:48:51] <wikibugs>	 (03PS2) 10Jcrespo: Fix bug introduced on check microcode by 7a99982 & 4a436bd [puppet] - 10https://gerrit.wikimedia.org/r/668657
[09:50:17] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
[09:50:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:52:06] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
[09:52:09] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Fix bug introduced on check microcode by 7a99982 & 4a436bd [puppet] - 10https://gerrit.wikimedia.org/r/668657 (owner: 10Jcrespo)
[09:52:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:52:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Fix bug introduced on check microcode by 7a99982 & 4a436bd [puppet] - 10https://gerrit.wikimedia.org/r/668657 (owner: 10Jcrespo)
[09:52:20] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 46 probes of 599 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[09:52:56] <wikibugs>	 (03CR) 10Jcrespo: "Adding everyone that reviewed our patches, sometimes more than 4 people per patch! 😳" [puppet] - 10https://gerrit.wikimedia.org/r/668657 (owner: 10Jcrespo)
[09:54:08] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' .
[09:54:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:56:59] <wikibugs>	 (03PS3) 10Jcrespo: Fix bug introduced on check microcode by 7a99982 & 4a436bd [puppet] - 10https://gerrit.wikimedia.org/r/668657
[10:00:00] <wikibugs>	 (03PS4) 10Jcrespo: Fix bug introduced on check microcode by 7a99982 & 4a436bd [puppet] - 10https://gerrit.wikimedia.org/r/668657
[10:00:21] <wikibugs>	 (03PS5) 10Jcrespo: Fix bug introduced on check microcode by 7a99982 & 4a436bd [puppet] - 10https://gerrit.wikimedia.org/r/668657
[10:04:37] <wikibugs>	 (03CR) 10Jbond: "This looks good to me but im not familiar enough with Django to give a +1" [software/netbox] - 10https://gerrit.wikimedia.org/r/668574 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov)
[10:05:38] <wikibugs>	 10SRE, 10netbox, 10Patch-For-Review: Add SSO support to netbox - https://phabricator.wikimedia.org/T244849 (10jbond) Just a note that i think we will be able to re-use this code in debmonitor
[10:09:27] <wikibugs>	 (03PS6) 10Jcrespo: Fix bug introduced on check microcode by 7a99982 & 4a436bd [puppet] - 10https://gerrit.wikimedia.org/r/668657
[10:09:52] <wikibugs>	 (03PS7) 10Jcrespo: base: Fix bug introduced on check microcode by 7a99982 & 4a436bd [puppet] - 10https://gerrit.wikimedia.org/r/668657
[10:12:06] <wikibugs>	 (03PS3) 10Filippo Giunchedi: pontoon: stop symlinking puppet client ssl directory [puppet] - 10https://gerrit.wikimedia.org/r/666668 (https://phabricator.wikimedia.org/T276501)
[10:12:08] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: Add httpd image for MediaWiki (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/667886 (owner: 10Giuseppe Lavagetto)
[10:12:15] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: Add httpd image for MediaWiki [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/667886
[10:12:17] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Add memcached image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/668100
[10:14:07] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Andrew, what do you think ? My understanding is that /var/lib/puppet/client isn't created anymore nor needed on newly-provisioned cloud vp" [puppet] - 10https://gerrit.wikimedia.org/r/666668 (https://phabricator.wikimedia.org/T276501) (owner: 10Filippo Giunchedi)
[10:17:04] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Add httpd image for MediaWiki [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/667886 (owner: 10Giuseppe Lavagetto)
[10:18:16] <wikibugs>	 (03PS1) 10Elukey: hadoop: bump HDFS Namenode heap size [puppet] - 10https://gerrit.wikimedia.org/r/668659
[10:19:43] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: Add memcached image (035 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/668100 (owner: 10Giuseppe Lavagetto)
[10:22:31] <wikibugs>	 10SRE, 10serviceops, 10User-jijiki: Modernise memcached systemd unit / sync to current buster setup - https://phabricator.wikimedia.org/T273950 (10Joe) `systemd-memcached-wrapper` is a perl script, an evolution of the old wrapper script debian always used and that caused me more headaches than it solved. I'd...
[10:23:05] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] "yeah I agree" [puppet] - 10https://gerrit.wikimedia.org/r/668465 (owner: 10Jbond)
[10:23:57] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:memcached: drop admin_groups as has no meening here [puppet] - 10https://gerrit.wikimedia.org/r/668465 (owner: 10Jbond)
[10:24:29] <wikibugs>	 (03CR) 10David Caro: "Neat, got a question though" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/668545 (https://phabricator.wikimedia.org/T276344) (owner: 10Andrew Bogott)
[10:25:06] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt1033.eqiad.wmnet
[10:25:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:26:54] <wikibugs>	 (03PS1) 10JMeybohm: kubernetes/helmfile_log_sal: Allow to suppress logging to SAL [puppet] - 10https://gerrit.wikimedia.org/r/668660
[10:31:11] <wikibugs>	 (03PS2) 10JMeybohm: kubernetes/helmfile_log_sal: Allow to suppress logging to SAL [puppet] - 10https://gerrit.wikimedia.org/r/668660
[10:32:56] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1033.eqiad.wmnet
[10:33:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:38:34] <logmsgbot>	 !log jakob@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'termbox' for release 'production' .
[10:38:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:39:04] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: wmcs-drain-hypervisor: add timeout and retries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/668545 (https://phabricator.wikimedia.org/T276344) (owner: 10Andrew Bogott)
[10:43:02] <logmsgbot>	 !log jakob@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'termbox' for release 'production' .
[10:43:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:47:35] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt1034.eqiad.wmnet
[10:47:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:51:26] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:51:56] <wikibugs>	 (03CR) 10Alexandros Kosiaris: kubernetes/helmfile_log_sal: Allow to suppress logging to SAL (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/668660 (owner: 10JMeybohm)
[10:54:51] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1034.eqiad.wmnet
[10:54:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:55:58] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:56:44] <wikibugs>	 (03PS3) 10JMeybohm: kubernetes/helmfile_log_sal: Allow to suppress logging to SAL [puppet] - 10https://gerrit.wikimedia.org/r/668660
[10:57:19] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] kubernetes/helmfile_log_sal: Allow to suppress logging to SAL [puppet] - 10https://gerrit.wikimedia.org/r/668660 (owner: 10JMeybohm)
[10:58:28] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] kubernetes/helmfile_log_sal: Allow to suppress logging to SAL (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/668660 (owner: 10JMeybohm)
[11:09:50] <marostegui>	 !log Run check table on db2092, db2116, db2145, db2146 (there will be lag)
[11:09:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:16:23] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: Add memcached image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/668100
[11:18:05] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Add memcached image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/668100 (owner: 10Giuseppe Lavagetto)
[11:23:41] <wikibugs>	 10SRE, 10Traffic: Wikipedia not opening images in any browser except Opera. - https://phabricator.wikimedia.org/T275211 (10Aklapper) @Mrcoolabhishek: Hi, have you received some answer from Tata sky broadband, by any chance?
[11:26:32] <wikibugs>	 (03PS1) 10Marostegui: parsercache.my.cnf: innodb_change_buffering = none [puppet] - 10https://gerrit.wikimedia.org/r/668669 (https://phabricator.wikimedia.org/T263443)
[11:26:59] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/668669 (https://phabricator.wikimedia.org/T263443) (owner: 10Marostegui)
[11:28:21] <marostegui>	 !log Temporarily set  innodb_change_buffering = none on db1134 (s1) - T263443
[11:28:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:36] <stashbot>	 T263443: Evaluate the impact of changing innodb_change_buffering to inserts  - https://phabricator.wikimedia.org/T263443
[11:48:18] <icinga-wm>	 PROBLEM - puppet last run on cuminunpriv1001 is CRITICAL: CRITICAL: Puppet has been disabled for longer than 86400 seconds, message: jmm, last run 1 day ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[11:56:20] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:58:44] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:15:48] <wikibugs>	 (03PS1) 10Jbond: P:pki::root_ca: correct ocsp responder location [puppet] - 10https://gerrit.wikimedia.org/r/668678
[12:15:50] <wikibugs>	 (03PS1) 10Jbond: P:pki::multirooca: add multirootca profile and config [puppet] - 10https://gerrit.wikimedia.org/r/668679
[12:18:40] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:pki::root_ca: correct ocsp responder location [puppet] - 10https://gerrit.wikimedia.org/r/668678 (owner: 10Jbond)
[12:23:12] <wikibugs>	 (03PS2) 10Jbond: P:pki::multirooca: add multirootca profile and config [puppet] - 10https://gerrit.wikimedia.org/r/668679
[12:24:49] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt1035.eqiad.wmnet
[12:24:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:31:28] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1035.eqiad.wmnet
[12:31:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:37:41] <wikibugs>	 (03PS3) 10Jbond: P:pki::multirooca: add multirootca profile and config [puppet] - 10https://gerrit.wikimedia.org/r/668679
[12:38:09] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] hadoop: bump HDFS Namenode heap size [puppet] - 10https://gerrit.wikimedia.org/r/668659 (owner: 10Elukey)
[12:40:31] <wikibugs>	 (03PS4) 10Jbond: P:pki::multirooca: add multirootca profile and config [puppet] - 10https://gerrit.wikimedia.org/r/668679
[12:40:33] <wikibugs>	 (03PS1) 10Jbond: P:pki::root_ca: use standard size key for ocsp [puppet] - 10https://gerrit.wikimedia.org/r/668682
[12:42:26] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:pki::root_ca: use standard size key for ocsp [puppet] - 10https://gerrit.wikimedia.org/r/668682 (owner: 10Jbond)
[12:47:39] <wikibugs>	 (03PS5) 10Jbond: P:pki::multirooca: add multirootca profile and config [puppet] - 10https://gerrit.wikimedia.org/r/668679
[12:53:51] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:pki::multirooca: add multirootca profile and config [puppet] - 10https://gerrit.wikimedia.org/r/668679 (owner: 10Jbond)
[12:58:17] <wikibugs>	 (03PS1) 10Jbond: P:pki-int: add default nets [puppet] - 10https://gerrit.wikimedia.org/r/668684
[12:59:11] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:pki-int: add default nets [puppet] - 10https://gerrit.wikimedia.org/r/668684 (owner: 10Jbond)
[13:03:54] <wikibugs>	 (03PS1) 10Jbond: P:pki:multirootca: fix types [puppet] - 10https://gerrit.wikimedia.org/r/668686
[13:05:17] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:pki:multirootca: fix types [puppet] - 10https://gerrit.wikimedia.org/r/668686 (owner: 10Jbond)
[13:12:00] <wikibugs>	 (03PS1) 10Jbond: hieradata: cloud pki add client config [puppet] - 10https://gerrit.wikimedia.org/r/668688
[13:12:53] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] hieradata: cloud pki add client config [puppet] - 10https://gerrit.wikimedia.org/r/668688 (owner: 10Jbond)
[13:20:13] <wikibugs>	 (03PS1) 10Jbond: P:pki: gaurd loading cfssl class [puppet] - 10https://gerrit.wikimedia.org/r/668689
[13:21:30] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:pki: gaurd loading cfssl class [puppet] - 10https://gerrit.wikimedia.org/r/668689 (owner: 10Jbond)
[13:24:51] <marostegui>	 !log Check tables on db1134
[13:24:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:25:13] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "+1 on premise, makes me wonder whether we need wmfjson for anything after this is merged." [puppet] - 10https://gerrit.wikimedia.org/r/668231 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[13:33:07] <wikibugs>	 (03CR) 10Joal: [C: 03+1] hadoop: bump HDFS Namenode heap size [puppet] - 10https://gerrit.wikimedia.org/r/668659 (owner: 10Elukey)
[13:34:16] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] hadoop: bump HDFS Namenode heap size [puppet] - 10https://gerrit.wikimedia.org/r/668659 (owner: 10Elukey)
[13:35:06] <icinga-wm>	 PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[13:38:16] <icinga-wm>	 PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[13:38:23] <marostegui>	 uh, that's me?
[13:38:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'DEpool db1134', diff saved to https://phabricator.wikimedia.org/P14644 and previous config saved to /var/cache/conftool/dbconfig/20210305-133833-marostegui.json
[13:38:36] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters
[13:38:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:38:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:38:52] <icinga-wm>	 RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[13:38:58] <icinga-wm>	 RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[13:40:15] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: wmcs-drain-hypervisor: add timeout and retries [puppet] - 10https://gerrit.wikimedia.org/r/668545 (https://phabricator.wikimedia.org/T276344) (owner: 10Andrew Bogott)
[13:42:30] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: wmcs-drain-hypervisor: add timeout and retries [puppet] - 10https://gerrit.wikimedia.org/r/668545 (https://phabricator.wikimedia.org/T276344) (owner: 10Andrew Bogott)
[13:45:15] <wikibugs>	 (03PS1) 10Jbond: cfssl::ocsp: create an empty response file [puppet] - 10https://gerrit.wikimedia.org/r/668692
[13:47:06] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] cfssl::ocsp: create an empty response file [puppet] - 10https://gerrit.wikimedia.org/r/668692 (owner: 10Jbond)
[13:52:06] <marostegui>	 !log Rebuild some indexes on db2102
[13:52:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:58:23] <wikibugs>	 (03PS1) 10Jbond: PKI: add new intermediate [puppet] - 10https://gerrit.wikimedia.org/r/668694
[13:59:13] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hadoop.roll-restart-masters (exit_code=99)
[13:59:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:59:22] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] PKI: add new intermediate [puppet] - 10https://gerrit.wikimedia.org/r/668694 (owner: 10Jbond)
[14:02:05] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] hadoop: bump HDFS Namenode heap size [puppet] - 10https://gerrit.wikimedia.org/r/668659 (owner: 10Elukey)
[14:08:54] <wikibugs>	 (03PS1) 10Jbond: pki cloud: add deployment-prep_eqiad1_wikimedia_cloud-key [puppet] - 10https://gerrit.wikimedia.org/r/668696
[14:10:59] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] pki cloud: add deployment-prep_eqiad1_wikimedia_cloud-key [puppet] - 10https://gerrit.wikimedia.org/r/668696 (owner: 10Jbond)
[14:16:06] <wikibugs>	 (03CR) 10Ottomata: "Installing even just r-base causes the conda env (can I start calling this cenv for short? :p ) from 472M to 1.2G.  It looks like the r-ba" [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/668425 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata)
[14:16:43] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] Don't install a copy of R in a stacked user conda env [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/668425 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata)
[14:18:37] <wikibugs>	 (03PS2) 10Ottomata: Add activate.d and deactivate.d env_vars.sh [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/668566 (https://phabricator.wikimedia.org/T272313)
[14:20:35] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[14:20:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:20:58] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Add activate.d and deactivate.d env_vars.sh [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/668566 (https://phabricator.wikimedia.org/T272313) (owner: 10Ottomata)
[14:21:00] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] Add activate.d and deactivate.d env_vars.sh [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/668566 (https://phabricator.wikimedia.org/T272313) (owner: 10Ottomata)
[14:23:10] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:23:51] * Urbanecm stagging on mwdebug1001
[14:24:21] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:24:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:22] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:27:04] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[14:36:46] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid
[14:38:54] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[14:39:22] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[14:42:14] <wikibugs>	 (03PS1) 10Jbond: pki: update to use include [puppet] - 10https://gerrit.wikimedia.org/r/668700
[14:42:37] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: profile::etcd::v3: use puppet certs for standalone cluster [puppet] - 10https://gerrit.wikimedia.org/r/668701
[14:44:36] <wikibugs>	 (03PS1) 10Jbond: pki: add a second test intermediate [puppet] - 10https://gerrit.wikimedia.org/r/668704
[14:44:42] <wikibugs>	 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10JMeybohm) helm test annotations changed a bit:  > Note that until Helm v3, the job definition needed to contain one of these helm test hook annotations: `helm.sh/hook: test-success` or `...
[14:45:50] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28393/console" [puppet] - 10https://gerrit.wikimedia.org/r/668701 (owner: 10Giuseppe Lavagetto)
[14:46:43] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] pki: update to use include [puppet] - 10https://gerrit.wikimedia.org/r/668700 (owner: 10Jbond)
[14:48:43] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28394/console" [puppet] - 10https://gerrit.wikimedia.org/r/668701 (owner: 10Giuseppe Lavagetto)
[14:48:52] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] pki: add a second test intermediate [puppet] - 10https://gerrit.wikimedia.org/r/668704 (owner: 10Jbond)
[14:49:44] <icinga-wm>	 RECOVERY - NFS Share Volume Space /srv/tools on labstore1004 is OK: DISK OK - free space: /srv/tools 1876701 MB (23% inode=80%): https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage%23NFS_volume_cleanup https://grafana.wikimedia.org/d/50z0i4XWz/tools-overall-nfs-storage-utilization?orgId=1
[15:00:18] <wikibugs>	 (03PS1) 10Zabe: Enable flood flag on hrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668708 (https://phabricator.wikimedia.org/T276560)
[15:00:26] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: profile::etcd::v3: use puppet certs for standalone cluster [puppet] - 10https://gerrit.wikimedia.org/r/668701
[15:02:05] <wikibugs>	 (03CR) 10Andrew Bogott: wmcs-drain-hypervisor: add timeout and retries (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/668545 (https://phabricator.wikimedia.org/T276344) (owner: 10Andrew Bogott)
[15:07:30] <elukey>	 !log drain + reimage analytics1073 and an-worker1086 to Debian Buster
[15:07:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:19:32] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:21:46] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:22:49] <wikibugs>	 (03PS1) 10Andrew Bogott: nova.conf: Adjust live migration timing values [puppet] - 10https://gerrit.wikimedia.org/r/668715 (https://phabricator.wikimedia.org/T276344)
[15:27:22] <wikibugs>	 (03PS1) 10Mholloway: WikimediaEvents: Create data QA group/right on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668716 (https://phabricator.wikimedia.org/T276515)
[15:29:48] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[15:29:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:52] <icinga-wm>	 RECOVERY - HP RAID on ms-be1032 is OK: OK: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[15:35:24] <wikibugs>	 (03CR) 10Mholloway: [C: 04-2] "Hold until Monday 3/8" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668716 (https://phabricator.wikimedia.org/T276515) (owner: 10Mholloway)
[15:37:09] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1073.eqiad.wmnet with reason: REIMAGE
[15:37:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:38:14] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:38:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:13] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1073.eqiad.wmnet with reason: REIMAGE
[15:39:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:41:35] <wikibugs>	 (03PS4) 10Ottomata: Jupyter - never use webproxy for *.wmnet URLs and make Java use system cacerts [puppet] - 10https://gerrit.wikimedia.org/r/668466 (https://phabricator.wikimedia.org/T224658)
[15:42:23] <wikibugs>	 (03CR) 10Zabe: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668708 (https://phabricator.wikimedia.org/T276560) (owner: 10Zabe)
[15:43:48] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Jupyter - never use webproxy for *.wmnet URLs and make Java use system cacerts [puppet] - 10https://gerrit.wikimedia.org/r/668466 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata)
[15:47:56] <wikibugs>	 (03PS3) 10Ottomata: Spark JVMs inherit system http settings [puppet] - 10https://gerrit.wikimedia.org/r/668485 (https://phabricator.wikimedia.org/T224658)
[15:48:52] <wikibugs>	 (03PS11) 10Razzi: Remove labsdb1012 from puppet in preparation for rename [puppet] - 10https://gerrit.wikimedia.org/r/663865 (https://phabricator.wikimedia.org/T269211)
[15:48:53] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs-drain-hypervisor: add timeout and retries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/668545 (https://phabricator.wikimedia.org/T276344) (owner: 10Andrew Bogott)
[15:49:26] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Spark JVMs inherit system http settings [puppet] - 10https://gerrit.wikimedia.org/r/668485 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata)
[15:50:32] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] shared-storage: enable project NFS for wikipathways [puppet] - 10https://gerrit.wikimedia.org/r/668234 (https://phabricator.wikimedia.org/T276141) (owner: 10Bstorm)
[15:52:24] <wikibugs>	 (03CR) 10Andrew Bogott: wmcs-drain-hypervisor: add timeout and retries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/668545 (https://phabricator.wikimedia.org/T276344) (owner: 10Andrew Bogott)
[15:52:34] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/668715 (https://phabricator.wikimedia.org/T276344) (owner: 10Andrew Bogott)
[15:54:15] <wikibugs>	 (03PS2) 10Andrew Bogott: nova.conf: Adjust live migration timing values [puppet] - 10https://gerrit.wikimedia.org/r/668715 (https://phabricator.wikimedia.org/T276344)
[15:54:19] <wikibugs>	 (03CR) 10Andrew Bogott: nova.conf: Adjust live migration timing values (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/668715 (https://phabricator.wikimedia.org/T276344) (owner: 10Andrew Bogott)
[15:54:52] <wikibugs>	 (03PS1) 10Ottomata: otto/.bash_aliases - set no_proxy too [puppet] - 10https://gerrit.wikimedia.org/r/668720
[15:55:47] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] otto/.bash_aliases - set no_proxy too [puppet] - 10https://gerrit.wikimedia.org/r/668720 (owner: 10Ottomata)
[15:56:27] <razzi>	 !log stop mariadb on labsdb1012 to reimage and rename to clouddb1021: T269211
[15:56:32] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nova.conf: Adjust live migration timing values [puppet] - 10https://gerrit.wikimedia.org/r/668715 (https://phabricator.wikimedia.org/T276344) (owner: 10Andrew Bogott)
[15:56:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:56:34] <stashbot>	 T269211: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211
[15:56:34] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] wikireplicas: Add basic configuration for clouddb1021 [puppet] - 10https://gerrit.wikimedia.org/r/661529 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi)
[16:03:44] <wikibugs>	 (03PS1) 10Jbond: P:pki: server bundles from web server [puppet] - 10https://gerrit.wikimedia.org/r/668722
[16:07:20] <wikibugs>	 (03PS2) 10Jbond: P:pki: server bundles from web server [puppet] - 10https://gerrit.wikimedia.org/r/668722
[16:07:40] <logmsgbot>	 !log razzi@cumin1001 START - Cookbook sre.hosts.decommission for hosts labsdb1012.eqiad.wmnet
[16:07:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:10] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28395/console" [puppet] - 10https://gerrit.wikimedia.org/r/668722 (owner: 10Jbond)
[16:09:36] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1086.eqiad.wmnet with reason: REIMAGE
[16:09:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:10:17] <wikibugs>	 (03CR) 10Razzi: [C: 03+2] Remove labsdb1012 from puppet in preparation for rename [puppet] - 10https://gerrit.wikimedia.org/r/663865 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi)
[16:11:44] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1086.eqiad.wmnet with reason: REIMAGE
[16:11:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:12:20] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] P:pki: server bundles from web server [puppet] - 10https://gerrit.wikimedia.org/r/668722 (owner: 10Jbond)
[16:13:57] <wikibugs>	 (03PS1) 10Klausman: ml-ctrl: Add dummy keys for ML k8s control plane [labs/private] - 10https://gerrit.wikimedia.org/r/668723 (https://phabricator.wikimedia.org/T272918)
[16:14:48] <wikibugs>	 (03PS1) 10Jbond: P:pki::multirootca: correct typo wmflib::dir::mkdir_p [puppet] - 10https://gerrit.wikimedia.org/r/668724
[16:16:42] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:pki::multirootca: correct typo wmflib::dir::mkdir_p [puppet] - 10https://gerrit.wikimedia.org/r/668724 (owner: 10Jbond)
[16:16:48] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] ml-ctrl: Add dummy keys for ML k8s control plane [labs/private] - 10https://gerrit.wikimedia.org/r/668723 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman)
[16:17:02] <wikibugs>	 (03CR) 10Klausman: [V: 03+2 C: 03+2] ml-ctrl: Add dummy keys for ML k8s control plane [labs/private] - 10https://gerrit.wikimedia.org/r/668723 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman)
[16:17:43] <logmsgbot>	 !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts labsdb1012.eqiad.wmnet
[16:17:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:18:21] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:19:35] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:21:22] <wikibugs>	 (03PS2) 10Klausman: [WIP, do not review] Add k8s config for ML machines [puppet] - 10https://gerrit.wikimedia.org/r/668075
[16:22:58] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt1036.eqiad.wmnet
[16:23:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:28:19] <wikibugs>	 (03PS3) 10Klausman: [WIP, do not review] Add k8s config for ML machines [puppet] - 10https://gerrit.wikimedia.org/r/668075
[16:28:54] <razzi>	 !log rename https://netbox.wikimedia.org/ipam/ip-addresses/734/ DNS name from labsdb1012.mgmt.eqiad.wmnet to clouddb1021.mgmt.eqiad.wmnet
[16:28:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:30:47] <razzi>	 !log delete non-mgmt interfaces for labsdb1012 at https://netbox.wikimedia.org/dcim/devices/2078/interfaces/
[16:30:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:35:00] <wikibugs>	 (03PS4) 10Klausman: [WIP, do not review] Add k8s config for ML machines [puppet] - 10https://gerrit.wikimedia.org/r/668075
[16:36:01] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1036.eqiad.wmnet
[16:36:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:36:22] <wikibugs>	 (03PS5) 10Klausman: [WIP, do not review] Add k8s config for ML machines [puppet] - 10https://gerrit.wikimedia.org/r/668075
[16:36:45] <icinga-wm>	 RECOVERY - Device not healthy -SMART- on ms-be1032 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be1032&var-datasource=eqiad+prometheus/ops
[16:37:24] <wikibugs>	 (03PS1) 10Jbond: cfssl:cert: provide method to fetch CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/668734
[16:38:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cfssl:cert: provide method to fetch CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/668734 (owner: 10Jbond)
[16:39:47] <wikibugs>	 (03PS2) 10Jbond: cfssl:cert: provide method to fetch CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/668734
[16:40:42] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28397/console" [puppet] - 10https://gerrit.wikimedia.org/r/668734 (owner: 10Jbond)
[16:40:54] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cfssl:cert: provide method to fetch CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/668734 (owner: 10Jbond)
[16:42:07] <wikibugs>	 (03PS3) 10Jbond: cfssl:cert: provide method to fetch CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/668734
[16:42:33] <effie>	 !depool mw1276 and pool back 
[16:42:33] <wm-bot>	 for s in nginx varnish-fe varnish-be varnish-be-rand; do confctl --tags dc=eqiad,cluster=cache_text,service=$s --action set/pooled=no cp1053.eqiad.wmnet; done
[16:42:38] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] logspam-watch: redraw when terminal size changes [puppet] - 10https://gerrit.wikimedia.org/r/668172 (owner: 10Hashar)
[16:47:07] <bblack>	 what's the cp1053 bit here?
[16:47:52] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] cfssl:cert: provide method to fetch CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/668734 (owner: 10Jbond)
[16:48:22] <razzi>	 !log edit https://netbox.wikimedia.org/dcim/devices/2078/ device name from labsdb1012 to clouddb1021
[16:48:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:49:02] <bblack>	 I mean, that server doesn't really exist anyways
[16:49:06] <wikibugs>	 (03PS4) 10Ahmon Dancy: env.php: Allow the datacenter to be specified in WMF_DATACENTER environment variable. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667243
[16:49:10] <bblack>	 but that's an odd wm-bot line
[16:49:39] <bblack>	 (it also seems to know about varnish-be, which doesn't exist anymore either)
[16:50:04] <wikibugs>	 (03PS5) 10Ahmon Dancy: env.php: Allow the datacenter to be specified in WMF_DATACENTER environment variable. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667243
[16:50:31] <wikibugs>	 (03CR) 10Ahmon Dancy: "> > For mediawiki image builds, WMF_DATACENTER will be set to 'eqiad' by the image building script." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667243 (owner: 10Ahmon Dancy)
[16:52:07] <rzl>	 bblack: it looks like it's a response to !depool for whatever reason (which I think effie typed by mistake instead of "!log depool")
[16:52:29] <effie>	 ?
[16:52:39] <bblack>	 rzl: oh this is like, a bot help message telling us how to depool things? :)
[16:52:45] <effie>	 lol wot?
[16:52:45] <cdanis>	 looks like it
[16:52:47] <rzl>	 11:42:34 AM <effie> !depool mw1276 and pool back 
[16:52:56] <rzl>	 (forgive my UTC-5)
[16:52:57] <cdanis>	 11:52:47	<rzl>	11:42:34 AM <effie> !depool mw1276 and pool back 
[16:52:58] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:53:00] <effie>	 hahahah
[16:53:03] <bblack>	 !depool foo
[16:53:03] <wm-bot>	 for s in nginx varnish-fe varnish-be varnish-be-rand; do confctl --tags dc=eqiad,cluster=cache_text,service=$s --action set/pooled=no cp1053.eqiad.wmnet; done
[16:53:07] <bblack>	 well ok :)
[16:53:09] <effie>	 so i wonder what else tis bot does 
[16:53:18] <effie>	 !cook dinner 
[16:53:24] <rzl>	 !help
[16:53:24] <wm-bot>	 want docs? ask for "!wm-bot". all keywords? try "@regsearch .*"
[16:53:25] <logmsgbot>	 !log razzi@cumin1001 START - Cookbook sre.dns.netbox
[16:53:26] <effie>	 pfff
[16:53:29] <bd808>	 effie: https://meta.wikimedia.org/wiki/Wm-bot
[16:53:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:53:51] <effie>	 bd808: I should have guess you were behind this 
[16:53:55] <rzl>	 aha: https://wm-bot.wmflabs.org/dump/%23wikimedia-operations.htm
[16:53:55] <effie>	 guessed*
[16:54:12] <bd808>	 effie: :) Not my bot, but a handy one that is in a lot of wikimedia channels
[16:54:14] <wm-bot>	 http://wm-bot.wmflabs.org/dump/%23wikimedia-operations.htm
[16:54:14] <Majavah>	 @info
[16:54:35] <effie>	 bd808: yeah I was talking about its secret hidden features
[16:54:44] <effie>	 !log depool mw1276 and pool back 
[16:54:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:57:14] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:58:09] <logmsgbot>	 !log razzi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:58:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:59:44] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:00:38] <wikibugs>	 (03PS11) 10Razzi: wikireplicas: Add basic configuration for clouddb1021 [puppet] - 10https://gerrit.wikimedia.org/r/661529 (https://phabricator.wikimedia.org/T269211)
[17:01:02] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:03:52] <wikibugs>	 (03CR) 10Razzi: [C: 03+2] wikireplicas: Add basic configuration for clouddb1021 [puppet] - 10https://gerrit.wikimedia.org/r/661529 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi)
[17:04:28] <wikibugs>	 (03CR) 10CRusnov: "> Patch Set 1:" [software/netbox] - 10https://gerrit.wikimedia.org/r/668574 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov)
[17:05:20] <wikibugs>	 (03PS3) 10Cwhite: logstash: ingest logstash logs as json and convert to ECS [puppet] - 10https://gerrit.wikimedia.org/r/668189 (https://phabricator.wikimedia.org/T273919)
[17:06:33] <wikibugs>	 (03PS4) 10Cwhite: logstash: ingest logstash logs as json and convert to ECS [puppet] - 10https://gerrit.wikimedia.org/r/668189 (https://phabricator.wikimedia.org/T273919)
[17:07:39] <wikibugs>	 (03CR) 10CRusnov: [C: 03+1] "looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/668505 (https://phabricator.wikimedia.org/T274689) (owner: 10Volans)
[17:16:29] <wikibugs>	 (03PS2) 10Razzi: wikireplicas: give analytics_multiinstance role to clouddb1021 [puppet] - 10https://gerrit.wikimedia.org/r/668494 (https://phabricator.wikimedia.org/T269211)
[17:17:29] <wikibugs>	 (03PS1) 10Jbond: P:pki::client: add bunldes source [puppet] - 10https://gerrit.wikimedia.org/r/668744
[17:18:46] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28401/console" [puppet] - 10https://gerrit.wikimedia.org/r/668744 (owner: 10Jbond)
[17:19:40] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] P:pki::client: add bunldes source [puppet] - 10https://gerrit.wikimedia.org/r/668744 (owner: 10Jbond)
[17:24:07] <wikibugs>	 (03PS1) 10Elukey: ssh-client-config: use wmcloud bastion [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/668746
[17:24:51] <wikibugs>	 (03CR) 10Elukey: "No idea if it is the right one or not, lemme know :)" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/668746 (owner: 10Elukey)
[17:27:44] <wikibugs>	 (03CR) 10Cwhite: "Testing in pontoon surfaced a couple bugs now fixed in the latest patchset.  Barring any additional concerns, this will go out next week." [puppet] - 10https://gerrit.wikimedia.org/r/668189 (https://phabricator.wikimedia.org/T273919) (owner: 10Cwhite)
[17:35:22] <wikibugs>	 (03CR) 10Dzahn: phabricator::tools: replace cron jobs with timers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/666979 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[17:45:38] <mutante>	 ottomata: yea, it makes sense to me. or let's say.. at least I am not aware of a reason why direct connections between .wmnet hosts would have to go via proxies, but also I never ran into the need to actively disable it
[17:46:59] <mutante>	 individual users can puppetize their .profile as needed via admin/files/home
[17:48:57] <mutante>	 syntax seems right, domain extensions or full IP, afaict
[17:49:33] <wikibugs>	 (03PS1) 10Ottomata: eventgate-analytics-external - Bump replicas to 6 for increase in mediawiki.client.session_tick [deployment-charts] - 10https://gerrit.wikimedia.org/r/668750 (https://phabricator.wikimedia.org/T276502)
[17:51:56] <wikibugs>	 (03CR) 10Ottomata: "Alex & Janis, can we do this?  Product would like to increase sampling rate for session_tick events on all wikis, up to around 2.5K per se" [deployment-charts] - 10https://gerrit.wikimedia.org/r/668750 (https://phabricator.wikimedia.org/T276502) (owner: 10Ottomata)
[17:55:10] <ottomata>	 ok gr8 thanks mutante 
[17:55:15] <wikibugs>	 (03PS1) 10Majavah: betacluster: switch etcd to deployment-etcd02 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668751 (https://phabricator.wikimedia.org/T276462)
[17:55:22] <mutante>	 yep, np
[17:57:08] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:59:20] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:02:20] <wikibugs>	 (03CR) 10Brennen Bearnes: "Minor convenience change here, one line and already tested by a few regular users." [puppet] - 10https://gerrit.wikimedia.org/r/668172 (owner: 10Hashar)
[18:05:34] <wikibugs>	 (03CR) 10Herron: [C: 03+2] elk: send icinga events to a separate partition/index [puppet] - 10https://gerrit.wikimedia.org/r/667917 (owner: 10Herron)
[18:08:18] <wikibugs>	 (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/668753 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov)
[18:14:54] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:18:33] <logmsgbot>	 !log razzi@cumin1001 START - Cookbook sre.dns.netbox
[18:18:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:21:36] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:23:44] <logmsgbot>	 !log razzi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:23:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:24:59] <wikibugs>	 (03PS2) 10CRusnov: netbox, profile::netbox: Switch to CAS authentication [puppet] - 10https://gerrit.wikimedia.org/r/668753 (https://phabricator.wikimedia.org/T244849)
[18:25:24] <wikibugs>	 (03PS1) 10Andrew Bogott: cloud-vps: Add a new class to detect and format available cinder volumes [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511)
[18:25:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cloud-vps: Add a new class to detect and format available cinder volumes [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott)
[18:26:12] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] netbox, profile::netbox: Switch to CAS authentication [puppet] - 10https://gerrit.wikimedia.org/r/668753 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov)
[18:29:05] <wikibugs>	 (03PS2) 10Andrew Bogott: cloud-vps: Add a new class to detect and format available cinder volumes [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511)
[18:29:33] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cloud-vps: Add a new class to detect and format available cinder volumes [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott)
[18:32:12] <wikibugs>	 (03PS3) 10CRusnov: netbox, profile::netbox: Switch to CAS authentication [puppet] - 10https://gerrit.wikimedia.org/r/668753 (https://phabricator.wikimedia.org/T244849)
[18:33:07] <wikibugs>	 (03PS3) 10Andrew Bogott: cloud-vps: Add a new class to detect and format available cinder volumes [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511)
[18:33:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cloud-vps: Add a new class to detect and format available cinder volumes [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott)
[18:35:56] <wikibugs>	 (03PS4) 10Andrew Bogott: cloud-vps: Add a new class to detect and format available cinder volumes [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511)
[18:36:22] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cloud-vps: Add a new class to detect and format available cinder volumes [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott)
[18:37:26] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:37:27] <wikibugs>	 (03PS5) 10Andrew Bogott: cloud-vps: Add a new class to detect and format available cinder volumes [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511)
[18:37:54] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cloud-vps: Add a new class to detect and format available cinder volumes [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott)
[18:39:38] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:40:06] <wikibugs>	 (03PS4) 10CRusnov: netbox, profile::netbox: Switch to CAS authentication [puppet] - 10https://gerrit.wikimedia.org/r/668753 (https://phabricator.wikimedia.org/T244849)
[18:40:10] <wikibugs>	 (03PS6) 10Andrew Bogott: cloud-vps: Add a new class to detect and format available cinder volumes [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511)
[18:40:41] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cloud-vps: Add a new class to detect and format available cinder volumes [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott)
[18:42:19] <wikibugs>	 (03PS7) 10Andrew Bogott: cloud-vps: Add a new class to detect and format available cinder volumes [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511)
[18:43:07] <logmsgbot>	 !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb1021.eqiad.wmnet with reason: REIMAGE
[18:43:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:43:30] <mutante>	 Krinkle: do you still need a restart of something on doc1001?
[18:45:02] <Krinkle>	 mutante: nope, all good now
[18:45:02] <legoktm>	 mutante: r.zl took care of it I believe
[18:45:13] <logmsgbot>	 !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb1021.eqiad.wmnet with reason: REIMAGE
[18:45:17] <Krinkle>	 https://phabricator.wikimedia.org/T275468
[18:45:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:45:24] <logmsgbot>	 !log legoktm@cumin1001 START - Cookbook sre.hosts.decommission for hosts registry2004.eqiad.wmnet
[18:45:26] <Krinkle>	 mutante: we do still have the inability to deploy
[18:45:28] <Krinkle>	 but it's not urgent
[18:45:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:47:12] <mutante>	 Krinkle: ok, ack
[18:47:19] <mutante>	 wasnt here yesterday
[18:50:32] <logmsgbot>	 !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts registry2004.eqiad.wmnet
[18:50:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:59:45] <wikibugs>	 (03PS5) 10CRusnov: netbox, profile::netbox: Switch to CAS authentication [puppet] - 10https://gerrit.wikimedia.org/r/668753 (https://phabricator.wikimedia.org/T244849)
[19:04:36] <mutante>	 !log phab1001 - running public_task_dump.py (from cron job) manually
[19:04:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:06:03] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] "Beta cluster-only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668751 (https://phabricator.wikimedia.org/T276462) (owner: 10Majavah)
[19:06:58] <wikibugs>	 (03Merged) 10jenkins-bot: betacluster: switch etcd to deployment-etcd02 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668751 (https://phabricator.wikimedia.org/T276462) (owner: 10Majavah)
[19:07:24] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] logspam-watch: redraw when terminal size changes [puppet] - 10https://gerrit.wikimedia.org/r/668172 (owner: 10Hashar)
[19:07:37] <wikibugs>	 (03CR) 10Ladsgroup: phabricator::tools: replace cron jobs with timers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/666979 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[19:11:15] <wikibugs>	 (03CR) 10Dzahn: "currently running the command manually .. and YES.. it does output a lot. it's full of warnings and Tracebacks ..hrmm" [puppet] - 10https://gerrit.wikimedia.org/r/666979 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[19:13:44] <mutante>	 Amir1: the reason that phab task dump script is so .."chatty" is    "ProgrammingError: not all arguments converted during string formatting
[19:14:22] <logmsgbot>	 !log legoktm@cumin1001 START - Cookbook sre.ganeti.makevm for new host registry2004.codfw.wmnet
[19:14:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:17:40] <Amir1>	 *headdesks*
[19:18:04] <Amir1>	 I assume it'll be small if you compress them?
[19:18:18] * Amir1 pretends to know about compression
[19:18:40] <tabbycat>	 sit in your suitcase until the zip closes Amir1 ?
[19:18:58] <tabbycat>	 :)
[19:19:15] <Amir1>	 I don't mind about myself, I just don't want to kill phab1001 :D
[19:20:26] <mutante>	 Amir1: yea, same here. just want to get rid of cron
[19:20:31] <wikibugs>	 (03CR) 10CRusnov: "Compiler output looks satisfactory:" [puppet] - 10https://gerrit.wikimedia.org/r/668753 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov)
[19:21:10] <mutante>	 Amir1: seems easier to fix it within the dump script itself ..instead of adding new parameters to systemd classes
[19:21:19] <Amir1>	 fixing it wouldn't be hard, is there a way for me to reproduce it locally somewhere?
[19:23:20] <mutante>	 Amir1: I wouldn't know how.. unless phab db was sanitized and replicated to cloud .. :(
[19:23:39] <mutante>	 but you could change the python code to stay silent
[19:23:47] <mutante>	 about errors in general
[19:30:05] <logmsgbot>	 !log legoktm@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host registry2004.codfw.wmnet
[19:30:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:34:04] <wikibugs>	 (03PS1) 10Legoktm: install_server: Add registry2004.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/668763 (https://phabricator.wikimedia.org/T276381)
[19:35:13] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] install_server: Add registry2004.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/668763 (https://phabricator.wikimedia.org/T276381) (owner: 10Legoktm)
[19:42:03] <wikibugs>	 (03PS1) 10Ottomata: Remove overrides for EL migration for Growth team schemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668764 (https://phabricator.wikimedia.org/T267333)
[19:50:38] <wikibugs>	 (03PS1) 10Ottomata: Remove overrides for EL migration for WMDE Technical Wishes schemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668766 (https://phabricator.wikimedia.org/T275005)
[19:51:20] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs-drain-hypervisor: add timeout and retries [puppet] - 10https://gerrit.wikimedia.org/r/668545 (https://phabricator.wikimedia.org/T276344) (owner: 10Andrew Bogott)
[19:56:16] <wikibugs>	 (03PS2) 10Dzahn: site: remove deploy1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/668168 (https://phabricator.wikimedia.org/T275831)
[19:56:48] <wikibugs>	 (03PS1) 10Ottomata: Migrate PrefUpdate to EventGate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668769 (https://phabricator.wikimedia.org/T267348)
[20:02:06] <logmsgbot>	 !log legoktm@deploy1002 conftool action : set/pooled=no; selector: name=registry2004.codfw.wmnet
[20:02:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:03:02] <wikibugs>	 (03PS1) 10Legoktm: conftool: Add registry2004.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/668771 (https://phabricator.wikimedia.org/T276381)
[20:03:34] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] conftool: Add registry2004.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/668771 (https://phabricator.wikimedia.org/T276381) (owner: 10Legoktm)
[20:04:15] <logmsgbot>	 !log legoktm@deploy1002 conftool action : set/pooled=no; selector: name=registry2004.codfw.wmnet
[20:04:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:04:57] <logmsgbot>	 !log legoktm@deploy1002 conftool action : set/weight=10; selector: name=registry2004.codfw.wmnet
[20:05:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:12:06] <logmsgbot>	 !log legoktm@deploy1002 conftool action : set/pooled=yes; selector: name=registry2004.codfw.wmnet
[20:12:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:15:00] <logmsgbot>	 !log legoktm@deploy1002 conftool action : set/pooled=no; selector: name=registry2001.codfw.wmnet
[20:15:04] <logmsgbot>	 !log legoktm@deploy1002 conftool action : set/pooled=no; selector: name=registry2002.codfw.wmnet
[20:15:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:15:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:19:34] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:28:41] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/28406/" [puppet] - 10https://gerrit.wikimedia.org/r/668168 (https://phabricator.wikimedia.org/T275831) (owner: 10Dzahn)
[20:29:54] <wikibugs>	 (03PS1) 10Andrew Bogott: Keystone: remove check_keystone_projects.py [puppet] - 10https://gerrit.wikimedia.org/r/668774 (https://phabricator.wikimedia.org/T274385)
[20:29:56] <wikibugs>	 (03PS1) 10Andrew Bogott: Keystone: remove some wkfkeystonehooks config flags [puppet] - 10https://gerrit.wikimedia.org/r/668775 (https://phabricator.wikimedia.org/T274385)
[20:33:41] <wikibugs>	 (03PS1) 10RobH: kafka-logging updates [puppet] - 10https://gerrit.wikimedia.org/r/668776 (https://phabricator.wikimedia.org/T273778)
[20:33:58] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:34:05] <wikibugs>	 (03PS2) 10RobH: kafka-logging updates [puppet] - 10https://gerrit.wikimedia.org/r/668776 (https://phabricator.wikimedia.org/T273778)
[20:34:26] <wikibugs>	 (03CR) 10RobH: [C: 03+2] kafka-logging updates [puppet] - 10https://gerrit.wikimedia.org/r/668776 (https://phabricator.wikimedia.org/T273778) (owner: 10RobH)
[20:34:56] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts deploy1001.eqiad.wmnet
[20:35:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:45:14] <icinga-wm>	 PROBLEM - Stale file for node-exporter textfile in eqiad on alert1001 is CRITICAL: cluster=syslog file=nel-kafkacat.prom instance=centrallog1001 job=node site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Stale_file_for_node-exporter_textfile https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile
[20:48:51] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts deploy1001.eqiad.wmnet
[20:48:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:01:28] <legoktm>	 !log updated udplog to 1.9 on mwlog1002.eqiad.wmnet and mwlog2002.codfw.wmnet
[21:01:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:20:34] <wikibugs>	 (03PS1) 10Dzahn: mariadb: update grants for deployment servers to clouddb and prod-m5 [puppet] - 10https://gerrit.wikimedia.org/r/668785 (https://phabricator.wikimedia.org/T275831)
[21:21:33] <wikibugs>	 (03PS2) 10Dzahn: mariadb: update grants for deployment servers to clouddb and prod-m5 [puppet] - 10https://gerrit.wikimedia.org/r/668785 (https://phabricator.wikimedia.org/T275831)
[21:23:49] <wikibugs>	 (03PS3) 10Dzahn: mariadb: update grants for deployment servers to clouddb and prod-m5 [puppet] - 10https://gerrit.wikimedia.org/r/668785 (https://phabricator.wikimedia.org/T275831)
[21:26:33] <wikibugs>	 (03PS4) 10Dzahn: mariadb: update grants for deployment servers to clouddb and prod-m5 [puppet] - 10https://gerrit.wikimedia.org/r/668785 (https://phabricator.wikimedia.org/T275831)
[21:28:47] <wikibugs>	 (03CR) 10Dzahn: "git checkout dc5b21b83c5d1e3caf" [puppet] - 10https://gerrit.wikimedia.org/r/668785 (https://phabricator.wikimedia.org/T275831) (owner: 10Dzahn)
[21:30:21] <wikibugs>	 10SRE, 10serviceops: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10Dzahn)
[21:31:04] <wikibugs>	 10SRE, 10serviceops: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10Dzahn)
[21:32:03] <wikibugs>	 10SRE, 10serviceops: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10Dzahn)
[21:32:13] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.dns.netbox
[21:32:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:34:16] <wikibugs>	 (03PS2) 10Andrew Bogott: Keystone: remove check_keystone_projects.py [puppet] - 10https://gerrit.wikimedia.org/r/668774 (https://phabricator.wikimedia.org/T274385)
[21:34:18] <wikibugs>	 (03PS2) 10Andrew Bogott: Keystone: remove some wkfkeystonehooks config flags [puppet] - 10https://gerrit.wikimedia.org/r/668775 (https://phabricator.wikimedia.org/T274385)
[21:34:20] <wikibugs>	 (03PS1) 10Andrew Bogott: cloud-vps rename 'observer' role to 'reader' [puppet] - 10https://gerrit.wikimedia.org/r/668789 (https://phabricator.wikimedia.org/T276018)
[21:34:37] <wikibugs>	 10SRE, 10serviceops: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10Dzahn)
[21:35:05] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Keystone: remove check_keystone_projects.py [puppet] - 10https://gerrit.wikimedia.org/r/668774 (https://phabricator.wikimedia.org/T274385) (owner: 10Andrew Bogott)
[21:35:31] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps rename 'observer' role to 'reader' [puppet] - 10https://gerrit.wikimedia.org/r/668789 (https://phabricator.wikimedia.org/T276018) (owner: 10Andrew Bogott)
[21:36:01] <logmsgbot>	 !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:36:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:36:13] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Keystone: remove some wkfkeystonehooks config flags [puppet] - 10https://gerrit.wikimedia.org/r/668775 (https://phabricator.wikimedia.org/T274385) (owner: 10Andrew Bogott)
[21:36:43] <wikibugs>	 (03PS3) 10Andrew Bogott: Keystone: remove some wkfkeystonehooks config flags [puppet] - 10https://gerrit.wikimedia.org/r/668775 (https://phabricator.wikimedia.org/T274385)
[21:37:06] <wikibugs>	 10SRE, 10vm-requests, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10Dzahn) p:05High→03Medium
[21:38:19] <wikibugs>	 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install pki-root1001.eqiad.wmnet - https://phabricator.wikimedia.org/T276625 (10RobH)
[21:38:31] <wikibugs>	 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install pki-root1001.eqiad.wmnet - https://phabricator.wikimedia.org/T276625 (10RobH)
[21:42:06] <wikibugs>	 10SRE, 10Traffic, 10GitLab (Initialization), 10Patch-For-Review, and 2 others: forward external traffic to gitlab VMs (was: Port map of how Gitlab is accessed) - https://phabricator.wikimedia.org/T276144 (10Dzahn) I think this ticket now just turned into "open port 80/443 to the public" but it seems best t...
[21:42:47] <wikibugs>	 10SRE, 10Traffic, 10GitLab (Initialization), 10Patch-For-Review, and 2 others: open firewall ports on gitlab1001.wikimedia.org (was: Port map of how Gitlab is accessed) - https://phabricator.wikimedia.org/T276144 (10Dzahn)
[21:43:27] <wikibugs>	 10SRE, 10Traffic, 10GitLab (Initialization), 10Patch-For-Review, and 2 others: open firewall ports on gitlab1001.wikimedia.org (was: Port map of how Gitlab is accessed) - https://phabricator.wikimedia.org/T276144 (10Dzahn) 05Open→03Stalled
[21:43:31] <wikibugs>	 10SRE, 10DNS, 10Traffic, 10serviceops, and 3 others: DNS for GitLab - https://phabricator.wikimedia.org/T276170 (10Dzahn)
[21:57:15] <wikibugs>	 (03CR) 10Phamhi: [C: 03+2] wikireplica: depool clouddb1016 [puppet] - 10https://gerrit.wikimedia.org/r/668563 (owner: 10Phamhi)
[22:06:22] <wikibugs>	 (03PS1) 10Dzahn: icinga/releases: exclude /run/docker from disk space checks, avoid alerts [puppet] - 10https://gerrit.wikimedia.org/r/668798
[22:09:16] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 16 down 2 https://wikitech.wikimedia.org/wiki/HAProxy
[22:10:30] <wikibugs>	 10SRE, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Standardize the logging format - https://phabricator.wikimedia.org/T234565 (10colewhite)
[22:11:36] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 18 down 0 https://wikitech.wikimedia.org/wiki/HAProxy
[22:12:49] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] icinga/releases: exclude /run/docker from disk space checks, avoid alerts [puppet] - 10https://gerrit.wikimedia.org/r/668798 (owner: 10Dzahn)
[22:12:55] <wikibugs>	 (03PS2) 10Dzahn: icinga/releases: exclude /run/docker from disk space checks, avoid alerts [puppet] - 10https://gerrit.wikimedia.org/r/668798
[22:14:48] <wikibugs>	 (03PS1) 10Phamhi: Revert "wikireplica: depool clouddb1016" [puppet] - 10https://gerrit.wikimedia.org/r/668806
[22:15:43] <wikibugs>	 (03CR) 10Phamhi: [C: 03+2] Revert "wikireplica: depool clouddb1016" [puppet] - 10https://gerrit.wikimedia.org/r/668806 (owner: 10Phamhi)
[22:16:54] <icinga-wm>	 RECOVERY - Disk space on releases1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=releases1002&var-datasource=eqiad+prometheus/ops
[22:17:46] <wikibugs>	 (03CR) 10Dzahn: "<+icinga-wm> RECOVERY - Disk space on releases1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space" [puppet] - 10https://gerrit.wikimedia.org/r/668798 (owner: 10Dzahn)
[22:24:35] <wikibugs>	 (03PS1) 10Phamhi: wikireplica: depool clouddb1017 [puppet] - 10https://gerrit.wikimedia.org/r/668803
[22:30:15] <wikibugs>	 10SRE, 10Traffic, 10GitLab (Initialization), 10Patch-For-Review, and 2 others: open firewall ports on gitlab1001.wikimedia.org (was: Port map of how Gitlab is accessed) - https://phabricator.wikimedia.org/T276144 (10thcipriani) >>! In T276144#6887868, @Dzahn wrote: > I think this ticket now just turned int...
[22:42:19] <wikibugs>	 10SRE, 10Traffic, 10GitLab (Initialization), 10Patch-For-Review, and 2 others: open firewall ports on gitlab1001.wikimedia.org (was: Port map of how Gitlab is accessed) - https://phabricator.wikimedia.org/T276144 (10Sergey.Trofimovsky.SF) From our perspective Gitlab-managed auto-renewed Let's Encrypt certi...
[22:43:06] <wikibugs>	 10SRE, 10Traffic, 10GitLab (Initialization), 10Patch-For-Review, and 2 others: open firewall ports on gitlab1001.wikimedia.org (was: Port map of how Gitlab is accessed) - https://phabricator.wikimedia.org/T276144 (10Dzahn) > much of production (i.e., phab) is now using envoy for termination and meanwhile G...
[22:45:43] <wikibugs>	 10SRE, 10Traffic, 10GitLab (Initialization), 10Patch-For-Review, and 2 others: open firewall ports on gitlab1001.wikimedia.org (was: Port map of how Gitlab is accessed) - https://phabricator.wikimedia.org/T276144 (10Dzahn) >>! In T276144#6887970, @Sergey.Trofimovsky.SF wrote: > From our perspective Gitlab-...
[22:49:06] <wikibugs>	 10SRE, 10Traffic, 10GitLab (Initialization), 10Patch-For-Review, and 2 others: open firewall ports on gitlab1001.wikimedia.org (was: Port map of how Gitlab is accessed) - https://phabricator.wikimedia.org/T276144 (10Dzahn) @Sergey.Trofimovsky.SF Do you agree we should keep the firewall closed until you say...
[22:53:20] <wikibugs>	 (03PS1) 10Legoktm: aptrepo: Add "pygments" component [puppet] - 10https://gerrit.wikimedia.org/r/668831 (https://phabricator.wikimedia.org/T276298)
[22:54:17] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28407/console" [puppet] - 10https://gerrit.wikimedia.org/r/668831 (https://phabricator.wikimedia.org/T276298) (owner: 10Legoktm)
[22:55:25] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28408/console" [puppet] - 10https://gerrit.wikimedia.org/r/668831 (https://phabricator.wikimedia.org/T276298) (owner: 10Legoktm)
[23:00:09] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1 C: 03+2] aptrepo: Add "pygments" component [puppet] - 10https://gerrit.wikimedia.org/r/668831 (https://phabricator.wikimedia.org/T276298) (owner: 10Legoktm)
[23:07:22] <wikibugs>	 10SRE: Disable man-db in pbuilder in package_builder on deneb - https://phabricator.wikimedia.org/T276632 (10Legoktm)
[23:16:22] <legoktm>	 !log imported pygments 2.8.0+dfsg-1 to apt.wm.o buster-wikimedia component/pygments (T276298)
[23:16:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:16:29] <stashbot>	 T276298: Package latest python3-pygments for apt.wikimedia.org - https://phabricator.wikimedia.org/T276298
[23:18:18] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:20:44] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:38:50] <wikibugs>	 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be100[12] - https://phabricator.wikimedia.org/T276637 (10RobH)
[23:39:00] <wikibugs>	 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be100[12] - https://phabricator.wikimedia.org/T276637 (10RobH)
[23:39:03] <wikibugs>	 (03PS1) 10Bstorm: wikireplicas: improve filtering on the actor_user view [puppet] - 10https://gerrit.wikimedia.org/r/668843 (https://phabricator.wikimedia.org/T276628)
[23:40:38] <wikibugs>	 (03PS1) 10Legoktm: codesearch: Add port for shouthow [puppet] - 10https://gerrit.wikimedia.org/r/668844 (https://phabricator.wikimedia.org/T253597)
[23:42:16] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28409/console" [puppet] - 10https://gerrit.wikimedia.org/r/668844 (https://phabricator.wikimedia.org/T253597) (owner: 10Legoktm)
[23:42:45] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1 C: 03+2] codesearch: Add port for shouthow [puppet] - 10https://gerrit.wikimedia.org/r/668844 (https://phabricator.wikimedia.org/T253597) (owner: 10Legoktm)
[23:59:35] <wikibugs>	 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be200[12] - https://phabricator.wikimedia.org/T276642 (10RobH) a:03Papaul
[23:59:40] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets