[00:00:04] RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210305T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:00:19] the easiest deployment ever :) [00:00:49] (03CR) 10Legoktm: [C: 03+2] install_server: Add registry1004.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/668570 (https://phabricator.wikimedia.org/T276380) (owner: 10Legoktm) [00:06:50] !log legoktm@cumin1001 START - Cookbook sre.ganeti.makevm for new host registry2004.eqiad.wmnet [00:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:15] (03PS4) 10Ryan Kemper: wdqs: expose wdqs1009 externally [puppet] - 10https://gerrit.wikimedia.org/r/668173 (https://phabricator.wikimedia.org/T266470) [00:14:38] 10SRE, 10DBA, 10Epic, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Decide how to improve parsercache replication, sharding and HA - https://phabricator.wikimedia.org/T133523 (10aaron) Playing around with ` mwscript shell.php aawiki ` ...I noticed that SHOW SLAVE STATUS is empty in... [00:14:40] (03CR) 10Ryan Kemper: wdqs: expose wdqs1009 externally (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/668173 (https://phabricator.wikimedia.org/T266470) (owner: 10Ryan Kemper) [00:23:33] !log legoktm@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host registry2004.eqiad.wmnet [00:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:57] ah crap [00:24:09] that was supposed to be .codfw.wmnet [00:24:45] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28390/console" [puppet] - 10https://gerrit.wikimedia.org/r/668173 (https://phabricator.wikimedia.org/T266470) (owner: 10Ryan Kemper) [00:26:17] 10SRE, 10vm-requests: codfw: 2 VM request for docker-registry - https://phabricator.wikimedia.org/T276381 (10Legoktm) Uh, screwed up, accidentally created registry2004 in codfw but called it registry2004.eqiad.wmnet. Going to delete it now and try again... [00:26:23] 10SRE, 10DBA, 10Epic, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Decide how to improve parsercache replication, sharding and HA - https://phabricator.wikimedia.org/T133523 (10aaron) Ideally the SqlBagOStuff hashing would use HashRing, though any naive transition would involve a lo... [00:27:14] I guess I have to finish setting it up before I can decom it? [00:31:57] heh [00:32:01] (03PS1) 10Legoktm: install_server: Add registry2004.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/668571 (https://phabricator.wikimedia.org/T276381) [00:32:42] (03CR) 10jerkins-bot: [V: 04-1] install_server: Add registry2004.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/668571 (https://phabricator.wikimedia.org/T276381) (owner: 10Legoktm) [00:34:11] or I guess not [00:35:06] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28391/console" [puppet] - 10https://gerrit.wikimedia.org/r/668543 (https://phabricator.wikimedia.org/T266470) (owner: 10Ryan Kemper) [00:35:55] (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] wdqs: new query-preview microsite [puppet] - 10https://gerrit.wikimedia.org/r/668543 (https://phabricator.wikimedia.org/T266470) (owner: 10Ryan Kemper) [00:36:15] !log T266470 Deploying new `query-preview` microsite: https://gerrit.wikimedia.org/r/c/operations/puppet/+/668543 [00:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:24] T266470: Expose wdqs1009 to wdqs users and gather feedback - https://phabricator.wikimedia.org/T266470 [00:39:28] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@4cc913e]: correct refinery-drop-older-than checksum [00:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:34] !log T266470 Ran `sudo run-puppet-agent` on `miscweb1002` without issue; `/var/log/apache2/query*.log` looks as expected [00:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:02] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@4cc913e]: correct refinery-drop-older-than checksum (duration: 01m 34s) [00:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:58] PROBLEM - Check systemd state on registry1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:46:36] (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] wdqs: expose wdqs1009 externally [puppet] - 10https://gerrit.wikimedia.org/r/668173 (https://phabricator.wikimedia.org/T266470) (owner: 10Ryan Kemper) [00:47:02] !log T266470 [ats] Deploying new mappings for `query-preview.wikidata.org` microsite: https://gerrit.wikimedia.org/r/c/operations/puppet/+/668173/ [00:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:47:08] T266470: Expose wdqs1009 to wdqs users and gather feedback - https://phabricator.wikimedia.org/T266470 [00:47:28] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:49:17] (03Abandoned) 10Legoktm: install_server: Add registry2004.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/668571 (https://phabricator.wikimedia.org/T276381) (owner: 10Legoktm) [00:49:26] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:49:44] 10SRE, 10vm-requests, 10Patch-For-Review: codfw: 2 VM request for docker-registry - https://phabricator.wikimedia.org/T276381 (10Legoktm) ` legoktm@cumin1001:~$ sudo cookbook sre.hosts.decommission registry2004.eqiad.wmnet -t T276381 >>> ATTENTION: the query does not match any host in PuppetDB or failed Host... [00:50:39] !log T266470 [ats] `sudo cumin 'A:cp-ats' 'sudo run-puppet-agent'` [00:50:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:38] (03PS1) 10Legoktm: sre.hosts.decomission: Don't error if all hosts aren't found [cookbooks] - 10https://gerrit.wikimedia.org/r/668572 [00:54:22] (03PS1) 10Legoktm: conftool: Add registry1004.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/668573 (https://phabricator.wikimedia.org/T276380) [00:55:02] (03CR) 10Legoktm: [C: 03+2] conftool: Add registry1004.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/668573 (https://phabricator.wikimedia.org/T276380) (owner: 10Legoktm) [00:55:48] !log legoktm@deploy1002 conftool action : set/pooled=no; selector: name=registry1004.eqiad.codfw [00:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:56:04] !log legoktm@deploy1002 conftool action : set/weight=10; selector: name=registry1004.eqiad.wmnet [00:56:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:57:04] !log legoktm@deploy1002 conftool action : set/pooled=inactive; selector: name=registry1004.eqiad.codfw [00:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:57:14] !log legoktm@deploy1002 conftool action : set/pooled=no; selector: name=registry1004.eqiad.wmnet [00:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:58:08] !log legoktm@deploy1002 conftool action : set/pooled=yes; selector: name=registry1004.eqiad.wmnet [00:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:58:47] 10SRE, 10vm-requests, 10Patch-For-Review: eqiad: 2 VM request for docker-registry - https://phabricator.wikimedia.org/T276380 (10Legoktm) 05Open→03Resolved [00:58:51] 10SRE, 10serviceops: Upgrade docker-registry servers to Debian Buster - https://phabricator.wikimedia.org/T272550 (10Legoktm) [00:58:55] !log legoktm@deploy1002 conftool action : set/pooled=no; selector: name=registry1001.eqiad.wmnet [00:58:58] !log legoktm@deploy1002 conftool action : set/pooled=no; selector: name=registry1002.eqiad.wmnet [00:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:59:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:59:46] !log depooled registry1001/registry1002 (old stretch VMs) - T272550 [00:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:59:53] T272550: Upgrade docker-registry servers to Debian Buster - https://phabricator.wikimedia.org/T272550 [01:02:35] I'm going to take a break now, too many typos [01:04:31] 10SRE, 10serviceops: Upgrade docker-registry servers to Debian Buster - https://phabricator.wikimedia.org/T272550 (10Legoktm) Update: 2 new buster VMs in eqiad are running, and I depooled the 2 stretch ones, will delete them on Monday if no other problems arise. In codfw 1 buster VM is running alongside the 2... [01:05:57] (03CR) 10CRusnov: "Well this turned out to be slightly more complicated than I expected, but I took a bit of extra time implementing a generic solution that " [software/netbox] - 10https://gerrit.wikimedia.org/r/668574 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [02:00:32] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 76 probes of 599 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:01:26] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 68 probes of 599 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:06:44] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 49 probes of 599 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:07:48] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 51 probes of 599 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:10:26] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [02:12:32] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1089 is OK: HTTP OK: HTTP/1.0 200 OK - 23610 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [03:17:22] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:21:44] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.061 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:49:52] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:52:12] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:47:56] PROBLEM - Memory correctable errors -EDAC- on thumbor2001 is CRITICAL: 4 ge 4 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor2001&var-datasource=codfw+prometheus/ops [05:18:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:21:00] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:42:38] 10SRE, 10DBA, 10Epic, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Decide how to improve parsercache replication, sharding and HA - https://phabricator.wikimedia.org/T133523 (10Marostegui) >>! In T133523#6884890, @aaron wrote: > Playing around with > ` > mwscript shell.php aawiki >... [05:49:49] (03PS1) 10Marostegui: install_server: Do not reimage db2145,db2146 [puppet] - 10https://gerrit.wikimedia.org/r/668598 (https://phabricator.wikimedia.org/T275633) [05:53:19] 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Marostegui) [05:54:42] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db2145,db2146 [puppet] - 10https://gerrit.wikimedia.org/r/668598 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [06:17:16] !log uploaded udplog 1.9 (buster-wikimedia) to apt.wikimedia.org (T276421) [06:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:24] T276421: Package udplog for Buster - https://phabricator.wikimedia.org/T276421 [06:18:09] 10SRE, 10Packaging: Package udplog for Buster - https://phabricator.wikimedia.org/T276421 (10Legoktm) [06:19:30] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:21:52] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:51:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2092', diff saved to https://phabricator.wikimedia.org/P14640 and previous config saved to /var/cache/conftool/dbconfig/20210305-065137-marostegui.json [06:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:15] good morning [06:58:40] (03CR) 10Hashar: "+1 :]" [puppet] - 10https://gerrit.wikimedia.org/r/668172 (owner: 10Hashar) [06:58:48] (03CR) 10Hashar: [C: 03+1] logspam-watch: redraw when terminal size changes [puppet] - 10https://gerrit.wikimedia.org/r/668172 (owner: 10Hashar) [07:29:13] (03PS4) 10JMeybohm: Add mwilliams to analytics-privatedata-users with no ssh access [puppet] - 10https://gerrit.wikimedia.org/r/668355 (https://phabricator.wikimedia.org/T275671) [07:33:08] 10SRE, 10Packaging: Package udplog for Buster - https://phabricator.wikimedia.org/T276421 (10Legoktm) @Majavah refreshed the packaging (https://gerrit.wikimedia.org/r/c/analytics/udplog/+/668451), switching to dh, a native package and got it to pass lintian with no errors. I uploaded that as 1.9, it works fine... [07:33:24] (03CR) 10JMeybohm: [C: 03+2] "PS4 is a rebase, Ottomata approved on phab. Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/668355 (https://phabricator.wikimedia.org/T275671) (owner: 10JMeybohm) [07:33:46] !log drain + reimage analytics107[0-1] to debian buster [07:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:07] 10SRE, 10Product-Analytics, 10SRE-Access-Requests, 10Patch-For-Review, 10Structured-Data-Backlog (Current Work): Add Matthew Williams to analytics-privatedata-users - https://phabricator.wikimedia.org/T275671 (10JMeybohm) 05Open→03Resolved a:03JMeybohm Thanks [07:57:12] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1070.eqiad.wmnet with reason: REIMAGE [07:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:08] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1071.eqiad.wmnet with reason: REIMAGE [07:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:20] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1070.eqiad.wmnet with reason: REIMAGE [07:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210305T0800) [08:00:20] PROBLEM - Check systemd state on analytics1066 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:01:28] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1071.eqiad.wmnet with reason: REIMAGE [08:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:45] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install backup2003 - https://phabricator.wikimedia.org/T274185 (10jcrespo) It was not an os issue (the os already saw the disks as sda and sdb), but a bios/boot issue. fixed with T274185#6883969 [08:07:38] 10SRE, 10serviceops: Upgrade docker-registry servers to Debian Buster - https://phabricator.wikimedia.org/T272550 (10JMeybohm) >>! In T272550#6884932, @Legoktm wrote: > In codfw 1 buster VM is running alongside the 2 stretch ones, except I accidentally created the second new VM with the wrong name (`registry20... [08:17:22] 10SRE, 10Beta-Cluster-Infrastructure: deployment-logstash03: UDP listener died EADDRINUSE, logstash port conflict with rsyslogd - https://phabricator.wikimedia.org/T241481 (10Majavah) [08:22:28] (03PS1) 10Filippo Giunchedi: hieradata: remove ms-be1034 [puppet] - 10https://gerrit.wikimedia.org/r/668645 (https://phabricator.wikimedia.org/T276193) [08:25:33] (03PS1) 10JMeybohm: Switch active kubernetes staging cluster to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/668646 (https://phabricator.wikimedia.org/T276305) [08:25:35] (03PS1) 10JMeybohm: Switch staging.svc.eqiad.wmnet back to eqiad [dns] - 10https://gerrit.wikimedia.org/r/668647 (https://phabricator.wikimedia.org/T276305) [08:26:11] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: remove ms-be1034 [puppet] - 10https://gerrit.wikimedia.org/r/668645 (https://phabricator.wikimedia.org/T276193) (owner: 10Filippo Giunchedi) [08:27:06] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install backup2003 - https://phabricator.wikimedia.org/T274185 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin2001.codfw.wmnet for hosts: ` backup2003.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2021030... [08:32:12] !log drain + reimage an-worker107[8,9] to Debian Buster [08:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:56] PROBLEM - NFS Share Volume Space /srv/tools on labstore1004 is CRITICAL: DISK CRITICAL - free space: /srv/tools 1263863 MB (15% inode=80%): https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage%23NFS_volume_cleanup https://grafana.wikimedia.org/d/50z0i4XWz/tools-overall-nfs-storage-utilization?orgId=1 [08:39:48] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1032 - https://phabricator.wikimedia.org/T272209 (10fgiunchedi) >>! In T272209#6884508, @wiki_willy wrote: > Hi @fgiunchedi - let us know when you have the decom task for ms-be1034 submitted per our conversation on IRC....then we can pull one of the drives for this. T... [08:42:20] !log jynus@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on backup2003.codfw.wmnet with reason: REIMAGE [08:42:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:37] (03PS1) 10QChris: Add .gitreview [alerts] - 10https://gerrit.wikimedia.org/r/668649 [08:43:39] (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [alerts] - 10https://gerrit.wikimedia.org/r/668649 (owner: 10QChris) [08:44:29] !log jynus@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup2003.codfw.wmnet with reason: REIMAGE [08:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:05] (03CR) 10David Caro: [C: 03+1] sre.hosts.decomission: Don't error if all hosts aren't found (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/668572 (owner: 10Legoktm) [08:51:27] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install backup2003 - https://phabricator.wikimedia.org/T274185 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['backup2003.codfw.wmnet'] ` and were **ALL** successful. [08:55:18] (03PS1) 10Elukey: sre.hosts.decommission: fix use of self [cookbooks] - 10https://gerrit.wikimedia.org/r/668650 (https://phabricator.wikimedia.org/T276524) [08:56:45] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM (to my untrained eye anyways)" [cookbooks] - 10https://gerrit.wikimedia.org/r/668650 (https://phabricator.wikimedia.org/T276524) (owner: 10Elukey) [08:57:46] (03CR) 10David Caro: [C: 04-1] sre.hosts.decommission: fix use of self (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/668650 (https://phabricator.wikimedia.org/T276524) (owner: 10Elukey) [08:57:51] (03CR) 10Elukey: [C: 03+2] "@Volans: I think it is my fault when I moved the cookbook to the class api, going to merge so Filippo can retry, lemme know if you have do" [cookbooks] - 10https://gerrit.wikimedia.org/r/668650 (https://phabricator.wikimedia.org/T276524) (owner: 10Elukey) [08:58:03] (03CR) 10jerkins-bot: [V: 04-1] sre.hosts.decommission: fix use of self [cookbooks] - 10https://gerrit.wikimedia.org/r/668650 (https://phabricator.wikimedia.org/T276524) (owner: 10Elukey) [08:58:39] (03CR) 10Elukey: sre.hosts.decommission: fix use of self (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/668650 (https://phabricator.wikimedia.org/T276524) (owner: 10Elukey) [08:59:11] (03PS2) 10Elukey: sre.hosts.decommission: fix use of self [cookbooks] - 10https://gerrit.wikimedia.org/r/668650 (https://phabricator.wikimedia.org/T276524) [08:59:23] (03CR) 10Alexandros Kosiaris: [C: 03+1] Switch staging.svc.eqiad.wmnet back to eqiad [dns] - 10https://gerrit.wikimedia.org/r/668647 (https://phabricator.wikimedia.org/T276305) (owner: 10JMeybohm) [08:59:42] (03CR) 10Elukey: sre.hosts.decommission: fix use of self (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/668650 (https://phabricator.wikimedia.org/T276524) (owner: 10Elukey) [08:59:49] (03CR) 10David Caro: "This is the same as https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/668572 I think" [cookbooks] - 10https://gerrit.wikimedia.org/r/668650 (https://phabricator.wikimedia.org/T276524) (owner: 10Elukey) [09:00:09] (03CR) 10David Caro: [C: 03+1] "This fixes the same as https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/668650 I think" [cookbooks] - 10https://gerrit.wikimedia.org/r/668572 (owner: 10Legoktm) [09:01:23] (03CR) 10Elukey: "It is yes, we can merge Lego's one, if it is better. No problem :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/668650 (https://phabricator.wikimedia.org/T276524) (owner: 10Elukey) [09:02:58] (03CR) 10Elukey: "@Legoktm: I moved the cookbook to the class API a while ago and probably missed these variables, hopefully it is the only bug, my bad!" [cookbooks] - 10https://gerrit.wikimedia.org/r/668572 (owner: 10Legoktm) [09:03:17] (03Abandoned) 10Elukey: sre.hosts.decomission: Don't error if all hosts aren't found [cookbooks] - 10https://gerrit.wikimedia.org/r/668572 (owner: 10Legoktm) [09:03:23] (03Restored) 10Elukey: sre.hosts.decomission: Don't error if all hosts aren't found [cookbooks] - 10https://gerrit.wikimedia.org/r/668572 (owner: 10Legoktm) [09:03:44] (03Abandoned) 10Elukey: sre.hosts.decommission: fix use of self [cookbooks] - 10https://gerrit.wikimedia.org/r/668650 (https://phabricator.wikimedia.org/T276524) (owner: 10Elukey) [09:04:19] (03CR) 10Elukey: [C: 03+1] "Friday, I mistakenly closed this instead of mine, +1 from my side, I think that we can merge to let Filippo test?" [cookbooks] - 10https://gerrit.wikimedia.org/r/668572 (owner: 10Legoktm) [09:04:51] dcaro: o/ ok if we merge --^ so Filippo can test? [09:04:55] seems ready to go [09:05:40] 👍 [09:05:52] super, thanks a lot for the -1 and the pings :) [09:06:05] (03CR) 10Elukey: [C: 03+2] sre.hosts.decomission: Don't error if all hosts aren't found [cookbooks] - 10https://gerrit.wikimedia.org/r/668572 (owner: 10Legoktm) [09:06:16] glad to help :) [09:08:14] (03Merged) 10jenkins-bot: sre.hosts.decomission: Don't error if all hosts aren't found [cookbooks] - 10https://gerrit.wikimedia.org/r/668572 (owner: 10Legoktm) [09:12:10] (03CR) 10Jakob: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/668651 (https://phabricator.wikimedia.org/T268640) (owner: 10Jakob) [09:12:11] !log filippo@cumin1001 START - Cookbook sre.hosts.decommission for hosts ms-be1034.eqiad.wmnet [09:12:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:03] (03CR) 10JMeybohm: [C: 03+2] Switch staging.svc.eqiad.wmnet back to eqiad [dns] - 10https://gerrit.wikimedia.org/r/668647 (https://phabricator.wikimedia.org/T276305) (owner: 10JMeybohm) [09:16:06] (03PS1) 10Filippo Giunchedi: install_server: remove ms-be1034 [puppet] - 10https://gerrit.wikimedia.org/r/668652 (https://phabricator.wikimedia.org/T276522) [09:16:52] (03CR) 10Filippo Giunchedi: [C: 03+2] install_server: remove ms-be1034 [puppet] - 10https://gerrit.wikimedia.org/r/668652 (https://phabricator.wikimedia.org/T276522) (owner: 10Filippo Giunchedi) [09:17:36] (03CR) 10Tarrow: [C: 03+2] Update termbox to 2021-03-01-112916-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/668651 (https://phabricator.wikimedia.org/T268640) (owner: 10Jakob) [09:18:44] (03CR) 10JMeybohm: [C: 03+2] Switch active kubernetes staging cluster to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/668646 (https://phabricator.wikimedia.org/T276305) (owner: 10JMeybohm) [09:18:49] (03Merged) 10jenkins-bot: Update termbox to 2021-03-01-112916-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/668651 (https://phabricator.wikimedia.org/T268640) (owner: 10Jakob) [09:19:47] !log jakob@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' . [09:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:05] ACKNOWLEDGEMENT - NFS Share Volume Space /srv/tools on labstore1004 is CRITICAL: DISK CRITICAL - free space: /srv/tools 1252491 MB (15% inode=80%): David Caro Tracking here: https://phabricator.wikimedia.org/T276525 - The acknowledgement expires at: 2021-03-11 10:30:00. https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage%23NFS_volume_cleanup https://grafana.wikimedia.org/d/50z0i4XWz/tools-overall-nfs-sto [09:20:05] orgId=1 [09:21:10] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts ms-be1034.eqiad.wmnet [09:21:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:04] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1079.eqiad.wmnet with reason: REIMAGE [09:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:08] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1079.eqiad.wmnet with reason: REIMAGE [09:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:39] 10ops-eqiad, 10decommission-hardware: decommission ms-be1034 - https://phabricator.wikimedia.org/T276522 (10fgiunchedi) a:03Cmjohnson [09:26:42] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1078.eqiad.wmnet with reason: REIMAGE [09:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:32] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'apertium' for release 'staging' . [09:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:51] !log switched back active kubernetes staging cluster to eqiad [09:28:56] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1078.eqiad.wmnet with reason: REIMAGE [09:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:09] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [09:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:45] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' . [09:31:45] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' . [09:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:42] godog: I noticed the b'' on https://phabricator.wikimedia.org/T276522#6885279 and looking took me to https://docs.python.org/3/library/locale.html#locale.getpreferredencoding but I'm not sure if that's safe to simply do .decode('utf-8') on the string when adding to phab [09:37:47] (03PS1) 10JMeybohm: Fix a bunch of dst_net entries to actually be CIDR [deployment-charts] - 10https://gerrit.wikimedia.org/r/668654 (https://phabricator.wikimedia.org/T276268) [09:41:07] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 121 probes of 599 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:42:48] (03CR) 10JMeybohm: [C: 03+2] Fix a bunch of dst_net entries to actually be CIDR [deployment-charts] - 10https://gerrit.wikimedia.org/r/668654 (https://phabricator.wikimedia.org/T276268) (owner: 10JMeybohm) [09:43:30] (03Merged) 10jenkins-bot: Fix a bunch of dst_net entries to actually be CIDR [deployment-charts] - 10https://gerrit.wikimedia.org/r/668654 (https://phabricator.wikimedia.org/T276268) (owner: 10JMeybohm) [09:45:09] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [09:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:40] (03PS1) 10Jcrespo: Fix bug introduced on check microcode by 7a99982 & 4a436bd [puppet] - 10https://gerrit.wikimedia.org/r/668657 [09:48:51] (03PS2) 10Jcrespo: Fix bug introduced on check microcode by 7a99982 & 4a436bd [puppet] - 10https://gerrit.wikimedia.org/r/668657 [09:50:17] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [09:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:06] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [09:52:09] (03CR) 10jerkins-bot: [V: 04-1] Fix bug introduced on check microcode by 7a99982 & 4a436bd [puppet] - 10https://gerrit.wikimedia.org/r/668657 (owner: 10Jcrespo) [09:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:13] (03CR) 10jerkins-bot: [V: 04-1] Fix bug introduced on check microcode by 7a99982 & 4a436bd [puppet] - 10https://gerrit.wikimedia.org/r/668657 (owner: 10Jcrespo) [09:52:20] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 46 probes of 599 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:52:56] (03CR) 10Jcrespo: "Adding everyone that reviewed our patches, sometimes more than 4 people per patch! 😳" [puppet] - 10https://gerrit.wikimedia.org/r/668657 (owner: 10Jcrespo) [09:54:08] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' . [09:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:59] (03PS3) 10Jcrespo: Fix bug introduced on check microcode by 7a99982 & 4a436bd [puppet] - 10https://gerrit.wikimedia.org/r/668657 [10:00:00] (03PS4) 10Jcrespo: Fix bug introduced on check microcode by 7a99982 & 4a436bd [puppet] - 10https://gerrit.wikimedia.org/r/668657 [10:00:21] (03PS5) 10Jcrespo: Fix bug introduced on check microcode by 7a99982 & 4a436bd [puppet] - 10https://gerrit.wikimedia.org/r/668657 [10:04:37] (03CR) 10Jbond: "This looks good to me but im not familiar enough with Django to give a +1" [software/netbox] - 10https://gerrit.wikimedia.org/r/668574 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [10:05:38] 10SRE, 10netbox, 10Patch-For-Review: Add SSO support to netbox - https://phabricator.wikimedia.org/T244849 (10jbond) Just a note that i think we will be able to re-use this code in debmonitor [10:09:27] (03PS6) 10Jcrespo: Fix bug introduced on check microcode by 7a99982 & 4a436bd [puppet] - 10https://gerrit.wikimedia.org/r/668657 [10:09:52] (03PS7) 10Jcrespo: base: Fix bug introduced on check microcode by 7a99982 & 4a436bd [puppet] - 10https://gerrit.wikimedia.org/r/668657 [10:12:06] (03PS3) 10Filippo Giunchedi: pontoon: stop symlinking puppet client ssl directory [puppet] - 10https://gerrit.wikimedia.org/r/666668 (https://phabricator.wikimedia.org/T276501) [10:12:08] (03CR) 10Giuseppe Lavagetto: Add httpd image for MediaWiki (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/667886 (owner: 10Giuseppe Lavagetto) [10:12:15] (03PS3) 10Giuseppe Lavagetto: Add httpd image for MediaWiki [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/667886 [10:12:17] (03PS2) 10Giuseppe Lavagetto: Add memcached image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/668100 [10:14:07] (03CR) 10Filippo Giunchedi: "Andrew, what do you think ? My understanding is that /var/lib/puppet/client isn't created anymore nor needed on newly-provisioned cloud vp" [puppet] - 10https://gerrit.wikimedia.org/r/666668 (https://phabricator.wikimedia.org/T276501) (owner: 10Filippo Giunchedi) [10:17:04] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Add httpd image for MediaWiki [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/667886 (owner: 10Giuseppe Lavagetto) [10:18:16] (03PS1) 10Elukey: hadoop: bump HDFS Namenode heap size [puppet] - 10https://gerrit.wikimedia.org/r/668659 [10:19:43] (03CR) 10Giuseppe Lavagetto: Add memcached image (035 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/668100 (owner: 10Giuseppe Lavagetto) [10:22:31] 10SRE, 10serviceops, 10User-jijiki: Modernise memcached systemd unit / sync to current buster setup - https://phabricator.wikimedia.org/T273950 (10Joe) `systemd-memcached-wrapper` is a perl script, an evolution of the old wrapper script debian always used and that caused me more headaches than it solved. I'd... [10:23:05] (03CR) 10Effie Mouzeli: [C: 03+1] "yeah I agree" [puppet] - 10https://gerrit.wikimedia.org/r/668465 (owner: 10Jbond) [10:23:57] (03CR) 10Jbond: [C: 03+2] P:memcached: drop admin_groups as has no meening here [puppet] - 10https://gerrit.wikimedia.org/r/668465 (owner: 10Jbond) [10:24:29] (03CR) 10David Caro: "Neat, got a question though" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/668545 (https://phabricator.wikimedia.org/T276344) (owner: 10Andrew Bogott) [10:25:06] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt1033.eqiad.wmnet [10:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:54] (03PS1) 10JMeybohm: kubernetes/helmfile_log_sal: Allow to suppress logging to SAL [puppet] - 10https://gerrit.wikimedia.org/r/668660 [10:31:11] (03PS2) 10JMeybohm: kubernetes/helmfile_log_sal: Allow to suppress logging to SAL [puppet] - 10https://gerrit.wikimedia.org/r/668660 [10:32:56] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1033.eqiad.wmnet [10:33:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:34] !log jakob@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'termbox' for release 'production' . [10:38:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:04] (03CR) 10Arturo Borrero Gonzalez: wmcs-drain-hypervisor: add timeout and retries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/668545 (https://phabricator.wikimedia.org/T276344) (owner: 10Andrew Bogott) [10:43:02] !log jakob@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'termbox' for release 'production' . [10:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:35] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt1034.eqiad.wmnet [10:47:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:26] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:51:56] (03CR) 10Alexandros Kosiaris: kubernetes/helmfile_log_sal: Allow to suppress logging to SAL (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/668660 (owner: 10JMeybohm) [10:54:51] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1034.eqiad.wmnet [10:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:58] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:56:44] (03PS3) 10JMeybohm: kubernetes/helmfile_log_sal: Allow to suppress logging to SAL [puppet] - 10https://gerrit.wikimedia.org/r/668660 [10:57:19] (03CR) 10Alexandros Kosiaris: [C: 03+1] kubernetes/helmfile_log_sal: Allow to suppress logging to SAL [puppet] - 10https://gerrit.wikimedia.org/r/668660 (owner: 10JMeybohm) [10:58:28] (03CR) 10JMeybohm: [C: 03+2] kubernetes/helmfile_log_sal: Allow to suppress logging to SAL (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/668660 (owner: 10JMeybohm) [11:09:50] !log Run check table on db2092, db2116, db2145, db2146 (there will be lag) [11:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:23] (03PS3) 10Giuseppe Lavagetto: Add memcached image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/668100 [11:18:05] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Add memcached image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/668100 (owner: 10Giuseppe Lavagetto) [11:23:41] 10SRE, 10Traffic: Wikipedia not opening images in any browser except Opera. - https://phabricator.wikimedia.org/T275211 (10Aklapper) @Mrcoolabhishek: Hi, have you received some answer from Tata sky broadband, by any chance? [11:26:32] (03PS1) 10Marostegui: parsercache.my.cnf: innodb_change_buffering = none [puppet] - 10https://gerrit.wikimedia.org/r/668669 (https://phabricator.wikimedia.org/T263443) [11:26:59] (03CR) 10Marostegui: [C: 04-2] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/668669 (https://phabricator.wikimedia.org/T263443) (owner: 10Marostegui) [11:28:21] !log Temporarily set innodb_change_buffering = none on db1134 (s1) - T263443 [11:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:36] T263443: Evaluate the impact of changing innodb_change_buffering to inserts - https://phabricator.wikimedia.org/T263443 [11:48:18] PROBLEM - puppet last run on cuminunpriv1001 is CRITICAL: CRITICAL: Puppet has been disabled for longer than 86400 seconds, message: jmm, last run 1 day ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:56:20] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:58:44] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:15:48] (03PS1) 10Jbond: P:pki::root_ca: correct ocsp responder location [puppet] - 10https://gerrit.wikimedia.org/r/668678 [12:15:50] (03PS1) 10Jbond: P:pki::multirooca: add multirootca profile and config [puppet] - 10https://gerrit.wikimedia.org/r/668679 [12:18:40] (03CR) 10Jbond: [C: 03+2] P:pki::root_ca: correct ocsp responder location [puppet] - 10https://gerrit.wikimedia.org/r/668678 (owner: 10Jbond) [12:23:12] (03PS2) 10Jbond: P:pki::multirooca: add multirootca profile and config [puppet] - 10https://gerrit.wikimedia.org/r/668679 [12:24:49] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt1035.eqiad.wmnet [12:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:28] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1035.eqiad.wmnet [12:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:41] (03PS3) 10Jbond: P:pki::multirooca: add multirootca profile and config [puppet] - 10https://gerrit.wikimedia.org/r/668679 [12:38:09] (03CR) 10Klausman: [C: 03+1] hadoop: bump HDFS Namenode heap size [puppet] - 10https://gerrit.wikimedia.org/r/668659 (owner: 10Elukey) [12:40:31] (03PS4) 10Jbond: P:pki::multirooca: add multirootca profile and config [puppet] - 10https://gerrit.wikimedia.org/r/668679 [12:40:33] (03PS1) 10Jbond: P:pki::root_ca: use standard size key for ocsp [puppet] - 10https://gerrit.wikimedia.org/r/668682 [12:42:26] (03CR) 10Jbond: [C: 03+2] P:pki::root_ca: use standard size key for ocsp [puppet] - 10https://gerrit.wikimedia.org/r/668682 (owner: 10Jbond) [12:47:39] (03PS5) 10Jbond: P:pki::multirooca: add multirootca profile and config [puppet] - 10https://gerrit.wikimedia.org/r/668679 [12:53:51] (03CR) 10Jbond: [C: 03+2] P:pki::multirooca: add multirootca profile and config [puppet] - 10https://gerrit.wikimedia.org/r/668679 (owner: 10Jbond) [12:58:17] (03PS1) 10Jbond: P:pki-int: add default nets [puppet] - 10https://gerrit.wikimedia.org/r/668684 [12:59:11] (03CR) 10Jbond: [C: 03+2] P:pki-int: add default nets [puppet] - 10https://gerrit.wikimedia.org/r/668684 (owner: 10Jbond) [13:03:54] (03PS1) 10Jbond: P:pki:multirootca: fix types [puppet] - 10https://gerrit.wikimedia.org/r/668686 [13:05:17] (03CR) 10Jbond: [C: 03+2] P:pki:multirootca: fix types [puppet] - 10https://gerrit.wikimedia.org/r/668686 (owner: 10Jbond) [13:12:00] (03PS1) 10Jbond: hieradata: cloud pki add client config [puppet] - 10https://gerrit.wikimedia.org/r/668688 [13:12:53] (03CR) 10Jbond: [C: 03+2] hieradata: cloud pki add client config [puppet] - 10https://gerrit.wikimedia.org/r/668688 (owner: 10Jbond) [13:20:13] (03PS1) 10Jbond: P:pki: gaurd loading cfssl class [puppet] - 10https://gerrit.wikimedia.org/r/668689 [13:21:30] (03CR) 10Jbond: [C: 03+2] P:pki: gaurd loading cfssl class [puppet] - 10https://gerrit.wikimedia.org/r/668689 (owner: 10Jbond) [13:24:51] !log Check tables on db1134 [13:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:13] (03CR) 10Alexandros Kosiaris: "+1 on premise, makes me wonder whether we need wmfjson for anything after this is merged." [puppet] - 10https://gerrit.wikimedia.org/r/668231 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [13:33:07] (03CR) 10Joal: [C: 03+1] hadoop: bump HDFS Namenode heap size [puppet] - 10https://gerrit.wikimedia.org/r/668659 (owner: 10Elukey) [13:34:16] (03CR) 10Elukey: [C: 03+2] hadoop: bump HDFS Namenode heap size [puppet] - 10https://gerrit.wikimedia.org/r/668659 (owner: 10Elukey) [13:35:06] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [13:38:16] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [13:38:23] uh, that's me? [13:38:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'DEpool db1134', diff saved to https://phabricator.wikimedia.org/P14644 and previous config saved to /var/cache/conftool/dbconfig/20210305-133833-marostegui.json [13:38:36] !log elukey@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters [13:38:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:52] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [13:38:58] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [13:40:15] (03PS2) 10Arturo Borrero Gonzalez: wmcs-drain-hypervisor: add timeout and retries [puppet] - 10https://gerrit.wikimedia.org/r/668545 (https://phabricator.wikimedia.org/T276344) (owner: 10Andrew Bogott) [13:42:30] (03PS3) 10Arturo Borrero Gonzalez: wmcs-drain-hypervisor: add timeout and retries [puppet] - 10https://gerrit.wikimedia.org/r/668545 (https://phabricator.wikimedia.org/T276344) (owner: 10Andrew Bogott) [13:45:15] (03PS1) 10Jbond: cfssl::ocsp: create an empty response file [puppet] - 10https://gerrit.wikimedia.org/r/668692 [13:47:06] (03CR) 10Jbond: [C: 03+2] cfssl::ocsp: create an empty response file [puppet] - 10https://gerrit.wikimedia.org/r/668692 (owner: 10Jbond) [13:52:06] !log Rebuild some indexes on db2102 [13:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:23] (03PS1) 10Jbond: PKI: add new intermediate [puppet] - 10https://gerrit.wikimedia.org/r/668694 [13:59:13] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hadoop.roll-restart-masters (exit_code=99) [13:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:22] (03CR) 10Jbond: [C: 03+2] PKI: add new intermediate [puppet] - 10https://gerrit.wikimedia.org/r/668694 (owner: 10Jbond) [14:02:05] (03CR) 10Ottomata: [C: 03+1] hadoop: bump HDFS Namenode heap size [puppet] - 10https://gerrit.wikimedia.org/r/668659 (owner: 10Elukey) [14:08:54] (03PS1) 10Jbond: pki cloud: add deployment-prep_eqiad1_wikimedia_cloud-key [puppet] - 10https://gerrit.wikimedia.org/r/668696 [14:10:59] (03CR) 10Jbond: [C: 03+2] pki cloud: add deployment-prep_eqiad1_wikimedia_cloud-key [puppet] - 10https://gerrit.wikimedia.org/r/668696 (owner: 10Jbond) [14:16:06] (03CR) 10Ottomata: "Installing even just r-base causes the conda env (can I start calling this cenv for short? :p ) from 472M to 1.2G. It looks like the r-ba" [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/668425 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [14:16:43] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Don't install a copy of R in a stacked user conda env [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/668425 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [14:18:37] (03PS2) 10Ottomata: Add activate.d and deactivate.d env_vars.sh [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/668566 (https://phabricator.wikimedia.org/T272313) [14:20:35] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [14:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:58] (03CR) 10Ottomata: [C: 03+2] Add activate.d and deactivate.d env_vars.sh [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/668566 (https://phabricator.wikimedia.org/T272313) (owner: 10Ottomata) [14:21:00] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Add activate.d and deactivate.d env_vars.sh [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/668566 (https://phabricator.wikimedia.org/T272313) (owner: 10Ottomata) [14:23:10] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:23:51] * Urbanecm stagging on mwdebug1001 [14:24:21] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:22] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:27:04] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:36:46] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [14:38:54] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [14:39:22] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:42:14] (03PS1) 10Jbond: pki: update to use include [puppet] - 10https://gerrit.wikimedia.org/r/668700 [14:42:37] (03PS1) 10Giuseppe Lavagetto: profile::etcd::v3: use puppet certs for standalone cluster [puppet] - 10https://gerrit.wikimedia.org/r/668701 [14:44:36] (03PS1) 10Jbond: pki: add a second test intermediate [puppet] - 10https://gerrit.wikimedia.org/r/668704 [14:44:42] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10JMeybohm) helm test annotations changed a bit: > Note that until Helm v3, the job definition needed to contain one of these helm test hook annotations: `helm.sh/hook: test-success` or `... [14:45:50] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28393/console" [puppet] - 10https://gerrit.wikimedia.org/r/668701 (owner: 10Giuseppe Lavagetto) [14:46:43] (03CR) 10Jbond: [C: 03+2] pki: update to use include [puppet] - 10https://gerrit.wikimedia.org/r/668700 (owner: 10Jbond) [14:48:43] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28394/console" [puppet] - 10https://gerrit.wikimedia.org/r/668701 (owner: 10Giuseppe Lavagetto) [14:48:52] (03CR) 10Jbond: [C: 03+2] pki: add a second test intermediate [puppet] - 10https://gerrit.wikimedia.org/r/668704 (owner: 10Jbond) [14:49:44] RECOVERY - NFS Share Volume Space /srv/tools on labstore1004 is OK: DISK OK - free space: /srv/tools 1876701 MB (23% inode=80%): https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage%23NFS_volume_cleanup https://grafana.wikimedia.org/d/50z0i4XWz/tools-overall-nfs-storage-utilization?orgId=1 [15:00:18] (03PS1) 10Zabe: Enable flood flag on hrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668708 (https://phabricator.wikimedia.org/T276560) [15:00:26] (03PS2) 10Giuseppe Lavagetto: profile::etcd::v3: use puppet certs for standalone cluster [puppet] - 10https://gerrit.wikimedia.org/r/668701 [15:02:05] (03CR) 10Andrew Bogott: wmcs-drain-hypervisor: add timeout and retries (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/668545 (https://phabricator.wikimedia.org/T276344) (owner: 10Andrew Bogott) [15:07:30] !log drain + reimage analytics1073 and an-worker1086 to Debian Buster [15:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:32] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:21:46] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:22:49] (03PS1) 10Andrew Bogott: nova.conf: Adjust live migration timing values [puppet] - 10https://gerrit.wikimedia.org/r/668715 (https://phabricator.wikimedia.org/T276344) [15:27:22] (03PS1) 10Mholloway: WikimediaEvents: Create data QA group/right on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668716 (https://phabricator.wikimedia.org/T276515) [15:29:48] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [15:29:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:52] RECOVERY - HP RAID on ms-be1032 is OK: OK: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:35:24] (03CR) 10Mholloway: [C: 04-2] "Hold until Monday 3/8" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668716 (https://phabricator.wikimedia.org/T276515) (owner: 10Mholloway) [15:37:09] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1073.eqiad.wmnet with reason: REIMAGE [15:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:14] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:13] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1073.eqiad.wmnet with reason: REIMAGE [15:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:35] (03PS4) 10Ottomata: Jupyter - never use webproxy for *.wmnet URLs and make Java use system cacerts [puppet] - 10https://gerrit.wikimedia.org/r/668466 (https://phabricator.wikimedia.org/T224658) [15:42:23] (03CR) 10Zabe: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668708 (https://phabricator.wikimedia.org/T276560) (owner: 10Zabe) [15:43:48] (03CR) 10Ottomata: [C: 03+2] Jupyter - never use webproxy for *.wmnet URLs and make Java use system cacerts [puppet] - 10https://gerrit.wikimedia.org/r/668466 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [15:47:56] (03PS3) 10Ottomata: Spark JVMs inherit system http settings [puppet] - 10https://gerrit.wikimedia.org/r/668485 (https://phabricator.wikimedia.org/T224658) [15:48:52] (03PS11) 10Razzi: Remove labsdb1012 from puppet in preparation for rename [puppet] - 10https://gerrit.wikimedia.org/r/663865 (https://phabricator.wikimedia.org/T269211) [15:48:53] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs-drain-hypervisor: add timeout and retries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/668545 (https://phabricator.wikimedia.org/T276344) (owner: 10Andrew Bogott) [15:49:26] (03CR) 10Ottomata: [C: 03+2] Spark JVMs inherit system http settings [puppet] - 10https://gerrit.wikimedia.org/r/668485 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [15:50:32] (03CR) 10Bstorm: [C: 03+2] shared-storage: enable project NFS for wikipathways [puppet] - 10https://gerrit.wikimedia.org/r/668234 (https://phabricator.wikimedia.org/T276141) (owner: 10Bstorm) [15:52:24] (03CR) 10Andrew Bogott: wmcs-drain-hypervisor: add timeout and retries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/668545 (https://phabricator.wikimedia.org/T276344) (owner: 10Andrew Bogott) [15:52:34] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/668715 (https://phabricator.wikimedia.org/T276344) (owner: 10Andrew Bogott) [15:54:15] (03PS2) 10Andrew Bogott: nova.conf: Adjust live migration timing values [puppet] - 10https://gerrit.wikimedia.org/r/668715 (https://phabricator.wikimedia.org/T276344) [15:54:19] (03CR) 10Andrew Bogott: nova.conf: Adjust live migration timing values (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/668715 (https://phabricator.wikimedia.org/T276344) (owner: 10Andrew Bogott) [15:54:52] (03PS1) 10Ottomata: otto/.bash_aliases - set no_proxy too [puppet] - 10https://gerrit.wikimedia.org/r/668720 [15:55:47] (03CR) 10Ottomata: [C: 03+2] otto/.bash_aliases - set no_proxy too [puppet] - 10https://gerrit.wikimedia.org/r/668720 (owner: 10Ottomata) [15:56:27] !log stop mariadb on labsdb1012 to reimage and rename to clouddb1021: T269211 [15:56:32] (03CR) 10Andrew Bogott: [C: 03+2] nova.conf: Adjust live migration timing values [puppet] - 10https://gerrit.wikimedia.org/r/668715 (https://phabricator.wikimedia.org/T276344) (owner: 10Andrew Bogott) [15:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:34] T269211: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 [15:56:34] (03CR) 10Elukey: [C: 03+1] wikireplicas: Add basic configuration for clouddb1021 [puppet] - 10https://gerrit.wikimedia.org/r/661529 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi) [16:03:44] (03PS1) 10Jbond: P:pki: server bundles from web server [puppet] - 10https://gerrit.wikimedia.org/r/668722 [16:07:20] (03PS2) 10Jbond: P:pki: server bundles from web server [puppet] - 10https://gerrit.wikimedia.org/r/668722 [16:07:40] !log razzi@cumin1001 START - Cookbook sre.hosts.decommission for hosts labsdb1012.eqiad.wmnet [16:07:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:10] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28395/console" [puppet] - 10https://gerrit.wikimedia.org/r/668722 (owner: 10Jbond) [16:09:36] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1086.eqiad.wmnet with reason: REIMAGE [16:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:17] (03CR) 10Razzi: [C: 03+2] Remove labsdb1012 from puppet in preparation for rename [puppet] - 10https://gerrit.wikimedia.org/r/663865 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi) [16:11:44] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1086.eqiad.wmnet with reason: REIMAGE [16:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:20] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:pki: server bundles from web server [puppet] - 10https://gerrit.wikimedia.org/r/668722 (owner: 10Jbond) [16:13:57] (03PS1) 10Klausman: ml-ctrl: Add dummy keys for ML k8s control plane [labs/private] - 10https://gerrit.wikimedia.org/r/668723 (https://phabricator.wikimedia.org/T272918) [16:14:48] (03PS1) 10Jbond: P:pki::multirootca: correct typo wmflib::dir::mkdir_p [puppet] - 10https://gerrit.wikimedia.org/r/668724 [16:16:42] (03CR) 10Jbond: [C: 03+2] P:pki::multirootca: correct typo wmflib::dir::mkdir_p [puppet] - 10https://gerrit.wikimedia.org/r/668724 (owner: 10Jbond) [16:16:48] (03CR) 10Klausman: [C: 03+2] ml-ctrl: Add dummy keys for ML k8s control plane [labs/private] - 10https://gerrit.wikimedia.org/r/668723 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [16:17:02] (03CR) 10Klausman: [V: 03+2 C: 03+2] ml-ctrl: Add dummy keys for ML k8s control plane [labs/private] - 10https://gerrit.wikimedia.org/r/668723 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [16:17:43] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts labsdb1012.eqiad.wmnet [16:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:21] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:19:35] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:21:22] (03PS2) 10Klausman: [WIP, do not review] Add k8s config for ML machines [puppet] - 10https://gerrit.wikimedia.org/r/668075 [16:22:58] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt1036.eqiad.wmnet [16:23:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:19] (03PS3) 10Klausman: [WIP, do not review] Add k8s config for ML machines [puppet] - 10https://gerrit.wikimedia.org/r/668075 [16:28:54] !log rename https://netbox.wikimedia.org/ipam/ip-addresses/734/ DNS name from labsdb1012.mgmt.eqiad.wmnet to clouddb1021.mgmt.eqiad.wmnet [16:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:47] !log delete non-mgmt interfaces for labsdb1012 at https://netbox.wikimedia.org/dcim/devices/2078/interfaces/ [16:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:00] (03PS4) 10Klausman: [WIP, do not review] Add k8s config for ML machines [puppet] - 10https://gerrit.wikimedia.org/r/668075 [16:36:01] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1036.eqiad.wmnet [16:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:22] (03PS5) 10Klausman: [WIP, do not review] Add k8s config for ML machines [puppet] - 10https://gerrit.wikimedia.org/r/668075 [16:36:45] RECOVERY - Device not healthy -SMART- on ms-be1032 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be1032&var-datasource=eqiad+prometheus/ops [16:37:24] (03PS1) 10Jbond: cfssl:cert: provide method to fetch CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/668734 [16:38:43] (03CR) 10jerkins-bot: [V: 04-1] cfssl:cert: provide method to fetch CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/668734 (owner: 10Jbond) [16:39:47] (03PS2) 10Jbond: cfssl:cert: provide method to fetch CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/668734 [16:40:42] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28397/console" [puppet] - 10https://gerrit.wikimedia.org/r/668734 (owner: 10Jbond) [16:40:54] (03CR) 10jerkins-bot: [V: 04-1] cfssl:cert: provide method to fetch CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/668734 (owner: 10Jbond) [16:42:07] (03PS3) 10Jbond: cfssl:cert: provide method to fetch CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/668734 [16:42:33] !depool mw1276 and pool back [16:42:33] for s in nginx varnish-fe varnish-be varnish-be-rand; do confctl --tags dc=eqiad,cluster=cache_text,service=$s --action set/pooled=no cp1053.eqiad.wmnet; done [16:42:38] (03CR) 10Ahmon Dancy: [C: 03+1] logspam-watch: redraw when terminal size changes [puppet] - 10https://gerrit.wikimedia.org/r/668172 (owner: 10Hashar) [16:47:07] what's the cp1053 bit here? [16:47:52] (03CR) 10Jbond: [C: 03+2] cfssl:cert: provide method to fetch CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/668734 (owner: 10Jbond) [16:48:22] !log edit https://netbox.wikimedia.org/dcim/devices/2078/ device name from labsdb1012 to clouddb1021 [16:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:02] I mean, that server doesn't really exist anyways [16:49:06] (03PS4) 10Ahmon Dancy: env.php: Allow the datacenter to be specified in WMF_DATACENTER environment variable. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667243 [16:49:10] but that's an odd wm-bot line [16:49:39] (it also seems to know about varnish-be, which doesn't exist anymore either) [16:50:04] (03PS5) 10Ahmon Dancy: env.php: Allow the datacenter to be specified in WMF_DATACENTER environment variable. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667243 [16:50:31] (03CR) 10Ahmon Dancy: "> > For mediawiki image builds, WMF_DATACENTER will be set to 'eqiad' by the image building script." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667243 (owner: 10Ahmon Dancy) [16:52:07] bblack: it looks like it's a response to !depool for whatever reason (which I think effie typed by mistake instead of "!log depool") [16:52:29] ? [16:52:39] rzl: oh this is like, a bot help message telling us how to depool things? :) [16:52:45] lol wot? [16:52:45] looks like it [16:52:47] 11:42:34 AM !depool mw1276 and pool back [16:52:56] (forgive my UTC-5) [16:52:57] 11:52:47 11:42:34 AM !depool mw1276 and pool back [16:52:58] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:53:00] hahahah [16:53:03] !depool foo [16:53:03] for s in nginx varnish-fe varnish-be varnish-be-rand; do confctl --tags dc=eqiad,cluster=cache_text,service=$s --action set/pooled=no cp1053.eqiad.wmnet; done [16:53:07] well ok :) [16:53:09] so i wonder what else tis bot does [16:53:18] !cook dinner [16:53:24] !help [16:53:24] want docs? ask for "!wm-bot". all keywords? try "@regsearch .*" [16:53:25] !log razzi@cumin1001 START - Cookbook sre.dns.netbox [16:53:26] pfff [16:53:29] effie: https://meta.wikimedia.org/wiki/Wm-bot [16:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:51] bd808: I should have guess you were behind this [16:53:55] aha: https://wm-bot.wmflabs.org/dump/%23wikimedia-operations.htm [16:53:55] guessed* [16:54:12] effie: :) Not my bot, but a handy one that is in a lot of wikimedia channels [16:54:14] http://wm-bot.wmflabs.org/dump/%23wikimedia-operations.htm [16:54:14] @info [16:54:35] bd808: yeah I was talking about its secret hidden features [16:54:44] !log depool mw1276 and pool back [16:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:14] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:58:09] !log razzi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:44] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:00:38] (03PS11) 10Razzi: wikireplicas: Add basic configuration for clouddb1021 [puppet] - 10https://gerrit.wikimedia.org/r/661529 (https://phabricator.wikimedia.org/T269211) [17:01:02] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:03:52] (03CR) 10Razzi: [C: 03+2] wikireplicas: Add basic configuration for clouddb1021 [puppet] - 10https://gerrit.wikimedia.org/r/661529 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi) [17:04:28] (03CR) 10CRusnov: "> Patch Set 1:" [software/netbox] - 10https://gerrit.wikimedia.org/r/668574 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [17:05:20] (03PS3) 10Cwhite: logstash: ingest logstash logs as json and convert to ECS [puppet] - 10https://gerrit.wikimedia.org/r/668189 (https://phabricator.wikimedia.org/T273919) [17:06:33] (03PS4) 10Cwhite: logstash: ingest logstash logs as json and convert to ECS [puppet] - 10https://gerrit.wikimedia.org/r/668189 (https://phabricator.wikimedia.org/T273919) [17:07:39] (03CR) 10CRusnov: [C: 03+1] "looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/668505 (https://phabricator.wikimedia.org/T274689) (owner: 10Volans) [17:16:29] (03PS2) 10Razzi: wikireplicas: give analytics_multiinstance role to clouddb1021 [puppet] - 10https://gerrit.wikimedia.org/r/668494 (https://phabricator.wikimedia.org/T269211) [17:17:29] (03PS1) 10Jbond: P:pki::client: add bunldes source [puppet] - 10https://gerrit.wikimedia.org/r/668744 [17:18:46] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28401/console" [puppet] - 10https://gerrit.wikimedia.org/r/668744 (owner: 10Jbond) [17:19:40] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:pki::client: add bunldes source [puppet] - 10https://gerrit.wikimedia.org/r/668744 (owner: 10Jbond) [17:24:07] (03PS1) 10Elukey: ssh-client-config: use wmcloud bastion [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/668746 [17:24:51] (03CR) 10Elukey: "No idea if it is the right one or not, lemme know :)" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/668746 (owner: 10Elukey) [17:27:44] (03CR) 10Cwhite: "Testing in pontoon surfaced a couple bugs now fixed in the latest patchset. Barring any additional concerns, this will go out next week." [puppet] - 10https://gerrit.wikimedia.org/r/668189 (https://phabricator.wikimedia.org/T273919) (owner: 10Cwhite) [17:35:22] (03CR) 10Dzahn: phabricator::tools: replace cron jobs with timers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/666979 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [17:45:38] ottomata: yea, it makes sense to me. or let's say.. at least I am not aware of a reason why direct connections between .wmnet hosts would have to go via proxies, but also I never ran into the need to actively disable it [17:46:59] individual users can puppetize their .profile as needed via admin/files/home [17:48:57] syntax seems right, domain extensions or full IP, afaict [17:49:33] (03PS1) 10Ottomata: eventgate-analytics-external - Bump replicas to 6 for increase in mediawiki.client.session_tick [deployment-charts] - 10https://gerrit.wikimedia.org/r/668750 (https://phabricator.wikimedia.org/T276502) [17:51:56] (03CR) 10Ottomata: "Alex & Janis, can we do this? Product would like to increase sampling rate for session_tick events on all wikis, up to around 2.5K per se" [deployment-charts] - 10https://gerrit.wikimedia.org/r/668750 (https://phabricator.wikimedia.org/T276502) (owner: 10Ottomata) [17:55:10] ok gr8 thanks mutante [17:55:15] (03PS1) 10Majavah: betacluster: switch etcd to deployment-etcd02 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668751 (https://phabricator.wikimedia.org/T276462) [17:55:22] yep, np [17:57:08] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:59:20] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:02:20] (03CR) 10Brennen Bearnes: "Minor convenience change here, one line and already tested by a few regular users." [puppet] - 10https://gerrit.wikimedia.org/r/668172 (owner: 10Hashar) [18:05:34] (03CR) 10Herron: [C: 03+2] elk: send icinga events to a separate partition/index [puppet] - 10https://gerrit.wikimedia.org/r/667917 (owner: 10Herron) [18:08:18] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/668753 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [18:14:54] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:18:33] !log razzi@cumin1001 START - Cookbook sre.dns.netbox [18:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:36] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:23:44] !log razzi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:59] (03PS2) 10CRusnov: netbox, profile::netbox: Switch to CAS authentication [puppet] - 10https://gerrit.wikimedia.org/r/668753 (https://phabricator.wikimedia.org/T244849) [18:25:24] (03PS1) 10Andrew Bogott: cloud-vps: Add a new class to detect and format available cinder volumes [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511) [18:25:52] (03CR) 10jerkins-bot: [V: 04-1] cloud-vps: Add a new class to detect and format available cinder volumes [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [18:26:12] (03CR) 10jerkins-bot: [V: 04-1] netbox, profile::netbox: Switch to CAS authentication [puppet] - 10https://gerrit.wikimedia.org/r/668753 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [18:29:05] (03PS2) 10Andrew Bogott: cloud-vps: Add a new class to detect and format available cinder volumes [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511) [18:29:33] (03CR) 10jerkins-bot: [V: 04-1] cloud-vps: Add a new class to detect and format available cinder volumes [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [18:32:12] (03PS3) 10CRusnov: netbox, profile::netbox: Switch to CAS authentication [puppet] - 10https://gerrit.wikimedia.org/r/668753 (https://phabricator.wikimedia.org/T244849) [18:33:07] (03PS3) 10Andrew Bogott: cloud-vps: Add a new class to detect and format available cinder volumes [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511) [18:33:34] (03CR) 10jerkins-bot: [V: 04-1] cloud-vps: Add a new class to detect and format available cinder volumes [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [18:35:56] (03PS4) 10Andrew Bogott: cloud-vps: Add a new class to detect and format available cinder volumes [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511) [18:36:22] (03CR) 10jerkins-bot: [V: 04-1] cloud-vps: Add a new class to detect and format available cinder volumes [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [18:37:26] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:37:27] (03PS5) 10Andrew Bogott: cloud-vps: Add a new class to detect and format available cinder volumes [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511) [18:37:54] (03CR) 10jerkins-bot: [V: 04-1] cloud-vps: Add a new class to detect and format available cinder volumes [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [18:39:38] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:40:06] (03PS4) 10CRusnov: netbox, profile::netbox: Switch to CAS authentication [puppet] - 10https://gerrit.wikimedia.org/r/668753 (https://phabricator.wikimedia.org/T244849) [18:40:10] (03PS6) 10Andrew Bogott: cloud-vps: Add a new class to detect and format available cinder volumes [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511) [18:40:41] (03CR) 10jerkins-bot: [V: 04-1] cloud-vps: Add a new class to detect and format available cinder volumes [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [18:42:19] (03PS7) 10Andrew Bogott: cloud-vps: Add a new class to detect and format available cinder volumes [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511) [18:43:07] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb1021.eqiad.wmnet with reason: REIMAGE [18:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:30] Krinkle: do you still need a restart of something on doc1001? [18:45:02] mutante: nope, all good now [18:45:02] mutante: r.zl took care of it I believe [18:45:13] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb1021.eqiad.wmnet with reason: REIMAGE [18:45:17] https://phabricator.wikimedia.org/T275468 [18:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:24] !log legoktm@cumin1001 START - Cookbook sre.hosts.decommission for hosts registry2004.eqiad.wmnet [18:45:26] mutante: we do still have the inability to deploy [18:45:28] but it's not urgent [18:45:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:12] Krinkle: ok, ack [18:47:19] wasnt here yesterday [18:50:32] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts registry2004.eqiad.wmnet [18:50:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:45] (03PS5) 10CRusnov: netbox, profile::netbox: Switch to CAS authentication [puppet] - 10https://gerrit.wikimedia.org/r/668753 (https://phabricator.wikimedia.org/T244849) [19:04:36] !log phab1001 - running public_task_dump.py (from cron job) manually [19:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:03] (03CR) 10Legoktm: [C: 03+2] "Beta cluster-only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668751 (https://phabricator.wikimedia.org/T276462) (owner: 10Majavah) [19:06:58] (03Merged) 10jenkins-bot: betacluster: switch etcd to deployment-etcd02 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668751 (https://phabricator.wikimedia.org/T276462) (owner: 10Majavah) [19:07:24] (03CR) 10RLazarus: [C: 03+2] logspam-watch: redraw when terminal size changes [puppet] - 10https://gerrit.wikimedia.org/r/668172 (owner: 10Hashar) [19:07:37] (03CR) 10Ladsgroup: phabricator::tools: replace cron jobs with timers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/666979 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [19:11:15] (03CR) 10Dzahn: "currently running the command manually .. and YES.. it does output a lot. it's full of warnings and Tracebacks ..hrmm" [puppet] - 10https://gerrit.wikimedia.org/r/666979 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [19:13:44] Amir1: the reason that phab task dump script is so .."chatty" is "ProgrammingError: not all arguments converted during string formatting [19:14:22] !log legoktm@cumin1001 START - Cookbook sre.ganeti.makevm for new host registry2004.codfw.wmnet [19:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:40] *headdesks* [19:18:04] I assume it'll be small if you compress them? [19:18:18] * Amir1 pretends to know about compression [19:18:40] sit in your suitcase until the zip closes Amir1 ? [19:18:58] :) [19:19:15] I don't mind about myself, I just don't want to kill phab1001 :D [19:20:26] Amir1: yea, same here. just want to get rid of cron [19:20:31] (03CR) 10CRusnov: "Compiler output looks satisfactory:" [puppet] - 10https://gerrit.wikimedia.org/r/668753 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [19:21:10] Amir1: seems easier to fix it within the dump script itself ..instead of adding new parameters to systemd classes [19:21:19] fixing it wouldn't be hard, is there a way for me to reproduce it locally somewhere? [19:23:20] Amir1: I wouldn't know how.. unless phab db was sanitized and replicated to cloud .. :( [19:23:39] but you could change the python code to stay silent [19:23:47] about errors in general [19:30:05] !log legoktm@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host registry2004.codfw.wmnet [19:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:04] (03PS1) 10Legoktm: install_server: Add registry2004.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/668763 (https://phabricator.wikimedia.org/T276381) [19:35:13] (03CR) 10Legoktm: [C: 03+2] install_server: Add registry2004.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/668763 (https://phabricator.wikimedia.org/T276381) (owner: 10Legoktm) [19:42:03] (03PS1) 10Ottomata: Remove overrides for EL migration for Growth team schemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668764 (https://phabricator.wikimedia.org/T267333) [19:50:38] (03PS1) 10Ottomata: Remove overrides for EL migration for WMDE Technical Wishes schemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668766 (https://phabricator.wikimedia.org/T275005) [19:51:20] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-drain-hypervisor: add timeout and retries [puppet] - 10https://gerrit.wikimedia.org/r/668545 (https://phabricator.wikimedia.org/T276344) (owner: 10Andrew Bogott) [19:56:16] (03PS2) 10Dzahn: site: remove deploy1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/668168 (https://phabricator.wikimedia.org/T275831) [19:56:48] (03PS1) 10Ottomata: Migrate PrefUpdate to EventGate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668769 (https://phabricator.wikimedia.org/T267348) [20:02:06] !log legoktm@deploy1002 conftool action : set/pooled=no; selector: name=registry2004.codfw.wmnet [20:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:02] (03PS1) 10Legoktm: conftool: Add registry2004.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/668771 (https://phabricator.wikimedia.org/T276381) [20:03:34] (03CR) 10Legoktm: [C: 03+2] conftool: Add registry2004.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/668771 (https://phabricator.wikimedia.org/T276381) (owner: 10Legoktm) [20:04:15] !log legoktm@deploy1002 conftool action : set/pooled=no; selector: name=registry2004.codfw.wmnet [20:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:57] !log legoktm@deploy1002 conftool action : set/weight=10; selector: name=registry2004.codfw.wmnet [20:05:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:06] !log legoktm@deploy1002 conftool action : set/pooled=yes; selector: name=registry2004.codfw.wmnet [20:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:00] !log legoktm@deploy1002 conftool action : set/pooled=no; selector: name=registry2001.codfw.wmnet [20:15:04] !log legoktm@deploy1002 conftool action : set/pooled=no; selector: name=registry2002.codfw.wmnet [20:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:34] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:28:41] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/28406/" [puppet] - 10https://gerrit.wikimedia.org/r/668168 (https://phabricator.wikimedia.org/T275831) (owner: 10Dzahn) [20:29:54] (03PS1) 10Andrew Bogott: Keystone: remove check_keystone_projects.py [puppet] - 10https://gerrit.wikimedia.org/r/668774 (https://phabricator.wikimedia.org/T274385) [20:29:56] (03PS1) 10Andrew Bogott: Keystone: remove some wkfkeystonehooks config flags [puppet] - 10https://gerrit.wikimedia.org/r/668775 (https://phabricator.wikimedia.org/T274385) [20:33:41] (03PS1) 10RobH: kafka-logging updates [puppet] - 10https://gerrit.wikimedia.org/r/668776 (https://phabricator.wikimedia.org/T273778) [20:33:58] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:34:05] (03PS2) 10RobH: kafka-logging updates [puppet] - 10https://gerrit.wikimedia.org/r/668776 (https://phabricator.wikimedia.org/T273778) [20:34:26] (03CR) 10RobH: [C: 03+2] kafka-logging updates [puppet] - 10https://gerrit.wikimedia.org/r/668776 (https://phabricator.wikimedia.org/T273778) (owner: 10RobH) [20:34:56] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts deploy1001.eqiad.wmnet [20:35:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:14] PROBLEM - Stale file for node-exporter textfile in eqiad on alert1001 is CRITICAL: cluster=syslog file=nel-kafkacat.prom instance=centrallog1001 job=node site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Stale_file_for_node-exporter_textfile https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [20:48:51] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts deploy1001.eqiad.wmnet [20:48:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:28] !log updated udplog to 1.9 on mwlog1002.eqiad.wmnet and mwlog2002.codfw.wmnet [21:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:34] (03PS1) 10Dzahn: mariadb: update grants for deployment servers to clouddb and prod-m5 [puppet] - 10https://gerrit.wikimedia.org/r/668785 (https://phabricator.wikimedia.org/T275831) [21:21:33] (03PS2) 10Dzahn: mariadb: update grants for deployment servers to clouddb and prod-m5 [puppet] - 10https://gerrit.wikimedia.org/r/668785 (https://phabricator.wikimedia.org/T275831) [21:23:49] (03PS3) 10Dzahn: mariadb: update grants for deployment servers to clouddb and prod-m5 [puppet] - 10https://gerrit.wikimedia.org/r/668785 (https://phabricator.wikimedia.org/T275831) [21:26:33] (03PS4) 10Dzahn: mariadb: update grants for deployment servers to clouddb and prod-m5 [puppet] - 10https://gerrit.wikimedia.org/r/668785 (https://phabricator.wikimedia.org/T275831) [21:28:47] (03CR) 10Dzahn: "git checkout dc5b21b83c5d1e3caf" [puppet] - 10https://gerrit.wikimedia.org/r/668785 (https://phabricator.wikimedia.org/T275831) (owner: 10Dzahn) [21:30:21] 10SRE, 10serviceops: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10Dzahn) [21:31:04] 10SRE, 10serviceops: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10Dzahn) [21:32:03] 10SRE, 10serviceops: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10Dzahn) [21:32:13] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [21:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:16] (03PS2) 10Andrew Bogott: Keystone: remove check_keystone_projects.py [puppet] - 10https://gerrit.wikimedia.org/r/668774 (https://phabricator.wikimedia.org/T274385) [21:34:18] (03PS2) 10Andrew Bogott: Keystone: remove some wkfkeystonehooks config flags [puppet] - 10https://gerrit.wikimedia.org/r/668775 (https://phabricator.wikimedia.org/T274385) [21:34:20] (03PS1) 10Andrew Bogott: cloud-vps rename 'observer' role to 'reader' [puppet] - 10https://gerrit.wikimedia.org/r/668789 (https://phabricator.wikimedia.org/T276018) [21:34:37] 10SRE, 10serviceops: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10Dzahn) [21:35:05] (03CR) 10Andrew Bogott: [C: 03+2] Keystone: remove check_keystone_projects.py [puppet] - 10https://gerrit.wikimedia.org/r/668774 (https://phabricator.wikimedia.org/T274385) (owner: 10Andrew Bogott) [21:35:31] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps rename 'observer' role to 'reader' [puppet] - 10https://gerrit.wikimedia.org/r/668789 (https://phabricator.wikimedia.org/T276018) (owner: 10Andrew Bogott) [21:36:01] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:13] (03CR) 10Andrew Bogott: [C: 03+2] Keystone: remove some wkfkeystonehooks config flags [puppet] - 10https://gerrit.wikimedia.org/r/668775 (https://phabricator.wikimedia.org/T274385) (owner: 10Andrew Bogott) [21:36:43] (03PS3) 10Andrew Bogott: Keystone: remove some wkfkeystonehooks config flags [puppet] - 10https://gerrit.wikimedia.org/r/668775 (https://phabricator.wikimedia.org/T274385) [21:37:06] 10SRE, 10vm-requests, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10Dzahn) p:05High→03Medium [21:38:19] 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install pki-root1001.eqiad.wmnet - https://phabricator.wikimedia.org/T276625 (10RobH) [21:38:31] 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install pki-root1001.eqiad.wmnet - https://phabricator.wikimedia.org/T276625 (10RobH) [21:42:06] 10SRE, 10Traffic, 10GitLab (Initialization), 10Patch-For-Review, and 2 others: forward external traffic to gitlab VMs (was: Port map of how Gitlab is accessed) - https://phabricator.wikimedia.org/T276144 (10Dzahn) I think this ticket now just turned into "open port 80/443 to the public" but it seems best t... [21:42:47] 10SRE, 10Traffic, 10GitLab (Initialization), 10Patch-For-Review, and 2 others: open firewall ports on gitlab1001.wikimedia.org (was: Port map of how Gitlab is accessed) - https://phabricator.wikimedia.org/T276144 (10Dzahn) [21:43:27] 10SRE, 10Traffic, 10GitLab (Initialization), 10Patch-For-Review, and 2 others: open firewall ports on gitlab1001.wikimedia.org (was: Port map of how Gitlab is accessed) - https://phabricator.wikimedia.org/T276144 (10Dzahn) 05Open→03Stalled [21:43:31] 10SRE, 10DNS, 10Traffic, 10serviceops, and 3 others: DNS for GitLab - https://phabricator.wikimedia.org/T276170 (10Dzahn) [21:57:15] (03CR) 10Phamhi: [C: 03+2] wikireplica: depool clouddb1016 [puppet] - 10https://gerrit.wikimedia.org/r/668563 (owner: 10Phamhi) [22:06:22] (03PS1) 10Dzahn: icinga/releases: exclude /run/docker from disk space checks, avoid alerts [puppet] - 10https://gerrit.wikimedia.org/r/668798 [22:09:16] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 16 down 2 https://wikitech.wikimedia.org/wiki/HAProxy [22:10:30] 10SRE, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Standardize the logging format - https://phabricator.wikimedia.org/T234565 (10colewhite) [22:11:36] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 18 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [22:12:49] (03CR) 10Dzahn: [C: 03+2] icinga/releases: exclude /run/docker from disk space checks, avoid alerts [puppet] - 10https://gerrit.wikimedia.org/r/668798 (owner: 10Dzahn) [22:12:55] (03PS2) 10Dzahn: icinga/releases: exclude /run/docker from disk space checks, avoid alerts [puppet] - 10https://gerrit.wikimedia.org/r/668798 [22:14:48] (03PS1) 10Phamhi: Revert "wikireplica: depool clouddb1016" [puppet] - 10https://gerrit.wikimedia.org/r/668806 [22:15:43] (03CR) 10Phamhi: [C: 03+2] Revert "wikireplica: depool clouddb1016" [puppet] - 10https://gerrit.wikimedia.org/r/668806 (owner: 10Phamhi) [22:16:54] RECOVERY - Disk space on releases1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=releases1002&var-datasource=eqiad+prometheus/ops [22:17:46] (03CR) 10Dzahn: "<+icinga-wm> RECOVERY - Disk space on releases1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space" [puppet] - 10https://gerrit.wikimedia.org/r/668798 (owner: 10Dzahn) [22:24:35] (03PS1) 10Phamhi: wikireplica: depool clouddb1017 [puppet] - 10https://gerrit.wikimedia.org/r/668803 [22:30:15] 10SRE, 10Traffic, 10GitLab (Initialization), 10Patch-For-Review, and 2 others: open firewall ports on gitlab1001.wikimedia.org (was: Port map of how Gitlab is accessed) - https://phabricator.wikimedia.org/T276144 (10thcipriani) >>! In T276144#6887868, @Dzahn wrote: > I think this ticket now just turned int... [22:42:19] 10SRE, 10Traffic, 10GitLab (Initialization), 10Patch-For-Review, and 2 others: open firewall ports on gitlab1001.wikimedia.org (was: Port map of how Gitlab is accessed) - https://phabricator.wikimedia.org/T276144 (10Sergey.Trofimovsky.SF) From our perspective Gitlab-managed auto-renewed Let's Encrypt certi... [22:43:06] 10SRE, 10Traffic, 10GitLab (Initialization), 10Patch-For-Review, and 2 others: open firewall ports on gitlab1001.wikimedia.org (was: Port map of how Gitlab is accessed) - https://phabricator.wikimedia.org/T276144 (10Dzahn) > much of production (i.e., phab) is now using envoy for termination and meanwhile G... [22:45:43] 10SRE, 10Traffic, 10GitLab (Initialization), 10Patch-For-Review, and 2 others: open firewall ports on gitlab1001.wikimedia.org (was: Port map of how Gitlab is accessed) - https://phabricator.wikimedia.org/T276144 (10Dzahn) >>! In T276144#6887970, @Sergey.Trofimovsky.SF wrote: > From our perspective Gitlab-... [22:49:06] 10SRE, 10Traffic, 10GitLab (Initialization), 10Patch-For-Review, and 2 others: open firewall ports on gitlab1001.wikimedia.org (was: Port map of how Gitlab is accessed) - https://phabricator.wikimedia.org/T276144 (10Dzahn) @Sergey.Trofimovsky.SF Do you agree we should keep the firewall closed until you say... [22:53:20] (03PS1) 10Legoktm: aptrepo: Add "pygments" component [puppet] - 10https://gerrit.wikimedia.org/r/668831 (https://phabricator.wikimedia.org/T276298) [22:54:17] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28407/console" [puppet] - 10https://gerrit.wikimedia.org/r/668831 (https://phabricator.wikimedia.org/T276298) (owner: 10Legoktm) [22:55:25] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28408/console" [puppet] - 10https://gerrit.wikimedia.org/r/668831 (https://phabricator.wikimedia.org/T276298) (owner: 10Legoktm) [23:00:09] (03CR) 10Legoktm: [V: 03+1 C: 03+2] aptrepo: Add "pygments" component [puppet] - 10https://gerrit.wikimedia.org/r/668831 (https://phabricator.wikimedia.org/T276298) (owner: 10Legoktm) [23:07:22] 10SRE: Disable man-db in pbuilder in package_builder on deneb - https://phabricator.wikimedia.org/T276632 (10Legoktm) [23:16:22] !log imported pygments 2.8.0+dfsg-1 to apt.wm.o buster-wikimedia component/pygments (T276298) [23:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:29] T276298: Package latest python3-pygments for apt.wikimedia.org - https://phabricator.wikimedia.org/T276298 [23:18:18] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:20:44] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:38:50] 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be100[12] - https://phabricator.wikimedia.org/T276637 (10RobH) [23:39:00] 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be100[12] - https://phabricator.wikimedia.org/T276637 (10RobH) [23:39:03] (03PS1) 10Bstorm: wikireplicas: improve filtering on the actor_user view [puppet] - 10https://gerrit.wikimedia.org/r/668843 (https://phabricator.wikimedia.org/T276628) [23:40:38] (03PS1) 10Legoktm: codesearch: Add port for shouthow [puppet] - 10https://gerrit.wikimedia.org/r/668844 (https://phabricator.wikimedia.org/T253597) [23:42:16] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28409/console" [puppet] - 10https://gerrit.wikimedia.org/r/668844 (https://phabricator.wikimedia.org/T253597) (owner: 10Legoktm) [23:42:45] (03CR) 10Legoktm: [V: 03+1 C: 03+2] codesearch: Add port for shouthow [puppet] - 10https://gerrit.wikimedia.org/r/668844 (https://phabricator.wikimedia.org/T253597) (owner: 10Legoktm) [23:59:35] 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be200[12] - https://phabricator.wikimedia.org/T276642 (10RobH) a:03Papaul [23:59:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets