[00:02:14] <DannyS712>	 also, anyone remember the context for https://bash.toolforge.org/quip/TNR_cHEBv7KcG9M-Utqd ?
[00:04:25] <mutante>	 hehe
[00:05:21] <mutante>	 https://bash.toolforge.org/quip/AU8FCPz66snAnmqnLHDj
[00:05:45] <Platonides>	 xDD
[00:06:46] <legoktm>	 DannyS712: it's about the top of https://wikitech.wikimedia.org/wiki/How_to_do_a_schema_change
[00:08:04] <Reedy>	 legoktm: Wasn't it when we thought update.php had been run and dropped that wikibase table?
[00:08:38] <Platonides>	 DannyS712: it was on April 6th 2020
[00:08:42] <legoktm>	 Oh I think you're right 
[00:08:53] <wikibugs>	 (03PS9) 10Dzahn: puppetmaster: add data types to all remaining parameters [puppet] - 10https://gerrit.wikimedia.org/r/635656
[00:09:27] <Platonides>	 https://wm-bot.wmflabs.org/logs/%23wikimedia-operations/20200406.txt
[00:09:42] <wikibugs>	 (03CR) 10Dzahn: "> Patch Set 8:" [puppet] - 10https://gerrit.wikimedia.org/r/635656 (owner: 10Dzahn)
[00:10:21] <mutante>	 lol, that icinga-wm wall 
[00:12:10] <mutante>	 things you don't want to read: " One of the biggest tables in production has just disappeared." 
[00:14:17] <Platonides>	 *shudders*
[00:17:39] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2163683760 and 42289 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:23:28] <wikibugs>	 (03PS1) 10Dzahn: parsoid::testing: remove vd_client/server and rt_client/server [puppet] - 10https://gerrit.wikimedia.org/r/637582 (https://phabricator.wikimedia.org/T257906)
[00:24:59] <wikibugs>	 (03CR) 10Dzahn: "I will not merge this during the ongoing test but would tomorrow, before reimaging again." [puppet] - 10https://gerrit.wikimedia.org/r/637582 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn)
[00:29:31] <wikibugs>	 (03PS2) 10Dzahn: parsoid::testing: remove vd_client/server and rt_client/server [puppet] - 10https://gerrit.wikimedia.org/r/637582 (https://phabricator.wikimedia.org/T257906)
[00:33:19] <wikibugs>	 (03PS3) 10Dzahn: parsoid::testing: remove vd_client/server and rt_client/server [puppet] - 10https://gerrit.wikimedia.org/r/637582 (https://phabricator.wikimedia.org/T257906)
[00:35:38] <wikibugs>	 (03CR) 10Dzahn: "thanks, currently does not rebase on the parent change that adds data types to puppetmaster.. I'll care about it after that is merged" [puppet] - 10https://gerrit.wikimedia.org/r/635658 (owner: 10Dzahn)
[00:47:08] <wikibugs>	 (03PS2) 10Dzahn: mailman: replace cron with systemd timer and move to profile [puppet] - 10https://gerrit.wikimedia.org/r/637037 (https://phabricator.wikimedia.org/T265138)
[00:48:27] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mailman: replace cron with systemd timer and move to profile [puppet] - 10https://gerrit.wikimedia.org/r/637037 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn)
[00:48:57] <wikibugs>	 (03PS3) 10Dzahn: mailman: replace cron with systemd timer and move to profile [puppet] - 10https://gerrit.wikimedia.org/r/637037 (https://phabricator.wikimedia.org/T265138)
[00:50:13] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:51:04] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/26222/" [puppet] - 10https://gerrit.wikimedia.org/r/637037 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn)
[00:53:07] <wikibugs>	 (03CR) 10Dzahn: "Should I link all changes to that ticket or is it too many updates and just a topic branch is better? Then it can be a single URL listing " [puppet] - 10https://gerrit.wikimedia.org/r/636999 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn)
[00:54:44] <wikibugs>	 (03PS5) 10Dzahn: puppetmaster: replace cron to remove old reports with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/636104 (https://phabricator.wikimedia.org/T265138)
[00:55:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: replace cron to remove old reports with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/636104 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn)
[00:56:39] <wikibugs>	 (03CR) 10Dzahn: "This isn't a real review but to be clear, i LOVE that you created a script for this purpose. That's exactly what we need once the corp LDA" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605596 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond)
[01:00:44] <wikibugs>	 (03CR) 10Dzahn: [C: 04-2] "waiting for the final decision what the target URL should be, stalled" [puppet] - 10https://gerrit.wikimedia.org/r/636755 (https://phabricator.wikimedia.org/T264367) (owner: 10Dzahn)
[01:05:10] <wikibugs>	 (03PS6) 10Dzahn: puppetmaster: replace cron to remove old reports with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/636104 (https://phabricator.wikimedia.org/T265138)
[01:05:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: replace cron to remove old reports with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/636104 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn)
[01:12:00] <wikibugs>	 (03CR) 10Dzahn: "not sure why this  Failed to parse template systemd/systemd.timer.erb:" [puppet] - 10https://gerrit.wikimedia.org/r/636104 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn)
[01:18:17] <icinga-wm>	 PROBLEM - Maps HTTPS on maps2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
[01:19:13] <icinga-wm>	 PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[01:20:45] <icinga-wm>	 RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 0.084 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[01:20:49] <icinga-wm>	 PROBLEM - cassandra CQL 10.192.16.179:9042 on maps2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886
[01:21:31] <icinga-wm>	 RECOVERY - Maps HTTPS on maps2002 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 0.158 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
[01:22:23] <icinga-wm>	 RECOVERY - cassandra CQL 10.192.16.179:9042 on maps2002 is OK: TCP OK - 0.036 second response time on 10.192.16.179 port 9042 https://phabricator.wikimedia.org/T93886
[01:30:43] <icinga-wm>	 PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[01:30:47] <icinga-wm>	 PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:30:53] <icinga-wm>	 PROBLEM - cassandra CQL 10.192.16.179:9042 on maps2002 is CRITICAL: connect to address 10.192.16.179 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[02:34:55] <icinga-wm>	 PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[02:36:35] <icinga-wm>	 RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[03:41:24] <wikibugs>	 (03PS1) 10Razzi: oozie: Add admin groups for authorization [puppet] - 10https://gerrit.wikimedia.org/r/637587 (https://phabricator.wikimedia.org/T262660)
[03:45:34] <wikibugs>	 (03PS1) 10Dzahn: admin: add AnneT to deployers [puppet] - 10https://gerrit.wikimedia.org/r/637588 (https://phabricator.wikimedia.org/T266718)
[03:45:42] <wikibugs>	 (03CR) 10Razzi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/637587 (https://phabricator.wikimedia.org/T262660) (owner: 10Razzi)
[03:48:48] <wikibugs>	 (03PS2) 10Razzi: oozie: Add admin groups for authorization [puppet] - 10https://gerrit.wikimedia.org/r/637587 (https://phabricator.wikimedia.org/T262660)
[03:50:27] <wikibugs>	 (03PS3) 10Razzi: oozie: Add admin groups for authorization [puppet] - 10https://gerrit.wikimedia.org/r/637587 (https://phabricator.wikimedia.org/T262660)
[04:02:30] <wikibugs>	 (03PS4) 10Razzi: oozie: Add admin groups for authorization [puppet] - 10https://gerrit.wikimedia.org/r/637587 (https://phabricator.wikimedia.org/T262660)
[04:08:05] <icinga-wm>	 PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[04:09:41] <icinga-wm>	 RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[04:20:03] <wikibugs>	 (03PS5) 10Razzi: oozie: Add admin groups for authorization [puppet] - 10https://gerrit.wikimedia.org/r/637587 (https://phabricator.wikimedia.org/T262660)
[04:29:22] <wikibugs>	 (03CR) 10Razzi: "I have added the admin group setting to an-test-coord1001, as we tested on the old test cluster. There are changes in the puppet compiler " [puppet] - 10https://gerrit.wikimedia.org/r/637587 (https://phabricator.wikimedia.org/T262660) (owner: 10Razzi)
[05:21:45] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 236, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:22:27] <icinga-wm>	 PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:26:55] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 238, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:27:03] <icinga-wm>	 RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[05:27:05] <icinga-wm>	 RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:27:33] <icinga-wm>	 RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:32:11] <icinga-wm>	 PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[05:32:13] <icinga-wm>	 PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:15:46] <wikibugs>	 (03PS1) 10Marostegui: orchestrator.conf: Enable Audit messages to syslog. [puppet] - 10https://gerrit.wikimedia.org/r/637596 (https://phabricator.wikimedia.org/T265990)
[06:32:01] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 236, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:32:37] <icinga-wm>	 PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:35:53] <icinga-wm>	 RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:36:55] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 238, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:43:17] <wikibugs>	 (03CR) 10Elukey: oozie: Add admin groups for authorization (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/637587 (https://phabricator.wikimedia.org/T262660) (owner: 10Razzi)
[06:44:13] <icinga-wm>	 PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:45:19] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 236, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201030T0700)
[07:28:01] <wikibugs>	 (03PS1) 10Elukey: Add role::analytics_test_cluster::hadoop::ui to an-test-ui1001 [puppet] - 10https://gerrit.wikimedia.org/r/637607 (https://phabricator.wikimedia.org/T260411)
[07:31:07] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add role::analytics_test_cluster::hadoop::ui to an-test-ui1001 [puppet] - 10https://gerrit.wikimedia.org/r/637607 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey)
[07:32:19] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 238, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:33:14] <icinga-wm>	 RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:38:24] <icinga-wm>	 PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[07:39:53] <icinga-wm>	 RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[07:57:17] <icinga-wm>	 RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[07:57:19] <icinga-wm>	 RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:00:25] <icinga-wm>	 PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:00:27] <icinga-wm>	 PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:53:15] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.decommission
[08:53:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:54:11] <elukey>	 !log decom an-tool1006 (old analytics test vm) - T255139
[08:54:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:54:17] <stashbot>	 T255139: Create the new Hadoop test cluster - https://phabricator.wikimedia.org/T255139
[08:54:32] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] "LGTM, it even has a trailing , :)" [puppet] - 10https://gerrit.wikimedia.org/r/637596 (https://phabricator.wikimedia.org/T265990) (owner: 10Marostegui)
[08:55:58] <wikibugs>	 (03PS1) 10Elukey: Decommission an-tool1006 [puppet] - 10https://gerrit.wikimedia.org/r/637638 (https://phabricator.wikimedia.org/T255139)
[08:58:59] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
[08:59:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:04:51] <elukey>	 interesting
[09:05:00] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Decommission an-tool1006 [puppet] - 10https://gerrit.wikimedia.org/r/637638 (https://phabricator.wikimedia.org/T255139) (owner: 10Elukey)
[09:05:02] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] orchestrator.conf: Enable Audit messages to syslog. [puppet] - 10https://gerrit.wikimedia.org/r/637596 (https://phabricator.wikimedia.org/T265990) (owner: 10Marostegui)
[09:05:16] <elukey>	 marostegui: go ahead with my change to pls :)
[09:05:20] <marostegui>	 elukey: haha about to ask
[09:05:36] <marostegui>	 elukey: merged
[09:05:39] <elukey>	 <3
[09:07:11] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.dns.netbox
[09:07:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:21] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:13:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:30] <elukey>	 goood
[09:22:39] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Test charts/deployments for compatibility with k8s 1.19 [deployment-charts] - 10https://gerrit.wikimedia.org/r/636881 (https://phabricator.wikimedia.org/T266032) (owner: 10JMeybohm)
[09:25:20] <wikibugs>	 (03Merged) 10jenkins-bot: Test charts/deployments for compatibility with k8s 1.19 [deployment-charts] - 10https://gerrit.wikimedia.org/r/636881 (https://phabricator.wikimedia.org/T266032) (owner: 10JMeybohm)
[09:41:58] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "LGTM will merge" [puppet] - 10https://gerrit.wikimedia.org/r/635656 (owner: 10Dzahn)
[09:42:08] <wikibugs>	 (03PS3) 10Jbond: puppetmaster: add data type for server type and use it [puppet] - 10https://gerrit.wikimedia.org/r/635660 (owner: 10Dzahn)
[09:42:33] <wikibugs>	 (03PS10) 10Jbond: puppetmaster: add data types to all remaining parameters [puppet] - 10https://gerrit.wikimedia.org/r/635656 (owner: 10Dzahn)
[09:43:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: add data type for server type and use it [puppet] - 10https://gerrit.wikimedia.org/r/635660 (owner: 10Dzahn)
[09:49:33] <wikibugs>	 (03PS4) 10Jbond: httpd/puppetmaster: add data type for SSLVerifyClient and use it [puppet] - 10https://gerrit.wikimedia.org/r/635658 (owner: 10Dzahn)
[09:51:32] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "LGTM will merge" [puppet] - 10https://gerrit.wikimedia.org/r/635658 (owner: 10Dzahn)
[09:57:35] <wikibugs>	 (03PS1) 10Ladsgroup: Revert "Change logo of Wikidata for the eighth birthday" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637649
[09:58:49] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Revert "Change logo of Wikidata for the eighth birthday" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637649 (owner: 10Ladsgroup)
[10:00:09] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Change logo of Wikidata for the eighth birthday" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637649 (owner: 10Ladsgroup)
[10:02:00] <logmsgbot>	 !log ladsgroup@deploy1001 Synchronized static/images/project-logos: Revert: Changing logo of Wikidata for the brithday (duration: 01m 12s)
[10:02:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:48] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.downtime
[10:08:49] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[10:08:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:21:26] <wikibugs>	 (03PS1) 10Klausman: analytics: Switch stat1005 to use rocm 3.8 [puppet] - 10https://gerrit.wikimedia.org/r/637671 (https://phabricator.wikimedia.org/T264408)
[10:21:46] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] analytics: Switch stat1005 to use rocm 3.8 [puppet] - 10https://gerrit.wikimedia.org/r/637671 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman)
[10:21:50] <wikibugs>	 (03PS1) 10Kormat: Initial (re)packaging [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/637672 (https://phabricator.wikimedia.org/T266763)
[10:22:53] <wikibugs>	 (03CR) 10Kormat: "This is the initial packaging. It builds in pbuilder in deneb. It does not yet contain the creation of user+group for orchestrator, nor th" [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/637672 (https://phabricator.wikimedia.org/T266763) (owner: 10Kormat)
[10:22:59] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] analytics: Switch stat1005 to use rocm 3.8 [puppet] - 10https://gerrit.wikimedia.org/r/637671 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman)
[10:24:21] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] admin: add AnneT to deployers [puppet] - 10https://gerrit.wikimedia.org/r/637588 (https://phabricator.wikimedia.org/T266718) (owner: 10Dzahn)
[10:27:21] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Move db1077 to test-pc1 [puppet] - 10https://gerrit.wikimedia.org/r/637673
[10:29:08] <wikibugs>	 (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler1001/26227/" [puppet] - 10https://gerrit.wikimedia.org/r/637673 (owner: 10Marostegui)
[10:39:15] <wikibugs>	 (03PS1) 10Klausman: aptrepo: add mor rocm 3.8 dependencies [puppet] - 10https://gerrit.wikimedia.org/r/637676 (https://phabricator.wikimedia.org/T264408)
[10:40:15] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] aptrepo: add mor rocm 3.8 dependencies [puppet] - 10https://gerrit.wikimedia.org/r/637676 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman)
[10:41:02] <wikibugs>	 (03CR) 10Kormat: "Hmm. These `test-` sections are a bit problematic. `check-cumin-aliases` complains when they are unused. I'll send a CR to not generate al" [puppet] - 10https://gerrit.wikimedia.org/r/637673 (owner: 10Marostegui)
[10:41:35] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] mariadb: Set both db_inventory nodes read-write [puppet] - 10https://gerrit.wikimedia.org/r/636609 (https://phabricator.wikimedia.org/T266003) (owner: 10Kormat)
[10:42:11] <wikibugs>	 (03CR) 10Marostegui: "I can totally abandon this patch, this isn't really needed. It was just more for "informative" issues, but the host can replicate from pc1" [puppet] - 10https://gerrit.wikimedia.org/r/637673 (owner: 10Marostegui)
[10:42:37] <wikibugs>	 (03CR) 10Jcrespo: "If you wanted, it can be changed on zarcillo too, but it is not a big deal except for the mysql aggregated grouping." [puppet] - 10https://gerrit.wikimedia.org/r/637673 (owner: 10Marostegui)
[10:42:38] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: Enable report_host [puppet] - 10https://gerrit.wikimedia.org/r/636452 (https://phabricator.wikimedia.org/T266483) (owner: 10Kormat)
[10:43:22] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] mariadb: Enable report_host [puppet] - 10https://gerrit.wikimedia.org/r/636452 (https://phabricator.wikimedia.org/T266483) (owner: 10Kormat)
[10:50:14] <wikibugs>	 (03PS1) 10Kormat: cumin: Exclude test-* mariadb sections. [puppet] - 10https://gerrit.wikimedia.org/r/637678
[10:50:57] <wikibugs>	 (03CR) 10Ema: [C: 03+2] admin: add AnneT to deployers [puppet] - 10https://gerrit.wikimedia.org/r/637588 (https://phabricator.wikimedia.org/T266718) (owner: 10Dzahn)
[10:51:41] <wikibugs>	 (03PS2) 10Kormat: cumin: Exclude test-* mariadb sections. [puppet] - 10https://gerrit.wikimedia.org/r/637678
[10:54:15] <wikibugs>	 (03CR) 10Kormat: "PCC looks good: https://puppet-compiler.wmflabs.org/compiler1003/26230/" [puppet] - 10https://gerrit.wikimedia.org/r/637678 (owner: 10Kormat)
[10:56:02] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] cumin: Exclude test-* mariadb sections. [puppet] - 10https://gerrit.wikimedia.org/r/637678 (owner: 10Kormat)
[10:56:15] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] cumin: Exclude test-* mariadb sections. [puppet] - 10https://gerrit.wikimedia.org/r/637678 (owner: 10Kormat)
[10:57:05] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] mariadb: Move db1077 to test-pc1 [puppet] - 10https://gerrit.wikimedia.org/r/637673 (owner: 10Marostegui)
[10:57:53] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1077 to test-pc1 [puppet] - 10https://gerrit.wikimedia.org/r/637673 (owner: 10Marostegui)
[11:08:01] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] openstack: turn bash scripts without bashisms into sh scripts [puppet] - 10https://gerrit.wikimedia.org/r/631891 (https://phabricator.wikimedia.org/T95064) (owner: 10Dzahn)
[11:12:05] <icinga-wm>	 RECOVERY - MariaDB read only db_inventory on db2093 is OK: Version 10.4.15-MariaDB-log, Uptime 777041s, read_only: False, event_scheduler: False, 68.03 QPS, connection latency: 0.003045s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[11:12:16] <marostegui>	 ^ \o/ kormat 
[11:12:29] <kormat>	 🤘
[11:15:26] <wikibugs>	 (03PS6) 10Giuseppe Lavagetto: Add apache httpd base image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/634924 (https://phabricator.wikimedia.org/T265324)
[11:15:28] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: Add an httpd-fcgi image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/636634 (https://phabricator.wikimedia.org/T265324)
[11:24:47] <wikibugs>	 (03PS1) 10Klausman: amd_rocm: Only add DKMS+firmware for rocm33 installs [puppet] - 10https://gerrit.wikimedia.org/r/637682 (https://phabricator.wikimedia.org/T264408)
[11:25:09] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] amd_rocm: Only add DKMS+firmware for rocm33 installs [puppet] - 10https://gerrit.wikimedia.org/r/637682 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman)
[11:28:54] <wikibugs>	 (03PS2) 10Klausman: amd_rocm: Only add DKMS+firmware for rocm33 installs [puppet] - 10https://gerrit.wikimedia.org/r/637682 (https://phabricator.wikimedia.org/T264408)
[11:29:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] amd_rocm: Only add DKMS+firmware for rocm33 installs [puppet] - 10https://gerrit.wikimedia.org/r/637682 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman)
[11:31:27] <wikibugs>	 (03PS3) 10Klausman: amd_rocm: Only add DKMS+firmware for rocm33 installs [puppet] - 10https://gerrit.wikimedia.org/r/637682 (https://phabricator.wikimedia.org/T264408)
[11:32:58] <wikibugs>	 (03PS4) 10Klausman: amd_rocm: Only add DKMS+firmware for rocm33 installs [puppet] - 10https://gerrit.wikimedia.org/r/637682 (https://phabricator.wikimedia.org/T264408)
[11:35:24] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] amd_rocm: Only add DKMS+firmware for rocm33 installs [puppet] - 10https://gerrit.wikimedia.org/r/637682 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman)
[11:37:47] <wikibugs>	 (03PS1) 10Kormat: debian: add user/group + systemd service [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/637683 (https://phabricator.wikimedia.org/T266763)
[11:44:21] <wikibugs>	 (03PS1) 10Kormat: orchestrator: Support sqlite backend [puppet] - 10https://gerrit.wikimedia.org/r/637684 (https://phabricator.wikimedia.org/T266657)
[11:47:17] <wikibugs>	 (03CR) 10Kormat: "Depends on https://gerrit.wikimedia.org/r/c/operations/debs/orchestrator/+/637683 getting merged, and packages built+uploaded." [puppet] - 10https://gerrit.wikimedia.org/r/637684 (https://phabricator.wikimedia.org/T266657) (owner: 10Kormat)
[11:48:17] <wikibugs>	 (03CR) 10Kormat: "Ah, i added you slightly too soon, @marostegui. I need to futz with heira a bit to test this in pcc." [puppet] - 10https://gerrit.wikimedia.org/r/637684 (https://phabricator.wikimedia.org/T266657) (owner: 10Kormat)
[11:58:59] <wikibugs>	 (03PS2) 10Kormat: orchestrator: Support sqlite backend [puppet] - 10https://gerrit.wikimedia.org/r/637684 (https://phabricator.wikimedia.org/T266657)
[11:59:20] <wikibugs>	 (03PS3) 10Kormat: orchestrator: Support sqlite backend [puppet] - 10https://gerrit.wikimedia.org/r/637684 (https://phabricator.wikimedia.org/T266657)
[12:01:25] <wikibugs>	 (03PS4) 10Kormat: orchestrator: Support sqlite backend [puppet] - 10https://gerrit.wikimedia.org/r/637684 (https://phabricator.wikimedia.org/T266657)
[12:06:51] <wikibugs>	 (03CR) 10Marostegui: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/637684 (https://phabricator.wikimedia.org/T266657) (owner: 10Kormat)
[12:23:30] <wikibugs>	 (03PS5) 10Kormat: orchestrator: Support sqlite backend [puppet] - 10https://gerrit.wikimedia.org/r/637684 (https://phabricator.wikimedia.org/T266657)
[12:25:30] <wikibugs>	 (03PS6) 10Kormat: orchestrator: Support sqlite backend [puppet] - 10https://gerrit.wikimedia.org/r/637684 (https://phabricator.wikimedia.org/T266657)
[12:26:55] <wikibugs>	 (03PS7) 10Kormat: orchestrator: Support sqlite backend [puppet] - 10https://gerrit.wikimedia.org/r/637684 (https://phabricator.wikimedia.org/T266657)
[12:27:49] <wikibugs>	 (03CR) 10Kormat: "Ready for review now :)" [puppet] - 10https://gerrit.wikimedia.org/r/637684 (https://phabricator.wikimedia.org/T266657) (owner: 10Kormat)
[12:29:18] <XioNoX>	 !log set normal VRRP balancing on cr2-eqiad
[12:29:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:29:44] <wikibugs>	 (03CR) 10Marostegui: "Will Puppet or sqllite itself take care of creating /var/lib/orchestrator/?" [puppet] - 10https://gerrit.wikimedia.org/r/637684 (https://phabricator.wikimedia.org/T266657) (owner: 10Kormat)
[12:59:47] <wikibugs>	 (03CR) 10Kormat: [C: 04-2] "Blocking until orchestrator packaging is published." [puppet] - 10https://gerrit.wikimedia.org/r/637684 (https://phabricator.wikimedia.org/T266657) (owner: 10Kormat)
[13:06:42] <wikibugs>	 (03PS2) 10Kormat: debian: add user/group + systemd service [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/637683 (https://phabricator.wikimedia.org/T266763)
[13:18:27] <wikibugs>	 (03PS8) 10Kormat: orchestrator: Support sqlite backend [puppet] - 10https://gerrit.wikimedia.org/r/637684 (https://phabricator.wikimedia.org/T266657)
[13:18:29] <wikibugs>	 (03PS1) 10Kormat: orchestrator: Support running as non-root [puppet] - 10https://gerrit.wikimedia.org/r/637693 (https://phabricator.wikimedia.org/T266763)
[13:19:08] <wikibugs>	 (03CR) 10Kormat: [C: 04-2] "-2 until new packages released." [puppet] - 10https://gerrit.wikimedia.org/r/637693 (https://phabricator.wikimedia.org/T266763) (owner: 10Kormat)
[13:53:18] <wikibugs>	 (03PS10) 10ArielGlenn: per job batches file with locking and methods for claiming jobs etc [dumps] - 10https://gerrit.wikimedia.org/r/596504 (https://phabricator.wikimedia.org/T252396)
[13:53:50] <wikibugs>	 (03PS1) 10Kormat: mariadb: Drop old mysql and percona my.cnf templates. [puppet] - 10https://gerrit.wikimedia.org/r/637699
[14:10:21] <wikibugs>	 (03PS1) 10Kormat: mariadb: (Ab)use wsrep_cluster_name for DC name [puppet] - 10https://gerrit.wikimedia.org/r/637702 (https://phabricator.wikimedia.org/T266635)
[14:14:16] <cmjohnson1>	 !log moving mw1267 and mw168 to rack A8 eqiad T266164
[14:14:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:24] <stashbot>	 T266164: eqiad: Physical Moves for MediaWiki Servers - https://phabricator.wikimedia.org/T266164
[14:20:17] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] ores: Stop memory reporting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/637557 (owner: 10Ladsgroup)
[14:20:31] <icinga-wm>	 PROBLEM - Host mw1267.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:20:32] <icinga-wm>	 PROBLEM - Host mw1268.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:21:46] <wikibugs>	 (03CR) 10Ppchelko: [C: 03+1] api-gateway: use envoy 1.16.0 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/637431 (owner: 10Hnowlan)
[14:24:37] <wikibugs>	 (03PS1) 10Andrew Bogott: cloud-vps instance backups: ignore clouddb-services project [puppet] - 10https://gerrit.wikimedia.org/r/637704 (https://phabricator.wikimedia.org/T260692)
[14:27:10] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps instance backups: ignore clouddb-services project [puppet] - 10https://gerrit.wikimedia.org/r/637704 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott)
[14:28:53] <icinga-wm>	 RECOVERY - Host mw1268.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.92 ms
[14:30:55] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] "🚀" [puppet] - 10https://gerrit.wikimedia.org/r/635987 (https://phabricator.wikimedia.org/T252391) (owner: 10Effie Mouzeli)
[14:42:48] <elukey>	 !log stop kafka-jumbo1006 to swap NICs (1g -> 10g, d1 -> d4 rack)
[14:42:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:46] <wikibugs>	 (03PS2) 10Kormat: mariadb: (Ab)use wsrep_cluster_name for DC name [puppet] - 10https://gerrit.wikimedia.org/r/637702 (https://phabricator.wikimedia.org/T266635)
[14:52:08] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: (Ab)use wsrep_cluster_name for DC name [puppet] - 10https://gerrit.wikimedia.org/r/637702 (https://phabricator.wikimedia.org/T266635) (owner: 10Kormat)
[14:52:57] <icinga-wm>	 PROBLEM - Host kafka-jumbo1006.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:53:06] <elukey>	 this is expected --^
[14:53:24] <wikibugs>	 (03PS3) 10Kormat: mariadb: (Ab)use wsrep_cluster_name for DC name [puppet] - 10https://gerrit.wikimedia.org/r/637702 (https://phabricator.wikimedia.org/T266635)
[14:58:13] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1004 is CRITICAL: 97 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1004
[14:58:28] <elukey>	 yep yep expected
[14:58:29] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1002 is CRITICAL: 106 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1002
[14:58:33] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is CRITICAL: 98 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1005
[14:58:41] <elukey>	 sorry for the spam
[14:59:09] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1001 is CRITICAL: 72 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001
[14:59:15] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1003 is CRITICAL: 98 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1003
[15:04:20] <wikibugs>	 (03CR) 10Marostegui: "Not fully sure about dropping these. We might want to keep them around as we _might_ be getting either mysql or percona serving as a slave" [puppet] - 10https://gerrit.wikimedia.org/r/637699 (owner: 10Kormat)
[15:04:29] <icinga-wm>	 RECOVERY - Host kafka-jumbo1006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms
[15:05:25] <wikibugs>	 (03CR) 10Kormat: "PCC from hell (but looks good): https://puppet-compiler.wmflabs.org/compiler1002/26236/" [puppet] - 10https://gerrit.wikimedia.org/r/637702 (https://phabricator.wikimedia.org/T266635) (owner: 10Kormat)
[15:09:37] <rzl>	 !log downtiming mc2036 for buster reimage
[15:09:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:11:00] <wikibugs>	 (03PS1) 10Effie Mouzeli: hiera: remove shard18 from redis.yaml [puppet] - 10https://gerrit.wikimedia.org/r/637708 (https://phabricator.wikimedia.org/T252391)
[15:11:29] <logmsgbot>	 !log rzl@cumin1001 START - Cookbook sre.hosts.downtime
[15:11:30] <logmsgbot>	 !log rzl@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[15:11:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:11:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:57] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] hiera: remove shard18 from redis.yaml [puppet] - 10https://gerrit.wikimedia.org/r/637708 (https://phabricator.wikimedia.org/T252391) (owner: 10Effie Mouzeli)
[15:19:08] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] "I like this as a step in the right direction, but I'd go further and merge the sections." [puppet] - 10https://gerrit.wikimedia.org/r/637577 (owner: 10Dzahn)
[15:19:10] <wikibugs>	 (03PS1) 10Cmjohnson: adding new mac address for update 10G nic kafka-jumbo1006 [puppet] - 10https://gerrit.wikimedia.org/r/637711 (https://phabricator.wikimedia.org/T236327)
[15:19:12] <wikibugs>	 (03PS4) 10Kormat: mariadb: (Ab)use wsrep_cluster_name for DC name [puppet] - 10https://gerrit.wikimedia.org/r/637702 (https://phabricator.wikimedia.org/T266635)
[15:20:11] <wikibugs>	 (03Abandoned) 10Kormat: mariadb: Drop old mysql and percona my.cnf templates. [puppet] - 10https://gerrit.wikimedia.org/r/637699 (owner: 10Kormat)
[15:26:00] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] Add apache httpd base image (038 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/634924 (https://phabricator.wikimedia.org/T265324) (owner: 10Giuseppe Lavagetto)
[15:27:02] <cdanis>	 is wikibugs not posting from phab?
[15:27:56] <Reedy>	 loooks a bit quiet
[15:28:11] <Reedy>	 probably needs a service reboot
[15:29:21] <ema>	 looks like it hasn't worked for ~1 day?
[15:29:57] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "Let's deploy on Monday? I have deployed this manually on pc1 and it worked, but as we are touching many files...and we are not in a rush.." [puppet] - 10https://gerrit.wikimedia.org/r/637702 (https://phabricator.wikimedia.org/T266635) (owner: 10Kormat)
[15:30:06] <ema>	 in my log I see "16:58 <+wikibugs> Operations, fundraising-tech-ops: Ensure all disaster recover documentation is in one central location" as the last phab-like update
[15:30:22] <marostegui>	 cdanis: yeah, it has been out for a day, I wanted to restart it earlier today but I got buried on other things
[15:30:29] <ema>	 after which:
[15:30:29] <ema>	 17:11 < arturo> we just had a network outage in wmcs
[15:30:30] <wikibugs>	 (03CR) 10Kormat: [C: 04-2] "Delay to monday." [puppet] - 10https://gerrit.wikimedia.org/r/637702 (https://phabricator.wikimedia.org/T266635) (owner: 10Kormat)
[15:30:40] <cdanis>	 cool, coolcoolv
[15:32:38] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] hiera: remove shard18 from redis.yaml [puppet] - 10https://gerrit.wikimedia.org/r/637708 (https://phabricator.wikimedia.org/T252391) (owner: 10Effie Mouzeli)
[15:34:42] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs instance backups: move more projects from cloudvirt1024 to cloudvirt1021 [puppet] - 10https://gerrit.wikimedia.org/r/637713 (https://phabricator.wikimedia.org/T260692)
[15:36:29] <effie>	 !log stopping puppet on mediawiki and mc* hosts 
[15:36:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:36:33] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs instance backups: move more projects from cloudvirt1024 to cloudvirt1021 [puppet] - 10https://gerrit.wikimedia.org/r/637713 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott)
[15:41:03] <wikibugs>	 (03PS1) 10Marostegui: orchestrator.conf: Add DetectDataCenterQuery to detect DC [puppet] - 10https://gerrit.wikimedia.org/r/637715 (https://phabricator.wikimedia.org/T266635)
[15:41:36] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Let's wait for https://gerrit.wikimedia.org/r/637702 to be merged first." [puppet] - 10https://gerrit.wikimedia.org/r/637715 (https://phabricator.wikimedia.org/T266635) (owner: 10Marostegui)
[15:42:18] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] orchestrator.conf: Add DetectDataCenterQuery to detect DC [puppet] - 10https://gerrit.wikimedia.org/r/637715 (https://phabricator.wikimedia.org/T266635) (owner: 10Marostegui)
[15:42:23] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "Overall pretty good work, various small inline comments" (039 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/636634 (https://phabricator.wikimedia.org/T265324) (owner: 10Giuseppe Lavagetto)
[15:43:20] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] Add apache httpd base image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/634924 (https://phabricator.wikimedia.org/T265324) (owner: 10Giuseppe Lavagetto)
[15:58:31] <wikibugs>	 (03CR) 10JMeybohm: Add apache httpd base image (034 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/634924 (https://phabricator.wikimedia.org/T265324) (owner: 10Giuseppe Lavagetto)
[16:03:43] <icinga-wm>	 PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[16:06:55] <icinga-wm>	 RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[16:07:27] <wikibugs>	 (03PS2) 10Cwhite: Add HTTP request and response headers fields as object[field:keyword] [software/ecs] - 10https://gerrit.wikimedia.org/r/636515
[16:07:29] <wikibugs>	 (03PS2) 10Cwhite: Add CSP Report fields. [software/ecs] - 10https://gerrit.wikimedia.org/r/636516
[16:07:31] <wikibugs>	 (03PS2) 10Cwhite: Enable search slowlog by default for ECS indices. [software/ecs] - 10https://gerrit.wikimedia.org/r/636685
[16:07:33] <wikibugs>	 (03PS1) 10Cwhite: First attempt at a JSONSchema template generator utility. [software/ecs] - 10https://gerrit.wikimedia.org/r/637719
[16:13:17] <wikibugs>	 (03PS2) 10Cwhite: First attempt at a JSONSchema template generator utility. [software/ecs] - 10https://gerrit.wikimedia.org/r/637719
[16:18:57] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): cloudvirt1033 psu redundancy alert - https://phabricator.wikimedia.org/T263145 (10Cmjohnson) 05Open→03Resolved New PSU arrived and swapped. System reports healthy.
[16:19:38] <elukey>	 !log kafka-jumbo1006 still running with 1g nick
[16:19:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:19:46] <wikibugs>	 (03PS1) 10Bstorm: toolsdb: Make socket a parameter so new servers might work right [puppet] - 10https://gerrit.wikimedia.org/r/637726 (https://phabricator.wikimedia.org/T266587)
[16:22:24] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) setup/install deploy1002 - https://phabricator.wikimedia.org/T265653 (10Cmjohnson) 05Open→03Resolved done
[16:24:54] <wikibugs>	 (03PS1) 10Ayounsi: PuppetDB import: don't do empty saves [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/637728 (https://phabricator.wikimedia.org/T266767)
[16:25:51] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/636999 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn)
[16:27:22] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] phabricator: replace require_package with ensure_packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636999 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn)
[16:28:19] <wikibugs>	 (03PS1) 10Cmjohnson: adding new production IP for frdb1004 [dns] - 10https://gerrit.wikimedia.org/r/637729 (https://phabricator.wikimedia.org/T265086)
[16:29:33] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] adding new production IP for frdb1004 [dns] - 10https://gerrit.wikimedia.org/r/637729 (https://phabricator.wikimedia.org/T265086) (owner: 10Cmjohnson)
[16:29:51] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka-jumbo1006 is CRITICAL: 2.641e+07 ge 5e+06 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1006
[16:30:18] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: ASAP) rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T265086 (10Cmjohnson)
[16:31:29] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: ASAP) rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T265086 (10Cmjohnson) @Jgreen  all the on-site work has been completed.  idrac password is a temporary password
[16:32:06] <wikibugs>	 (03CR) 10Jbond: profile::sre::check_mail: new script for checking user emails (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605596 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond)
[16:32:09] <wikibugs>	 10Operations, 10Patch-For-Review: (Need By: ASAP) rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T265086 (10Cmjohnson) a:03Jgreen
[16:41:53] <wikibugs>	 (03PS1) 10Dwisehaupt: Point fundraisingdb-read back at frdb1001 [dns] - 10https://gerrit.wikimedia.org/r/637733 (https://phabricator.wikimedia.org/T266815)
[16:41:55] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10Patch-For-Review, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10elukey) We had to rollback the NIC on 1006, we need to install `firmware-bnx2x` on all nodes before doing any work (checked with Fa...
[16:43:11] <wikibugs>	 10Operations, 10MW-on-K8s, 10observability, 10serviceops: Logging options for apache httpd in k8s - https://phabricator.wikimedia.org/T265876 (10akosiaris) Couple of points  > We create a directory on the k8s node that works as a hostpath in all apache containers, and we make apache write its logs there, w...
[16:48:29] <icinga-wm>	 RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1001 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001
[16:48:47] <wikibugs>	 (03PS1) 10Ayounsi: PuppetDB import, set interface type when renaming ##PRIMARY## [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/637734 (https://phabricator.wikimedia.org/T265340)
[16:48:55] <wikibugs>	 (03CR) 10Jgreen: [C: 03+2] Point fundraisingdb-read back at frdb1001 [dns] - 10https://gerrit.wikimedia.org/r/637733 (https://phabricator.wikimedia.org/T266815) (owner: 10Dwisehaupt)
[16:51:07] <icinga-wm>	 RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1002 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1002
[16:51:12] <icinga-wm>	 RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1005
[16:51:55] <icinga-wm>	 RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1003 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1003
[16:54:09] <icinga-wm>	 RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1004 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1004
[16:57:03] <wikibugs>	 (03CR) 10Razzi: oozie: Add admin groups for authorization (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/637587 (https://phabricator.wikimedia.org/T262660) (owner: 10Razzi)
[16:58:26] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata Query Builder, 10User-Addshore: Deploy WDQS query builder to microsites - https://phabricator.wikimedia.org/T266703 (10Addshore)
[16:59:20] <wikibugs>	 (03PS6) 10Razzi: oozie: Add admin groups for authorization [puppet] - 10https://gerrit.wikimedia.org/r/637587 (https://phabricator.wikimedia.org/T262660)
[17:01:15] <wikibugs>	 (03CR) 10Bstorm: toolsdb: Make socket a parameter so new servers might work right (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/637726 (https://phabricator.wikimedia.org/T266587) (owner: 10Bstorm)
[17:05:31] <wikibugs>	 (03PS6) 10Effie Mouzeli: Set debian buster for mc2036 [puppet] - 10https://gerrit.wikimedia.org/r/635987 (https://phabricator.wikimedia.org/T252391)
[17:06:12] <wikibugs>	 (03PS7) 10Effie Mouzeli: Set debian buster for mc2036 [puppet] - 10https://gerrit.wikimedia.org/r/635987 (https://phabricator.wikimedia.org/T252391)
[17:09:06] <wikibugs>	 10Operations, 10Analytics: Augment NEL reports with a computed timestamp-of-generation - https://phabricator.wikimedia.org/T266886 (10CDanis)
[17:15:51] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] parsoid::testing: remove vd_client/server and rt_client/server [puppet] - 10https://gerrit.wikimedia.org/r/637582 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn)
[17:16:53] <wikibugs>	 10Operations, 10Release-Engineering-Team-TODO, 10Scap, 10Datacenter-Switchover, and 3 others: Scap is checking canary servers in dormant instead of active-dc - https://phabricator.wikimedia.org/T204907 (10akosiaris) 05Open→03Resolved This was done, resolving.
[17:18:55] <effie>	 !log enable puppet on all mediawiki and mc* hosts 
[17:18:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:19:37] <effie>	 !log disable puppet on mc1036 and mc2036 - T252391
[17:19:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:19:44] <stashbot>	 T252391: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391
[17:20:44] <wikibugs>	 (03PS1) 10Dzahn: site/parsoid-testing: update comments, apply insetup role [puppet] - 10https://gerrit.wikimedia.org/r/637740 (https://phabricator.wikimedia.org/T257906)
[17:21:03] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka-jumbo1006 is OK: (C)5e+06 ge (W)1e+06 ge 2.464e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1006
[17:21:43] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] site/parsoid-testing: update comments, apply insetup role [puppet] - 10https://gerrit.wikimedia.org/r/637740 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn)
[17:22:31] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[17:22:32] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[17:22:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:22:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:23:22] <wikibugs>	 10Operations, 10Parsoid, 10Parsoid-Tests, 10serviceops, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` scandium.eqi...
[17:23:45] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] Set debian buster for mc2036 [puppet] - 10https://gerrit.wikimedia.org/r/635987 (https://phabricator.wikimedia.org/T252391) (owner: 10Effie Mouzeli)
[17:26:29] <wikibugs>	 (03CR) 10Dzahn: Set debian buster for mc2036 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635987 (https://phabricator.wikimedia.org/T252391) (owner: 10Effie Mouzeli)
[17:26:46] <mutante>	 effie: the os_version check should be for buster instead of jessie?
[17:27:14] <mutante>	 oh.. i see.. nevermind
[17:27:27] <effie>	 we are using this profile only for this cluster 
[17:27:52] <effie>	 so I don't think there is any need to do anything more than that 
[17:27:57] <wikibugs>	 (03CR) 10Dzahn: Set debian buster for mc2036 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635987 (https://phabricator.wikimedia.org/T252391) (owner: 10Effie Mouzeli)
[17:28:33] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 2.084e+04 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[17:30:10] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 4 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[17:30:48] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] Set debian buster for mc2036 [puppet] - 10https://gerrit.wikimedia.org/r/635987 (https://phabricator.wikimedia.org/T252391) (owner: 10Effie Mouzeli)
[17:31:18] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "lgtm, all mc* hosts are jessie except gutter pool and those don't use this role" [puppet] - 10https://gerrit.wikimedia.org/r/635987 (https://phabricator.wikimedia.org/T252391) (owner: 10Effie Mouzeli)
[17:31:32] <mutante>	 effie: I understand now. +1   (also role inheritance is weird :)
[17:31:43] <effie>	 yeah it sucks :)
[17:34:01] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10User-fsero: Upgrade Calico - https://phabricator.wikimedia.org/T207804 (10JMeybohm)
[17:36:14] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[17:36:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:38:14] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[17:38:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:39:32] <wikibugs>	 10Operations, 10serviceops, 10Growth-Team (Current Sprint), 10Patch-For-Review, and 2 others: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin2001.codfw.wmnet for hosts: ` mc2036.codfw.wmnet ` The...
[17:40:47] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_eventgate_main_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:41:05] <wikibugs>	 (03PS1) 10Dzahn: mediawiki/memcached: stop using role inheritance [puppet] - 10https://gerrit.wikimedia.org/r/637742
[17:41:37] <mutante>	 effie: seems it's the only role globally (still?) doing that. then ... ^ for some other time 
[17:42:46] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10netops, 10Performance Issue: Very long response time on frwiki main page - https://phabricator.wikimedia.org/T266865 (10Shawn)
[17:42:47] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:43:03] <effie>	 we had a brief issue with the jobqueue
[17:43:37] <effie>	 https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1&from=1604078236373&to=1604079748837
[17:43:42] <wikibugs>	 (03CR) 10Jeena Huneidi: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/636916 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan)
[17:43:49] <effie>	 just keep an eye
[17:44:26] <mutante>	 the insertion rate did not seem to go up though?  wasn't that just the prometheus side?
[17:44:38] <wikibugs>	 10Operations, 10Parsoid, 10Parsoid-Tests, 10serviceops, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['scandium.eqiad.wmnet'] `  and were **ALL** successful.
[17:45:08] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] "PCC output is good enough for a puppet check. I'll merge this." [puppet] - 10https://gerrit.wikimedia.org/r/637726 (https://phabricator.wikimedia.org/T266587) (owner: 10Bstorm)
[17:46:47] <mutante>	 the brief "prometheus ..reduced availability" seems to happen sometimes separate from which the targets are
[17:52:05] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10serviceops, 10Performance Issue: Very long response time on frwiki main page - https://phabricator.wikimedia.org/T266865 (10CDanis) This isn't limited to just esams; it is in fact happening across all cache clusters.  All of my requests took at least 19 seco...
[17:52:26] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10wiki_willy) Latest update - looks like we need to order a minimum of 100 of the Enconnex (because it's customized), so let's scrap that one.  Some additional details I gathered for the Chatswort...
[17:59:17] <wikibugs>	 (03PS1) 10Dzahn: site/parsoid-testing: reapply testing role to scandium [puppet] - 10https://gerrit.wikimedia.org/r/637745 (https://phabricator.wikimedia.org/T257906)
[18:02:45] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] site/parsoid-testing: reapply testing role to scandium [puppet] - 10https://gerrit.wikimedia.org/r/637745 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn)
[18:07:56] <icinga-wm>	 PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[18:09:54] <icinga-wm>	 PROBLEM - Aggregate IPsec Tunnel Status eqiad on alert1001 is CRITICAL: instance=mc1036 site=eqiad tunnel=mc2036_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[18:10:46] <icinga-wm>	 RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[18:15:52] <Spookreeeno>	 musikanimal: which channel where you in with Jdlrobson?
[18:16:31] <musikanimal>	 they were direct messages
[18:16:47] <Spookreeeno>	 Ah
[18:18:17] <Spookreeeno>	 musikanimal: see pm
[18:24:05] <wikibugs>	 (03CR) 10Hashar: "recheck after CI got configured" [software/ecs] - 10https://gerrit.wikimedia.org/r/636513 (owner: 10Cwhite)
[18:25:38] <wikibugs>	 (03PS1) 10Bstorm: toolsdb: Fix the my.cnf template to include parameter for socket [puppet] - 10https://gerrit.wikimedia.org/r/637751 (https://phabricator.wikimedia.org/T266587)
[18:27:38] <logmsgbot>	 !log hashar@deploy1001 Started deploy [integration/docroot@c35e5e9]: Add ECS to doc.wikimedia.org index
[18:27:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:27:45] <logmsgbot>	 !log hashar@deploy1001 Finished deploy [integration/docroot@c35e5e9]: Add ECS to doc.wikimedia.org index (duration: 00m 06s)
[18:27:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:29:25] <wikibugs>	 (03PS3) 10Jeena Huneidi: linkrecommendation: Add deployment chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/636916 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan)
[18:30:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] linkrecommendation: Add deployment chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/636916 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan)
[18:30:40] <wikibugs>	 (03PS2) 10Bstorm: toolsdb: Fix the my.cnf template to include parameter for socket [puppet] - 10https://gerrit.wikimedia.org/r/637751 (https://phabricator.wikimedia.org/T266587)
[18:31:32] <wikibugs>	 (03CR) 10Hashar: "recheck since CI has been configured" [software/ecs] - 10https://gerrit.wikimedia.org/r/636515 (owner: 10Cwhite)
[18:31:35] <wikibugs>	 (03CR) 10Hashar: "recheck since CI has been configured" [software/ecs] - 10https://gerrit.wikimedia.org/r/636516 (owner: 10Cwhite)
[18:31:37] <wikibugs>	 (03CR) 10Hashar: "recheck since CI has been configured" [software/ecs] - 10https://gerrit.wikimedia.org/r/636685 (owner: 10Cwhite)
[18:31:40] <wikibugs>	 (03CR) 10Hashar: "recheck since CI has been configured" [software/ecs] - 10https://gerrit.wikimedia.org/r/637719 (owner: 10Cwhite)
[18:31:41] <wikibugs>	 10Operations: (Need By: ASAP) rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T265086 (10Jgreen) a:05Jgreen→03None
[18:31:43] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Standardize the logging format - https://phabricator.wikimedia.org/T234565 (10hashar) The spam above is @colewhite and I setting up CI to automatically generate https://doc.wikimedia.org/ecs/   . From now on, whenever a patch is merge...
[18:32:07] <wikibugs>	 10Operations, 10fundraising-tech-ops: (Need By: ASAP) rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T265086 (10Jgreen)
[18:33:38] <wikibugs>	 (03CR) 10Bstorm: "https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26240/console" [puppet] - 10https://gerrit.wikimedia.org/r/637751 (https://phabricator.wikimedia.org/T266587) (owner: 10Bstorm)
[18:33:41] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] toolsdb: Fix the my.cnf template to include parameter for socket [puppet] - 10https://gerrit.wikimedia.org/r/637751 (https://phabricator.wikimedia.org/T266587) (owner: 10Bstorm)
[18:37:00] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10serviceops, 10Performance Issue: Very long response time on frwiki main page - https://phabricator.wikimedia.org/T266865 (10jijiki) p:05High→03Unbreak!
[18:39:34] <wikibugs>	 (03PS1) 10CDanis: Reduce size of frwiki featuredfeeds to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637752 (https://phabricator.wikimedia.org/T266865)
[18:40:44] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] Reduce size of frwiki featuredfeeds to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637752 (https://phabricator.wikimedia.org/T266865) (owner: 10CDanis)
[18:40:55] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] Reduce size of frwiki featuredfeeds to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637752 (https://phabricator.wikimedia.org/T266865) (owner: 10CDanis)
[18:40:59] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "the only other exception seems to be lawiki (180) and dewiki (7)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637752 (https://phabricator.wikimedia.org/T266865) (owner: 10CDanis)
[18:41:42] <wikibugs>	 (03Merged) 10jenkins-bot: Reduce size of frwiki featuredfeeds to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637752 (https://phabricator.wikimedia.org/T266865) (owner: 10CDanis)
[18:45:14] <wikibugs>	 10Operations, 10serviceops, 10Growth-Team (Current Sprint), 10Patch-For-Review, and 2 others: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc2036.codfw.wmnet'] `  Of which those **FAILED**: ` ['mc2036.codfw.wmn...
[18:45:50] <cdanis>	 sigh
[18:46:04] <cdanis>	 I wish scap logged at the start of a sync-file run as well
[18:46:19] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10serviceops, and 2 others: Very long response time on frwiki main page - https://phabricator.wikimedia.org/T266865 (10jijiki)
[18:47:58] <cdanis>	 !log ✔️ cdanis@deploy1001.eqiad.wmnet /srv/mediawiki-staging 🕝☕ scap sync-file wmf-config/InitialiseSettings.php 'lower frwiki featured feeds limit 1a41ef634 T266865'
[18:47:58] <logmsgbot>	 !log cdanis@deploy1001 Synchronized wmf-config/InitialiseSettings.php: lower frwiki featured feeds limit 1a41ef634 T266865 (duration: 05m 14s)
[18:48:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:48:06] <stashbot>	 T266865: Very long response time on frwiki main page - https://phabricator.wikimedia.org/T266865
[18:48:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:48:23] <cdanis>	 !log the above scap began (and mostly finished) several minutes ago but is hanging on a couple hosts down for maintenance
[18:48:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:49:27] <wikibugs>	 (03PS1) 10Jeena Huneidi: Scaffold improvements [deployment-charts] - 10https://gerrit.wikimedia.org/r/637753
[18:52:13] <wikibugs>	 (03CR) 10Jeena Huneidi: Scaffold improvements (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/637753 (owner: 10Jeena Huneidi)
[18:56:00] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10serviceops, and 2 others: Very long response time on frwiki main page - https://phabricator.wikimedia.org/T266865 (10CDanis) 05Open→03Resolved a:03CDanis Approx 23:00 on 28 Oct, the size of the featured feed for frwiki started to become too large to be s...
[18:56:56] <icinga-wm>	 RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[18:58:04] <wikibugs>	 (03PS1) 10Bstorm: toolsdb: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/637757
[19:00:41] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] toolsdb: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/637757 (owner: 10Bstorm)
[19:01:45] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` es1026.eqiad.wmnet ` The log can be found in `/v...
[19:01:48] <icinga-wm>	 PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:07:02] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10serviceops, and 2 others: Very long response time on frwiki main page - https://phabricator.wikimedia.org/T266865 (10Legoktm) >>! In T266865#6592372, @CDanis wrote: > Long ago, frwiki's default feed length (in days) was set to 60, well above the default of 10....
[19:07:57] <wikibugs>	 (03PS1) 10Bstorm: toolsdb: actually use the read_only parameter from the profiles [puppet] - 10https://gerrit.wikimedia.org/r/637763 (https://phabricator.wikimedia.org/T266587)
[19:11:10] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] toolsdb: actually use the read_only parameter from the profiles [puppet] - 10https://gerrit.wikimedia.org/r/637763 (https://phabricator.wikimedia.org/T266587) (owner: 10Bstorm)
[19:14:41] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es1026.eqiad.wmnet'] `  Of which those **FAILED**: ` ['es1026.eqiad.wmnet'] `
[19:33:43] <wikibugs>	 (03PS1) 10Jcrespo: [QIP] Add second prototype to handle File metadata directly from the db [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/637769 (https://phabricator.wikimedia.org/T264189)
[19:36:03] <wikibugs>	 10Operations, 10fundraising-tech-ops: (Need By: ASAP) rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T265086 (10Dwisehaupt)
[19:36:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [QIP] Add second prototype to handle File metadata directly from the db [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/637769 (https://phabricator.wikimedia.org/T264189) (owner: 10Jcrespo)
[19:42:18] <wikibugs>	 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Goal, 10Patch-For-Review: Prepare a proof of concept of the minimum setup capable of backup and recover testwiki media files - https://phabricator.wikimedia.org/T264189 (10jcrespo) Further exploration of the existing metadata has been done a...
[19:46:23] <wikibugs>	 10Operations, 10fundraising-tech-ops: (Need By: ASAP) rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T265086 (10Dwisehaupt)
[19:50:38] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` es1026.eqiad.wmnet ` The log can be found in `/v...
[19:54:10] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to prod cluster for annet - https://phabricator.wikimedia.org/T266718 (10AnneT) Looks like all is working, thanks @ema and @Dzahn!
[19:54:49] <wikibugs>	 (03CR) 10Dzahn: "yay, I am seeing now you already merged it. awesome :)" [puppet] - 10https://gerrit.wikimedia.org/r/635656 (owner: 10Dzahn)
[19:56:44] <icinga-wm>	 RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:01:40] <icinga-wm>	 PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:04:35] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime
[20:04:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:06:35] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[20:06:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:15:08] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es1026.eqiad.wmnet'] `  and were **ALL** successful.
[20:16:36] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10RobH)
[20:17:35] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['es1027.eqiad.wmnet', 'es1028.eqiad.wmnet', 'es...
[20:31:29] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime
[20:31:29] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime
[20:31:29] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime
[20:31:29] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime
[20:31:29] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime
[20:31:30] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime
[20:31:30] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime
[20:31:31] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime
[20:31:32] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[20:31:32] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[20:31:32] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[20:31:33] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[20:31:33] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[20:31:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:34] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[20:31:34] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[20:31:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:32:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:32:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:32:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:32:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:32:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:32:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:32:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:32:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:32:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:33:29] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[20:33:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:35:22] <robh>	 =P
[20:35:32] <robh>	 known issue
[20:35:41] <mutante>	 :)
[20:36:06] <robh>	 multi reimage script tends to bork when its --new and hsoudlnt downtime
[20:36:13] <robh>	 but then fails sometimes and echos that cruft
[20:36:42] <mutante>	 gotcha
[20:38:16] <robh>	 the reimage script doesnt fail out, just notes that failure condition and keeps going
[20:38:20] <robh>	 so it doesnt hurt the installs.
[20:40:39] <wikibugs>	 (03CR) 10Dzahn: phabricator: replace require_package with ensure_packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636999 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn)
[20:41:02] <wikibugs>	 10Operations, 10serviceops, 10Growth-Team (Current Sprint), 10Patch-For-Review, and 2 others: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin2001.codfw.wmnet for hosts: ` mc2036.codfw.wmnet ` The...
[20:41:06] <wikibugs>	 10Operations, 10serviceops, 10Growth-Team (Current Sprint), 10Patch-For-Review, and 2 others: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc2036.codfw.wmnet'] `  Of which those **FAILED**: ` ['mc2036.codfw.wmn...
[20:42:31] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es1027.eqiad.wmnet', 'es1029.eqiad.wmnet', 'es1028.eqiad.wmnet', 'es1033.eqiad.wmnet', 'es1031...
[20:44:21] <wikibugs>	 10Operations, 10serviceops, 10Growth-Team (Current Sprint), 10Patch-For-Review, and 2 others: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin2001.codfw.wmnet for hosts: ` mc2036.codfw.wmnet ` The...
[20:49:13] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10RobH)
[20:51:11] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10RobH) a:05RobH→03None
[20:51:16] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10RobH) 05Open→03Resolved All installations complete and hosts are calling into puppet.  They've all been set to staged in netbox, and the #DBA team...
[20:56:11] <mutante>	 !log mw1267,mw1268 - scap pull 
[20:56:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:56:54] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10wiki_willy) a:03Cmjohnson Thanks @Cmjohnson and @RobH for prioritizing this one.  Nice work getting it turned over in time.   Thanks, Willy
[20:57:50] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1268.eqiad.wmnet
[20:57:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:57:59] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1267.eqiad.wmnet
[20:58:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:58:43] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: eqiad: Physical Moves for MediaWiki Servers - https://phabricator.wikimedia.org/T266164 (10Dzahn) a:05Cmjohnson→03Dzahn
[20:59:17] <mutante>	 !log mw1267,mw1268 - scap pull and repool - back to prod - T266164
[20:59:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:59:25] <stashbot>	 T266164: eqiad: Physical Moves for MediaWiki Servers - https://phabricator.wikimedia.org/T266164
[21:00:14] <logmsgbot>	 !log jiji@cumin2001 START - Cookbook sre.hosts.downtime
[21:00:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:02:22] <logmsgbot>	 !log jiji@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[21:02:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:10:47] <wikibugs>	 10Operations, 10serviceops, 10Growth-Team (Current Sprint), 10Patch-For-Review, and 2 others: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc2036.codfw.wmnet'] `  and were **ALL** successful.
[21:17:37] <wikibugs>	 10Operations, 10serviceops, 10Growth-Team (Current Sprint), 10Patch-For-Review, and 2 others: Reimage one memcached shard per DC to Buster - https://phabricator.wikimedia.org/T252391 (10jijiki)
[21:20:51] <wikibugs>	 10Operations, 10serviceops, 10Growth-Team (Current Sprint), 10Patch-For-Review, and 2 others: Reimage one memcached shard per DC to Buster - https://phabricator.wikimedia.org/T252391 (10jijiki) We removed `shard18` from `redis.yaml` so to be able to avoid installing redis-server on this server pair (mc1036...
[21:23:38] <wikibugs>	 (03PS1) 10Matthias Mullie: Fix array depth for properties array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637778 (https://phabricator.wikimedia.org/T266835)
[21:26:57] <icinga-wm>	 RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:28:17] <wikibugs>	 (03CR) 10Anne Tomasevich: [C: 03+1] Fix array depth for properties array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637778 (https://phabricator.wikimedia.org/T266835) (owner: 10Matthias Mullie)
[21:30:27] <icinga-wm>	 PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:38:40] <wikibugs>	 10Operations, 10observability: update logging ES's template index to type the 'age' field as an integer - https://phabricator.wikimedia.org/T266906 (10CDanis)
[21:38:50] <wikibugs>	 10Operations, 10observability: update logging ES's template index to type the 'age' field as an integer - https://phabricator.wikimedia.org/T266906 (10CDanis)
[21:38:53] <wikibugs>	 10Operations, 10Product-Infrastructure-Data, 10Epic, 10Goal, 10Patch-For-Review: automatically collect network error reports from users' browsers (Network Error Logging API) - https://phabricator.wikimedia.org/T257527 (10CDanis)
[22:17:22] <wikibugs>	 10Operations, 10Parsoid, 10Parsoid-Tests, 10serviceops, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) scandium has been reimaged.  It is now just an mw appserver plus: git clone of parsoid repo, nginx for test s...
[22:18:52] <wikibugs>	 10Operations, 10Parsoid, 10Parsoid-Tests, 10serviceops, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) 05Open→03Resolved @ssastry @Muehlenhoff Let me know if you see anything missing. Claiming resolved for now.
[22:19:19] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to prod cluster for annet - https://phabricator.wikimedia.org/T266718 (10Dzahn) 05Open→03Resolved a:03Dzahn @AnneT Thanks for confirming that. I'll call the ticket resolved.
[22:21:59] <wikibugs>	 10Operations, 10LDAP-Access-Requests: LDAP access to the nda group for Michael Raish - https://phabricator.wikimedia.org/T262316 (10Dzahn) Hi @MNovotny_WMF Do we have an expiration date meanwhile? The ticket is still open on our side since a while due to that missing date.
[22:23:03] <wikibugs>	 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung - https://phabricator.wikimedia.org/T266250 (10Dzahn) a:03Rmaung
[22:23:17] <icinga-wm>	 PROBLEM - IPMI Sensor Status on mw1267 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[22:23:53] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to production shell groups for DNdubane - https://phabricator.wikimedia.org/T266791 (10Dzahn) a:03DNdubane_WMF
[22:24:39] <wikibugs>	 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Dzahn) a:03JAnstee_WMF
[22:25:12] <wikibugs>	 10Operations, 10LDAP-Access-Requests: LDAP access to the nda group for Michael Raish - https://phabricator.wikimedia.org/T262316 (10Dzahn) 05Open→03Stalled
[22:42:13] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/26242/miscweb1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/637552 (https://phabricator.wikimedia.org/T266702) (owner: 10Dzahn)
[22:45:28] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] "Looks great! Not sure if there's an easy way to add either the JDK vs JRE distinction or the headless vs not headless, but if we think of " [puppet] - 10https://gerrit.wikimedia.org/r/636924 (owner: 10Jbond)
[22:47:41] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] "Ah, just noticed this patch is tied to the parent rspec patch. Both patches look fine to me but I'll let you ship these when you're ready " [puppet] - 10https://gerrit.wikimedia.org/r/636924 (owner: 10Jbond)
[22:48:45] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: ASAP) rack/setup/install clouddb10[13-20] - https://phabricator.wikimedia.org/T260441 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['clouddb1013.eqiad.wmnet', 'cloud...
[22:50:55] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata Query UI, 10Patch-For-Review, 10User-Addshore: Move WDQS UI to microsites - https://phabricator.wikimedia.org/T266702 (10Dzahn) Hi @Addshore Puppet has an issue cloning from the deployment repo:   `fatal: Remote branch master not found in upstream origin`
[22:54:23] <wikibugs>	 (03PS1) 10Dzahn: WDQS microsite: use branch production instead of master [puppet] - 10https://gerrit.wikimedia.org/r/637807 (https://phabricator.wikimedia.org/T266702)
[23:01:42] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime
[23:01:46] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime
[23:01:46] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime
[23:01:46] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime
[23:01:46] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime
[23:01:46] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime
[23:01:46] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime
[23:01:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:01:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:01:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:02:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:02:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:02:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:02:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:02:47] <wikibugs>	 (03PS1) 10Ryan Kemper: cirrus: temporarily disable saneitizer [puppet] - 10https://gerrit.wikimedia.org/r/637809 (https://phabricator.wikimedia.org/T266912)
[23:03:22] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] WDQS microsite: use branch production instead of master [puppet] - 10https://gerrit.wikimedia.org/r/637807 (https://phabricator.wikimedia.org/T266702) (owner: 10Dzahn)
[23:03:46] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[23:03:48] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[23:03:48] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[23:03:48] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[23:03:48] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[23:03:48] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[23:03:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:03:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:03:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:04:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:04:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:04:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:05:02] <wikibugs>	 (03CR) 10Bstorm: "Every physical server that's a client of labstore1006/7 mount as NFSv4, and the server *shouldn't* even support v3. So I'm going to merge " [puppet] - 10https://gerrit.wikimedia.org/r/635628 (https://phabricator.wikimedia.org/T265588) (owner: 10Bstorm)
[23:05:34] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[23:05:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:23:36] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1055 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:26:28] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1055 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[23:32:41] <mutante>	 !log adding query.wikidata.org to TLS cert for webserver-misc-apps.discovery.wmnet T266702
[23:32:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:32:49] <stashbot>	 T266702: Move WDQS UI to microsites - https://phabricator.wikimedia.org/T266702
[23:33:01] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on clouddb1019 - https://phabricator.wikimedia.org/T266912 (10ops-monitoring-bot)
[23:35:36] <foks>	 !log removing two files for legal compliance 
[23:35:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:37:50] <wikibugs>	 (03PS1) 10Dzahn: ssl: add query.wikidata.org to TLS cert for webserver-misc-apps [puppet] - 10https://gerrit.wikimedia.org/r/637811 (https://phabricator.wikimedia.org/T266702)
[23:39:42] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "openssl x509 -in webserver-misc-apps.discovery.wmnet.crt -noout -text | grep DNS" [puppet] - 10https://gerrit.wikimedia.org/r/637811 (https://phabricator.wikimedia.org/T266702) (owner: 10Dzahn)
[23:47:07] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata Query UI, 10Patch-For-Review, 10User-Addshore: Move WDQS UI to microsites - https://phabricator.wikimedia.org/T266702 (10Dzahn) I added Apache config and git cloning to the miscweb backends.  Then added query.wikidata.org to the TLS cert they are using.  Now you can a...
[23:47:21] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata Query UI, 10Patch-For-Review, 10User-Addshore: Move WDQS UI to microsites - https://phabricator.wikimedia.org/T266702 (10Dzahn)
[23:49:04] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: ASAP) rack/setup/install clouddb10[13-20] - https://phabricator.wikimedia.org/T260441 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['clouddb1020.eqiad.wmnet'] `  Of which those **FAILED**: ` ['clouddb1020.eqiad.wm...
[23:51:24] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:57:10] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1055 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[23:58:23] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: ASAP) rack/setup/install clouddb10[13-20] - https://phabricator.wikimedia.org/T260441 (10RobH)
[23:59:19] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: ASAP) rack/setup/install clouddb10[13-20] - https://phabricator.wikimedia.org/T260441 (10RobH) All but clouddb1020 are set to staged in netbox, and calling into puppet.  I'll investigate whats up with clouddb1020.
[23:59:50] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): relocate/reimage cloudvirt1030 with 10G interfaces - https://phabricator.wikimedia.org/T266623 (10RobH) a:05RobH→03None