[00:02:14] also, anyone remember the context for https://bash.toolforge.org/quip/TNR_cHEBv7KcG9M-Utqd ? [00:04:25] hehe [00:05:21] https://bash.toolforge.org/quip/AU8FCPz66snAnmqnLHDj [00:05:45] xDD [00:06:46] DannyS712: it's about the top of https://wikitech.wikimedia.org/wiki/How_to_do_a_schema_change [00:08:04] legoktm: Wasn't it when we thought update.php had been run and dropped that wikibase table? [00:08:38] DannyS712: it was on April 6th 2020 [00:08:42] Oh I think you're right [00:08:53] (03PS9) 10Dzahn: puppetmaster: add data types to all remaining parameters [puppet] - 10https://gerrit.wikimedia.org/r/635656 [00:09:27] https://wm-bot.wmflabs.org/logs/%23wikimedia-operations/20200406.txt [00:09:42] (03CR) 10Dzahn: "> Patch Set 8:" [puppet] - 10https://gerrit.wikimedia.org/r/635656 (owner: 10Dzahn) [00:10:21] lol, that icinga-wm wall [00:12:10] things you don't want to read: " One of the biggest tables in production has just disappeared." [00:14:17] *shudders* [00:17:39] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2163683760 and 42289 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:23:28] (03PS1) 10Dzahn: parsoid::testing: remove vd_client/server and rt_client/server [puppet] - 10https://gerrit.wikimedia.org/r/637582 (https://phabricator.wikimedia.org/T257906) [00:24:59] (03CR) 10Dzahn: "I will not merge this during the ongoing test but would tomorrow, before reimaging again." [puppet] - 10https://gerrit.wikimedia.org/r/637582 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn) [00:29:31] (03PS2) 10Dzahn: parsoid::testing: remove vd_client/server and rt_client/server [puppet] - 10https://gerrit.wikimedia.org/r/637582 (https://phabricator.wikimedia.org/T257906) [00:33:19] (03PS3) 10Dzahn: parsoid::testing: remove vd_client/server and rt_client/server [puppet] - 10https://gerrit.wikimedia.org/r/637582 (https://phabricator.wikimedia.org/T257906) [00:35:38] (03CR) 10Dzahn: "thanks, currently does not rebase on the parent change that adds data types to puppetmaster.. I'll care about it after that is merged" [puppet] - 10https://gerrit.wikimedia.org/r/635658 (owner: 10Dzahn) [00:47:08] (03PS2) 10Dzahn: mailman: replace cron with systemd timer and move to profile [puppet] - 10https://gerrit.wikimedia.org/r/637037 (https://phabricator.wikimedia.org/T265138) [00:48:27] (03CR) 10jerkins-bot: [V: 04-1] mailman: replace cron with systemd timer and move to profile [puppet] - 10https://gerrit.wikimedia.org/r/637037 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [00:48:57] (03PS3) 10Dzahn: mailman: replace cron with systemd timer and move to profile [puppet] - 10https://gerrit.wikimedia.org/r/637037 (https://phabricator.wikimedia.org/T265138) [00:50:13] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:51:04] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/26222/" [puppet] - 10https://gerrit.wikimedia.org/r/637037 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [00:53:07] (03CR) 10Dzahn: "Should I link all changes to that ticket or is it too many updates and just a topic branch is better? Then it can be a single URL listing " [puppet] - 10https://gerrit.wikimedia.org/r/636999 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [00:54:44] (03PS5) 10Dzahn: puppetmaster: replace cron to remove old reports with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/636104 (https://phabricator.wikimedia.org/T265138) [00:55:16] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: replace cron to remove old reports with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/636104 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [00:56:39] (03CR) 10Dzahn: "This isn't a real review but to be clear, i LOVE that you created a script for this purpose. That's exactly what we need once the corp LDA" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605596 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [01:00:44] (03CR) 10Dzahn: [C: 04-2] "waiting for the final decision what the target URL should be, stalled" [puppet] - 10https://gerrit.wikimedia.org/r/636755 (https://phabricator.wikimedia.org/T264367) (owner: 10Dzahn) [01:05:10] (03PS6) 10Dzahn: puppetmaster: replace cron to remove old reports with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/636104 (https://phabricator.wikimedia.org/T265138) [01:05:48] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: replace cron to remove old reports with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/636104 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [01:12:00] (03CR) 10Dzahn: "not sure why this Failed to parse template systemd/systemd.timer.erb:" [puppet] - 10https://gerrit.wikimedia.org/r/636104 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [01:18:17] PROBLEM - Maps HTTPS on maps2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [01:19:13] PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:20:45] RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 0.084 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:20:49] PROBLEM - cassandra CQL 10.192.16.179:9042 on maps2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [01:21:31] RECOVERY - Maps HTTPS on maps2002 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 0.158 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [01:22:23] RECOVERY - cassandra CQL 10.192.16.179:9042 on maps2002 is OK: TCP OK - 0.036 second response time on 10.192.16.179 port 9042 https://phabricator.wikimedia.org/T93886 [01:30:43] PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:30:47] PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:30:53] PROBLEM - cassandra CQL 10.192.16.179:9042 on maps2002 is CRITICAL: connect to address 10.192.16.179 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [02:34:55] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [02:36:35] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [03:41:24] (03PS1) 10Razzi: oozie: Add admin groups for authorization [puppet] - 10https://gerrit.wikimedia.org/r/637587 (https://phabricator.wikimedia.org/T262660) [03:45:34] (03PS1) 10Dzahn: admin: add AnneT to deployers [puppet] - 10https://gerrit.wikimedia.org/r/637588 (https://phabricator.wikimedia.org/T266718) [03:45:42] (03CR) 10Razzi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/637587 (https://phabricator.wikimedia.org/T262660) (owner: 10Razzi) [03:48:48] (03PS2) 10Razzi: oozie: Add admin groups for authorization [puppet] - 10https://gerrit.wikimedia.org/r/637587 (https://phabricator.wikimedia.org/T262660) [03:50:27] (03PS3) 10Razzi: oozie: Add admin groups for authorization [puppet] - 10https://gerrit.wikimedia.org/r/637587 (https://phabricator.wikimedia.org/T262660) [04:02:30] (03PS4) 10Razzi: oozie: Add admin groups for authorization [puppet] - 10https://gerrit.wikimedia.org/r/637587 (https://phabricator.wikimedia.org/T262660) [04:08:05] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [04:09:41] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [04:20:03] (03PS5) 10Razzi: oozie: Add admin groups for authorization [puppet] - 10https://gerrit.wikimedia.org/r/637587 (https://phabricator.wikimedia.org/T262660) [04:29:22] (03CR) 10Razzi: "I have added the admin group setting to an-test-coord1001, as we tested on the old test cluster. There are changes in the puppet compiler " [puppet] - 10https://gerrit.wikimedia.org/r/637587 (https://phabricator.wikimedia.org/T262660) (owner: 10Razzi) [05:21:45] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 236, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:22:27] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:26:55] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 238, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:27:03] RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:27:05] RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:27:33] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:32:11] PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:32:13] PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:15:46] (03PS1) 10Marostegui: orchestrator.conf: Enable Audit messages to syslog. [puppet] - 10https://gerrit.wikimedia.org/r/637596 (https://phabricator.wikimedia.org/T265990) [06:32:01] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 236, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:32:37] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:35:53] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:36:55] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 238, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:43:17] (03CR) 10Elukey: oozie: Add admin groups for authorization (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/637587 (https://phabricator.wikimedia.org/T262660) (owner: 10Razzi) [06:44:13] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:45:19] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 236, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201030T0700) [07:28:01] (03PS1) 10Elukey: Add role::analytics_test_cluster::hadoop::ui to an-test-ui1001 [puppet] - 10https://gerrit.wikimedia.org/r/637607 (https://phabricator.wikimedia.org/T260411) [07:31:07] (03CR) 10Elukey: [C: 03+2] Add role::analytics_test_cluster::hadoop::ui to an-test-ui1001 [puppet] - 10https://gerrit.wikimedia.org/r/637607 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey) [07:32:19] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 238, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:33:14] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:38:24] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [07:39:53] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [07:57:17] RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:57:19] RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:00:25] PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:00:27] PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:53:15] !log elukey@cumin1001 START - Cookbook sre.hosts.decommission [08:53:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:11] !log decom an-tool1006 (old analytics test vm) - T255139 [08:54:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:17] T255139: Create the new Hadoop test cluster - https://phabricator.wikimedia.org/T255139 [08:54:32] (03CR) 10Kormat: [C: 03+1] "LGTM, it even has a trailing , :)" [puppet] - 10https://gerrit.wikimedia.org/r/637596 (https://phabricator.wikimedia.org/T265990) (owner: 10Marostegui) [08:55:58] (03PS1) 10Elukey: Decommission an-tool1006 [puppet] - 10https://gerrit.wikimedia.org/r/637638 (https://phabricator.wikimedia.org/T255139) [08:58:59] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [08:59:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:51] interesting [09:05:00] (03CR) 10Elukey: [C: 03+2] Decommission an-tool1006 [puppet] - 10https://gerrit.wikimedia.org/r/637638 (https://phabricator.wikimedia.org/T255139) (owner: 10Elukey) [09:05:02] (03CR) 10Marostegui: [C: 03+2] orchestrator.conf: Enable Audit messages to syslog. [puppet] - 10https://gerrit.wikimedia.org/r/637596 (https://phabricator.wikimedia.org/T265990) (owner: 10Marostegui) [09:05:16] marostegui: go ahead with my change to pls :) [09:05:20] elukey: haha about to ask [09:05:36] elukey: merged [09:05:39] <3 [09:07:11] !log elukey@cumin1001 START - Cookbook sre.dns.netbox [09:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:21] !log elukey@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:13:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:30] goood [09:22:39] (03CR) 10JMeybohm: [C: 03+2] Test charts/deployments for compatibility with k8s 1.19 [deployment-charts] - 10https://gerrit.wikimedia.org/r/636881 (https://phabricator.wikimedia.org/T266032) (owner: 10JMeybohm) [09:25:20] (03Merged) 10jenkins-bot: Test charts/deployments for compatibility with k8s 1.19 [deployment-charts] - 10https://gerrit.wikimedia.org/r/636881 (https://phabricator.wikimedia.org/T266032) (owner: 10JMeybohm) [09:41:58] (03CR) 10Jbond: [C: 03+2] "LGTM will merge" [puppet] - 10https://gerrit.wikimedia.org/r/635656 (owner: 10Dzahn) [09:42:08] (03PS3) 10Jbond: puppetmaster: add data type for server type and use it [puppet] - 10https://gerrit.wikimedia.org/r/635660 (owner: 10Dzahn) [09:42:33] (03PS10) 10Jbond: puppetmaster: add data types to all remaining parameters [puppet] - 10https://gerrit.wikimedia.org/r/635656 (owner: 10Dzahn) [09:43:00] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: add data type for server type and use it [puppet] - 10https://gerrit.wikimedia.org/r/635660 (owner: 10Dzahn) [09:49:33] (03PS4) 10Jbond: httpd/puppetmaster: add data type for SSLVerifyClient and use it [puppet] - 10https://gerrit.wikimedia.org/r/635658 (owner: 10Dzahn) [09:51:32] (03CR) 10Jbond: [C: 03+2] "LGTM will merge" [puppet] - 10https://gerrit.wikimedia.org/r/635658 (owner: 10Dzahn) [09:57:35] (03PS1) 10Ladsgroup: Revert "Change logo of Wikidata for the eighth birthday" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637649 [09:58:49] (03CR) 10Ladsgroup: [C: 03+2] Revert "Change logo of Wikidata for the eighth birthday" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637649 (owner: 10Ladsgroup) [10:00:09] (03Merged) 10jenkins-bot: Revert "Change logo of Wikidata for the eighth birthday" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637649 (owner: 10Ladsgroup) [10:02:00] !log ladsgroup@deploy1001 Synchronized static/images/project-logos: Revert: Changing logo of Wikidata for the brithday (duration: 01m 12s) [10:02:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:48] !log klausman@cumin1001 START - Cookbook sre.hosts.downtime [10:08:49] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:08:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:26] (03PS1) 10Klausman: analytics: Switch stat1005 to use rocm 3.8 [puppet] - 10https://gerrit.wikimedia.org/r/637671 (https://phabricator.wikimedia.org/T264408) [10:21:46] (03CR) 10Elukey: [C: 03+1] analytics: Switch stat1005 to use rocm 3.8 [puppet] - 10https://gerrit.wikimedia.org/r/637671 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman) [10:21:50] (03PS1) 10Kormat: Initial (re)packaging [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/637672 (https://phabricator.wikimedia.org/T266763) [10:22:53] (03CR) 10Kormat: "This is the initial packaging. It builds in pbuilder in deneb. It does not yet contain the creation of user+group for orchestrator, nor th" [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/637672 (https://phabricator.wikimedia.org/T266763) (owner: 10Kormat) [10:22:59] (03CR) 10Klausman: [C: 03+2] analytics: Switch stat1005 to use rocm 3.8 [puppet] - 10https://gerrit.wikimedia.org/r/637671 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman) [10:24:21] (03CR) 10Hashar: [C: 03+1] admin: add AnneT to deployers [puppet] - 10https://gerrit.wikimedia.org/r/637588 (https://phabricator.wikimedia.org/T266718) (owner: 10Dzahn) [10:27:21] (03PS1) 10Marostegui: mariadb: Move db1077 to test-pc1 [puppet] - 10https://gerrit.wikimedia.org/r/637673 [10:29:08] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler1001/26227/" [puppet] - 10https://gerrit.wikimedia.org/r/637673 (owner: 10Marostegui) [10:39:15] (03PS1) 10Klausman: aptrepo: add mor rocm 3.8 dependencies [puppet] - 10https://gerrit.wikimedia.org/r/637676 (https://phabricator.wikimedia.org/T264408) [10:40:15] (03CR) 10Klausman: [C: 03+2] aptrepo: add mor rocm 3.8 dependencies [puppet] - 10https://gerrit.wikimedia.org/r/637676 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman) [10:41:02] (03CR) 10Kormat: "Hmm. These `test-` sections are a bit problematic. `check-cumin-aliases` complains when they are unused. I'll send a CR to not generate al" [puppet] - 10https://gerrit.wikimedia.org/r/637673 (owner: 10Marostegui) [10:41:35] (03CR) 10Kormat: [C: 03+2] mariadb: Set both db_inventory nodes read-write [puppet] - 10https://gerrit.wikimedia.org/r/636609 (https://phabricator.wikimedia.org/T266003) (owner: 10Kormat) [10:42:11] (03CR) 10Marostegui: "I can totally abandon this patch, this isn't really needed. It was just more for "informative" issues, but the host can replicate from pc1" [puppet] - 10https://gerrit.wikimedia.org/r/637673 (owner: 10Marostegui) [10:42:37] (03CR) 10Jcrespo: "If you wanted, it can be changed on zarcillo too, but it is not a big deal except for the mysql aggregated grouping." [puppet] - 10https://gerrit.wikimedia.org/r/637673 (owner: 10Marostegui) [10:42:38] (03CR) 10Marostegui: [C: 03+1] mariadb: Enable report_host [puppet] - 10https://gerrit.wikimedia.org/r/636452 (https://phabricator.wikimedia.org/T266483) (owner: 10Kormat) [10:43:22] (03CR) 10Kormat: [C: 03+2] mariadb: Enable report_host [puppet] - 10https://gerrit.wikimedia.org/r/636452 (https://phabricator.wikimedia.org/T266483) (owner: 10Kormat) [10:50:14] (03PS1) 10Kormat: cumin: Exclude test-* mariadb sections. [puppet] - 10https://gerrit.wikimedia.org/r/637678 [10:50:57] (03CR) 10Ema: [C: 03+2] admin: add AnneT to deployers [puppet] - 10https://gerrit.wikimedia.org/r/637588 (https://phabricator.wikimedia.org/T266718) (owner: 10Dzahn) [10:51:41] (03PS2) 10Kormat: cumin: Exclude test-* mariadb sections. [puppet] - 10https://gerrit.wikimedia.org/r/637678 [10:54:15] (03CR) 10Kormat: "PCC looks good: https://puppet-compiler.wmflabs.org/compiler1003/26230/" [puppet] - 10https://gerrit.wikimedia.org/r/637678 (owner: 10Kormat) [10:56:02] (03CR) 10Marostegui: [C: 03+1] cumin: Exclude test-* mariadb sections. [puppet] - 10https://gerrit.wikimedia.org/r/637678 (owner: 10Kormat) [10:56:15] (03CR) 10Kormat: [C: 03+2] cumin: Exclude test-* mariadb sections. [puppet] - 10https://gerrit.wikimedia.org/r/637678 (owner: 10Kormat) [10:57:05] (03CR) 10Kormat: [C: 03+1] mariadb: Move db1077 to test-pc1 [puppet] - 10https://gerrit.wikimedia.org/r/637673 (owner: 10Marostegui) [10:57:53] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1077 to test-pc1 [puppet] - 10https://gerrit.wikimedia.org/r/637673 (owner: 10Marostegui) [11:08:01] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] openstack: turn bash scripts without bashisms into sh scripts [puppet] - 10https://gerrit.wikimedia.org/r/631891 (https://phabricator.wikimedia.org/T95064) (owner: 10Dzahn) [11:12:05] RECOVERY - MariaDB read only db_inventory on db2093 is OK: Version 10.4.15-MariaDB-log, Uptime 777041s, read_only: False, event_scheduler: False, 68.03 QPS, connection latency: 0.003045s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [11:12:16] ^ \o/ kormat [11:12:29] 🤘 [11:15:26] (03PS6) 10Giuseppe Lavagetto: Add apache httpd base image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/634924 (https://phabricator.wikimedia.org/T265324) [11:15:28] (03PS4) 10Giuseppe Lavagetto: Add an httpd-fcgi image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/636634 (https://phabricator.wikimedia.org/T265324) [11:24:47] (03PS1) 10Klausman: amd_rocm: Only add DKMS+firmware for rocm33 installs [puppet] - 10https://gerrit.wikimedia.org/r/637682 (https://phabricator.wikimedia.org/T264408) [11:25:09] (03CR) 10jerkins-bot: [V: 04-1] amd_rocm: Only add DKMS+firmware for rocm33 installs [puppet] - 10https://gerrit.wikimedia.org/r/637682 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman) [11:28:54] (03PS2) 10Klausman: amd_rocm: Only add DKMS+firmware for rocm33 installs [puppet] - 10https://gerrit.wikimedia.org/r/637682 (https://phabricator.wikimedia.org/T264408) [11:29:16] (03CR) 10jerkins-bot: [V: 04-1] amd_rocm: Only add DKMS+firmware for rocm33 installs [puppet] - 10https://gerrit.wikimedia.org/r/637682 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman) [11:31:27] (03PS3) 10Klausman: amd_rocm: Only add DKMS+firmware for rocm33 installs [puppet] - 10https://gerrit.wikimedia.org/r/637682 (https://phabricator.wikimedia.org/T264408) [11:32:58] (03PS4) 10Klausman: amd_rocm: Only add DKMS+firmware for rocm33 installs [puppet] - 10https://gerrit.wikimedia.org/r/637682 (https://phabricator.wikimedia.org/T264408) [11:35:24] (03CR) 10Klausman: [C: 03+2] amd_rocm: Only add DKMS+firmware for rocm33 installs [puppet] - 10https://gerrit.wikimedia.org/r/637682 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman) [11:37:47] (03PS1) 10Kormat: debian: add user/group + systemd service [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/637683 (https://phabricator.wikimedia.org/T266763) [11:44:21] (03PS1) 10Kormat: orchestrator: Support sqlite backend [puppet] - 10https://gerrit.wikimedia.org/r/637684 (https://phabricator.wikimedia.org/T266657) [11:47:17] (03CR) 10Kormat: "Depends on https://gerrit.wikimedia.org/r/c/operations/debs/orchestrator/+/637683 getting merged, and packages built+uploaded." [puppet] - 10https://gerrit.wikimedia.org/r/637684 (https://phabricator.wikimedia.org/T266657) (owner: 10Kormat) [11:48:17] (03CR) 10Kormat: "Ah, i added you slightly too soon, @marostegui. I need to futz with heira a bit to test this in pcc." [puppet] - 10https://gerrit.wikimedia.org/r/637684 (https://phabricator.wikimedia.org/T266657) (owner: 10Kormat) [11:58:59] (03PS2) 10Kormat: orchestrator: Support sqlite backend [puppet] - 10https://gerrit.wikimedia.org/r/637684 (https://phabricator.wikimedia.org/T266657) [11:59:20] (03PS3) 10Kormat: orchestrator: Support sqlite backend [puppet] - 10https://gerrit.wikimedia.org/r/637684 (https://phabricator.wikimedia.org/T266657) [12:01:25] (03PS4) 10Kormat: orchestrator: Support sqlite backend [puppet] - 10https://gerrit.wikimedia.org/r/637684 (https://phabricator.wikimedia.org/T266657) [12:06:51] (03CR) 10Marostegui: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/637684 (https://phabricator.wikimedia.org/T266657) (owner: 10Kormat) [12:23:30] (03PS5) 10Kormat: orchestrator: Support sqlite backend [puppet] - 10https://gerrit.wikimedia.org/r/637684 (https://phabricator.wikimedia.org/T266657) [12:25:30] (03PS6) 10Kormat: orchestrator: Support sqlite backend [puppet] - 10https://gerrit.wikimedia.org/r/637684 (https://phabricator.wikimedia.org/T266657) [12:26:55] (03PS7) 10Kormat: orchestrator: Support sqlite backend [puppet] - 10https://gerrit.wikimedia.org/r/637684 (https://phabricator.wikimedia.org/T266657) [12:27:49] (03CR) 10Kormat: "Ready for review now :)" [puppet] - 10https://gerrit.wikimedia.org/r/637684 (https://phabricator.wikimedia.org/T266657) (owner: 10Kormat) [12:29:18] !log set normal VRRP balancing on cr2-eqiad [12:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:44] (03CR) 10Marostegui: "Will Puppet or sqllite itself take care of creating /var/lib/orchestrator/?" [puppet] - 10https://gerrit.wikimedia.org/r/637684 (https://phabricator.wikimedia.org/T266657) (owner: 10Kormat) [12:59:47] (03CR) 10Kormat: [C: 04-2] "Blocking until orchestrator packaging is published." [puppet] - 10https://gerrit.wikimedia.org/r/637684 (https://phabricator.wikimedia.org/T266657) (owner: 10Kormat) [13:06:42] (03PS2) 10Kormat: debian: add user/group + systemd service [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/637683 (https://phabricator.wikimedia.org/T266763) [13:18:27] (03PS8) 10Kormat: orchestrator: Support sqlite backend [puppet] - 10https://gerrit.wikimedia.org/r/637684 (https://phabricator.wikimedia.org/T266657) [13:18:29] (03PS1) 10Kormat: orchestrator: Support running as non-root [puppet] - 10https://gerrit.wikimedia.org/r/637693 (https://phabricator.wikimedia.org/T266763) [13:19:08] (03CR) 10Kormat: [C: 04-2] "-2 until new packages released." [puppet] - 10https://gerrit.wikimedia.org/r/637693 (https://phabricator.wikimedia.org/T266763) (owner: 10Kormat) [13:53:18] (03PS10) 10ArielGlenn: per job batches file with locking and methods for claiming jobs etc [dumps] - 10https://gerrit.wikimedia.org/r/596504 (https://phabricator.wikimedia.org/T252396) [13:53:50] (03PS1) 10Kormat: mariadb: Drop old mysql and percona my.cnf templates. [puppet] - 10https://gerrit.wikimedia.org/r/637699 [14:10:21] (03PS1) 10Kormat: mariadb: (Ab)use wsrep_cluster_name for DC name [puppet] - 10https://gerrit.wikimedia.org/r/637702 (https://phabricator.wikimedia.org/T266635) [14:14:16] !log moving mw1267 and mw168 to rack A8 eqiad T266164 [14:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:24] T266164: eqiad: Physical Moves for MediaWiki Servers - https://phabricator.wikimedia.org/T266164 [14:20:17] (03CR) 10Alexandros Kosiaris: [C: 04-1] ores: Stop memory reporting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/637557 (owner: 10Ladsgroup) [14:20:31] PROBLEM - Host mw1267.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:20:32] PROBLEM - Host mw1268.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:21:46] (03CR) 10Ppchelko: [C: 03+1] api-gateway: use envoy 1.16.0 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/637431 (owner: 10Hnowlan) [14:24:37] (03PS1) 10Andrew Bogott: cloud-vps instance backups: ignore clouddb-services project [puppet] - 10https://gerrit.wikimedia.org/r/637704 (https://phabricator.wikimedia.org/T260692) [14:27:10] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps instance backups: ignore clouddb-services project [puppet] - 10https://gerrit.wikimedia.org/r/637704 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott) [14:28:53] RECOVERY - Host mw1268.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.92 ms [14:30:55] (03CR) 10RLazarus: [C: 03+1] "🚀" [puppet] - 10https://gerrit.wikimedia.org/r/635987 (https://phabricator.wikimedia.org/T252391) (owner: 10Effie Mouzeli) [14:42:48] !log stop kafka-jumbo1006 to swap NICs (1g -> 10g, d1 -> d4 rack) [14:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:46] (03PS2) 10Kormat: mariadb: (Ab)use wsrep_cluster_name for DC name [puppet] - 10https://gerrit.wikimedia.org/r/637702 (https://phabricator.wikimedia.org/T266635) [14:52:08] (03CR) 10jerkins-bot: [V: 04-1] mariadb: (Ab)use wsrep_cluster_name for DC name [puppet] - 10https://gerrit.wikimedia.org/r/637702 (https://phabricator.wikimedia.org/T266635) (owner: 10Kormat) [14:52:57] PROBLEM - Host kafka-jumbo1006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:53:06] this is expected --^ [14:53:24] (03PS3) 10Kormat: mariadb: (Ab)use wsrep_cluster_name for DC name [puppet] - 10https://gerrit.wikimedia.org/r/637702 (https://phabricator.wikimedia.org/T266635) [14:58:13] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1004 is CRITICAL: 97 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1004 [14:58:28] yep yep expected [14:58:29] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1002 is CRITICAL: 106 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1002 [14:58:33] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is CRITICAL: 98 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1005 [14:58:41] sorry for the spam [14:59:09] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1001 is CRITICAL: 72 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001 [14:59:15] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1003 is CRITICAL: 98 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1003 [15:04:20] (03CR) 10Marostegui: "Not fully sure about dropping these. We might want to keep them around as we _might_ be getting either mysql or percona serving as a slave" [puppet] - 10https://gerrit.wikimedia.org/r/637699 (owner: 10Kormat) [15:04:29] RECOVERY - Host kafka-jumbo1006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [15:05:25] (03CR) 10Kormat: "PCC from hell (but looks good): https://puppet-compiler.wmflabs.org/compiler1002/26236/" [puppet] - 10https://gerrit.wikimedia.org/r/637702 (https://phabricator.wikimedia.org/T266635) (owner: 10Kormat) [15:09:37] !log downtiming mc2036 for buster reimage [15:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:00] (03PS1) 10Effie Mouzeli: hiera: remove shard18 from redis.yaml [puppet] - 10https://gerrit.wikimedia.org/r/637708 (https://phabricator.wikimedia.org/T252391) [15:11:29] !log rzl@cumin1001 START - Cookbook sre.hosts.downtime [15:11:30] !log rzl@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:57] (03CR) 10RLazarus: [C: 03+1] hiera: remove shard18 from redis.yaml [puppet] - 10https://gerrit.wikimedia.org/r/637708 (https://phabricator.wikimedia.org/T252391) (owner: 10Effie Mouzeli) [15:19:08] (03CR) 10RLazarus: [C: 03+1] "I like this as a step in the right direction, but I'd go further and merge the sections." [puppet] - 10https://gerrit.wikimedia.org/r/637577 (owner: 10Dzahn) [15:19:10] (03PS1) 10Cmjohnson: adding new mac address for update 10G nic kafka-jumbo1006 [puppet] - 10https://gerrit.wikimedia.org/r/637711 (https://phabricator.wikimedia.org/T236327) [15:19:12] (03PS4) 10Kormat: mariadb: (Ab)use wsrep_cluster_name for DC name [puppet] - 10https://gerrit.wikimedia.org/r/637702 (https://phabricator.wikimedia.org/T266635) [15:20:11] (03Abandoned) 10Kormat: mariadb: Drop old mysql and percona my.cnf templates. [puppet] - 10https://gerrit.wikimedia.org/r/637699 (owner: 10Kormat) [15:26:00] (03CR) 10Alexandros Kosiaris: [C: 04-1] Add apache httpd base image (038 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/634924 (https://phabricator.wikimedia.org/T265324) (owner: 10Giuseppe Lavagetto) [15:27:02] is wikibugs not posting from phab? [15:27:56] loooks a bit quiet [15:28:11] probably needs a service reboot [15:29:21] looks like it hasn't worked for ~1 day? [15:29:57] (03CR) 10Marostegui: [C: 03+1] "Let's deploy on Monday? I have deployed this manually on pc1 and it worked, but as we are touching many files...and we are not in a rush.." [puppet] - 10https://gerrit.wikimedia.org/r/637702 (https://phabricator.wikimedia.org/T266635) (owner: 10Kormat) [15:30:06] in my log I see "16:58 <+wikibugs> Operations, fundraising-tech-ops: Ensure all disaster recover documentation is in one central location" as the last phab-like update [15:30:22] cdanis: yeah, it has been out for a day, I wanted to restart it earlier today but I got buried on other things [15:30:29] after which: [15:30:29] 17:11 < arturo> we just had a network outage in wmcs [15:30:30] (03CR) 10Kormat: [C: 04-2] "Delay to monday." [puppet] - 10https://gerrit.wikimedia.org/r/637702 (https://phabricator.wikimedia.org/T266635) (owner: 10Kormat) [15:30:40] cool, coolcoolv [15:32:38] (03CR) 10Effie Mouzeli: [C: 03+2] hiera: remove shard18 from redis.yaml [puppet] - 10https://gerrit.wikimedia.org/r/637708 (https://phabricator.wikimedia.org/T252391) (owner: 10Effie Mouzeli) [15:34:42] (03PS1) 10Andrew Bogott: wmcs instance backups: move more projects from cloudvirt1024 to cloudvirt1021 [puppet] - 10https://gerrit.wikimedia.org/r/637713 (https://phabricator.wikimedia.org/T260692) [15:36:29] !log stopping puppet on mediawiki and mc* hosts [15:36:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:33] (03CR) 10Andrew Bogott: [C: 03+2] wmcs instance backups: move more projects from cloudvirt1024 to cloudvirt1021 [puppet] - 10https://gerrit.wikimedia.org/r/637713 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott) [15:41:03] (03PS1) 10Marostegui: orchestrator.conf: Add DetectDataCenterQuery to detect DC [puppet] - 10https://gerrit.wikimedia.org/r/637715 (https://phabricator.wikimedia.org/T266635) [15:41:36] (03CR) 10Marostegui: [C: 04-2] "Let's wait for https://gerrit.wikimedia.org/r/637702 to be merged first." [puppet] - 10https://gerrit.wikimedia.org/r/637715 (https://phabricator.wikimedia.org/T266635) (owner: 10Marostegui) [15:42:18] (03CR) 10Kormat: [C: 03+1] orchestrator.conf: Add DetectDataCenterQuery to detect DC [puppet] - 10https://gerrit.wikimedia.org/r/637715 (https://phabricator.wikimedia.org/T266635) (owner: 10Marostegui) [15:42:23] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Overall pretty good work, various small inline comments" (039 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/636634 (https://phabricator.wikimedia.org/T265324) (owner: 10Giuseppe Lavagetto) [15:43:20] (03CR) 10Alexandros Kosiaris: [C: 04-1] Add apache httpd base image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/634924 (https://phabricator.wikimedia.org/T265324) (owner: 10Giuseppe Lavagetto) [15:58:31] (03CR) 10JMeybohm: Add apache httpd base image (034 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/634924 (https://phabricator.wikimedia.org/T265324) (owner: 10Giuseppe Lavagetto) [16:03:43] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [16:06:55] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [16:07:27] (03PS2) 10Cwhite: Add HTTP request and response headers fields as object[field:keyword] [software/ecs] - 10https://gerrit.wikimedia.org/r/636515 [16:07:29] (03PS2) 10Cwhite: Add CSP Report fields. [software/ecs] - 10https://gerrit.wikimedia.org/r/636516 [16:07:31] (03PS2) 10Cwhite: Enable search slowlog by default for ECS indices. [software/ecs] - 10https://gerrit.wikimedia.org/r/636685 [16:07:33] (03PS1) 10Cwhite: First attempt at a JSONSchema template generator utility. [software/ecs] - 10https://gerrit.wikimedia.org/r/637719 [16:13:17] (03PS2) 10Cwhite: First attempt at a JSONSchema template generator utility. [software/ecs] - 10https://gerrit.wikimedia.org/r/637719 [16:18:57] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): cloudvirt1033 psu redundancy alert - https://phabricator.wikimedia.org/T263145 (10Cmjohnson) 05Open→03Resolved New PSU arrived and swapped. System reports healthy. [16:19:38] !log kafka-jumbo1006 still running with 1g nick [16:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:46] (03PS1) 10Bstorm: toolsdb: Make socket a parameter so new servers might work right [puppet] - 10https://gerrit.wikimedia.org/r/637726 (https://phabricator.wikimedia.org/T266587) [16:22:24] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) setup/install deploy1002 - https://phabricator.wikimedia.org/T265653 (10Cmjohnson) 05Open→03Resolved done [16:24:54] (03PS1) 10Ayounsi: PuppetDB import: don't do empty saves [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/637728 (https://phabricator.wikimedia.org/T266767) [16:25:51] (03CR) 10Jbond: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/636999 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [16:27:22] (03CR) 10Jbond: [C: 03+1] phabricator: replace require_package with ensure_packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636999 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [16:28:19] (03PS1) 10Cmjohnson: adding new production IP for frdb1004 [dns] - 10https://gerrit.wikimedia.org/r/637729 (https://phabricator.wikimedia.org/T265086) [16:29:33] (03CR) 10Cmjohnson: [C: 03+2] adding new production IP for frdb1004 [dns] - 10https://gerrit.wikimedia.org/r/637729 (https://phabricator.wikimedia.org/T265086) (owner: 10Cmjohnson) [16:29:51] PROBLEM - Kafka Broker Replica Max Lag on kafka-jumbo1006 is CRITICAL: 2.641e+07 ge 5e+06 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1006 [16:30:18] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: ASAP) rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T265086 (10Cmjohnson) [16:31:29] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: ASAP) rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T265086 (10Cmjohnson) @Jgreen all the on-site work has been completed. idrac password is a temporary password [16:32:06] (03CR) 10Jbond: profile::sre::check_mail: new script for checking user emails (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605596 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [16:32:09] 10Operations, 10Patch-For-Review: (Need By: ASAP) rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T265086 (10Cmjohnson) a:03Jgreen [16:41:53] (03PS1) 10Dwisehaupt: Point fundraisingdb-read back at frdb1001 [dns] - 10https://gerrit.wikimedia.org/r/637733 (https://phabricator.wikimedia.org/T266815) [16:41:55] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10Patch-For-Review, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10elukey) We had to rollback the NIC on 1006, we need to install `firmware-bnx2x` on all nodes before doing any work (checked with Fa... [16:43:11] 10Operations, 10MW-on-K8s, 10observability, 10serviceops: Logging options for apache httpd in k8s - https://phabricator.wikimedia.org/T265876 (10akosiaris) Couple of points > We create a directory on the k8s node that works as a hostpath in all apache containers, and we make apache write its logs there, w... [16:48:29] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1001 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001 [16:48:47] (03PS1) 10Ayounsi: PuppetDB import, set interface type when renaming ##PRIMARY## [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/637734 (https://phabricator.wikimedia.org/T265340) [16:48:55] (03CR) 10Jgreen: [C: 03+2] Point fundraisingdb-read back at frdb1001 [dns] - 10https://gerrit.wikimedia.org/r/637733 (https://phabricator.wikimedia.org/T266815) (owner: 10Dwisehaupt) [16:51:07] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1002 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1002 [16:51:12] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1005 [16:51:55] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1003 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1003 [16:54:09] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1004 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1004 [16:57:03] (03CR) 10Razzi: oozie: Add admin groups for authorization (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/637587 (https://phabricator.wikimedia.org/T262660) (owner: 10Razzi) [16:58:26] 10Operations, 10Wikidata, 10Wikidata Query Builder, 10User-Addshore: Deploy WDQS query builder to microsites - https://phabricator.wikimedia.org/T266703 (10Addshore) [16:59:20] (03PS6) 10Razzi: oozie: Add admin groups for authorization [puppet] - 10https://gerrit.wikimedia.org/r/637587 (https://phabricator.wikimedia.org/T262660) [17:01:15] (03CR) 10Bstorm: toolsdb: Make socket a parameter so new servers might work right (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/637726 (https://phabricator.wikimedia.org/T266587) (owner: 10Bstorm) [17:05:31] (03PS6) 10Effie Mouzeli: Set debian buster for mc2036 [puppet] - 10https://gerrit.wikimedia.org/r/635987 (https://phabricator.wikimedia.org/T252391) [17:06:12] (03PS7) 10Effie Mouzeli: Set debian buster for mc2036 [puppet] - 10https://gerrit.wikimedia.org/r/635987 (https://phabricator.wikimedia.org/T252391) [17:09:06] 10Operations, 10Analytics: Augment NEL reports with a computed timestamp-of-generation - https://phabricator.wikimedia.org/T266886 (10CDanis) [17:15:51] (03CR) 10Dzahn: [C: 03+2] parsoid::testing: remove vd_client/server and rt_client/server [puppet] - 10https://gerrit.wikimedia.org/r/637582 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn) [17:16:53] 10Operations, 10Release-Engineering-Team-TODO, 10Scap, 10Datacenter-Switchover, and 3 others: Scap is checking canary servers in dormant instead of active-dc - https://phabricator.wikimedia.org/T204907 (10akosiaris) 05Open→03Resolved This was done, resolving. [17:18:55] !log enable puppet on all mediawiki and mc* hosts [17:18:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:37] !log disable puppet on mc1036 and mc2036 - T252391 [17:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:44] T252391: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 [17:20:44] (03PS1) 10Dzahn: site/parsoid-testing: update comments, apply insetup role [puppet] - 10https://gerrit.wikimedia.org/r/637740 (https://phabricator.wikimedia.org/T257906) [17:21:03] RECOVERY - Kafka Broker Replica Max Lag on kafka-jumbo1006 is OK: (C)5e+06 ge (W)1e+06 ge 2.464e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1006 [17:21:43] (03CR) 10Dzahn: [C: 03+2] site/parsoid-testing: update comments, apply insetup role [puppet] - 10https://gerrit.wikimedia.org/r/637740 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn) [17:22:31] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [17:22:32] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:22] 10Operations, 10Parsoid, 10Parsoid-Tests, 10serviceops, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` scandium.eqi... [17:23:45] (03CR) 10Effie Mouzeli: [C: 03+1] Set debian buster for mc2036 [puppet] - 10https://gerrit.wikimedia.org/r/635987 (https://phabricator.wikimedia.org/T252391) (owner: 10Effie Mouzeli) [17:26:29] (03CR) 10Dzahn: Set debian buster for mc2036 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635987 (https://phabricator.wikimedia.org/T252391) (owner: 10Effie Mouzeli) [17:26:46] effie: the os_version check should be for buster instead of jessie? [17:27:14] oh.. i see.. nevermind [17:27:27] we are using this profile only for this cluster [17:27:52] so I don't think there is any need to do anything more than that [17:27:57] (03CR) 10Dzahn: Set debian buster for mc2036 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635987 (https://phabricator.wikimedia.org/T252391) (owner: 10Effie Mouzeli) [17:28:33] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 2.084e+04 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:30:10] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 4 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:30:48] (03CR) 10Effie Mouzeli: [C: 03+2] Set debian buster for mc2036 [puppet] - 10https://gerrit.wikimedia.org/r/635987 (https://phabricator.wikimedia.org/T252391) (owner: 10Effie Mouzeli) [17:31:18] (03CR) 10Dzahn: [C: 03+1] "lgtm, all mc* hosts are jessie except gutter pool and those don't use this role" [puppet] - 10https://gerrit.wikimedia.org/r/635987 (https://phabricator.wikimedia.org/T252391) (owner: 10Effie Mouzeli) [17:31:32] effie: I understand now. +1 (also role inheritance is weird :) [17:31:43] yeah it sucks :) [17:34:01] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10User-fsero: Upgrade Calico - https://phabricator.wikimedia.org/T207804 (10JMeybohm) [17:36:14] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [17:36:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:14] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:32] 10Operations, 10serviceops, 10Growth-Team (Current Sprint), 10Patch-For-Review, and 2 others: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin2001.codfw.wmnet for hosts: ` mc2036.codfw.wmnet ` The... [17:40:47] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_eventgate_main_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:41:05] (03PS1) 10Dzahn: mediawiki/memcached: stop using role inheritance [puppet] - 10https://gerrit.wikimedia.org/r/637742 [17:41:37] effie: seems it's the only role globally (still?) doing that. then ... ^ for some other time [17:42:46] 10Operations, 10Performance-Team, 10Traffic, 10netops, 10Performance Issue: Very long response time on frwiki main page - https://phabricator.wikimedia.org/T266865 (10Shawn) [17:42:47] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:43:03] we had a brief issue with the jobqueue [17:43:37] https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1&from=1604078236373&to=1604079748837 [17:43:42] (03CR) 10Jeena Huneidi: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/636916 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [17:43:49] just keep an eye [17:44:26] the insertion rate did not seem to go up though? wasn't that just the prometheus side? [17:44:38] 10Operations, 10Parsoid, 10Parsoid-Tests, 10serviceops, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['scandium.eqiad.wmnet'] ` and were **ALL** successful. [17:45:08] (03CR) 10Bstorm: [C: 03+2] "PCC output is good enough for a puppet check. I'll merge this." [puppet] - 10https://gerrit.wikimedia.org/r/637726 (https://phabricator.wikimedia.org/T266587) (owner: 10Bstorm) [17:46:47] the brief "prometheus ..reduced availability" seems to happen sometimes separate from which the targets are [17:52:05] 10Operations, 10Performance-Team, 10Traffic, 10serviceops, 10Performance Issue: Very long response time on frwiki main page - https://phabricator.wikimedia.org/T266865 (10CDanis) This isn't limited to just esams; it is in fact happening across all cache clusters. All of my requests took at least 19 seco... [17:52:26] 10Operations, 10ops-codfw, 10DC-Ops: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10wiki_willy) Latest update - looks like we need to order a minimum of 100 of the Enconnex (because it's customized), so let's scrap that one. Some additional details I gathered for the Chatswort... [17:59:17] (03PS1) 10Dzahn: site/parsoid-testing: reapply testing role to scandium [puppet] - 10https://gerrit.wikimedia.org/r/637745 (https://phabricator.wikimedia.org/T257906) [18:02:45] (03CR) 10Dzahn: [C: 03+2] site/parsoid-testing: reapply testing role to scandium [puppet] - 10https://gerrit.wikimedia.org/r/637745 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn) [18:07:56] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [18:09:54] PROBLEM - Aggregate IPsec Tunnel Status eqiad on alert1001 is CRITICAL: instance=mc1036 site=eqiad tunnel=mc2036_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [18:10:46] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [18:15:52] musikanimal: which channel where you in with Jdlrobson? [18:16:31] they were direct messages [18:16:47] Ah [18:18:17] musikanimal: see pm [18:24:05] (03CR) 10Hashar: "recheck after CI got configured" [software/ecs] - 10https://gerrit.wikimedia.org/r/636513 (owner: 10Cwhite) [18:25:38] (03PS1) 10Bstorm: toolsdb: Fix the my.cnf template to include parameter for socket [puppet] - 10https://gerrit.wikimedia.org/r/637751 (https://phabricator.wikimedia.org/T266587) [18:27:38] !log hashar@deploy1001 Started deploy [integration/docroot@c35e5e9]: Add ECS to doc.wikimedia.org index [18:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:45] !log hashar@deploy1001 Finished deploy [integration/docroot@c35e5e9]: Add ECS to doc.wikimedia.org index (duration: 00m 06s) [18:27:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:25] (03PS3) 10Jeena Huneidi: linkrecommendation: Add deployment chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/636916 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [18:30:16] (03CR) 10jerkins-bot: [V: 04-1] linkrecommendation: Add deployment chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/636916 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [18:30:40] (03PS2) 10Bstorm: toolsdb: Fix the my.cnf template to include parameter for socket [puppet] - 10https://gerrit.wikimedia.org/r/637751 (https://phabricator.wikimedia.org/T266587) [18:31:32] (03CR) 10Hashar: "recheck since CI has been configured" [software/ecs] - 10https://gerrit.wikimedia.org/r/636515 (owner: 10Cwhite) [18:31:35] (03CR) 10Hashar: "recheck since CI has been configured" [software/ecs] - 10https://gerrit.wikimedia.org/r/636516 (owner: 10Cwhite) [18:31:37] (03CR) 10Hashar: "recheck since CI has been configured" [software/ecs] - 10https://gerrit.wikimedia.org/r/636685 (owner: 10Cwhite) [18:31:40] (03CR) 10Hashar: "recheck since CI has been configured" [software/ecs] - 10https://gerrit.wikimedia.org/r/637719 (owner: 10Cwhite) [18:31:41] 10Operations: (Need By: ASAP) rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T265086 (10Jgreen) a:05Jgreen→03None [18:31:43] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Standardize the logging format - https://phabricator.wikimedia.org/T234565 (10hashar) The spam above is @colewhite and I setting up CI to automatically generate https://doc.wikimedia.org/ecs/ . From now on, whenever a patch is merge... [18:32:07] 10Operations, 10fundraising-tech-ops: (Need By: ASAP) rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T265086 (10Jgreen) [18:33:38] (03CR) 10Bstorm: "https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26240/console" [puppet] - 10https://gerrit.wikimedia.org/r/637751 (https://phabricator.wikimedia.org/T266587) (owner: 10Bstorm) [18:33:41] (03CR) 10Bstorm: [C: 03+2] toolsdb: Fix the my.cnf template to include parameter for socket [puppet] - 10https://gerrit.wikimedia.org/r/637751 (https://phabricator.wikimedia.org/T266587) (owner: 10Bstorm) [18:37:00] 10Operations, 10Performance-Team, 10Traffic, 10serviceops, 10Performance Issue: Very long response time on frwiki main page - https://phabricator.wikimedia.org/T266865 (10jijiki) p:05High→03Unbreak! [18:39:34] (03PS1) 10CDanis: Reduce size of frwiki featuredfeeds to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637752 (https://phabricator.wikimedia.org/T266865) [18:40:44] (03CR) 10Jforrester: [C: 03+1] Reduce size of frwiki featuredfeeds to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637752 (https://phabricator.wikimedia.org/T266865) (owner: 10CDanis) [18:40:55] (03CR) 10CDanis: [C: 03+2] Reduce size of frwiki featuredfeeds to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637752 (https://phabricator.wikimedia.org/T266865) (owner: 10CDanis) [18:40:59] (03CR) 10Dzahn: [C: 03+1] "the only other exception seems to be lawiki (180) and dewiki (7)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637752 (https://phabricator.wikimedia.org/T266865) (owner: 10CDanis) [18:41:42] (03Merged) 10jenkins-bot: Reduce size of frwiki featuredfeeds to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637752 (https://phabricator.wikimedia.org/T266865) (owner: 10CDanis) [18:45:14] 10Operations, 10serviceops, 10Growth-Team (Current Sprint), 10Patch-For-Review, and 2 others: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc2036.codfw.wmnet'] ` Of which those **FAILED**: ` ['mc2036.codfw.wmn... [18:45:50] sigh [18:46:04] I wish scap logged at the start of a sync-file run as well [18:46:19] 10Operations, 10Performance-Team, 10Traffic, 10serviceops, and 2 others: Very long response time on frwiki main page - https://phabricator.wikimedia.org/T266865 (10jijiki) [18:47:58] !log ✔️ cdanis@deploy1001.eqiad.wmnet /srv/mediawiki-staging 🕝☕ scap sync-file wmf-config/InitialiseSettings.php 'lower frwiki featured feeds limit 1a41ef634 T266865' [18:47:58] !log cdanis@deploy1001 Synchronized wmf-config/InitialiseSettings.php: lower frwiki featured feeds limit 1a41ef634 T266865 (duration: 05m 14s) [18:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:06] T266865: Very long response time on frwiki main page - https://phabricator.wikimedia.org/T266865 [18:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:23] !log the above scap began (and mostly finished) several minutes ago but is hanging on a couple hosts down for maintenance [18:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:27] (03PS1) 10Jeena Huneidi: Scaffold improvements [deployment-charts] - 10https://gerrit.wikimedia.org/r/637753 [18:52:13] (03CR) 10Jeena Huneidi: Scaffold improvements (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/637753 (owner: 10Jeena Huneidi) [18:56:00] 10Operations, 10Performance-Team, 10Traffic, 10serviceops, and 2 others: Very long response time on frwiki main page - https://phabricator.wikimedia.org/T266865 (10CDanis) 05Open→03Resolved a:03CDanis Approx 23:00 on 28 Oct, the size of the featured feed for frwiki started to become too large to be s... [18:56:56] RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:58:04] (03PS1) 10Bstorm: toolsdb: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/637757 [19:00:41] (03CR) 10Bstorm: [C: 03+2] toolsdb: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/637757 (owner: 10Bstorm) [19:01:45] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` es1026.eqiad.wmnet ` The log can be found in `/v... [19:01:48] PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:07:02] 10Operations, 10Performance-Team, 10Traffic, 10serviceops, and 2 others: Very long response time on frwiki main page - https://phabricator.wikimedia.org/T266865 (10Legoktm) >>! In T266865#6592372, @CDanis wrote: > Long ago, frwiki's default feed length (in days) was set to 60, well above the default of 10.... [19:07:57] (03PS1) 10Bstorm: toolsdb: actually use the read_only parameter from the profiles [puppet] - 10https://gerrit.wikimedia.org/r/637763 (https://phabricator.wikimedia.org/T266587) [19:11:10] (03CR) 10Bstorm: [C: 03+2] toolsdb: actually use the read_only parameter from the profiles [puppet] - 10https://gerrit.wikimedia.org/r/637763 (https://phabricator.wikimedia.org/T266587) (owner: 10Bstorm) [19:14:41] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es1026.eqiad.wmnet'] ` Of which those **FAILED**: ` ['es1026.eqiad.wmnet'] ` [19:33:43] (03PS1) 10Jcrespo: [QIP] Add second prototype to handle File metadata directly from the db [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/637769 (https://phabricator.wikimedia.org/T264189) [19:36:03] 10Operations, 10fundraising-tech-ops: (Need By: ASAP) rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T265086 (10Dwisehaupt) [19:36:16] (03CR) 10jerkins-bot: [V: 04-1] [QIP] Add second prototype to handle File metadata directly from the db [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/637769 (https://phabricator.wikimedia.org/T264189) (owner: 10Jcrespo) [19:42:18] 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Goal, 10Patch-For-Review: Prepare a proof of concept of the minimum setup capable of backup and recover testwiki media files - https://phabricator.wikimedia.org/T264189 (10jcrespo) Further exploration of the existing metadata has been done a... [19:46:23] 10Operations, 10fundraising-tech-ops: (Need By: ASAP) rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T265086 (10Dwisehaupt) [19:50:38] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` es1026.eqiad.wmnet ` The log can be found in `/v... [19:54:10] 10Operations, 10SRE-Access-Requests: Requesting access to prod cluster for annet - https://phabricator.wikimedia.org/T266718 (10AnneT) Looks like all is working, thanks @ema and @Dzahn! [19:54:49] (03CR) 10Dzahn: "yay, I am seeing now you already merged it. awesome :)" [puppet] - 10https://gerrit.wikimedia.org/r/635656 (owner: 10Dzahn) [19:56:44] RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:01:40] PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:04:35] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [20:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:35] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:08] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es1026.eqiad.wmnet'] ` and were **ALL** successful. [20:16:36] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10RobH) [20:17:35] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['es1027.eqiad.wmnet', 'es1028.eqiad.wmnet', 'es... [20:31:29] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [20:31:29] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [20:31:29] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [20:31:29] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [20:31:29] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [20:31:30] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [20:31:30] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [20:31:31] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [20:31:32] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [20:31:32] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [20:31:32] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [20:31:33] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [20:31:33] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [20:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:34] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [20:31:34] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [20:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:29] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:33:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:22] =P [20:35:32] known issue [20:35:41] :) [20:36:06] multi reimage script tends to bork when its --new and hsoudlnt downtime [20:36:13] but then fails sometimes and echos that cruft [20:36:42] gotcha [20:38:16] the reimage script doesnt fail out, just notes that failure condition and keeps going [20:38:20] so it doesnt hurt the installs. [20:40:39] (03CR) 10Dzahn: phabricator: replace require_package with ensure_packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636999 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [20:41:02] 10Operations, 10serviceops, 10Growth-Team (Current Sprint), 10Patch-For-Review, and 2 others: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin2001.codfw.wmnet for hosts: ` mc2036.codfw.wmnet ` The... [20:41:06] 10Operations, 10serviceops, 10Growth-Team (Current Sprint), 10Patch-For-Review, and 2 others: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc2036.codfw.wmnet'] ` Of which those **FAILED**: ` ['mc2036.codfw.wmn... [20:42:31] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es1027.eqiad.wmnet', 'es1029.eqiad.wmnet', 'es1028.eqiad.wmnet', 'es1033.eqiad.wmnet', 'es1031... [20:44:21] 10Operations, 10serviceops, 10Growth-Team (Current Sprint), 10Patch-For-Review, and 2 others: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin2001.codfw.wmnet for hosts: ` mc2036.codfw.wmnet ` The... [20:49:13] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10RobH) [20:51:11] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10RobH) a:05RobH→03None [20:51:16] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10RobH) 05Open→03Resolved All installations complete and hosts are calling into puppet. They've all been set to staged in netbox, and the #DBA team... [20:56:11] !log mw1267,mw1268 - scap pull [20:56:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:54] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10wiki_willy) a:03Cmjohnson Thanks @Cmjohnson and @RobH for prioritizing this one. Nice work getting it turned over in time. Thanks, Willy [20:57:50] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1268.eqiad.wmnet [20:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:59] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1267.eqiad.wmnet [20:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:43] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: eqiad: Physical Moves for MediaWiki Servers - https://phabricator.wikimedia.org/T266164 (10Dzahn) a:05Cmjohnson→03Dzahn [20:59:17] !log mw1267,mw1268 - scap pull and repool - back to prod - T266164 [20:59:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:25] T266164: eqiad: Physical Moves for MediaWiki Servers - https://phabricator.wikimedia.org/T266164 [21:00:14] !log jiji@cumin2001 START - Cookbook sre.hosts.downtime [21:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:22] !log jiji@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:02:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:47] 10Operations, 10serviceops, 10Growth-Team (Current Sprint), 10Patch-For-Review, and 2 others: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc2036.codfw.wmnet'] ` and were **ALL** successful. [21:17:37] 10Operations, 10serviceops, 10Growth-Team (Current Sprint), 10Patch-For-Review, and 2 others: Reimage one memcached shard per DC to Buster - https://phabricator.wikimedia.org/T252391 (10jijiki) [21:20:51] 10Operations, 10serviceops, 10Growth-Team (Current Sprint), 10Patch-For-Review, and 2 others: Reimage one memcached shard per DC to Buster - https://phabricator.wikimedia.org/T252391 (10jijiki) We removed `shard18` from `redis.yaml` so to be able to avoid installing redis-server on this server pair (mc1036... [21:23:38] (03PS1) 10Matthias Mullie: Fix array depth for properties array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637778 (https://phabricator.wikimedia.org/T266835) [21:26:57] RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:28:17] (03CR) 10Anne Tomasevich: [C: 03+1] Fix array depth for properties array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637778 (https://phabricator.wikimedia.org/T266835) (owner: 10Matthias Mullie) [21:30:27] PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:38:40] 10Operations, 10observability: update logging ES's template index to type the 'age' field as an integer - https://phabricator.wikimedia.org/T266906 (10CDanis) [21:38:50] 10Operations, 10observability: update logging ES's template index to type the 'age' field as an integer - https://phabricator.wikimedia.org/T266906 (10CDanis) [21:38:53] 10Operations, 10Product-Infrastructure-Data, 10Epic, 10Goal, 10Patch-For-Review: automatically collect network error reports from users' browsers (Network Error Logging API) - https://phabricator.wikimedia.org/T257527 (10CDanis) [22:17:22] 10Operations, 10Parsoid, 10Parsoid-Tests, 10serviceops, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) scandium has been reimaged. It is now just an mw appserver plus: git clone of parsoid repo, nginx for test s... [22:18:52] 10Operations, 10Parsoid, 10Parsoid-Tests, 10serviceops, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) 05Open→03Resolved @ssastry @Muehlenhoff Let me know if you see anything missing. Claiming resolved for now. [22:19:19] 10Operations, 10SRE-Access-Requests: Requesting access to prod cluster for annet - https://phabricator.wikimedia.org/T266718 (10Dzahn) 05Open→03Resolved a:03Dzahn @AnneT Thanks for confirming that. I'll call the ticket resolved. [22:21:59] 10Operations, 10LDAP-Access-Requests: LDAP access to the nda group for Michael Raish - https://phabricator.wikimedia.org/T262316 (10Dzahn) Hi @MNovotny_WMF Do we have an expiration date meanwhile? The ticket is still open on our side since a while due to that missing date. [22:23:03] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung - https://phabricator.wikimedia.org/T266250 (10Dzahn) a:03Rmaung [22:23:17] PROBLEM - IPMI Sensor Status on mw1267 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [22:23:53] 10Operations, 10SRE-Access-Requests: Requesting access to production shell groups for DNdubane - https://phabricator.wikimedia.org/T266791 (10Dzahn) a:03DNdubane_WMF [22:24:39] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Dzahn) a:03JAnstee_WMF [22:25:12] 10Operations, 10LDAP-Access-Requests: LDAP access to the nda group for Michael Raish - https://phabricator.wikimedia.org/T262316 (10Dzahn) 05Open→03Stalled [22:42:13] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/26242/miscweb1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/637552 (https://phabricator.wikimedia.org/T266702) (owner: 10Dzahn) [22:45:28] (03CR) 10Ryan Kemper: [C: 03+2] "Looks great! Not sure if there's an easy way to add either the JDK vs JRE distinction or the headless vs not headless, but if we think of " [puppet] - 10https://gerrit.wikimedia.org/r/636924 (owner: 10Jbond) [22:47:41] (03CR) 10Ryan Kemper: [C: 03+1] "Ah, just noticed this patch is tied to the parent rspec patch. Both patches look fine to me but I'll let you ship these when you're ready " [puppet] - 10https://gerrit.wikimedia.org/r/636924 (owner: 10Jbond) [22:48:45] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: ASAP) rack/setup/install clouddb10[13-20] - https://phabricator.wikimedia.org/T260441 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['clouddb1013.eqiad.wmnet', 'cloud... [22:50:55] 10Operations, 10Wikidata, 10Wikidata Query UI, 10Patch-For-Review, 10User-Addshore: Move WDQS UI to microsites - https://phabricator.wikimedia.org/T266702 (10Dzahn) Hi @Addshore Puppet has an issue cloning from the deployment repo: `fatal: Remote branch master not found in upstream origin` [22:54:23] (03PS1) 10Dzahn: WDQS microsite: use branch production instead of master [puppet] - 10https://gerrit.wikimedia.org/r/637807 (https://phabricator.wikimedia.org/T266702) [23:01:42] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [23:01:46] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [23:01:46] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [23:01:46] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [23:01:46] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [23:01:46] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [23:01:46] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [23:01:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:47] (03PS1) 10Ryan Kemper: cirrus: temporarily disable saneitizer [puppet] - 10https://gerrit.wikimedia.org/r/637809 (https://phabricator.wikimedia.org/T266912) [23:03:22] (03CR) 10Dzahn: [C: 03+2] WDQS microsite: use branch production instead of master [puppet] - 10https://gerrit.wikimedia.org/r/637807 (https://phabricator.wikimedia.org/T266702) (owner: 10Dzahn) [23:03:46] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [23:03:48] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [23:03:48] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [23:03:48] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [23:03:48] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [23:03:48] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [23:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:02] (03CR) 10Bstorm: "Every physical server that's a client of labstore1006/7 mount as NFSv4, and the server *shouldn't* even support v3. So I'm going to merge " [puppet] - 10https://gerrit.wikimedia.org/r/635628 (https://phabricator.wikimedia.org/T265588) (owner: 10Bstorm) [23:05:34] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [23:05:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:36] PROBLEM - Check systemd state on ms-be1055 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:26:28] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1055 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [23:32:41] !log adding query.wikidata.org to TLS cert for webserver-misc-apps.discovery.wmnet T266702 [23:32:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:49] T266702: Move WDQS UI to microsites - https://phabricator.wikimedia.org/T266702 [23:33:01] 10Operations, 10ops-eqiad: Degraded RAID on clouddb1019 - https://phabricator.wikimedia.org/T266912 (10ops-monitoring-bot) [23:35:36] !log removing two files for legal compliance [23:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:50] (03PS1) 10Dzahn: ssl: add query.wikidata.org to TLS cert for webserver-misc-apps [puppet] - 10https://gerrit.wikimedia.org/r/637811 (https://phabricator.wikimedia.org/T266702) [23:39:42] (03CR) 10Dzahn: [C: 03+2] "openssl x509 -in webserver-misc-apps.discovery.wmnet.crt -noout -text | grep DNS" [puppet] - 10https://gerrit.wikimedia.org/r/637811 (https://phabricator.wikimedia.org/T266702) (owner: 10Dzahn) [23:47:07] 10Operations, 10Wikidata, 10Wikidata Query UI, 10Patch-For-Review, 10User-Addshore: Move WDQS UI to microsites - https://phabricator.wikimedia.org/T266702 (10Dzahn) I added Apache config and git cloning to the miscweb backends. Then added query.wikidata.org to the TLS cert they are using. Now you can a... [23:47:21] 10Operations, 10Wikidata, 10Wikidata Query UI, 10Patch-For-Review, 10User-Addshore: Move WDQS UI to microsites - https://phabricator.wikimedia.org/T266702 (10Dzahn) [23:49:04] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: ASAP) rack/setup/install clouddb10[13-20] - https://phabricator.wikimedia.org/T260441 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['clouddb1020.eqiad.wmnet'] ` Of which those **FAILED**: ` ['clouddb1020.eqiad.wm... [23:51:24] RECOVERY - Check systemd state on ms-be1055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:57:10] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1055 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [23:58:23] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: ASAP) rack/setup/install clouddb10[13-20] - https://phabricator.wikimedia.org/T260441 (10RobH) [23:59:19] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: ASAP) rack/setup/install clouddb10[13-20] - https://phabricator.wikimedia.org/T260441 (10RobH) All but clouddb1020 are set to staged in netbox, and calling into puppet. I'll investigate whats up with clouddb1020. [23:59:50] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): relocate/reimage cloudvirt1030 with 10G interfaces - https://phabricator.wikimedia.org/T266623 (10RobH) a:05RobH→03None