[00:00:04] twentyafterfour: My dear minions, it's time we take the moon! Just kidding. Time for Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210520T0000). [00:01:46] (Primary outbound port utilisation over 80% #page) resolved: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [00:05:54] PROBLEM - SSH on ms-be2035 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:07:00] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:07:36] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: drop_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:10:20] (03CR) 10Razzi: [C: 03+2] db1125: decommission db1125 [puppet] - 10https://gerrit.wikimedia.org/r/692984 (https://phabricator.wikimedia.org/T283125) (owner: 10Razzi) [00:13:22] RECOVERY - SSH on ms-be2035 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:28:26] PROBLEM - SSH on ms-be2035 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:31:45] (03PS1) 10Dzahn: install_server: add doh2* to use flat/virtual partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/693025 (https://phabricator.wikimedia.org/T283192) [00:36:00] (03CR) 10Dzahn: [C: 03+2] install_server: add doh2* to use flat/virtual partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/693025 (https://phabricator.wikimedia.org/T283192) (owner: 10Dzahn) [00:39:00] (03PS1) 10Razzi: site: add dbstore1006 to replace db1004 [puppet] - 10https://gerrit.wikimedia.org/r/693046 [00:39:28] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:41:52] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:44:22] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:44:30] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:48:08] (03CR) 10Razzi: "Let me know how the config looks; hardware address taken from its former name as db1125." [puppet] - 10https://gerrit.wikimedia.org/r/693046 (owner: 10Razzi) [00:51:52] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:56:54] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:01:44] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:01:45] !log signing puppet certs for doh2001 and doh2002.wikimedia.org (T283192) [01:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:51] T283192: Please create two Ganeti VMs for Wikidough - https://phabricator.wikimedia.org/T283192 [01:04:10] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:11:47] 10SRE, 10Traffic, 10vm-requests: Please create two Ganeti VMs for Wikidough - https://phabricator.wikimedia.org/T283192 (10Dzahn) 05Open→03Resolved VMs have been created, added to site.pp with "insetup", added to DHCP and partma. OS has been installed (buster) and puppet certs signed. You can now SSH to... [01:31:46] 10SRE, 10SRE-Access-Requests: Superset Access for Cooltey Feng - https://phabricator.wikimedia.org/T283189 (10Ottomata) Approved. Verifying that this is a case of https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Dashboards_in_Superset_/_Hive_interfaces_(like_Hue)_that_do_access_private_data, and th... [01:34:32] (03PS1) 10Jforrester: PageProps: be prepared that PageIdentity is not proper title [core] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/693028 (https://phabricator.wikimedia.org/T283170) [01:34:51] (03PS1) 10Jforrester: ActorStore: avoid throwing in case of invalid usernames [core] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/693029 (https://phabricator.wikimedia.org/T283167) [01:35:20] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup100[4-7] - https://phabricator.wikimedia.org/T277327 (10wiki_willy) Hi @Jclark-ctr - are there specific racks that you need the space in? We also have some high priority 740xd2 servers coming in Q1, that we should make room for at... [01:35:35] (03PS1) 10Jforrester: UploadFromStash: convert default user from false to null [core] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/693030 (https://phabricator.wikimedia.org/T283196) [01:44:30] (03CR) 10Ottomata: [C: 03+1] db1125: decommission db1125 [puppet] - 10https://gerrit.wikimedia.org/r/692984 (https://phabricator.wikimedia.org/T283125) (owner: 10Razzi) [01:48:04] RECOVERY - SSH on ms-be2035 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:50:14] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:54:24] PROBLEM - SSH on ms-be2035 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:54:28] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:54:32] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:56:42] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:03:08] RECOVERY - SSH on ms-be2035 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:09:44] PROBLEM - SSH on ms-be2035 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:25:02] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:27:12] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:27:32] (03CR) 10Ottomata: site: add dbstore1006 to replace db1004 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/693046 (owner: 10Razzi) [02:33:42] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:36:02] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:40:28] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:42:38] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:42:40] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:44:32] RECOVERY - SSH on ms-be2035 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:49:10] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:51:00] PROBLEM - SSH on ms-be2035 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:53:34] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:55:32] RECOVERY - SSH on ms-be2035 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:55:54] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:00:32] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:02:20] PROBLEM - SSH on ms-be2035 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:05:06] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:42:04] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:44:34] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:46:52] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:49:02] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:01:00] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:02:18] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:10:36] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:43:26] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:48:10] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:55:34] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:58:33] (03PS1) 10Marostegui: Revert "db1141: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/693032 [04:58:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1141 (re)pooling @ 25%: Repool db1141', diff saved to https://phabricator.wikimedia.org/P16107 and previous config saved to /var/cache/conftool/dbconfig/20210520-045852-root.json [04:58:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:59:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1166', diff saved to https://phabricator.wikimedia.org/P16108 and previous config saved to /var/cache/conftool/dbconfig/20210520-045919-marostegui.json [04:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:59:49] (03CR) 10Marostegui: [C: 03+2] Revert "db1141: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/693032 (owner: 10Marostegui) [05:00:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1143', diff saved to https://phabricator.wikimedia.org/P16109 and previous config saved to /var/cache/conftool/dbconfig/20210520-050025-marostegui.json [05:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:52] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:11:32] (03PS2) 10Jcrespo: Revert "bacula: Reenable read-write ES database backups, disable read-only" [puppet] - 10https://gerrit.wikimedia.org/r/692650 [05:12:29] (03PS1) 10Marostegui: mariadb: Decommission labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/693057 (https://phabricator.wikimedia.org/T282524) [05:13:13] (03CR) 10Jcrespo: [C: 03+2] Revert "bacula: Reenable read-write ES database backups, disable read-only" [puppet] - 10https://gerrit.wikimedia.org/r/692650 (owner: 10Jcrespo) [05:13:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts labsdb1011.eqiad.wmnet [05:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1141 (re)pooling @ 50%: Repool db1141', diff saved to https://phabricator.wikimedia.org/P16110 and previous config saved to /var/cache/conftool/dbconfig/20210520-051355-root.json [05:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:14] (03PS1) 10Jcrespo: Revert "Revert "bacula: Reenable read-write ES database backups, disable read-only"" [puppet] - 10https://gerrit.wikimedia.org/r/693033 [05:18:36] (03PS2) 10Jcrespo: Revert "Revert "bacula: Reenable read-write ES database backups, disable read-only"" [puppet] - 10https://gerrit.wikimedia.org/r/693033 [05:22:53] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/693057 (https://phabricator.wikimedia.org/T282524) (owner: 10Marostegui) [05:22:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts labsdb1011.eqiad.wmnet [05:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:51] 10ops-eqiad, 10DBA, 10decommission-hardware: decommission labsdb1011.eqiad.wmnet - https://phabricator.wikimedia.org/T282524 (10Marostegui) This is ready for #dc-ops [05:23:57] 10ops-eqiad, 10DBA, 10decommission-hardware: decommission labsdb1011.eqiad.wmnet - https://phabricator.wikimedia.org/T282524 (10Marostegui) a:05Marostegui→03wiki_willy [05:24:07] 10ops-eqiad, 10decommission-hardware: decommission labsdb1011.eqiad.wmnet - https://phabricator.wikimedia.org/T282524 (10Marostegui) [05:24:42] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic reboot - ryankemper@cumin1001 - T283223 [05:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:46] T283223: Reboot cloudelastic* to apply security updates - https://phabricator.wikimedia.org/T283223 [05:25:09] (03PS1) 10Marostegui: maintain_dbusers.pp: Remove reference to labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/693058 (https://phabricator.wikimedia.org/T282662) [05:27:12] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic reboot - ryankemper@cumin1001 - T283223 [05:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:43] (ctrl+c'd, need to set a lower # of nodes at a time) [05:29:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1141 (re)pooling @ 75%: Repool db1141', diff saved to https://phabricator.wikimedia.org/P16111 and previous config saved to /var/cache/conftool/dbconfig/20210520-052859-root.json [05:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:03] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 245 threshold =0.15 breach: active_shards: 1276, status: yellow, relocating_shards: 0, timed_out: False, active_primary_shards: 759, delayed_unassigned_shards: 0, task_max_waiting_in_queue_millis: 0, initializing_shards: 20, number_of_in_flight_fetch: 0, active_shards_percent_as_number: 83.892176199868 [05:29:03] rds: 225, number_of_data_nodes: 5, cluster_name: cloudelastic-chi-eqiad, number_of_nodes: 5, number_of_pending_tasks: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [05:30:27] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1002 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: active_shards: 1490, cluster_name: cloudelastic-chi-eqiad, delayed_unassigned_shards: 0, timed_out: False, active_shards_percent_as_number: 97.96186719263642, number_of_data_nodes: 6, unassigned_shards: 27, relocating_shards: 0, number_of_pending_tasks: 2, status: yellow, task_max_waiting_in_queue_ [05:30:27] ializing_shards: 4, active_primary_shards: 759, number_of_nodes: 6, number_of_in_flight_fetch: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [05:33:05] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic reboot - ryankemper@cumin1001 - T283223 [05:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:33:09] T283223: Reboot cloudelastic* to apply security updates - https://phabricator.wikimedia.org/T283223 [05:33:21] !log T283223 `sudo -i cookbook sre.elasticsearch.rolling-operation cloudelastic "cloudelastic reboot" --reboot --nodes-per-run 1 --start-datetime 2021-05-20T05:16:40 --task-id T283223` on `ryankemper@cumin1001` tmux session `restart_cloudelastic` [05:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:13] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:37:49] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:38:15] PROBLEM - ElasticSearch health check for shards on 9600 on cloudelastic1005 is CRITICAL: CRITICAL - elasticsearch inactive shards 241 threshold =0.15 breach: unassigned_shards: 241, task_max_waiting_in_queue_millis: 49, active_shards: 1207, cluster_name: cloudelastic-psi-eqiad, timed_out: False, number_of_in_flight_fetch: 1002, active_primary_shards: 723, number_of_nodes: 6, status: yellow, delayed_unassigned_shards: 0, number_of [05:38:15] umber_of_pending_tasks: 2, initializing_shards: 0, relocating_shards: 0, active_shards_percent_as_number: 83.35635359116023 https://wikitech.wikimedia.org/wiki/Search%23Administration [05:38:15] PROBLEM - ElasticSearch health check for shards on 9600 on cloudelastic1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 241 threshold =0.15 breach: number_of_data_nodes: 6, active_shards_percent_as_number: 83.35635359116023, task_max_waiting_in_queue_millis: 195, active_shards: 1207, number_of_in_flight_fetch: 588, delayed_unassigned_shards: 0, cluster_name: cloudelastic-psi-eqiad, unassigned_shards: 241, number_of_node [05:38:15] g_shards: 0, active_primary_shards: 723, timed_out: False, status: yellow, number_of_pending_tasks: 2, relocating_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [05:40:01] RECOVERY - ElasticSearch health check for shards on 9600 on cloudelastic1005 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: timed_out: False, unassigned_shards: 0, cluster_name: cloudelastic-psi-eqiad, initializing_shards: 0, relocating_shards: 0, active_shards: 1448, number_of_nodes: 6, delayed_unassigned_shards: 0, active_primary_shards: 723, number_of_in_flight_fetch: 0, active_shards_percent_as_number: 100.0, number [05:40:01] : 0, status: green, task_max_waiting_in_queue_millis: 0, number_of_data_nodes: 6 https://wikitech.wikimedia.org/wiki/Search%23Administration [05:40:01] RECOVERY - ElasticSearch health check for shards on 9600 on cloudelastic1001 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: relocating_shards: 0, timed_out: False, number_of_pending_tasks: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0, status: green, active_primary_shards: 723, delayed_unassigned_shards: 0, number_of_data_nodes: 6, number_of_nodes: 6, active_shards: 1448, initializing_sh [05:40:01] ed_shards: 0, number_of_in_flight_fetch: 0, cluster_name: cloudelastic-psi-eqiad https://wikitech.wikimedia.org/wiki/Search%23Administration [05:41:19] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:41:51] PROBLEM - Check systemd state on cloudelastic1004 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:42:05] (03CR) 10Marostegui: [C: 03+2] maintain_dbusers.pp: Remove reference to labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/693058 (https://phabricator.wikimedia.org/T282662) (owner: 10Marostegui) [05:44:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1141 (re)pooling @ 100%: Repool db1141', diff saved to https://phabricator.wikimedia.org/P16112 and previous config saved to /var/cache/conftool/dbconfig/20210520-054402-root.json [05:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:01] PROBLEM - ElasticSearch health check for shards on 9400 on cloudelastic1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 250 threshold =0.15 breach: relocating_shards: 0, status: yellow, number_of_in_flight_fetch: 0, number_of_pending_tasks: 0, active_shards: 1248, number_of_nodes: 6, task_max_waiting_in_queue_millis: 0, timed_out: False, initializing_shards: 0, unassigned_shards: 250, active_primary_shards: 748, number_ [05:47:01] active_shards_percent_as_number: 83.31108144192257, delayed_unassigned_shards: 0, cluster_name: cloudelastic-omega-eqiad https://wikitech.wikimedia.org/wiki/Search%23Administration [05:47:17] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:47:35] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 253 threshold =0.15 breach: active_shards_percent_as_number: 83.36620644312951, timed_out: False, initializing_shards: 0, task_max_waiting_in_queue_millis: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, active_shards: 1268, cluster_name: cloudelastic-chi-eqiad, status: yellow, unassigned_ [05:47:35] er_of_nodes: 6, relocating_shards: 0, number_of_in_flight_fetch: 0, number_of_data_nodes: 6, active_primary_shards: 759 https://wikitech.wikimedia.org/wiki/Search%23Administration [05:47:49] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 246 threshold =0.15 breach: delayed_unassigned_shards: 0, timed_out: False, number_of_nodes: 6, status: yellow, task_max_waiting_in_queue_millis: 6485, number_of_in_flight_fetch: 0, active_primary_shards: 759, unassigned_shards: 242, cluster_name: cloudelastic-chi-eqiad, active_shards_percent_as_number [05:47:49] 13, relocating_shards: 0, initializing_shards: 4, number_of_data_nodes: 6, active_shards: 1275, number_of_pending_tasks: 4 https://wikitech.wikimedia.org/wiki/Search%23Administration [05:49:11] RECOVERY - ElasticSearch health check for shards on 9400 on cloudelastic1002 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: number_of_pending_tasks: 0, active_primary_shards: 748, initializing_shards: 0, active_shards: 1498, active_shards_percent_as_number: 100.0, number_of_data_nodes: 6, unassigned_shards: 0, task_max_waiting_in_queue_millis: 0, status: green, timed_out: False, delayed_unassigned_shards: 0, cluster_n [05:49:11] -omega-eqiad, number_of_in_flight_fetch: 0, number_of_nodes: 6, relocating_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [05:49:35] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:49:42] (Sorry for the shard health check noise - the cookbook downtimes each host so not sure why those alerts are coming through...will take a look tomorrow) [05:49:45] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1002 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: task_max_waiting_in_queue_millis: 0, relocating_shards: 0, number_of_pending_tasks: 0, initializing_shards: 0, number_of_nodes: 6, number_of_data_nodes: 6, active_shards_percent_as_number: 100.0, active_primary_shards: 759, delayed_unassigned_shards: 0, timed_out: False, status: green, cluster_name [05:49:45] i-eqiad, number_of_in_flight_fetch: 0, active_shards: 1521, unassigned_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [05:49:57] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1004 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: task_max_waiting_in_queue_millis: 0, active_primary_shards: 759, delayed_unassigned_shards: 0, number_of_in_flight_fetch: 0, timed_out: False, number_of_data_nodes: 6, status: green, relocating_shards: 0, number_of_nodes: 6, unassigned_shards: 0, active_shards: 1521, initializing_shards: 0, cluster [05:49:57] ic-chi-eqiad, active_shards_percent_as_number: 100.0, number_of_pending_tasks: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [05:53:42] PROBLEM - ElasticSearch health check for shards on 9600 on cloudelastic1005 is CRITICAL: CRITICAL - elasticsearch inactive shards 241 threshold =0.15 breach: task_max_waiting_in_queue_millis: 0, active_primary_shards: 723, number_of_in_flight_fetch: 0, number_of_nodes: 5, cluster_name: cloudelastic-psi-eqiad, delayed_unassigned_shards: 0, timed_out: False, number_of_data_nodes: 5, unassigned_shards: 241, relocating_shards: 0, ini [05:53:42] 0, active_shards_percent_as_number: 83.35635359116023, number_of_pending_tasks: 1, status: yellow, active_shards: 1207 https://wikitech.wikimedia.org/wiki/Search%23Administration [05:54:44] RECOVERY - ElasticSearch health check for shards on 9600 on cloudelastic1005 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: timed_out: False, cluster_name: cloudelastic-psi-eqiad, active_primary_shards: 723, number_of_pending_tasks: 1, unassigned_shards: 35, task_max_waiting_in_queue_millis: 0, relocating_shards: 0, delayed_unassigned_shards: 0, number_of_data_nodes: 6, status: yellow, initializing_shards: 2, number_of_ [05:54:44] of_in_flight_fetch: 0, active_shards_percent_as_number: 97.44475138121547, active_shards: 1411 https://wikitech.wikimedia.org/wiki/Search%23Administration [05:54:58] I deploy this UBN! https://gerrit.wikimedia.org/r/c/mediawiki/core/+/693028 [05:56:17] (03CR) 10Marostegui: [C: 04-1] site: add dbstore1006 to replace db1004 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/693046 (owner: 10Razzi) [05:56:29] thanks Amir1 [05:56:30] (03CR) 10Ladsgroup: [C: 03+2] PageProps: be prepared that PageIdentity is not proper title [core] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/693028 (https://phabricator.wikimedia.org/T283170) (owner: 10Jforrester) [05:56:38] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:58:02] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:58:33] (03PS1) 10Marostegui: cumin: Remove labsdb* [puppet] - 10https://gerrit.wikimedia.org/r/693059 (https://phabricator.wikimedia.org/T282662) [06:00:28] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:01:36] PROBLEM - ElasticSearch health check for shards on 9400 on cloudelastic1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 250 threshold =0.15 breach: number_of_pending_tasks: 0, initializing_shards: 0, number_of_nodes: 5, status: yellow, number_of_data_nodes: 5, active_shards_percent_as_number: 83.31108144192257, delayed_unassigned_shards: 0, timed_out: False, active_shards: 1248, unassigned_shards: 250, relocating_shards [06:01:36] _flight_fetch: 0, active_primary_shards: 748, cluster_name: cloudelastic-omega-eqiad, task_max_waiting_in_queue_millis: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [06:01:36] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 254 threshold =0.15 breach: status: yellow, number_of_pending_tasks: 0, unassigned_shards: 254, active_shards: 1267, active_shards_percent_as_number: 83.30046022353714, task_max_waiting_in_queue_millis: 0, delayed_unassigned_shards: 0, number_of_nodes: 5, relocating_shards: 0, active_primary_shards: 75 [06:01:36] se, cluster_name: cloudelastic-chi-eqiad, number_of_in_flight_fetch: 0, initializing_shards: 0, number_of_data_nodes: 5 https://wikitech.wikimedia.org/wiki/Search%23Administration [06:01:36] PROBLEM - ElasticSearch health check for shards on 9600 on cloudelastic1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 242 threshold =0.15 breach: active_primary_shards: 723, cluster_name: cloudelastic-psi-eqiad, active_shards_percent_as_number: 83.28729281767956, timed_out: False, number_of_in_flight_fetch: 0, unassigned_shards: 242, status: yellow, initializing_shards: 0, task_max_waiting_in_queue_millis: 0, number_o [06:01:36] number_of_pending_tasks: 0, relocating_shards: 0, number_of_nodes: 5, active_shards: 1206, delayed_unassigned_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [06:01:36] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 254 threshold =0.15 breach: timed_out: False, active_shards_percent_as_number: 83.30046022353714, number_of_in_flight_fetch: 0, relocating_shards: 0, task_max_waiting_in_queue_millis: 0, delayed_unassigned_shards: 0, number_of_nodes: 5, initializing_shards: 0, unassigned_shards: 254, cluster_name: clou [06:01:37] d, active_shards: 1267, number_of_data_nodes: 5, number_of_pending_tasks: 0, active_primary_shards: 759, status: yellow https://wikitech.wikimedia.org/wiki/Search%23Administration [06:01:38] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 254 threshold =0.15 breach: active_shards: 1267, timed_out: False, status: yellow, task_max_waiting_in_queue_millis: 0, number_of_nodes: 5, initializing_shards: 0, cluster_name: cloudelastic-chi-eqiad, unassigned_shards: 254, number_of_data_nodes: 5, active_primary_shards: 759, relocating_shards: 0, nu [06:01:38] asks: 0, number_of_in_flight_fetch: 0, active_shards_percent_as_number: 83.30046022353714, delayed_unassigned_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [06:01:42] PROBLEM - ElasticSearch health check for shards on 9400 on cloudelastic1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 250 threshold =0.15 breach: status: yellow, active_shards_percent_as_number: 83.31108144192257, active_shards: 1248, timed_out: False, cluster_name: cloudelastic-omega-eqiad, task_max_waiting_in_queue_millis: 0, active_primary_shards: 748, number_of_data_nodes: 5, number_of_pending_tasks: 0, relocating [06:01:42] r_of_in_flight_fetch: 0, unassigned_shards: 250, initializing_shards: 0, number_of_nodes: 5, delayed_unassigned_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [06:01:42] PROBLEM - ElasticSearch health check for shards on 9400 on cloudelastic1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 250 threshold =0.15 breach: number_of_data_nodes: 5, active_shards: 1248, status: yellow, relocating_shards: 0, task_max_waiting_in_queue_millis: 0, active_primary_shards: 748, timed_out: False, number_of_in_flight_fetch: 0, cluster_name: cloudelastic-omega-eqiad, initializing_shards: 0, number_of_node [06:01:42] shards: 250, number_of_pending_tasks: 0, active_shards_percent_as_number: 83.31108144192257, delayed_unassigned_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [06:01:42] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 254 threshold =0.15 breach: task_max_waiting_in_queue_millis: 0, number_of_data_nodes: 5, number_of_pending_tasks: 0, active_primary_shards: 759, status: yellow, cluster_name: cloudelastic-chi-eqiad, delayed_unassigned_shards: 0, initializing_shards: 0, relocating_shards: 0, active_shards_percent_as_nu [06:01:42] 353714, number_of_in_flight_fetch: 0, unassigned_shards: 254, timed_out: False, active_shards: 1267, number_of_nodes: 5 https://wikitech.wikimedia.org/wiki/Search%23Administration [06:02:14] PROBLEM - ElasticSearch health check for shards on 9600 on cloudelastic1005 is CRITICAL: CRITICAL - elasticsearch inactive shards 242 threshold =0.15 breach: active_shards: 1206, relocating_shards: 0, task_max_waiting_in_queue_millis: 0, cluster_name: cloudelastic-psi-eqiad, active_primary_shards: 723, unassigned_shards: 242, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, initializing_shards: 0, number_of_in_flight_fet [06:02:14] data_nodes: 5, number_of_nodes: 5, active_shards_percent_as_number: 83.28729281767956, status: yellow, timed_out: False https://wikitech.wikimedia.org/wiki/Search%23Administration [06:02:14] PROBLEM - ElasticSearch health check for shards on 9600 on cloudelastic1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 242 threshold =0.15 breach: unassigned_shards: 242, number_of_data_nodes: 5, active_shards_percent_as_number: 83.28729281767956, number_of_pending_tasks: 0, task_max_waiting_in_queue_millis: 0, number_of_nodes: 5, number_of_in_flight_fetch: 0, status: yellow, cluster_name: cloudelastic-psi-eqiad, reloc [06:02:14] delayed_unassigned_shards: 0, initializing_shards: 0, active_shards: 1206, timed_out: False, active_primary_shards: 723 https://wikitech.wikimedia.org/wiki/Search%23Administration [06:02:35] ^ Working on downtiming those manually...guessing the master just restarted (expected) thus why each host is complaining [06:02:36] PROBLEM - ElasticSearch health check for shards on 9400 on cloudelastic1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 250 threshold =0.15 breach: number_of_data_nodes: 5, number_of_nodes: 5, number_of_pending_tasks: 0, delayed_unassigned_shards: 0, task_max_waiting_in_queue_millis: 0, relocating_shards: 0, status: yellow, cluster_name: cloudelastic-omega-eqiad, unassigned_shards: 250, number_of_in_flight_fetch: 0, tim [06:02:36] tive_shards_percent_as_number: 83.31108144192257, initializing_shards: 0, active_shards: 1248, active_primary_shards: 748 https://wikitech.wikimedia.org/wiki/Search%23Administration [06:02:36] PROBLEM - ElasticSearch health check for shards on 9600 on cloudelastic1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 242 threshold =0.15 breach: number_of_data_nodes: 5, delayed_unassigned_shards: 0, task_max_waiting_in_queue_millis: 0, relocating_shards: 0, cluster_name: cloudelastic-psi-eqiad, number_of_pending_tasks: 0, timed_out: False, active_shards: 1206, status: yellow, number_of_nodes: 5, active_shards_percen [06:02:36] 8729281767956, number_of_in_flight_fetch: 0, initializing_shards: 0, unassigned_shards: 242, active_primary_shards: 723 https://wikitech.wikimedia.org/wiki/Search%23Administration [06:03:32] (Downtime set for two hours on `cloudelastic100[1-6]`) [06:03:46] RECOVERY - Check systemd state on cloudelastic1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:04:06] RECOVERY - ElasticSearch health check for shards on 9600 on cloudelastic1003 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: number_of_nodes: 6, cluster_name: cloudelastic-psi-eqiad, task_max_waiting_in_queue_millis: 0, initializing_shards: 2, status: yellow, relocating_shards: 0, active_shards: 1378, active_shards_percent_as_number: 95.1657458563536, number_of_pending_tasks: 0, timed_out: False, number_of_data_nodes: 6, [06:04:06] ght_fetch: 0, delayed_unassigned_shards: 0, active_primary_shards: 723, unassigned_shards: 68 https://wikitech.wikimedia.org/wiki/Search%23Administration [06:04:34] RECOVERY - ElasticSearch health check for shards on 9600 on cloudelastic1002 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: active_shards_percent_as_number: 100.0, unassigned_shards: 0, number_of_data_nodes: 6, relocating_shards: 0, status: green, active_primary_shards: 723, delayed_unassigned_shards: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, cluster_name: cloudelastic-psi-eqiad, number_of_no [06:04:34] ards: 1448, timed_out: False, number_of_pending_tasks: 0, initializing_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [06:04:34] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1004 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: delayed_unassigned_shards: 0, relocating_shards: 0, number_of_pending_tasks: 1, initializing_shards: 4, timed_out: False, task_max_waiting_in_queue_millis: 0, unassigned_shards: 63, status: yellow, active_primary_shards: 759, number_of_data_nodes: 6, cluster_name: cloudelastic-chi-eqiad, active_sha [06:04:34] _shards_percent_as_number: 95.59500328731097, number_of_nodes: 6, number_of_in_flight_fetch: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [06:04:34] RECOVERY - ElasticSearch health check for shards on 9400 on cloudelastic1003 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: active_shards_percent_as_number: 100.0, active_shards: 1498, task_max_waiting_in_queue_millis: 0, initializing_shards: 0, delayed_unassigned_shards: 0, cluster_name: cloudelastic-omega-eqiad, active_primary_shards: 748, number_of_in_flight_fetch: 0, unassigned_shards: 0, number_of_nodes: 6, reloc [06:04:34] number_of_pending_tasks: 0, status: green, number_of_data_nodes: 6, timed_out: False https://wikitech.wikimedia.org/wiki/Search%23Administration [06:04:36] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1002 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: status: yellow, cluster_name: cloudelastic-chi-eqiad, timed_out: False, number_of_data_nodes: 6, active_primary_shards: 759, number_of_in_flight_fetch: 0, active_shards: 1472, unassigned_shards: 45, active_shards_percent_as_number: 96.7784352399737, relocating_shards: 0, initializing_shards: 4, del [06:04:36] hards: 0, number_of_nodes: 6, task_max_waiting_in_queue_millis: 205, number_of_pending_tasks: 2 https://wikitech.wikimedia.org/wiki/Search%23Administration [06:04:36] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1001 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: number_of_in_flight_fetch: 0, unassigned_shards: 37, cluster_name: cloudelastic-chi-eqiad, delayed_unassigned_shards: 0, task_max_waiting_in_queue_millis: 250, timed_out: False, relocating_shards: 0, number_of_data_nodes: 6, number_of_pending_tasks: 3, active_shards: 1480, active_primary_shards: 75 [06:04:36] percent_as_number: 97.30440499671269, initializing_shards: 4, status: yellow, number_of_nodes: 6 https://wikitech.wikimedia.org/wiki/Search%23Administration [06:04:42] RECOVERY - ElasticSearch health check for shards on 9400 on cloudelastic1001 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: green, active_primary_shards: 748, timed_out: False, number_of_nodes: 6, relocating_shards: 0, number_of_pending_tasks: 0, unassigned_shards: 0, task_max_waiting_in_queue_millis: 0, number_of_data_nodes: 6, active_shards_percent_as_number: 100.0, de [06:04:42] shards: 0, active_shards: 1498, number_of_in_flight_fetch: 0, initializing_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [06:04:42] RECOVERY - ElasticSearch health check for shards on 9400 on cloudelastic1002 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: number_of_data_nodes: 6, active_primary_shards: 748, active_shards: 1498, relocating_shards: 0, number_of_nodes: 6, timed_out: False, initializing_shards: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, number_of_pending_tasks: 0, status: green, delayed_unassigned_shards: 0, [06:04:42] s: 0, cluster_name: cloudelastic-omega-eqiad, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [06:04:42] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1003 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: initializing_shards: 4, delayed_unassigned_shards: 0, number_of_pending_tasks: 6, unassigned_shards: 19, relocating_shards: 0, active_primary_shards: 759, number_of_nodes: 6, number_of_data_nodes: 6, task_max_waiting_in_queue_millis: 1501, cluster_name: cloudelastic-chi-eqiad, number_of_in_flight_f [06:04:42] hards_percent_as_number: 98.4878369493754, active_shards: 1498, timed_out: False, status: yellow https://wikitech.wikimedia.org/wiki/Search%23Administration [06:04:48] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:05:14] RECOVERY - ElasticSearch health check for shards on 9600 on cloudelastic1005 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: delayed_unassigned_shards: 0, active_shards_percent_as_number: 100.0, relocating_shards: 0, number_of_nodes: 6, active_primary_shards: 723, timed_out: False, number_of_data_nodes: 6, cluster_name: cloudelastic-psi-eqiad, task_max_waiting_in_queue_millis: 0, number_of_in_flight_fetch: 0, initializin [06:05:14] ve_shards: 1448, unassigned_shards: 0, number_of_pending_tasks: 0, status: green https://wikitech.wikimedia.org/wiki/Search%23Administration [06:05:14] RECOVERY - ElasticSearch health check for shards on 9600 on cloudelastic1001 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: status: green, delayed_unassigned_shards: 0, initializing_shards: 0, active_shards: 1448, relocating_shards: 0, number_of_nodes: 6, active_primary_shards: 723, unassigned_shards: 0, cluster_name: cloudelastic-psi-eqiad, number_of_data_nodes: 6, task_max_waiting_in_queue_millis: 0, timed_out: False, [06:05:14] ght_fetch: 0, active_shards_percent_as_number: 100.0, number_of_pending_tasks: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [06:05:36] RECOVERY - ElasticSearch health check for shards on 9400 on cloudelastic1004 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 748, relocating_shards: 0, unassigned_shards: 0, active_shards_percent_as_number: 100.0, delayed_unassigned_shards: 0, active_shards: 1498, task_max_waiting_in_queue_millis: 0, initializing_shards: 0, status: green, number_of_ [06:05:36] uster_name: cloudelastic-omega-eqiad, number_of_pending_tasks: 0, number_of_nodes: 6 https://wikitech.wikimedia.org/wiki/Search%23Administration [06:06:18] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:06:40] PROBLEM - Check systemd state on cloudelastic1006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:08:44] !log powercycle ms-be2035 - no ssh available, no metrics since hours ago, I/O errors registered in the main tty on serial console [06:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:40] PROBLEM - Host ms-be2035 is DOWN: PING CRITICAL - Packet loss = 100% [06:12:16] RECOVERY - Host ms-be2035 is UP: PING OK - Packet loss = 0%, RTA = 33.03 ms [06:13:02] PROBLEM - puppet last run on ms-be2035 is CRITICAL: CRITICAL: Puppet last ran 10 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:13:35] Cc: godog: --^ [06:13:50] <_joe_> oh well good morning [06:13:51] (not sure if there is anything to follow up on) [06:16:43] (03Merged) 10jenkins-bot: PageProps: be prepared that PageIdentity is not proper title [core] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/693028 (https://phabricator.wikimedia.org/T283170) (owner: 10Jforrester) [06:17:24] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:19:04] RECOVERY - puppet last run on ms-be2035 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:20:18] RECOVERY - SSH on ms-be2035 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:23:46] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:25:10] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.6/includes/PageProps.php: Backport: [[gerrit:693028|PageProps: be prepared that PageIdentity is not proper title (T283170)]] (duration: 01m 06s) [06:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:14] T283170: Special:RecentChanges in it.wikiversity dies with an internal error - https://phabricator.wikimedia.org/T283170 [06:29:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 25%: Repool db1166', diff saved to https://phabricator.wikimedia.org/P16113 and previous config saved to /var/cache/conftool/dbconfig/20210520-062921-root.json [06:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:47] (03PS1) 10Marostegui: check_private_data_report: Remove references to labsdb [puppet] - 10https://gerrit.wikimedia.org/r/693060 (https://phabricator.wikimedia.org/T282662) [06:32:19] (03CR) 10Marostegui: [C: 03+2] check_private_data_report: Remove references to labsdb [puppet] - 10https://gerrit.wikimedia.org/r/693060 (https://phabricator.wikimedia.org/T282662) (owner: 10Marostegui) [06:33:00] RECOVERY - Check systemd state on cloudelastic1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:39:20] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:44:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 50%: Repool db1166', diff saved to https://phabricator.wikimedia.org/P16114 and previous config saved to /var/cache/conftool/dbconfig/20210520-064425-root.json [06:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:04] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:49:47] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Add tokens and users for mwdebug service [puppet] - 10https://gerrit.wikimedia.org/r/692667 (https://phabricator.wikimedia.org/T283056) (owner: 10Effie Mouzeli) [06:50:08] !log T283223 Write queue not draining fast enough for the next node to reboot, will finish reboot tomorrow [06:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:11] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Add tokens for mwdebug service [labs/private] - 10https://gerrit.wikimedia.org/r/692672 (https://phabricator.wikimedia.org/T283056) (owner: 10Effie Mouzeli) [06:50:12] T283223: Reboot cloudelastic* to apply security updates - https://phabricator.wikimedia.org/T283223 [06:50:12] !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) reboot without plugin upgrade (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic reboot - ryankemper@cumin1001 - T283223 [06:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:21] (03CR) 10Effie Mouzeli: [C: 03+2] Add tokens for mwdebug service [labs/private] - 10https://gerrit.wikimedia.org/r/692672 (https://phabricator.wikimedia.org/T283056) (owner: 10Effie Mouzeli) [06:56:36] (03CR) 10Effie Mouzeli: [V: 03+2 C: 03+2] Add tokens for mwdebug service [labs/private] - 10https://gerrit.wikimedia.org/r/692672 (https://phabricator.wikimedia.org/T283056) (owner: 10Effie Mouzeli) [06:57:09] (03CR) 10Effie Mouzeli: [C: 03+2] Add tokens and users for mwdebug service [puppet] - 10https://gerrit.wikimedia.org/r/692667 (https://phabricator.wikimedia.org/T283056) (owner: 10Effie Mouzeli) [06:57:42] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:58:22] (03PS1) 10Marostegui: db1143: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/693061 [06:59:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 75%: Repool db1166', diff saved to https://phabricator.wikimedia.org/P16115 and previous config saved to /var/cache/conftool/dbconfig/20210520-065928-root.json [06:59:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:02] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:01:51] (03CR) 10Marostegui: [C: 03+2] db1143: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/693061 (owner: 10Marostegui) [07:11:08] elukey: thanks for the reboot! I'll take a look shortly [07:14:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 100%: Repool db1166', diff saved to https://phabricator.wikimedia.org/P16116 and previous config saved to /var/cache/conftool/dbconfig/20210520-071432-root.json [07:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1179', diff saved to https://phabricator.wikimedia.org/P16117 and previous config saved to /var/cache/conftool/dbconfig/20210520-071723-marostegui.json [07:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:28] yeah looks like the host is back no problem [07:18:08] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:20:49] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:21:29] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:23:37] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:23:53] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:24:57] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:25:01] (03PS1) 10Marostegui: orchestrator.conf.json: Remove labsdb* [puppet] - 10https://gerrit.wikimedia.org/r/693123 (https://phabricator.wikimedia.org/T282662) [07:25:57] (03PS2) 10Marostegui: orchestrator.conf.json: Remove labsdb* [puppet] - 10https://gerrit.wikimedia.org/r/693123 (https://phabricator.wikimedia.org/T282662) [07:33:16] (03PS1) 10Effie Mouzeli: Add a namespace for mwdebug service [deployment-charts] - 10https://gerrit.wikimedia.org/r/693124 (https://phabricator.wikimedia.org/T283056) [07:34:26] (03CR) 10JMeybohm: [C: 03+1] Add a namespace for mwdebug service [deployment-charts] - 10https://gerrit.wikimedia.org/r/693124 (https://phabricator.wikimedia.org/T283056) (owner: 10Effie Mouzeli) [07:34:29] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:36:27] (03CR) 10Effie Mouzeli: [C: 03+2] Add a namespace for mwdebug service [deployment-charts] - 10https://gerrit.wikimedia.org/r/693124 (https://phabricator.wikimedia.org/T283056) (owner: 10Effie Mouzeli) [07:37:17] PROBLEM - Check whether ferm is active by checking the default input chain on labstore1006 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:38:36] (03Merged) 10jenkins-bot: Add a namespace for mwdebug service [deployment-charts] - 10https://gerrit.wikimedia.org/r/693124 (https://phabricator.wikimedia.org/T283056) (owner: 10Effie Mouzeli) [07:39:47] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:41:45] !log jiji@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [07:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:32] !log jiji@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [07:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:53] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:50:07] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:56:09] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:01:05] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:03:31] PROBLEM - Check systemd state on cloudelastic1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:05:59] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:06:31] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-me [08:07:07] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [08:08:33] RECOVERY - Check whether ferm is active by checking the default input chain on labstore1006 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:09:25] RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [08:11:09] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [08:15:51] (03PS1) 10Muehlenhoff: Skip Cumin/Homer/Spicerack on cumin2001 [puppet] - 10https://gerrit.wikimedia.org/r/693130 (https://phabricator.wikimedia.org/T276589) [08:22:51] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-me [08:23:27] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [08:24:15] mmmm [08:24:49] the only thing that I can think of is restbase-async running on codfw appservers [08:24:52] app/api [08:25:07] but in theory I'd expect only api-appservers [08:25:45] RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [08:27:18] (03PS1) 10Andrew-WMDE: [beta] Enable back button in the VisualEditor transclusion dialog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/693131 (https://phabricator.wikimedia.org/T272354) [08:27:23] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [08:30:14] <_joe_> elukey: wait what? [08:30:25] <_joe_> restbase-async doesn't run on codfw appservers [08:30:33] <_joe_> if it's doing so, It's a huge incident [08:30:38] <_joe_> also, please move over [08:30:53] <_joe_> (to a chat network not controlled by a piece of shit) [08:31:40] _joe_ no no it doesn't seem so, it seems general monitoring being a little slow [08:31:47] I checked on one api appserver in codfw [08:31:53] didn't see anything weird so far [08:31:59] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-me [08:32:10] move over where? Operations on libera? I am there, but icinga-wm is not :D [08:32:38] <_joe_> yeah I just asked about it [08:32:45] <_joe_> but I'd prefer not to chat here anymore [08:34:17] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [08:37:03] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:41:49] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:47:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 25%: Repool db1179', diff saved to https://phabricator.wikimedia.org/P16118 and previous config saved to /var/cache/conftool/dbconfig/20210520-084746-root.json [08:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:03] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] [beta] Enable back button in the VisualEditor transclusion dialog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/693131 (https://phabricator.wikimedia.org/T272354) (owner: 10Andrew-WMDE) [08:54:23] (03PS1) 10Filippo Giunchedi: icinga: move icinga-wm to libera.chat [puppet] - 10https://gerrit.wikimedia.org/r/693132 (https://phabricator.wikimedia.org/T283213) [08:54:56] seeking reviewers for ^ [08:55:07] (03CR) 10Kormat: cumin: Remove labsdb* (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/693059 (https://phabricator.wikimedia.org/T282662) (owner: 10Marostegui) [08:55:09] (03CR) 10RhinosF1: [C: 03+1] icinga: move icinga-wm to libera.chat [puppet] - 10https://gerrit.wikimedia.org/r/693132 (https://phabricator.wikimedia.org/T283213) (owner: 10Filippo Giunchedi) [08:55:46] thanks RhinosF1 ! [08:55:55] godog: no [08:55:59] Np* [08:56:03] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: move icinga-wm to libera.chat [puppet] - 10https://gerrit.wikimedia.org/r/693132 (https://phabricator.wikimedia.org/T283213) (owner: 10Filippo Giunchedi) [08:56:13] (03CR) 10Kormat: [C: 03+1] orchestrator.conf.json: Remove labsdb* [puppet] - 10https://gerrit.wikimedia.org/r/693123 (https://phabricator.wikimedia.org/T282662) (owner: 10Marostegui) [08:56:50] !log move icinga-wm to libera.chat [08:56:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:08] ok I'll stop writing here [09:00:40] (03CR) 10Marostegui: [C: 03+2] orchestrator.conf.json: Remove labsdb* [puppet] - 10https://gerrit.wikimedia.org/r/693123 (https://phabricator.wikimedia.org/T282662) (owner: 10Marostegui) [09:02:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 50%: Repool db1179', diff saved to https://phabricator.wikimedia.org/P16119 and previous config saved to /var/cache/conftool/dbconfig/20210520-090250-root.json [09:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:41] (03PS1) 10Filippo Giunchedi: alertmanager: move to libera.chat [puppet] - 10https://gerrit.wikimedia.org/r/693133 (https://phabricator.wikimedia.org/T283213) [09:09:25] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: move to libera.chat [puppet] - 10https://gerrit.wikimedia.org/r/693133 (https://phabricator.wikimedia.org/T283213) (owner: 10Filippo Giunchedi) [09:11:44] <_joe_> we still need to move stashbot before we can fully migrate [09:12:41] (03PS2) 10Marostegui: cumin: Remove labsdb* [puppet] - 10https://gerrit.wikimedia.org/r/693059 (https://phabricator.wikimedia.org/T282662) [09:12:58] (03CR) 10Marostegui: cumin: Remove labsdb* (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/693059 (https://phabricator.wikimedia.org/T282662) (owner: 10Marostegui) [09:17:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 75%: Repool db1179', diff saved to https://phabricator.wikimedia.org/P16120 and previous config saved to /var/cache/conftool/dbconfig/20210520-091754-root.json [09:17:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:32] (03CR) 10Kormat: [C: 03+1] cumin: Remove labsdb* [puppet] - 10https://gerrit.wikimedia.org/r/693059 (https://phabricator.wikimedia.org/T282662) (owner: 10Marostegui) [09:20:44] (03CR) 10Marostegui: [C: 03+2] cumin: Remove labsdb* [puppet] - 10https://gerrit.wikimedia.org/r/693059 (https://phabricator.wikimedia.org/T282662) (owner: 10Marostegui) [09:28:29] (03CR) 10Ayounsi: [C: 03+1] Skip Cumin/Homer/Spicerack on cumin2001 [puppet] - 10https://gerrit.wikimedia.org/r/693130 (https://phabricator.wikimedia.org/T276589) (owner: 10Muehlenhoff) [09:32:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 100%: Repool db1179', diff saved to https://phabricator.wikimedia.org/P16121 and previous config saved to /var/cache/conftool/dbconfig/20210520-093257-root.json [09:33:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1112', diff saved to https://phabricator.wikimedia.org/P16122 and previous config saved to /var/cache/conftool/dbconfig/20210520-093510-marostegui.json [09:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:49] (03CR) 10Andrew-WMDE: [C: 03+2] "Deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/693131 (https://phabricator.wikimedia.org/T272354) (owner: 10Andrew-WMDE) [09:38:39] (03Merged) 10jenkins-bot: [beta] Enable back button in the VisualEditor transclusion dialog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/693131 (https://phabricator.wikimedia.org/T272354) (owner: 10Andrew-WMDE) [09:45:54] (03PS10) 10Kormat: mariadb: Convert pt-heartbeat to a systemd service. [puppet] - 10https://gerrit.wikimedia.org/r/665324 (https://phabricator.wikimedia.org/T252528) [09:58:05] (03CR) 10Elukey: "I am trying to import a more recent version of Istio (1.9.x series) since it seems supported by kubeflow (and there are also some CVEs fix" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [10:00:04] mvolz: I, the Bot under the Fountain, allow thee, The Deployer, to do Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210520T1000). [10:10:51] !log jiji@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [10:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:54] !log jiji@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [10:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:00] !log Deploy schema change on s1 codfw, lag will appear in codfw T266486 T268392 T273360 [10:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:06] T268392: Schema change for watchlist.wl_notificationtimestamp going binary(14) from varbinary(14) - https://phabricator.wikimedia.org/T268392 [10:15:07] T273360: Schema change for dropping default of img_timestamp and making it binary(14) - https://phabricator.wikimedia.org/T273360 [10:15:07] T266486: Schema change to turn user_last_timestamp.user_newtalk to binary(14) - https://phabricator.wikimedia.org/T266486 [10:20:34] (03CR) 10Hnowlan: maps: DB performance improvements (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685743 (owner: 10MSantos) [10:27:51] 10SRE, 10wikimedia-irc-freenode: Move SRE-related channels to Libera - https://phabricator.wikimedia.org/T283230 (10Joe) [10:28:53] 10SRE, 10CAS-SSO, 10Patch-For-Review: Kryo memcached transcoder broken in CAS 6.3/6.4 - https://phabricator.wikimedia.org/T273867 (10MoritzMuehlenhoff) p:05Medium→03Low [10:33:38] 10SRE, 10wikimedia-irc-freenode: Move SRE-related channels to Libera - https://phabricator.wikimedia.org/T283230 (10Joe) [10:33:42] 10SRE, 10Security-Team, 10CAS-SSO, 10User-jbond: CAS Single Logout Flow - https://phabricator.wikimedia.org/T233941 (10MoritzMuehlenhoff) [10:38:01] 10SRE, 10wikimedia-irc-freenode: Move SRE-related channels to Libera - https://phabricator.wikimedia.org/T283230 (10Marostegui) [10:39:38] (03PS1) 10Filippo Giunchedi: icinga: move logmsgbot to libera.chat [puppet] - 10https://gerrit.wikimedia.org/r/693138 (https://phabricator.wikimedia.org/T283213) [10:39:53] (03CR) 10jerkins-bot: [V: 04-1] icinga: move logmsgbot to libera.chat [puppet] - 10https://gerrit.wikimedia.org/r/693138 (https://phabricator.wikimedia.org/T283213) (owner: 10Filippo Giunchedi) [10:40:39] (03PS2) 10Filippo Giunchedi: icinga: move logmsgbot to libera.chat [puppet] - 10https://gerrit.wikimedia.org/r/693138 (https://phabricator.wikimedia.org/T283213) [10:42:02] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29624/console" [puppet] - 10https://gerrit.wikimedia.org/r/693138 (https://phabricator.wikimedia.org/T283213) (owner: 10Filippo Giunchedi) [10:43:00] (03CR) 10Filippo Giunchedi: [V: 03+1] "As per task, this will need to be coordinated once stashbot is on libera.chat too" [puppet] - 10https://gerrit.wikimedia.org/r/693138 (https://phabricator.wikimedia.org/T283213) (owner: 10Filippo Giunchedi) [10:50:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 25%: Repool db1112', diff saved to https://phabricator.wikimedia.org/P16123 and previous config saved to /var/cache/conftool/dbconfig/20210520-105018-root.json [10:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:38] 10SRE, 10Security-Team, 10CAS-SSO, 10User-jbond: CAS Single Logout Flow - https://phabricator.wikimedia.org/T233941 (10MoritzMuehlenhoff) Status update: The Single Logout has been implemented across all applications using mod_cas and tested for all affected applications. It works fine for all application... [11:00:04] Amir1, Lucas_WMDE, apergos, and duesen: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) EU Backport and Config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210520T1100). [11:00:16] o/ [11:00:24] 10SRE, 10wikimedia-irc-freenode: Move SRE-related channels to Libera - https://phabricator.wikimedia.org/T283230 (10jbond) [11:02:57] anything to train or deploy? [11:03:06] (train as in training, not deployment train :D) [11:03:30] 10SRE, 10wikimedia-irc-freenode: Move SRE-related channels to Libera - https://phabricator.wikimedia.org/T283230 (10MoritzMuehlenhoff) [11:04:32] 10SRE, 10wikimedia-irc-freenode: Move SRE-related channels to Libera - https://phabricator.wikimedia.org/T283230 (10MoritzMuehlenhoff) [11:05:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 50%: Repool db1112', diff saved to https://phabricator.wikimedia.org/P16124 and previous config saved to /var/cache/conftool/dbconfig/20210520-110522-root.json [11:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:29] 10SRE, 10wikimedia-irc-freenode: Move SRE-related IRC channels to Libera - https://phabricator.wikimedia.org/T283230 (10Aklapper) [11:20:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 75%: Repool db1112', diff saved to https://phabricator.wikimedia.org/P16125 and previous config saved to /var/cache/conftool/dbconfig/20210520-112026-root.json [11:20:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:55] I am in a colab session for the cpt code jam, but there was no one scheduled for training during this backport window. [11:21:00] my apologies for not being around. [11:21:11] 10SRE, 10netops: routinator: create gabage collection job - https://phabricator.wikimedia.org/T282469 (10jbond) >>! In T282469#7099149, @ayounsi wrote: > @jbond would the upcoming changes in https://github.com/NLnetLabs/routinator/releases/tag/v0.9.0-rc1 solve that issue by using a database instead of the file... [11:22:27] (03PS1) 10Jcrespo: dbbackups: Switchover codfw s5 backups from db2099 to db2101 (buster) [puppet] - 10https://gerrit.wikimedia.org/r/693142 (https://phabricator.wikimedia.org/T283235) [11:31:06] 10SRE, 10Traffic, 10vm-requests: Please create two Ganeti VMs for Wikidough - https://phabricator.wikimedia.org/T283192 (10ssingh) >>! In T283192#7099749, @Dzahn wrote: > VMs have been created, added to site.pp with "insetup", added to DHCP and partma. > > OS has been installed (buster) and puppet certs sig... [11:31:46] (03PS7) 10Jbond: C:admin: add ability to manage home [puppet] - 10https://gerrit.wikimedia.org/r/691131 (https://phabricator.wikimedia.org/T280989) [11:33:38] (03CR) 10Ssingh: "As per the discussion in https://phabricator.wikimedia.org/T252132#7098776, we decided that this should be done manually through operation" [dns] - 10https://gerrit.wikimedia.org/r/692625 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [11:34:26] (03CR) 10Ssingh: Add zone for wikimedia-dns.org (Wikidough) (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/692625 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [11:35:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 100%: Repool db1112', diff saved to https://phabricator.wikimedia.org/P16126 and previous config saved to /var/cache/conftool/dbconfig/20210520-113529-root.json [11:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:41] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/692635 (owner: 10Jbond) [11:58:06] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/692632 (owner: 10Jbond) [12:00:04] Deploy window Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210520T1200) [12:00:34] (03CR) 10Muehlenhoff: (test) migrate sretest to new role_data profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/692636 (owner: 10Jbond) [12:02:29] (03PS3) 10Jbond: (test) migrate sretest to new role_data profile [puppet] - 10https://gerrit.wikimedia.org/r/692636 [12:02:48] (03CR) 10Volans: [C: 03+1] "LGTM, I would probably add something to the MOTD too if you don't mind." [puppet] - 10https://gerrit.wikimedia.org/r/693130 (https://phabricator.wikimedia.org/T276589) (owner: 10Muehlenhoff) [12:03:27] (03CR) 10jerkins-bot: [V: 04-1] (test) migrate sretest to new role_data profile [puppet] - 10https://gerrit.wikimedia.org/r/692636 (owner: 10Jbond) [12:03:29] (03CR) 10Jbond: (test) migrate sretest to new role_data profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/692636 (owner: 10Jbond) [12:11:55] (03PS1) 10Marostegui: site.pp: s3 is no longer default for new wikis. [puppet] - 10https://gerrit.wikimedia.org/r/693148 (https://phabricator.wikimedia.org/T259438) [12:12:15] (03PS1) 10Jbond: (WIP) create a logout.d proffile for managing logout scripts [puppet] - 10https://gerrit.wikimedia.org/r/693149 [12:12:30] (03CR) 10Marostegui: [C: 03+2] site.pp: s3 is no longer default for new wikis. [puppet] - 10https://gerrit.wikimedia.org/r/693148 (https://phabricator.wikimedia.org/T259438) (owner: 10Marostegui) [12:14:11] (03CR) 10Muehlenhoff: (WIP) create a logout.d proffile for managing logout scripts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/693149 (owner: 10Jbond) [12:25:23] (03CR) 10Marostegui: "What about dbprov? Does it need to be removed from there too?" [puppet] - 10https://gerrit.wikimedia.org/r/692341 (https://phabricator.wikimedia.org/T280751) (owner: 10Jcrespo) [12:27:12] (03CR) 10Jcrespo: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/692341 (https://phabricator.wikimedia.org/T280751) (owner: 10Jcrespo) [12:28:08] (03CR) 10Marostegui: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/692341 (https://phabricator.wikimedia.org/T280751) (owner: 10Jcrespo) [12:30:49] !log Deploying wmfmariadbpy 0.7 T283228 [12:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:55] T283228: Deploy wmfmariadbpy 0.7 - https://phabricator.wikimedia.org/T283228 [12:31:49] (03CR) 10Kormat: [C: 03+2] "It's Go time :)" [puppet] - 10https://gerrit.wikimedia.org/r/665324 (https://phabricator.wikimedia.org/T252528) (owner: 10Kormat) [12:32:51] (03CR) 10Jcrespo: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/692341 (https://phabricator.wikimedia.org/T280751) (owner: 10Jcrespo) [12:33:23] (03CR) 10Marostegui: [C: 03+1] "Thanks for double checking :-)" [puppet] - 10https://gerrit.wikimedia.org/r/692341 (https://phabricator.wikimedia.org/T280751) (owner: 10Jcrespo) [12:35:44] (03CR) 10Jcrespo: "@marostegui To clarify, this is an example of when it is removed from the dbbackups configuration FYI. Here is when it becomes passive, bu" [puppet] - 10https://gerrit.wikimedia.org/r/693142 (https://phabricator.wikimedia.org/T283235) (owner: 10Jcrespo) [12:36:36] (03CR) 10Marostegui: "thank you - I didn't remember the s6 one" [puppet] - 10https://gerrit.wikimedia.org/r/693142 (https://phabricator.wikimedia.org/T283235) (owner: 10Jcrespo) [12:37:01] (03CR) 10Jcrespo: [C: 04-1] "Waiting for green light from DBAs (ticket stalled)." [puppet] - 10https://gerrit.wikimedia.org/r/693142 (https://phabricator.wikimedia.org/T283235) (owner: 10Jcrespo) [12:37:26] (03CR) 10Marostegui: "It will take around 1 month if all goes well" [puppet] - 10https://gerrit.wikimedia.org/r/693142 (https://phabricator.wikimedia.org/T283235) (owner: 10Jcrespo) [12:39:00] 10SRE, 10wikimedia-irc-freenode: Move SRE-related IRC channels to Libera - https://phabricator.wikimedia.org/T283230 (10Majavah) [12:41:59] 10SRE, 10MW-on-K8s, 10serviceops: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10Joe) [12:42:09] 10SRE, 10Traffic: OpenSSL < 1.1.0 compatibility issues with new LE issuance chain - https://phabricator.wikimedia.org/T283165 (10BBlack) bump for testing purposes [12:44:42] (03PS2) 10Jbond: (WIP) create a logout.d profile for managing logout scripts [puppet] - 10https://gerrit.wikimedia.org/r/693149 [12:46:13] (03CR) 10jerkins-bot: [V: 04-1] (WIP) create a logout.d profile for managing logout scripts [puppet] - 10https://gerrit.wikimedia.org/r/693149 (owner: 10Jbond) [12:50:04] (03PS1) 10Andrew Bogott: Nova vendordata.txt: fix up new VMs that have chrony installed [puppet] - 10https://gerrit.wikimedia.org/r/693152 (https://phabricator.wikimedia.org/T280801) [12:55:15] (03CR) 10Andrew Bogott: [C: 03+2] Nova vendordata.txt: fix up new VMs that have chrony installed [puppet] - 10https://gerrit.wikimedia.org/r/693152 (https://phabricator.wikimedia.org/T280801) (owner: 10Andrew Bogott) [13:00:07] hashar and dancy: I, the Bot under the Fountain, allow thee, The Deployer, to do MediaWiki train - European+American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210520T1300). [13:01:59] ^ it is still blocked [13:02:03] bah [13:03:07] (03CR) 10Hashar: [C: 03+2] ActorStore: avoid throwing in case of invalid usernames [core] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/693029 (https://phabricator.wikimedia.org/T283167) (owner: 10Jforrester) [13:03:17] (03CR) 10Hashar: [C: 03+2] UploadFromStash: convert default user from false to null [core] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/693030 (https://phabricator.wikimedia.org/T283196) (owner: 10Jforrester) [13:05:27] (03Abandoned) 10Andrew Bogott: Install systemd-timesyncd on Bullseye and later [puppet] - 10https://gerrit.wikimedia.org/r/691960 (https://phabricator.wikimedia.org/T280801) (owner: 10Andrew Bogott) [13:15:38] (03CR) 10Muehlenhoff: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/693130 (https://phabricator.wikimedia.org/T276589) (owner: 10Muehlenhoff) [13:16:06] (03CR) 10Hnowlan: [C: 03+1] "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/693008 (https://phabricator.wikimedia.org/T261367) (owner: 10Clarakosi) [13:17:02] (03PS1) 10Jbond: P::puppetdb::microservice: add pki to acl for puppetdb microservice [puppet] - 10https://gerrit.wikimedia.org/r/693158 [13:22:44] (03Merged) 10jenkins-bot: ActorStore: avoid throwing in case of invalid usernames [core] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/693029 (https://phabricator.wikimedia.org/T283167) (owner: 10Jforrester) [13:24:22] (03Merged) 10jenkins-bot: UploadFromStash: convert default user from false to null [core] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/693030 (https://phabricator.wikimedia.org/T283196) (owner: 10Jforrester) [13:26:19] (03PS1) 10Jbond: puppetdb: add site specific cnames for puppetdb [dns] - 10https://gerrit.wikimedia.org/r/693159 (https://phabricator.wikimedia.org/T283185) [13:26:30] (03PS2) 10Jbond: P::puppetdb::microservice: add pki to acl for puppetdb microservice [puppet] - 10https://gerrit.wikimedia.org/r/693158 (https://phabricator.wikimedia.org/T283185) [13:26:34] 10SRE, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Epic, 10Goal: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10LSobanski) [13:26:41] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29625/console" [puppet] - 10https://gerrit.wikimedia.org/r/685743 (owner: 10MSantos) [13:27:00] (03CR) 10jerkins-bot: [V: 04-1] puppetdb: add site specific cnames for puppetdb [dns] - 10https://gerrit.wikimedia.org/r/693159 (https://phabricator.wikimedia.org/T283185) (owner: 10Jbond) [13:35:42] (03CR) 10Volans: "Should we maybe use a discovery name instead? I'm unsure." [dns] - 10https://gerrit.wikimedia.org/r/693159 (https://phabricator.wikimedia.org/T283185) (owner: 10Jbond) [13:37:20] (03CR) 10Volans: [V: 03+2 C: 03+2] Release v0.3.0 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/692932 (owner: 10Volans) [13:37:26] 10SRE, 10CAS-SSO: Cookbook for centralised logouts and session status queries - https://phabricator.wikimedia.org/T283242 (10MoritzMuehlenhoff) [13:39:52] !log volans@deploy1002 Started deploy [debmonitor/deploy@444b931]: Release v0.3.0 [13:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:02] (03PS2) 10Jbond: puppetdb: add site specific cnames for puppetdb [dns] - 10https://gerrit.wikimedia.org/r/693159 (https://phabricator.wikimedia.org/T283185) [13:41:13] !log volans@deploy1002 Finished deploy [debmonitor/deploy@444b931]: Release v0.3.0 (duration: 01m 20s) [13:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:00] !log hashar@deploy1002 Synchronized php-1.37.0-wmf.6/includes/user/ActorStore.php: ActorStore: avoid throwing in case of invalid usernames T283167 (duration: 01m 05s) [13:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:08] T283167: InvalidArgumentException: Unable to normalize the provided actor name x.y.z.v/16 - https://phabricator.wikimedia.org/T283167 [13:52:11] !log hashar@deploy1002 Synchronized php-1.37.0-wmf.6/includes/upload/UploadFromStash.php: UploadFromStash: convert default user from false to null - T283196 (duration: 01m 05s) [13:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:15] T283196: TypeError: Argument 2 passed to UploadStash::__construct() must implement interface MediaWiki\User\UserIdentity or be null, boolean given, called in /srv/mediawiki/php-1.37.0-wmf.6/includes/upload/UploadFromStash.php on line 66 - https://phabricator.wikimedia.org/T283196 [13:54:03] group 1 promotion! [13:55:13] (03PS1) 10Hashar: group1 wikis to 1.37.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/693161 [13:55:15] (03CR) 10Hashar: [C: 03+2] group1 wikis to 1.37.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/693161 (owner: 10Hashar) [13:55:57] (03Merged) 10jenkins-bot: group1 wikis to 1.37.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/693161 (owner: 10Hashar) [13:56:00] (03PS2) 10Muehlenhoff: Skip Cumin/Homer/Spicerack on cumin2001 [puppet] - 10https://gerrit.wikimedia.org/r/693130 (https://phabricator.wikimedia.org/T276589) [13:56:20] (03CR) 10Herron: [C: 03+1] icinga: move logmsgbot to libera.chat [puppet] - 10https://gerrit.wikimedia.org/r/693138 (https://phabricator.wikimedia.org/T283213) (owner: 10Filippo Giunchedi) [13:56:30] (03CR) 10jerkins-bot: [V: 04-1] Skip Cumin/Homer/Spicerack on cumin2001 [puppet] - 10https://gerrit.wikimedia.org/r/693130 (https://phabricator.wikimedia.org/T276589) (owner: 10Muehlenhoff) [13:57:13] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.37.0-wmf.6 [13:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:19] !log hashar@deploy1002 Synchronized php: group1 wikis to 1.37.0-wmf.6 (duration: 01m 05s) [13:58:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:36] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/693130 (https://phabricator.wikimedia.org/T276589) (owner: 10Muehlenhoff) [13:58:53] (03PS1) 10Kormat: mariadb: Use ROW binlog format for heartbeat on dbinventory. [puppet] - 10https://gerrit.wikimedia.org/r/693162 [13:59:36] (03PS2) 10Kormat: mariadb: Use ROW binlog format for heartbeat on dbinventory. [puppet] - 10https://gerrit.wikimedia.org/r/693162 [14:00:51] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29626/console" [puppet] - 10https://gerrit.wikimedia.org/r/693162 (owner: 10Kormat) [14:05:16] (03PS3) 10Muehlenhoff: Skip Cumin/Homer/Spicerack on cumin2001 [puppet] - 10https://gerrit.wikimedia.org/r/693130 (https://phabricator.wikimedia.org/T276589) [14:05:47] (03CR) 10jerkins-bot: [V: 04-1] Skip Cumin/Homer/Spicerack on cumin2001 [puppet] - 10https://gerrit.wikimedia.org/r/693130 (https://phabricator.wikimedia.org/T276589) (owner: 10Muehlenhoff) [14:07:57] (03CR) 10Marostegui: [C: 03+1] mariadb: Use ROW binlog format for heartbeat on dbinventory. [puppet] - 10https://gerrit.wikimedia.org/r/693162 (owner: 10Kormat) [14:08:20] (03CR) 10Kormat: [V: 03+1 C: 03+2] mariadb: Use ROW binlog format for heartbeat on dbinventory. [puppet] - 10https://gerrit.wikimedia.org/r/693162 (owner: 10Kormat) [14:09:17] (03CR) 10Muehlenhoff: "CI ignores the lint::ignore. I can either ignore CI's ignorance and +V2 or remove the MOTD again (after all cumin/spicerack won't be prese" [puppet] - 10https://gerrit.wikimedia.org/r/693130 (https://phabricator.wikimedia.org/T276589) (owner: 10Muehlenhoff) [14:13:05] (03PS1) 10Herron: librenms: move librenms-wmf to irc.libera.chat [puppet] - 10https://gerrit.wikimedia.org/r/693164 (https://phabricator.wikimedia.org/T283213) [14:14:10] (03CR) 10Volans: Skip Cumin/Homer/Spicerack on cumin2001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/693130 (https://phabricator.wikimedia.org/T276589) (owner: 10Muehlenhoff) [14:17:02] (03PS1) 10Ayounsi: Add jstep to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/693166 (https://phabricator.wikimedia.org/T282521) [14:17:43] (03CR) 10jerkins-bot: [V: 04-1] Add jstep to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/693166 (https://phabricator.wikimedia.org/T282521) (owner: 10Ayounsi) [14:20:06] (03PS2) 10Ayounsi: Add jstep to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/693166 (https://phabricator.wikimedia.org/T282521) [14:20:40] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/693166 (https://phabricator.wikimedia.org/T282521) (owner: 10Ayounsi) [14:21:20] (03PS1) 10Jbond: Nova vendordata.txt: delete systemd-coredump user [puppet] - 10https://gerrit.wikimedia.org/r/693167 (https://phabricator.wikimedia.org/T280801) [14:22:04] (03CR) 10Ayounsi: [C: 03+1] "LGTM, but I'm not sure it's still in use." [puppet] - 10https://gerrit.wikimedia.org/r/693164 (https://phabricator.wikimedia.org/T283213) (owner: 10Herron) [14:22:23] (03CR) 10Ayounsi: [C: 03+2] Add jstep to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/693166 (https://phabricator.wikimedia.org/T282521) (owner: 10Ayounsi) [14:27:03] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for JStephenson1980 - https://phabricator.wikimedia.org/T282521 (10ayounsi) Apologies for the delay, you should be good to go! Let me know if you're having any issues. [14:29:37] (03CR) 10Jbond: "> Patch Set 1:" [dns] - 10https://gerrit.wikimedia.org/r/693159 (https://phabricator.wikimedia.org/T283185) (owner: 10Jbond) [14:29:58] (03CR) 10Volans: [C: 03+1] "LGTM, thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/692993 (https://phabricator.wikimedia.org/T283204) (owner: 10Herron) [14:30:51] (03CR) 10Jbond: [C: 03+2] P::puppetdb::microservice: add pki to acl for puppetdb microservice [puppet] - 10https://gerrit.wikimedia.org/r/693158 (https://phabricator.wikimedia.org/T283185) (owner: 10Jbond) [14:31:49] (03CR) 10Volans: [C: 04-1] "Pending agreement/discussion on the related task." [cookbooks] - 10https://gerrit.wikimedia.org/r/692992 (https://phabricator.wikimedia.org/T283204) (owner: 10Herron) [14:33:54] (03CR) 10Elukey: "Importing version 1.9.x might take time, it requires golang 1.15 (so bullseye golang docker images that we don't have yet) plus https://gi" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [14:35:33] 10SRE, 10wikimedia-irc-freenode: Move SRE-related IRC channels to Libera - https://phabricator.wikimedia.org/T283230 (10Dsharpe) [14:35:42] 10SRE, 10wikimedia-irc-freenode: Move SRE-related IRC channels to Libera - https://phabricator.wikimedia.org/T283230 (10Dsharpe) [14:36:29] 10SRE, 10wikimedia-irc-freenode: Move SRE-related IRC channels to Libera - https://phabricator.wikimedia.org/T283230 (10Dsharpe) [14:37:35] (03CR) 10Andrew Bogott: [C: 03+2] Nova vendordata.txt: delete systemd-coredump user [puppet] - 10https://gerrit.wikimedia.org/r/693167 (https://phabricator.wikimedia.org/T280801) (owner: 10Jbond) [14:37:52] (03CR) 10Jbond: (WIP) create a logout.d profile for managing logout scripts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/693149 (owner: 10Jbond) [14:37:54] (03CR) 10Andrew Bogott: "Should this be coupled with chrony?" [puppet] - 10https://gerrit.wikimedia.org/r/693167 (https://phabricator.wikimedia.org/T280801) (owner: 10Jbond) [14:38:25] (03CR) 10Filippo Giunchedi: [C: 04-1] "Not in use anymore, we should disable/comment the options though!" [puppet] - 10https://gerrit.wikimedia.org/r/693164 (https://phabricator.wikimedia.org/T283213) (owner: 10Herron) [14:38:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1118', diff saved to https://phabricator.wikimedia.org/P16128 and previous config saved to /var/cache/conftool/dbconfig/20210520-143825-marostegui.json [14:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:15] (03CR) 10Jbond: "> Patch Set 1: -Code-Review" [puppet] - 10https://gerrit.wikimedia.org/r/693167 (https://phabricator.wikimedia.org/T280801) (owner: 10Jbond) [14:39:42] (03PS1) 10Effie Mouzeli: Add kubernetes mwdebug user [labs/private] - 10https://gerrit.wikimedia.org/r/693169 (https://phabricator.wikimedia.org/T283056) [14:40:13] 10SRE, 10wikimedia-irc-freenode: Move SRE-related IRC channels to Libera - https://phabricator.wikimedia.org/T283230 (10Dsharpe) [14:40:30] 10SRE, 10SRE-Access-Requests: Allow JStephenson to access Superset - https://phabricator.wikimedia.org/T282515 (10ayounsi) [14:40:42] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for JStephenson1980 - https://phabricator.wikimedia.org/T282521 (10ayounsi) 05Open→03Resolved Other two deleted from LDAP. [14:41:43] (03CR) 10Andrew Bogott: [C: 03+1] wmcs: add cloudvirt drain cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/683370 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [14:48:33] (03CR) 10Muehlenhoff: Nova vendordata.txt: delete systemd-coredump user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/693167 (https://phabricator.wikimedia.org/T280801) (owner: 10Jbond) [14:52:51] (03CR) 10Jbond: Nova vendordata.txt: delete systemd-coredump user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/693167 (https://phabricator.wikimedia.org/T280801) (owner: 10Jbond) [14:54:12] (03CR) 10Muehlenhoff: Skip Cumin/Homer/Spicerack on cumin2001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/693130 (https://phabricator.wikimedia.org/T276589) (owner: 10Muehlenhoff) [14:57:36] (03CR) 10Muehlenhoff: Nova vendordata.txt: delete systemd-coredump user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/693167 (https://phabricator.wikimedia.org/T280801) (owner: 10Jbond) [14:57:58] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add kubernetes mwdebug user [labs/private] - 10https://gerrit.wikimedia.org/r/693169 (https://phabricator.wikimedia.org/T283056) (owner: 10Effie Mouzeli) [15:00:14] (03CR) 10JMeybohm: [C: 04-1] "We tried to migrate everything to using nobody instead of special users, but I see this might not make sense here as you need a dedicated " (0310 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [15:05:15] (03PS1) 10Giuseppe Lavagetto: docker::baseimages: drop alpine support [puppet] - 10https://gerrit.wikimedia.org/r/693174 [15:05:17] (03PS1) 10Giuseppe Lavagetto: docker::baseimages: add script to build debian-slim [puppet] - 10https://gerrit.wikimedia.org/r/693175 (https://phabricator.wikimedia.org/T281596) [15:06:04] (03CR) 10jerkins-bot: [V: 04-1] docker::baseimages: add script to build debian-slim [puppet] - 10https://gerrit.wikimedia.org/r/693175 (https://phabricator.wikimedia.org/T281596) (owner: 10Giuseppe Lavagetto) [15:06:38] (03CR) 10Effie Mouzeli: [C: 03+2] Add kubernetes mwdebug user [labs/private] - 10https://gerrit.wikimedia.org/r/693169 (https://phabricator.wikimedia.org/T283056) (owner: 10Effie Mouzeli) [15:08:55] 10SRE, 10DBA: wmf-auto-reinstall fails on hosts that run pt-heartbeat - https://phabricator.wikimedia.org/T252528 (10Kormat) 05Open→03Resolved a:03Kormat This is now fixed. Puppet will no longer start/stop heartbeat. That is managed by `db-switchover` when changing masters. This does mean that `pt-heartb... [15:11:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1038: PCIe error - https://phabricator.wikimedia.org/T276922 (10Cmjohnson) Received a new PCI card and the error returned immediately. I disabled the PCI-E slot 1 and the server boots fine. I do not see any need for that riser in the... [15:11:23] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29627/console" [puppet] - 10https://gerrit.wikimedia.org/r/693174 (owner: 10Giuseppe Lavagetto) [15:12:08] (03CR) 10Effie Mouzeli: [V: 03+2 C: 03+2] Add kubernetes mwdebug user [labs/private] - 10https://gerrit.wikimedia.org/r/693169 (https://phabricator.wikimedia.org/T283056) (owner: 10Effie Mouzeli) [15:13:30] 10SRE, 10wikimedia-irc-freenode: Move SRE-related IRC channels to Libera - https://phabricator.wikimedia.org/T283230 (10RobH) [15:13:34] (03CR) 10BryanDavis: [C: 03+1] "a copy of stashbot is now running in #wikimedia-operations@libera.chat" [puppet] - 10https://gerrit.wikimedia.org/r/693138 (https://phabricator.wikimedia.org/T283213) (owner: 10Filippo Giunchedi) [15:13:47] 10SRE, 10wikimedia-irc-freenode: Move SRE-related IRC channels to Libera - https://phabricator.wikimedia.org/T283230 (10RobH) [15:15:12] (03CR) 10Alexandros Kosiaris: [C: 04-1] "minor comment inline, but otherwise yes, let's drop alpine" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/693174 (owner: 10Giuseppe Lavagetto) [15:15:27] (03PS1) 10Muehlenhoff: Add library hints for graphviz [puppet] - 10https://gerrit.wikimedia.org/r/693178 [15:17:03] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1053 - https://phabricator.wikimedia.org/T282839 (10Cmjohnson) Thank you, I will get a ticket in with HPE ASAP [15:18:29] (03CR) 10Muehlenhoff: [C: 03+2] Add library hints for graphviz [puppet] - 10https://gerrit.wikimedia.org/r/693178 (owner: 10Muehlenhoff) [15:20:06] (03CR) 10Volans: Skip Cumin/Homer/Spicerack on cumin2001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/693130 (https://phabricator.wikimedia.org/T276589) (owner: 10Muehlenhoff) [15:21:22] !log jiji@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [15:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:25] !log jiji@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [15:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:36] !log jiji@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [15:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:39] !log jiji@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [15:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:15] !log installing graphviz security updates on buster [15:23:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:29] 10SRE, 10wikimedia-irc-freenode: Move SRE-related IRC channels to Libera - https://phabricator.wikimedia.org/T283230 (10Bugreporter) [15:24:03] moritzm: stashbot is now on libera [15:24:32] 10SRE, 10Data-Persistence-Backup, 10netops: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10faidon) p:05Triage→03High Given a) this was linked during budgeting in the context of of our cross-DC... [15:30:28] jouncebot: now [15:30:28] No deployments scheduled for the next 0 hour(s) and 29 minute(s) [15:31:05] jouncebot is now on libera.chat as well (and also here, but separate instances) [15:31:48] !log [cloudelastic] `ryankemper@cloudelastic1003:~$ sudo systemctl restart *search*` to clear `Check systemd state` alert on `cloudelastic1003` [15:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:06] Majavah: ah, missed that. Thanks! [15:33:28] 10SRE, 10Data-Persistence-Backup, 10netops: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10jcrespo) One of the things I raised to my manager is that this limitation means that, in the event of a c... [15:33:53] <_joe_> ryankemper: you should move to the other network, read ops@ :) [15:34:07] time to update the topic in here yet? [15:34:12] _joe_: ack, catching up on the email and stuff now :) [15:34:16] <_joe_> apergos: in a few :) [15:34:19] :-) [15:34:25] <_joe_> ryankemper: yeah I figured, it's early morning for you [15:35:32] thanks for the heads up [15:43:28] !log jiji@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [15:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:31] !log jiji@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [15:44:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:53] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [15:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:07] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [15:46:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:23] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [15:46:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:25] (03CR) 10BBlack: [C: 03+1] Add zone for wikimedia-dns.org (Wikidough) [dns] - 10https://gerrit.wikimedia.org/r/692625 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [15:47:00] (03CR) 10Giuseppe Lavagetto: [C: 03+2] icinga: move logmsgbot to libera.chat [puppet] - 10https://gerrit.wikimedia.org/r/693138 (https://phabricator.wikimedia.org/T283213) (owner: 10Filippo Giunchedi) [15:47:15] (03PS2) 10Herron: librenms: remove librenms-wmf irc config [puppet] - 10https://gerrit.wikimedia.org/r/693164 (https://phabricator.wikimedia.org/T283213) [15:47:31] (03CR) 10Ssingh: [C: 03+2] Add zone for wikimedia-dns.org (Wikidough) [dns] - 10https://gerrit.wikimedia.org/r/692625 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [15:48:01] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [15:48:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:24] (03Abandoned) 10Ssingh: aptrepo: add a component for knot-dnsutils [puppet] - 10https://gerrit.wikimedia.org/r/685571 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [15:53:05] (03PS2) 10Ssingh: bird: add Wikidough's /24 to vips_filter (accept) [puppet] - 10https://gerrit.wikimedia.org/r/692367 (https://phabricator.wikimedia.org/T283027) [15:54:06] (03CR) 10Ssingh: "This is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/692367 (https://phabricator.wikimedia.org/T283027) (owner: 10Ssingh) [15:56:35] (03PS3) 10Ssingh: WIP: wikidough: update role to work towards anycast support [puppet] - 10https://gerrit.wikimedia.org/r/692368 (https://phabricator.wikimedia.org/T283027) [15:56:54] (03PS4) 10Ssingh: wikidough: update role to work towards anycast support [puppet] - 10https://gerrit.wikimedia.org/r/692368 (https://phabricator.wikimedia.org/T283027) [15:58:49] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29628/console" [puppet] - 10https://gerrit.wikimedia.org/r/692368 (https://phabricator.wikimedia.org/T283027) (owner: 10Ssingh) [15:59:28] (03PS2) 10Ppchelko: DNM: Changes to use envoyproxy's image of envoy 1.18.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/692695 (owner: 10Clarakosi) [15:59:31] (03CR) 10Ssingh: [V: 03+1] "Ready for review as well." [puppet] - 10https://gerrit.wikimedia.org/r/692368 (https://phabricator.wikimedia.org/T283027) (owner: 10Ssingh) [16:00:04] jbond42 and cdanis: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210520T1600). [16:01:02] :o two IRC networks means we get TWO jouncebot jokes per deployment window! <3 [16:06:58] <_joe_> Lucas_WMDE: not for long [16:07:08] aww [16:08:16] (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/691154 (owner: 10Arturo Borrero Gonzalez) [16:08:58] (03CR) 10Filippo Giunchedi: "LGTM, see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/693164 (https://phabricator.wikimedia.org/T283213) (owner: 10Herron) [16:11:07] (03PS2) 10Effie Mouzeli: Add kubernetes mwdebug user [labs/private] - 10https://gerrit.wikimedia.org/r/693169 (https://phabricator.wikimedia.org/T283056) [16:11:37] (03CR) 10Effie Mouzeli: [V: 03+2 C: 03+2] Add kubernetes mwdebug user [labs/private] - 10https://gerrit.wikimedia.org/r/693169 (https://phabricator.wikimedia.org/T283056) (owner: 10Effie Mouzeli) [16:12:09] (03PS3) 10Herron: librenms: remove librenms-wmf irc config [puppet] - 10https://gerrit.wikimedia.org/r/693164 (https://phabricator.wikimedia.org/T283213) [16:13:01] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10jijiki) [16:13:02] (03CR) 10Filippo Giunchedi: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/693164 (https://phabricator.wikimedia.org/T283213) (owner: 10Herron) [16:13:39] (03CR) 10Herron: [C: 03+2] librenms: remove librenms-wmf irc config [puppet] - 10https://gerrit.wikimedia.org/r/693164 (https://phabricator.wikimedia.org/T283213) (owner: 10Herron) [16:16:26] 10SRE, 10netops: Lumen 10G Wave (cr2-eqiad to cr2-esams) Down - https://phabricator.wikimedia.org/T283227 (10cmooney) Came back up approx 40 mins ago: ` May 20 15:37:52 re0.cr2-eqiad mib2d[13184]: SNMP_TRAP_LINK_UP: ifIndex 660, ifAdminStatus up(1), ifOperStatus up(1), ifName xe-4/1/3 ` ` cmooney@re0.cr2-eqia... [16:18:37] (03PS2) 10Razzi: site: configure dbstore1006 as insetup [puppet] - 10https://gerrit.wikimedia.org/r/693046 (https://phabricator.wikimedia.org/T283125) [16:22:18] (03CR) 10Marostegui: [C: 03+1] "+1 (I haven't checked if the MAC is correct though)" [puppet] - 10https://gerrit.wikimedia.org/r/693046 (https://phabricator.wikimedia.org/T283125) (owner: 10Razzi) [16:23:42] (03PS1) 10Herron: librenms: remove librenms-ircbot service [puppet] - 10https://gerrit.wikimedia.org/r/693182 [16:29:05] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, though the unit will need to be cleaned up manually on netmon1002" [puppet] - 10https://gerrit.wikimedia.org/r/693182 (owner: 10Herron) [16:29:23] (03CR) 10Herron: [C: 03+2] librenms: remove librenms-ircbot service [puppet] - 10https://gerrit.wikimedia.org/r/693182 (owner: 10Herron) [16:54:08] (03PS20) 10Elukey: Add istio base images build support [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192) [16:54:10] (03PS2) 10Elukey: Add knative serving and net-istio images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/692899 (https://phabricator.wikimedia.org/T278194) [16:54:46] (03CR) 10Elukey: "Followed also what was suggested about the istio-proxy user!" (0310 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [17:00:04] chrisalbon and accraze: (Dis)respected human, time to deploy Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210520T1700). Please do the needful. [17:02:04] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission labsdb1011.eqiad.wmnet - https://phabricator.wikimedia.org/T282524 (10Cmjohnson) [17:02:16] 10SRE, 10Analytics, 10LDAP-Access-Requests, 10SRE-Access-Requests: Account setup issues for jmixter-ctr - https://phabricator.wikimedia.org/T283250 (10elukey) Had a chat with Jeff on Slack (together with @JAllemandou), and the account `JMixter (WMF)` seems not accessible on meta (maybe the password reset +... [17:03:23] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission labsdb1011.eqiad.wmnet - https://phabricator.wikimedia.org/T282524 (10Cmjohnson) 05Open→03Resolved a:05wiki_willy→03Cmjohnson [17:03:56] 10SRE, 10ops-eqiad, 10Data-Services, 10decommission-hardware: decommission labsdb1010.eqiad.wmnet - https://phabricator.wikimedia.org/T282523 (10Cmjohnson) 05Open→03Resolved All Decom tasks are complete [17:04:11] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Cmjohnson) [17:04:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1074.eqiad.wmnet - https://phabricator.wikimedia.org/T281959 (10Cmjohnson) 05Open→03Resolved All decom tasks are complete. [17:04:35] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Cmjohnson) [17:04:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1079.eqiad.wmnet - https://phabricator.wikimedia.org/T282079 (10Cmjohnson) 05Open→03Resolved All decom tasks are complete. [17:06:44] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Cmjohnson) [17:06:51] (03CR) 10Hnowlan: [V: 03+2 C: 03+2] ratelimiter: update to new upstream version [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/692941 (https://phabricator.wikimedia.org/T246278) (owner: 10Ppchelko) [17:07:09] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1085.eqiad.wmnet - https://phabricator.wikimedia.org/T282096 (10Cmjohnson) 05Open→03Resolved All decom tasks are complete. [17:07:53] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Cmjohnson) [17:08:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1083.eqiad.wmnet - https://phabricator.wikimedia.org/T281445 (10Cmjohnson) 05Open→03Resolved All decom tasks are complete [17:08:26] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission labsdb1009.eqiad.wmnet - https://phabricator.wikimedia.org/T282522 (10Cmjohnson) 05Open→03Resolved All decom tasks are completed [17:18:06] 10SRE, 10Analytics-Radar, 10LDAP-Access-Requests, 10SRE-Access-Requests: Account setup issues for jmixter-ctr - https://phabricator.wikimedia.org/T283250 (10odimitrijevic) [17:26:09] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be100[12] - https://phabricator.wikimedia.org/T276637 (10Cmjohnson) [17:32:30] (03PS5) 10Cwhite: logstash: update ES template to patch 2 [puppet] - 10https://gerrit.wikimedia.org/r/690538 [17:37:07] (03CR) 10Cwhite: [C: 03+2] logstash: update ES template to patch 2 [puppet] - 10https://gerrit.wikimedia.org/r/690538 (owner: 10Cwhite) [17:38:00] (03PS3) 10Razzi: site: configure dbstore1006 as insetup [puppet] - 10https://gerrit.wikimedia.org/r/693046 (https://phabricator.wikimedia.org/T283125) [17:40:31] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be100[12] - https://phabricator.wikimedia.org/T276637 (10Cmjohnson) [17:41:29] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be100[12] - https://phabricator.wikimedia.org/T276637 (10Cmjohnson) a:05Jclark-ctr→03RobH @robh if you have time to do the installs that would be great, assign back to me if you're busy. [17:47:33] (03CR) 10Razzi: [C: 03+2] site: configure dbstore1006 as insetup [puppet] - 10https://gerrit.wikimedia.org/r/693046 (https://phabricator.wikimedia.org/T283125) (owner: 10Razzi) [17:53:05] (03PS1) 10Kosta Harlan: Check if task is link-recommendation type before showing onboarding [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/693040 (https://phabricator.wikimedia.org/T282826) [17:53:36] (03PS1) 10Kosta Harlan: Check if task is link-recommendation type before showing onboarding [extensions/GrowthExperiments] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/693041 (https://phabricator.wikimedia.org/T282826) [17:54:01] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install phab1004 (was: phab1002) - https://phabricator.wikimedia.org/T280540 (10Cmjohnson) a:05Cmjohnson→03RobH @robh same thing, this server is ready for install if you have the time. [17:54:07] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install phab1004 (was: phab1002) - https://phabricator.wikimedia.org/T280540 (10Cmjohnson) [18:00:04] RoanKattouw, Niharika, and Urbanecm: Dear deployers, time to do the Morning backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210520T1800). [18:00:04] kostajh: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:01:07] (03CR) 10Urbanecm: [C: 03+2] Check if task is link-recommendation type before showing onboarding [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/693040 (https://phabricator.wikimedia.org/T282826) (owner: 10Kosta Harlan) [18:01:09] (03CR) 10Urbanecm: [C: 03+2] Check if task is link-recommendation type before showing onboarding [extensions/GrowthExperiments] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/693041 (https://phabricator.wikimedia.org/T282826) (owner: 10Kosta Harlan) [18:09:17] (03PS1) 10Cwhite: logstash: allocate ecs shards to hdd nodes after one month [puppet] - 10https://gerrit.wikimedia.org/r/693198 [18:10:04] (03PS1) 10Cwhite: logstash: allocate w3creportingapi shards older than 1 month to hdd nodes [puppet] - 10https://gerrit.wikimedia.org/r/693199 [18:15:19] (03PS1) 10Hnowlan: New upstream envoy-future version 1.18.3 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/693200 [18:17:30] (03CR) 10Clarakosi: [C: 03+1] New upstream envoy-future version 1.18.3 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/693200 (owner: 10Hnowlan) [18:17:36] (03PS1) 10Hnowlan: api-gateway: use envoy-future 1.18.3 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/693201 [18:19:44] (03CR) 10Hnowlan: [V: 03+2 C: 03+2] New upstream envoy-future version 1.18.3 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/693200 (owner: 10Hnowlan) [18:20:14] (03CR) 10Clarakosi: [C: 03+1] api-gateway: use envoy-future 1.18.3 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/693201 (owner: 10Hnowlan) [18:21:04] (03CR) 10Hnowlan: [C: 03+2] api-gateway: use envoy-future 1.18.3 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/693201 (owner: 10Hnowlan) [18:23:59] (03Merged) 10jenkins-bot: api-gateway: use envoy-future 1.18.3 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/693201 (owner: 10Hnowlan) [18:24:51] (03Merged) 10jenkins-bot: Check if task is link-recommendation type before showing onboarding [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/693040 (https://phabricator.wikimedia.org/T282826) (owner: 10Kosta Harlan) [18:25:06] (03Merged) 10jenkins-bot: Check if task is link-recommendation type before showing onboarding [extensions/GrowthExperiments] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/693041 (https://phabricator.wikimedia.org/T282826) (owner: 10Kosta Harlan) [18:38:17] (03PS4) 10Clarakosi: api-gateway: Implement new ratelimit configurations from envoy 1.16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/692404 (https://phabricator.wikimedia.org/T260591) [18:42:12] (03PS2) 10Clarakosi: api-gateway: Add default_value to dynamic_metadata if JWT is not set [deployment-charts] - 10https://gerrit.wikimedia.org/r/692714 (https://phabricator.wikimedia.org/T261350) [18:43:23] (03PS1) 10Ryan Kemper: cloudelastic: bump inactive shard alert threshold [puppet] - 10https://gerrit.wikimedia.org/r/693204 (https://phabricator.wikimedia.org/T283269) [18:44:14] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/693204 (https://phabricator.wikimedia.org/T283269) (owner: 10Ryan Kemper) [18:45:14] (03PS2) 10Ryan Kemper: cloudelastic: bump inactive shard alert threshold [puppet] - 10https://gerrit.wikimedia.org/r/693204 (https://phabricator.wikimedia.org/T283269) [18:45:32] (03PS3) 10Ryan Kemper: cloudelastic: bump inactive shard alert threshold [puppet] - 10https://gerrit.wikimedia.org/r/693204 (https://phabricator.wikimedia.org/T283269) [18:45:59] (03PS2) 10Clarakosi: api-gateway: Replace echoapi with http-https-echo [deployment-charts] - 10https://gerrit.wikimedia.org/r/693008 (https://phabricator.wikimedia.org/T261367) [18:46:07] (03CR) 10jerkins-bot: [V: 04-1] cloudelastic: bump inactive shard alert threshold [puppet] - 10https://gerrit.wikimedia.org/r/693204 (https://phabricator.wikimedia.org/T283269) (owner: 10Ryan Kemper) [18:47:20] (03PS1) 10Ebernhardson: mjolnir bulk daemon: Add topic for hourly updates [puppet] - 10https://gerrit.wikimedia.org/r/693205 (https://phabricator.wikimedia.org/T261407) [18:55:22] (03Abandoned) 10Clarakosi: DNM: Changes to use envoyproxy's image of envoy 1.18.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/692695 (owner: 10Clarakosi) [18:55:44] (03CR) 10Herron: [C: 03+2] sre.hosts.decommission: clarify "wipe bootloader" step [cookbooks] - 10https://gerrit.wikimedia.org/r/692993 (https://phabricator.wikimedia.org/T283204) (owner: 10Herron) [18:55:47] (03PS2) 10Clarakosi: Use envoy 1.16 nested json feature for access logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/692709 (https://phabricator.wikimedia.org/T260820) (owner: 10Ppchelko) [18:57:04] 10SRE, 10Analytics-Radar, 10LDAP-Access-Requests, 10SRE-Access-Requests: Account setup issues for jmixter-ctr - https://phabricator.wikimedia.org/T283250 (10Aklapper) On-wiki SUL account on meta, mentioned by elukey: https://meta.wikimedia.org/wiki/Special:Log?page=User:JMixter_(WMF) looks correct to me. W... [18:57:35] (03PS2) 10Cwhite: logstash: allocate ecs shards to hdd nodes after one month [puppet] - 10https://gerrit.wikimedia.org/r/693198 [18:58:27] (03PS2) 10Cwhite: logstash: allocate w3creportingapi shards older than 1 month to hdd nodes [puppet] - 10https://gerrit.wikimedia.org/r/693199 [18:59:10] (03CR) 10Herron: "> Patch Set 1: Code-Review-1" [cookbooks] - 10https://gerrit.wikimedia.org/r/692992 (https://phabricator.wikimedia.org/T283204) (owner: 10Herron) [18:59:41] (03Merged) 10jenkins-bot: sre.hosts.decommission: clarify "wipe bootloader" step [cookbooks] - 10https://gerrit.wikimedia.org/r/692993 (https://phabricator.wikimedia.org/T283204) (owner: 10Herron) [19:00:04] hashar and dancy: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - European+American Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210520T1900). [19:03:07] (03PS1) 10Hashar: all wikis to 1.37.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/693207 [19:03:09] (03CR) 10Hashar: [C: 03+2] all wikis to 1.37.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/693207 (owner: 10Hashar) [19:04:16] (03Merged) 10jenkins-bot: all wikis to 1.37.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/693207 (owner: 10Hashar) [19:07:52] (03PS1) 10Tks4Fish: ptwiki: Add 'flow-delete' to 'eliminator' user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/693208 (https://phabricator.wikimedia.org/T283266) [19:08:10] (03CR) 10Ppchelko: "yeah, but I've added my stuff that is needed for 1.18.3 into this commit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/692695 (owner: 10Clarakosi) [19:09:32] (03CR) 10Urbanecm: [C: 04-2] "not now, see task. pending community confirmation." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/693208 (https://phabricator.wikimedia.org/T283266) (owner: 10Tks4Fish) [19:09:34] (03CR) 10Clarakosi: "> Patch Set 2:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/692695 (owner: 10Clarakosi) [19:09:36] (03Restored) 10Clarakosi: DNM: Changes to use envoyproxy's image of envoy 1.18.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/692695 (owner: 10Clarakosi) [19:23:54] 10SRE, 10Analytics-Radar, 10LDAP-Access-Requests, 10SRE-Access-Requests: Account setup issues for jmixter-ctr - https://phabricator.wikimedia.org/T283250 (10jmixter) I was able to create the Developer Account following the instructions. I guess I am confused about all of the various accounts I needed to se... [19:27:00] (03PS3) 10Ppchelko: DNM: Changes to use envoyproxy's image of envoy 1.18.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/692695 (owner: 10Clarakosi) [19:27:33] (03PS4) 10Ppchelko: Changes to use envoyproxy's image of envoy 1.18.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/692695 (owner: 10Clarakosi) [19:31:10] (03PS1) 10Ssingh: acme_chief: add certificates for wikimedia-dns.org [puppet] - 10https://gerrit.wikimedia.org/r/693210 (https://phabricator.wikimedia.org/T252132) [19:31:59] (03CR) 10Clarakosi: Changes to use envoyproxy's image of envoy 1.18.3 (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/692695 (owner: 10Clarakosi) [19:33:32] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29629/console" [puppet] - 10https://gerrit.wikimedia.org/r/693210 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [19:35:09] Lucas_WMDE: hi could use a hand if you are still around :] [19:35:22] it is about https://phabricator.wikimedia.org/T283240 which is a follow up to a train blocker [19:35:37] but it does not seem to be a blocker, just wanted to confirm it is indeed just a followup action for later :] [19:35:52] (03CR) 10Ppchelko: Changes to use envoyproxy's image of envoy 1.18.3 (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/692695 (owner: 10Clarakosi) [19:36:48] (03CR) 10Clarakosi: [C: 03+1] Changes to use envoyproxy's image of envoy 1.18.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/692695 (owner: 10Clarakosi) [19:38:57] (03CR) 10Ppchelko: [C: 03+2] Changes to use envoyproxy's image of envoy 1.18.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/692695 (owner: 10Clarakosi) [19:41:02] (03Merged) 10jenkins-bot: Changes to use envoyproxy's image of envoy 1.18.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/692695 (owner: 10Clarakosi) [19:42:29] 10SRE, 10Traffic, 10Patch-For-Review: Offer Wikidough as an anycasted service - https://phabricator.wikimedia.org/T283027 (10ssingh) [19:45:18] Lucas_WMDE: I just assumed it is a follow up and commented on the train blocker task ;] don't worry! [19:45:19] (03PS1) 10Herron: remove rescue boot dhcp entry for mwlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/693211 [19:46:23] (03CR) 10Herron: [C: 03+2] remove rescue boot dhcp entry for mwlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/693211 (owner: 10Herron) [19:49:13] hashar: yes that’s just a followup, shouldn’t block anything [19:49:27] wasn’t sure how to attach it to which other tasks ^^ [19:49:42] (03PS10) 10DCausse: rdf-streaming-updater: use session mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/681497 (https://phabricator.wikimedia.org/T280166) (owner: 10Mstyles) [19:51:07] (03Abandoned) 10Herron: sre.hosts.decommssion: use dd to zero the bootloader [cookbooks] - 10https://gerrit.wikimedia.org/r/692992 (https://phabricator.wikimedia.org/T283204) (owner: 10Herron) [19:51:31] (03CR) 10jerkins-bot: [V: 04-1] rdf-streaming-updater: use session mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/681497 (https://phabricator.wikimedia.org/T280166) (owner: 10Mstyles) [19:52:42] Lucas_WMDE: it is fine :] [19:52:53] Lucas_WMDE: thank you for confirming it at this late hour of the day!!! [19:53:17] I’m just leaving my laptop running because the new IRC channels don’t have logging yet and I don’t want to miss stuff :'D [19:53:25] hope the rest of the train goes well! [19:54:40] 10SRE, 10decommission-hardware, 10observability, 10Patch-For-Review: decommission mwlog1001 - https://phabricator.wikimedia.org/T282575 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by herron@cumin1001 for hosts: `mwlog1001.eqiad.wmnet` - mwlog1001.eqiad.wmnet (**FAIL**) - **Failed dow... [19:56:08] Lucas_WMDE: yeah it is all fine I am marking it solved right now! \o/ [20:13:27] \o/ [20:27:02] (03PS5) 10Cwhite: logstash: replace ECS allow list with filter_on_template [puppet] - 10https://gerrit.wikimedia.org/r/674718 (https://phabricator.wikimedia.org/T234565) [20:43:48] (03PS1) 10Cwhite: logstash: bugfix filter to exclude hdd-allocated indexes [puppet] - 10https://gerrit.wikimedia.org/r/693213 [21:02:38] 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: TBD) rack/setup/install dumpsdata100[45] - https://phabricator.wikimedia.org/T283290 (10RobH) [21:03:08] 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: TBD) rack/setup/install dumpsdata100[45] - https://phabricator.wikimedia.org/T283290 (10RobH) [21:07:50] 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: TBD) rack/setup/install dumpsdata100[45] - https://phabricator.wikimedia.org/T283290 (10RobH) [21:07:55] 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: TBD) rack/setup/install dumpsdata100[45] - https://phabricator.wikimedia.org/T283290 (10RobH) Please note the original ask for networking was: **Networking/Subnet/VLAN/IP:** Internal vlan, 10G for one host and 1G (for now) for the other. If the 10G connec... [21:09:05] (03PS1) 10Ottomata: Initial commit [debs/airflow] - 10https://gerrit.wikimedia.org/r/693216 [21:11:29] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Initial commit [debs/airflow] - 10https://gerrit.wikimedia.org/r/693216 (owner: 10Ottomata) [21:15:07] (03PS1) 10Ottomata: Add .gitreview [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/693221 [21:15:50] (03PS2) 10Ottomata: Add .gitreview [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/693221 [21:16:29] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Add .gitreview [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/693221 (owner: 10Ottomata) [21:17:30] (03PS1) 10Ottomata: Initial debianization and 2.0.2-1~py3.7 release [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/693222 (https://phabricator.wikimedia.org/T277012) [21:25:31] (03PS2) 10Krinkle: trafficserver: Remove X-Request-Id from response headers unless debug [puppet] - 10https://gerrit.wikimedia.org/r/676682 (https://phabricator.wikimedia.org/T210484) [21:27:21] 10SRE, 10Traffic, 10Performance-Team (Radar): Strip new X-Request-Id header from non-debug responses - https://phabricator.wikimedia.org/T283291 (10Krinkle) [21:27:24] (03PS3) 10Krinkle: trafficserver: Remove X-Request-Id from response headers unless debug [puppet] - 10https://gerrit.wikimedia.org/r/676682 (https://phabricator.wikimedia.org/T283291) [21:28:31] (03PS2) 10Ottomata: Initial debianization and 2.0.2-1~py3.7 release [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/693222 (https://phabricator.wikimedia.org/T277012) [21:42:05] (03PS3) 10Ottomata: Initial debianization and 2.0.2-1~py3.7 release [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/693222 (https://phabricator.wikimedia.org/T277012) [21:43:36] (03PS1) 10Zabe: [doc] switching from freenode to libera.chat [puppet] - 10https://gerrit.wikimedia.org/r/693223 [21:46:13] (03PS4) 10Ottomata: Initial debianization and 2.0.2-1~py3.7 release [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/693222 (https://phabricator.wikimedia.org/T277012) [22:11:26] (03PS2) 10Zabe: [doc] switching from freenode to libera.chat [puppet] - 10https://gerrit.wikimedia.org/r/693223 (https://phabricator.wikimedia.org/T283247) [22:12:58] (03CR) 10Legoktm: [C: 04-1] [doc] switching from freenode to libera.chat (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/693223 (https://phabricator.wikimedia.org/T283247) (owner: 10Zabe) [22:15:56] (03PS3) 10Zabe: [doc] switching from freenode to libera.chat [puppet] - 10https://gerrit.wikimedia.org/r/693223 (https://phabricator.wikimedia.org/T283247) [22:16:09] (03CR) 10Zabe: [doc] switching from freenode to libera.chat (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/693223 (https://phabricator.wikimedia.org/T283247) (owner: 10Zabe) [22:28:37] (03PS1) 10Razzi: netboot: Change dbstore1006 netboot.cfg to partman/custom/db.cfg [puppet] - 10https://gerrit.wikimedia.org/r/693224 (https://phabricator.wikimedia.org/T283125) [22:30:48] (03PS2) 10Razzi: netboot: Change dbstore1006 netboot.cfg to partman/custom/db.cfg [puppet] - 10https://gerrit.wikimedia.org/r/693224 (https://phabricator.wikimedia.org/T283125) [22:35:43] (03CR) 10Razzi: [C: 03+2] netboot: Change dbstore1006 netboot.cfg to partman/custom/db.cfg [puppet] - 10https://gerrit.wikimedia.org/r/693224 (https://phabricator.wikimedia.org/T283125) (owner: 10Razzi) [22:59:26] (03CR) 10Krinkle: [C: 03+1] [doc] switching from freenode to libera.chat [puppet] - 10https://gerrit.wikimedia.org/r/693223 (https://phabricator.wikimedia.org/T283247) (owner: 10Zabe) [23:00:05] brennen: #bothumor My software never has bugs. It just develops random features. Rise for US Backport and Config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210520T2300). [23:01:06] (03CR) 10Bstorm: [C: 03+2] "First sync is done. Now I'm going to try this until cut over https://puppet-compiler.wmflabs.org/compiler1001/29630/cloudstore1009.wikimed" [puppet] - 10https://gerrit.wikimedia.org/r/690783 (https://phabricator.wikimedia.org/T224747) (owner: 10Bstorm) [23:07:23] (03CR) 10Legoktm: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/693223 (https://phabricator.wikimedia.org/T283247) (owner: 10Zabe) [23:17:57] (03PS1) 10Razzi: site: add role for dbstore1006 [puppet] - 10https://gerrit.wikimedia.org/r/693230 (https://phabricator.wikimedia.org/T283125) [23:21:28] (03CR) 10Razzi: [C: 03+2] site: add role for dbstore1006 [puppet] - 10https://gerrit.wikimedia.org/r/693230 (https://phabricator.wikimedia.org/T283125) (owner: 10Razzi) [23:23:21] (03CR) 10Dzahn: [C: 03+2] [doc] switching from freenode to libera.chat [puppet] - 10https://gerrit.wikimedia.org/r/693223 (https://phabricator.wikimedia.org/T283247) (owner: 10Zabe) [23:24:11] (03CR) 10Dzahn: "tried to merge without noticing it was already done :)" [puppet] - 10https://gerrit.wikimedia.org/r/693223 (https://phabricator.wikimedia.org/T283247) (owner: 10Zabe) [23:33:54] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: ssh unreachable for wdqs2007.codfw.wmnet - https://phabricator.wikimedia.org/T281437 (10RobH) 05Open→03Resolved wdqs2007 raid fully rebuilt and system is online. I set to staged in netbox, when its added back into ful... [23:41:26] (03PS1) 10Legoktm: codesearch: Use our own hound image [puppet] - 10https://gerrit.wikimedia.org/r/693233 (https://phabricator.wikimedia.org/T243380) [23:42:12] (03CR) 10Legoktm: [C: 03+2] codesearch: Use our own hound image [puppet] - 10https://gerrit.wikimedia.org/r/693233 (https://phabricator.wikimedia.org/T243380) (owner: 10Legoktm)