[00:37:48] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [00:38:57] 10Operations, 10MW-on-K8s, 10TechCom-RFC, 10serviceops, 10Patch-For-Review: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10tstarling) >>! In T260330#6468248, @Legoktm wrote: > I didn't see any shell pipelines in your caller survey and can't think of... [00:45:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:49:21] 10Operations, 10Wikimedia-Mailing-lists: Mailing list for local development discussion - https://phabricator.wikimedia.org/T263216 (10jeena) [00:50:50] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:02:12] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:06:04] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:09:30] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:11:24] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:22:40] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:24:36] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:26:30] (03PS1) 10Ryan Kemper: cloudelastic: use envoy to mitigate tls latency [puppet] - 10https://gerrit.wikimedia.org/r/628243 (https://phabricator.wikimedia.org/T263073) [01:26:46] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:28:40] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:29:35] (03CR) 10Ryan Kemper: "OPEN QUESTIONS" [puppet] - 10https://gerrit.wikimedia.org/r/628243 (https://phabricator.wikimedia.org/T263073) (owner: 10Ryan Kemper) [01:30:22] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:31:11] (03CR) 10Ryan Kemper: "@Giuseppe:" [puppet] - 10https://gerrit.wikimedia.org/r/628243 (https://phabricator.wikimedia.org/T263073) (owner: 10Ryan Kemper) [01:32:18] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:34:28] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:36:26] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:46:06] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:47:48] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:49:44] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:51:02] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 69 probes of 569 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:51:50] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:56:58] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 65 probes of 569 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:12:34] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 75 probes of 569 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:15:20] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:16:50] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:19:12] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:24:36] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:36:12] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:36:22] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 65 probes of 569 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:36:36] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:38:34] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:42:00] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:46:04] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 67 probes of 569 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:57:54] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 64 probes of 569 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [03:05:34] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:07:32] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:18:04] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 66 probes of 569 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [03:19:08] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:21:04] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:21:36] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - No response from remote host 91.198.174.244 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:22:50] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:28:38] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:36:34] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:42:20] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:42:28] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:46:16] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:47:06] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 62 probes of 568 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:08:19] (03PS1) 10Marostegui: db1131: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/628250 (https://phabricator.wikimedia.org/T262901) [05:15:52] !log Restart wikibugs [05:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool es2018, es2012 after cloning es2029 and es2030 T261717', diff saved to https://phabricator.wikimedia.org/P12641 and previous config saved to /var/cache/conftool/dbconfig/20200918-052608-marostegui.json [05:26:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:17] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [05:28:13] (03CR) 10Marostegui: [C: 03+2] db1131: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/628250 (https://phabricator.wikimedia.org/T262901) (owner: 10Marostegui) [05:29:01] (03PS1) 10Marostegui: instances.yaml: Add es2029 and es2030 [puppet] - 10https://gerrit.wikimedia.org/r/628251 (https://phabricator.wikimedia.org/T261717) [05:30:30] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add es2029 and es2030 [puppet] - 10https://gerrit.wikimedia.org/r/628251 (https://phabricator.wikimedia.org/T261717) (owner: 10Marostegui) [05:36:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add es2029 and es2030 to dbctl depooled - T261717', diff saved to https://phabricator.wikimedia.org/P12642 and previous config saved to /var/cache/conftool/dbconfig/20200918-053604-marostegui.json [05:36:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:09] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [05:36:46] (03CR) 10Effie Mouzeli: "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/628153 (owner: 10Dzahn) [05:37:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool es2018, es2012 after cloning es2029 and es2030 T261717', diff saved to https://phabricator.wikimedia.org/P12643 and previous config saved to /var/cache/conftool/dbconfig/20200918-053758-marostegui.json [05:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:17] (03PS1) 10Elukey: profile::hue: add new alarms for Hue 4 [puppet] - 10https://gerrit.wikimedia.org/r/628252 [05:40:24] (03CR) 10Elukey: [C: 03+2] profile::hue: add new alarms for Hue 4 [puppet] - 10https://gerrit.wikimedia.org/r/628252 (owner: 10Elukey) [05:55:26] (03PS1) 10Elukey: profile::hadoop::common: create the ssl directory if not present [puppet] - 10https://gerrit.wikimedia.org/r/628253 [05:56:28] (03CR) 10jerkins-bot: [V: 04-1] profile::hadoop::common: create the ssl directory if not present [puppet] - 10https://gerrit.wikimedia.org/r/628253 (owner: 10Elukey) [05:57:17] (03PS2) 10Elukey: profile::hadoop::common: create the ssl directory if not present [puppet] - 10https://gerrit.wikimedia.org/r/628253 [05:57:50] (03PS6) 10Rosalie Perside (WMDE): Remove $wgExtraLanguageNames from Wikidata and Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620050 (https://phabricator.wikimedia.org/T260118) (owner: 10Guergana Tzatchkova) [05:58:46] (03CR) 10Elukey: [C: 03+2] profile::hadoop::common: create the ssl directory if not present [puppet] - 10https://gerrit.wikimedia.org/r/628253 (owner: 10Elukey) [06:01:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool es2018, es2012 after cloning es2029 and es2030 T261717', diff saved to https://phabricator.wikimedia.org/P12644 and previous config saved to /var/cache/conftool/dbconfig/20200918-060103-marostegui.json [06:01:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:13] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [06:03:34] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) 05Open→03Resolved Closing per the internal email thread. If this happens again we'll reopen and contact Dell again. [06:03:40] 10Operations, 10ops-codfw, 10DBA: db2127 memory errors - https://phabricator.wikimedia.org/T262247 (10Marostegui) >>! In T262247#6442725, @Papaul wrote: > The log on says "It has been corrected by h/w and requires no further action" so i don't think this will be enough to replace the memory because it is not... [06:06:33] (03PS1) 10Elukey: role::analytics_test_cluster::hadoop::ui: add missing hiera param [puppet] - 10https://gerrit.wikimedia.org/r/628254 [06:07:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1131 after rack move', diff saved to https://phabricator.wikimedia.org/P12645 and previous config saved to /var/cache/conftool/dbconfig/20200918-060724-marostegui.json [06:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:38] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::hadoop::ui: add missing hiera param [puppet] - 10https://gerrit.wikimedia.org/r/628254 (owner: 10Elukey) [06:08:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1106 after MCR changes', diff saved to https://phabricator.wikimedia.org/P12646 and previous config saved to /var/cache/conftool/dbconfig/20200918-060815-marostegui.json [06:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool es2018, es2012 after cloning es2029 and es2030 T261717', diff saved to https://phabricator.wikimedia.org/P12647 and previous config saved to /var/cache/conftool/dbconfig/20200918-062127-marostegui.json [06:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:32] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [06:39:24] PROBLEM - kubelet operational latencies on kubernetes2010 is CRITICAL: instance=kubernetes2010.codfw.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [06:42:44] RECOVERY - kubelet operational latencies on kubernetes2010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [06:48:32] (03PS1) 10Elukey: profile::hue: adjust settings for the new python3.7 alerts (hue 4) [puppet] - 10https://gerrit.wikimedia.org/r/628290 [06:50:08] PROBLEM - BFD status on cr3-knams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:50:28] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:50:48] PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:51:08] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 237, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:52:08] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/628290 (owner: 10Elukey) [06:52:30] so the above seems related to the GTT transport from eqiad to knams [06:53:04] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:53:13] but I don't see maintenance scheduled [06:56:10] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 3/5 UP : OSPFv3: 3/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:57:52] RECOVERY - BFD status on cr3-knams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:58:08] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:58:10] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:58:32] RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200918T0700) [07:00:41] (discussion about the links down are in #sre) [07:05:26] 10Operations, 10Wikimedia-Mailing-lists: Wikimedia-RU mailing list page has wrong encoding - https://phabricator.wikimedia.org/T135226 (10ssr) Web interface for reading mails. E. g. we take the letter with Cyrillic letters in URL: https://lists.wikimedia.org/pipermail/wikimedia-ru/2020-September/005396.html —... [07:05:55] (03PS2) 10Muehlenhoff: Add grafana-rw to cache config [puppet] - 10https://gerrit.wikimedia.org/r/627772 (https://phabricator.wikimedia.org/T262512) [07:07:14] (03CR) 10Muehlenhoff: "The grafana.discovery.wmnet cert was extended with the grafana-rw.w.o altname yesterday." [puppet] - 10https://gerrit.wikimedia.org/r/627772 (https://phabricator.wikimedia.org/T262512) (owner: 10Muehlenhoff) [07:12:08] !log draining kubestage1001 for kernel upgrade - T262527 [07:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:13] T262527: Update to kernel 4.19 on kubernetes nodes - https://phabricator.wikimedia.org/T262527 [07:14:53] !log push pfw policies - T263168 [07:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:21] !log installing xdg-utils security updates [07:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:59] (03CR) 10Elukey: [C: 03+2] profile::hue: adjust settings for the new python3.7 alerts (hue 4) [puppet] - 10https://gerrit.wikimedia.org/r/628290 (owner: 10Elukey) [07:36:04] (03PS1) 10Elukey: profile::hadoop::common: add missing config to local puppet ssl [puppet] - 10https://gerrit.wikimedia.org/r/628294 [07:38:07] (03CR) 10Elukey: [C: 03+2] profile::hadoop::common: add missing config to local puppet ssl [puppet] - 10https://gerrit.wikimedia.org/r/628294 (owner: 10Elukey) [07:38:40] 10Operations, 10netops: Set the same OSPF weight on eqiad/codfw wavelenghts - https://phabricator.wikimedia.org/T263230 (10ayounsi) p:05Triage→03High [07:39:05] (03PS4) 10Gehel: Extracting obvious reporting code to a Reporter class. [software/cumin] - 10https://gerrit.wikimedia.org/r/626660 (https://phabricator.wikimedia.org/T212783) [07:39:12] 10Operations, 10netops: Set the same OSPF weight on eqiad/codfw wavelenghts - https://phabricator.wikimedia.org/T263230 (10ayounsi) [07:39:14] (03CR) 10Gehel: Extracting obvious reporting code to a Reporter class. (032 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/626660 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [07:39:28] (03CR) 10Gehel: Extracting obvious reporting code to a Reporter class. (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/626660 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [07:41:28] (03PS1) 10Elukey: profile::hadoop::common: remove duplicate variable definition [puppet] - 10https://gerrit.wikimedia.org/r/628295 [07:46:05] (03PS2) 10Elukey: profile::hadoop::common: remove duplicate variable definition [puppet] - 10https://gerrit.wikimedia.org/r/628295 [07:46:15] (03CR) 10Elukey: [C: 03+2] profile::hadoop::common: remove duplicate variable definition [puppet] - 10https://gerrit.wikimedia.org/r/628295 (owner: 10Elukey) [07:49:39] 10Operations, 10User-MoritzMuehlenhoff: Review lists of config/sysctl recommendations by "kernel self-protection project" - https://phabricator.wikimedia.org/T142984 (10Aklapper) [07:53:08] 10Operations, 10Citoid, 10Wikimedia-Logstash, 10serviceops, 10Platform Engineering (Icebox): Citoid is logging all request / response headers as separate fields - https://phabricator.wikimedia.org/T239713 (10Aklapper) [07:53:11] 10Operations, 10ops-codfw, 10decommission-hardware: decommission wmf6412 - https://phabricator.wikimedia.org/T261968 (10Aklapper) [07:55:57] (03PS2) 10Jcrespo: cli: Make /etc/wmfbackups the config dir for the main backup scripts [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/628168 (https://phabricator.wikimedia.org/T138562) [07:57:34] 10Operations, 10ops-codfw, 10DC-Ops, 10User-jijiki: Put rdb200[78] into service - https://phabricator.wikimedia.org/T255681 (10Aklapper) [07:58:12] (03PS3) 10Jcrespo: cli: Make /etc/wmfbackups the config dir for the main backup scripts [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/628168 (https://phabricator.wikimedia.org/T138562) [08:16:36] !log reinstalling stat1004 with Buster [08:16:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:23] !log kormat@cumin1001 dbctl commit (dc=all): 'db2124 depooling: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12648 and previous config saved to /var/cache/conftool/dbconfig/20200918-082223-kormat.json [08:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:29] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [08:22:42] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:24:36] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:25:28] !log reboot kubestage1001 for clean state testing - T262527 [08:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:32] T262527: Update to kernel 4.19 on kubernetes nodes - https://phabricator.wikimedia.org/T262527 [08:25:33] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [08:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:26] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [08:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:16] PROBLEM - Thanos query has high gRPC client errors on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [08:40:15] (03CR) 10JMeybohm: [C: 03+2] Use Kernel 4.19 on staging cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/627868 (https://phabricator.wikimedia.org/T262527) (owner: 10JMeybohm) [08:43:20] !log reboot kubestage1001 for kernel upgrade - T262527 [08:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:25] T262527: Update to kernel 4.19 on kubernetes nodes - https://phabricator.wikimedia.org/T262527 [08:43:26] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [08:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:55] !log kormat@cumin1001 dbctl commit (dc=all): 'db2124 (re)pooling @ 20%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12650 and previous config saved to /var/cache/conftool/dbconfig/20200918-084554-kormat.json [08:45:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:59] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [08:47:54] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [08:47:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:22] (03PS1) 10Elukey: Allow port 443 in term apt for analytics-in4 and in6 [homer/public] - 10https://gerrit.wikimedia.org/r/628300 [08:48:36] XioNoX: --^ if you have a sec [08:49:22] elukey: what are you pointing to? :) [08:49:34] ahhh right you don't see it, I always forget [08:49:35] sorry [08:49:39] https://gerrit.wikimedia.org/r/628300 [08:49:41] :) [08:50:12] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/628300 (owner: 10Elukey) [08:51:33] ok I guess I can proceed, there seems no maintenance on the routers so I am also going to commit [08:52:03] XioNoX: green light? (sorry we are in the middle of a reimage that is stuck :( ) [08:52:24] elukey: one sec [08:52:47] elukey: +1 [08:53:09] (03CR) 10Ayounsi: [C: 03+1] Allow port 443 in term apt for analytics-in4 and in6 [homer/public] - 10https://gerrit.wikimedia.org/r/628300 (owner: 10Elukey) [08:53:11] XioNoX: <3 [08:53:20] (03CR) 10Elukey: [C: 03+2] Allow port 443 in term apt for analytics-in4 and in6 [homer/public] - 10https://gerrit.wikimedia.org/r/628300 (owner: 10Elukey) [08:54:18] !log change analytics-in4/in6 filters on cr1/cr2 after https://gerrit.wikimedia.org/r/628300 [08:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:53] (03PS4) 10Arturo Borrero Gonzalez: openstack: rocky/buster: use more modern netfilter components [puppet] - 10https://gerrit.wikimedia.org/r/627773 (https://phabricator.wikimedia.org/T262979) [08:55:27] PROBLEM - Thanos query has high gRPC client errors on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [08:56:14] !log reboot kubestage1001 for clean state - T262527 [08:56:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:19] T262527: Update to kernel 4.19 on kubernetes nodes - https://phabricator.wikimedia.org/T262527 [08:56:21] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [08:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:24] godog: ^ [08:59:56] (03PS3) 10Kormat: bsection: Script for binary-searching log files. [puppet] - 10https://gerrit.wikimedia.org/r/627841 [09:00:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1002/25184/" [puppet] - 10https://gerrit.wikimedia.org/r/627773 (https://phabricator.wikimedia.org/T262979) (owner: 10Arturo Borrero Gonzalez) [09:00:48] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [09:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:58] !log kormat@cumin1001 dbctl commit (dc=all): 'db2124 (re)pooling @ 40%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12651 and previous config saved to /var/cache/conftool/dbconfig/20200918-090058-kormat.json [09:01:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:02] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [09:02:33] PROBLEM - HTTPS-planet on en.planet.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org valid until 2020-10-18 09:02:07 +0000 (expires in 29 days) https://wikitech.wikimedia.org/wiki/Planet.wikimedia.org [09:03:25] PROBLEM - HTTPS-wmfusercontent on phab.wmfusercontent.org is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org valid until 2020-10-18 09:02:07 +0000 (expires in 29 days) https://phabricator.wikimedia.org/tag/phabricator/ [09:12:57] (03PS1) 10Arturo Borrero Gonzalez: openstack: rocky/buster: also pin other related packages required by modern iptables [puppet] - 10https://gerrit.wikimedia.org/r/628302 (https://phabricator.wikimedia.org/T262979) [09:13:35] (03CR) 10jerkins-bot: [V: 04-1] openstack: rocky/buster: also pin other related packages required by modern iptables [puppet] - 10https://gerrit.wikimedia.org/r/628302 (https://phabricator.wikimedia.org/T262979) (owner: 10Arturo Borrero Gonzalez) [09:14:42] (03PS2) 10Arturo Borrero Gonzalez: openstack: rocky/buster: also pin other related packages required by iptables [puppet] - 10https://gerrit.wikimedia.org/r/628302 (https://phabricator.wikimedia.org/T262979) [09:16:02] !log kormat@cumin1001 dbctl commit (dc=all): 'db2124 (re)pooling @ 60%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12652 and previous config saved to /var/cache/conftool/dbconfig/20200918-091601-kormat.json [09:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:07] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [09:18:55] (03PS3) 10Arturo Borrero Gonzalez: openstack: rocky/buster: also pin other related packages required by iptables [puppet] - 10https://gerrit.wikimedia.org/r/628302 (https://phabricator.wikimedia.org/T262979) [09:22:20] !log klausman@cumin1001 START - Cookbook sre.hosts.downtime [09:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:30] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:05] !log kormat@cumin1001 dbctl commit (dc=all): 'db2124 (re)pooling @ 80%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12653 and previous config saved to /var/cache/conftool/dbconfig/20200918-093105-kormat.json [09:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:10] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [09:34:43] (03PS4) 10Arturo Borrero Gonzalez: openstack: rocky/buster: fixes for iptables updates [puppet] - 10https://gerrit.wikimedia.org/r/628302 (https://phabricator.wikimedia.org/T262979) [09:43:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC as expected: https://puppet-compiler.wmflabs.org/compiler1001/25189/" [puppet] - 10https://gerrit.wikimedia.org/r/628302 (https://phabricator.wikimedia.org/T262979) (owner: 10Arturo Borrero Gonzalez) [09:45:58] (03PS1) 10Effie Mouzeli: push-notifications: deploy to production environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/628304 (https://phabricator.wikimedia.org/T256973) [09:46:08] !log kormat@cumin1001 dbctl commit (dc=all): 'db2124 (re)pooling @ 100%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12654 and previous config saved to /var/cache/conftool/dbconfig/20200918-094608-kormat.json [09:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:14] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [09:46:57] !log uncordoned kubestage1001 - T262527 [09:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:00] T262527: Update to kernel 4.19 on kubernetes nodes - https://phabricator.wikimedia.org/T262527 [09:47:08] !log deleting some random pods in kubernetes staging to rebalance load back on kubestage1001 - T262527 [09:47:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:46] !log deployed hotfix for T263063 to phab1001 [09:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:50] T263063: Phabricator global search: "Cannot use object of type PhutilSafeHTML as array" error for certain strings - https://phabricator.wikimedia.org/T263063 [09:48:34] (03CR) 10jerkins-bot: [V: 04-1] push-notifications: deploy to production environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/628304 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [09:50:46] (03CR) 10Alexandros Kosiaris: [C: 03+1] urldownloader: convert A record to CNAME (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/628102 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [09:53:51] (03PS2) 10Effie Mouzeli: push-notifications: deploy to production environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/628304 (https://phabricator.wikimedia.org/T256973) [09:55:29] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [09:55:30] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:55:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:55] !log kormat@cumin1001 dbctl commit (dc=all): 'db2087:3316 depooling: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12655 and previous config saved to /var/cache/conftool/dbconfig/20200918-095554-kormat.json [09:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:00] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [09:57:06] (03CR) 10Hnowlan: [C: 03+2] api-gateway: allow mwdebug hosts in calico [deployment-charts] - 10https://gerrit.wikimedia.org/r/628127 (https://phabricator.wikimedia.org/T262396) (owner: 10Hnowlan) [09:59:10] (03Merged) 10jenkins-bot: api-gateway: allow mwdebug hosts in calico [deployment-charts] - 10https://gerrit.wikimedia.org/r/628127 (https://phabricator.wikimedia.org/T262396) (owner: 10Hnowlan) [10:03:07] (03CR) 10Effie Mouzeli: [V: 03+2] "We have a bug that sometimes causes build failures, it built fine here: https://integration.wikimedia.org/ci/job/helm-lint/2574/console" [deployment-charts] - 10https://gerrit.wikimedia.org/r/628304 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [10:10:13] (03PS1) 10Kormat: Exclude /cover dir from debian source tarball. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/628306 [10:11:33] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/628304 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [10:11:35] !log kormat@cumin1001 dbctl commit (dc=all): 'db2087:3316 (re)pooling @ 25%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12656 and previous config saved to /var/cache/conftool/dbconfig/20200918-101135-kormat.json [10:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:40] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [10:12:05] (03CR) 10Kormat: [C: 03+2] Exclude /cover dir from debian source tarball. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/628306 (owner: 10Kormat) [10:13:06] (03Merged) 10jenkins-bot: Exclude /cover dir from debian source tarball. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/628306 (owner: 10Kormat) [10:13:24] 10Operations, 10Domains, 10Traffic: Change of nameservers for Wikimedia.org.tr - https://phabricator.wikimedia.org/T259792 (10Ijon) Given the answer above, can progress be made on this? @croslof, @akosiaris [10:18:16] 10Operations, 10Domains, 10Traffic: Change of nameservers for Wikimedia.org.tr - https://phabricator.wikimedia.org/T259792 (10jcrespo) For the SRE side, regarding DNS, @bblack is probably the best person to be notified here. [10:24:19] (03PS1) 10Kormat: Prepare for 0.5 release. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/628307 [10:25:52] Hey effie! Regarding https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/628304/ can we also bump the version of the image after a couple of patches we merged? [10:26:36] whatever works for you, we only want the -production for the version that will end up in production [10:26:39] !log kormat@cumin1001 dbctl commit (dc=all): 'db2087:3316 (re)pooling @ 50%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12657 and previous config saved to /var/cache/conftool/dbconfig/20200918-102638-kormat.json [10:26:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:47] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [10:27:16] cool [10:27:18] nemo-yiannis: since there is too much bot traffic here, mind if we move this to #mediawiki-serviceops? [10:27:32] it will be easier to offer future support there as well [10:27:40] * kormat hugs the bots. please don't mind effie [10:27:46] sure [10:27:51] tx tx [10:28:15] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [10:28:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:09] PROBLEM - Thanos query has high gRPC client errors on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [10:30:59] (03CR) 10Effie Mouzeli: [V: 03+2 C: 03+2] push-notifications: deploy to production environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/628304 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [10:31:57] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [10:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:25] (03PS1) 10Jgiannelos: Bump push-notifications image to version 2020-09-17-171128-publish [deployment-charts] - 10https://gerrit.wikimedia.org/r/628309 [10:33:31] (03Merged) 10jenkins-bot: push-notifications: deploy to production environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/628304 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [10:34:23] !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [10:34:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:26] (03PS2) 10Kormat: Prepare for 0.5 release. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/628307 [10:35:51] !log jiji@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [10:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:52] (03CR) 10Kormat: [C: 03+2] Prepare for 0.5 release. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/628307 (owner: 10Kormat) [10:39:48] (03Merged) 10jenkins-bot: Prepare for 0.5 release. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/628307 (owner: 10Kormat) [10:39:58] (03CR) 10Jgiannelos: [C: 04-1] Bump push-notifications image to version 2020-09-17-171128-publish [deployment-charts] - 10https://gerrit.wikimedia.org/r/628309 (owner: 10Jgiannelos) [10:41:42] !log kormat@cumin1001 dbctl commit (dc=all): 'db2087:3316 (re)pooling @ 75%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12658 and previous config saved to /var/cache/conftool/dbconfig/20200918-104141-kormat.json [10:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:47] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [10:44:54] (03CR) 10Jcrespo: "yay!!! :-) Thank you!" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/628307 (owner: 10Kormat) [10:45:12] !log jiji@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [10:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:41] (03PS2) 10Jgiannelos: Bump push-notifications image to version 2020-09-17-171128-publish [deployment-charts] - 10https://gerrit.wikimedia.org/r/628309 [10:50:27] (03PS3) 10Jgiannelos: Bump push-notifications image to version 2020-09-18-103454-publish [deployment-charts] - 10https://gerrit.wikimedia.org/r/628309 [10:56:45] !log kormat@cumin1001 dbctl commit (dc=all): 'db2087:3316 (re)pooling @ 100%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12659 and previous config saved to /var/cache/conftool/dbconfig/20200918-105645-kormat.json [10:56:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:52] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [11:12:40] (03CR) 10Jgiannelos: [C: 03+2] Bump push-notifications image to version 2020-09-18-103454-publish [deployment-charts] - 10https://gerrit.wikimedia.org/r/628309 (owner: 10Jgiannelos) [11:12:44] PROBLEM - Thanos query has high gRPC client errors on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [11:15:00] (03Merged) 10jenkins-bot: Bump push-notifications image to version 2020-09-18-103454-publish [deployment-charts] - 10https://gerrit.wikimedia.org/r/628309 (owner: 10Jgiannelos) [11:15:29] !log kormat@cumin1001 dbctl commit (dc=all): 'db2089:3316 depooling: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12660 and previous config saved to /var/cache/conftool/dbconfig/20200918-111529-kormat.json [11:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:34] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [11:33:51] PROBLEM - Thanos query has high gRPC client errors on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [11:35:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2125', diff saved to https://phabricator.wikimedia.org/P12661 and previous config saved to /var/cache/conftool/dbconfig/20200918-113509-marostegui.json [11:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:24] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:39:51] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 240, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:47:16] (03PS1) 10Marostegui: db2125: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/628312 (https://phabricator.wikimedia.org/T263244) [11:48:00] (03CR) 10Marostegui: [C: 03+2] db2125: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/628312 (https://phabricator.wikimedia.org/T263244) (owner: 10Marostegui) [11:54:37] !log kormat@cumin1001 dbctl commit (dc=all): 'db2089:3316 (re)pooling @ 25%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12662 and previous config saved to /var/cache/conftool/dbconfig/20200918-115437-kormat.json [11:54:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:42] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [11:56:49] PROBLEM - Thanos query has high gRPC client errors on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [11:57:14] 04Critical Alert for device cr2-codfw.wikimedia.org - Primary outbound port utilisation over 80% #page [11:58:43] 10Operations, 10netops: Upgrade Fastnetmon to 1.1.7 - https://phabricator.wikimedia.org/T257035 (10MoritzMuehlenhoff) I have upgraded the existing 1.1.4 package to 1.1.7, I needed to fix up all patches for 1.1.7 (except one for luajit, which was obsoleted by upstream dropping support for luajit in 1.1.5 "Disab... [12:01:30] PROBLEM - LibreNMS has a critical alert #page on icinga1001 is CRITICAL: Primary outbound port utilisation over 80% #page (cr2-codfw.wikimedia.org) https://bit.ly/wmf-librenms [12:03:14] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80% #page [12:03:26] RECOVERY - LibreNMS has a critical alert #page on icinga1001 is OK: OK: zero critical LibreNMS alerts https://bit.ly/wmf-librenms [12:05:51] (03PS1) 10Gehel: Enable cumin EventHandler to disable output. [software/cumin] - 10https://gerrit.wikimedia.org/r/628315 (https://phabricator.wikimedia.org/T212783) [12:08:24] (03CR) 10jerkins-bot: [V: 04-1] Enable cumin EventHandler to disable output. [software/cumin] - 10https://gerrit.wikimedia.org/r/628315 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [12:08:25] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic, 10Patch-For-Review, and 2 others: Beta cluster is down - https://phabricator.wikimedia.org/T178841 (10hashar) [12:09:22] 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team-TODO, 10Traffic, 10Release-Engineering-Team (Other / Uncategorized): Investigate what caused the unattended varnish upgrade in Beta Cluster - https://phabricator.wikimedia.org/T179197 (10hashar) 05Open→03Declined Not much we can... [12:09:41] !log kormat@cumin1001 dbctl commit (dc=all): 'db2089:3316 (re)pooling @ 50%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12663 and previous config saved to /var/cache/conftool/dbconfig/20200918-120940-kormat.json [12:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:46] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [12:13:28] (03CR) 10Ema: [C: 03+1] "Excellent!" [puppet] - 10https://gerrit.wikimedia.org/r/627772 (https://phabricator.wikimedia.org/T262512) (owner: 10Muehlenhoff) [12:20:27] (03PS2) 10Gehel: Enable cumin EventHandler to disable output. [software/cumin] - 10https://gerrit.wikimedia.org/r/628315 (https://phabricator.wikimedia.org/T212783) [12:23:13] 04Critical Alert for device cr2-eqiad.wikimedia.org - Primary inbound port utilisation over 80% #page [12:24:44] !log kormat@cumin1001 dbctl commit (dc=all): 'db2089:3316 (re)pooling @ 75%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12664 and previous config saved to /var/cache/conftool/dbconfig/20200918-122444-kormat.json [12:24:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:50] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [12:26:46] PROBLEM - LibreNMS has a critical alert #page on icinga1001 is CRITICAL: Primary inbound port utilisation over 80% #page (cr2-eqiad.wikimedia.org) https://bit.ly/wmf-librenms [12:26:59] PROBLEM - Thanos query has high gRPC client errors on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [12:28:14] 04Critical Alert for device cr2-codfw.wikimedia.org - Primary outbound port utilisation over 80% #page [12:28:48] 10Operations, 10Wikimedia-Mailing-lists, 10I18n: Mojibake on Mailman - https://phabricator.wikimedia.org/T263248 (10jhsoby) [12:28:51] 10Operations, 10Wikimedia-Mailing-lists, 10I18n: Mojibake on Mailman - https://phabricator.wikimedia.org/T263248 (10jhsoby) @Ladsgroup I see in T52864 that you're involved in upgrading the lists. Do you have any idea what's causing this? [12:39:48] !log kormat@cumin1001 dbctl commit (dc=all): 'db2089:3316 (re)pooling @ 100%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12665 and previous config saved to /var/cache/conftool/dbconfig/20200918-123947-kormat.json [12:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:53] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [12:41:08] !log reimaging db2125 T263244 [12:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:12] T263244: Reimage and reclone db2125 - https://phabricator.wikimedia.org/T263244 [12:45:39] I'm disabling the inbound/outbound port utilization alert [12:45:46] the circuit is flirting with the 8Gbps [12:46:11] XioNoX: I'm about to take several gpbs of traffic off the link [12:46:14] RECOVERY - LibreNMS has a critical alert #page on icinga1001 is OK: OK: zero critical LibreNMS alerts https://bit.ly/wmf-librenms [12:47:27] cdanis: re-enabling Swift in eqiad? [12:47:46] XioNoX: yes [12:48:59] !log cdanis@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=swift,name=eqiad [12:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:11] !log kormat@cumin2001 START - Cookbook sre.hosts.downtime [13:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:27] !log kormat@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:48] (03CR) 10Jcrespo: [C: 03+2] remote_backup: Instead of using a preassigned port, autoselect one [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/626172 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [13:08:07] (03CR) 10Jcrespo: "This change is ready for review." [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/623756 (https://phabricator.wikimedia.org/T165358) (owner: 10Jcrespo) [13:08:24] (03CR) 10Jcrespo: [C: 03+2] Add WMFBackup package creation [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/623756 (https://phabricator.wikimedia.org/T165358) (owner: 10Jcrespo) [13:08:35] (03CR) 10Jcrespo: [C: 03+2] cli: Make /etc/wmfbackups the config dir for the main backup scripts [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/628168 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [13:12:11] PROBLEM - Thanos query has high gRPC client errors on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [13:15:46] PROBLEM - k8s API server requests latencies on argon is CRITICAL: instance=10.64.32.133 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:17:09] RECOVERY - k8s API server requests latencies on argon is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:24:32] 10Operations: Integrate Stretch 9.13 point update - https://phabricator.wikimedia.org/T258407 (10MoritzMuehlenhoff) [13:33:13] PROBLEM - Thanos query has high gRPC client errors on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [13:47:57] (03PS1) 10Elukey: profile::hadoop::spark2: use the package resource instead of require_package() [puppet] - 10https://gerrit.wikimedia.org/r/628330 (https://phabricator.wikimedia.org/T255028) [13:50:43] PROBLEM - Thanos query has high gRPC client errors on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [13:55:42] 10Operations, 10Community-Tech, 10MediaWiki-extensions-PageAssessments, 10Performance Issue: Issues with purgeUnusedProjects.php cron job on mwmaint1002 (Fri Oct 26, 2018) - https://phabricator.wikimedia.org/T208231 (10Aklapper) [13:56:15] ACKNOWLEDGEMENT - Thanos query has high gRPC client errors on icinga1001 is CRITICAL: job=thanos-query Herron prometheus5001 being prepped for cutover (not yet in production) https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [13:56:15] ACKNOWLEDGEMENT - Thanos sidecar cannot connect to Prometheus on icinga1001 is CRITICAL: cluster=prometheus instance=prometheus5001 job=thanos-sidecar prometheus=ops site=eqsin Herron prometheus5001 being prepped for cutover (not yet in production) https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar [14:01:55] (03PS1) 10Effie Mouzeli: push-notifications: enable egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/628336 (https://phabricator.wikimedia.org/T256973) [14:02:13] PROBLEM - Prometheus prometheus5001/ops restarted: beware possible monitoring artifacts on prometheus5001 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqsin+prometheus/ops [14:03:13] (03PS1) 10Ladsgroup: labs: Turn on termbox v2 on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628337 (https://phabricator.wikimedia.org/T261488) [14:03:29] RECOVERY - Thanos sidecar cannot connect to Prometheus on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar [14:03:32] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/25191/" [puppet] - 10https://gerrit.wikimedia.org/r/628330 (https://phabricator.wikimedia.org/T255028) (owner: 10Elukey) [14:04:07] RECOVERY - Thanos query has high gRPC client errors on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [14:04:10] (03CR) 10JMeybohm: [C: 03+1] push-notifications: enable egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/628336 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [14:05:15] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops, and 2 others: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10Mholloway) What will be the internal URL for this service? I am guessing `https://push-notifications.... [14:05:53] (03CR) 10Ladsgroup: [C: 03+2] "noop for production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628337 (https://phabricator.wikimedia.org/T261488) (owner: 10Ladsgroup) [14:06:38] (03Merged) 10jenkins-bot: labs: Turn on termbox v2 on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628337 (https://phabricator.wikimedia.org/T261488) (owner: 10Ladsgroup) [14:07:32] (03PS1) 10Hashar: gerrit: dump heap on out of memory error [puppet] - 10https://gerrit.wikimedia.org/r/628338 (https://phabricator.wikimedia.org/T263008) [14:08:43] ^ rebased on deploy1001 [14:10:50] (03PS5) 10Muehlenhoff: Manage /etc/apt/sources.list via Puppet [puppet] - 10https://gerrit.wikimedia.org/r/626693 (https://phabricator.wikimedia.org/T156562) [14:10:52] (03PS6) 10Muehlenhoff: Manage /etc/apt/sources.list via Puppet [puppet] - 10https://gerrit.wikimedia.org/r/626693 (https://phabricator.wikimedia.org/T156562) [14:12:00] (03PS2) 10Hashar: gerrit: dump heap on out of memory error [puppet] - 10https://gerrit.wikimedia.org/r/628338 (https://phabricator.wikimedia.org/T263008) [14:13:14] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: labs: Turn on termbox v2 on desktop for wikidatawiki -- noop for production, sanity sync (T261488) (duration: 01m 00s) [14:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:19] T261488: trial new termbox on desktop on a test system - https://phabricator.wikimedia.org/T261488 [14:13:27] 10Operations, 10netops: cr1-codfw<->cr1-eqiad link saturation - https://phabricator.wikimedia.org/T263206 (10CDanis) for posterity: repooling swift@eqiad took 3.5Gbit/s off of the codfw->eqiad path. there's a much longer discussion (recorded in #wikimedia-sre logs) about discussing edge-egress-to-backhaul byt... [14:13:30] (03CR) 10Hashar: "Note: needs the service to be restarted after puppet ran." [puppet] - 10https://gerrit.wikimedia.org/r/628338 (https://phabricator.wikimedia.org/T263008) (owner: 10Hashar) [14:13:36] wikibugs is like [14:13:39] 10 minutes behind the times?? [14:13:47] 12 [14:15:04] (03PS7) 10Hnowlan: api-gateway: migrate to new helmfile format [deployment-charts] - 10https://gerrit.wikimedia.org/r/627250 [14:15:37] !log ladsgroup@deploy1001 Synchronized wmf-config/Wikibase.php: labs: Turn on termbox v2 on desktop for wikidatawiki -- noop for production, sanity sync (T261488) (duration: 00m 56s) [14:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:56] cdanis: yeah, it has been misbehaving lately, I had to restart it yesterday and today [14:19:46] (03PS1) 10MSantos: push-notifications: change version tag to -production [deployment-charts] - 10https://gerrit.wikimedia.org/r/628340 (https://phabricator.wikimedia.org/T256973) [14:20:51] (03CR) 10Hnowlan: [C: 03+2] api-gateway: migrate to new helmfile format [deployment-charts] - 10https://gerrit.wikimedia.org/r/627250 (owner: 10Hnowlan) [14:22:17] (03CR) 10Papaul: [C: 03+2] maps: add partman configuration for newer maps servers. [puppet] - 10https://gerrit.wikimedia.org/r/628089 (https://phabricator.wikimedia.org/T260271) (owner: 10Gehel) [14:22:31] (03CR) 10Ppchelko: "woohooo! :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/627250 (owner: 10Hnowlan) [14:22:34] (03PS2) 10Papaul: maps: add partman configuration for newer maps servers. [puppet] - 10https://gerrit.wikimedia.org/r/628089 (https://phabricator.wikimedia.org/T260271) (owner: 10Gehel) [14:22:41] (03CR) 10Papaul: [V: 03+2 C: 03+2] maps: add partman configuration for newer maps servers. [puppet] - 10https://gerrit.wikimedia.org/r/628089 (https://phabricator.wikimedia.org/T260271) (owner: 10Gehel) [14:23:17] (03Merged) 10jenkins-bot: api-gateway: migrate to new helmfile format [deployment-charts] - 10https://gerrit.wikimedia.org/r/627250 (owner: 10Hnowlan) [14:25:38] 10Operations, 10ops-codfw, 10serviceops: mw2256 went down with thermal issues / fail-safe voltage is out of range - https://phabricator.wikimedia.org/T263022 (10Papaul) 05Open→03Declined Declined since it is a duplicate. [14:25:42] 10Operations, 10ops-codfw: ps1-a8-codfw WebUI unresponsive - https://phabricator.wikimedia.org/T263001 (10Papaul) working on getting a local FTP to upload the firmware to the PDU. It will be one sometimes next week [14:27:35] (03PS1) 10Mholloway: Echo: Set up the push notifier type [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628341 (https://phabricator.wikimedia.org/T262936) [14:27:37] (03PS1) 10Mholloway: Echo: Enable push on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628342 (https://phabricator.wikimedia.org/T262936) [14:27:39] (03PS1) 10Mholloway: Echo: Enable push on all Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628343 (https://phabricator.wikimedia.org/T262936) [14:28:27] (03CR) 10jerkins-bot: [V: 04-1] Echo: Enable push on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628342 (https://phabricator.wikimedia.org/T262936) (owner: 10Mholloway) [14:28:30] (03CR) 10Mholloway: [C: 04-2] "Hold for scheduled deployment window" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628341 (https://phabricator.wikimedia.org/T262936) (owner: 10Mholloway) [14:28:32] (03CR) 10jerkins-bot: [V: 04-1] Echo: Enable push on all Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628343 (https://phabricator.wikimedia.org/T262936) (owner: 10Mholloway) [14:28:36] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops, and 2 others: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10jijiki) @Mholloway it will be accessible on Monday after we deploy the LVS/DNS patches. Meanwhile you... [14:28:38] 10Operations, 10ops-codfw, 10DBA: db2127 memory errors - https://phabricator.wikimedia.org/T262247 (10Papaul) @Marostegui any day that works for you works for me as well [14:28:46] (03CR) 10Mholloway: [C: 04-2] "Hold for scheduled deployment window" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628342 (https://phabricator.wikimedia.org/T262936) (owner: 10Mholloway) [14:28:53] (03CR) 10Mholloway: [C: 04-2] "Hold for scheduled deployment window" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628343 (https://phabricator.wikimedia.org/T262936) (owner: 10Mholloway) [14:28:58] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/25190/" [puppet] - 10https://gerrit.wikimedia.org/r/626693 (https://phabricator.wikimedia.org/T156562) (owner: 10Muehlenhoff) [14:29:59] (03PS7) 10Muehlenhoff: Manage /etc/apt/sources.list via Puppet [puppet] - 10https://gerrit.wikimedia.org/r/626693 (https://phabricator.wikimedia.org/T156562) [14:30:53] RECOVERY - Prometheus prometheus5001/ops restarted: beware possible monitoring artifacts on prometheus5001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqsin+prometheus/ops [14:31:52] 10Operations, 10ops-codfw, 10DBA: db2127 memory errors - https://phabricator.wikimedia.org/T262247 (10Marostegui) Thank you @Papaul - I will have it ready by Monday [14:34:48] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps, 10Patch-For-Review: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` maps2005.codfw.wmnet ` The l... [14:38:08] (03PS2) 10Mholloway: Echo: Set up common push settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628341 (https://phabricator.wikimedia.org/T262936) [14:38:10] (03PS2) 10Mholloway: Echo: Enable push on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628342 (https://phabricator.wikimedia.org/T262936) [14:38:12] (03PS2) 10Mholloway: Echo: Enable push on all Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628343 (https://phabricator.wikimedia.org/T262936) [14:38:41] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops, and 2 others: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10MSantos) [14:42:17] (03CR) 10Mholloway: [C: 04-2] Echo: Set up common push settings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628341 (https://phabricator.wikimedia.org/T262936) (owner: 10Mholloway) [14:44:51] 10Operations, 10ops-eqiad: Check jumbo1008.eqiad.wmnet PSU redundancy reported as critical - https://phabricator.wikimedia.org/T263262 (10klausman) [14:45:49] 10Operations, 10ops-eqiad: Check jumbo1008.eqiad.wmnet PSU redundancy reported as critical - https://phabricator.wikimedia.org/T263262 (10klausman) [14:47:05] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops, and 2 others: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10jijiki) On Monday (EU) morning, @JMeybohm and I will push the LVS/DNS patches, so everything will be r... [14:52:37] PROBLEM - Check whether microcode mitigations for CPU vulnerabilities are applied on stat1004 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear, flush_l1d} https://wikitech.wikimedia.org/wiki/Microcode [14:53:31] (03CR) 10Muehlenhoff: "Obsolete since an-tool1009 has been moved to CAS instead." [puppet] - 10https://gerrit.wikimedia.org/r/617385 (owner: 10Muehlenhoff) [14:53:43] (03Abandoned) 10Muehlenhoff: Enable CAS for Hue [puppet] - 10https://gerrit.wikimedia.org/r/617385 (owner: 10Muehlenhoff) [14:53:56] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps, 10Patch-For-Review: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['maps2005.codfw.wmnet'] ` Of which those **FAILED**: ` ['maps2005.codfw.wmne... [14:54:14] (03PS2) 10Muehlenhoff: Retire stub firejail code in service::uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/622350 [14:57:29] (03PS1) 10Gehel: maps: fix typo in glob exrpession for maps netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/628348 (https://phabricator.wikimedia.org/T260271) [14:57:46] (03CR) 10Dzahn: [C: 03+2] gerrit: dump heap on out of memory error [puppet] - 10https://gerrit.wikimedia.org/r/628338 (https://phabricator.wikimedia.org/T263008) (owner: 10Hashar) [14:58:12] 10Operations, 10Wikimedia-Mailing-lists, 10I18n: Mojibake on Mailman - https://phabricator.wikimedia.org/T263248 (10Aklapper) [14:58:15] 10Operations, 10Wikimedia-Mailing-lists, 10I18n: Several unreadable mailing list descriptions (Mojibake) due to wrong charset encodings, should be Unicode - https://phabricator.wikimedia.org/T261031 (10Aklapper) [14:59:26] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/628348 (https://phabricator.wikimedia.org/T260271) (owner: 10Gehel) [15:00:28] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps, 10Patch-For-Review: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10Papaul) Still having partman recipe problem maybe because of this line ` maps[12]00[1-4]*) echo partman/standard.cfg partman/raid10-... [15:01:13] 10Operations, 10Wikimedia-Mailing-lists, 10I18n: Several unreadable mailing list descriptions (Mojibake) due to wrong charset encodings, should be Unicode - https://phabricator.wikimedia.org/T261031 (10Aklapper) [15:02:38] jouncebot: next [15:02:38] In 15 hour(s) and 57 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200919T0700) [15:05:41] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:08:44] 10Operations, 10OTRS, 10serviceops, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10JGHowes) Please restore the color highlighting as was in the previous OTRS version. It's removal from OTRS 6.0 makes it more difficult to... [15:09:07] !log restarting gerrit service to apply gerrit::628338 to make it dump heap if out of memory (T263008) [15:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:13] T263008: Gerrit out of heap - https://phabricator.wikimedia.org/T263008 [15:10:21] gerrit back [15:10:53] should now dump heap in /srv/gerrit if it runs out of memory again [15:13:01] (03CR) 10Dzahn: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/628338 (https://phabricator.wikimedia.org/T263008) (owner: 10Hashar) [15:21:09] (03CR) 10Elukey: "Answering to comments and sending the first set of fixes, will also check the other ones pointed out in the previous comment from Riccardo" (038 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/626380 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [15:21:35] (03PS2) 10Elukey: Add basic debian packaging [software/pywmflib] - 10https://gerrit.wikimedia.org/r/626380 (https://phabricator.wikimedia.org/T257905) [15:22:05] (03PS2) 10Gehel: maps: fix typo in glob exrpession for maps netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/628348 (https://phabricator.wikimedia.org/T260271) [15:22:49] (03CR) 10Gehel: [C: 03+2] maps: fix typo in glob exrpession for maps netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/628348 (https://phabricator.wikimedia.org/T260271) (owner: 10Gehel) [15:26:01] 10Operations, 10Wikimedia-Mailing-lists, 10I18n: Mojibake on Mailman - https://phabricator.wikimedia.org/T263248 (10Dzahn) > the problem arose some time between May 2019 and August 2020 – I wish I could be more specific. This would match the upgrading of the mailman server to the newer Debian distro versio... [15:28:00] 10Operations, 10Wikimedia-Mailing-lists, 10I18n: Several unreadable mailing list descriptions (Mojibake) due to wrong charset encodings, should be Unicode - https://phabricator.wikimedia.org/T261031 (10Dzahn) imported comment from T263248: > the problem arose some time between May 2019 and August 2020 – I... [15:31:23] (03CR) 10Effie Mouzeli: [C: 04-2] "Let's wait and make sure we need this before merging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/628336 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [15:32:11] 10Operations, 10OTRS, 10serviceops, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10NoFWDaddress) @JGHowes please sse T263243. Note that the new version was in test for all before the upgrade and that this issue could hav... [15:42:19] 10Operations, 10Wikimedia-Mailing-lists, 10I18n: Several unreadable mailing list descriptions (Mojibake) due to wrong charset encodings, should be Unicode - https://phabricator.wikimedia.org/T261031 (10Aklapper) Wondering if backporting https://gitlab.com/mailman/mailman/-/commit/761c268bb7c7c7b91d3f962e5ca4... [15:50:46] 10Operations, 10OTRS, 10serviceops, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10akosiaris) >>! In T187984#6474618, @JGHowes wrote: > Please restore the color highlighting as was in the previous OTRS version. It's remov... [16:12:22] 10Operations, 10OTRS, 10serviceops, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10jcrespo) > that will probably not be developed As a small correction, instead of "that will probably not be developed" something more lik... [16:16:36] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` maps2005.codfw.wmnet ` The log can be found in `/v... [16:21:30] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [16:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:46] 10Operations, 10OTRS, 10serviceops, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10NoFWDaddress) @jcrespo : By experience, those kind of features will not see light in our era for OTRS since more urgent "features" (like T... [16:22:49] 10Operations, 10OTRS, 10serviceops, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10jcrespo) Yeah, not disagreeing, in fact supporting that ticket. My stress was on that it was not Alex's decision to remove it. :-) Cheers. [16:23:34] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [16:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:02] (03Abandoned) 10Hnowlan: api-gateway: Make JWT issuer configurable. [deployment-charts] - 10https://gerrit.wikimedia.org/r/622799 (https://phabricator.wikimedia.org/T235277) (owner: 10Hnowlan) [16:45:41] (03PS1) 10Hnowlan: api-gateway: remove straggler config. [deployment-charts] - 10https://gerrit.wikimedia.org/r/628403 [16:50:47] (03CR) 10Hnowlan: [C: 03+2] api-gateway: remove straggler config. [deployment-charts] - 10https://gerrit.wikimedia.org/r/628403 (owner: 10Hnowlan) [16:52:55] (03Merged) 10jenkins-bot: api-gateway: remove straggler config. [deployment-charts] - 10https://gerrit.wikimedia.org/r/628403 (owner: 10Hnowlan) [16:56:48] (03PS14) 10Dzahn: prometheus: replace remaining hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/623666 [16:58:33] (03PS2) 10Dzahn: openstack: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/627966 (https://phabricator.wikimedia.org/T209953) [17:00:52] (03CR) 10Elukey: "So without the dh_auto_clean override I get:" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/626380 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [17:03:03] 👋 just got a librenms page, I guess it's the one from yesterday and the ack expired [17:07:02] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Allow Nicholas Skaggs to issue icinga commands - https://phabricator.wikimedia.org/T263191 (10Dzahn) This is usually done without separate access request for all users who have root shell. The difference here would just be "prod root" vs. "wmcs / cloud... [17:09:51] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` maps2005.codfw.wmnet ` The log can be found in `/v... [17:11:55] 10Operations, 10Wikimedia-Mailing-lists: Mailing list for local development discussion - https://phabricator.wikimedia.org/T263216 (10Dzahn) //You have successfully created the mailing list local-dev and notification has been sent to the list owner jhuneidi@wikimedia.org. You can now:// [[ https://lists.wikim... [17:13:52] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['maps2005.codfw.wmnet'] ` and were **ALL** successful. [17:15:41] !log lists1001 - apt-get install pwgen to generate passwords (this was installed on previous list server but apparently not puppetized, puppet patch coming up) [17:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:58] (03PS1) 10Hnowlan: api-gateway: add routing for static and other components [deployment-charts] - 10https://gerrit.wikimedia.org/r/628408 (https://phabricator.wikimedia.org/T263045) [17:22:27] 10Operations, 10Wikimedia-Mailing-lists: Mailing list for local development discussion - https://phabricator.wikimedia.org/T263216 (10Dzahn) Hi @jeena @bbearnes, the new list has been created. The thing is that at list creation you can only enter a single initial admin (Jeena). So what I did was: - create n... [17:23:43] 10Operations, 10Wikimedia-Mailing-lists: Mailing list for local development discussion - https://phabricator.wikimedia.org/T263216 (10Dzahn) 05Open→03Resolved a:03Dzahn [17:24:54] 10Operations, 10Wikimedia-Mailing-lists: Create arbcom-ru@wikimedia.org - https://phabricator.wikimedia.org/T262525 (10Dzahn) a:03Adamant.pwn [17:30:56] 10Operations, 10Wikidata, 10Wikimedia-Mailing-lists: Stop archiving the wikidata-bugs mailinglist in pipermail - https://phabricator.wikimedia.org/T262773 (10Dzahn) a:03Lydia_Pintscher assigning to Lydia as she is the list admin and can change it per T262773#6464825 I think that's all that is needed to re... [17:36:48] (03CR) 10Ppchelko: "that's what you get for having the portal site and apis on the same host." [deployment-charts] - 10https://gerrit.wikimedia.org/r/628408 (https://phabricator.wikimedia.org/T263045) (owner: 10Hnowlan) [17:37:15] (03CR) 10Ppchelko: [C: 03+1] api-gateway: add routing for static and other components [deployment-charts] - 10https://gerrit.wikimedia.org/r/628408 (https://phabricator.wikimedia.org/T263045) (owner: 10Hnowlan) [17:42:19] (03PS1) 10Dzahn: mailman: require package pwgen to create random passwords [puppet] - 10https://gerrit.wikimedia.org/r/628412 [17:43:24] (03PS2) 10Dzahn: mailman: require package pwgen to create random passwords [puppet] - 10https://gerrit.wikimedia.org/r/628412 [17:43:55] (03CR) 10Herron: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/628412 (owner: 10Dzahn) [17:44:23] (03CR) 10RLazarus: [C: 03+1] mailman: require package pwgen to create random passwords [puppet] - 10https://gerrit.wikimedia.org/r/628412 (owner: 10Dzahn) [17:44:50] (03CR) 10Dzahn: [C: 03+2] "fast reviews like that are awesome 😊" [puppet] - 10https://gerrit.wikimedia.org/r/628412 (owner: 10Dzahn) [17:46:40] (03CR) 10Dzahn: "now installed on lists1001 by puppet" [puppet] - 10https://gerrit.wikimedia.org/r/628412 (owner: 10Dzahn) [17:49:30] 10Operations, 10Traffic, 10netops, 10Epic: Capacity planning for (& optimization of) transport backhaul vs edge egress - https://phabricator.wikimedia.org/T263275 (10CDanis) p:05Triage→03Medium [17:50:33] 10Operations, 10netops: cr1-codfw<->cr1-eqiad link saturation - https://phabricator.wikimedia.org/T263206 (10CDanis) This particular issue is resolved for now, and the action items and other ideas spawned in the discussion of it will be tracked as sub-tasks of {T263275} [17:50:50] 10Operations, 10netops: cr1-codfw<->cr1-eqiad link saturation - https://phabricator.wikimedia.org/T263206 (10CDanis) 05Open→03Resolved a:03CDanis [18:00:02] 10Operations, 10ops-eqiad: Check jumbo1008.eqiad.wmnet PSU redundancy reported as critical - https://phabricator.wikimedia.org/T263262 (10wiki_willy) a:03Cmjohnson [18:04:37] (03CR) 10Dzahn: [V: 04-1 C: 04-1] "previous issue fixed but still more here: https://puppet-compiler.wmflabs.org/compiler1003/25192/bast3004.wikimedia.org/change.bast3004.wi" [puppet] - 10https://gerrit.wikimedia.org/r/623666 (owner: 10Dzahn) [18:07:10] 10Operations, 10Traffic, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10CDanis) [18:08:20] 10Operations, 10Traffic, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10CDanis) p:05Triage→03Medium [18:10:29] (03PS1) 10Herron: graphite-carbon: disable internal log rotation and use logrotate [puppet] - 10https://gerrit.wikimedia.org/r/628423 (https://phabricator.wikimedia.org/T263103) [18:10:56] !log Removed stale `wikidatardf-dumps` crontab entry from `dumpsgen@snapshot1008`, stored backup of previous state of crontab in the (admittedly verbose) `/tmp/dumpsgen_crontab_before_removing_stale_wikidata_dump_entry_see_gerrit_puppet_patch_622342` [18:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:40] 10Operations, 10Domains, 10Traffic: Change of nameservers for Wikimedia.org.tr - https://phabricator.wikimedia.org/T259792 (10Ijon) Thanks, @jcrespo! [18:26:16] (03CR) 10Herron: "currently fails like this https://puppet-compiler.wmflabs.org/compiler1003/25194/graphite1004.eqiad.wmnet/change.graphite1004.eqiad.wmnet." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/628423 (https://phabricator.wikimedia.org/T263103) (owner: 10Herron) [18:26:38] 10Operations, 10ops-codfw, 10decommission-hardware: decommission wmf6412 - https://phabricator.wikimedia.org/T261968 (10wiki_willy) a:03Papaul [18:37:22] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10Papaul) hey guys please look at the output below and see if the all looks good on maps2005 so i can resume the install on the other nodes on Monday. Thank... [18:38:36] !log `sudo kill 126121 126122 126124 126128 249520 249521 254016 254027` on `snapshot1008` to terminate wikidata dump jobs that are in a bad state [18:38:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:55] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: 3.37e+04 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:43:49] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: (C)100 gt (W)50 gt 9 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:44:42] !log `sudo kill 254017 254018 254028 254029` to kill some dangling serdi / gzip processes, all the wikidata cleanup should be complete [18:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:28] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [18:46:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:28] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:52:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:41] 10Operations, 10ops-codfw, 10decommission-hardware: decommission wmf6412 - https://phabricator.wikimedia.org/T261968 (10Papaul) [18:53:21] 10Operations, 10ops-codfw, 10decommission-hardware: decommission wmf6412 - https://phabricator.wikimedia.org/T261968 (10Papaul) 05Open→03Resolved [20:17:13] (03CR) 10Ppchelko: [C: 03+1] "heh, I was thinking about https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/listener/v3/listener_components.proto#config-listener-" [deployment-charts] - 10https://gerrit.wikimedia.org/r/628408 (https://phabricator.wikimedia.org/T263045) (owner: 10Hnowlan) [20:20:53] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/628436 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [20:26:07] 10Operations, 10Traffic, 10netops: experiment with reënabling compression between applayer's TLS terminators and edge caches - https://phabricator.wikimedia.org/T263288 (10CDanis) [20:28:52] (03PS2) 10CRusnov: base/check_systemd_state.py: Switch header to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/624733 (https://phabricator.wikimedia.org/T247364) [20:32:37] (03PS3) 10CRusnov: modules/service/files/logstash_checker.py: Move to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/624116 (https://phabricator.wikimedia.org/T247364) [20:33:47] (03Abandoned) 10DannyS712: abusefilter.php: Remove settings that duplicate defaults, and clean up [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552610 (https://phabricator.wikimedia.org/T238965) (owner: 10DannyS712) [20:35:36] (03PS3) 10CRusnov: modules/admin/data/nda_audit.py: Port to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/624112 (https://phabricator.wikimedia.org/T247364) [20:45:07] 10Operations, 10netops: Set the same OSPF weight on eqiad/codfw wavelenghts - https://phabricator.wikimedia.org/T263230 (10CDanis) [20:45:09] 10Operations, 10Traffic, 10netops, 10Epic: Capacity planning for (& optimization of) transport backhaul vs edge egress - https://phabricator.wikimedia.org/T263275 (10CDanis) [20:45:12] 10Operations, 10netops: Consider balancing VRRP primaries to cr1/cr2 - https://phabricator.wikimedia.org/T263212 (10CDanis) [20:52:16] 10Operations, 10Analytics, 10Traffic, 10netops: Turnilo: per-second rates for wmf_netflow bytes + packets - https://phabricator.wikimedia.org/T263290 (10CDanis) [21:00:39] (03CR) 10Dzahn: [C: 03+2] "I compiled this on everything. NOOP on *. https://puppet-compiler.wmflabs.org/compiler1002/25193/" [puppet] - 10https://gerrit.wikimedia.org/r/627966 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [21:06:41] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: 138 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:06:41] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [21:08:37] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: (C)100 gt (W)50 gt 7 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:08:53] 10Operations, 10Traffic: experiment with a "unified" ATS-BE pool - https://phabricator.wikimedia.org/T263291 (10CDanis) [21:14:27] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: 1222 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:16:23] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: (C)100 gt (W)50 gt 2 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:18:11] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1083 is OK: HTTP OK: HTTP/1.0 200 OK - 23597 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [21:38:35] (03PS9) 10CDanis: WIP: serve NEL headers on group0 [puppet] - 10https://gerrit.wikimedia.org/r/627629 [21:39:56] (03CR) 10Bstorm: [C: 03+1] ""is_primary_server" is likely to confuse someone in the future who thinks that means the other server is standby (which is not true, it's " [puppet] - 10https://gerrit.wikimedia.org/r/624328 (owner: 10Dzahn) [21:41:19] (03PS10) 10CDanis: WIP: serve NEL headers on group0 [puppet] - 10https://gerrit.wikimedia.org/r/627629 [21:42:16] (03PS7) 10Dzahn: dumps: rename the do_acme parameter and lookup [puppet] - 10https://gerrit.wikimedia.org/r/624328 [21:43:21] (03CR) 10Dzahn: [C: 03+2] "thank you very much, also for updating docs. i'll merge it then and confirm on the hosts" [puppet] - 10https://gerrit.wikimedia.org/r/624328 (owner: 10Dzahn) [21:45:08] (03PS11) 10CDanis: Serve Network Error Logging headers on group0 [puppet] - 10https://gerrit.wikimedia.org/r/627629 (https://phabricator.wikimedia.org/T257527) [21:45:20] (03CR) 10Bstorm: [C: 03+1] "I think this looks good now. It's honestly a really scary one because a change in the NFS mounts that doesn't go well will cause all NFS m" [puppet] - 10https://gerrit.wikimedia.org/r/622666 (owner: 10Dzahn) [21:45:39] (03CR) 10CDanis: "This is ready for review! I'd like to deploy Monday." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/627629 (https://phabricator.wikimedia.org/T257527) (owner: 10CDanis) [21:45:57] (03CR) 10Dzahn: "noop confirmed on labstore1006/1007" [puppet] - 10https://gerrit.wikimedia.org/r/624328 (owner: 10Dzahn) [21:46:30] 10Operations, 10Product-Infrastructure-Data, 10Epic, 10Goal, 10Patch-For-Review: automatically collect network error reports from users' browsers (Network Error Logging API) - https://phabricator.wikimedia.org/T257527 (10CDanis) [21:47:10] (03CR) 10Dzahn: "ACK, confirmed. this will not merge on a Friday. thank you" [puppet] - 10https://gerrit.wikimedia.org/r/622666 (owner: 10Dzahn) [21:48:32] !log changed password for Millennium bug@ptwiki [21:48:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:07] (03CR) 10Hashar: "Danke Schon ;)" [puppet] - 10https://gerrit.wikimedia.org/r/628338 (https://phabricator.wikimedia.org/T263008) (owner: 10Hashar) [22:13:32] (03CR) 10Dzahn: "reduced the number of hiera() lines across the whole repo by 32%" [puppet] - 10https://gerrit.wikimedia.org/r/627966 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [22:14:38] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): Use lookup() instead of hiera() in Puppet - https://phabricator.wikimedia.org/T209953 (10Dzahn) The patch above reduced the number of hiera() lines across the whole puppet repo by 32%. [22:26:29] (03PS1) 10Dzahn: wmcs::postgres: hiera->lookup and add data types [puppet] - 10https://gerrit.wikimedia.org/r/628459 [22:30:54] 10Operations, 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Wikimedia-production-error: Could not enqueue jobs from stream mediawiki.job.cirrusSearchIncomingLinkCount - https://phabricator.wikimedia.org/T263132 (10jeena) Various jobenqueue errors happened today in the past 6 hours with spikes of 1... [22:33:12] (03PS1) 10Dzahn: cache::ssl::unified: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/628460 [22:34:11] (03CR) 10jerkins-bot: [V: 04-1] cache::ssl::unified: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/628460 (owner: 10Dzahn) [22:34:33] 10Operations, 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Wikimedia-production-error: Could not enqueue jobs from stream mediawiki.job.cirrusSearchIncomingLinkCount - https://phabricator.wikimedia.org/T263132 (10thcipriani) p:05High→03Unbreak! >>! In T263132#6475784, @jeena wrote: > Various... [22:39:48] (03PS1) 10Dzahn: nutcracker: hiera-lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/628461 [22:48:08] (03PS1) 10Dzahn: yubiauth: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/628462 [23:07:15] (03PS1) 10Dzahn: phabricator: add mysql port to user check script [puppet] - 10https://gerrit.wikimedia.org/r/628464 [23:08:23] (03CR) 10Dzahn: [C: 03+2] "it's just a script run by humans, not influencing prod phabricator in any way" [puppet] - 10https://gerrit.wikimedia.org/r/628464 (owner: 10Dzahn) [23:12:42] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/624116 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [23:14:49] PROBLEM - Too many messages in kafka logging-eqiad #o11y on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1002 job=burrow partition={2,4} prometheus=ops site=eqiad topic=rsyslog-notice https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster= [23:14:49] -topic=All&var-consumer_group=All [23:20:03] (03CR) 10Krinkle: "Ping :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622863 (https://phabricator.wikimedia.org/T249745) (owner: 10Ppchelko) [23:21:03] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/628423 (https://phabricator.wikimedia.org/T263103) (owner: 10Herron) [23:51:45] RECOVERY - Too many messages in kafka logging-eqiad #o11y on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All