[00:04:24] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)8 ge (W)1 ge 0.3625 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [00:12:32] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 51 probes of 576 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:13:44] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 52 probes of 576 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:18:22] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 49 probes of 576 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:19:32] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 576 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:39:00] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:40:46] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:49:28] PROBLEM - Check systemd state on an-launcher1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:54:36] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 53 probes of 576 (alerts on 50) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:00:24] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 49 probes of 576 (alerts on 50) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:10:10] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 51 probes of 576 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:24:10] 10Operations, 10LDAP-Access-Requests: NDA for superset access request from WMDE employee danshick - https://phabricator.wikimedia.org/T254442 (10KFrancis) @danshick-wmde Hi Dan, Please provide me with the following information: -Full legal name -Mailing address -Email address -Specifics about the type of se... [01:27:28] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 50 probes of 576 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:48:09] (03CR) 10Jforrester: [C: 04-2] "We're removing these files from this repo and putting them in scap proper; they'll be in the next release of scap. Let's continue this the" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599147 (https://phabricator.wikimedia.org/T247107) (owner: 10Krinkle) [01:50:34] PROBLEM - Check systemd state on db1141 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:54:06] RECOVERY - Check systemd state on db1141 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:57:52] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 51 probes of 576 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:03:38] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 50 probes of 576 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:30:26] RECOVERY - Check the last execution of mediawiki_job_cirrus_build_completion_indices_codfw on mwmaint1002 is OK: OK: Status of the systemd unit mediawiki_job_cirrus_build_completion_indices_codfw https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:31:14] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:36:20] RECOVERY - Check the last execution of mediawiki_job_cirrus_build_completion_indices_eqiad on mwmaint1002 is OK: OK: Status of the systemd unit mediawiki_job_cirrus_build_completion_indices_eqiad https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [03:08:30] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 51 probes of 576 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [03:14:20] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 50 probes of 576 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [03:15:42] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:17:30] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:28:28] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:30:20] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:00:28] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 51 probes of 576 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:01:08] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:02:56] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:06:16] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 50 probes of 576 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:06:34] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=thanos-compact site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:08:22] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:15:44] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 51 probes of 576 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:18:16] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 52 probes of 576 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:21:32] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 50 probes of 576 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:24:06] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 576 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:56:02] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 269, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:56:16] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:01:40] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:04:36] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:06:48] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 271, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:08:12] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 135, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:26:38] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 269, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:26:52] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:46:41] these --^ should be the Telia maintenance scheduled [05:59:46] (03PS1) 10Elukey: Prepare druid1005 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/602552 (https://phabricator.wikimedia.org/T253980) [06:01:13] (03CR) 10Elukey: [C: 03+2] Prepare druid1005 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/602552 (https://phabricator.wikimedia.org/T253980) (owner: 10Elukey) [06:15:38] RECOVERY - Check systemd state on an-launcher1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:17:57] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [06:18:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:33] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [06:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:25] (03PS3) 10Elukey: turnilo: move functionalities to the proxy profile [puppet] - 10https://gerrit.wikimedia.org/r/602440 (https://phabricator.wikimedia.org/T253294) [06:27:27] (03PS7) 10Elukey: Add Turnilo to the staging environment on an-tool1007 [puppet] - 10https://gerrit.wikimedia.org/r/602371 (https://phabricator.wikimedia.org/T253294) [06:37:54] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:44:44] PROBLEM - Check the last execution of mediawiki_job_cirrus_build_completion_indices_eqiad on mwmaint1002 is CRITICAL: CRITICAL: Status of the systemd unit mediawiki_job_cirrus_build_completion_indices_eqiad https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:45:32] (03CR) 10Elukey: [C: 03+2] turnilo: move functionalities to the proxy profile [puppet] - 10https://gerrit.wikimedia.org/r/602440 (https://phabricator.wikimedia.org/T253294) (owner: 10Elukey) [06:46:14] ACKNOWLEDGEMENT - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Ayounsi https://phabricator.wikimedia.org/T254436 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:46:49] ACKNOWLEDGEMENT - Check the last execution of mediawiki_job_cirrus_build_completion_indices_eqiad on mwmaint1002 is CRITICAL: CRITICAL: Status of the systemd unit mediawiki_job_cirrus_build_completion_indices_eqiad Ayounsi https://phabricator.wikimedia.org/T254436 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:50:39] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 271, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:54:13] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:57:31] (03PS1) 10Elukey: Rename superset staging to ui staging in hiera [labs/private] - 10https://gerrit.wikimedia.org/r/602594 [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200605T0700) [07:00:45] (03CR) 10Elukey: [V: 03+2 C: 03+2] Rename superset staging to ui staging in hiera [labs/private] - 10https://gerrit.wikimedia.org/r/602594 (owner: 10Elukey) [07:06:20] (03PS1) 10Privacybatm: wmfmariadbpy: Remove transferpy package [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/602595 (https://phabricator.wikimedia.org/T248256) [07:07:37] (03PS1) 10Marostegui: install_server: Do not reimage db2071 [puppet] - 10https://gerrit.wikimedia.org/r/602596 [07:07:53] (03PS2) 10Marostegui: install_server: Do not reimage db1091 [puppet] - 10https://gerrit.wikimedia.org/r/602596 [07:08:29] (03CR) 10Elukey: [C: 03+2] Add Turnilo to the staging environment on an-tool1007 [puppet] - 10https://gerrit.wikimedia.org/r/602371 (https://phabricator.wikimedia.org/T253294) (owner: 10Elukey) [07:09:19] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1091 [puppet] - 10https://gerrit.wikimedia.org/r/602596 (owner: 10Marostegui) [07:09:47] elukey: can I merge your changes? [07:11:53] marostegui: yes I was fixing one thing in private, please go ahead [07:11:58] ok! [07:12:09] elukey: done! [07:13:18] thanks! [07:19:45] (03PS1) 10Elukey: profile::hadoop::yarn_proxy: remove java depedencies [puppet] - 10https://gerrit.wikimedia.org/r/602597 [07:20:41] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [07:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:11] (03PS2) 10Dzahn: add IPv6 records for people2001.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/598968 [07:26:02] (03CR) 10Elukey: [C: 03+2] profile::hadoop::yarn_proxy: remove java depedencies [puppet] - 10https://gerrit.wikimedia.org/r/602597 (owner: 10Elukey) [07:29:29] (03PS1) 10Dzahn: add IPv6 records for peek2001 [dns] - 10https://gerrit.wikimedia.org/r/602598 (https://phabricator.wikimedia.org/T252210) [07:31:08] (03PS1) 10Elukey: Include profile::java in analytics::hadoop::ui roles [puppet] - 10https://gerrit.wikimedia.org/r/602600 [07:33:23] (03CR) 10Dzahn: "@chasemp fyi. i meant to add this right after creating the VM and then forgot. it should be standard nowadays" [dns] - 10https://gerrit.wikimedia.org/r/602598 (https://phabricator.wikimedia.org/T252210) (owner: 10Dzahn) [07:34:03] (03PS2) 10Dzahn: add IPv6 records for peek2001 [dns] - 10https://gerrit.wikimedia.org/r/602598 (https://phabricator.wikimedia.org/T252210) [07:40:51] (03CR) 10Marostegui: install_server: Allow reuse of partitions during reimage. (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/601761 (https://phabricator.wikimedia.org/T252027) (owner: 10Kormat) [07:41:00] (03PS1) 10Dzahn: peek: add data types [puppet] - 10https://gerrit.wikimedia.org/r/602602 [07:41:19] (03CR) 10Elukey: [C: 03+2] Include profile::java in analytics::hadoop::ui roles [puppet] - 10https://gerrit.wikimedia.org/r/602600 (owner: 10Elukey) [07:44:37] (03PS2) 10Dzahn: peek: add data types, comments [puppet] - 10https://gerrit.wikimedia.org/r/602602 [07:45:59] (03CR) 10Dzahn: [C: 03+2] add IPv6 records for people2001.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/598968 (owner: 10Dzahn) [07:46:16] (03PS3) 10Dzahn: add IPv6 records for people2001.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/598968 [07:52:46] !log rolling restart of ats-tls - T249335 [07:52:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:50] T249335: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 [07:57:48] 10Operations, 10Analytics, 10Analytics-Kanban: Increase memory available for an-launcher1001 - https://phabricator.wikimedia.org/T254125 (10elukey) >>! In T254125#6193413, @Milimetric wrote: > Can we re-enable reportupdater on the machine now? Already done a couple of days ago :) [07:59:03] PROBLEM - Check the last execution of mediawiki_job_cirrus_build_completion_indices_codfw on mwmaint1002 is CRITICAL: CRITICAL: Status of the systemd unit mediawiki_job_cirrus_build_completion_indices_codfw https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:59:13] (03PS1) 10Dzahn: smokeping: add data types [puppet] - 10https://gerrit.wikimedia.org/r/602606 [07:59:15] (03PS1) 10Dzahn: releases: add data types [puppet] - 10https://gerrit.wikimedia.org/r/602607 [08:00:23] ACKNOWLEDGEMENT - Check the last execution of mediawiki_job_cirrus_build_completion_indices_codfw on mwmaint1002 is CRITICAL: CRITICAL: Status of the systemd unit mediawiki_job_cirrus_build_completion_indices_codfw daniel_zahn https://phabricator.wikimedia.org/T254436 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:01:42] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1001/23016/ganeti2003.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/589608 (owner: 10Dzahn) [08:03:35] (03CR) 10Dzahn: [C: 03+2] smokeping: add data types [puppet] - 10https://gerrit.wikimedia.org/r/602606 (owner: 10Dzahn) [08:03:36] (03PS2) 10Dzahn: smokeping: add data types [puppet] - 10https://gerrit.wikimedia.org/r/602606 [08:04:29] PROBLEM - PHP opcache health on mw2253 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [08:05:13] (03CR) 10Filippo Giunchedi: [C: 03+1] centrallog: update mtail syslog file locations [puppet] - 10https://gerrit.wikimedia.org/r/602470 (owner: 10Herron) [08:11:57] !log Upgrade db2075 to 10.1.45 [08:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:22] (03PS1) 10Dzahn: rancid: add data types [puppet] - 10https://gerrit.wikimedia.org/r/602613 [08:16:24] (03PS1) 10Dzahn: statistics: add data types [puppet] - 10https://gerrit.wikimedia.org/r/602614 [08:18:58] (03PS1) 10Dzahn: statistics: delete empty file password.pp [puppet] - 10https://gerrit.wikimedia.org/r/602616 [08:19:51] (03PS2) 10Dzahn: statistics: delete empty file password.pp [puppet] - 10https://gerrit.wikimedia.org/r/602616 [08:20:16] (03PS3) 10Dzahn: statistics: delete empty file password.pp [puppet] - 10https://gerrit.wikimedia.org/r/602616 (https://phabricator.wikimedia.org/T87450) [08:20:50] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/602490 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [08:20:51] RECOVERY - PHP opcache health on mw2253 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [08:21:59] (03CR) 10Volans: [C: 04-1] "Maybe I'm missing context here, but see my comment inline." (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/602318 (owner: 10Muehlenhoff) [08:22:24] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: move services instance to profile [puppet] - 10https://gerrit.wikimedia.org/r/602398 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [08:22:32] (03PS3) 10Filippo Giunchedi: prometheus: move services instance to profile [puppet] - 10https://gerrit.wikimedia.org/r/602398 (https://phabricator.wikimedia.org/T252186) [08:25:18] (03PS1) 10Privacybatm: transferpy: Remove wmfmariadbpy package [software/transferpy] - 10https://gerrit.wikimedia.org/r/602618 (https://phabricator.wikimedia.org/T248256) [08:26:30] (03PS1) 10Dzahn: httpbb: add test_miscweb snippet to profile [puppet] - 10https://gerrit.wikimedia.org/r/602619 [08:27:14] (03Abandoned) 10Jcrespo: Remove transfer.py-related files as per migration to new repo [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/602315 (owner: 10Jcrespo) [08:27:58] !log migrate etherpad1002 to new ganeti nodes [08:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:09] !log migrate seaborgium.wikimedia.org to new ganeti nodes [08:28:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:48] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: move global instance to profile [puppet] - 10https://gerrit.wikimedia.org/r/602401 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [08:28:58] (03PS2) 10Filippo Giunchedi: prometheus: move global instance to profile [puppet] - 10https://gerrit.wikimedia.org/r/602401 (https://phabricator.wikimedia.org/T252186) [08:32:16] (03PS1) 10Dzahn: profile::planet: add data types [puppet] - 10https://gerrit.wikimedia.org/r/602620 [08:37:55] !log empty ganeti100{1,2,3,4}. Move all VMs to new ganeti nodes [08:37:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:03] !log failover master IP from ganeti1003 to ganeti1009 [08:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:38] 10Operations, 10LDAP-Access-Requests: NDA for superset access request from WMDE employee danshick - https://phabricator.wikimedia.org/T254442 (10Dzahn) a:03danshick-wmde [08:40:07] !log migrate acrab to new ganeti nodes [08:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:46] 10Operations, 10Citoid, 10Wikimedia-Logstash, 10observability, and 3 others: Move citoid logging to new logging pipeline - https://phabricator.wikimedia.org/T219919 (10Mvolz) >>! In T219919#6193561, @Mvolz wrote: >>>! In T219919#6192799, @Pchelolo wrote: >> The patch above doesn't change anything in produc... [08:41:17] 10Operations, 10LDAP-Access-Requests: Add Daniel Cipoletti to analytics-privatedata-users - https://phabricator.wikimedia.org/T253086 (10Dzahn) 05Open→03Stalled [08:41:27] (03CR) 10Jcrespo: "See dependencies, will test locally to check all tests run normally." (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/602595 (https://phabricator.wikimedia.org/T248256) (owner: 10Privacybatm) [08:42:43] !log migrate mx2001.wikimedia.org to new ganeti nodes [08:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:56] (03PS2) 10Jcrespo: Transferer.py: Backport production fixes into HEAD (xtrabackup in path) [software/transferpy] - 10https://gerrit.wikimedia.org/r/602359 (https://phabricator.wikimedia.org/T250666) [08:44:04] 10Operations, 10LDAP-Access-Requests: Add Daniel Cipoletti to analytics-privatedata-users - https://phabricator.wikimedia.org/T253086 (10Dzahn) Hi @dcipoletti gentle ping. To move this forward we still just need the last steps mentioned by Reuven in the comment above. [08:44:07] (03PS3) 10Jcrespo: Transferer.py: Backport production fixes into HEAD (xtrabackup in path) [software/transferpy] - 10https://gerrit.wikimedia.org/r/602359 (https://phabricator.wikimedia.org/T250666) [08:44:27] (03CR) 10Jcrespo: "Done" (033 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/602359 (https://phabricator.wikimedia.org/T250666) (owner: 10Jcrespo) [08:44:29] 10Operations, 10ops-codfw: rack/setup/ codfw: ganeti2009 - ganeti201[0-8] - https://phabricator.wikimedia.org/T224603 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by akosiaris on cumin1001.eqiad.wmnet for hosts: ` ['ganeti2016.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/20... [08:44:35] !log reimage ganeti2016 for stretch [08:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:45] (03PS4) 10Jcrespo: Transferer.py: Backport production fixes into HEAD (xtrabackup in path) [software/transferpy] - 10https://gerrit.wikimedia.org/r/602359 (https://phabricator.wikimedia.org/T250666) [08:47:48] (03CR) 10Privacybatm: "Please see my comments." (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/602618 (https://phabricator.wikimedia.org/T248256) (owner: 10Privacybatm) [08:48:23] (03CR) 10Privacybatm: "Please see my comments" [software/transferpy] - 10https://gerrit.wikimedia.org/r/602618 (https://phabricator.wikimedia.org/T248256) (owner: 10Privacybatm) [08:50:09] 10Operations, 10LDAP-Access-Requests: NDA for superset access request from WMDE employee danshick - https://phabricator.wikimedia.org/T254442 (10danshick-wmde) Email sent to @KFrancis . Thanks all! [08:50:41] (03CR) 10Privacybatm: "> Patch Set 1:" (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/602595 (https://phabricator.wikimedia.org/T248256) (owner: 10Privacybatm) [08:51:45] (03CR) 10Alexandros Kosiaris: [C: 03+2] ganeti: add monitoring for ganeti RAPI [puppet] - 10https://gerrit.wikimedia.org/r/589608 (owner: 10Dzahn) [08:51:54] (03PS5) 10Alexandros Kosiaris: ganeti: add monitoring for ganeti RAPI [puppet] - 10https://gerrit.wikimedia.org/r/589608 (owner: 10Dzahn) [08:54:16] :) [08:54:41] (03PS1) 10Jcrespo: mariabackup: Stop harcoding xtrabackup path [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/602622 (https://phabricator.wikimedia.org/T250666) [08:54:57] (03CR) 10jerkins-bot: [V: 04-1] mariabackup: Stop harcoding xtrabackup path [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/602622 (https://phabricator.wikimedia.org/T250666) (owner: 10Jcrespo) [08:55:07] (03CR) 10Privacybatm: "> Patch Set 1:" (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/602595 (https://phabricator.wikimedia.org/T248256) (owner: 10Privacybatm) [08:55:54] 10Operations, 10LDAP-Access-Requests: NDA for superset access request from WMDE employee danshick - https://phabricator.wikimedia.org/T254442 (10Dzahn) a:05danshick-wmde→03KFrancis [08:56:18] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: add alerts for Thanos components [puppet] - 10https://gerrit.wikimedia.org/r/602082 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [08:56:28] (03PS2) 10Jcrespo: mariabackup: Stop harcoding xtrabackup path [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/602622 (https://phabricator.wikimedia.org/T250666) [09:00:30] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [09:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:02] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:03:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:56] (03PS1) 10Jcrespo: mariadb-backups: Port xtrabackup path fixes from wmfmariadbpy [puppet] - 10https://gerrit.wikimedia.org/r/602624 (https://phabricator.wikimedia.org/T250666) [09:08:10] 10Operations, 10SRE-tools, 10Patch-For-Review: New tool to track package updates/status for hosts and images (debmonitor) - https://phabricator.wikimedia.org/T167504 (10Volans) 05Open→03Resolved Yes indeed this old issue that was tracking the development of Debmonitor can be surely closed. The remaining... [09:08:12] 10Operations, 10SRE-tools, 10Goal, 10Patch-For-Review: Release and deploy Debmonitor (patch management software) [Technology Goal 2017-18_Q4] - https://phabricator.wikimedia.org/T191298 (10Volans) [09:08:59] (03PS2) 10Dzahn: profile::planet: add data types [puppet] - 10https://gerrit.wikimedia.org/r/602620 [09:11:46] 10Operations, 10ops-codfw: rack/setup/ codfw: ganeti2009 - ganeti201[0-8] - https://phabricator.wikimedia.org/T224603 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti2016.codfw.wmnet'] ` and were **ALL** successful. [09:11:55] PROBLEM - Check correctness of the icinga configuration on icinga1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [09:12:17] (03PS3) 10Jcrespo: mariabackup: Stop harcoding xtrabackup path [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/602622 (https://phabricator.wikimedia.org/T250666) [09:12:19] checking that.. ^ probably the new ganeti RAPI monitoring [09:12:26] (03PS3) 10Dzahn: profile::planet: add data types [puppet] - 10https://gerrit.wikimedia.org/r/602620 [09:13:21] ah.. no.. it's thanos [09:13:54] godog: icinga does not like vi /etc/nagios/nagios_service.cfg +22229 [09:14:44] Could not add object property [09:15:03] siiighh [09:15:08] thanks mutante I'll take a look [09:16:37] (03PS12) 10Kormat: install_server: Allow reuse of partitions during reimage. [puppet] - 10https://gerrit.wikimedia.org/r/601761 (https://phabricator.wikimedia.org/T252027) [09:17:37] 10Operations, 10SRE-tools, 10docker-pkg, 10serviceops: Report image metadata to debmonitor - https://phabricator.wikimedia.org/T241206 (10hashar) [09:17:39] 10Operations, 10SRE-tools, 10Patch-For-Review: New tool to track package updates/status for hosts and images (debmonitor) - https://phabricator.wikimedia.org/T167504 (10hashar) [09:18:36] 10Operations, 10SRE-tools, 10Patch-For-Review: New tool to track package updates/status for hosts and images (debmonitor) - https://phabricator.wikimedia.org/T167504 (10hashar) Excellent, thank you very much ;) [09:18:53] yeah I'll revert [09:19:16] (03PS13) 10Kormat: install_server: Allow reuse of partitions during reimage. [puppet] - 10https://gerrit.wikimedia.org/r/601761 (https://phabricator.wikimedia.org/T252027) [09:19:35] (03CR) 10Jcrespo: "Hadn't we deleted all RemoteExecution files on a previous commit?" [software/transferpy] - 10https://gerrit.wikimedia.org/r/602618 (https://phabricator.wikimedia.org/T248256) (owner: 10Privacybatm) [09:20:25] mutante: where did you see that log btw ? [09:22:25] (03CR) 10Jcrespo: "> Patch Set 1:" (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/602595 (https://phabricator.wikimedia.org/T248256) (owner: 10Privacybatm) [09:22:27] (03PS1) 10Filippo Giunchedi: Revert "thanos: add alerts for Thanos components" [puppet] - 10https://gerrit.wikimedia.org/r/602626 [09:22:38] godog: i saw icinga itself alert about PROBLEM - Check correctness of the icinga configuration .. then i ran "sudo icinga -v /etc/icinga/icinga.cfg " and that tells me "Error: Invalid service object directive 'sum' and "in file '/etc/nagios/nagios_service.cfg' on line 22229" [09:23:40] (03PS4) 10Jcrespo: mariabackup: Stop harcoding xtrabackup path [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/602622 (https://phabricator.wikimedia.org/T250666) [09:24:03] mutante: thanks! [09:24:14] (03CR) 10jerkins-bot: [V: 04-1] Revert "thanos: add alerts for Thanos components" [puppet] - 10https://gerrit.wikimedia.org/r/602626 (owner: 10Filippo Giunchedi) [09:24:59] (03PS2) 10Filippo Giunchedi: Revert "thanos: add alerts for Thanos components" [puppet] - 10https://gerrit.wikimedia.org/r/602626 [09:25:01] (03CR) 10Jcrespo: [C: 03+2] mariabackup: Stop harcoding xtrabackup path [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/602622 (https://phabricator.wikimedia.org/T250666) (owner: 10Jcrespo) [09:25:10] * godog shushes jerkins-bot [09:25:39] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Port xtrabackup path fixes from wmfmariadbpy [puppet] - 10https://gerrit.wikimedia.org/r/602624 (https://phabricator.wikimedia.org/T250666) (owner: 10Jcrespo) [09:25:44] !log elukey@cumin1001 START - Cookbook sre.cassandra.roll-restart [09:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:51] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] Revert "thanos: add alerts for Thanos components" [puppet] - 10https://gerrit.wikimedia.org/r/602626 (owner: 10Filippo Giunchedi) [09:26:38] let me know if to merge that too, godog [09:26:46] jynus: yes please, thank you [09:27:09] ongoing [09:27:17] finished [09:29:27] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [09:30:33] (03PS14) 10Kormat: install_server: Allow reuse of partitions during reimage. [puppet] - 10https://gerrit.wikimedia.org/r/601761 (https://phabricator.wikimedia.org/T252027) [09:31:34] ooof still the same problem with the puppet run heh [09:31:56] (03CR) 10Jbond: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/602598 (https://phabricator.wikimedia.org/T252210) (owner: 10Dzahn) [09:33:14] (investigating) [09:34:42] (03CR) 10Jcrespo: "I see the issue, switchover.py requires RemoteExecution, but we have moved that under trasnferpy. Any suggestion?" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/602595 (https://phabricator.wikimedia.org/T248256) (owner: 10Privacybatm) [09:38:27] ok manually fixing the icinga configuration did the trick [09:39:10] (03CR) 10Alexandros Kosiaris: [C: 03+1] "\o/" [deployment-charts] - 10https://gerrit.wikimedia.org/r/598076 (https://phabricator.wikimedia.org/T254479) (owner: 10JMeybohm) [09:44:15] (03PS7) 10JMeybohm: Readd wmf.chartid (.metadata.labels.chart) to all resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/598076 (https://phabricator.wikimedia.org/T254479) [09:44:23] RECOVERY - Check correctness of the icinga configuration on icinga1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [09:45:12] (03CR) 10JMeybohm: [C: 03+2] Readd wmf.chartid (.metadata.labels.chart) to all resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/598076 (https://phabricator.wikimedia.org/T254479) (owner: 10JMeybohm) [09:45:38] (03Merged) 10jenkins-bot: Readd wmf.chartid (.metadata.labels.chart) to all resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/598076 (https://phabricator.wikimedia.org/T254479) (owner: 10JMeybohm) [09:45:52] (03CR) 10Privacybatm: "> Patch Set 1:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/602595 (https://phabricator.wikimedia.org/T248256) (owner: 10Privacybatm) [09:46:35] (03CR) 10Privacybatm: "We have not yet moved the switchover.py to transferpy. But we can move that if it is related to transferpy. In that case, we will have to " [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/602595 (https://phabricator.wikimedia.org/T248256) (owner: 10Privacybatm) [09:46:42] !log elukey@cumin1001 START - Cookbook sre.ganeti.makevm [09:46:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:41] (03CR) 10Jcrespo: "> Patch Set 1:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/602595 (https://phabricator.wikimedia.org/T248256) (owner: 10Privacybatm) [09:49:03] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "HAH! Amazing bug :P LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/598209 (owner: 10Alexandros Kosiaris) [09:49:08] (03PS4) 10Dzahn: profile::planet: add data types [puppet] - 10https://gerrit.wikimedia.org/r/602620 [09:49:09] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'mathoid' for release 'staging' . [09:49:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:24] (03CR) 10Privacybatm: "> Patch Set 1:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/602618 (https://phabricator.wikimedia.org/T248256) (owner: 10Privacybatm) [09:52:31] (03CR) 10Privacybatm: "> Patch Set 1:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/602618 (https://phabricator.wikimedia.org/T248256) (owner: 10Privacybatm) [09:52:40] (03CR) 10Jcrespo: "> Patch Set 1:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/602618 (https://phabricator.wikimedia.org/T248256) (owner: 10Privacybatm) [09:53:48] (03CR) 10Dzahn: "checked Icinga and it works.. but also realized we will have some duplicate definitions because it adds one check on each machine but we j" [puppet] - 10https://gerrit.wikimedia.org/r/589608 (owner: 10Dzahn) [09:55:39] (03CR) 10DannyS712: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/602616 (https://phabricator.wikimedia.org/T87450) (owner: 10Dzahn) [09:59:29] (03PS15) 10Kormat: install_server: Allow reuse of partitions during reimage. [puppet] - 10https://gerrit.wikimedia.org/r/601761 (https://phabricator.wikimedia.org/T252027) [10:02:10] (03PS16) 10Kormat: install_server: Allow reuse of partitions during reimage. [puppet] - 10https://gerrit.wikimedia.org/r/601761 (https://phabricator.wikimedia.org/T252027) [10:02:41] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [10:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:50] (03CR) 10Kormat: [C: 03+1] Transferer.py: Backport production fixes into HEAD (xtrabackup in path) [software/transferpy] - 10https://gerrit.wikimedia.org/r/602359 (https://phabricator.wikimedia.org/T250666) (owner: 10Jcrespo) [10:10:03] (03CR) 10Privacybatm: "> Patch Set 1:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/602618 (https://phabricator.wikimedia.org/T248256) (owner: 10Privacybatm) [10:10:44] (03PS4) 10Filippo Giunchedi: prometheus: merge ops instance role into profile [puppet] - 10https://gerrit.wikimedia.org/r/602409 (https://phabricator.wikimedia.org/T252186) [10:10:46] (03PS1) 10Filippo Giunchedi: thanos: add alerts for Thanos components [puppet] - 10https://gerrit.wikimedia.org/r/602633 (https://phabricator.wikimedia.org/T252186) [10:11:49] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [10:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:10] \o/ [10:14:55] (03CR) 10Alexandros Kosiaris: [C: 03+2] "thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/598209 (owner: 10Alexandros Kosiaris) [10:14:59] (03PS2) 10Alexandros Kosiaris: Rakefile: read stdout/stderr before waiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/598209 [10:15:11] 10Operations, 10Analytics, 10serviceops, 10vm-requests: Create a VM for matomo1002 (eqiad) - https://phabricator.wikimedia.org/T252742 (10elukey) ` elukey@cumin1001:~$ sudo cookbook sre.ganeti.makevm eqiad_C matomo1002.eqiad.wmnet --vcpus 4 --memory 8 --disk 50 START - Cookbook sre.ganeti.makevm Ready to c... [10:15:25] 10Operations, 10Analytics, 10serviceops, 10vm-requests: Create a VM for matomo1002 (eqiad) - https://phabricator.wikimedia.org/T252742 (10elukey) 05Stalled→03Open [10:17:39] (03PS1) 10Dzahn: ganeti: add custom fact for master node and only monitor RAPI there [puppet] - 10https://gerrit.wikimedia.org/r/602635 [10:18:49] (03PS1) 10Jcrespo: mariadb-backups: Disable trasnfer.py logging to systemd [puppet] - 10https://gerrit.wikimedia.org/r/602636 [10:18:56] (03PS1) 10Elukey: Add puppet configuration for matomo1002 [puppet] - 10https://gerrit.wikimedia.org/r/602637 (https://phabricator.wikimedia.org/T252742) [10:18:58] (03CR) 10Privacybatm: "But the tox env contains everything! :" [software/transferpy] - 10https://gerrit.wikimedia.org/r/602618 (https://phabricator.wikimedia.org/T248256) (owner: 10Privacybatm) [10:19:03] (03CR) 10Kormat: install_server: Allow reuse of partitions during reimage. (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/601761 (https://phabricator.wikimedia.org/T252027) (owner: 10Kormat) [10:19:28] (03CR) 10jerkins-bot: [V: 04-1] ganeti: add custom fact for master node and only monitor RAPI there [puppet] - 10https://gerrit.wikimedia.org/r/602635 (owner: 10Dzahn) [10:21:55] (03CR) 10Jcrespo: [C: 03+2] Transferer.py: Backport production fixes into HEAD (xtrabackup in path) [software/transferpy] - 10https://gerrit.wikimedia.org/r/602359 (https://phabricator.wikimedia.org/T250666) (owner: 10Jcrespo) [10:22:15] (03PS2) 10Dzahn: ganeti: add custom fact for master node and only monitor RAPI there [puppet] - 10https://gerrit.wikimedia.org/r/602635 [10:26:45] (03PS2) 10Jcrespo: mariadb-backups: Disable trasnfer.py logging to systemd [puppet] - 10https://gerrit.wikimedia.org/r/602636 [10:27:37] (03PS3) 10Jcrespo: mariadb-backups: Disable transfer.py logging to systemd [puppet] - 10https://gerrit.wikimedia.org/r/602636 [10:28:59] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is OK: (C)100 gt (W)80 gt 75.25 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [10:29:07] (03PS1) 10Dzahn: cumin/wdqs: fix cumin aliases after wdqs role changes [puppet] - 10https://gerrit.wikimedia.org/r/602639 [10:31:35] (03PS2) 10Dzahn: cumin/wdqs: fix cumin aliases after wdqs role changes [puppet] - 10https://gerrit.wikimedia.org/r/602639 [10:32:33] !log elukey@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) [10:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:43] (03CR) 10Dzahn: "The changes here lead to some broken cumin aliases. See https://gerrit.wikimedia.org/r/c/operations/puppet/+/602639" [puppet] - 10https://gerrit.wikimedia.org/r/598884 (owner: 10EBernhardson) [10:32:50] \o/ [10:32:59] (03PS1) 10Alexandros Kosiaris: ganeti: Add ganeti_master fact [puppet] - 10https://gerrit.wikimedia.org/r/602640 [10:33:01] (03PS1) 10Alexandros Kosiaris: ganeti: Utilize ganeti_master fact to deduplicate checks [puppet] - 10https://gerrit.wikimedia.org/r/602641 [10:33:03] (03PS1) 10Alexandros Kosiaris: ganeti: Monitor metad and wconfd as well [puppet] - 10https://gerrit.wikimedia.org/r/602642 [10:33:40] (03CR) 10Alexandros Kosiaris: [C: 03+2] "I think https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/602641/ and the dependent change should fix that" [puppet] - 10https://gerrit.wikimedia.org/r/589608 (owner: 10Dzahn) [10:34:20] akosiaris: ooh.. i just compiled that and wasn't sure if it works :) https://puppet-compiler.wmflabs.org/compiler1003/23024/ [10:34:40] akosiaris: i mean https://gerrit.wikimedia.org/r/c/operations/puppet/+/602635 heh [10:35:27] oh, you were working on it already? nice! [10:35:32] I 'll abandon mine then [10:35:42] (03CR) 10Dzahn: "duplicate of https://gerrit.wikimedia.org/r/c/operations/puppet/+/602640 ?" [puppet] - 10https://gerrit.wikimedia.org/r/602635 (owner: 10Dzahn) [10:35:49] it's the same thing anyway [10:36:21] akosiaris: well.. is it the same though.. i added it to wmflib .. but i have not really added custom facts before [10:36:40] i found the command you used in the motd [10:36:44] (03PS1) 10Jbond: deployment: fix shellcheck issues umask-wikidev-profile-d.sh [puppet] - 10https://gerrit.wikimedia.org/r/602643 (https://phabricator.wikimedia.org/T254480) [10:36:47] (03PS1) 10Jbond: mailmain: fix shellcheck issues remove_from_private.sh [puppet] - 10https://gerrit.wikimedia.org/r/602644 (https://phabricator.wikimedia.org/T254480) [10:36:51] (03PS1) 10Jbond: dumps: fix shellcheck issues [puppet] - 10https://gerrit.wikimedia.org/r/602645 (https://phabricator.wikimedia.org/T254480) [10:36:55] (03PS1) 10Jbond: confluent: fix shellcheck issues [puppet] - 10https://gerrit.wikimedia.org/r/602646 (https://phabricator.wikimedia.org/T254480) [10:36:57] (03PS1) 10Jbond: osm: fix shellcheck issues [puppet] - 10https://gerrit.wikimedia.org/r/602647 (https://phabricator.wikimedia.org/T254480) [10:36:59] (03PS1) 10Jbond: statistics: fix shelcheck issues in hardsync [puppet] - 10https://gerrit.wikimedia.org/r/602648 (https://phabricator.wikimedia.org/T254480) [10:37:01] (03PS1) 10Jbond: prometheus: fix shellcheck issues in prometheus-local-crontabs.sh [puppet] - 10https://gerrit.wikimedia.org/r/602649 (https://phabricator.wikimedia.org/T254480) [10:37:36] your fact is probably better.. just reading a file instead of running the command [10:38:16] either way i was thinking a custom fact can also be used to change the motd based on it [10:38:43] (03CR) 10Dzahn: [C: 03+1] ganeti: Add ganeti_master fact [puppet] - 10https://gerrit.wikimedia.org/r/602640 (owner: 10Alexandros Kosiaris) [10:39:38] (03CR) 10Dzahn: [C: 03+1] "looks better than my attempt at https://gerrit.wikimedia.org/r/c/operations/puppet/+/602635" [puppet] - 10https://gerrit.wikimedia.org/r/602640 (owner: 10Alexandros Kosiaris) [10:40:12] (03CR) 10Dzahn: [C: 03+1] ganeti: Utilize ganeti_master fact to deduplicate checks [puppet] - 10https://gerrit.wikimedia.org/r/602641 (owner: 10Alexandros Kosiaris) [10:40:54] (03PS2) 10Jbond: misc: fix minor shellcheck issues in a few scripts [puppet] - 10https://gerrit.wikimedia.org/r/602643 (https://phabricator.wikimedia.org/T254480) [10:41:09] (03CR) 10Dzahn: [C: 04-1] ganeti: Monitor metad and wconfd as well (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/602642 (owner: 10Alexandros Kosiaris) [10:42:29] (03CR) 10Dzahn: [C: 04-1] ganeti: Monitor metad and wconfd as well (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/602642 (owner: 10Alexandros Kosiaris) [10:42:39] (03CR) 10Volans: [C: 03+1] "LGTM as a temporary workaround before we come up with a better solution" [puppet] - 10https://gerrit.wikimedia.org/r/602636 (owner: 10Jcrespo) [10:43:55] (03Abandoned) 10Dzahn: ganeti: add custom fact for master node and only monitor RAPI there [puppet] - 10https://gerrit.wikimedia.org/r/602635 (owner: 10Dzahn) [10:44:13] (03CR) 10Dzahn: [C: 03+2] ganeti: Add ganeti_master fact [puppet] - 10https://gerrit.wikimedia.org/r/602640 (owner: 10Alexandros Kosiaris) [10:47:04] (03PS1) 10Jbond: planet: fix shellcheck issues in check_https [puppet] - 10https://gerrit.wikimedia.org/r/602650 (https://phabricator.wikimedia.org/T254480) [10:50:03] (03CR) 10Volans: [C: 04-1] "I think there is a typo, see inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/602639 (owner: 10Dzahn) [10:57:20] akosiaris: it works now with facter --custom-dir=/var/lib/puppet/lib/facter/ ganeti_master [10:57:24] (03PS3) 10Alexandros Kosiaris: Enable networkpolicy across multiple services [deployment-charts] - 10https://gerrit.wikimedia.org/r/598188 (https://phabricator.wikimedia.org/T249927) [10:57:37] mutante: wow, nice! [10:57:43] (03CR) 10Dzahn: "[ganeti1003:~] $ facter --custom-dir=/var/lib/puppet/lib/facter/ ganeti_master" [puppet] - 10https://gerrit.wikimedia.org/r/602640 (owner: 10Alexandros Kosiaris) [10:57:52] mutante: I think you can also do facter -p [10:58:01] that will instruct facter to load puppet facts as well [10:58:10] 2020-06-05 10:58:03.526142 WARN puppetlabs.facter - skipping external facts for "/home/dzahn/.puppet/cache/facts.d": No such file or directory [10:58:40] that happens with -p as non-root.. but with custom-dir it just works [10:59:10] (03PS2) 10Jbond: dumps: fix shellcheck issues [puppet] - 10https://gerrit.wikimedia.org/r/602645 (https://phabricator.wikimedia.org/T254480) [10:59:11] took me a while before when i never saw our custom facts with regular facter [10:59:45] or with "puppet facts" either [10:59:56] they can be pretty useful. Although their API has evolved a bit. I think most of our custom facts are not using the facts.d dir (or at least did not used to) [11:00:20] now that you mentioned it, I remember meeting a weird thing in puppetboard [11:00:50] I think we overpopulate network facts, counting ephemeral veth (e.g. calixXXXXXXX) and tap (i.e. ganeti) interfaces [11:01:08] (03CR) 10Dzahn: [C: 03+2] ganeti: Utilize ganeti_master fact to deduplicate checks [puppet] - 10https://gerrit.wikimedia.org/r/602641 (owner: 10Alexandros Kosiaris) [11:01:45] I quickly rebuild the interwiki cache [11:01:54] (03CR) 10Alexandros Kosiaris: [C: 03+2] Enable networkpolicy across multiple services [deployment-charts] - 10https://gerrit.wikimedia.org/r/598188 (https://phabricator.wikimedia.org/T249927) (owner: 10Alexandros Kosiaris) [11:02:20] (03Merged) 10jenkins-bot: Enable networkpolicy across multiple services [deployment-charts] - 10https://gerrit.wikimedia.org/r/598188 (https://phabricator.wikimedia.org/T249927) (owner: 10Alexandros Kosiaris) [11:03:34] oh, heh, something did not quite work with the second change... the check_command is now: [11:03:37] /usr/lib/nagios/plugins/check_http -H {agent_specified_environment => production, architecture => amd64, augeas => {version => 1.8.0}, .... [11:03:42] (03PS1) 10Ladsgroup: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602652 [11:03:44] (03CR) 10Ladsgroup: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602652 (owner: 10Ladsgroup) [11:03:46] (03PS1) 10Ladsgroup: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602653 [11:03:48] (03CR) 10Ladsgroup: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602653 (owner: 10Ladsgroup) [11:04:11] (03PS3) 10Jbond: dumps: fix shellcheck issues [puppet] - 10https://gerrit.wikimedia.org/r/602645 (https://phabricator.wikimedia.org/T254480) [11:04:38] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602652 (owner: 10Ladsgroup) [11:04:42] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602653 (owner: 10Ladsgroup) [11:04:55] check_http -H ${facts}['ganeti_cluster'] [11:05:03] ah dammit [11:05:14] it should be ${facts['ganeti_cluster']} I guess? [11:05:22] * akosiaris doublechecking [11:05:46] !log ladsgroup@deploy1001 Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 02m 14s) [11:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:03] we can keep using ${::site} or we can replace all ${::site} with the fact [11:07:19] (03CR) 10Jbond: "I have tried to keep functionality untouched but let me know if that's not the case we can always silence some of the complaints" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/602645 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [11:07:20] the advantage of not using $site is the number "01" will not be hardcoded.. once there is ganeti02.svc somewhere [11:07:41] yup [11:08:03] we would need to change some of the puppet logic however to achieve that [11:08:15] we 've recently went ahead and did some changes that no longer support it [11:08:24] ok [11:08:32] but then again.. we haven't instantiated a ganeti02.svc.eqiad.wmnet in like .. 5 years now? [11:09:14] yea, and we can still just grep -r ganeti01 [11:10:14] (03PS2) 10Alexandros Kosiaris: ganeti: Monitor metad and wconfd as well [puppet] - 10https://gerrit.wikimedia.org/r/602642 [11:10:16] (03PS1) 10Alexandros Kosiaris: Correctly reference the ganeti_cluster fact [puppet] - 10https://gerrit.wikimedia.org/r/602655 [11:10:25] mutante: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/602655 [11:10:30] that should fix it [11:11:12] (03CR) 10Dzahn: [C: 03+2] Correctly reference the ganeti_cluster fact [puppet] - 10https://gerrit.wikimedia.org/r/602655 (owner: 10Alexandros Kosiaris) [11:11:17] (03PS2) 10Jbond: planet: fix shellcheck issues in check_https [puppet] - 10https://gerrit.wikimedia.org/r/602650 (https://phabricator.wikimedia.org/T254480) [11:11:46] (03PS2) 10Jbond: statistics: fix shelcheck issues in hardsync [puppet] - 10https://gerrit.wikimedia.org/r/602648 (https://phabricator.wikimedia.org/T254480) [11:11:51] (03PS2) 10Jbond: osm: fix shellcheck issues [puppet] - 10https://gerrit.wikimedia.org/r/602647 (https://phabricator.wikimedia.org/T254480) [11:11:57] (03PS2) 10Jbond: confluent: fix shellcheck issues [puppet] - 10https://gerrit.wikimedia.org/r/602646 (https://phabricator.wikimedia.org/T254480) [11:12:01] (03PS4) 10Jbond: dumps: fix shellcheck issues [puppet] - 10https://gerrit.wikimedia.org/r/602645 (https://phabricator.wikimedia.org/T254480) [11:12:06] (03PS2) 10Jbond: mailmain: fix shellcheck issues remove_from_private.sh [puppet] - 10https://gerrit.wikimedia.org/r/602644 (https://phabricator.wikimedia.org/T254480) [11:12:34] (03PS3) 10Jbond: confluent: fix shellcheck issues [puppet] - 10https://gerrit.wikimedia.org/r/602646 (https://phabricator.wikimedia.org/T254480) [11:12:38] (03PS5) 10Jbond: dumps: fix shellcheck issues [puppet] - 10https://gerrit.wikimedia.org/r/602645 (https://phabricator.wikimedia.org/T254480) [11:12:42] (03PS3) 10Jbond: mailmain: fix shellcheck issues remove_from_private.sh [puppet] - 10https://gerrit.wikimedia.org/r/602644 (https://phabricator.wikimedia.org/T254480) [11:13:09] (03PS2) 10Jbond: prometheus: fix shellcheck issues in prometheus-local-crontabs.sh [puppet] - 10https://gerrit.wikimedia.org/r/602649 (https://phabricator.wikimedia.org/T254480) [11:13:25] (03PS3) 10Jbond: statistics: fix shelcheck issues in hardsync [puppet] - 10https://gerrit.wikimedia.org/r/602648 (https://phabricator.wikimedia.org/T254480) [11:13:40] (03PS3) 10Jbond: osm: fix shellcheck issues [puppet] - 10https://gerrit.wikimedia.org/r/602647 (https://phabricator.wikimedia.org/T254480) [11:13:57] (03PS4) 10Jbond: confluent: fix shellcheck issues [puppet] - 10https://gerrit.wikimedia.org/r/602646 (https://phabricator.wikimedia.org/T254480) [11:14:16] !log running puppet on all ganeti nodes [11:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:12] 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review: Reduce read pressure on memcached servers by adding a machine-local Memcache instance - https://phabricator.wikimedia.org/T244340 (10Ladsgroup) Seconding Krinkle. I investigate more on this. I have lots of ideas on how to improve memcached... [11:18:11] hi ottomata - I'm seeing errors from MobileWikiAppProtectedEditAttempt ('protectionStatus' is a required property) Shall I file a task? [11:18:57] PROBLEM - Host ganeti2016 is DOWN: PING CRITICAL - Packet loss = 100% [11:19:04] (03CR) 10Dzahn: [C: 03+2] planet: fix shellcheck issues in check_https [puppet] - 10https://gerrit.wikimedia.org/r/602650 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [11:20:53] akosiaris: Search Results [11:20:53] Web results [11:21:00] oops :), wrong paste [11:21:49] :) [11:21:51] akosiaris: all good now: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=RAPI [11:22:03] 👍 [11:22:08] RECOVERY - Host ganeti2016 is UP: PING OK - Packet loss = 0%, RTA = 36.21 ms [11:23:16] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add kafka-jumbo100[7-9] to network policy for eventgate-analytics and eventgate-analytics-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/602087 (https://phabricator.wikimedia.org/T252675) (owner: 10Ottomata) [11:24:27] (03PS3) 10Dzahn: ganeti: Monitor metad and wconfd as well [puppet] - 10https://gerrit.wikimedia.org/r/602642 (owner: 10Alexandros Kosiaris) [11:24:28] PROBLEM - Check systemd state on ganeti2016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:25:08] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [11:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:20] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [11:25:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:31] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [11:25:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:40] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Applied in all 3 clusters. Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/602087 (https://phabricator.wikimedia.org/T252675) (owner: 10Ottomata) [11:26:43] (03CR) 10Dzahn: "PS3: fixed command line and descriptions" [puppet] - 10https://gerrit.wikimedia.org/r/602642 (owner: 10Alexandros Kosiaris) [11:27:39] 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review: Reduce read pressure on memcached servers by adding a machine-local Memcache instance - https://phabricator.wikimedia.org/T244340 (10Joe) The problem we're trying to solve here is not an individual cache key read, or even multiple cache ke... [11:27:50] mutante: great! [11:28:10] I am btw, falling over ganeti01.svc.codfw.wmnet from ganeti2001 to ganeti2019 [11:28:26] !log master-failover from ganeti2001 to ganeti2019 for ganeti01.svc.codfw.wmnet [11:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:10] (03CR) 10Dzahn: [C: 04-1] "ganeti-metad check works on ganeti1009 but not on ganeti1003 (needs to be limited to masters as well?)" [puppet] - 10https://gerrit.wikimedia.org/r/602642 (owner: 10Alexandros Kosiaris) [11:37:45] XioNoX: Hi! would you have a minute for me? [11:37:55] joal: sure! [11:38:32] XioNoX: I have noticed we (more or less regularly) receive late data in the netflow topic [11:38:51] what's late? [11:39:10] XioNoX: volume seems low, and I'd like to set known limits for acceptance [11:39:58] XioNoX: we have copied 1 hour ago data belonging to yesterday hour 22 (all UTC) [11:40:30] so pmacct sent 1h ago data belonging to yesterday? [11:40:40] that sounds wrong [11:40:41] and actually XioNoX we have copied 5 minutes ago data for that same hour :) [11:40:54] it feels wrong to me as well [11:41:39] XioNoX: 2 possible things - pmacct sent data belonging to yesterday, or we have an issue at ingestion [11:42:01] it it possible to look at kafka for that? [11:42:32] XioNoX: we have currently 2 systems ingesting the data, and both correlate the fact that data belonging to yesteday appears now [11:42:39] XioNoX: We can look at kafka yes [11:43:10] (03CR) 10ArielGlenn: "Some of the double quotes placement can't be right. When you add ouble quotes around a variable in a for item in "$X", that $X is now seen" [puppet] - 10https://gerrit.wikimedia.org/r/602645 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [11:43:30] it's possible that pmacct sends data that is a few minutes old (when a flow is over) [11:43:41] but I can't imagine a usecase for hours old [11:44:19] so if the issue is from the pmacct side, it's either a missconfig or a bug [11:44:41] so we can review the settings, but unfortunately pmacct doesn't give much visibility into its internals [11:44:56] ack XioNoX [11:45:10] that's why looking at kafka might be helpful here (if possible) [11:45:16] joal: can you open a task? :) [11:45:28] XioNoX: I'm gonna spend some time trying to provide you meaningful info [11:45:39] sure XioNoX - doing that with info w hen I have it [11:45:49] thanks for the brainbounce XioNoX [11:46:13] 10Operations, 10Gerrit, 10vm-requests: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10Dzahn) 05Resolved→03Open a:05QChris→03Dzahn As Chris points out this VM only has 1 vcpu. But requested were 8 and he needs 8. That was my mistake it seems. Reopening. [11:46:18] akosiaris: when you are done i want to assign 8 VCPUS to an existing VM that only had 1 and steal all the resources, lol [11:46:27] joal: thanks! [11:49:55] (03PS6) 10Jbond: dumps: fix shellcheck issues [puppet] - 10https://gerrit.wikimedia.org/r/602645 (https://phabricator.wikimedia.org/T254480) [11:52:26] (03PS1) 10Dzahn: DHCP/site: add people2001 [puppet] - 10https://gerrit.wikimedia.org/r/602661 [11:53:51] (03CR) 10Dzahn: [C: 03+2] mailmain: fix shellcheck issues remove_from_private.sh [puppet] - 10https://gerrit.wikimedia.org/r/602644 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [11:54:03] (03CR) 10Jbond: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/602645 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [11:55:24] (03CR) 10Dzahn: [C: 03+1] misc: fix minor shellcheck issues in a few scripts [puppet] - 10https://gerrit.wikimedia.org/r/602643 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [11:57:02] 10Operations, 10Gerrit, 10vm-requests: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10QChris) [11:58:04] (03CR) 10Dzahn: "tested..works normal" [puppet] - 10https://gerrit.wikimedia.org/r/602650 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [11:58:28] (03CR) 10Dzahn: "tested with my own email address (except not actually removing myself). works" [puppet] - 10https://gerrit.wikimedia.org/r/602644 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [12:02:35] 10Operations, 10LDAP-Access-Requests: Add Daniel Cipoletti to analytics-privatedata-users - https://phabricator.wikimedia.org/T253086 (10dcipoletti) Thanks for the ping. I have signed L3 and have read Data Access User Responsibilities. [12:04:53] 10Operations, 10LDAP-Access-Requests: Add Daniel Cipoletti to analytics-privatedata-users - https://phabricator.wikimedia.org/T253086 (10Dzahn) 05Stalled→03Open a:05dcipoletti→03Dzahn [12:06:08] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, cc Bryan as original author" [puppet] - 10https://gerrit.wikimedia.org/r/602649 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [12:06:42] PROBLEM - Host ganeti2016 is DOWN: PING CRITICAL - Packet loss = 100% [12:07:04] (03PS5) 10Dzahn: profile::planet: add data types [puppet] - 10https://gerrit.wikimedia.org/r/602620 [12:09:34] RECOVERY - Host ganeti2016 is UP: PING OK - Packet loss = 0%, RTA = 36.32 ms [12:09:52] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/23026/" [puppet] - 10https://gerrit.wikimedia.org/r/602620 (owner: 10Dzahn) [12:09:54] mutante: sure, go head. most CPUs aren't doing anything anyway. It's memory and disk we might be dound on [12:10:00] s/dounb/bound/ [12:10:16] akosiaris: great! thanks. it is temporary as well [12:12:16] (03CR) 10Alexandros Kosiaris: [C: 04-1] "You are right. Interestingly, metad runs on master and master candidates (so correctly it doesn't run on ganeti1003, but does run on ganet" [puppet] - 10https://gerrit.wikimedia.org/r/602642 (owner: 10Alexandros Kosiaris) [12:12:52] PROBLEM - Check systemd state on ganeti2016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:16:02] (03PS4) 10Alexandros Kosiaris: ganeti: Monitor metad and wconfd as well [puppet] - 10https://gerrit.wikimedia.org/r/602642 [12:17:05] !log fix typo in ganeti2016 /etc/network/interfaces and reboot [12:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:25] (03CR) 10Dzahn: [C: 03+1] "yep, both command lines now working on ganeti1009" [puppet] - 10https://gerrit.wikimedia.org/r/602642 (owner: 10Alexandros Kosiaris) [12:17:43] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [12:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:56] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [12:17:57] !log update blubberoid changeprop changeprop-jobqueue citoid cxserver wikifeeds zotero in staging to latest charts [12:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:13] (03CR) 10Alexandros Kosiaris: [C: 03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/602642 (owner: 10Alexandros Kosiaris) [12:18:34] PROBLEM - Host ganeti2016 is DOWN: PING CRITICAL - Packet loss = 100% [12:19:16] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [12:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:33] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'citoid' for release 'staging' . [12:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:45] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [12:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:59] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [12:20:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:34] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'zotero' for release 'staging' . [12:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:40] (03CR) 10CDanis: [C: 03+1] misc: fix minor shellcheck issues in a few scripts [puppet] - 10https://gerrit.wikimedia.org/r/602643 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [12:21:38] RECOVERY - Host ganeti2016 is UP: PING OK - Packet loss = 0%, RTA = 36.22 ms [12:22:37] (03CR) 10Jbond: [C: 03+2] misc: fix minor shellcheck issues in a few scripts [puppet] - 10https://gerrit.wikimedia.org/r/602643 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [12:23:32] (03CR) 10Gehel: [C: 03+1] "LGTM, and PCC agrees: https://puppet-compiler.wmflabs.org/compiler1003/23019/" [puppet] - 10https://gerrit.wikimedia.org/r/599145 (owner: 10EBernhardson) [12:26:09] (03CR) 10Dzahn: [C: 03+2] rancid: add data types [puppet] - 10https://gerrit.wikimedia.org/r/602613 (owner: 10Dzahn) [12:26:17] (03PS2) 10Dzahn: rancid: add data types [puppet] - 10https://gerrit.wikimedia.org/r/602613 [12:31:25] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/23027/" [puppet] - 10https://gerrit.wikimedia.org/r/602613 (owner: 10Dzahn) [12:31:27] (03CR) 10Gehel: [C: 04-1] "minor comment inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/599146 (owner: 10EBernhardson) [12:32:30] (03CR) 10Dzahn: [C: 03+2] add IPv6 records for peek2001 [dns] - 10https://gerrit.wikimedia.org/r/602598 (https://phabricator.wikimedia.org/T252210) (owner: 10Dzahn) [12:32:33] (03PS3) 10Dzahn: add IPv6 records for peek2001 [dns] - 10https://gerrit.wikimedia.org/r/602598 (https://phabricator.wikimedia.org/T252210) [12:34:30] (03PS1) 10Filippo Giunchedi: monitoring: bail on check_command containing newlines [puppet] - 10https://gerrit.wikimedia.org/r/602669 (https://phabricator.wikimedia.org/T252186) [12:37:22] (03CR) 10Alexandros Kosiaris: "> Hmm. The pattern I see for other services doesn't seem to be only including values that override the default. Maybe best to wait on furt" (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/602155 (https://phabricator.wikimedia.org/T218733) (owner: 10Mholloway) [12:37:32] RECOVERY - Check systemd state on ganeti2016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:38:37] (03CR) 10Filippo Giunchedi: "This is take #2, I ran into a problem with newlines/heredoc which should be catched by https://gerrit.wikimedia.org/r/c/operations/puppet/" [puppet] - 10https://gerrit.wikimedia.org/r/602633 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [12:39:38] (03CR) 10Marostegui: install_server: Allow reuse of partitions during reimage. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/601761 (https://phabricator.wikimedia.org/T252027) (owner: 10Kormat) [12:40:16] (03PS3) 10Gehel: cumin/wdqs: fix cumin aliases after wdqs role changes [puppet] - 10https://gerrit.wikimedia.org/r/602639 (owner: 10Dzahn) [12:41:15] !log rebooting gerrit1002 to add more vCPUs, after [ganeti1009:~] $ sudo gnt-instance modify -B vcpus=8 gerrit1002.wikimedia.org T239151 [12:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:20] T239151: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 [12:42:40] (03CR) 10Kormat: install_server: Allow reuse of partitions during reimage. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/601761 (https://phabricator.wikimedia.org/T252027) (owner: 10Kormat) [12:43:02] (03CR) 10Alexandros Kosiaris: [C: 04-1] Mobileapps: Add initial helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/602155 (https://phabricator.wikimedia.org/T218733) (owner: 10Mholloway) [12:43:40] 10Operations, 10Phabricator, 10Security-Team, 10Traffic: Accessing Phabricator from Tor - https://phabricator.wikimedia.org/T254568 (10Reedy) [12:43:58] (03CR) 10Dzahn: [C: 03+1] "thanks Volans (you were right) and gehel (for fixing it already)" [puppet] - 10https://gerrit.wikimedia.org/r/602639 (owner: 10Dzahn) [12:44:00] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/602639 (owner: 10Dzahn) [12:44:34] (03CR) 10Dzahn: [C: 03+2] cumin/wdqs: fix cumin aliases after wdqs role changes [puppet] - 10https://gerrit.wikimedia.org/r/602639 (owner: 10Dzahn) [12:49:00] (03PS1) 10Ladsgroup: Hotfix for the interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602674 (https://phabricator.wikimedia.org/T111853) [12:50:59] 10Operations, 10Gerrit, 10vm-requests: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10Dzahn) ` [gerrit1002:~] $ lscpu ... CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 8 ... CPU MHz: 2499.998 Bog... [12:51:17] 10Operations, 10Gerrit, 10vm-requests: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10Dzahn) 05Open→03Resolved @QChris ^ fixed [12:51:20] (03CR) 10Ladsgroup: [C: 03+2] Hotfix for the interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602674 (https://phabricator.wikimedia.org/T111853) (owner: 10Ladsgroup) [12:52:03] (03Merged) 10jenkins-bot: Hotfix for the interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602674 (https://phabricator.wikimedia.org/T111853) (owner: 10Ladsgroup) [12:52:37] (03CR) 10Dvorapa: [C: 03+1] Hotfix for the interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602674 (https://phabricator.wikimedia.org/T111853) (owner: 10Ladsgroup) [12:52:59] (03PS1) 10Ladsgroup: Add be-tarask to langlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602675 (https://phabricator.wikimedia.org/T111853) [12:53:55] (03PS1) 10Elukey: Prepare druid1006 for Debian Buster [puppet] - 10https://gerrit.wikimedia.org/r/602676 (https://phabricator.wikimedia.org/T253980) [12:54:45] (03CR) 10Dvorapa: [C: 03+1] Add be-tarask to langlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602675 (https://phabricator.wikimedia.org/T111853) (owner: 10Ladsgroup) [12:55:05] !log ladsgroup@deploy1001 Synchronized wmf-config/interwiki.php: Hotfix for be-tarask interwiki link being broken (T111853) (duration: 01m 00s) [12:55:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:09] T111853: The href of be-tarask: interlanguage link points to the be-x-old domain - https://phabricator.wikimedia.org/T111853 [12:55:51] 10Operations, 10Phabricator, 10Security-Team, 10Traffic: Accessing Phabricator from Tor - https://phabricator.wikimedia.org/T254568 (10Dzahn) This could be related to T229620. [12:56:15] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/23032/" [puppet] - 10https://gerrit.wikimedia.org/r/602676 (https://phabricator.wikimedia.org/T253980) (owner: 10Elukey) [12:58:22] (03PS1) 10Alexandros Kosiaris: changeprop-jobqueue: Add codfw to calico policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/602678 [12:58:40] PROBLEM - PHP opcache health on mw2362 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:59:02] 10Operations, 10Analytics, 10netops: Ingestion semantic for netflow data sent to kafka generates late-data - https://phabricator.wikimedia.org/T254574 (10JAllemandou) [12:59:13] XioNoX: https://phabricator.wikimedia.org/T254574 [13:01:31] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602675 (https://phabricator.wikimedia.org/T111853) (owner: 10Ladsgroup) [13:01:56] (03PS4) 10Dzahn: statistics: delete empty file password.pp [puppet] - 10https://gerrit.wikimedia.org/r/602616 (https://phabricator.wikimedia.org/T87450) [13:02:46] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/23034/" [puppet] - 10https://gerrit.wikimedia.org/r/602616 (https://phabricator.wikimedia.org/T87450) (owner: 10Dzahn) [13:03:38] 10Puppet, 10Cloud-VPS, 10cloud-services-team (Kanban): Puppet labs/private.git data loss incident affecting some projects - https://phabricator.wikimedia.org/T254491 (10jbond) i have made a first pass at the [[ https://wikitech.wikimedia.org/wiki/Incident_documentation/20200605-cloud-private-repo | incident... [13:04:19] 10Operations, 10Gerrit, 10vm-requests: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10QChris) Wooohooo \o/ Thanks! [13:05:10] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/602607 (owner: 10Dzahn) [13:09:00] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/23031/" [puppet] - 10https://gerrit.wikimedia.org/r/602607 (owner: 10Dzahn) [13:09:18] (03PS2) 10Dzahn: releases: add data types [puppet] - 10https://gerrit.wikimedia.org/r/602607 [13:10:06] RECOVERY - PHP opcache health on mw2362 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:15:00] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [13:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:53] (03CR) 10Alexandros Kosiaris: [C: 03+2] changeprop-jobqueue: Add codfw to calico policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/602678 (owner: 10Alexandros Kosiaris) [13:16:23] (03Merged) 10jenkins-bot: changeprop-jobqueue: Add codfw to calico policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/602678 (owner: 10Alexandros Kosiaris) [13:17:14] (03CR) 10TK-999: [C: 03+1] Add a default Apache 2.0 license [puppet] - 10https://gerrit.wikimedia.org/r/183862 (https://phabricator.wikimedia.org/T67270) (owner: 10Rush) [13:18:25] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [13:18:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:07] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'citoid' for release 'staging' . [13:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:36] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:19:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:50] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [13:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:44] (03PS1) 10Mvolz: Switch from gelf to rsyslog [deployment-charts] - 10https://gerrit.wikimedia.org/r/602683 (https://phabricator.wikimedia.org/T219919) [13:23:43] (03PS1) 10JMeybohm: common_templates: make envoy admin interface listen to 0.0.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/602684 [13:23:49] (03PS2) 10Mvolz: Switch from gelf to rsyslog in citoid chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/602683 (https://phabricator.wikimedia.org/T219919) [13:24:27] (03PS1) 10Alexandros Kosiaris: chromium-render: Drop appbase_url_port [deployment-charts] - 10https://gerrit.wikimedia.org/r/602685 [13:24:49] 10Operations, 10Maps, 10Wikimedia-Logstash, 10observability, and 4 others: Move kartotherian/tilerator logging to new logging pipeline - https://phabricator.wikimedia.org/T222377 (10Mholloway) Hi @fgiunchedi, thanks for the reviews! What's the minimum version of service-runner that's required for this? W... [13:25:51] 10Operations, 10Analytics, 10netops: Ingestion semantic for netflow data sent to kafka generates late-data - https://phabricator.wikimedia.org/T254574 (10elukey) @ayounsi is it ok to set the "Time" field to `stamp_updated` rather than `stamp_inserted` ? As Joseph pointed out we have some weird situations lik... [13:26:08] (03CR) 10Alexandros Kosiaris: [C: 04-2] "Charts already use the logging pipeline and just need to log to stdout, no need for syslog logging." [deployment-charts] - 10https://gerrit.wikimedia.org/r/602683 (https://phabricator.wikimedia.org/T219919) (owner: 10Mvolz) [13:27:17] (03CR) 10Alexandros Kosiaris: [C: 03+1] common_templates: make envoy admin interface listen to 0.0.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/602684 (owner: 10JMeybohm) [13:27:33] (03CR) 10Alexandros Kosiaris: [C: 03+2] common_templates: make envoy admin interface listen to 0.0.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/602684 (owner: 10JMeybohm) [13:27:37] (03PS2) 10JMeybohm: eventgate: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/602061 (https://phabricator.wikimedia.org/T253396) [13:28:00] (03Merged) 10jenkins-bot: common_templates: make envoy admin interface listen to 0.0.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/602684 (owner: 10JMeybohm) [13:29:17] (03Abandoned) 10Mvolz: Switch from gelf to rsyslog in citoid chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/602683 (https://phabricator.wikimedia.org/T219919) (owner: 10Mvolz) [13:29:59] (03PS3) 10JMeybohm: eventstreams: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/602060 (https://phabricator.wikimedia.org/T253396) [13:30:48] PROBLEM - ganeti-confd running on ganeti2016 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [13:30:50] (03PS2) 10Dzahn: DHCP/site: add people2001 [puppet] - 10https://gerrit.wikimedia.org/r/602661 [13:31:21] (03PS3) 10JMeybohm: eventgate: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/602061 (https://phabricator.wikimedia.org/T253396) [13:31:32] (03PS2) 10Alexandros Kosiaris: chromium-render: Drop appbase_url_port [deployment-charts] - 10https://gerrit.wikimedia.org/r/602685 [13:31:44] (03CR) 10Marostegui: [C: 03+1] install_server: Allow reuse of partitions during reimage. [puppet] - 10https://gerrit.wikimedia.org/r/601761 (https://phabricator.wikimedia.org/T252027) (owner: 10Kormat) [13:31:50] (03CR) 10Huji: [C: 03+1] Add be-tarask to langlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602675 (https://phabricator.wikimedia.org/T111853) (owner: 10Ladsgroup) [13:32:19] (03CR) 10Alexandros Kosiaris: [C: 03+2] chromium-render: Drop appbase_url_port [deployment-charts] - 10https://gerrit.wikimedia.org/r/602685 (owner: 10Alexandros Kosiaris) [13:32:37] (03CR) 10Alexandros Kosiaris: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/598280 (owner: 10Alexandros Kosiaris) [13:32:44] (03CR) 10jerkins-bot: [V: 04-1] rake: Add kubeyaml validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/598280 (owner: 10Alexandros Kosiaris) [13:32:47] (03Merged) 10jenkins-bot: chromium-render: Drop appbase_url_port [deployment-charts] - 10https://gerrit.wikimedia.org/r/602685 (owner: 10Alexandros Kosiaris) [13:33:04] PROBLEM - ganeti-mond running on ganeti2016 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [13:33:10] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'mathoid' for release 'staging' . [13:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:48] (03CR) 10Dzahn: [C: 03+2] DHCP/site: add people2001 [puppet] - 10https://gerrit.wikimedia.org/r/602661 (owner: 10Dzahn) [13:36:22] (03CR) 10Dzahn: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/579602 (owner: 10Thcipriani) [13:38:30] (03PS1) 10Paladox: gerrit: Unset gerrit::service::ipv6 for gerrit-prod-1001 [puppet] - 10https://gerrit.wikimedia.org/r/602686 [13:38:47] (03PS2) 10Paladox: gerrit: Unset gerrit::service::ipv6 for gerrit-prod-1001 [puppet] - 10https://gerrit.wikimedia.org/r/602686 [13:38:49] (03CR) 10Dzahn: "the commands don't write to logfiles and don't redirect stdout. seems like that would create mail spam to root@?" [puppet] - 10https://gerrit.wikimedia.org/r/579602 (owner: 10Thcipriani) [13:39:18] PROBLEM - PHP opcache health on mw2245 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:42:35] (03PS1) 10Elukey: alternatives::java: fix java paths [puppet] - 10https://gerrit.wikimedia.org/r/602689 (https://phabricator.wikimedia.org/T253553) [13:45:46] (03PS7) 10Alexandros Kosiaris: rake: Add kubeyaml validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/598280 [13:47:52] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor nitpick comment, the rest LGTM, feel free to merge and deploy after that." (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/602164 (https://phabricator.wikimedia.org/T225680) (owner: 10Mholloway) [13:47:57] (03CR) 10Ottomata: "> Patch Set 2:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/602087 (https://phabricator.wikimedia.org/T252675) (owner: 10Ottomata) [13:49:03] (03PS3) 10Paladox: gerrit: Unset gerrit::service::ipv6 for gerrit-prod-1001 [puppet] - 10https://gerrit.wikimedia.org/r/602686 [13:49:28] (03PS4) 10Dzahn: gerrit (cloud): Unset gerrit::service::ipv6 for gerrit-prod-1001 [puppet] - 10https://gerrit.wikimedia.org/r/602686 (owner: 10Paladox) [13:49:47] (03CR) 10Ottomata: "Huh, weird. Maybe we should use the 1.X path for update-alternatives too?" [puppet] - 10https://gerrit.wikimedia.org/r/602689 (https://phabricator.wikimedia.org/T253553) (owner: 10Elukey) [13:50:20] (03CR) 10Dzahn: [C: 03+2] gerrit (cloud): Unset gerrit::service::ipv6 for gerrit-prod-1001 [puppet] - 10https://gerrit.wikimedia.org/r/602686 (owner: 10Paladox) [13:52:15] (03CR) 10Elukey: "elukey@druid1006:~$ /usr/bin/update-alternatives --query java" [puppet] - 10https://gerrit.wikimedia.org/r/602689 (https://phabricator.wikimedia.org/T253553) (owner: 10Elukey) [13:54:03] (03PS1) 10Paladox: gerrit: Make ipv6 optional (for the cloud) [puppet] - 10https://gerrit.wikimedia.org/r/602690 [13:58:58] (03PS4) 10Mholloway: Mobileapps: Add initial helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/602155 (https://phabricator.wikimedia.org/T218733) [13:59:12] (03CR) 10Mholloway: Mobileapps: Add initial helmfile stanzas (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/602155 (https://phabricator.wikimedia.org/T218733) (owner: 10Mholloway) [13:59:31] (03CR) 10Ottomata: [C: 03+1] "COOL!" [puppet] - 10https://gerrit.wikimedia.org/r/602648 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [14:02:20] (03CR) 10Alexandros Kosiaris: "Thanks!" (0311 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [14:03:11] (03CR) 10Ottomata: [C: 03+1] "OOok!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/602646 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [14:04:01] RECOVERY - PHP opcache health on mw2245 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:04:11] (03CR) 10Ottomata: "HM.. weird. ok." [puppet] - 10https://gerrit.wikimedia.org/r/602689 (https://phabricator.wikimedia.org/T253553) (owner: 10Elukey) [14:04:16] (03CR) 10Ottomata: [C: 03+1] alternatives::java: fix java paths [puppet] - 10https://gerrit.wikimedia.org/r/602689 (https://phabricator.wikimedia.org/T253553) (owner: 10Elukey) [14:05:09] (03CR) 10Alexandros Kosiaris: parsoid: added support egress rules parsoid: Created symlink _helpers.tpl from common templates. Fixed appbase_url_port field in values. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/597789 (owner: 10Apakhomov) [14:05:12] (03PS4) 10Alexandros Kosiaris: parsoid: added support egress rules parsoid: Created symlink _helpers.tpl from common templates. Fixed appbase_url_port field in values. [deployment-charts] - 10https://gerrit.wikimedia.org/r/597789 (owner: 10Apakhomov) [14:05:44] (03CR) 10Alexandros Kosiaris: [C: 03+2] parsoid: added support egress rules parsoid: Created symlink _helpers.tpl from common templates. Fixed appbase_url_port field in values. [deployment-charts] - 10https://gerrit.wikimedia.org/r/597789 (owner: 10Apakhomov) [14:06:04] 10Operations, 10Maps, 10Wikimedia-Logstash, 10observability, and 4 others: Move kartotherian/tilerator logging to new logging pipeline - https://phabricator.wikimedia.org/T222377 (10fgiunchedi) >>! In T222377#6196289, @Mholloway wrote: > Hi @fgiunchedi, thanks for the reviews! What's the minimum version o... [14:06:06] (03PS3) 10Alexandros Kosiaris: chromium-render: added support egress rules chromium-render: Created symlink _helpers.tpl from common templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/597785 (owner: 10Apakhomov) [14:06:10] (03Merged) 10jenkins-bot: parsoid: added support egress rules parsoid: Created symlink _helpers.tpl from common templates. Fixed appbase_url_port field in values. [deployment-charts] - 10https://gerrit.wikimedia.org/r/597789 (owner: 10Apakhomov) [14:06:28] (03CR) 10RLazarus: [C: 03+1] "Sorry for not catching this when test_miscweb was added! One nit inline, feel free to merge without another round of review." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/602619 (owner: 10Dzahn) [14:06:40] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Nice! Thanks" [deployment-charts] - 10https://gerrit.wikimedia.org/r/597785 (owner: 10Apakhomov) [14:07:06] (03Merged) 10jenkins-bot: chromium-render: added support egress rules chromium-render: Created symlink _helpers.tpl from common templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/597785 (owner: 10Apakhomov) [14:07:59] (03PS3) 10Alexandros Kosiaris: mathoid: added support egress rules mathoid: deleted _policy_helper.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/597777 (owner: 10Apakhomov) [14:08:15] (03CR) 10jerkins-bot: [V: 04-1] mathoid: added support egress rules mathoid: deleted _policy_helper.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/597777 (owner: 10Apakhomov) [14:11:37] (03PS1) 10Jbond: CI: add CI to check shell scripts [puppet] - 10https://gerrit.wikimedia.org/r/602693 (https://phabricator.wikimedia.org/T254480) [14:11:39] (03PS1) 10Jbond: CI: add some shell scripts to test the new shellcheck CI check [puppet] - 10https://gerrit.wikimedia.org/r/602694 (https://phabricator.wikimedia.org/T254480) [14:12:03] (03CR) 10Jbond: [C: 03+2] statistics: fix shelcheck issues in hardsync [puppet] - 10https://gerrit.wikimedia.org/r/602648 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [14:12:46] (03CR) 10jerkins-bot: [V: 04-1] CI: add some shell scripts to test the new shellcheck CI check [puppet] - 10https://gerrit.wikimedia.org/r/602694 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [14:12:55] (03CR) 10Jbond: [C: 03+2] confluent: fix shellcheck issues [puppet] - 10https://gerrit.wikimedia.org/r/602646 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [14:13:13] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=jmx_puppetdb site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:14:51] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:18:02] 10Operations, 10Maps, 10Wikimedia-Logstash, 10observability, and 4 others: Move kartotherian/tilerator logging to new logging pipeline - https://phabricator.wikimedia.org/T222377 (10Mholloway) Great, thanks @fgiunchedi! It looks like we currently have service-runner@2.7.3 in production for both services. [14:19:06] 10Operations, 10observability, 10User-MoritzMuehlenhoff, 10Wikimedia-Incident: Alert on ECC warnings in SEL - https://phabricator.wikimedia.org/T253810 (10fgiunchedi) Yeah MCE got it, similar logs than parent task but from centrallog ` May 27 20:19:45 db1138 kernel: [33134188.608450] mce: [Hardware Error]... [14:19:17] PROBLEM - PHP opcache health on mw2251 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:19:37] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2020): CommRel support for FY2020-2021 Q1 DC switchover - https://phabricator.wikimedia.org/T244808 (10RLazarus) Thanks for checking -- not sure yet, but as we're planning out Q1 on our side too, I'm starting to take everyone's temperature about it. I'll let... [14:23:04] (03PS1) 10Paladox: gerrit: Set default for value for ipv6 (undef) [puppet] - 10https://gerrit.wikimedia.org/r/602696 [14:23:32] (03PS2) 10Paladox: gerrit: Set default for value for ipv6 (undef) [puppet] - 10https://gerrit.wikimedia.org/r/602696 [14:23:54] (03CR) 10Gehel: alternatives::java: fix java paths (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/602689 (https://phabricator.wikimedia.org/T253553) (owner: 10Elukey) [14:24:19] (03PS3) 10Paladox: gerrit: Set default for value for ipv6 (undef) [puppet] - 10https://gerrit.wikimedia.org/r/602696 [14:24:23] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/602696 (owner: 10Paladox) [14:24:45] RECOVERY - PHP opcache health on mw2251 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:26:56] (03PS4) 10Paladox: gerrit: Set default for value for ipv6 (undef) [puppet] - 10https://gerrit.wikimedia.org/r/602696 [14:27:01] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/602696 (owner: 10Paladox) [14:27:17] RECOVERY - Ensure legal html en.wb on en.wikibooks.org is OK: all html is present. https://phabricator.wikimedia.org/project/members/28/ [14:27:47] \o/ [14:27:56] (03CR) 10Alexandros Kosiaris: [C: 04-1] mathoid: added support egress rules mathoid: deleted _policy_helper.tpl (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/597777 (owner: 10Apakhomov) [14:28:14] (03PS4) 10Alexandros Kosiaris: mediawiki-dev: added support egress rules mediawiki-dev: Created symlink _helpers.tpl from common templates. Fixed appbase_url_port field in values. [deployment-charts] - 10https://gerrit.wikimedia.org/r/597787 (owner: 10Apakhomov) [14:30:00] (03CR) 10Elukey: alternatives::java: fix java paths (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/602689 (https://phabricator.wikimedia.org/T253553) (owner: 10Elukey) [14:30:57] (03CR) 10Elukey: "I am going to add some documentation about the assumptions so it will be clearer in the future why this was done." [puppet] - 10https://gerrit.wikimedia.org/r/602689 (https://phabricator.wikimedia.org/T253553) (owner: 10Elukey) [14:32:58] (03CR) 10Jbond: [C: 03+2] gerrit: Set default for value for ipv6 (undef) [puppet] - 10https://gerrit.wikimedia.org/r/602696 (owner: 10Paladox) [14:33:40] (03CR) 10Alexandros Kosiaris: [C: 03+2] mediawiki-dev: added support egress rules mediawiki-dev: Created symlink _helpers.tpl from common templates. Fixed appbase_url_port field in [deployment-charts] - 10https://gerrit.wikimedia.org/r/597787 (owner: 10Apakhomov) [14:33:46] (03CR) 10Alexandros Kosiaris: [C: 03+2] "thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/597787 (owner: 10Apakhomov) [14:34:02] (03CR) 10Gehel: [C: 03+1] "good enough!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/602689 (https://phabricator.wikimedia.org/T253553) (owner: 10Elukey) [14:34:09] (03Merged) 10jenkins-bot: mediawiki-dev: added support egress rules mediawiki-dev: Created symlink _helpers.tpl from common templates. Fixed appbase_url_port field in values. [deployment-charts] - 10https://gerrit.wikimedia.org/r/597787 (owner: 10Apakhomov) [14:34:15] (03PS2) 10Elukey: alternatives::java: fix java paths [puppet] - 10https://gerrit.wikimedia.org/r/602689 (https://phabricator.wikimedia.org/T253553) [14:36:31] (03CR) 10Elukey: [C: 03+2] alternatives::java: fix java paths [puppet] - 10https://gerrit.wikimedia.org/r/602689 (https://phabricator.wikimedia.org/T253553) (owner: 10Elukey) [14:38:14] (03PS3) 10Mholloway: Chromium-render: Add initial helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/602164 (https://phabricator.wikimedia.org/T225680) [14:39:20] (03CR) 10Mholloway: Chromium-render: Add initial helmfile stanzas (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/602164 (https://phabricator.wikimedia.org/T225680) (owner: 10Mholloway) [14:41:28] (03PS3) 10Alexandros Kosiaris: eventstreams: added support egress rules eventstreams: deleted networkpolicy field from values-canary.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/597774 (owner: 10Apakhomov) [14:43:08] (03PS1) 10Jbond: hiera: use ~ not undef which results in a string [puppet] - 10https://gerrit.wikimedia.org/r/602703 [14:44:33] (03CR) 10Jbond: [C: 03+2] hiera: use ~ not undef which results in a string [puppet] - 10https://gerrit.wikimedia.org/r/602703 (owner: 10Jbond) [14:46:28] (03CR) 10Alexandros Kosiaris: [C: 03+2] eventstreams: added support egress rules eventstreams: deleted networkpolicy field from values-canary.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/597774 (owner: 10Apakhomov) [14:46:55] (03Merged) 10jenkins-bot: eventstreams: added support egress rules eventstreams: deleted networkpolicy field from values-canary.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/597774 (owner: 10Apakhomov) [14:47:32] (03PS1) 10Mholloway: Include ::profile::rsyslog::udp_localhost_compat in OSM common role [puppet] - 10https://gerrit.wikimedia.org/r/602704 (https://phabricator.wikimedia.org/T222377) [14:47:44] (03PS4) 10Alexandros Kosiaris: eventgate: added support egress rules eventgate: Deleted networkpolicy field from values-canary.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/597772 (owner: 10Apakhomov) [14:48:51] (03CR) 10Alexandros Kosiaris: [C: 03+2] eventgate: added support egress rules eventgate: Deleted networkpolicy field from values-canary.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/597772 (owner: 10Apakhomov) [14:49:23] (03Merged) 10jenkins-bot: eventgate: added support egress rules eventgate: Deleted networkpolicy field from values-canary.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/597772 (owner: 10Apakhomov) [14:50:54] (03PS1) 10Alexandros Kosiaris: Bump chart versions for netpol bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/602706 [14:51:27] 10Puppet, 10Cloud-VPS, 10cloud-services-team (Kanban): Puppet labs/private.git data loss incident affecting some projects - https://phabricator.wikimedia.org/T254491 (10Andrew) Thank you @jbond! [15:07:09] PROBLEM - PHP opcache health on mw2220 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:13:36] (03CR) 10Elukey: [C: 03+2] Add puppet configuration for matomo1002 [puppet] - 10https://gerrit.wikimedia.org/r/602637 (https://phabricator.wikimedia.org/T252742) (owner: 10Elukey) [15:13:44] (03PS2) 10Elukey: Add puppet configuration for matomo1002 [puppet] - 10https://gerrit.wikimedia.org/r/602637 (https://phabricator.wikimedia.org/T252742) [15:17:39] (03PS2) 10Privacybatm: transferpy: Remove wmfmariadbpy package [software/transferpy] - 10https://gerrit.wikimedia.org/r/602618 (https://phabricator.wikimedia.org/T248256) [15:22:53] 10Operations, 10Analytics, 10netops: Ingestion semantic for netflow data sent to kafka generates late-data - https://phabricator.wikimedia.org/T254574 (10CDanis) Hopefully Arzhel understands this better than I do, but here's my rough understanding: * 'Normally' routers send netflow just on the end of a flow... [15:23:09] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: automated linting/analysis/other CI of Python/shell scripts generated by ERB - https://phabricator.wikimedia.org/T254480 (10jbond) Originally i wanted to do this checking in the Rake checks as that feels like the right place for them. however i cant... [15:24:33] (03PS1) 10Filippo Giunchedi: prometheus: enable Thanos upload for k8s [puppet] - 10https://gerrit.wikimedia.org/r/602715 (https://phabricator.wikimedia.org/T252186) [15:24:35] (03PS1) 10Filippo Giunchedi: prometheus: enable Thanos upload for services [puppet] - 10https://gerrit.wikimedia.org/r/602716 (https://phabricator.wikimedia.org/T252186) [15:24:39] (03PS1) 10Filippo Giunchedi: prometheus: enable Thanos upload for ops in esams [puppet] - 10https://gerrit.wikimedia.org/r/602717 (https://phabricator.wikimedia.org/T252186) [15:25:11] RECOVERY - PHP opcache health on mw2220 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:26:22] (03PS1) 10CDanis: pmacct: set nfacctd_time_new for accurate long-lived flow times [puppet] - 10https://gerrit.wikimedia.org/r/602718 (https://phabricator.wikimedia.org/T254574) [15:27:29] (03PS1) 10Privacybatm: Write documentation using Sphinx [software/transferpy] - 10https://gerrit.wikimedia.org/r/602719 (https://phabricator.wikimedia.org/T253219) [15:30:50] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: automated linting/analysis/other CI of Python/shell scripts generated by ERB - https://phabricator.wikimedia.org/T254480 (10hashar) [15:31:04] 10Operations, 10SRE-tools, 10Continuous-Integration-Config, 10Patch-For-Review: Add shell scripts CI validations - https://phabricator.wikimedia.org/T148494 (10hashar) [15:31:53] 10Operations, 10SRE-tools, 10Continuous-Integration-Config, 10Patch-For-Review: Add shell scripts CI validations - https://phabricator.wikimedia.org/T148494 (10hashar) A few years later, this an effort to run shell scripts validation via T254480 including scripts generated through .erb templates. [15:32:31] (03Abandoned) 10Privacybatm: Write documentation using Sphinx [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598295 (https://phabricator.wikimedia.org/T253219) (owner: 10Privacybatm) [15:35:08] (03CR) 10Ayounsi: [C: 03+1] "great topic." [puppet] - 10https://gerrit.wikimedia.org/r/602718 (https://phabricator.wikimedia.org/T254574) (owner: 10CDanis) [15:37:14] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: automated linting/analysis/other CI of Python/shell scripts generated by ERB - https://phabricator.wikimedia.org/T254480 (10jbond) Another option is we could just ban templated scripts and add CI to reject any erb file with a shebang in it. This wo... [15:37:40] (03PS2) 10Privacybatm: Write documentation using Sphinx [software/transferpy] - 10https://gerrit.wikimedia.org/r/602719 (https://phabricator.wikimedia.org/T253219) [15:38:15] PROBLEM - PHP opcache health on mw2364 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:39:11] (03CR) 10CDanis: [C: 03+2] pmacct: set nfacctd_time_new for accurate long-lived flow times [puppet] - 10https://gerrit.wikimedia.org/r/602718 (https://phabricator.wikimedia.org/T254574) (owner: 10CDanis) [15:39:41] !log disabling puppet on netflow* and trying I6598d8f8 on netflow3001 first [15:39:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:48] !log disabling puppet on netflow* and trying I6598d8f8 on netflow3001 first T254574 [15:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:51] T254574: Ingestion semantic for netflow data sent to kafka generates late-data - https://phabricator.wikimedia.org/T254574 [15:44:40] (03PS3) 10Privacybatm: Write documentation using Sphinx [software/transferpy] - 10https://gerrit.wikimedia.org/r/602719 (https://phabricator.wikimedia.org/T253219) [15:50:51] RECOVERY - PHP opcache health on mw2364 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:54:07] !log enabling & rerunning puppet on netflow* T254574 [15:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:11] T254574: Ingestion semantic for netflow data sent to kafka generates late-data - https://phabricator.wikimedia.org/T254574 [15:54:44] (03PS1) 10Paladox: letsencrypt: Sync acme-tiny upstream [puppet] - 10https://gerrit.wikimedia.org/r/602722 [15:55:48] (03PS2) 10Paladox: letsencrypt: Sync acme-tiny upstream [puppet] - 10https://gerrit.wikimedia.org/r/602722 [16:02:53] (03CR) 10Paladox: "@ Vgutierrez or @Krenair could you review this please? I'm not entirely sure if we added hacks over the year... when i used this for anoth" [puppet] - 10https://gerrit.wikimedia.org/r/602722 (owner: 10Paladox) [16:09:31] 10Operations, 10Analytics, 10netops, 10Patch-For-Review: Ingestion semantic for netflow data sent to kafka generates late-data - https://phabricator.wikimedia.org/T254574 (10CDanis) 05Open→03Resolved The change looks effective: previously, GRE traffic was predominantly reported as ginormous-to-the-poin... [16:09:38] !log Updated puppet CI jobs to add shellcheck to the environment https://gerrit.wikimedia.org/r/#/c/integration/config/+/602728/ # T254480 [16:10:09] (03CR) 10Eevans: [C: 04-1] Extend Cassandra cookbook to also cover maps (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/602318 (owner: 10Muehlenhoff) [16:13:06] (03PS1) 10Cwhite: profile: add loki_event filter script [puppet] - 10https://gerrit.wikimedia.org/r/602729 (https://phabricator.wikimedia.org/T222826) [16:13:16] 10Operations, 10Puppet, 10Continuous-Integration-Config, 10Patch-For-Review, and 2 others: automated linting/analysis/other CI of Python/shell scripts generated by ERB - https://phabricator.wikimedia.org/T254480 (10hashar) [16:13:52] 10Operations, 10Analytics, 10netops, 10Patch-For-Review: Ingestion semantic for netflow data sent to kafka generates late-data - https://phabricator.wikimedia.org/T254574 (10CDanis) As a last note: this should improve the accuracy of //all// long-lived flows (any with a duration longer than a minute); the... [16:14:22] is stashbot down? hashar’s message doesn’t seem to have made it into the SAL, for instance [16:14:41] do [16:15:15] yeah never on https://wikitech.wikimedia.org/wiki/Server_Admin_Log#2020-06-05 [16:15:20] or in https://sal.toolforge.org/production [16:15:20] :( [16:15:33] !log will it log! [16:15:46] yes, it is [16:15:50] looks like stashbot reconnected (as stashbot_) at 1606Z [16:16:15] 10Operations, 10Analytics, 10netops, 10Patch-For-Review: Ingestion semantic for netflow data sent to kafka generates late-data - https://phabricator.wikimedia.org/T254574 (10JAllemandou) Great explanation @CDanis - This change should also resolve hour late-data issue - One stone two birds :) [16:16:19] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/602694 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [16:16:37] 10Operations, 10MachineVision, 10Product-Infrastructure-Team-Backlog, 10Structured-Data-Backlog, and 3 others: Update open_nsfw-- for Wikimedia production deployment - https://phabricator.wikimedia.org/T225664 (10Lazy-restless) [16:19:21] (03PS1) 10Jbond: puppet-merge: split dynamic values out of puppet-merge script [puppet] - 10https://gerrit.wikimedia.org/r/602732 (https://phabricator.wikimedia.org/T254480) [16:20:22] (03CR) 10CDanis: [C: 03+1] monitoring: bail on check_command containing newlines [puppet] - 10https://gerrit.wikimedia.org/r/602669 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [16:20:52] (03PS2) 10Jbond: puppet-merge: split dynamic values out of puppet-merge script [puppet] - 10https://gerrit.wikimedia.org/r/602732 (https://phabricator.wikimedia.org/T254480) [16:21:03] !log roundtrip testing stashdot [16:21:36] well it is broken somehow but I am already multi tasking several thing and can't look into it [16:23:47] PROBLEM - PHP opcache health on mw2290 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:24:10] hashar: everything on toolforge connected to irc (bar wm-bot2) seems to have died [16:24:30] (03PS2) 10Dzahn: httpbb: add test_miscweb snippet to profile [puppet] - 10https://gerrit.wikimedia.org/r/602619 [16:24:32] (03PS1) 10Dzahn: planet: upgrade feeds from http to https detected by script [puppet] - 10https://gerrit.wikimedia.org/r/602733 (https://phabricator.wikimedia.org/T168459) [16:24:34] (03PS1) 10Dzahn: planet: fix detection of working https URLs in check_https.sh [puppet] - 10https://gerrit.wikimedia.org/r/602734 [16:25:14] * bd808 goes to look at stashbot errors [16:26:30] (03CR) 10Dzahn: [C: 03+2] httpbb: add test_miscweb snippet to profile [puppet] - 10https://gerrit.wikimedia.org/r/602619 (owner: 10Dzahn) [16:29:10] (03CR) 10Dzahn: [C: 03+2] planet: upgrade feeds from http to https detected by script [puppet] - 10https://gerrit.wikimedia.org/r/602733 (https://phabricator.wikimedia.org/T168459) (owner: 10Dzahn) [16:29:22] !log Testing stashbot following hard restart of service. It was having LDAP connection failure problems. [16:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:37] back for now at lest folks [16:31:19] (03CR) 10Dzahn: httpbb: add test_miscweb snippet to profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/602619 (owner: 10Dzahn) [16:32:15] (03Abandoned) 10Milimetric: [WIP] Migrate pagecounts-ez to hadoop [puppet] - 10https://gerrit.wikimedia.org/r/562079 (https://phabricator.wikimedia.org/T192474) (owner: 10Milimetric) [16:32:51] (03PS2) 10Dzahn: planet: fix detection of working https URLs in check_https.sh [puppet] - 10https://gerrit.wikimedia.org/r/602734 [16:33:16] (03CR) 10Dzahn: [C: 03+2] planet: fix detection of working https URLs in check_https.sh [puppet] - 10https://gerrit.wikimedia.org/r/602734 (owner: 10Dzahn) [16:33:26] (03CR) 10Herron: [C: 03+1] "Agreed re: the ideal, but breaking the puppet run instead of the icinga config is a big improvement by itself." [puppet] - 10https://gerrit.wikimedia.org/r/602669 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [16:38:36] (03CR) 10BearND: "Thank you for addressing my questions. I have one more inline." (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/602155 (https://phabricator.wikimedia.org/T218733) (owner: 10Mholloway) [16:43:06] 10Operations, 10observability: Making centrallog syslog easier and faster to work with - https://phabricator.wikimedia.org/T254605 (10herron) p:05Triage→03Medium [16:43:36] 10Operations, 10observability, 10User-MoritzMuehlenhoff, 10Wikimedia-Incident: Alert on ECC warnings in SEL - https://phabricator.wikimedia.org/T253810 (10CDanis) That was for the fatal, I was wondering about the earlier correctable errors. But I just checked on centrallog1001 and I didn't find any eviden... [16:43:41] RECOVERY - PHP opcache health on mw2290 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:43:49] 10Operations, 10observability: Making centrallog syslog easier and faster to work with - https://phabricator.wikimedia.org/T254605 (10herron) Adding this patch retroactively (the per-hostname split is already done) https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/601836/ [16:44:24] (03PS2) 10Herron: centrallog: update mtail syslog file locations [puppet] - 10https://gerrit.wikimedia.org/r/602470 (https://phabricator.wikimedia.org/T254605) [16:44:52] (03PS1) 10Jbond: puppet-merge: fix shellcheck issues [puppet] - 10https://gerrit.wikimedia.org/r/602738 (https://phabricator.wikimedia.org/T254480) [16:45:13] !log elukey@deploy1001 Started deploy [analytics/turnilo/deploy@f7e4f78]: Upgrade to 1.24.0 [16:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:23] !log elukey@deploy1001 Finished deploy [analytics/turnilo/deploy@f7e4f78]: Upgrade to 1.24.0 (duration: 00m 11s) [16:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:30] (03CR) 10Herron: [C: 03+2] centrallog: update mtail syslog file locations [puppet] - 10https://gerrit.wikimedia.org/r/602470 (https://phabricator.wikimedia.org/T254605) (owner: 10Herron) [16:45:54] (03PS2) 10Jbond: puppet-merge: fix shellcheck issues [puppet] - 10https://gerrit.wikimedia.org/r/602738 (https://phabricator.wikimedia.org/T254480) [16:47:16] (03PS2) 10Dzahn: statistics: add data types [puppet] - 10https://gerrit.wikimedia.org/r/602614 [16:47:18] (03CR) 10Jbond: puppet-merge: fix shellcheck issues (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/602738 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [16:48:45] (03PS3) 10Dzahn: statistics: add data types [puppet] - 10https://gerrit.wikimedia.org/r/602614 [16:51:41] PROBLEM - piwik.wikimedia.org on matomo1002 is CRITICAL: connect to address 10.64.32.137 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Piwik [16:51:50] this is me --^ [16:51:53] new host [16:53:55] PROBLEM - Check systemd state on matomo1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:54:59] (03PS4) 10Dzahn: statistics: add data types [puppet] - 10https://gerrit.wikimedia.org/r/602614 [16:58:39] (03Abandoned) 10Dzahn: statistics: add data types [puppet] - 10https://gerrit.wikimedia.org/r/602614 (owner: 10Dzahn) [17:01:45] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=thanos-compact site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:01:52] (03CR) 10Dzahn: [C: 03+2] "going ahead anyways since this is "slave/labs"-only (needs renaming for 2 reasons at once, btw )" [puppet] - 10https://gerrit.wikimedia.org/r/579602 (owner: 10Thcipriani) [17:02:31] (03PS1) 10Elukey: profile::piwik::webserver: support Buster [puppet] - 10https://gerrit.wikimedia.org/r/602742 (https://phabricator.wikimedia.org/T252742) [17:03:28] (03Abandoned) 10Dzahn: icinga: increase tresholds for check_ssl_http_letsencrypt [puppet] - 10https://gerrit.wikimedia.org/r/594722 (https://phabricator.wikimedia.org/T251726) (owner: 10Dzahn) [17:03:49] PROBLEM - Check systemd state on thanos-fe2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:26] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/23041/matomo1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/602742 (https://phabricator.wikimedia.org/T252742) (owner: 10Elukey) [17:10:11] RECOVERY - piwik.wikimedia.org on matomo1002 is OK: HTTP OK: Status line output matched HTTP/1.1 401 - 593 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/Piwik [17:14:31] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Cluster for Yi-Ju,Lu - https://phabricator.wikimedia.org/T254130 (10diego) Hi These are the groups we need for @YiJuLu: analytics-privatedata-users & gpu-testers Thanks [17:15:59] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Cluster for Yi-Ju,Lu - https://phabricator.wikimedia.org/T254130 (10elukey) gpu-testers is not needed anymore if the user is analytics-privatedata-users :) [17:16:01] RECOVERY - Check systemd state on thanos-fe2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:43] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:21:58] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Cluster for Yi-Ju,Lu - https://phabricator.wikimedia.org/T254130 (10diego) great! gpu's for everybody, power to the people :) !!! Thanks @elukey [17:23:53] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Cluster for Yi-Ju,Lu - https://phabricator.wikimedia.org/T254130 (10Nuria) I think we need the expiration time for access for @YiJuLu , am I correct this is an intership? [17:25:50] 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review: Reduce read pressure on memcached servers by adding a machine-local Memcache instance - https://phabricator.wikimedia.org/T244340 (10Krinkle) If the local-memcached's blind ttl is around the time we tolerate purges to be delayed for, and i... [17:26:43] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Cluster for Yi-Ju,Lu - https://phabricator.wikimedia.org/T254130 (10diego) Yes @Nuria is an internship. End date August 10th. [17:30:41] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Cluster for Yi-Ju,Lu - https://phabricator.wikimedia.org/T254130 (10Nuria) Approved on my end [17:52:10] (03PS1) 10Cwhite: profile::base: add hardware_monitoring option and disable on out-of-warranty nodes [puppet] - 10https://gerrit.wikimedia.org/r/602751 [17:53:24] (03CR) 10jerkins-bot: [V: 04-1] profile::base: add hardware_monitoring option and disable on out-of-warranty nodes [puppet] - 10https://gerrit.wikimedia.org/r/602751 (owner: 10Cwhite) [17:53:37] (03PS2) 10Cwhite: profile::base: add hardware_monitoring option and disable on out-of-warranty nodes [puppet] - 10https://gerrit.wikimedia.org/r/602751 [17:54:47] (03PS3) 10Cwhite: profile::base: add hardware_monitoring option and set for out-of-warranty nodes [puppet] - 10https://gerrit.wikimedia.org/r/602751 [17:54:49] (03CR) 10jerkins-bot: [V: 04-1] profile::base: add hardware_monitoring option and set for out-of-warranty nodes [puppet] - 10https://gerrit.wikimedia.org/r/602751 (owner: 10Cwhite) [17:54:55] RECOVERY - Check systemd state on matomo1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:03:01] (03PS1) 10Privacybatm: transferpy: Package transferpy [software/transferpy] - 10https://gerrit.wikimedia.org/r/602754 (https://phabricator.wikimedia.org/T253736) [18:03:30] 10Operations, 10Analytics, 10serviceops, 10vm-requests: Create a VM for matomo1002 (eqiad) - https://phabricator.wikimedia.org/T252742 (10elukey) 05Open→03Resolved a:03elukey [18:04:09] @seen evanpro [18:04:09] mutante: Last time I saw evanpro they were leaving the channel #wikimedia-office at 5/29/2019 10:03:03 PM (372d20h1m5s ago) [18:04:32] (03PS1) 10Andrew Bogott: Replace cumin public key for cloud VMs [labs/private] - 10https://gerrit.wikimedia.org/r/602755 (https://phabricator.wikimedia.org/T254589) [18:06:11] (03Abandoned) 10Privacybatm: transferpy: Package transferpy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598984 (https://phabricator.wikimedia.org/T253736) (owner: 10Privacybatm) [18:06:25] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Cluster for Yi-Ju,Lu - https://phabricator.wikimedia.org/T254130 (10Dzahn) [18:10:16] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Replace cumin public key for cloud VMs [labs/private] - 10https://gerrit.wikimedia.org/r/602755 (https://phabricator.wikimedia.org/T254589) (owner: 10Andrew Bogott) [18:15:54] (03PS1) 10Dzahn: admin: shell account for Yi-Ju Lu, add to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/602756 (https://phabricator.wikimedia.org/T254130) [18:19:55] (03PS4) 10Apakhomov: mathoid: added support egress rules mathoid: deleted _policy_helper.tpl mathoid: Restore tls_helpers template which was accidentally deleted [deployment-charts] - 10https://gerrit.wikimedia.org/r/597777 [18:23:00] (03CR) 10CDanis: [C: 03+1] "A few nits but overall looks great" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/602732 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [18:27:06] (03CR) 10CDanis: [C: 03+1] puppet-merge: fix shellcheck issues (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/602738 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [18:27:34] (03CR) 10CDanis: [C: 03+1] puppet-merge: split dynamic values out of puppet-merge script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/602732 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [18:33:22] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytics Cluster for Yi-Ju,Lu - https://phabricator.wikimedia.org/T254130 (10Dzahn) [18:38:40] 10Operations, 10Traffic, 10serviceops: Certificate *.wikipedia.org valid until 2020-06-20 - https://phabricator.wikimedia.org/T251726 (10Dzahn) 05Open→03Declined [18:45:01] (03CR) 10Alex Monk: "You could do, I think for wikimedia though the way forward is acme-chief" [puppet] - 10https://gerrit.wikimedia.org/r/602722 (owner: 10Paladox) [18:51:19] (03CR) 10Paladox: "@Krenair acme-chief seems like a complicated setup. Are there any install docs for doing it in the cloud?" [puppet] - 10https://gerrit.wikimedia.org/r/602722 (owner: 10Paladox) [18:52:48] paladox: I guess Willy E. Coyote will have some ACME instructions, somewhere [18:53:00] Willy E. Coyote? [18:53:32] Yeah, the one that chased the roadrunner [18:53:54] (03CR) 10Cwhite: "PCC checks out: https://puppet-compiler.wmflabs.org/compiler1003/23043/" [puppet] - 10https://gerrit.wikimedia.org/r/602751 (owner: 10Cwhite) [18:54:17] heh [18:58:36] yea, https://en.wikipedia.org/wiki/Acme_Corporation :) [19:01:01] PROBLEM - PHP opcache health on mw2288 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:04:19] I asked in #wikipedia-en but I figure someone here might know the answer. Does anyone know offhand why this article started failing to render reference templates? https://en.wikipedia.org/w/index.php?title=List_of_George_Floyd_protests&oldid=960423521 Obviously it's got something to do with how many references there are, but like, is it a template bug, is it a mediawiki bug? Just curious. [19:04:57] (03PS2) 10Krinkle: logging: Combine the three custom Monolog processors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601813 [19:05:46] Oh, it was Wile not Willy - https://en.wikipedia.org/wiki/Wile_E._Coyote_and_the_Road_Runner [19:16:31] (03CR) 10CDanis: [C: 03+1] thanos: add alerts for Thanos components (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/602633 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [19:20:55] RECOVERY - PHP opcache health on mw2288 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:23:16] (03CR) 10Alex Monk: "https://wikitech.wikimedia.org/wiki/Acme-chief/Cloud_VPS_setup" [puppet] - 10https://gerrit.wikimedia.org/r/602722 (owner: 10Paladox) [19:24:06] 10Operations, 10LDAP-Access-Requests: Add Daniel Cipoletti to analytics-privatedata-users - https://phabricator.wikimedia.org/T253086 (10Dzahn) Thanks @dcipoletti I am preparing a patch to create your account. Looks like we are all done here besides code review. Can we just have a quick reason/justification... [19:25:16] (03PS1) 10Dzahn: admin: create shell user for Daniel Cipoletti [puppet] - 10https://gerrit.wikimedia.org/r/602761 (https://phabricator.wikimedia.org/T253086) [19:31:19] 10Puppet, 10Cloud-VPS, 10cloud-services-team (Kanban): Puppet labs/private.git data loss incident affecting some projects - https://phabricator.wikimedia.org/T254491 (10bd808) [19:34:38] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add Daniel Cipoletti to analytics-privatedata-users - https://phabricator.wikimedia.org/T253086 (10dr0ptp4kt) The justification for the access is for reviewing data related to Product Infrastructure and Web features. [19:40:45] (03CR) 10RLazarus: [C: 03+1] admin: shell account for Yi-Ju Lu, add to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/602756 (https://phabricator.wikimedia.org/T254130) (owner: 10Dzahn) [19:40:54] phuzion: my guess is https://en.wikipedia.org/wiki/Wikipedia:Template_limits and explicitly https://en.wikipedia.org/wiki/Help:Template#Expand_limits [19:42:04] phuzion: I can see "Post‐expand include size: 2097127/2097152 bytes" in the raw html source which I think confirms that [19:43:36] (03PS1) 10Herron: centrallog: disable mtail fsnotify, increase fd limit & simplify glob [puppet] - 10https://gerrit.wikimedia.org/r/602764 (https://phabricator.wikimedia.org/T254605) [19:43:41] phuzion: so it really needs subst and/or content splitting to reduce the raw template count for each page render [19:44:07] bd808: I mean, the solution the community seems to be going with is splitting the largest sections into their own individual articles. [19:44:25] I think subst-ing {{cite web}} would be a disaster, tbh. [19:44:45] (03CR) 10jerkins-bot: [V: 04-1] centrallog: disable mtail fsnotify, increase fd limit & simplify glob [puppet] - 10https://gerrit.wikimedia.org/r/602764 (https://phabricator.wikimedia.org/T254605) (owner: 10Herron) [19:45:13] phuzion: *nod* some templates are not a great candidate for subst for sure [19:46:40] 10Operations, 10Puppet, 10Continuous-Integration-Config, 10Patch-For-Review, and 2 others: automated linting/analysis/other CI of Python/shell scripts generated by ERB - https://phabricator.wikimedia.org/T254480 (10Volans) Sorry I'm late to this party, just noticed the task. I actually think that this is t... [19:46:47] (03PS2) 10Herron: centrallog: disable mtail fsnotify, increase fd limit & simplify glob [puppet] - 10https://gerrit.wikimedia.org/r/602764 (https://phabricator.wikimedia.org/T254605) [19:48:20] (03CR) 10RLazarus: "Thanks for picking this up! One complaint inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/602761 (https://phabricator.wikimedia.org/T253086) (owner: 10Dzahn) [19:49:41] 10Operations, 10Puppet, 10Continuous-Integration-Config, 10Patch-For-Review, and 2 others: automated linting/analysis/other CI of Python/shell scripts generated by ERB - https://phabricator.wikimedia.org/T254480 (10CDanis) @Volans >>! In T254480#6196820, @jbond wrote: > Another option is we could just ba... [19:52:25] (03CR) 10Herron: "PCC including mx1001 as a randomly selected system that also runs mtail https://puppet-compiler.wmflabs.org/compiler1001/23046/" [puppet] - 10https://gerrit.wikimedia.org/r/602764 (https://phabricator.wikimedia.org/T254605) (owner: 10Herron) [19:56:09] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/602764 (https://phabricator.wikimedia.org/T254605) (owner: 10Herron) [20:01:26] (03CR) 10Herron: [C: 03+2] centrallog: disable mtail fsnotify, increase fd limit & simplify glob [puppet] - 10https://gerrit.wikimedia.org/r/602764 (https://phabricator.wikimedia.org/T254605) (owner: 10Herron) [20:01:29] 10Operations, 10Puppet, 10Continuous-Integration-Config, 10Patch-For-Review, and 2 others: Shell/Python/other scripts should not be generated by ERB files; dynamic parts should be a simple ERB config file - https://phabricator.wikimedia.org/T254480 (10CDanis) [20:03:15] (03PS5) 10Cwhite: profile: add loki output support to the logstash pipeline [puppet] - 10https://gerrit.wikimedia.org/r/602490 (https://phabricator.wikimedia.org/T222826) [20:03:42] (03PS6) 10Cwhite: profile: add loki output support to the logstash pipeline [puppet] - 10https://gerrit.wikimedia.org/r/602490 (https://phabricator.wikimedia.org/T222826) [20:05:04] (03CR) 10jerkins-bot: [V: 04-1] profile: add loki output support to the logstash pipeline [puppet] - 10https://gerrit.wikimedia.org/r/602490 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [20:08:34] 10Operations, 10ops-codfw: Degraded RAID on ms-be2018 - https://phabricator.wikimedia.org/T254392 (10wiki_willy) a:03Papaul [20:12:27] 10Operations, 10Puppet, 10Continuous-Integration-Config, 10Patch-For-Review, and 2 others: Shell/Python/other scripts should not be generated by ERB files; dynamic parts should be a simple ERB config file - https://phabricator.wikimedia.org/T254480 (10Volans) @CDanis yeah, sorry I noticed that after replyi... [20:12:51] (03PS7) 10Cwhite: profile: add loki output support to the logstash pipeline [puppet] - 10https://gerrit.wikimedia.org/r/602490 (https://phabricator.wikimedia.org/T222826) [20:13:18] 10Operations, 10DC-Ops, 10decommission, 10Goal: reclaim and return all cisco servers - https://phabricator.wikimedia.org/T128821 (10wiki_willy) Pickup date for remaining Cisco servers at eqiad has been set for June 16. @Jclark-ctr to work with Equinix in prepping for the pickup date. Thanks, Willy [20:14:35] (03CR) 10jerkins-bot: [V: 04-1] profile: add loki output support to the logstash pipeline [puppet] - 10https://gerrit.wikimedia.org/r/602490 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [20:18:01] (03PS8) 10Cwhite: profile: add loki output support to the logstash pipeline [puppet] - 10https://gerrit.wikimedia.org/r/602490 (https://phabricator.wikimedia.org/T222826) [20:18:05] PROBLEM - PHP opcache health on mw2261 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [20:18:33] (03CR) 10Cwhite: profile: add loki output support to the logstash pipeline (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/602490 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [20:19:35] (03CR) 10jerkins-bot: [V: 04-1] profile: add loki output support to the logstash pipeline [puppet] - 10https://gerrit.wikimedia.org/r/602490 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [20:22:35] 10Operations, 10Puppet, 10Continuous-Integration-Config, 10Patch-For-Review, and 2 others: Shell/Python/other scripts should not be generated by ERB files; dynamic parts should be a simple ERB config file - https://phabricator.wikimedia.org/T254480 (10ArielGlenn) I agree. Example: the first attempt at ./mo... [20:34:48] (03PS6) 10Cwhite: wmflib: add systemd.timer OnCalendar support to cron_splay [puppet] - 10https://gerrit.wikimedia.org/r/600928 (https://phabricator.wikimedia.org/T210818) [20:36:11] RECOVERY - PHP opcache health on mw2261 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [20:38:54] 10Operations, 10Puppet, 10Continuous-Integration-Config, 10Patch-For-Review, and 2 others: Shell/Python/other scripts should not be generated by ERB files; dynamic parts should be a simple ERB config file - https://phabricator.wikimedia.org/T254480 (10jbond) >>! In T254480#6197931, @Volans wrote: > @CDanis... [20:40:49] (03PS7) 10Cwhite: wmflib: add systemd.timer OnCalendar support to cron_splay [puppet] - 10https://gerrit.wikimedia.org/r/600928 (https://phabricator.wikimedia.org/T210818) [20:48:25] (03PS8) 10Cwhite: wmflib: add systemd.timer OnCalendar support to cron_splay [puppet] - 10https://gerrit.wikimedia.org/r/600928 (https://phabricator.wikimedia.org/T210818) [20:52:32] (03CR) 10Cwhite: "updated current use surfaced some bugs" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/600928 (https://phabricator.wikimedia.org/T210818) (owner: 10Cwhite) [20:54:50] (03PS2) 10CDanis: Systemd::Servicename: make it reflect reality e.g. php7.2-fpm [puppet] - 10https://gerrit.wikimedia.org/r/601460 [20:54:52] (03PS5) 10CDanis: textfile exporter for php-fpm worker pool status [puppet] - 10https://gerrit.wikimedia.org/r/601454 (https://phabricator.wikimedia.org/T252605) [20:55:25] (03CR) 10CDanis: "Made this clean up after itself on failures (reusing code from other textfile exporter scripts), and also made it handle fractional report" [puppet] - 10https://gerrit.wikimedia.org/r/601454 (https://phabricator.wikimedia.org/T252605) (owner: 10CDanis) [20:56:19] (03CR) 10BryanDavis: [C: 03+1] "The quoting should not matter in practice as this is an echo call not a directory path being traversed or a cli argument that will get con" [puppet] - 10https://gerrit.wikimedia.org/r/602649 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [20:58:30] (03PS6) 10CDanis: textfile exporter for php-fpm worker pool status [puppet] - 10https://gerrit.wikimedia.org/r/601454 (https://phabricator.wikimedia.org/T252605) [20:58:47] (03CR) 10CDanis: "fixed a few things, will merge Monday if no further comments" [puppet] - 10https://gerrit.wikimedia.org/r/601454 (https://phabricator.wikimedia.org/T252605) (owner: 10CDanis) [20:59:01] (03CR) 10CDanis: "will self-merge Monday if there are still no comments" [puppet] - 10https://gerrit.wikimedia.org/r/601460 (owner: 10CDanis) [21:02:34] (03PS1) 10Jbond: Example: build script in line in puppet [puppet] - 10https://gerrit.wikimedia.org/r/602771 (https://phabricator.wikimedia.org/T254480) [21:02:51] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:04:39] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:05:45] 10Operations, 10Puppet, 10Continuous-Integration-Config, 10Patch-For-Review, and 2 others: Shell/Python/other scripts should not be generated by ERB files; dynamic parts should be a simple ERB config file - https://phabricator.wikimedia.org/T254480 (10jbond) >>! In T254480#6197961, @ArielGlenn wrote: > I a... [21:14:37] (03PS9) 10Cwhite: profile: add loki output support to the logstash pipeline [puppet] - 10https://gerrit.wikimedia.org/r/602490 (https://phabricator.wikimedia.org/T222826) [21:15:30] (03PS2) 10Jbond: Example: build script in line in puppet [puppet] - 10https://gerrit.wikimedia.org/r/602771 (https://phabricator.wikimedia.org/T254480) [21:15:59] (03CR) 10jerkins-bot: [V: 04-1] profile: add loki output support to the logstash pipeline [puppet] - 10https://gerrit.wikimedia.org/r/602490 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [21:17:43] (03PS3) 10Jbond: Example: build script in line in puppet [puppet] - 10https://gerrit.wikimedia.org/r/602771 (https://phabricator.wikimedia.org/T254480) [21:18:03] 10Operations, 10Puppet, 10Continuous-Integration-Config, 10Patch-For-Review, and 2 others: Shell/Python/other scripts should not be generated by ERB files; dynamic parts should be a simple ERB config file - https://phabricator.wikimedia.org/T254480 (10ArielGlenn) @jbond I added the author of the one patch... [21:20:05] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/602771 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [21:28:19] (03PS6) 10Cwhite: puppetmaster,icinga: naggen2 cleanup and update to python3 [puppet] - 10https://gerrit.wikimedia.org/r/549222 [21:30:32] (03Abandoned) 10Ppchelko: [RESTRouter] Switch event service to eventgate. [deployment-charts] - 10https://gerrit.wikimedia.org/r/524060 (https://phabricator.wikimedia.org/T524055) (owner: 10Ppchelko) [21:43:23] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/602694 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [21:49:10] (03PS3) 10Apakhomov: kask: added support egress rules kask: added support dst_ports rules without adding cidr field. Added missing rule [deployment-charts] - 10https://gerrit.wikimedia.org/r/597776 [22:11:24] (03PS2) 10Jbond: CI: add CI to check shell scripts [puppet] - 10https://gerrit.wikimedia.org/r/602693 (https://phabricator.wikimedia.org/T254480) [22:11:44] (03CR) 10jerkins-bot: [V: 04-1] CI: add CI to check shell scripts [puppet] - 10https://gerrit.wikimedia.org/r/602693 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [22:11:51] (03PS2) 10Jbond: CI: add some shell scripts to test the new shellcheck CI check [puppet] - 10https://gerrit.wikimedia.org/r/602694 (https://phabricator.wikimedia.org/T254480) [22:12:12] (03CR) 10jerkins-bot: [V: 04-1] CI: add some shell scripts to test the new shellcheck CI check [puppet] - 10https://gerrit.wikimedia.org/r/602694 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [22:15:59] (03PS3) 10Jbond: CI: add CI to check shell scripts [puppet] - 10https://gerrit.wikimedia.org/r/602693 (https://phabricator.wikimedia.org/T254480) [22:17:04] (03PS3) 10Jbond: CI: add some shell scripts to test the new shellcheck CI check [puppet] - 10https://gerrit.wikimedia.org/r/602694 (https://phabricator.wikimedia.org/T254480) [22:17:21] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/602693 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [22:17:24] (03CR) 10jerkins-bot: [V: 04-1] CI: add some shell scripts to test the new shellcheck CI check [puppet] - 10https://gerrit.wikimedia.org/r/602694 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [22:21:09] (03PS4) 10Jbond: CI: add CI to check shell scripts [puppet] - 10https://gerrit.wikimedia.org/r/602693 (https://phabricator.wikimedia.org/T254480) [22:21:38] (03PS4) 10Jbond: CI: add some shell scripts to test the new shellcheck CI check [puppet] - 10https://gerrit.wikimedia.org/r/602694 (https://phabricator.wikimedia.org/T254480) [22:21:59] (03CR) 10jerkins-bot: [V: 04-1] CI: add some shell scripts to test the new shellcheck CI check [puppet] - 10https://gerrit.wikimedia.org/r/602694 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [22:24:40] (03PS5) 10Jbond: CI: add CI to check shell scripts [puppet] - 10https://gerrit.wikimedia.org/r/602693 (https://phabricator.wikimedia.org/T254480) [22:25:21] jbond42: ;))) [22:25:35] (03PS5) 10Jbond: CI: add some shell scripts to test the new shellcheck CI check [puppet] - 10https://gerrit.wikimedia.org/r/602694 (https://phabricator.wikimedia.org/T254480) [22:25:49] (03PS6) 10Jbond: CI: add some shell scripts to test the new shellcheck CI check [puppet] - 10https://gerrit.wikimedia.org/r/602694 (https://phabricator.wikimedia.org/T254480) [22:25:57] (03CR) 10jerkins-bot: [V: 04-1] CI: add some shell scripts to test the new shellcheck CI check [puppet] - 10https://gerrit.wikimedia.org/r/602694 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [22:26:21] (03CR) 10jerkins-bot: [V: 04-1] CI: add some shell scripts to test the new shellcheck CI check [puppet] - 10https://gerrit.wikimedia.org/r/602694 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [22:28:19] (03PS6) 10Jbond: CI: add CI to check shell scripts [puppet] - 10https://gerrit.wikimedia.org/r/602693 (https://phabricator.wikimedia.org/T254480) [22:28:53] (03PS7) 10Jbond: CI: add some shell scripts to test the new shellcheck CI check [puppet] - 10https://gerrit.wikimedia.org/r/602694 (https://phabricator.wikimedia.org/T254480) [22:29:15] (03CR) 10jerkins-bot: [V: 04-1] CI: add some shell scripts to test the new shellcheck CI check [puppet] - 10https://gerrit.wikimedia.org/r/602694 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [22:34:25] (03CR) 10Hashar: "I like it a lot (notably the blue color, that home dirs are excluded)" [puppet] - 10https://gerrit.wikimedia.org/r/602693 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [22:55:06] 10Operations, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Uploading, and 3 others: Some (recent?) uploads to Commons are not available on other wikis - https://phabricator.wikimedia.org/T253405 (10Krinkle) 05Open→03Resolved PROBLEM - PHP opcache health on mw2366 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [23:43:07] RECOVERY - PHP opcache health on mw2366 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health