[00:01:02] (03CR) 10Bstorm: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/635888 (https://phabricator.wikimedia.org/T266300) (owner: 10Bstorm) [00:02:27] (03PS6) 10Bstorm: toolforge: script to make long-running processes on bastions less good [puppet] - 10https://gerrit.wikimedia.org/r/635888 (https://phabricator.wikimedia.org/T266300) [00:03:32] (03CR) 10Dzahn: [V: 03+1] "This uses 2 different instances in the wikistats project as an example just to show the opt-in works: note how on one instance the timer i" [puppet] - 10https://gerrit.wikimedia.org/r/635406 (https://phabricator.wikimedia.org/T165885) (owner: 10Dzahn) [00:05:37] (03PS1) 10Dzahn: base::labs: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/635905 [00:08:14] (03PS2) 10Dzahn: base::labs: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/635905 [00:13:13] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:13:39] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:14:31] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/26090/" [puppet] - 10https://gerrit.wikimedia.org/r/633835 (owner: 10Dzahn) [00:14:51] (03CR) 10Dzahn: "toolsbeta only and noop there as well" [puppet] - 10https://gerrit.wikimedia.org/r/633835 (owner: 10Dzahn) [00:31:03] (03PS4) 10Dzahn: gerrit: replace cron jobs with systemd timers (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/633857 [00:31:06] (03CR) 10jerkins-bot: [V: 04-1] gerrit: replace cron jobs with systemd timers (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/633857 (owner: 10Dzahn) [00:32:08] (03CR) 10Dzahn: gerrit: replace cron jobs with systemd timers (WIP) (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/633857 (owner: 10Dzahn) [00:34:04] (03PS5) 10Dzahn: gerrit: replace cron jobs with systemd timers (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/633857 [00:34:51] (03CR) 10Dzahn: [C: 04-2] "draft, amending tomorrow" [puppet] - 10https://gerrit.wikimedia.org/r/633857 (owner: 10Dzahn) [00:36:52] 10Operations, 10Traffic, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs, 10iOS-app-Bonefish-On-A-Balloon: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10Tsevener) [00:40:23] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:41:57] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:09:29] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:11:07] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:31:59] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [01:54:57] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:00:09] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:06:29] PROBLEM - Check the last execution of package_builder_Clean_up_build_directory on deneb is CRITICAL: CRITICAL: Status of the systemd unit package_builder_Clean_up_build_directory https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:07:11] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:36:09] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [02:46:25] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3060 is OK: HTTP OK: HTTP/1.0 200 OK - 23613 bytes in 0.264 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [03:04:29] PROBLEM - Check systemd state on ms-be2017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:12:16] (03CR) 10Zhuyifei1999: "By wall clock time I mean the time since process started. We are generally concerned with processes in D state too much or R state too muc" [puppet] - 10https://gerrit.wikimedia.org/r/635888 (https://phabricator.wikimedia.org/T266300) (owner: 10Bstorm) [03:25:05] RECOVERY - Check systemd state on ms-be2017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:27:59] (03CR) 10Bstorm: "Oh! I get what you mean then. I don't want to go after things in D state (unless it is a very long time) because other processes can easil" [puppet] - 10https://gerrit.wikimedia.org/r/635888 (https://phabricator.wikimedia.org/T266300) (owner: 10Bstorm) [03:29:03] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:30:51] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:11:05] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:14:33] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:33:13] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:33:35] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 243, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:40:33] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={swagger_check_proton_cluster_eqiad,swagger_check_restbase_eqiad} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:44:01] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:49:15] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 2 (deploy1002, ...), Fresh: 100 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:04:33] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:04:57] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:18:31] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:18:53] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 243, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:38:31] 10Operations, 10SRE-Access-Requests: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung - https://phabricator.wikimedia.org/T266250 (10Marostegui) p:05Triage→03Medium [05:38:41] 10Operations, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Marostegui) p:05Triage→03Medium [05:45:47] 10Operations, 10SRE-Access-Requests: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung - https://phabricator.wikimedia.org/T266250 (10Marostegui) [05:48:30] 10Operations, 10SRE-Access-Requests: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung - https://phabricator.wikimedia.org/T266250 (10Marostegui) [05:50:01] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:53:32] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung - https://phabricator.wikimedia.org/T266250 (10Marostegui) [05:53:46] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung - https://phabricator.wikimedia.org/T266250 (10Marostegui) @KFrancis can you confirm if @Rmaung has a valid NDA signed? I cannot see it on the NDA tracking sheet. [05:54:50] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Marostegui) [05:55:05] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Marostegui) [05:59:26] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung - https://phabricator.wikimedia.org/T266250 (10Marostegui) Confirmed that @rmaung is staff by checking via ldap-corp. @Rmaung we'd also need your manager to sign off this re... [05:59:38] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung - https://phabricator.wikimedia.org/T266250 (10Marostegui) [06:00:51] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Marostegui) [06:01:02] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Marostegui) Confirmed janstee@wikimedia.org via ldap corp as staff. @JAnstee_WMF we'd need your manager to sign this off. Thanks! [06:04:33] (03PS2) 10Elukey: hadoop: final clean up after the decommission of old nodes [puppet] - 10https://gerrit.wikimedia.org/r/635750 (https://phabricator.wikimedia.org/T255140) [06:05:45] PROBLEM - Check systemd state on ms-be2017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:07:06] (03CR) 10Elukey: [C: 03+2] hadoop: final clean up after the decommission of old nodes [puppet] - 10https://gerrit.wikimedia.org/r/635750 (https://phabricator.wikimedia.org/T255140) (owner: 10Elukey) [06:24:04] RECOVERY - Check systemd state on ms-be2017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:33:52] PROBLEM - Check systemd state on ms-be2017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:34:23] (03CR) 10Muehlenhoff: ntp: hiera->lookup, data types (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635666 (owner: 10Dzahn) [06:37:32] (03CR) 10Muehlenhoff: debmonitor::client: hiera->lookup, add data types (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635665 (owner: 10Dzahn) [06:37:59] (03PS2) 10Muehlenhoff: autoinstall: Also use mirrors.wikimedia.org for publi/esams [puppet] - 10https://gerrit.wikimedia.org/r/630876 (https://phabricator.wikimedia.org/T158562) [06:38:24] (03PS3) 10Muehlenhoff: autoinstall: Also use mirrors.wikimedia.org for public/esams [puppet] - 10https://gerrit.wikimedia.org/r/630876 (https://phabricator.wikimedia.org/T158562) [06:48:31] (03CR) 10Muehlenhoff: [C: 03+2] autoinstall: Also use mirrors.wikimedia.org for public/esams [puppet] - 10https://gerrit.wikimedia.org/r/630876 (https://phabricator.wikimedia.org/T158562) (owner: 10Muehlenhoff) [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201023T0700) [07:04:04] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:05:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:08:14] (03CR) 10Ayounsi: [C: 03+1] rpkivalidator: hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/635664 (owner: 10Dzahn) [07:15:00] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/635356 (owner: 10Jbond) [07:21:05] (03Abandoned) 10Muehlenhoff: mariadb::config: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/522496 (owner: 10Muehlenhoff) [07:21:35] (03Abandoned) 10Muehlenhoff: Enable Kerberos for Druid/www (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/474673 (owner: 10Muehlenhoff) [07:21:50] (03Abandoned) 10Muehlenhoff: Enable Kerberos for Druid workers (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/474672 (owner: 10Muehlenhoff) [07:25:16] RECOVERY - Check systemd state on ms-be2017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:27:26] 10Operations, 10Patch-For-Review, 10User-jbond: Manage apt sources via puppet - https://phabricator.wikimedia.org/T158562 (10MoritzMuehlenhoff) 05Open→03Resolved /etc/apt/sources.list is managed by Puppet since a few weeks in production, closing the task (for Cloud VPS it's being considered to also enabl... [07:31:14] (03Restored) 10Giuseppe Lavagetto: Add --force flag to safe-service-restart.py [puppet] - 10https://gerrit.wikimedia.org/r/635630 (https://phabricator.wikimedia.org/T243009) (owner: 10Ahmon Dancy) [07:33:41] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Add --force flag to safe-service-restart.py [puppet] - 10https://gerrit.wikimedia.org/r/635630 (https://phabricator.wikimedia.org/T243009) (owner: 10Ahmon Dancy) [07:36:38] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [07:41:30] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [07:59:46] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 73 probes of 569 (alerts on 65) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:01:58] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 68 probes of 569 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:04:34] PROBLEM - Check systemd state on ms-be2017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:07:36] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 49 probes of 569 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:10:30] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 39 probes of 569 (alerts on 65) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:17:50] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:19:08] 10Operations: Decom cookbook should also remove keytabs - https://phabricator.wikimedia.org/T266314 (10MoritzMuehlenhoff) [08:19:21] 10Operations, 10User-MoritzMuehlenhoff: Decom cookbook should also remove keytabs - https://phabricator.wikimedia.org/T266314 (10MoritzMuehlenhoff) [08:19:28] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:24:18] RECOVERY - Check systemd state on ms-be2017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:30:31] (03PS2) 10Alexandros Kosiaris: Add new java images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/635743 (https://phabricator.wikimedia.org/T265504) [08:40:31] (03PS1) 10Marostegui: site.pp: Do not use clouddb1020 yet [puppet] - 10https://gerrit.wikimedia.org/r/635949 [08:41:29] (03CR) 10Marostegui: [C: 03+2] site.pp: Do not use clouddb1020 yet [puppet] - 10https://gerrit.wikimedia.org/r/635949 (owner: 10Marostegui) [08:42:29] (03CR) 10Muehlenhoff: "JFTR, we also have Java 8 for Buster if needed (various clusters (ELK, Hadoop) needed it since all nodes of a cluster need the same Java o" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/635743 (https://phabricator.wikimedia.org/T265504) (owner: 10Alexandros Kosiaris) [08:54:06] (03CR) 10Ema: [C: 03+1] "Looks great!" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/635842 (https://phabricator.wikimedia.org/T265911) (owner: 10Vgutierrez) [08:54:32] 10Operations, 10User-MoritzMuehlenhoff: Decom cookbook should also remove keytabs - https://phabricator.wikimedia.org/T266314 (10Marostegui) p:05Triage→03Medium [08:55:15] (03PS1) 10Muehlenhoff: Remove Stretch-based LDAP replicas from conftool [puppet] - 10https://gerrit.wikimedia.org/r/635951 (https://phabricator.wikimedia.org/T264388) [09:09:27] !log upgrading spicerack to 0.0.44 on cumin hosts - T257905 [09:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:41] T257905: Spin off common Spicerack modules into a standalone Python library importable anywhere - https://phabricator.wikimedia.org/T257905 [09:09:46] (03CR) 10Jbond: [C: 03+2] dns::auth::acmechief_target: hiera->lookup, data type [puppet] - 10https://gerrit.wikimedia.org/r/635661 (owner: 10Dzahn) [09:14:23] (03CR) 10Jbond: rpkivalidator: hiera->lookup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635664 (owner: 10Dzahn) [09:18:19] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Initial commit of eventrouter docker image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/634985 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [09:18:24] (03CR) 10DCausse: "> Patch Set 2:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/635743 (https://phabricator.wikimedia.org/T265504) (owner: 10Alexandros Kosiaris) [09:18:39] (03CR) 10JMeybohm: [C: 03+2] wikifeeds: Increase envoy CPU and memory ressources [deployment-charts] - 10https://gerrit.wikimedia.org/r/635753 (https://phabricator.wikimedia.org/T266194) (owner: 10JMeybohm) [09:19:06] (03PS1) 10Elukey: hive: set new kerberos principal for Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/635955 (https://phabricator.wikimedia.org/T257412) [09:19:25] 10Operations, 10Traffic, 10HTTPS: HSTS preload for wmfusercontent.org - https://phabricator.wikimedia.org/T132452 (10Nintendofan885) [09:19:55] (03CR) 10Elukey: [C: 03+2] hive: set new kerberos principal for Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/635955 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [09:21:24] (03CR) 10Vgutierrez: [C: 03+2] Release 8.0.8-1wm3 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/635842 (https://phabricator.wikimedia.org/T265911) (owner: 10Vgutierrez) [09:21:50] (03Merged) 10jenkins-bot: wikifeeds: Increase envoy CPU and memory ressources [deployment-charts] - 10https://gerrit.wikimedia.org/r/635753 (https://phabricator.wikimedia.org/T266194) (owner: 10JMeybohm) [09:22:06] (03CR) 10JMeybohm: [C: 03+1] Add new java images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/635743 (https://phabricator.wikimedia.org/T265504) (owner: 10Alexandros Kosiaris) [09:23:05] (03PS1) 10Elukey: hive: update remaining setting for Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/635956 [09:23:06] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [09:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:32] (03CR) 10Elukey: [C: 03+2] hive: update remaining setting for Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/635956 (owner: 10Elukey) [09:26:10] !log jayme@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [09:26:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:25] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.3 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/635958 [09:31:45] !log published docker-registry.discovery.wmnet/eventrouter:0.3.0-1 [09:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:04] !log jayme@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [09:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:03] (03CR) 10JMeybohm: [C: 03+2] Initial commit of eventrouter chart from stable/charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/635258 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [09:33:27] (03PS1) 10Elukey: Move upload of hive-site.xml from client to hadoop standby in test [puppet] - 10https://gerrit.wikimedia.org/r/635959 (https://phabricator.wikimedia.org/T255139) [09:34:01] (03CR) 10Arturo Borrero Gonzalez: toolforge: script to make long-running processes on bastions less good (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/635888 (https://phabricator.wikimedia.org/T266300) (owner: 10Bstorm) [09:34:28] (03CR) 10Elukey: [C: 03+2] Move upload of hive-site.xml from client to hadoop standby in test [puppet] - 10https://gerrit.wikimedia.org/r/635959 (https://phabricator.wikimedia.org/T255139) (owner: 10Elukey) [09:36:09] (03Merged) 10jenkins-bot: Initial commit of eventrouter chart from stable/charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/635258 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [09:37:19] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.3 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/635958 (owner: 10Volans) [09:38:26] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.3 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/635958 (owner: 10Volans) [09:39:45] 10Operations, 10netops, 10cloud-services-team (Kanban): Enable L3 routing on cloudsw nodes - https://phabricator.wikimedia.org/T265288 (10aborrero) Announcement sent to the community: https://lists.wikimedia.org/pipermail/cloud-announce/2020-October/000331.html [09:41:03] (03PS1) 10Elukey: hive: change metastore and server hostnames for Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/635961 (https://phabricator.wikimedia.org/T257412) [09:43:06] (03PS2) 10Elukey: hive: change metastore and server hostnames for Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/635961 (https://phabricator.wikimedia.org/T257412) [09:43:30] (03CR) 10Elukey: [C: 03+2] hive: change metastore and server hostnames for Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/635961 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [09:43:42] (03CR) 10Muehlenhoff: [C: 03+2] Remove Stretch-based LDAP replicas from conftool [puppet] - 10https://gerrit.wikimedia.org/r/635951 (https://phabricator.wikimedia.org/T264388) (owner: 10Muehlenhoff) [09:47:02] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [09:47:04] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:21] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [09:47:23] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:47:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:28] (03CR) 10Hashar: "recheck After https://gerrit.wikimedia.org/r/c/integration/config/+/635566 the CI image now has python3-ldap." [puppet] - 10https://gerrit.wikimedia.org/r/635559 (owner: 10Filippo Giunchedi) [09:51:28] !log masking slapd on the old Stretch replicas to uncover potential direct access outside of the LVSes T264388 [09:51:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:34] T264388: Migrate LDAP replicas to Buster - https://phabricator.wikimedia.org/T264388 [09:53:55] (03PS1) 10Volans: Upstream release v0.0.3 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/635963 [09:54:07] (03PS1) 10Kormat: aptrepo: Add thirdparty/orchestrator [puppet] - 10https://gerrit.wikimedia.org/r/635964 (https://phabricator.wikimedia.org/T266023) [09:56:06] (03PS1) 10Arturo Borrero Gonzalez: wikimediacloud.org: refresh codfw1dev addresses with cloudgw changes [dns] - 10https://gerrit.wikimedia.org/r/635965 (https://phabricator.wikimedia.org/T261724) [09:56:50] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/635964 (https://phabricator.wikimedia.org/T266023) (owner: 10Kormat) [09:56:59] (03CR) 10Kormat: [C: 03+2] aptrepo: Add thirdparty/orchestrator [puppet] - 10https://gerrit.wikimedia.org/r/635964 (https://phabricator.wikimedia.org/T266023) (owner: 10Kormat) [10:01:53] (03CR) 10Hashar: [C: 04-1] "The testenv requires sitepackage=True in order to be able to reach the modules installed via python3-ldap.deb (which is nice as an integr" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/635559 (owner: 10Filippo Giunchedi) [10:02:19] (03CR) 10Marostegui: "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/635964 (https://phabricator.wikimedia.org/T266023) (owner: 10Kormat) [10:02:37] (03CR) 10Volans: [C: 03+1] "I have zero context on the specific changes but as for the interaction with the current Netbox automation all looks good." [dns] - 10https://gerrit.wikimedia.org/r/635965 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [10:04:02] (03PS1) 10JMeybohm: eventrouter: always log to stderr [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/635986 (https://phabricator.wikimedia.org/T262675) [10:04:32] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] eventrouter: always log to stderr [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/635986 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [10:05:04] (03CR) 10Alexandros Kosiaris: "> Patch Set 2:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/635743 (https://phabricator.wikimedia.org/T265504) (owner: 10Alexandros Kosiaris) [10:05:21] 10Operations, 10DBA, 10Patch-For-Review, 10User-Kormat: orchestrator: Get packages into WMF apt - https://phabricator.wikimedia.org/T266023 (10Kormat) v3.2.3 has been uploaded: ` # apt policy orchestrator orchestrator: Installed: (none) Candidate: 1:3.2.3 Version table: 1:3.2.3 1001 1001... [10:07:29] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wikimediacloud.org: refresh codfw1dev addresses with cloudgw changes [dns] - 10https://gerrit.wikimedia.org/r/635965 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [10:09:26] (03PS3) 10JMeybohm: admin: deploy eventrouter to all clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/635259 (https://phabricator.wikimedia.org/T262675) [10:09:35] !log published docker-registry.discovery.wmnet/eventrouter:0.3.0-2 [10:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:19] (03PS1) 10Effie Mouzeli: Set debian buster for mc1019 [puppet] - 10https://gerrit.wikimedia.org/r/635987 (https://phabricator.wikimedia.org/T252391) [10:11:22] (03PS1) 10Kormat: orchestrator: Add very basic role/profile/module. [puppet] - 10https://gerrit.wikimedia.org/r/635988 [10:11:51] (03PS2) 10Kormat: orchestrator: Add very basic role/profile/module. [puppet] - 10https://gerrit.wikimedia.org/r/635988 (https://phabricator.wikimedia.org/T265990) [10:11:56] (03CR) 10jerkins-bot: [V: 04-1] Set debian buster for mc1019 [puppet] - 10https://gerrit.wikimedia.org/r/635987 (https://phabricator.wikimedia.org/T252391) (owner: 10Effie Mouzeli) [10:12:22] (03CR) 10JMeybohm: [C: 03+2] admin: deploy eventrouter to all clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/635259 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [10:13:56] (03PS3) 10Kormat: orchestrator: Add very basic role/profile/module. [puppet] - 10https://gerrit.wikimedia.org/r/635988 (https://phabricator.wikimedia.org/T265990) [10:15:09] (03Merged) 10jenkins-bot: admin: deploy eventrouter to all clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/635259 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [10:15:11] (03PS2) 10Volans: Upstream release v0.0.3 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/635963 [10:16:27] (03CR) 10Kormat: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/26092/" [puppet] - 10https://gerrit.wikimedia.org/r/635988 (https://phabricator.wikimedia.org/T265990) (owner: 10Kormat) [10:17:22] (03PS2) 10Effie Mouzeli: Set debian buster for mc1019 [puppet] - 10https://gerrit.wikimedia.org/r/635987 (https://phabricator.wikimedia.org/T252391) [10:18:21] (03CR) 10Marostegui: "Let's disable notifications for now? (I assume it is just creating the hiera yaml? - if not, we can do it directly on icinga) I don't thin" [puppet] - 10https://gerrit.wikimedia.org/r/635988 (https://phabricator.wikimedia.org/T265990) (owner: 10Kormat) [10:19:31] (03PS4) 10Kormat: orchestrator: Add very basic role/profile/module. [puppet] - 10https://gerrit.wikimedia.org/r/635988 (https://phabricator.wikimedia.org/T265990) [10:20:00] (03PS3) 10Effie Mouzeli: mediawiki: Check number of cached keys in php-check-and-restart.sh [puppet] - 10https://gerrit.wikimedia.org/r/635854 (https://phabricator.wikimedia.org/T253673) [10:20:15] (03PS1) 10Arturo Borrero Gonzalez: templates/57.15.185.in-addr.arpa: add missing PTR record for neutron virtual address [dns] - 10https://gerrit.wikimedia.org/r/635990 (https://phabricator.wikimedia.org/T261724) [10:22:47] (03CR) 10Kormat: "> Let's disable notifications for now? (I assume it is just creating the hiera yaml? - if not, we can do it directly on icinga) I don't th" [puppet] - 10https://gerrit.wikimedia.org/r/635988 (https://phabricator.wikimedia.org/T265990) (owner: 10Kormat) [10:24:43] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] templates/57.15.185.in-addr.arpa: add missing PTR record for neutron virtual address [dns] - 10https://gerrit.wikimedia.org/r/635990 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [10:25:32] (03CR) 10Marostegui: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/635988 (https://phabricator.wikimedia.org/T265990) (owner: 10Kormat) [10:25:35] (03PS2) 10Giuseppe Lavagetto: Add --force flag to safe-service-restart.py [puppet] - 10https://gerrit.wikimedia.org/r/635630 (https://phabricator.wikimedia.org/T243009) (owner: 10Ahmon Dancy) [10:25:37] (03PS1) 10Giuseppe Lavagetto: safe-service-restart: add optional poolcounter support [puppet] - 10https://gerrit.wikimedia.org/r/635991 [10:25:39] (03PS1) 10Giuseppe Lavagetto: poolcounter: add client configuration classes [puppet] - 10https://gerrit.wikimedia.org/r/635992 [10:25:41] (03PS1) 10Giuseppe Lavagetto: profile::lvs::realserver: add ability to configure poolcounter for pools [puppet] - 10https://gerrit.wikimedia.org/r/635993 [10:25:43] (03PS1) 10Giuseppe Lavagetto: restbase: add poolcounter support to safe-service-restart scripts [puppet] - 10https://gerrit.wikimedia.org/r/635994 [10:25:45] (03CR) 10Volans: [C: 03+2] Upstream release v0.0.3 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/635963 (owner: 10Volans) [10:25:49] (03CR) 10Kormat: [C: 03+2] orchestrator: Add very basic role/profile/module. [puppet] - 10https://gerrit.wikimedia.org/r/635988 (https://phabricator.wikimedia.org/T265990) (owner: 10Kormat) [10:26:58] (03CR) 10jerkins-bot: [V: 04-1] safe-service-restart: add optional poolcounter support [puppet] - 10https://gerrit.wikimedia.org/r/635991 (owner: 10Giuseppe Lavagetto) [10:27:01] (03CR) 10jerkins-bot: [V: 04-1] restbase: add poolcounter support to safe-service-restart scripts [puppet] - 10https://gerrit.wikimedia.org/r/635994 (owner: 10Giuseppe Lavagetto) [10:27:17] (03CR) 10jerkins-bot: [V: 04-1] profile::lvs::realserver: add ability to configure poolcounter for pools [puppet] - 10https://gerrit.wikimedia.org/r/635993 (owner: 10Giuseppe Lavagetto) [10:27:21] (03Merged) 10jenkins-bot: Upstream release v0.0.3 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/635963 (owner: 10Volans) [10:28:40] (03CR) 10Elukey: "mc1019 is a redis lock manager, its ip is in mediawiki config 😞" [puppet] - 10https://gerrit.wikimedia.org/r/635987 (https://phabricator.wikimedia.org/T252391) (owner: 10Effie Mouzeli) [10:33:01] (03PS3) 10Effie Mouzeli: Set debian buster for mc1019 [puppet] - 10https://gerrit.wikimedia.org/r/635987 (https://phabricator.wikimedia.org/T252391) [10:33:06] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:33:30] (03PS4) 10Effie Mouzeli: Set debian buster for mc1036 [puppet] - 10https://gerrit.wikimedia.org/r/635987 (https://phabricator.wikimedia.org/T252391) [10:34:14] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/635987 (https://phabricator.wikimedia.org/T252391) (owner: 10Effie Mouzeli) [10:34:29] (03CR) 10Effie Mouzeli: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/635987 (https://phabricator.wikimedia.org/T252391) (owner: 10Effie Mouzeli) [10:34:46] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:37:11] 10Operations, 10netops: cr3-esams linecard diversity issue - https://phabricator.wikimedia.org/T262524 (10ayounsi) a:05ayounsi→03wiki_willy @wiki_willy is that something that could be planned for next Q? It should be trivial enough for the DC remote hands. [10:38:45] 10Operations, 10netbox: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10ayounsi) a:05ayounsi→03None [10:41:26] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:42:12] 10Operations, 10Commons, 10MediaWiki-File-management, 10Thumbor, 10Traffic: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10Gilles) [10:43:04] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:45:22] 10Operations, 10Commons, 10MediaWiki-File-management, 10Thumbor, 10Traffic: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10Gilles) @ema can you confirm that int-front cache status in the response means that the 429 was em... [10:56:14] !log uploaded python3-wmflib_0.0.3 to apt.wikimedia.org buster-wikimedia - T257905 [10:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:22] T257905: Spin off common Spicerack modules into a standalone Python library importable anywhere - https://phabricator.wikimedia.org/T257905 [10:57:08] !log uploaded orchestrator v3.2.3 to apt.wikimedia.org buster-wikimedia - T266023 (forgot to log this earlier) [10:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:15] T266023: orchestrator: Get packages into WMF apt - https://phabricator.wikimedia.org/T266023 [11:01:35] (03PS2) 10Jbond: 6.2.4: merge additional upstream changes [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/635848 [11:01:50] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki: Check number of cached keys in php-check-and-restart.sh [puppet] - 10https://gerrit.wikimedia.org/r/635854 (https://phabricator.wikimedia.org/T253673) (owner: 10Effie Mouzeli) [11:02:55] (03PS3) 10Jbond: 6.2.4: merge additional upstream changes [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/635848 [11:03:36] (03PS2) 10Giuseppe Lavagetto: safe-service-restart: add optional poolcounter support [puppet] - 10https://gerrit.wikimedia.org/r/635991 [11:03:38] (03PS2) 10Giuseppe Lavagetto: poolcounter: add client configuration classes [puppet] - 10https://gerrit.wikimedia.org/r/635992 [11:03:40] (03PS2) 10Giuseppe Lavagetto: profile::lvs::realserver: add ability to configure poolcounter for pools [puppet] - 10https://gerrit.wikimedia.org/r/635993 [11:03:42] (03PS2) 10Giuseppe Lavagetto: restbase: add poolcounter support to safe-service-restart scripts [puppet] - 10https://gerrit.wikimedia.org/r/635994 [11:04:41] (03CR) 10jerkins-bot: [V: 04-1] restbase: add poolcounter support to safe-service-restart scripts [puppet] - 10https://gerrit.wikimedia.org/r/635994 (owner: 10Giuseppe Lavagetto) [11:05:06] (03CR) 10jerkins-bot: [V: 04-1] profile::lvs::realserver: add ability to configure poolcounter for pools [puppet] - 10https://gerrit.wikimedia.org/r/635993 (owner: 10Giuseppe Lavagetto) [11:13:30] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:20:20] RECOVERY - Check the last execution of package_builder_Clean_up_build_directory on deneb is OK: OK: Status of the systemd unit package_builder_Clean_up_build_directory https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:22:39] 10Operations, 10Commons, 10MediaWiki-File-management, 10Thumbor, 10Traffic: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10ema) >>! In T266155#6574047, @Gilles wrote: > @ema can you confirm that int-front cache status in... [11:31:30] PROBLEM - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is CRITICAL: CRITICAL: the following (6) node(s) change every puppet run: testvm1001.eqiad.wmnet, ms-be2017.codfw.wmnet, cloudvirt1025.eqiad.wmnet, deploy1002.eqiad.wmnet, stat1007.eqiad.wmnet, cloudvirt1026.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [11:32:06] (03PS1) 10Volans: Remove modules migrated to wmflib [software/spicerack] - 10https://gerrit.wikimedia.org/r/636000 (https://phabricator.wikimedia.org/T257905) [11:35:28] (03CR) 10jerkins-bot: [V: 04-1] Remove modules migrated to wmflib [software/spicerack] - 10https://gerrit.wikimedia.org/r/636000 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [11:55:22] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:56:49] 10Operations, 10DNS, 10Traffic, 10netbox, and 2 others: Cloud: define relationship between wikimediacloud.org domain, CIDR prefixes and netbox automation - https://phabricator.wikimedia.org/T266331 (10aborrero) [11:57:00] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:57:10] 10Operations, 10DNS, 10Traffic, 10netbox, and 2 others: Cloud: define relationship between wikimediacloud.org domain, CIDR prefixes and netbox automation - https://phabricator.wikimedia.org/T266331 (10aborrero) p:05Triage→03Medium [12:00:13] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good! I deployed a local build on idp-test1001 and working fine. As for the question whether we can make our theming more future-pro" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/635848 (owner: 10Jbond) [12:13:29] (03PS1) 10Jcrespo: Add 4 line naive prototype for downloading all images from a wiki [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/636007 (https://phabricator.wikimedia.org/T264189) [12:14:44] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/636000 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [12:28:09] (03PS1) 10JMeybohm: admin: fix eventrouter chart reference [deployment-charts] - 10https://gerrit.wikimedia.org/r/636008 (https://phabricator.wikimedia.org/T262675) [12:30:39] 10Operations, 10Commons, 10MediaWiki-File-management, 10Thumbor: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10Gilles) [12:31:27] (03PS2) 10JMeybohm: admin: fix eventrouter chart reference [deployment-charts] - 10https://gerrit.wikimedia.org/r/636008 (https://phabricator.wikimedia.org/T262675) [12:34:55] (03CR) 10JMeybohm: [C: 03+2] admin: fix eventrouter chart reference [deployment-charts] - 10https://gerrit.wikimedia.org/r/636008 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [12:37:27] (03Merged) 10jenkins-bot: admin: fix eventrouter chart reference [deployment-charts] - 10https://gerrit.wikimedia.org/r/636008 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [12:38:33] 10Operations, 10Commons, 10MediaWiki-File-management, 10Thumbor: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10Gilles) I've confirmed that it is the PoolCounter throttle from Thumbor, by hitting it myself (that's my own ip... [12:42:23] (03PS1) 10Gilles: Increase timeout of Thumbor per-ip throttling [puppet] - 10https://gerrit.wikimedia.org/r/636012 (https://phabricator.wikimedia.org/T266155) [12:43:33] (03CR) 10Elukey: "I have two remaining doubts:" [puppet] - 10https://gerrit.wikimedia.org/r/635987 (https://phabricator.wikimedia.org/T252391) (owner: 10Effie Mouzeli) [12:43:44] 10Operations, 10Commons, 10MediaWiki-File-management, 10Thumbor, 10Patch-For-Review: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10Gilles) p:05Triage→03Medium a:03Gilles [12:47:42] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'kube-system' for release 'eventrouter' . [12:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:56] (03PS1) 10ArielGlenn: Stash list of known tables once per run per wiki and re-use [dumps] - 10https://gerrit.wikimedia.org/r/636013 (https://phabricator.wikimedia.org/T266333) [12:50:05] (03PS2) 10Gilles: Increase timeout of Thumbor per-ip throttling [puppet] - 10https://gerrit.wikimedia.org/r/636012 (https://phabricator.wikimedia.org/T266155) [12:50:24] (03CR) 10Gilles: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/636012 (https://phabricator.wikimedia.org/T266155) (owner: 10Gilles) [12:53:30] 10Operations, 10Puppet, 10observability, 10Patch-For-Review, and 2 others: Puppet: get row/rack info from Netbox - https://phabricator.wikimedia.org/T229397 (10Volans) I fear that we are going down another path like the dns generation in which the Netbox API don't really suits our needs in terms of perform... [12:53:37] (03CR) 10Ema: [C: 03+2] Increase timeout of Thumbor per-ip throttling [puppet] - 10https://gerrit.wikimedia.org/r/636012 (https://phabricator.wikimedia.org/T266155) (owner: 10Gilles) [12:54:07] (03CR) 10Volans: [C: 04-1] "I've added more generic feedback in the task, specific things here inline" (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/563186 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [12:59:19] (03Abandoned) 10Elukey: Remove mc1036/mc2036 from the Redis Nutcracker config [puppet] - 10https://gerrit.wikimedia.org/r/595810 (https://phabricator.wikimedia.org/T252391) (owner: 10Elukey) [13:04:01] !log rolling thumbor-instances restart to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/636012/ T266155 [13:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:07] T266155: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 [13:10:22] 10Operations, 10CAS-SSO, 10User-jbond: Apereo CAS expose CASCookieSameSite via profile::idp::client::http - https://phabricator.wikimedia.org/T264605 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff I've built an updated mod_cas package with SameSite cookie support for buster-wikimedia (not imported yet to apt.... [13:11:14] (03CR) 10Jbond: netbox/puppet: Add machinery to get Puppet facts from Netbox (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/563186 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [13:11:20] (03PS17) 10Jbond: netbox/puppet: Add machinery to get Puppet facts from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/563186 (https://phabricator.wikimedia.org/T229397) [13:11:35] (03PS1) 10Kormat: hiera: Add fake db backend password for orchestrator [labs/private] - 10https://gerrit.wikimedia.org/r/636014 (https://phabricator.wikimedia.org/T265990) [13:13:45] (03PS1) 10Kormat: orchestrator: Add basic config [puppet] - 10https://gerrit.wikimedia.org/r/636015 (https://phabricator.wikimedia.org/T265990) [13:14:20] (03CR) 10Kormat: [V: 03+2 C: 03+2] hiera: Add fake db backend password for orchestrator [labs/private] - 10https://gerrit.wikimedia.org/r/636014 (https://phabricator.wikimedia.org/T265990) (owner: 10Kormat) [13:14:57] (03CR) 10jerkins-bot: [V: 04-1] orchestrator: Add basic config [puppet] - 10https://gerrit.wikimedia.org/r/636015 (https://phabricator.wikimedia.org/T265990) (owner: 10Kormat) [13:15:37] (03PS3) 10Jcrespo: ssh: Remove deprecated option UsePrivilegeSeparation sandbox [puppet] - 10https://gerrit.wikimedia.org/r/635288 (https://phabricator.wikimedia.org/T170298) [13:15:39] (03PS1) 10Jcrespo: cumin: Fix wrong alias as reported by check-cumin-aliases [puppet] - 10https://gerrit.wikimedia.org/r/636016 [13:16:07] (03PS2) 10Jcrespo: cumin: Fix wrong alias as reported by check-cumin-aliases [puppet] - 10https://gerrit.wikimedia.org/r/636016 [13:16:14] (03PS2) 10Kormat: orchestrator: Add basic config [puppet] - 10https://gerrit.wikimedia.org/r/636015 (https://phabricator.wikimedia.org/T265990) [13:17:31] 10Operations, 10Security-Team: Offboard Chase Pettet from Security Team - https://phabricator.wikimedia.org/T265147 (10elukey) [13:17:41] (03PS3) 10Kormat: orchestrator: Add basic config [puppet] - 10https://gerrit.wikimedia.org/r/636015 (https://phabricator.wikimedia.org/T265990) [13:18:20] (03CR) 10Marostegui: [C: 03+1] cumin: Fix wrong alias as reported by check-cumin-aliases [puppet] - 10https://gerrit.wikimedia.org/r/636016 (owner: 10Jcrespo) [13:19:31] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/636017 [13:20:11] (03CR) 10Muehlenhoff: [C: 03+1] cumin: Fix wrong alias as reported by check-cumin-aliases [puppet] - 10https://gerrit.wikimedia.org/r/636016 (owner: 10Jcrespo) [13:22:08] (03CR) 10Jcrespo: [C: 03+2] "Thanks for the check, in this case was super-useful to detect this typo- hopefully nothing broke on labsdb1011 for not being selectable wi" [puppet] - 10https://gerrit.wikimedia.org/r/636016 (owner: 10Jcrespo) [13:24:48] 10Operations, 10Puppet, 10observability, 10Patch-For-Review, and 2 others: Puppet: get row/rack info from Netbox - https://phabricator.wikimedia.org/T229397 (10akosiaris) >>! In T229397#6574323, @Volans wrote: > I fear that we are going down another path like the dns generation in which the Netbox API don'... [13:27:05] (03PS1) 10DCausse: [wdqs-data-reload] load all lexemes chunks [cookbooks] - 10https://gerrit.wikimedia.org/r/636018 [13:28:23] (03PS1) 10Kormat: hiera: move orchestrator.yaml to correct dir. [labs/private] - 10https://gerrit.wikimedia.org/r/636019 (https://phabricator.wikimedia.org/T265990) [13:29:08] 10Operations, 10serviceops, 10Growth-Team (Current Sprint), 10Patch-For-Review, and 2 others: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10jijiki) [13:29:47] (03PS1) 10Andrew Bogott: Define backy2::backup_time for cloudvirt102[5-8] [puppet] - 10https://gerrit.wikimedia.org/r/636020 (https://phabricator.wikimedia.org/T260692) [13:31:08] (03CR) 10Andrew Bogott: [C: 03+2] Define backy2::backup_time for cloudvirt102[5-8] [puppet] - 10https://gerrit.wikimedia.org/r/636020 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott) [13:33:12] (03CR) 10Kormat: [V: 03+2 C: 03+2] hiera: move orchestrator.yaml to correct dir. [labs/private] - 10https://gerrit.wikimedia.org/r/636019 (https://phabricator.wikimedia.org/T265990) (owner: 10Kormat) [13:34:20] (03CR) 10Elukey: [C: 03+1] Remove modules migrated to wmflib [software/spicerack] - 10https://gerrit.wikimedia.org/r/636000 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [13:36:25] (03PS4) 10Kormat: orchestrator: Add basic config [puppet] - 10https://gerrit.wikimedia.org/r/636015 (https://phabricator.wikimedia.org/T265990) [13:36:36] 10Operations, 10Commons, 10MediaWiki-File-management, 10Thumbor: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10Gilles) The new timeout is in place. It seems to help Special:NewFiles on Commons to a degree but still doesn't... [13:36:51] (03PS5) 10Effie Mouzeli: Set debian buster for mc2036 [puppet] - 10https://gerrit.wikimedia.org/r/635987 (https://phabricator.wikimedia.org/T252391) [13:39:26] (03PS5) 10Kormat: orchestrator: Add basic config [puppet] - 10https://gerrit.wikimedia.org/r/636015 (https://phabricator.wikimedia.org/T265990) [13:41:01] (03CR) 10Kormat: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/26100/" [puppet] - 10https://gerrit.wikimedia.org/r/636015 (https://phabricator.wikimedia.org/T265990) (owner: 10Kormat) [13:42:22] 10Operations, 10DBA, 10User-Kormat: orchestrator: Add service monitoring - https://phabricator.wikimedia.org/T266338 (10Kormat) [13:43:51] (03CR) 10Effie Mouzeli: [C: 04-2] "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/635987 (https://phabricator.wikimedia.org/T252391) (owner: 10Effie Mouzeli) [13:45:54] (03PS1) 10JMeybohm: eventrouter: don't deploy to production clusters by now [deployment-charts] - 10https://gerrit.wikimedia.org/r/636023 (https://phabricator.wikimedia.org/T262675) [13:46:10] (03PS1) 10Gilles: Switch Thumbor haproxy load balancing to IP hash [puppet] - 10https://gerrit.wikimedia.org/r/636024 (https://phabricator.wikimedia.org/T266155) [13:47:50] (03CR) 10Marostegui: [C: 03+1] "Looks good, we might need to tackle things as we go, but this looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/636015 (https://phabricator.wikimedia.org/T265990) (owner: 10Kormat) [13:48:30] (03PS2) 10Gilles: Switch Thumbor haproxy load balancing to IP hash [puppet] - 10https://gerrit.wikimedia.org/r/636024 (https://phabricator.wikimedia.org/T266155) [13:48:34] (03CR) 10Kormat: [C: 03+2] orchestrator: Add basic config [puppet] - 10https://gerrit.wikimedia.org/r/636015 (https://phabricator.wikimedia.org/T265990) (owner: 10Kormat) [13:48:47] (03CR) 10Gilles: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/636024 (https://phabricator.wikimedia.org/T266155) (owner: 10Gilles) [13:51:24] (03PS2) 10JMeybohm: eventrouter: don't deploy to production clusters by now [deployment-charts] - 10https://gerrit.wikimedia.org/r/636023 (https://phabricator.wikimedia.org/T262675) [13:52:55] (03CR) 10Alexandros Kosiaris: [C: 03+1] eventrouter: don't deploy to production clusters by now [deployment-charts] - 10https://gerrit.wikimedia.org/r/636023 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [13:53:09] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:53:51] 10Operations, 10Commons, 10MediaWiki-File-management, 10Thumbor, 10Patch-For-Review: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10Gilles) The other drawback of that workaround I can think of is that it will re-introduce... [13:55:11] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:57:57] (03PS1) 10ArielGlenn: update sql/xml dump config settings for stashing revision info [puppet] - 10https://gerrit.wikimedia.org/r/636026 (https://phabricator.wikimedia.org/T263319) [13:58:55] (03CR) 10JMeybohm: [C: 03+2] eventrouter: don't deploy to production clusters by now [deployment-charts] - 10https://gerrit.wikimedia.org/r/636023 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [14:01:32] (03Merged) 10jenkins-bot: eventrouter: don't deploy to production clusters by now [deployment-charts] - 10https://gerrit.wikimedia.org/r/636023 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [14:06:46] (03PS1) 10Kormat: mariadb: Document orchestrator backend db grants [puppet] - 10https://gerrit.wikimedia.org/r/636031 (https://phabricator.wikimedia.org/T265990) [14:07:57] (03CR) 10Marostegui: [C: 03+1] "works for me, later we should move it to unix socket auth" [puppet] - 10https://gerrit.wikimedia.org/r/636031 (https://phabricator.wikimedia.org/T265990) (owner: 10Kormat) [14:08:40] (03CR) 10Kormat: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/636031 (https://phabricator.wikimedia.org/T265990) (owner: 10Kormat) [14:09:21] (03PS2) 10Kormat: mariadb: Document orchestrator backend db grants [puppet] - 10https://gerrit.wikimedia.org/r/636031 (https://phabricator.wikimedia.org/T265990) [14:10:25] (03CR) 10Marostegui: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/636031 (https://phabricator.wikimedia.org/T265990) (owner: 10Kormat) [14:10:43] (03CR) 10Kormat: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/636031 (https://phabricator.wikimedia.org/T265990) (owner: 10Kormat) [14:11:46] (03CR) 10Volans: [C: 04-1] "Add quite some complexity but all seems reasonable. Minor nits/typos inline." (037 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/634017 (https://phabricator.wikimedia.org/T262899) (owner: 10Ayounsi) [14:12:31] (03CR) 10Marostegui: [C: 03+1] mariadb: Document orchestrator backend db grants [puppet] - 10https://gerrit.wikimedia.org/r/636031 (https://phabricator.wikimedia.org/T265990) (owner: 10Kormat) [14:12:43] (03CR) 10Kormat: [C: 03+2] mariadb: Document orchestrator backend db grants [puppet] - 10https://gerrit.wikimedia.org/r/636031 (https://phabricator.wikimedia.org/T265990) (owner: 10Kormat) [14:14:12] (03PS1) 10DCausse: [wdqs] fix StreamingUpdate package name after refactoring [puppet] - 10https://gerrit.wikimedia.org/r/636034 (https://phabricator.wikimedia.org/T255399) [14:15:05] (03PS1) 10Kormat: orchestrator: Drop trailing comma [puppet] - 10https://gerrit.wikimedia.org/r/636035 (https://phabricator.wikimedia.org/T265990) [14:15:41] (03CR) 10Kormat: [C: 03+2] orchestrator: Drop trailing comma [puppet] - 10https://gerrit.wikimedia.org/r/636035 (https://phabricator.wikimedia.org/T265990) (owner: 10Kormat) [14:18:17] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "One detail that escaped me at first read - you also need to add something like "$@" at the end of the command-line in modules/conftool/tem" [puppet] - 10https://gerrit.wikimedia.org/r/635630 (https://phabricator.wikimedia.org/T243009) (owner: 10Ahmon Dancy) [14:18:19] (03PS1) 10Kormat: orchestrator: Fix credentials config path [puppet] - 10https://gerrit.wikimedia.org/r/636036 (https://phabricator.wikimedia.org/T265990) [14:19:40] (03CR) 10Kormat: [C: 03+2] orchestrator: Fix credentials config path [puppet] - 10https://gerrit.wikimedia.org/r/636036 (https://phabricator.wikimedia.org/T265990) (owner: 10Kormat) [14:32:32] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/635399 (owner: 10Dzahn) [14:36:42] (03PS1) 10Kormat: orchestrator: Set ReadOnly [puppet] - 10https://gerrit.wikimedia.org/r/636039 (https://phabricator.wikimedia.org/T265990) [14:37:23] (03CR) 10Kormat: [C: 03+2] orchestrator: Set ReadOnly [puppet] - 10https://gerrit.wikimedia.org/r/636039 (https://phabricator.wikimedia.org/T265990) (owner: 10Kormat) [14:39:22] (03PS1) 10Muehlenhoff: Add debian-debug repository [puppet] - 10https://gerrit.wikimedia.org/r/636040 (https://phabricator.wikimedia.org/T164819) [14:41:08] (03PS1) 10Marostegui: orchestrator.conf.json.erb: Do not listen on localhost [puppet] - 10https://gerrit.wikimedia.org/r/636041 (https://phabricator.wikimedia.org/T265990) [14:47:54] (03PS8) 10Ayounsi: Add Z side device/interface/vlan and cable to PuppetDB importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/634017 (https://phabricator.wikimedia.org/T262899) [14:47:56] (03PS2) 10Ayounsi: Update AssignIPs to handle switch port and cable [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635385 (https://phabricator.wikimedia.org/T265339) [14:47:58] (03PS3) 10Ayounsi: Add CSV import to AssignIPs script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635849 [14:48:00] (03PS2) 10Ayounsi: AssingIPs, cleanup and standardize logs format [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635853 (https://phabricator.wikimedia.org/T265339) [14:48:24] (03CR) 10Ayounsi: "Thanks!" (037 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/634017 (https://phabricator.wikimedia.org/T262899) (owner: 10Ayounsi) [14:51:30] (03PS2) 10Muehlenhoff: Add debian-debug repository [puppet] - 10https://gerrit.wikimedia.org/r/636040 (https://phabricator.wikimedia.org/T164819) [14:53:29] (03PS1) 10Andrew Bogott: encapi (aka 'labspuppet backend'): add reload-on-exception to uwsgi config [puppet] - 10https://gerrit.wikimedia.org/r/636045 (https://phabricator.wikimedia.org/T264658) [14:53:50] (03CR) 10jerkins-bot: [V: 04-1] encapi (aka 'labspuppet backend'): add reload-on-exception to uwsgi config [puppet] - 10https://gerrit.wikimedia.org/r/636045 (https://phabricator.wikimedia.org/T264658) (owner: 10Andrew Bogott) [14:53:53] (03CR) 10Kormat: [C: 04-2] "I reject this, and i reject your mac usage." [puppet] - 10https://gerrit.wikimedia.org/r/636041 (https://phabricator.wikimedia.org/T265990) (owner: 10Marostegui) [14:54:08] (03CR) 10Marostegui: "hahaha" [puppet] - 10https://gerrit.wikimedia.org/r/636041 (https://phabricator.wikimedia.org/T265990) (owner: 10Marostegui) [14:55:09] (03Abandoned) 10Marostegui: orchestrator.conf.json.erb: Do not listen on localhost [puppet] - 10https://gerrit.wikimedia.org/r/636041 (https://phabricator.wikimedia.org/T265990) (owner: 10Marostegui) [14:55:36] (03CR) 10Jcrespo: "I don't have the context, but shouldn't this ip gathered from the ip fact, rather than hardcoded?" [puppet] - 10https://gerrit.wikimedia.org/r/636041 (https://phabricator.wikimedia.org/T265990) (owner: 10Marostegui) [14:55:57] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/26102/" [puppet] - 10https://gerrit.wikimedia.org/r/636040 (https://phabricator.wikimedia.org/T164819) (owner: 10Muehlenhoff) [14:55:59] (03PS2) 10Andrew Bogott: encapi (aka 'labspuppet backend'): add reload-on-exception to uwsgi config [puppet] - 10https://gerrit.wikimedia.org/r/636045 (https://phabricator.wikimedia.org/T264658) [14:56:19] (03CR) 10jerkins-bot: [V: 04-1] encapi (aka 'labspuppet backend'): add reload-on-exception to uwsgi config [puppet] - 10https://gerrit.wikimedia.org/r/636045 (https://phabricator.wikimedia.org/T264658) (owner: 10Andrew Bogott) [14:57:15] (03PS1) 10Effie Mouzeli: mediawiki::php bump opcache.max_accelerated_files [puppet] - 10https://gerrit.wikimedia.org/r/636047 (https://phabricator.wikimedia.org/T253673) [14:58:03] (03PS3) 10Andrew Bogott: encapi (aka 'labspuppet backend'): add reload-on-exception to uwsgi config [puppet] - 10https://gerrit.wikimedia.org/r/636045 (https://phabricator.wikimedia.org/T264658) [14:59:08] (03CR) 10Effie Mouzeli: [C: 04-1] "Don't merge yet as it requires a full cluster restart" [puppet] - 10https://gerrit.wikimedia.org/r/636047 (https://phabricator.wikimedia.org/T253673) (owner: 10Effie Mouzeli) [15:03:09] (03CR) 10Andrew Bogott: [C: 03+2] encapi (aka 'labspuppet backend'): add reload-on-exception to uwsgi config [puppet] - 10https://gerrit.wikimedia.org/r/636045 (https://phabricator.wikimedia.org/T264658) (owner: 10Andrew Bogott) [15:06:10] (03PS1) 10Marostegui: orchestrator.sql: Track orchestrator grants for topology discovery [puppet] - 10https://gerrit.wikimedia.org/r/636052 (https://phabricator.wikimedia.org/T265990) [15:07:18] (03PS2) 10Marostegui: orchestrator.sql: Track orchestrator grants for topology discovery [puppet] - 10https://gerrit.wikimedia.org/r/636052 (https://phabricator.wikimedia.org/T265990) [15:10:53] 10Puppet, 10Cloud-VPS, 10cloud-services-team (Kanban): labspuppetbackend service can fail to intialize when DNS blip happens at the wrong time - https://phabricator.wikimedia.org/T264658 (10Andrew) 05Open→03Resolved I'm not in a hurry to reproduce the exact cause of this failure, but the attached patch s... [15:37:22] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [15:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:32] 10Operations, 10Puppet, 10observability, 10Patch-For-Review, and 2 others: Puppet: get row/rack info from Netbox - https://phabricator.wikimedia.org/T229397 (10Volans) We were discussing this offline with John and there are still various open questions, we plan to discuss them in the next I/F meeting next... [15:42:14] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:42:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:33] (03PS9) 10Jbond: diffscan: pyhotnify [puppet] - 10https://gerrit.wikimedia.org/r/634572 [15:44:57] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 76 probes of 567 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:47:39] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 66 probes of 567 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:47:47] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 69 probes of 567 (alerts on 65) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:50:35] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 56 probes of 567 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:51:03] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:52:39] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:53:15] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 38 probes of 567 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:53:25] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 62 probes of 567 (alerts on 65) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:57:35] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:59:15] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:06:08] (03PS3) 10Giuseppe Lavagetto: safe-service-restart: add optional poolcounter support [puppet] - 10https://gerrit.wikimedia.org/r/635991 (https://phabricator.wikimedia.org/T266055) [16:06:10] (03PS3) 10Giuseppe Lavagetto: poolcounter: add client configuration classes [puppet] - 10https://gerrit.wikimedia.org/r/635992 (https://phabricator.wikimedia.org/T266055) [16:06:12] (03PS3) 10Giuseppe Lavagetto: profile::lvs::realserver: add ability to configure poolcounter for pools [puppet] - 10https://gerrit.wikimedia.org/r/635993 (https://phabricator.wikimedia.org/T266055) [16:06:14] (03PS3) 10Giuseppe Lavagetto: restbase: add poolcounter support to safe-service-restart scripts [puppet] - 10https://gerrit.wikimedia.org/r/635994 (https://phabricator.wikimedia.org/T266055) [16:07:52] (03CR) 10jerkins-bot: [V: 04-1] profile::lvs::realserver: add ability to configure poolcounter for pools [puppet] - 10https://gerrit.wikimedia.org/r/635993 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [16:08:14] (03CR) 10Dzahn: [V: 03+1] rpkivalidator: hiera->lookup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635664 (owner: 10Dzahn) [16:08:20] (03PS2) 10Dzahn: rpkivalidator: hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/635664 [16:10:22] (03CR) 10Dzahn: [V: 03+1] debmonitor::client: hiera->lookup, add data types (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635665 (owner: 10Dzahn) [16:10:24] (03PS6) 10Dzahn: debmonitor::client: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/635665 [16:10:35] (03PS1) 10Hnowlan: envoy-future: upgrade to Envoy 1.16.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/636062 [16:12:36] (03CR) 10Dzahn: [V: 03+1 C: 03+2] docker_registry_ha: hiera()->lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/635399 (owner: 10Dzahn) [16:13:36] (03CR) 10Dzahn: "noop on registry1001" [puppet] - 10https://gerrit.wikimedia.org/r/635399 (owner: 10Dzahn) [16:14:48] (03CR) 10Dzahn: [V: 03+1] ntp: hiera->lookup, data types (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635666 (owner: 10Dzahn) [16:14:54] (03PS2) 10Dzahn: ntp: hiera->lookup, data types [puppet] - 10https://gerrit.wikimedia.org/r/635666 [16:18:16] jouncebot: next [16:18:16] In 14 hour(s) and 41 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201024T0700) [16:20:54] (03CR) 10Ppchelko: "Oh mamma mia :) I dono how production-images thing works, so can't review that, but +1 the intention." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/636062 (owner: 10Hnowlan) [16:28:00] (03CR) 10Effie Mouzeli: [C: 03+1] "I will test it on a thumbor server next week. If it looks like it works fine, I will merge this" [puppet] - 10https://gerrit.wikimedia.org/r/636024 (https://phabricator.wikimedia.org/T266155) (owner: 10Gilles) [16:35:27] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:37:05] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:49:11] (03CR) 10Ahmon Dancy: [C: 04-1] safe-service-restart: add optional poolcounter support (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/635991 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [16:51:15] (03CR) 10Ahmon Dancy: [C: 03+1] poolcounter: add client configuration classes [puppet] - 10https://gerrit.wikimedia.org/r/635992 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [16:58:38] (03PS1) 10Dzahn: orchestrator: add monitoring for process and TCP port [puppet] - 10https://gerrit.wikimedia.org/r/636067 (https://phabricator.wikimedia.org/T266338) [16:59:51] (03CR) 10jerkins-bot: [V: 04-1] orchestrator: add monitoring for process and TCP port [puppet] - 10https://gerrit.wikimedia.org/r/636067 (https://phabricator.wikimedia.org/T266338) (owner: 10Dzahn) [17:01:32] (03PS2) 10Dzahn: orchestrator: add monitoring for process and TCP port [puppet] - 10https://gerrit.wikimedia.org/r/636067 (https://phabricator.wikimedia.org/T266338) [17:04:15] (03PS3) 10Dzahn: orchestrator: add monitoring for process and TCP port [puppet] - 10https://gerrit.wikimedia.org/r/636067 (https://phabricator.wikimedia.org/T266338) [17:04:34] (03CR) 10Ahmon Dancy: [C: 03+1] Reduce reconnectTimeout for etcd to 0.1 seconds [debs/pybal] (1.15) - 10https://gerrit.wikimedia.org/r/631686 (https://phabricator.wikimedia.org/T264362) (owner: 10Giuseppe Lavagetto) [17:07:09] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:07:21] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/26107/" [puppet] - 10https://gerrit.wikimedia.org/r/636067 (https://phabricator.wikimedia.org/T266338) (owner: 10Dzahn) [17:08:49] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:13:58] (03CR) 10Dzahn: "@dborch1001:~# /usr/lib/nagios/plugins/check_procs -w 1:1 -c 1:1 --ereg-argument-array 'orchestrator http'" [puppet] - 10https://gerrit.wikimedia.org/r/636067 (https://phabricator.wikimedia.org/T266338) (owner: 10Dzahn) [17:28:42] (03CR) 10Ahmon Dancy: [C: 04-1] safe-service-restart: add optional poolcounter support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635991 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [17:48:47] 10Operations, 10netops: cr3-esams linecard diversity issue - https://phabricator.wikimedia.org/T262524 (10wiki_willy) a:05wiki_willy→03ayounsi Sounds good @ayounsi, thanks for the heads up. Just shoot open a dc-ops task with the ops-esams project tag, and @RobH will work with you and Iron Mountain on gett... [17:49:54] (03CR) 10Ahmon Dancy: profile::lvs::realserver: add ability to configure poolcounter for pools (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635993 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [17:54:44] (03CR) 10Ahmon Dancy: [C: 04-1] profile::lvs::realserver: add ability to configure poolcounter for pools (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635993 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [18:03:06] 10Operations, 10SRE-Access-Requests: Nuria's volunteer account - https://phabricator.wikimedia.org/T266086 (10Dzahn) The offboarding script has an option for "stay volunteer". [18:11:09] (03CR) 10Dzahn: [C: 03+2] Update update_parsoid.sh script for use on testreduce1001 [puppet] - 10https://gerrit.wikimedia.org/r/635613 (https://phabricator.wikimedia.org/T257906) (owner: 10Subramanya Sastry) [18:16:02] 10Operations, 10Traffic, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs, 10iOS-app-Bonefish-On-A-Balloon: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10CDanis) Thanks @Tsevener ! Detailed repl... [18:22:44] (03PS3) 10Ahmon Dancy: Add --force flag to safe-service-restart.py [puppet] - 10https://gerrit.wikimedia.org/r/635630 (https://phabricator.wikimedia.org/T243009) [18:22:46] (03PS1) 10Ahmon Dancy: modules/scap/templates/scap.cfg.erb: Update php_fpm_restart_script [puppet] - 10https://gerrit.wikimedia.org/r/636074 (https://phabricator.wikimedia.org/T243009) [18:24:13] (03CR) 10jerkins-bot: [V: 04-1] modules/scap/templates/scap.cfg.erb: Update php_fpm_restart_script [puppet] - 10https://gerrit.wikimedia.org/r/636074 (https://phabricator.wikimedia.org/T243009) (owner: 10Ahmon Dancy) [18:31:20] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install deploy2002 - https://phabricator.wikimedia.org/T266363 (10RobH) [18:31:34] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install deploy2002 - https://phabricator.wikimedia.org/T266363 (10RobH) [18:38:13] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install frqueue100[34] - https://phabricator.wikimedia.org/T266365 (10RobH) [18:38:21] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install frqueue100[34] - https://phabricator.wikimedia.org/T266365 (10RobH) [18:39:07] (03PS2) 10Ahmon Dancy: modules/scap/templates/scap.cfg.erb: Update php_fpm_restart_script [puppet] - 10https://gerrit.wikimedia.org/r/636074 (https://phabricator.wikimedia.org/T243009) [18:44:34] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: ASAP) rack/setup/install clouddb10[13-20] - https://phabricator.wikimedia.org/T260441 (10Cmjohnson) Sorry for the delay, these are in progress. I expect to have them completed by the end of next week. [18:46:03] PROBLEM - clamd running on otrs1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/OTRS%23ClamAV [18:46:09] PROBLEM - Check systemd state on otrs1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:54:05] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:57:25] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:09:31] RECOVERY - clamd running on otrs1001 is OK: PROCS OK: 1 process with UID = 112 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/OTRS%23ClamAV [19:09:37] RECOVERY - Check systemd state on otrs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:15:43] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:19:01] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:26:18] (03PS1) 10Catrope: Add changeprop rules for newcomerTasksCacheRefreshJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/636078 (https://phabricator.wikimedia.org/T260758) [20:03:32] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1027 with 10G interfaces - https://phabricator.wikimedia.org/T266369 (10Andrew) [20:04:29] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [20:08:18] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [20:16:47] (03PS2) 10Kaldari: Removing obsolete license definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619880 [20:19:53] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/636081 [20:25:01] (03CR) 10Dzahn: [C: 04-2] gerrit: replace cron jobs with systemd timers (WIP) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/633857 (owner: 10Dzahn) [20:26:50] (03PS6) 10Dzahn: gerrit: replace cron jobs with systemd timers (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/633857 [20:27:59] (03CR) 10RLazarus: gerrit: replace cron jobs with systemd timers (WIP) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/633857 (owner: 10Dzahn) [20:32:46] (03PS1) 10Dzahn: mirrors: replace cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/636082 [20:33:11] (03CR) 10jerkins-bot: [V: 04-1] mirrors: replace cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/636082 (owner: 10Dzahn) [20:33:48] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1001/26109/gerrit1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/633857 (owner: 10Dzahn) [20:37:41] (03PS7) 10Dzahn: gerrit: replace cron jobs with systemd timers (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/633857 [20:38:04] (03PS1) 10Legoktm: Remove $wgExtDistListFile, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636083 (https://phabricator.wikimedia.org/T266024) [20:38:37] (03CR) 10Dzahn: "also https://phabricator.wikimedia.org/T266024 :p" [puppet] - 10https://gerrit.wikimedia.org/r/633857 (owner: 10Dzahn) [20:42:14] (03PS1) 10Dzahn: gerrit: remove 'list_mediawiki_extensions' cron job [puppet] - 10https://gerrit.wikimedia.org/r/636084 (https://phabricator.wikimedia.org/T266024) [20:42:43] (03CR) 10Dzahn: "now first waiting for https://gerrit.wikimedia.org/r/c/operations/puppet/+/636084/" [puppet] - 10https://gerrit.wikimedia.org/r/633857 (owner: 10Dzahn) [20:44:00] (03CR) 10Dzahn: "would be removed from crontabs manually on mwmaint* servers on merge" [puppet] - 10https://gerrit.wikimedia.org/r/636084 (https://phabricator.wikimedia.org/T266024) (owner: 10Dzahn) [20:45:50] (03CR) 10Dzahn: "duh.. I mean of course "gerrit servers" not mwmaint.. for this one" [puppet] - 10https://gerrit.wikimedia.org/r/636084 (https://phabricator.wikimedia.org/T266024) (owner: 10Dzahn) [20:49:23] (03PS2) 10Dzahn: mirrors: replace cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/636082 [20:49:48] (03CR) 10jerkins-bot: [V: 04-1] mirrors: replace cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/636082 (owner: 10Dzahn) [20:58:45] (03PS3) 10Dzahn: mirrors: replace cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/636082 [20:59:10] (03CR) 10jerkins-bot: [V: 04-1] mirrors: replace cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/636082 (owner: 10Dzahn) [21:00:50] sigh..not fixing .fixtures.yml [21:11:48] (03CR) 10Legoktm: [C: 03+1] "No longer used by anything I'm aware of." [puppet] - 10https://gerrit.wikimedia.org/r/636084 (https://phabricator.wikimedia.org/T266024) (owner: 10Dzahn) [21:21:45] (03PS2) 10Dzahn: wikistats: allow to 'absent' import/dump crons as well [puppet] - 10https://gerrit.wikimedia.org/r/633845 [21:22:06] (03CR) 10jerkins-bot: [V: 04-1] wikistats: allow to 'absent' import/dump crons as well [puppet] - 10https://gerrit.wikimedia.org/r/633845 (owner: 10Dzahn) [21:23:18] (03PS1) 10Dzahn: phabricator: remove absented cron jobs for Bugzilla updates [puppet] - 10https://gerrit.wikimedia.org/r/636085 [21:25:36] (03CR) 10Dzahn: "These crons are already "absent" and the comments say they require a "Bugzilla migration DB" that is "currently" not present. I am finding" [puppet] - 10https://gerrit.wikimedia.org/r/636085 (owner: 10Dzahn) [21:29:51] (03PS1) 10Dzahn: dumps: rm profile::dumps::distribution::datasets::cleanup_miscdatasets [puppet] - 10https://gerrit.wikimedia.org/r/636087 [21:34:23] (03PS1) 10Dzahn: toolforge::clush::master: remove absented cron TODO [puppet] - 10https://gerrit.wikimedia.org/r/636088 [21:36:32] (03PS1) 10Dzahn: wmcs::nfs::primary: remove absented cron TODO [puppet] - 10https://gerrit.wikimedia.org/r/636089 [21:41:06] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission wtp2001 through wtp2020 - https://phabricator.wikimedia.org/T265558 (10Papaul) [21:41:19] (03PS1) 10Dzahn: librenms: remove absented and obsoleted cron [puppet] - 10https://gerrit.wikimedia.org/r/636091 [21:45:39] (03CR) 10Dzahn: "hello reviewers. this still seems active. I am running across it in code comments that disabling this cron was supposed to be temp but it'" [puppet] - 10https://gerrit.wikimedia.org/r/475453 (https://phabricator.wikimedia.org/T204993) (owner: 10Alex Monk) [21:47:34] 10Operations, 10Traffic, 10Patch-For-Review: Update certspotter - https://phabricator.wikimedia.org/T204993 (10Dzahn) What is the status of this nowadays? I ran across it in a different matter while looking for absented crons and found the TODO and link in code that points over here and to the pending patch... [21:49:41] (03PS1) 10Dzahn: cassandra: remove absented metrics collector cron [puppet] - 10https://gerrit.wikimedia.org/r/636092 [21:51:14] (03CR) 10Dzahn: "Can this be removed as well?" [puppet] - 10https://gerrit.wikimedia.org/r/636092 (owner: 10Dzahn) [21:52:05] (03PS1) 10Krinkle: mediawiki.util: Use mw.util rather than 'this' [core] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/635981 (https://phabricator.wikimedia.org/T265809) [21:53:17] (03PS1) 10Dzahn: openstack::designate::dns_floating_ip_updater: remove absented cron TODO [puppet] - 10https://gerrit.wikimedia.org/r/636093 [21:53:59] (03PS1) 10Aaron Schulz: Add "mcrouter-with-onhost-tier" entry to $wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636094 (https://phabricator.wikimedia.org/T264604) [21:54:01] (03PS1) 10Aaron Schulz: Switch parser cache to using "mcrouter-with-onhost-tier" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636095 (https://phabricator.wikimedia.org/T264604) [21:55:15] (03PS1) 10Dzahn: openstack::wikitech::web: remove absented cron TODO [puppet] - 10https://gerrit.wikimedia.org/r/636096 [22:08:33] 10Operations, 10netops, 10Sustainability (Incident Followup): Audit Juniper EX snapshots version - https://phabricator.wikimedia.org/T262290 (10Krinkle) [22:08:38] 10Operations, 10ops-eqiad, 10netops, 10Sustainability (Incident Followup): eqiad row D switch fabric recabling - https://phabricator.wikimedia.org/T256112 (10Krinkle) [22:08:43] 10Operations, 10netops, 10Sustainability (Incident Followup): Configure BGP route damping on Anycast sessions - https://phabricator.wikimedia.org/T262372 (10Krinkle) [22:08:52] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung - https://phabricator.wikimedia.org/T266250 (10KFrancis) >>! In T266250#6573604, @Marostegui wrote: > @KFrancis can you confirm if @Rmaung has a valid NDA signed? I cannot s... [22:09:22] (03CR) 10Jdlrobson: [C: 03+1] mediawiki.util: Use mw.util rather than 'this' [core] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/635981 (https://phabricator.wikimedia.org/T265809) (owner: 10Krinkle) [22:15:18] (03PS1) 10Dzahn: cumin: remove stretch support and move python_version to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/636101 [22:19:39] (03PS1) 10Dzahn: cumin: replace check-aliases-cron with a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/636102 [22:27:34] (03PS1) 10Legoktm: toolforge: Install pack and buildpacks repo on image builder [puppet] - 10https://gerrit.wikimedia.org/r/636103 (https://phabricator.wikimedia.org/T266270) [22:28:51] (03PS1) 10Dzahn: puppetmaster: replace cron to remove old reports with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/636104 [22:28:54] (03PS2) 10Legoktm: toolforge: Install pack and buildpacks repo on image builder [puppet] - 10https://gerrit.wikimedia.org/r/636103 (https://phabricator.wikimedia.org/T266270) [22:29:32] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: replace cron to remove old reports with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/636104 (owner: 10Dzahn) [22:36:06] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung - https://phabricator.wikimedia.org/T266250 (10Nuria) @Rmaung: can you describe what data are looking to access? This is so we can see what is the appropriate level of acces... [22:38:02] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung - https://phabricator.wikimedia.org/T266250 (10Nuria) Also, @Rmaung please take a look at https://wikitech.wikimedia.org/wiki/Analytics/Data_Access_Guidelines and ask any qu... [22:42:23] 10Operations, 10Analytics, 10SRE-Access-Requests: Nuria's volunteer account - https://phabricator.wikimedia.org/T266086 (10Nuria) [22:43:17] 10Operations, 10Analytics, 10SRE-Access-Requests: Nuria's volunteer account - https://phabricator.wikimedia.org/T266086 (10Nuria) NDA signed now but I do not have access to https://phabricator.wikimedia.org/L2? [22:49:43] 10Operations, 10Analytics, 10SRE-Access-Requests: Nuria's volunteer account - https://phabricator.wikimedia.org/T266086 (10Dzahn) @Nuria Try again now, I just added you to the project called "WMF-NDA-Requests" (https://phabricator.wikimedia.org/project/profile/974/) which seems like it's needed to allow you... [22:52:37] 10Operations, 10Analytics, 10SRE-Access-Requests: Nuria's volunteer account - https://phabricator.wikimedia.org/T266086 (10Nuria) done! [22:52:52] (03PS1) 10Dzahn: planet: replace update cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/636105 [22:53:13] (03CR) 10jerkins-bot: [V: 04-1] planet: replace update cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/636105 (owner: 10Dzahn) [22:56:29] !log added Nuria to "nda" LDAP group - leaving her in "wmf" until the actual last day - shell account remains so no puppet change needed in ldap_only_admins (T266086) [22:56:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:36] T266086: Nuria's volunteer account - https://phabricator.wikimedia.org/T266086 [22:58:29] 10Operations, 10Analytics, 10SRE-Access-Requests: Nuria's volunteer account - https://phabricator.wikimedia.org/T266086 (10Dzahn) >>! In T266086#6575705, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://sal.toolforge.org/log/5BKtV3UBpU87LSFJgL3r} [2020-10-23T22:5... [23:12:31] (03PS2) 10Dzahn: planet: replace update cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/636105 [23:12:52] (03CR) 10jerkins-bot: [V: 04-1] planet: replace update cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/636105 (owner: 10Dzahn) [23:15:49] (03PS3) 10Dzahn: planet: replace update cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/636105 [23:19:40] 10Operations, 10Analytics, 10SRE-Access-Requests: Nuria's volunteer account - https://phabricator.wikimedia.org/T266086 (10KFrancis) @Dzahn because they are an employee of the WMF, the NDA is kept in file by T&C. [23:42:40] 10Operations, 10Desktop Improvements, 10Readers-Web-Backlog, 10Traffic, 10Wikimedia-production-error: Connection closed while downloading PDF of articles - https://phabricator.wikimedia.org/T266373 (10Urbanecm) Can reproduce from toolforge: ` urbanecm@tools-sgebastion-07 ~/tmp $ wget https://fr.wikiped...