[00:00:05] twentyafterfour: Dear deployers, time to do the Phabricator update deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201022T0000). [00:00:21] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1077 is OK: HTTP OK: HTTP/1.0 200 OK - 23614 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [00:02:29] RECOVERY - Prometheus prometheus2003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [00:28:41] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:30:19] (03PS1) 10Andrew Bogott: wmcs instance and image backups: move some jobs one hour later [puppet] - 10https://gerrit.wikimedia.org/r/635676 (https://phabricator.wikimedia.org/T265843) [00:31:06] (03CR) 10Andrew Bogott: [C: 03+2] wmcs instance and image backups: move some jobs one hour later [puppet] - 10https://gerrit.wikimedia.org/r/635676 (https://phabricator.wikimedia.org/T265843) (owner: 10Andrew Bogott) [00:31:55] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:44:27] (03PS1) 10Ottomata: camus::job - fix --check-java-opts when no $check_java_opts is given [puppet] - 10https://gerrit.wikimedia.org/r/635678 [00:44:40] (03Abandoned) 10Ryan Kemper: cirrussearch: increase up commonswiki_file shards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628980 (https://phabricator.wikimedia.org/T260083) (owner: 10Ryan Kemper) [00:44:50] (03CR) 10jerkins-bot: [V: 04-1] camus::job - fix --check-java-opts when no $check_java_opts is given [puppet] - 10https://gerrit.wikimedia.org/r/635678 (owner: 10Ottomata) [00:47:51] (03PS2) 10Ottomata: camus::job - fix --check-java-opts when no $check_java_opts is given [puppet] - 10https://gerrit.wikimedia.org/r/635678 [00:48:25] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1002/26056/an-launcher1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/635678 (owner: 10Ottomata) [00:48:29] (03CR) 10Ottomata: [C: 03+2] camus::job - fix --check-java-opts when no $check_java_opts is given [puppet] - 10https://gerrit.wikimedia.org/r/635678 (owner: 10Ottomata) [01:01:25] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:03:11] !log ryankemper@deploy1001 Started deploy [wdqs/wdqs@870829c]: 0.3.52 [01:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:04:26] !log Tests passing on canary `wdqs1003`, proceeding with wdqs deploy for rest of fleet [01:04:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:04:39] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:10:51] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 66 probes of 570 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:12:18] !log ryankemper@deploy1001 Finished deploy [wdqs/wdqs@870829c]: 0.3.52 (duration: 09m 07s) [01:12:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:16:29] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 63 probes of 570 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:39:47] PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [02:10:00] (03PS1) 10Andrew Bogott: Cloudvirt1025 and 1029 => Buster [puppet] - 10https://gerrit.wikimedia.org/r/635680 (https://phabricator.wikimedia.org/T259399) [02:11:14] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1029 with 10G interfaces - https://phabricator.wikimedia.org/T266206 (10Andrew) [02:11:58] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [02:12:27] RECOVERY - rpki grafana alert on alert1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [02:12:28] (03CR) 10Andrew Bogott: [C: 03+2] Cloudvirt1025 and 1029 => Buster [puppet] - 10https://gerrit.wikimedia.org/r/635680 (https://phabricator.wikimedia.org/T259399) (owner: 10Andrew Bogott) [03:47:05] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:48:43] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:52:05] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:53:41] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:37:25] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:42:11] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:01:47] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:04:07] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:05:41] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:11:14] 10Operations, 10LDAP, 10Python3-Porting: Port prometheus-openldap-exporter to Python 3 - https://phabricator.wikimedia.org/T266147 (10Marostegui) p:05Triage→03Medium [05:11:51] (03PS2) 10CDanis: prepend esams/knams [homer/public] - 10https://gerrit.wikimedia.org/r/627920 [05:15:10] (03PS1) 10Marostegui: dns: Fix ganeti comments, for consistency. [dns] - 10https://gerrit.wikimedia.org/r/635683 [05:15:46] (03CR) 10Marostegui: [C: 03+2] dns: Fix ganeti comments, for consistency. [dns] - 10https://gerrit.wikimedia.org/r/635683 (owner: 10Marostegui) [05:59:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:02:51] RECOVERY - Check systemd state on analytics-tool1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:03:09] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:06:31] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:11:43] (03PS1) 10Elukey: Revert "Revert "geoip: move archive timer from stat1007 to an-launcher1002"" [puppet] - 10https://gerrit.wikimedia.org/r/635601 [06:12:34] (03CR) 10Elukey: [C: 03+2] "Razzi: simply reverting left an inconsistent state on an-launcher1002, so I am rolling it out again. The permission issue was an easy one " [puppet] - 10https://gerrit.wikimedia.org/r/635601 (owner: 10Elukey) [06:15:00] arturo, bstorm o/ - as FYI there was https://gerrit.wikimedia.org/r/635680 waiting for a puppet-merge, I proceeded since it seemed very low risk [06:25:55] (03CR) 10Elukey: [C: 03+2] "Marcel is in CC so he knows about these changes, and it sounds like everybody agrees to proceed for the moment. Please let me know otherwi" [puppet] - 10https://gerrit.wikimedia.org/r/634946 (https://phabricator.wikimedia.org/T254332) (owner: 10Faidon Liambotis) [06:26:00] (03PS4) 10Elukey: turnilo: add exporter hostname and region for netflow [puppet] - 10https://gerrit.wikimedia.org/r/634946 (https://phabricator.wikimedia.org/T254332) (owner: 10Faidon Liambotis) [06:28:02] (03CR) 10Elukey: [C: 03+2] turnilo: fix retainMissingValue misconfig [puppet] - 10https://gerrit.wikimedia.org/r/634948 (owner: 10Faidon Liambotis) [06:28:08] (03PS2) 10Elukey: turnilo: fix retainMissingValue misconfig [puppet] - 10https://gerrit.wikimedia.org/r/634948 (owner: 10Faidon Liambotis) [06:36:13] (03CR) 10ArielGlenn: [C: 04-1] "The dumpsdata hosts export with NFSv3, and they need the default-nfs-common template for statd, see https://phabricator.wikimedia.org/T239" [puppet] - 10https://gerrit.wikimedia.org/r/635628 (https://phabricator.wikimedia.org/T265588) (owner: 10Bstorm) [06:44:19] PROBLEM - MariaDB Replica Lag: s7 on db2100 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 846.09 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:44:44] (03CR) 10Muehlenhoff: [C: 03+2] Add IDP service definition for orchestrator.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/635520 (https://phabricator.wikimedia.org/T266106) (owner: 10Muehlenhoff) [06:50:11] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10elukey) ping :) [06:52:27] ^that is a backup source, so no user impact [06:52:29] (We are in a meeting) [06:53:43] PROBLEM - MariaDB Replica Lag: s5 on db1145 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1629.57 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:55:48] that is me [06:55:49] ignore [07:02:04] (03PS1) 10Elukey: Decom analytics1057 from the Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/635742 (https://phabricator.wikimedia.org/T255140) [07:02:55] (03CR) 10Elukey: [C: 03+2] Decom analytics1057 from the Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/635742 (https://phabricator.wikimedia.org/T255140) (owner: 10Elukey) [07:04:47] RECOVERY - MariaDB Replica SQL: s4 on db2099 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:06:51] RECOVERY - MariaDB Replica Lag: s5 on db1145 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:08:29] RECOVERY - MariaDB Replica Lag: s7 on db2100 is OK: OK slave_sql_lag Replication lag: 0.41 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:11:18] (03PS1) 10Alexandros Kosiaris: Add new java images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/635743 [07:52:44] !log swift codfw-prod: bump object weight for ms-be2057 - T261633 [07:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:51] T261633: Put ms-be2057 (Dell R740xd2) in service - https://phabricator.wikimedia.org/T261633 [07:54:20] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add Pushgateway profile and module [puppet] - 10https://gerrit.wikimedia.org/r/635295 (https://phabricator.wikimedia.org/T249311) (owner: 10Filippo Giunchedi) [07:54:34] (03CR) 10Filippo Giunchedi: [C: 03+2] role: add Pushgateway to Prometheus ops [puppet] - 10https://gerrit.wikimedia.org/r/635296 (https://phabricator.wikimedia.org/T249311) (owner: 10Filippo Giunchedi) [08:04:31] 10Operations, 10Cloud-Services, 10Graphite: Graphite returning 500 @ nagf and graphite url - https://phabricator.wikimedia.org/T198209 (10Marostegui) 05Open→03Resolved @Paladox https://nagf.toolforge.org/?project=tools works for me, so maybe this is already gone. Going to close it. If you feel this still... [08:14:30] (03PS3) 10KartikMistry: WIP: Remove wgContentTranslationRESTBase config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634956 (https://phabricator.wikimedia.org/T266213) [08:18:43] PROBLEM - Disk space on ms-be2017 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdi1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2017&var-datasource=codfw+prometheus/ops [08:22:56] mhhh I'll take a look [08:23:06] godog: thanks! I was about to create a task :) [08:23:13] PROBLEM - HP RAID on ms-be2017 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Failed: 1I:1:7 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [08:23:17] ACKNOWLEDGEMENT - HP RAID on ms-be2017 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Failed: 1I:1:7 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T266214 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [08:23:20] 10Operations, 10ops-codfw: Degraded RAID on ms-be2017 - https://phabricator.wikimedia.org/T266214 (10ops-monitoring-bot) [08:23:22] ha, there we have the task [08:23:23] XD [08:23:54] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [08:23:55] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:08] 10Operations, 10ops-codfw, 10SRE-swift-storage: Degraded RAID on ms-be2017 - https://phabricator.wikimedia.org/T266214 (10Marostegui) p:05Triage→03Medium [08:24:09] marostegui: heheh indeed! here we go [08:24:44] there is a rebalance in progress, maybe the disk decided to commit seppuku [08:25:43] 10Operations, 10ops-codfw, 10SRE-swift-storage: Degraded RAID on ms-be2017 - https://phabricator.wikimedia.org/T266214 (10Marostegui) ` [1814258.987868] sd 0:1:0:8: [sdi] tag#17 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [1814258.987874] sd 0:1:0:8: [sdi] tag#17 Sense Key : Hardware Error [curren... [08:30:55] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:31:19] !log enabling replication from eqiad to codfw T261914 [08:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:25] T261914: Enable replication eqiad -> codfw and other checks - https://phabricator.wikimedia.org/T261914 [08:32:13] thanks elukey RE: puppet-merge patch [08:32:33] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:32:47] 10Operations, 10Traffic, 10Patch-For-Review: Remove SLAAC IPs from Ganeti hosts - https://phabricator.wikimedia.org/T265904 (10Volans) @jbond thanks for looking into this, unfortunately the data used in the Netbox import comes from the `networking` fact because it needs all of them and parses that one, so no... [08:37:10] 10Operations, 10ops-codfw, 10SRE-swift-storage: Degraded RAID on ms-be2017 - https://phabricator.wikimedia.org/T266214 (10fgiunchedi) a:03Papaul @papaul this disk is busted, it is a 4TB drive, please replace with a spare, the led should be blinking. Thank you! ` array I logicaldrive 9 (3.6 TB, R... [08:38:17] (03PS1) 10Elukey: hadoop: final clean up after the decommission of old nodes [puppet] - 10https://gerrit.wikimedia.org/r/635750 (https://phabricator.wikimedia.org/T255140) [08:38:19] (03PS1) 10Elukey: Initial configuration of the Hadoop backup cluster [puppet] - 10https://gerrit.wikimedia.org/r/635751 (https://phabricator.wikimedia.org/T260411) [08:39:21] RECOVERY - Disk space on ms-be2017 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2017&var-datasource=codfw+prometheus/ops [08:42:37] (03PS1) 10Filippo Giunchedi: prometheus: honor labels from pushgateway [puppet] - 10https://gerrit.wikimedia.org/r/635752 (https://phabricator.wikimedia.org/T249311) [08:44:01] PROBLEM - Check systemd state on ms-be2042 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:45:00] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: honor labels from pushgateway [puppet] - 10https://gerrit.wikimedia.org/r/635752 (https://phabricator.wikimedia.org/T249311) (owner: 10Filippo Giunchedi) [08:52:17] 10Operations, 10vm-requests: Site: 1 VM request for Analytics test cluster - https://phabricator.wikimedia.org/T266064 (10elukey) [08:54:17] 10Operations, 10DBA, 10User-Kormat: orchestrator: Get packages into WMF apt - https://phabricator.wikimedia.org/T266023 (10MoritzMuehlenhoff) > @MoritzMuehlenhoff: Is it acceptable to download the pre-build .debs, and upload them into our apt repo? We do have packages we source from external repos, but they... [08:55:08] 10Operations, 10vm-requests: Site: 1 VM request for Analytics test cluster - https://phabricator.wikimedia.org/T266064 (10elukey) @razzi yes this is expected, in the hiera config we have ` # Kerberos config profile::kerberos::keytabs::keytabs_metadata: - role: 'analytics' owner: 'analytics' group: '... [08:55:15] (03PS1) 10JMeybohm: wikifeeds: Increase envoy CPU and memory ressources [deployment-charts] - 10https://gerrit.wikimedia.org/r/635753 (https://phabricator.wikimedia.org/T266194) [09:02:46] (03CR) 10Filippo Giunchedi: [C: 03+2] wmnet: record for prometheus-pushgateway [dns] - 10https://gerrit.wikimedia.org/r/635536 (https://phabricator.wikimedia.org/T249311) (owner: 10Filippo Giunchedi) [09:02:51] (03PS2) 10Filippo Giunchedi: wmnet: record for prometheus-pushgateway [dns] - 10https://gerrit.wikimedia.org/r/635536 (https://phabricator.wikimedia.org/T249311) [09:03:39] RECOVERY - Check systemd state on ms-be2042 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:07:06] (03PS1) 10Klausman: analytics: Switch stat1005 to use rocm 3.8 [puppet] - 10https://gerrit.wikimedia.org/r/635755 (https://phabricator.wikimedia.org/T264408) [09:10:34] (03CR) 10Elukey: [C: 03+1] analytics: Switch stat1005 to use rocm 3.8 [puppet] - 10https://gerrit.wikimedia.org/r/635755 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman) [09:11:07] (03CR) 10Klausman: [C: 03+2] analytics: Switch stat1005 to use rocm 3.8 [puppet] - 10https://gerrit.wikimedia.org/r/635755 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman) [09:13:08] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) >>! In T264398#6564734, @Gilles wrote: > cp3054 seems to be consistently a little faster for miss and pass, and overall a little slower... [09:18:14] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] toolforge k8s: add a PodSecurityPolicy to be used by buildpacks [puppet] - 10https://gerrit.wikimedia.org/r/635641 (https://phabricator.wikimedia.org/T265557) (owner: 10Bstorm) [09:26:06] godog: XioNoX: [09:26:12] sorry ignore [09:27:08] ACKNOWLEDGEMENT - HP RAID on ms-be2017 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Failed: 1I:1:7 - Controller: OK - Cache: Temporarily Disabled - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T266221 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardwa [09:27:08] on_Gathering [09:27:11] 10Operations, 10ops-codfw: Degraded RAID on ms-be2017 - https://phabricator.wikimedia.org/T266221 (10ops-monitoring-bot) [09:27:53] 10Operations, 10ops-codfw: Degraded RAID on ms-be2017 - https://phabricator.wikimedia.org/T266221 (10RhinosF1) [09:27:58] 10Operations, 10ops-codfw, 10SRE-swift-storage: Degraded RAID on ms-be2017 - https://phabricator.wikimedia.org/T266214 (10RhinosF1) [09:28:14] 10Operations, 10ops-codfw: Degraded RAID on ms-be2017 - https://phabricator.wikimedia.org/T266221 (10RhinosF1) Already has a task from earlier today [09:29:34] PROBLEM - SSH on ms-be2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:31:52] godog: another disk for ms-be2017? ^ [09:32:34] marostegui: looks like a duplicate [09:32:50] RECOVERY - SSH on ms-be2016 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:33:00] (03CR) 10Elukey: netbox: Move eqiad private to automation (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/634302 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [09:33:21] ah #7 disk indeed [09:35:21] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) The telemetry of both ATS-TLS and Varnish is also blind to app-level, OS-level or hardware-level delays and buffering internal to our... [09:35:52] (03CR) 10Ayounsi: [C: 03+1] network: constants: add cloud floating IP ranges [puppet] - 10https://gerrit.wikimedia.org/r/634050 (owner: 10Arturo Borrero Gonzalez) [09:36:19] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] network: constants: add cloud floating IP ranges [puppet] - 10https://gerrit.wikimedia.org/r/634050 (owner: 10Arturo Borrero Gonzalez) [09:38:22] !log merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/634050 change to network data yaml [09:38:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:58] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={swagger_check_mobileapps_cluster_eqiad,swagger_check_sessionstore_eqiad} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:43:00] PROBLEM - Check systemd state on ms-be2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:43:38] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:45:37] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) >>! In T264398#6570924, @Gilles wrote: > Before we dig into that, the [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/635276/ |... [09:46:59] (03PS1) 10Filippo Giunchedi: pontoon: update hiera for o11y stack [puppet] - 10https://gerrit.wikimedia.org/r/635760 [09:47:00] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={cadvisor,purged,swagger_check_restbase_esams} site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:47:45] (03PS2) 10Ema: ATS: add metric trafficserver_tls_client_total_time [puppet] - 10https://gerrit.wikimedia.org/r/635276 (https://phabricator.wikimedia.org/T265869) [09:48:03] 10Operations, 10Traffic, 10Patch-For-Review: Remove SLAAC IPs from Ganeti hosts - https://phabricator.wikimedia.org/T265904 (10akosiaris) >>! In T265904#6570744, @Volans wrote: > @jbond thanks for looking into this, unfortunately the data used in the Netbox import comes from the `networking` fact because it... [09:48:45] ACKNOWLEDGEMENT - HP RAID on ms-be2017 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Failed: 1I:1:7 - Controller: OK - Cache: Temporarily Disabled - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T266222 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardwa [09:48:45] on_Gathering [09:48:50] 10Operations, 10ops-codfw: Degraded RAID on ms-be2017 - https://phabricator.wikimedia.org/T266222 (10ops-monitoring-bot) [09:48:53] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:48:59] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:49:40] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: update hiera for o11y stack [puppet] - 10https://gerrit.wikimedia.org/r/635760 (owner: 10Filippo Giunchedi) [09:53:07] PROBLEM - Check systemd state on ms-be2024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:53:20] godog: I am going to disable event handler for ms-be2017, otherwise it might keep creating raid tasks [09:54:13] 10Operations, 10ops-codfw: Degraded RAID on ms-be2017 - https://phabricator.wikimedia.org/T266222 (10RhinosF1) [09:54:17] 10Operations, 10ops-codfw, 10SRE-swift-storage: Degraded RAID on ms-be2017 - https://phabricator.wikimedia.org/T266214 (10RhinosF1) [09:54:29] marostegui: ack, sounds good thank you very much [09:54:37] 10Operations, 10ops-codfw, 10SRE-swift-storage: Degraded RAID on ms-be2017 - https://phabricator.wikimedia.org/T266214 (10RhinosF1) Do those need silencing? [09:55:05] 10Operations, 10ops-codfw, 10SRE-swift-storage: Degraded RAID on ms-be2017 - https://phabricator.wikimedia.org/T266214 (10Marostegui) Please note that I have disabled event handler for this host for those checks, so it would need to be re-enabled once the disk is swapped [09:56:37] PROBLEM - DPKG on stat1005 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:00:53] (03PS1) 10Lucas Werkmeister (WMDE): Enable propagatePageDeletion on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635762 [10:00:55] PROBLEM - SSH on ms-be2030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:01:23] (03CR) 10Effie Mouzeli: [C: 03+1] Add new java images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/635743 (owner: 10Alexandros Kosiaris) [10:02:01] (03CR) 10Ema: [C: 03+2] Add _ to the allowed list of short url characters [puppet] - 10https://gerrit.wikimedia.org/r/634937 (https://phabricator.wikimedia.org/T230685) (owner: 10Ladsgroup) [10:03:36] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Typo, otherwise LGTM" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/635753 (https://phabricator.wikimedia.org/T266194) (owner: 10JMeybohm) [10:04:25] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:05:12] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/635577 (owner: 10Jbond) [10:06:07] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:06:19] jouncebot: refresh please [10:06:20] I refreshed my knowledge about deployments. [10:06:26] thank you :) [10:07:31] RECOVERY - SSH on ms-be2030 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:07:39] PROBLEM - Check systemd state on ms-be2030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:13:18] (03PS1) 10Klausman: amd-rocm: Add 3.7 to reprepro [puppet] - 10https://gerrit.wikimedia.org/r/635764 (https://phabricator.wikimedia.org/T264408) [10:14:31] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:17:41] (03CR) 10Tobias Andersson: [C: 03+1] Enable propagatePageDeletion on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635762 (owner: 10Lucas Werkmeister (WMDE)) [10:18:26] (03CR) 10Elukey: amd-rocm: Add 3.7 to reprepro (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635764 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman) [10:21:59] RECOVERY - Check systemd state on ms-be2020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:22:17] 10Operations, 10Traffic, 10Patch-For-Review: Remove SLAAC IPs from Ganeti hosts - https://phabricator.wikimedia.org/T265904 (10jbond) >>! In T265904#6570744, @Volans wrote: > @jbond thanks for looking into this, unfortunately the data used in the Netbox import comes from the `networking` fact because it need... [10:22:35] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={pdu_sentry4,swagger_check_restbase_esams} site={eqsin,esams} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:23:14] (03PS2) 10Klausman: amd-rocm: Add 3.7 to reprepro [puppet] - 10https://gerrit.wikimedia.org/r/635764 (https://phabricator.wikimedia.org/T264408) [10:23:21] (03CR) 10Klausman: amd-rocm: Add 3.7 to reprepro (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635764 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman) [10:24:13] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:24:37] (03CR) 10Elukey: [C: 03+1] amd-rocm: Add 3.7 to reprepro [puppet] - 10https://gerrit.wikimedia.org/r/635764 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman) [10:27:21] !log volans@cumin1001 START - Cookbook sre.dns.netbox [10:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:29] (03CR) 10Klausman: [C: 03+2] amd-rocm: Add 3.7 to reprepro [puppet] - 10https://gerrit.wikimedia.org/r/635764 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman) [10:28:43] <_joe_> jayme: I'm having problems accessing our docker image manifests right now [10:29:13] _joe_: I've an interview starting right now [10:29:54] <_joe_> ok, I can figure that out later :) [10:31:39] RECOVERY - Check systemd state on ms-be2024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:32:03] 10Operations, 10Traffic, 10Patch-For-Review: Remove SLAAC IPs from Ganeti hosts - https://phabricator.wikimedia.org/T265904 (10Volans) @jbond yeah I agree we can implement this assuming that all v6 addresses are mapped, with a sane fallback into the `networking[ip6]` one. @crusnov could you implement that lo... [10:33:27] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:43] (03PS1) 10Klausman: analytics: switch stat1005 to use rocm 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/635787 (https://phabricator.wikimedia.org/T264408) [10:37:59] (03CR) 10Klausman: [C: 03+2] analytics: switch stat1005 to use rocm 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/635787 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman) [10:38:03] (03PS1) 10Filippo Giunchedi: prometheus: proxy pushgateway through apache [puppet] - 10https://gerrit.wikimedia.org/r/635788 (https://phabricator.wikimedia.org/T249311) [10:40:26] (03PS2) 10Arturo Borrero Gonzalez: nftables: change ensure_package parameter datatype to String [puppet] - 10https://gerrit.wikimedia.org/r/635571 [10:42:07] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] nftables: change ensure_package parameter datatype to String [puppet] - 10https://gerrit.wikimedia.org/r/635571 (owner: 10Arturo Borrero Gonzalez) [10:43:25] PROBLEM - Check systemd state on ms-be2054 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:43:34] !log installing freetype security updates for stretch/buster [10:43:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:15] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:44:26] (03CR) 10Volans: [C: 04-1] "See inline" (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/634303 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [10:45:02] (03CR) 10Volans: [C: 04-1] "Missing subnet inline" (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/634302 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [10:45:55] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:46:12] (03CR) 10Ayounsi: netbox: Move eqiad private to automation (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/634302 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [10:47:27] RECOVERY - Check systemd state on ms-be2030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:48:14] (03PS7) 10Volans: netbox: Move eqiad private to automation [dns] - 10https://gerrit.wikimedia.org/r/634302 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [10:48:16] (03PS9) 10Volans: netbox: Move eqiad public to automation [dns] - 10https://gerrit.wikimedia.org/r/634303 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [10:54:28] (03PS1) 10Klausman: analytics: Remove rocm 3.7 overrid for stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/635789 (https://phabricator.wikimedia.org/T264408) [10:54:56] (03PS2) 10Klausman: analytics: Remove rocm 3.7 override for stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/635789 (https://phabricator.wikimedia.org/T264408) [10:55:50] (03CR) 10Klausman: [C: 03+2] analytics: Remove rocm 3.7 override for stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/635789 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman) [10:58:27] RECOVERY - Check systemd state on ms-be2054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:58:37] PROBLEM - SSH on ms-be2055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:58:54] sorry for the noise on ms-be hosts -.- [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Time to snap out of that daydream and deploy European mid-day backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201022T1100). [11:00:05] Lucas_WMDE: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:07] RECOVERY - SSH on ms-be2055 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:00:14] o/ [11:00:26] I guess you'll self-deploy then :-) [11:00:29] yeah :) [11:00:30] but I wave anyway! [11:00:33] godog: are we okay to go ahead? [11:01:14] Lucas_WMDE: sure! noisy but otherwise harmless alerts above [11:01:18] ok thanks [11:01:26] I need to prepare a test for my change first [11:01:33] thanks for checking in tho [11:01:35] (should’ve done that ahead of time but there’s no other changes anyways) [11:01:53] PROBLEM - MariaDB read only db_inventory on db2093 is CRITICAL: CRIT: read_only: False, expected True: OK: Version 10.4.15-MariaDB-log, Uptime 85228s, event_scheduler: False, 12.63 QPS, connection latency: 0.003016s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [11:03:54] (03CR) 10Vgutierrez: [C: 03+1] ATS: add metric trafficserver_tls_client_total_time [puppet] - 10https://gerrit.wikimedia.org/r/635276 (https://phabricator.wikimedia.org/T265869) (owner: 10Ema) [11:04:55] ACKNOWLEDGEMENT - MariaDB read only db_inventory on db2093 is CRITICAL: CRIT: read_only: False, expected True: OK: Version 10.4.15-MariaDB-log, Uptime 85329s, event_scheduler: False, 12.62 QPS, connection latency: 0.003041s Marostegui Known https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [11:08:52] (03PS1) 10Faidon Liambotis: turnilo/netflow: make the TCP flags map multi-line [puppet] - 10https://gerrit.wikimedia.org/r/635796 [11:08:54] (03PS1) 10Faidon Liambotis: turnilo/netflow: add more TCP flag combinations [puppet] - 10https://gerrit.wikimedia.org/r/635797 [11:13:26] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) I think that a major caveat with that 6-hour window counter-example is that it's looking at hit rate over everything and RUM data is... [11:17:30] (03PS1) 10Jon Harald Søby: Add ary, avk, awa, lld, shy and smn to InterwikiSortOrders.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635813 [11:18:58] does anyone know how to gain insights into the eventbus job queue? e. g. how many jobs there are in it? [11:19:12] (specifically for testwiki) [11:19:25] showJobs.php is useless (T221224), runJobs.php doesn’t do anything either [11:19:25] T221224: showJobs.php maintenance script useless and misleading in production - https://phabricator.wikimedia.org/T221224 [11:19:42] Am I too late to add a patch to the ongoing deployment window? :) [11:20:08] Jhs: no, go ahead [11:20:19] I can’t deploy my config change if I don’t figure out how to test it [11:20:32] Lucas_WMDE, cool, it's this one: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/635813 [11:20:51] I'll add it to [[wikitech:Deployments]] now [11:21:05] * Lucas_WMDE looks [11:21:10] right, seems i forgot that one :/ [11:21:12] !log correction: installing freetype security updates for buster (stretch TBD) [11:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:18] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635813 (owner: 10Jon Harald Søby) [11:21:29] LGTM Lucas_WMDE [11:22:21] Jhs: only one language code for the svwiktionary array? [11:22:27] do the other languages not have wiktionaries? [11:22:36] Lucas_WMDE, yeah, that one only includes the languages which have wiktionaries [11:22:39] ok [11:22:49] oh, but good point [11:22:52] hang on a minute [11:23:05] Lucas_WMDE: and no - I asked about that at https://lists.wikimedia.org/mailman/private/ops/2020-March/055633.html a while ago, but i didn't get any useful answer... [11:23:05] I forgot to check if there were new wiktionaries that aren't included in that one [11:23:22] I’m checking right now [11:23:25] Lucas_WMDE: maybe [11:23:26] https://grafana.wikimedia.org/d/000000105/job-queue-rate?orgId=1 [11:23:31] PROBLEM - Check systemd state on ms-be2051 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:23:31] re: how many jobs in queue? [11:23:41] doesn't show per wiki tho [11:23:50] that looks suspiciously low [11:23:57] there’s also https://grafana.wikimedia.org/d/000000400/jobqueue-eventbus?orgId=1 but a lot of boards there are blank [11:24:01] maybe due to the DC switch [11:24:20] hm, yeah i don't really know what is what with job queue [11:24:22] just found that one [11:24:32] that looks like old graphite metrics tho [11:24:34] so i dunno... [11:24:49] :/ [11:24:51] oo [11:24:52] this looks better [11:24:53] https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1 [11:24:54] mabye? [11:25:24] !log restarting apache and smokeping* on netmon* to pick up freetype update [11:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:49] hm, that looks better [11:25:57] but still lots of empty panels and no per-wiki info [11:26:26] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Add ary, avk, awa, lld, shy and smn to InterwikiSortOrders.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635813 (owner: 10Jon Harald Søby) [11:27:00] (03CR) 10Ayounsi: [C: 03+1] turnilo/netflow: make the TCP flags map multi-line [puppet] - 10https://gerrit.wikimedia.org/r/635796 (owner: 10Faidon Liambotis) [11:27:06] (03Merged) 10jenkins-bot: Add ary, avk, awa, lld, shy and smn to InterwikiSortOrders.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635813 (owner: 10Jon Harald Søby) [11:27:45] (03CR) 10Ayounsi: [C: 03+1] turnilo/netflow: add more TCP flag combinations [puppet] - 10https://gerrit.wikimedia.org/r/635797 (owner: 10Faidon Liambotis) [11:28:25] Jhs: the change is on mwdebug2001, can you test it? [11:29:54] Lucas_WMDE, works correctly yes 👍 [11:30:20] \o/ [11:30:24] something works at least, yay [11:30:55] syncing [11:31:40] !log aborrero@cumin2001 START - Cookbook sre.hosts.downtime [11:31:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:59] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InterwikiSortOrders.php: Config: [[gerrit:635813|Add ary, avk, awa, lld, shy and smn to InterwikiSortOrders.php]] (duration: 01m 08s) [11:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:39] !log aborrero@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:33:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:36] !log Compare s1-s8 tables - T261914 [11:38:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:42] T261914: Enable replication eqiad -> codfw and other checks - https://phabricator.wikimedia.org/T261914 [11:39:23] !log restarting nginx on acmechief*, debmonitor*, schema*, puppetdb* to pick up freetype update [11:39:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:21] aha, my recent changes finally appeared [11:43:28] so I think the job queue was just slow [11:43:41] but with that kind of delay I don’t think I’ll be able to test the config change before the window is over… [11:43:47] well, we’ll see [11:46:48] (03PS2) 10Lucas Werkmeister (WMDE): Enable propagatePageDeletion on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635762 [11:47:05] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Enable propagatePageDeletion on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635762 (owner: 10Lucas Werkmeister (WMDE)) [11:47:50] (03Merged) 10jenkins-bot: Enable propagatePageDeletion on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635762 (owner: 10Lucas Werkmeister (WMDE)) [11:48:19] ok, trying to test on mwdebug2001 now… [11:48:34] (03PS4) 10Ottomata: camus - use eventstreamconfig for eventgate-analytics-external streams [puppet] - 10https://gerrit.wikimedia.org/r/635632 (https://phabricator.wikimedia.org/T251609) [11:48:52] Lucas_WMDE: I'm not sure whether mwdebug2001 affects job queue [11:48:57] (it may or may not [11:49:04] the change would affect whether a job is being enqueued or not, I think [11:49:06] so it should still work [11:49:09] worth a try, at least [11:49:14] sure [11:50:14] (03CR) 10Ottomata: [C: 03+2] camus - use eventstreamconfig for eventgate-analytics-external streams [puppet] - 10https://gerrit.wikimedia.org/r/635632 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [11:54:19] !log Start of `mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=$wiki --print-orphaned-records-to=/tmp/urbanecm/$wiki-orphaned.log --progress-markers > $wiki.log` in a tmux session updateVarDumps at mwmaint2001 (wiki=huwiki; T246539) [11:54:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:26] T246539: Dry-run, then actually run updateVarDumps - https://phabricator.wikimedia.org/T246539 [11:54:31] 10Operations, 10Traffic, 10Patch-For-Review: Remove SLAAC IPs from Ganeti hosts - https://phabricator.wikimedia.org/T265904 (10jbond) > Ill take a quick look today to see how to override the structured fact I had a quick look and i can't work out how to override the networking fact. I seem to be hitting the... [11:54:50] hm, on second thought, it might not work with mwdebug after all [11:54:54] (03PS1) 10Ottomata: camus - bump camus jar version for eventstreamconfig event jobs to wmf12 [puppet] - 10https://gerrit.wikimedia.org/r/635817 (https://phabricator.wikimedia.org/T251609) [11:55:50] I’ll sync it and then do another test [11:56:20] (03CR) 10Ottomata: [C: 03+2] camus - bump camus jar version for eventstreamconfig event jobs to wmf12 [puppet] - 10https://gerrit.wikimedia.org/r/635817 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [11:56:28] PROBLEM - SSH on ms-be2030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:57:25] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:635762|Enable propagatePageDeletion on Test Wikidata]], 1/2 (duration: 01m 02s) [11:57:26] RECOVERY - SSH on ms-be2030 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:57:28] (03CR) 10Jbond: [C: 03+2] stdlib: Switch to new Stdlib::Yes_no type where appropriate [puppet] - 10https://gerrit.wikimedia.org/r/635577 (owner: 10Jbond) [11:57:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:10] RECOVERY - DPKG on stat1005 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:58:53] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:635762|Enable propagatePageDeletion on Test Wikidata]], 2/2 (duration: 01m 04s) [11:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:12] !log EU backport&config window done [11:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:43] (03CR) 10Jbond: [C: 03+2] "LGTM will merge" [puppet] - 10https://gerrit.wikimedia.org/r/634363 (owner: 10Dzahn) [12:03:51] !log [urbanecm@deploy1001 /srv/mediawiki-staging (master * u=)]$ sudo /usr/local/sbin/fix-staging-perms [12:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:54] (03PS1) 10Urbanecm: Do not log logins at loginwiki via CU [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635819 (https://phabricator.wikimedia.org/T253802) [12:04:57] (03CR) 10Urbanecm: [C: 03+2] Do not log logins at loginwiki via CU [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635819 (https://phabricator.wikimedia.org/T253802) (owner: 10Urbanecm) [12:05:40] (03Merged) 10jenkins-bot: Do not log logins at loginwiki via CU [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635819 (https://phabricator.wikimedia.org/T253802) (owner: 10Urbanecm) [12:06:14] (03PS1) 10Ottomata: camus - take 2 bump camus jar version for eventstreamconfig event jobs to wmf12 [puppet] - 10https://gerrit.wikimedia.org/r/635820 (https://phabricator.wikimedia.org/T251609) [12:07:15] (03CR) 10Ottomata: [V: 03+2 C: 03+2] camus - take 2 bump camus jar version for eventstreamconfig event jobs to wmf12 [puppet] - 10https://gerrit.wikimedia.org/r/635820 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [12:10:06] RECOVERY - Check systemd state on ms-be2051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:10:39] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 52ad2d4df1164dced684231c12aa64bd028b8ac9: Do not log logins at loginwiki via CU (T253802) (duration: 01m 06s) [12:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:45] T253802: Configure WMF wikis to log login attempts in CheckUser - https://phabricator.wikimedia.org/T253802 [12:12:56] PROBLEM - Check systemd state on ms-be2056 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:13:04] PROBLEM - Check systemd state on ms-be2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:14:54] 10Operations, 10SRE-swift-storage, 10observability: swift backend decomms / rebalances are noisy - https://phabricator.wikimedia.org/T221904 (10fgiunchedi) 05Resolved→03Open Unfortunately reopening, we've been seeing failures (e.g. systemd, ssh) during latest codfw rebalances [12:19:01] (03PS1) 10Ottomata: Enable canary events for 3 eventgate-analytics bound streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635822 (https://phabricator.wikimedia.org/T251609) [12:19:44] (03CR) 10Ottomata: "FYI, this will emit artificial canary events once per hour into these streams." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635822 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [12:22:28] RECOVERY - Check systemd state on ms-be2020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:22:30] (03CR) 10Ottomata: [C: 03+2] Enable canary events for 3 eventgate-analytics bound streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635822 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [12:24:20] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: wgEventStreams: Enable canary events for 3 eventgate-analytics bound streams - T251609 (duration: 01m 05s) [12:24:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:26] T251609: Automate ingestion and refinement into Hive of event data from Kafka using stream configs and canary/heartbeat events - https://phabricator.wikimedia.org/T251609 [12:33:44] RECOVERY - Check systemd state on ms-be2056 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:36:07] (03PS1) 10Ottomata: camus - use eventstreamconfig for eventgate-analytics streams [puppet] - 10https://gerrit.wikimedia.org/r/635824 (https://phabricator.wikimedia.org/T251609) [12:37:19] (03CR) 10jerkins-bot: [V: 04-1] camus - use eventstreamconfig for eventgate-analytics streams [puppet] - 10https://gerrit.wikimedia.org/r/635824 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [12:39:56] PROBLEM - Device not healthy -SMART- on ms-be2017 is CRITICAL: cluster=swift device=None instance=ms-be2017 job=node site=codfw https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2017&var-datasource=codfw+prometheus/ops [12:41:30] (03CR) 10Muehlenhoff: "Looks great, a few comments/nits inline!" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/635559 (owner: 10Filippo Giunchedi) [12:42:04] PROBLEM - SSH on ms-be2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:54:53] (03PS2) 10CDanis: hieradata: add Swift account for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/635501 (https://phabricator.wikimedia.org/T246004) (owner: 10Filippo Giunchedi) [12:56:12] (03PS2) 10Ottomata: camus - use eventstreamconfig for eventgate-analytics streams [puppet] - 10https://gerrit.wikimedia.org/r/635824 (https://phabricator.wikimedia.org/T251609) [12:56:14] (03CR) 10CDanis: [C: 03+1] "it took me months to stop making the same 'wqds' typo :)" [puppet] - 10https://gerrit.wikimedia.org/r/635501 (https://phabricator.wikimedia.org/T246004) (owner: 10Filippo Giunchedi) [12:56:44] PROBLEM - Check the NTP synchronisation status of timesyncd on ms-be2016 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.136: Connection reset by peer https://wikitech.wikimedia.org/wiki/NTP [12:57:55] (03CR) 10CDanis: [C: 03+1] turnilo/netflow: make the TCP flags map multi-line [puppet] - 10https://gerrit.wikimedia.org/r/635796 (owner: 10Faidon Liambotis) [12:58:00] (03CR) 10Ottomata: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/26060/an-launcher1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/635824 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [12:58:16] RECOVERY - SSH on ms-be2016 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:58:49] (03CR) 10CDanis: [C: 03+1] "thanks! one nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635797 (owner: 10Faidon Liambotis) [13:00:04] longma and liw: I, the Bot under the Fountain, allow thee, The Deployer, to do Mediawiki train - American+European Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201022T1300). [13:01:39] !log pooling ldap-replica2004 T264388 [13:01:44] (03CR) 10Elukey: [C: 03+2] turnilo/netflow: make the TCP flags map multi-line [puppet] - 10https://gerrit.wikimedia.org/r/635796 (owner: 10Faidon Liambotis) [13:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:47] T264388: Migrate LDAP replicas to Buster - https://phabricator.wikimedia.org/T264388 [13:02:02] (03PS1) 10Lars Wirzenius: all wikis to 1.36.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635827 [13:02:04] (03CR) 10Lars Wirzenius: [C: 03+2] all wikis to 1.36.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635827 (owner: 10Lars Wirzenius) [13:02:41] (03Merged) 10jenkins-bot: all wikis to 1.36.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635827 (owner: 10Lars Wirzenius) [13:04:01] (03PS2) 10Faidon Liambotis: turnilo/netflow: add more TCP flag combinations [puppet] - 10https://gerrit.wikimedia.org/r/635797 [13:04:13] (03CR) 10jerkins-bot: [V: 04-1] turnilo/netflow: add more TCP flag combinations [puppet] - 10https://gerrit.wikimedia.org/r/635797 (owner: 10Faidon Liambotis) [13:04:46] !log liw@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.14 [13:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:23] (03PS3) 10Faidon Liambotis: turnilo/netflow: add more TCP flag combinations [puppet] - 10https://gerrit.wikimedia.org/r/635797 [13:06:42] (03CR) 10Elukey: [C: 03+2] turnilo/netflow: add more TCP flag combinations [puppet] - 10https://gerrit.wikimedia.org/r/635797 (owner: 10Faidon Liambotis) [13:10:15] !log depooling ldap-replica2001/2002 T264388 [13:10:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:21] T264388: Migrate LDAP replicas to Buster - https://phabricator.wikimedia.org/T264388 [13:12:16] (03PS1) 10Ottomata: Enable canary events for all eventgate-analytics-external bound streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635832 (https://phabricator.wikimedia.org/T251609) [13:13:07] liw: know when train is done plz! i have a config change i'd like to deploy. :) [13:13:11] no hurry tho [13:14:15] ottomata, I've promoted to all servers now; nervously watching logs to see if I need to roll back. if you can wait say 45 or 60 minutes, that'd be great in case of rollback [13:14:18] (03PS1) 10Kormat: dbtools: Add error checking to check-master-heartbeat.sh [software] - 10https://gerrit.wikimedia.org/r/635833 [13:14:27] can do [13:14:39] ottomata, thanks [13:15:34] (03PS2) 10Kormat: dbtools: Add error checking to check-master-heartbeat.sh [software] - 10https://gerrit.wikimedia.org/r/635833 [13:16:15] (03CR) 10Volans: "some additional records inline, mostly for arzhel" (036 comments) [dns] - 10https://gerrit.wikimedia.org/r/634302 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [13:16:23] (03PS1) 10Kormat: dbtools: Add master-pos script [software] - 10https://gerrit.wikimedia.org/r/635834 [13:16:58] (03CR) 10Klausman: [C: 03+1] dbtools: Add error checking to check-master-heartbeat.sh [software] - 10https://gerrit.wikimedia.org/r/635833 (owner: 10Kormat) [13:17:19] (03CR) 10Kormat: [C: 03+2] dbtools: Add error checking to check-master-heartbeat.sh [software] - 10https://gerrit.wikimedia.org/r/635833 (owner: 10Kormat) [13:17:49] (03Merged) 10jenkins-bot: dbtools: Add error checking to check-master-heartbeat.sh [software] - 10https://gerrit.wikimedia.org/r/635833 (owner: 10Kormat) [13:19:15] (03PS1) 10Muehlenhoff: Add ldap-replica1001/1002 to conftool [puppet] - 10https://gerrit.wikimedia.org/r/635835 [13:19:18] (03CR) 10Volans: "some additional record for arzhel" (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/634303 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [13:21:02] (03PS1) 10Ottomata: camus - use eventstreamconfig for eventgate-main streams [puppet] - 10https://gerrit.wikimedia.org/r/635836 (https://phabricator.wikimedia.org/T251609) [13:21:04] (03CR) 10Marostegui: [C: 03+1] dbtools: Add master-pos script (031 comment) [software] - 10https://gerrit.wikimedia.org/r/635834 (owner: 10Kormat) [13:23:12] (03CR) 10Klausman: dbtools: Add master-pos script (032 comments) [software] - 10https://gerrit.wikimedia.org/r/635834 (owner: 10Kormat) [13:23:32] (03CR) 10Filippo Giunchedi: [C: 03+2] "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/635501 (https://phabricator.wikimedia.org/T246004) (owner: 10Filippo Giunchedi) [13:27:16] RECOVERY - Check the NTP synchronisation status of timesyncd on ms-be2016 is OK: OK: synced at Thu 2020-10-22 13:27:14 UTC. https://wikitech.wikimedia.org/wiki/NTP [13:27:24] (03CR) 10Ottomata: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/26061/an-launcher1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/635836 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [13:31:07] (03CR) 10Muehlenhoff: [C: 03+2] Add ldap-replica1001/1002 to conftool [puppet] - 10https://gerrit.wikimedia.org/r/635835 (owner: 10Muehlenhoff) [13:34:59] (03PS1) 10Ottomata: camus - add extra backslash escapes for regex in systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/635837 (https://phabricator.wikimedia.org/T251609) [13:36:55] (03CR) 10Ottomata: [C: 03+2] "I guess!" [puppet] - 10https://gerrit.wikimedia.org/r/635837 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [13:41:06] !log pooling ldap-replica1001/1002 T264388 [13:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:13] T264388: Migrate LDAP replicas to Buster - https://phabricator.wikimedia.org/T264388 [13:44:55] (03PS1) 10Ottomata: camus - remove extra unneeded paren in regex [puppet] - 10https://gerrit.wikimedia.org/r/635838 [13:47:37] (03CR) 10Ottomata: [C: 03+2] camus - remove extra unneeded paren in regex [puppet] - 10https://gerrit.wikimedia.org/r/635838 (owner: 10Ottomata) [13:48:54] (03PS5) 10Filippo Giunchedi: WIP ldap/grafana user sync [puppet] - 10https://gerrit.wikimedia.org/r/635559 [13:49:23] (03PS2) 10Kormat: dbtools: Add master-pos script [software] - 10https://gerrit.wikimedia.org/r/635834 [13:49:52] (03CR) 10Filippo Giunchedi: "Thanks for the review" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/635559 (owner: 10Filippo Giunchedi) [13:50:48] (03CR) 10jerkins-bot: [V: 04-1] WIP ldap/grafana user sync [puppet] - 10https://gerrit.wikimedia.org/r/635559 (owner: 10Filippo Giunchedi) [13:52:46] (03CR) 10Kormat: dbtools: Add master-pos script (033 comments) [software] - 10https://gerrit.wikimedia.org/r/635834 (owner: 10Kormat) [13:52:48] (03CR) 10Ayounsi: "All good." (035 comments) [dns] - 10https://gerrit.wikimedia.org/r/634302 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [13:53:38] (03CR) 10Muehlenhoff: [C: 03+1] "Sync script looks good to me (sans tests failing)" [puppet] - 10https://gerrit.wikimedia.org/r/635559 (owner: 10Filippo Giunchedi) [13:55:13] !log depooling ldap-eqiad-replica01/ldap-eqiad-replica02 T264388 [13:55:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:19] T264388: Migrate LDAP replicas to Buster - https://phabricator.wikimedia.org/T264388 [13:56:06] PROBLEM - Check systemd state on ms-be2051 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:57:33] (03CR) 10Ayounsi: netbox: Move eqiad public to automation (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/634303 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [13:58:14] ottomata, go ahead and deploy your config change, train seems stable for now [13:58:30] k thanks [13:58:37] * liw goes into a meeting [13:59:17] (03CR) 10Ottomata: [C: 03+2] "FYI this will produce artificial canary events into these streams once an hour." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635832 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [14:00:45] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: wgEventStreams: Enable canary events for all eventgate-analytics-external bound streams - T251609 (duration: 01m 02s) [14:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:51] T251609: Automate ingestion and refinement into Hive of event data from Kafka using stream configs and canary/heartbeat events - https://phabricator.wikimedia.org/T251609 [14:02:20] (03CR) 10Kormat: [C: 03+1] "LGTM - i like it :)" [puppet] - 10https://gerrit.wikimedia.org/r/635575 (owner: 10Jbond) [14:03:17] (03PS1) 10Ottomata: camus - bump to camus jar version wmf12 for all camus jobs [puppet] - 10https://gerrit.wikimedia.org/r/635841 (https://phabricator.wikimedia.org/T251609) [14:04:53] (03CR) 10Ottomata: [C: 03+2] camus - bump to camus jar version wmf12 for all camus jobs [puppet] - 10https://gerrit.wikimedia.org/r/635841 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [14:05:27] !log bump camus version to wmf12 for all camus jobs. should be no-op now. - T251609 [14:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:37] oops wrong channel ^ [14:06:28] (03PS1) 10Vgutierrez: Release 8.0.8-1wm3 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/635842 (https://phabricator.wikimedia.org/T265911) [14:06:52] (03PS3) 10Kormat: dbtools: Add master-pos script [software] - 10https://gerrit.wikimedia.org/r/635834 [14:09:26] (03PS1) 10Elukey: profile::hive::client: refactor code to have settings only in one place [puppet] - 10https://gerrit.wikimedia.org/r/635844 (https://phabricator.wikimedia.org/T257412) [14:09:58] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/635227 (https://phabricator.wikimedia.org/T265969) (owner: 10Elukey) [14:11:22] (03CR) 10Elukey: [C: 03+2] Add sbisson to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/635227 (https://phabricator.wikimedia.org/T265969) (owner: 10Elukey) [14:11:24] RECOVERY - Check systemd state on ms-be2051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:27] (03PS2) 10Elukey: Add sbisson to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/635227 (https://phabricator.wikimedia.org/T265969) [14:13:02] !log upgrading mariadb on cloudcontrol1003, 1004, 1005 [14:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:41] (03CR) 10Muehlenhoff: [C: 03+1] "Great, looks good :-)" [puppet] - 10https://gerrit.wikimedia.org/r/635516 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [14:17:10] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:18:50] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:19:12] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Add sbisson to analytics-privatedata-users and create a kerberos identity - https://phabricator.wikimedia.org/T265969 (10elukey) 05Open→03Resolved ` elukey@krb1001:~$ sudo manage_principals.py create sbisson --email_address=sbisson@w... [14:21:12] (03PS2) 10Elukey: profile::hive::client: refactor code to have settings only in one place [puppet] - 10https://gerrit.wikimedia.org/r/635844 (https://phabricator.wikimedia.org/T257412) [14:22:25] (03CR) 10jerkins-bot: [V: 04-1] profile::hive::client: refactor code to have settings only in one place [puppet] - 10https://gerrit.wikimedia.org/r/635844 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [14:25:20] (03PS3) 10Elukey: profile::hive::client: refactor code to have settings only in one place [puppet] - 10https://gerrit.wikimedia.org/r/635844 (https://phabricator.wikimedia.org/T257412) [14:30:30] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:31:51] (03PS1) 10Ottomata: camus::job - Remove support for 'dynamic-stream-config' [puppet] - 10https://gerrit.wikimedia.org/r/635847 (https://phabricator.wikimedia.org/T251609) [14:32:12] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:32:18] (03CR) 10jerkins-bot: [V: 04-1] camus::job - Remove support for 'dynamic-stream-config' [puppet] - 10https://gerrit.wikimedia.org/r/635847 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [14:33:28] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [14:33:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:02] (03PS2) 10Ottomata: camus::job - Remove support for 'dynamic-stream-config' [puppet] - 10https://gerrit.wikimedia.org/r/635847 (https://phabricator.wikimedia.org/T251609) [14:38:55] (03CR) 10Jbond: [C: 03+2] profile::mariadb: make use of Stdlib::Datasize [puppet] - 10https://gerrit.wikimedia.org/r/635575 (owner: 10Jbond) [14:39:43] (03CR) 10Ottomata: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/26067/an-launcher1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/635847 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [14:41:16] (03PS4) 10Elukey: profile::hive::client: refactor code to have settings only in one place [puppet] - 10https://gerrit.wikimedia.org/r/635844 (https://phabricator.wikimedia.org/T257412) [14:41:58] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/635562 (https://phabricator.wikimedia.org/T265712) (owner: 10Filippo Giunchedi) [14:42:48] (03CR) 10Volans: [C: 03+1] "[disclaimer] For this one I had to trust my diff scripts, too big to check all records one by one, but LGTM." [dns] - 10https://gerrit.wikimedia.org/r/634302 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [14:46:50] (03PS5) 10Elukey: profile::hive::client: refactor code to have settings only in one place [puppet] - 10https://gerrit.wikimedia.org/r/635844 (https://phabricator.wikimedia.org/T257412) [14:47:01] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:16] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:53:11] (03PS6) 10Elukey: profile::hive::client: refactor code to have settings only in one place [puppet] - 10https://gerrit.wikimedia.org/r/635844 (https://phabricator.wikimedia.org/T257412) [14:54:31] (03CR) 10jerkins-bot: [V: 04-1] profile::hive::client: refactor code to have settings only in one place [puppet] - 10https://gerrit.wikimedia.org/r/635844 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [14:54:40] PROBLEM - SSH on ms-be2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:56:40] !log installing remaining mariadb-10.3 updates for buster (as packaged in Debian, not the wmf-mariadb package) [14:56:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:36] 10Operations, 10Patch-For-Review: Migrate LDAP replicas to Buster - https://phabricator.wikimedia.org/T264388 (10MoritzMuehlenhoff) All new buster replicas are now pooled and the stretch ones have been depooled. I'll keep them around for another week just in case, then they are going to be removed. [14:59:17] (03PS7) 10Elukey: profile::hive::client: refactor code to have settings only in one place [puppet] - 10https://gerrit.wikimedia.org/r/635844 (https://phabricator.wikimedia.org/T257412) [15:01:14] RECOVERY - SSH on ms-be2017 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:01:17] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Add sbisson to analytics-privatedata-users and create a kerberos identity - https://phabricator.wikimedia.org/T265969 (10SBisson) @elukey All good. Thanks! [15:03:38] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 68 probes of 569 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:05:16] (03CR) 10Volans: [C: 04-1] "still something to fix" (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/634303 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [15:06:12] (03CR) 10Filippo Giunchedi: [C: 03+2] tox: move grafana tests to python3 [puppet] - 10https://gerrit.wikimedia.org/r/635562 (https://phabricator.wikimedia.org/T265712) (owner: 10Filippo Giunchedi) [15:06:14] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:07:45] (03PS8) 10Elukey: profile::hive::client: refactor code to have settings only in one place [puppet] - 10https://gerrit.wikimedia.org/r/635844 (https://phabricator.wikimedia.org/T257412) [15:09:18] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 52 probes of 569 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:10:01] (03PS9) 10Elukey: profile::hive::client: refactor code to have settings only in one place [puppet] - 10https://gerrit.wikimedia.org/r/635844 (https://phabricator.wikimedia.org/T257412) [15:12:32] (03PS1) 10Jbond: 6.2.4: merge additional upstream changes [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/635848 [15:13:45] (03PS1) 10Ayounsi: Add CSV import to AssignIPs script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635849 [15:14:00] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:14:19] (03CR) 10jerkins-bot: [V: 04-1] Add CSV import to AssignIPs script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635849 (owner: 10Ayounsi) [15:14:30] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1003/26074/" [puppet] - 10https://gerrit.wikimedia.org/r/635844 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [15:19:00] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:21:01] (03PS1) 10Elukey: Add CNAME analytics-test-hive [dns] - 10https://gerrit.wikimedia.org/r/635850 (https://phabricator.wikimedia.org/T257412) [15:22:23] (03CR) 10Elukey: [C: 03+2] Add CNAME analytics-test-hive [dns] - 10https://gerrit.wikimedia.org/r/635850 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [15:28:39] (03CR) 10Ottomata: [C: 03+1] "Couple of naming nits but LGTM otherwise!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/635844 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [15:31:46] (03CR) 10Elukey: profile::hive::client: refactor code to have settings only in one place (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/635844 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [15:32:43] (03CR) 10Ottomata: [C: 03+1] profile::hive::client: refactor code to have settings only in one place (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/635844 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [15:34:21] (03CR) 10Bstorm: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/635628 (https://phabricator.wikimedia.org/T265588) (owner: 10Bstorm) [15:37:50] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={swagger_check_citoid_cluster_codfw,swagger_check_restbase_esams} site={codfw,esams} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:38:20] (03PS2) 10Bstorm: dumps nfs: remove probably-unused firewall ports and services [puppet] - 10https://gerrit.wikimedia.org/r/635628 (https://phabricator.wikimedia.org/T265588) [15:38:33] (03PS8) 10Volans: netbox: Move eqiad private to automation [dns] - 10https://gerrit.wikimedia.org/r/634302 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [15:38:36] (03PS10) 10Volans: netbox: Move eqiad public to automation [dns] - 10https://gerrit.wikimedia.org/r/634303 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [15:38:51] (03CR) 10Volans: "fixed" (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/634303 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [15:39:34] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:39:43] (03CR) 10Bstorm: "This version should only affect labstore1006/7 and their clients." [puppet] - 10https://gerrit.wikimedia.org/r/635628 (https://phabricator.wikimedia.org/T265588) (owner: 10Bstorm) [15:40:19] (03PS2) 10JMeybohm: wikifeeds: Increase envoy CPU and memory ressources [deployment-charts] - 10https://gerrit.wikimedia.org/r/635753 (https://phabricator.wikimedia.org/T266194) [15:47:29] (03CR) 10BBlack: [C: 03+2] partman: document cacheproxy exceptions [puppet] - 10https://gerrit.wikimedia.org/r/635530 (https://phabricator.wikimedia.org/T156955) (owner: 10BBlack) [15:49:09] (03Abandoned) 10BBlack: VCL: use hfm for large_objects_cutoff [puppet] - 10https://gerrit.wikimedia.org/r/635318 (https://phabricator.wikimedia.org/T266040) (owner: 10BBlack) [15:49:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:49:56] (03PS1) 10BBlack: cache_text: large_objects_cutoff == backend limit [puppet] - 10https://gerrit.wikimedia.org/r/635852 (https://phabricator.wikimedia.org/T266040) [15:51:11] (03CR) 10CDanis: [C: 03+1] "I legitimately love the scare quotes around "Temporary"" [puppet] - 10https://gerrit.wikimedia.org/r/635852 (https://phabricator.wikimedia.org/T266040) (owner: 10BBlack) [15:51:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:00:05] jbond42 and cdanis: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201022T1600). [16:04:15] (03PS10) 10Elukey: profile::hive::client: refactor code to have settings only in one place [puppet] - 10https://gerrit.wikimedia.org/r/635844 (https://phabricator.wikimedia.org/T257412) [16:04:18] (03CR) 10BBlack: [C: 03+2] cache_text: large_objects_cutoff == backend limit [puppet] - 10https://gerrit.wikimedia.org/r/635852 (https://phabricator.wikimedia.org/T266040) (owner: 10BBlack) [16:05:26] (03CR) 10jerkins-bot: [V: 04-1] profile::hive::client: refactor code to have settings only in one place [puppet] - 10https://gerrit.wikimedia.org/r/635844 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [16:07:56] 10Operations, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10JAnstee_WMF) [16:08:02] (03CR) 10Bstorm: "I'm going to merge this and deploy only to Toolsbeta for now, since it won't be needed in tools until the entire workflow is needed. Howev" [puppet] - 10https://gerrit.wikimedia.org/r/635641 (https://phabricator.wikimedia.org/T265557) (owner: 10Bstorm) [16:08:11] (03CR) 10Bstorm: [C: 03+2] toolforge k8s: add a PodSecurityPolicy to be used by buildpacks [puppet] - 10https://gerrit.wikimedia.org/r/635641 (https://phabricator.wikimedia.org/T265557) (owner: 10Bstorm) [16:09:42] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:09:58] 10Operations, 10SRE-Access-Requests: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung - https://phabricator.wikimedia.org/T266250 (10Rmaung) [16:11:24] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:12:18] (03PS2) 10Ayounsi: Add CSV import to AssignIPs script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635849 [16:12:20] (03PS1) 10Ayounsi: AssingIPs, cleanup and standardize logs format [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635853 (https://phabricator.wikimedia.org/T265339) [16:13:21] (03CR) 10Volans: [C: 04-1] "Some missing deletes inline" (034 comments) [dns] - 10https://gerrit.wikimedia.org/r/634303 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [16:14:36] (03PS11) 10Volans: netbox: Move eqiad public to automation [dns] - 10https://gerrit.wikimedia.org/r/634303 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [16:15:06] (03CR) 10Volans: "addressed" (035 comments) [dns] - 10https://gerrit.wikimedia.org/r/634303 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [16:17:16] (03PS1) 10Effie Mouzeli: mediawiki: Check number of cached keys in php-check-and-restart.sh [puppet] - 10https://gerrit.wikimedia.org/r/635854 (https://phabricator.wikimedia.org/T253673) [16:19:16] (03PS11) 10Elukey: profile::hive::client: refactor code to have settings only in one place [puppet] - 10https://gerrit.wikimedia.org/r/635844 (https://phabricator.wikimedia.org/T257412) [16:21:07] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) I've compared 1+ million miss requests on cp3052 and cp3054 and looking at the most frequent miss URLs, there's no distinguishable pa... [16:21:15] 10Operations, 10Traffic, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10CDanis) Hi @Tsevener -- wanted to check in about something. Is the version... [16:21:18] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki: Check number of cached keys in php-check-and-restart.sh [puppet] - 10https://gerrit.wikimedia.org/r/635854 (https://phabricator.wikimedia.org/T253673) (owner: 10Effie Mouzeli) [16:21:29] (03CR) 10Effie Mouzeli: mediawiki: Check number of cached keys in php-check-and-restart.sh [puppet] - 10https://gerrit.wikimedia.org/r/635854 (https://phabricator.wikimedia.org/T253673) (owner: 10Effie Mouzeli) [16:21:29] 10Operations, 10Traffic, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10CDanis) p:05High→03Medium [16:24:22] (03PS1) 10Dave Pifke: Expand $wgImagePreconnect to group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635856 (https://phabricator.wikimedia.org/T123582) [16:24:24] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/26077" [puppet] - 10https://gerrit.wikimedia.org/r/635844 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [16:24:47] (03CR) 10Effie Mouzeli: "I have tested it on mwdebug1001" [puppet] - 10https://gerrit.wikimedia.org/r/635854 (https://phabricator.wikimedia.org/T253673) (owner: 10Effie Mouzeli) [16:27:09] (03CR) 10Ahmon Dancy: [C: 03+1] "Just a minor nit. Looks good." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635854 (https://phabricator.wikimedia.org/T253673) (owner: 10Effie Mouzeli) [16:27:14] 10Operations, 10Analytics, 10procurement: Check data currently stored on thorium and drop what it is not needed anymore - https://phabricator.wikimedia.org/T265971 (10elukey) [16:35:02] (03PS1) 10Andrew Bogott: Remaining eqiad1 cloudvirts to Buster [puppet] - 10https://gerrit.wikimedia.org/r/635858 (https://phabricator.wikimedia.org/T259399) [16:35:42] (03CR) 10CDanis: [C: 03+1] mediawiki: Check number of cached keys in php-check-and-restart.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635854 (https://phabricator.wikimedia.org/T253673) (owner: 10Effie Mouzeli) [16:36:07] (03CR) 10Elukey: [C: 03+2] profile::hive::client: refactor code to have settings only in one place [puppet] - 10https://gerrit.wikimedia.org/r/635844 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [16:36:09] (03CR) 10Andrew Bogott: [C: 03+2] Remaining eqiad1 cloudvirts to Buster [puppet] - 10https://gerrit.wikimedia.org/r/635858 (https://phabricator.wikimedia.org/T259399) (owner: 10Andrew Bogott) [16:37:36] (03CR) 10Ahmon Dancy: [C: 03+1] mediawiki: Check number of cached keys in php-check-and-restart.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635854 (https://phabricator.wikimedia.org/T253673) (owner: 10Effie Mouzeli) [16:37:38] andrewbogott: shall I merge yours too? (also, hi!) [16:37:38] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission es2017.codfw.wmnet - https://phabricator.wikimedia.org/T264386 (10Papaul) ` papaul@asw-a-codfw# show | compare [edit interfaces interface-range vlan-private1-a-codfw] - member ge-6/0/19; [edit interfaces interface-range disabled]... [16:37:50] elukey: yes please! thx [16:38:03] hi! [16:40:14] (03PS2) 10Ssingh: dnsdist: set and increase the value of setMaxTCPClientThreads [puppet] - 10https://gerrit.wikimedia.org/r/635309 [16:42:14] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [16:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:28] (03CR) 10Ssingh: [C: 03+2] dnsdist: set and increase the value of setMaxTCPClientThreads [puppet] - 10https://gerrit.wikimedia.org/r/635309 (owner: 10Ssingh) [16:44:39] (03PS12) 10Volans: netbox: Move eqiad public to automation [dns] - 10https://gerrit.wikimedia.org/r/634303 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [16:46:35] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:46:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:24] (03CR) 10Effie Mouzeli: [C: 04-2] "Although I understand the reasoning behind what you are trying to do here, this script has the sole purpose to restart php-fpm or clear th" [puppet] - 10https://gerrit.wikimedia.org/r/635635 (https://phabricator.wikimedia.org/T243009) (owner: 10Ahmon Dancy) [16:48:35] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission es2017.codfw.wmnet - https://phabricator.wikimedia.org/T264386 (10Papaul) [16:49:54] PROBLEM - SSH on ms-be2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:49:59] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission es2017.codfw.wmnet - https://phabricator.wikimedia.org/T264386 (10Papaul) a:05Marostegui→03Papaul [16:54:53] (03PS1) 10Dave Pifke: webperf: add fake keys for WebPageTest [labs/private] - 10https://gerrit.wikimedia.org/r/635859 (https://phabricator.wikimedia.org/T262962) [16:56:00] RECOVERY - SSH on ms-be2017 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:56:02] (03PS13) 10Volans: netbox: Move eqiad public to automation [dns] - 10https://gerrit.wikimedia.org/r/634303 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [16:56:38] (03CR) 10Volans: netbox: Move eqiad public to automation (033 comments) [dns] - 10https://gerrit.wikimedia.org/r/634303 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [17:00:04] chrisalbon and accraze: (Dis)respected human, time to deploy Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201022T1700). Please do the needful. [17:00:59] (03PS1) 10Elukey: profile::oozie::client: make config global [puppet] - 10https://gerrit.wikimedia.org/r/635861 (https://phabricator.wikimedia.org/T257412) [17:02:17] (03CR) 10jerkins-bot: [V: 04-1] profile::oozie::client: make config global [puppet] - 10https://gerrit.wikimedia.org/r/635861 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [17:03:17] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission wtp2001 through wtp2020 - https://phabricator.wikimedia.org/T265558 (10Papaul) ` [edit interfaces interface-range disabled] member xe-2/0/6 { ... } + member ge-4/0/17; + member ge-4/0/18; + member ge-4/0/19; + membe... [17:03:32] (03PS2) 10Elukey: profile::oozie::client: make config global [puppet] - 10https://gerrit.wikimedia.org/r/635861 (https://phabricator.wikimedia.org/T257412) [17:07:58] !log bd808@cumin1001 START - Cookbook wmcs.wikireplicas.add_wiki [17:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:22] (03PS3) 10Elukey: profile::oozie::client: make config global [puppet] - 10https://gerrit.wikimedia.org/r/635861 (https://phabricator.wikimedia.org/T257412) [17:09:19] (03CR) 10Effie Mouzeli: mediawiki: Check number of cached keys in php-check-and-restart.sh (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/635854 (https://phabricator.wikimedia.org/T253673) (owner: 10Effie Mouzeli) [17:09:24] (03PS2) 10Effie Mouzeli: mediawiki: Check number of cached keys in php-check-and-restart.sh [puppet] - 10https://gerrit.wikimedia.org/r/635854 (https://phabricator.wikimedia.org/T253673) [17:11:09] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1003/26079/" [puppet] - 10https://gerrit.wikimedia.org/r/635861 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [17:11:28] (03CR) 10Ottomata: [C: 03+1] profile::oozie::client: make config global [puppet] - 10https://gerrit.wikimedia.org/r/635861 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [17:17:49] (03PS14) 10Volans: netbox: Move eqiad public to automation [dns] - 10https://gerrit.wikimedia.org/r/634303 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [17:20:03] (03PS1) 10Herron: kibana: add kibana_ecs role [puppet] - 10https://gerrit.wikimedia.org/r/635864 [17:21:03] !log bd808@cumin1001 Added views for new wiki: smnwiki T264900 [17:21:03] !log bd808@cumin1001 END (PASS) - Cookbook wmcs.wikireplicas.add_wiki (exit_code=0) [17:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:11] T264900: Prepare and check storage layer for smnwiki - https://phabricator.wikimedia.org/T264900 [17:21:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:51] (03CR) 10Elukey: [C: 03+2] profile::oozie::client: make config global [puppet] - 10https://gerrit.wikimedia.org/r/635861 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [17:22:37] jouncebot: now [17:22:37] For the next 0 hour(s) and 37 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201022T1700) [17:22:42] jouncebot: next [17:22:42] In 0 hour(s) and 37 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201022T1800) [17:26:42] [HEADS-UP] We're about to migrate eqiad DNS records to the Netbox-generated ones, please do not merge DNS patches for the next ~30 minutes. No impact is expected. [17:27:05] \o/ [17:29:33] (03PS1) 10Volans: Mark eqiad as migrated to Netbox in the DNS [cookbooks] - 10https://gerrit.wikimedia.org/r/635865 (https://phabricator.wikimedia.org/T258729) [17:30:24] (03PS1) 10Volans: dns: mark eqiad as migrated to Netbox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635867 (https://phabricator.wikimedia.org/T258729) [17:30:53] chaomodus: ^^ the 2 above [17:31:33] (03CR) 10Volans: [C: 03+1] "LGTM with 99.3% confidence :)" [dns] - 10https://gerrit.wikimedia.org/r/634303 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [17:44:25] (03CR) 10Ahmon Dancy: [C: 03+1] mediawiki: Check number of cached keys in php-check-and-restart.sh [puppet] - 10https://gerrit.wikimedia.org/r/635854 (https://phabricator.wikimedia.org/T253673) (owner: 10Effie Mouzeli) [17:45:23] (03PS1) 10Elukey: aptrepo: add bigtop14 to buster-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/635872 [17:45:52] PROBLEM - Check systemd state on ms-be2056 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:46:47] !log Updating eqiad private network DNS to automation [17:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:55] (03CR) 10CRusnov: [C: 03+2] netbox: Move eqiad private to automation [dns] - 10https://gerrit.wikimedia.org/r/634302 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [17:47:20] (03CR) 10Elukey: [C: 03+2] aptrepo: add bigtop14 to buster-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/635872 (owner: 10Elukey) [17:49:42] !log add thirdparty/bigtop14 to buster-wikimedia [17:49:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:17] Did someone typo something in config? "PHP Notice: Undefined index: mwf1" is the top prod error right now. [17:50:23] !log cumin 'A:dns-rec' 'rec_control wipe-cache eqiad.wmnet$' - T258729 [17:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:29] T258729: netbox DNS Automation Workflow checklist for Commissioning and Decommissioning 2020Q1 - https://phabricator.wikimedia.org/T258729 [17:51:17] Hmm, no, that's coming from Parsoid, never mind. [17:52:06] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:53:10] (03CR) 10Volans: [C: 03+2] dns: mark eqiad as migrated to Netbox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635867 (https://phabricator.wikimedia.org/T258729) (owner: 10Volans) [17:53:46] (03CR) 10Volans: [C: 03+2] Mark eqiad as migrated to Netbox in the DNS [cookbooks] - 10https://gerrit.wikimedia.org/r/635865 (https://phabricator.wikimedia.org/T258729) (owner: 10Volans) [17:53:46] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:55:00] (03Merged) 10jenkins-bot: Mark eqiad as migrated to Netbox in the DNS [cookbooks] - 10https://gerrit.wikimedia.org/r/635865 (https://phabricator.wikimedia.org/T258729) (owner: 10Volans) [18:00:04] RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201022T1800). [18:00:05] dpifke: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:55] I can do the deploy, just need someone with +2 in mw-config to approve https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/635856 [18:01:18] dpifke: here you go [18:01:18] (03CR) 10Urbanecm: [C: 03+2] Expand $wgImagePreconnect to group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635856 (https://phabricator.wikimedia.org/T123582) (owner: 10Dave Pifke) [18:02:01] Awesome, thanks. [18:03:00] (03CR) 10jerkins-bot: [V: 04-1] Expand $wgImagePreconnect to group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635856 (https://phabricator.wikimedia.org/T123582) (owner: 10Dave Pifke) [18:03:03] what? [18:03:06] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/news (get In the News content) is WARNING: Test get In the News content responds with unexpected value at path [2]/links[0] [18:03:06] //wikitech.wikimedia.org/wiki/Wikifeeds [18:03:34] (03CR) 10Urbanecm: [C: 03+2] "restart CI" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635856 (https://phabricator.wikimedia.org/T123582) (owner: 10Dave Pifke) [18:04:29] (03PS8) 10ArielGlenn: get revision info from stubs file and use to generate page range info [dumps] - 10https://gerrit.wikimedia.org/r/633567 (https://phabricator.wikimedia.org/T263319) [18:04:31] (03Merged) 10jenkins-bot: Expand $wgImagePreconnect to group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635856 (https://phabricator.wikimedia.org/T123582) (owner: 10Dave Pifke) [18:04:46] dpifke: seems to be merged [18:04:46] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [18:05:22] Cool, getting set up in Kibana now. Will pull on mwdebug2001 first just to be safe. [18:06:44] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_proton_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:07:09] !log Updating eqiad public network DNS to automation [18:07:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:25] (03CR) 10CRusnov: [C: 03+2] netbox: Move eqiad public to automation [dns] - 10https://gerrit.wikimedia.org/r/634303 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [18:08:28] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:11:33] Verified on mwdebug2001, syncing to the rest. [18:12:12] !log cumin 'A:dns-rec' 'rec_control wipe-cache wikimedia.org$' - T258729 [18:12:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:19] T258729: netbox DNS Automation Workflow checklist for Commissioning and Decommissioning 2020Q1 - https://phabricator.wikimedia.org/T258729 [18:12:35] !log dpifke@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Expand to group1 (T123582) (duration: 00m 56s) [18:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:41] T123582: Use "preconnect" resource hint for thumbnail host - https://phabricator.wikimedia.org/T123582 [18:14:37] dpifke: I recommend to use ' to wrap the message, seems $wgImagePreconnect was interpreted as shell variable :) [18:15:08] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_proton_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:15:09] Yup, figured that out too late. :) [18:15:30] hehe :) [18:16:46] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:29:46] dpifke: git repo in `/srv/mediawiki-staging` has dirty status [18:29:52] can you fix it please? [18:30:03] Looking now, yes. [18:30:07] thanks [18:31:40] Fixed. [18:31:50] thanks [18:34:02] RECOVERY - Check systemd state on ms-be2056 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:34:15] (03PS1) 10Ottomata: camus - run job 5 minutes earlier to not conflict with timing of canary events [puppet] - 10https://gerrit.wikimedia.org/r/635877 (https://phabricator.wikimedia.org/T251609) [18:34:30] !log adding mcrouter cert for deploy1002.eqiad.wmnet T265963 [18:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:36] T265963: replace production deployment servers - https://phabricator.wikimedia.org/T265963 [18:35:58] (03CR) 10Ottomata: [C: 03+2] camus - run job 5 minutes earlier to not conflict with timing of canary events [puppet] - 10https://gerrit.wikimedia.org/r/635877 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [18:40:05] (03PS1) 10Dzahn: add fake mcrouter certs for deploy1002.eqiad.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/635879 (https://phabricator.wikimedia.org/T265963) [18:40:58] (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake mcrouter certs for deploy1002.eqiad.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/635879 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn) [18:41:01] (03PS2) 10Dzahn: add fake mcrouter certs for deploy1002.eqiad.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/635879 (https://phabricator.wikimedia.org/T265963) [18:46:16] (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake mcrouter certs for deploy1002.eqiad.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/635879 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn) [18:47:20] (03PS2) 10Dzahn: site: add deployment_server role on deploy1002 [puppet] - 10https://gerrit.wikimedia.org/r/635404 (https://phabricator.wikimedia.org/T265963) [18:49:40] the above DNs migration is all done now and you can resume normal operations in the operations/dns repository, thanks for you patience :) [18:49:55] thanks, congrats volans, big switch [18:50:10] chaomodus too, team effort :) [18:50:16] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:50:22] and also arzhel for all the networking stuff [18:50:25] :) [18:50:40] teamwork! [18:51:27] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/26080/" [puppet] - 10https://gerrit.wikimedia.org/r/635404 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn) [18:51:58] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:54:39] !log applying deployment_server role to new server deploy1002 - might show up in monitoring but is not prod yet, deploy1001 still is [18:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:29] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [18:56:30] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:56:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:12] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [18:57:13] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:41] 10Operations, 10Analytics-Clusters: Switch Zookeeper to profile::java - https://phabricator.wikimedia.org/T264176 (10Ottomata) a:03razzi [19:00:04] longma and liw: That opportune time is upon us again. Time for a Mediawiki train - American+European Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201022T1900). [19:01:14] 10Operations, 10ops-codfw, 10SRE-swift-storage: Degraded RAID on ms-be2017 - https://phabricator.wikimedia.org/T266214 (10Papaul) a:05Papaul→03fgiunchedi @fgiunchedi disk replaced [19:02:17] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission es2017.codfw.wmnet - https://phabricator.wikimedia.org/T264386 (10Papaul) [19:02:32] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission es2017.codfw.wmnet - https://phabricator.wikimedia.org/T264386 (10Papaul) 05Open→03Resolved complete [19:13:37] !log deploy1002 currently cloning ALL the deployment repos - new setup [19:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:20] RECOVERY - Device not healthy -SMART- on ms-be2017 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2017&var-datasource=codfw+prometheus/ops [19:28:54] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: 2020-09-15) upgrade/replace memory in stat100[58] - https://phabricator.wikimedia.org/T260448 (10Cmjohnson) I have not received the PDUs yet [19:29:38] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): cloudvirt1033 psu redundancy alert - https://phabricator.wikimedia.org/T263145 (10Cmjohnson) Dell denied my request for the part, somehow it was only ordering outside of the US. I will need to calll them [19:30:27] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:31:35] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:48:55] 10Operations, 10Traffic, 10observability, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Document and/or improve navigation of the various HTTP frontend Grafana dashboards - https://phabricator.wikimedia.org/T253655 (10Ladsgroup) [19:49:33] PROBLEM - Host checker.tools.wmflabs.org is DOWN: CRITICAL - Host Unreachable (checker.tools.wmflabs.org) [19:52:13] RECOVERY - Host checker.tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [19:56:03] 10Operations, 10ops-eqiad, 10Analytics-Radar: an-presto1004 down - https://phabricator.wikimedia.org/T253438 (10Cmjohnson) I am not sure why this is not here yet. I am calling Dell to follow up [19:56:55] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10Cmjohnson) @elukey I am sorry but I have ot push these off to the first week of November. Let's coordinate a schedule next week. [19:58:36] (03Abandoned) 10Effie Mouzeli: push-notifications: enable egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/628336 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [19:59:20] (03PS8) 10Effie Mouzeli: systemd: fixes in coredump class [puppet] - 10https://gerrit.wikimedia.org/r/545558 (https://phabricator.wikimedia.org/T236253) [20:10:13] mutante: FYI deploy1002's keyholder is not armed [20:22:24] (03PS1) 10Bstorm: toolforge: script to make long-running processes on bastions less good [puppet] - 10https://gerrit.wikimedia.org/r/635888 [20:29:53] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:31:35] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:34:43] volans: yes, i just added the role a couple minutes ago, will take care of it, thx [20:35:06] just laptop battery low.. arg:) [20:36:09] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [20:36:10] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:36:11] ACKNOWLEDGEMENT - Keyholder SSH agent on deploy1002 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. daniel_zahn new setup in progress https://wikitech.wikimedia.org/wiki/Keyholder [20:36:11] ACKNOWLEDGEMENT - mediawiki-installation DSH group on deploy1002 is CRITICAL: Host deploy1002 is not in mediawiki-installation dsh group daniel_zahn new setup in progress https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [20:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:22] thx [21:01:47] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.5065 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [21:07:32] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1026 with 10G interfaces - https://phabricator.wikimedia.org/T266281 (10Andrew) [21:08:10] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [21:08:31] (03PS1) 10QChris: Add .gitreview [debs/kthxbye] - 10https://gerrit.wikimedia.org/r/635892 [21:08:33] (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [debs/kthxbye] - 10https://gerrit.wikimedia.org/r/635892 (owner: 10QChris) [21:11:27] (03PS1) 10Andrew Bogott: wmcs: change cloudvirt1025, 26 and 29 to ceph-enabled hypervisors [puppet] - 10https://gerrit.wikimedia.org/r/635893 (https://phabricator.wikimedia.org/T261132) [21:15:22] (03CR) 10Andrew Bogott: [C: 03+2] wmcs: change cloudvirt1025, 26 and 29 to ceph-enabled hypervisors [puppet] - 10https://gerrit.wikimedia.org/r/635893 (https://phabricator.wikimedia.org/T261132) (owner: 10Andrew Bogott) [21:19:50] (03CR) 10ArielGlenn: "Giving you my +0 on this: no impact on dumpsdata boxes, but I did not look around at labstore 1006/7 mounts." [puppet] - 10https://gerrit.wikimedia.org/r/635628 (https://phabricator.wikimedia.org/T265588) (owner: 10Bstorm) [21:22:30] (03CR) 10Bstorm: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/635628 (https://phabricator.wikimedia.org/T265588) (owner: 10Bstorm) [21:24:33] bstorm: look at it this way, that's a "no blockers" vote :-D [21:28:56] :) [21:44:03] (03CR) 10Bstorm: "Moving this into "review" to make the UI take comments more easily. This needs some work before merge still (like emailing users)." [puppet] - 10https://gerrit.wikimedia.org/r/635888 (owner: 10Bstorm) [21:44:05] "+0" :) [21:46:20] (03PS3) 10Bstorm: toolforge: script to make long-running processes on bastions less good [puppet] - 10https://gerrit.wikimedia.org/r/635888 [21:50:08] (03CR) 10Dzahn: [C: 03+2] scap/dsh: add deploy1002 to mediawiki_installation hosts [puppet] - 10https://gerrit.wikimedia.org/r/635109 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn) [21:56:09] !log deploy1002 - scap pull and added to mediawiki-installation "dsh" group - will be part of scap trains but just like any appserver (T265963) [21:56:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:16] T265963: replace production deployment servers - https://phabricator.wikimedia.org/T265963 [21:59:08] (03PS4) 10Dzahn: add deploy1002 to deployment_hosts for firewalls [puppet] - 10https://gerrit.wikimedia.org/r/635079 (https://phabricator.wikimedia.org/T265963) [22:03:12] !log deploy1002 - armed keyholder, all deployment keys loaded T265963 [22:03:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:18] T265963: replace production deployment servers - https://phabricator.wikimedia.org/T265963 [22:05:55] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:06:15] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:06:39] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:06:43] There is a bootstrap issue when adding a new deployment server. puppet will try to run a bunch of "scap deploy --init" commands but they all fail because ""Not the active deployment server" but also we don't want to switch the deployment server before the repos are initialized [22:07:09] 10Operations, 10Traffic, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10Tsevener) @CDanis darn - yes, that's the right version, and we limited it to... [22:07:10] don't remember if the same happened when we did "tin -> deploy1001".. probably [22:07:12] (03Abandoned) 10Ahmon Dancy: Add --force flag to safe-service-restart.py [puppet] - 10https://gerrit.wikimedia.org/r/635630 (https://phabricator.wikimedia.org/T243009) (owner: 10Ahmon Dancy) [22:07:21] (03Abandoned) 10Ahmon Dancy: Add --force flag to php-check-and-restart.sh [puppet] - 10https://gerrit.wikimedia.org/r/635635 (https://phabricator.wikimedia.org/T243009) (owner: 10Ahmon Dancy) [22:07:43] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:11:19] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:11:24] 10Operations, 10Traffic, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10Tsevener) @CDanis Sorry I just noticed you do mention in the description th... [22:12:37] (03PS1) 10C. Scott Ananian: Bump wikimedia/parsoid to v0.13.0-a13 [vendor] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/635782 (https://phabricator.wikimedia.org/T266285) [22:13:03] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:27:35] (03CR) 10Zhuyifei1999: "Does wall clock time make sense? I think initially the most affected processes are old screen sessions. Though, I can't think of a better " [puppet] - 10https://gerrit.wikimedia.org/r/635888 (owner: 10Bstorm) [22:42:16] !log ganeti1001 - adding 2 more vcpus to VM testreduce1001 - T257940 [22:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:23] T257940: eqiad: 1 VM request for testreduce - https://phabricator.wikimedia.org/T257940 [22:53:20] (03PS3) 10Dzahn: debmonitor::client: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/635665 [22:54:14] (03CR) 10Cwhite: [C: 03+1] "LGTM! Thanks for doing this!" [puppet] - 10https://gerrit.wikimedia.org/r/635864 (owner: 10Herron) [22:54:35] (03CR) 10jerkins-bot: [V: 04-1] debmonitor::client: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/635665 (owner: 10Dzahn) [22:55:37] (03CR) 10Cwhite: [C: 03+1] prometheus: proxy pushgateway through apache [puppet] - 10https://gerrit.wikimedia.org/r/635788 (https://phabricator.wikimedia.org/T249311) (owner: 10Filippo Giunchedi) [22:58:11] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:00:04] RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201022T2300). Please do the needful. [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:01:28] (03PS4) 10Dzahn: debmonitor::client: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/635665 [23:01:39] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:02:09] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 243, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:02:35] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:02:40] (03CR) 10jerkins-bot: [V: 04-1] debmonitor::client: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/635665 (owner: 10Dzahn) [23:04:58] (03PS5) 10Dzahn: debmonitor::client: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/635665 [23:05:55] PROBLEM - Check systemd state on ms-be2017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:11:02] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/26082/" [puppet] - 10https://gerrit.wikimedia.org/r/635665 (owner: 10Dzahn) [23:13:40] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/26083/" [puppet] - 10https://gerrit.wikimedia.org/r/635666 (owner: 10Dzahn) [23:15:23] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/26084/" [puppet] - 10https://gerrit.wikimedia.org/r/635664 (owner: 10Dzahn) [23:17:26] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/26085/" [puppet] - 10https://gerrit.wikimedia.org/r/635661 (owner: 10Dzahn) [23:24:49] RECOVERY - Check systemd state on ms-be2017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:26:39] (03PS2) 10Dzahn: puppetmaster: add data types to all remaining parameters [puppet] - 10https://gerrit.wikimedia.org/r/635656 [23:27:19] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: add data types to all remaining parameters [puppet] - 10https://gerrit.wikimedia.org/r/635656 (owner: 10Dzahn) [23:38:01] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:39:16] (03PS3) 10Dzahn: puppetmaster: add data types to all remaining parameters [puppet] - 10https://gerrit.wikimedia.org/r/635656 [23:39:42] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: add data types to all remaining parameters [puppet] - 10https://gerrit.wikimedia.org/r/635656 (owner: 10Dzahn) [23:41:26] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/635656 (owner: 10Dzahn) [23:41:29] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:44:08] (03PS4) 10Bstorm: toolforge: script to make long-running processes on bastions less good [puppet] - 10https://gerrit.wikimedia.org/r/635888 [23:46:18] (03CR) 10Bstorm: "The email function and the get group members functions I just tested manually on the bastion. I used the old-style paging for ldap3 becaus" [puppet] - 10https://gerrit.wikimedia.org/r/635888 (owner: 10Bstorm) [23:49:24] (03PS4) 10Dzahn: puppetmaster: add data types to all remaining parameters [puppet] - 10https://gerrit.wikimedia.org/r/635656 [23:50:12] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: add data types to all remaining parameters [puppet] - 10https://gerrit.wikimedia.org/r/635656 (owner: 10Dzahn) [23:58:41] (03PS2) 10Dzahn: base/labs: add systemd timer to clean puppet client bucket [puppet] - 10https://gerrit.wikimedia.org/r/635406 (https://phabricator.wikimedia.org/T165885) [23:58:46] (03PS5) 10Bstorm: toolforge: script to make long-running processes on bastions less good [puppet] - 10https://gerrit.wikimedia.org/r/635888 (https://phabricator.wikimedia.org/T266300)