[00:10:54] (03PS4) 10Ryan Kemper: elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) [00:12:32] (03Abandoned) 10Ryan Kemper: relforge: New hosts are relforge100[3,4] [homer/public] - 10https://gerrit.wikimedia.org/r/663054 (https://phabricator.wikimedia.org/T274314) (owner: 10Ryan Kemper) [00:13:28] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [00:25:00] (03PS5) 10Ryan Kemper: elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) [00:27:33] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [00:29:48] (03PS1) 10Ryan Kemper: wdqs: switch to raid0 for more space [puppet] - 10https://gerrit.wikimedia.org/r/681223 (https://phabricator.wikimedia.org/T280382) [00:35:30] (03PS6) 10Ryan Kemper: elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) [00:38:16] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [00:41:33] (03PS7) 10Ryan Kemper: elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) [00:42:35] (03CR) 10Ryan Kemper: "Note: I haven't had time to add the runtime_description @property to change the log message - I'll look into that tomorrow (ran out of tim" [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [00:44:39] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [00:51:24] PROBLEM - Host elastic2043 is DOWN: PING CRITICAL - Packet loss = 100% [00:53:44] RECOVERY - Host elastic2043 is UP: PING OK - Packet loss = 0%, RTA = 31.63 ms [00:55:44] PROBLEM - SSH on mw1270.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:58:58] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1004-cloudelastic-chi-eqiad on cloudelastic1004 is OK: (C)100 gt (W)80 gt 73.22 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1004&panelId=37 [01:00:02] 10SRE, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Scap, 10serviceops: Scap can't clear opcache on mw servers in Beta Cluster - https://phabricator.wikimedia.org/T237033 (10thcipriani) [01:01:28] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1006-cloudelastic-chi-eqiad on cloudelastic1006 is OK: (C)100 gt (W)80 gt 71.38 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1006&panelId=37 [01:01:30] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1005-cloudelastic-chi-eqiad on cloudelastic1005 is OK: (C)100 gt (W)80 gt 72.2 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1005&panelId=37 [01:06:57] 10Puppet, 10SRE, 10Release-Engineering-Team, 10Continuous-Integration-Config, and 2 others: Shell/Python/other scripts should not be generated by ERB files; dynamic parts should be a simple ERB config file - https://phabricator.wikimedia.org/T254480 (10thcipriani) [01:07:05] 10SRE, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10serviceops, 10Patch-For-Review: replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10thcipriani) [01:20:37] 10SRE, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, 10Goal: Self-service Deployment Pipeline - https://phabricator.wikimedia.org/T228676 (10thcipriani) [01:24:12] PROBLEM - MariaDB Replica Lag: m2 on db1117 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1432.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:24:14] PROBLEM - MariaDB Replica Lag: m2 on db2078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1443.51 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:27:56] (03PS1) 10Dzahn: README.md: markdown syntax for code, links, small wording changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/681227 [01:37:16] RECOVERY - Check systemd state on otrs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:41:08] (03PS1) 10Dzahn: create_new_service/scaffold.rb: add a missing / in path to generated files [deployment-charts] - 10https://gerrit.wikimedia.org/r/681229 [01:57:04] RECOVERY - SSH on mw1270.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:08:41] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.37.0-wmf.2 [core] (wmf/1.37.0-wmf.2) - 10https://gerrit.wikimedia.org/r/681232 [02:08:44] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.37.0-wmf.2 [core] (wmf/1.37.0-wmf.2) - 10https://gerrit.wikimedia.org/r/681232 (owner: 10TrainBranchBot) [02:27:46] (03PS1) 10Krinkle: Send "0 edits" userEditCountBucket for anons [extensions/WikimediaEvents] (wmf/1.37.0-wmf.2) - 10https://gerrit.wikimedia.org/r/681156 (https://phabricator.wikimedia.org/T210106) [02:32:11] (03Merged) 10jenkins-bot: Branch commit for wmf/1.37.0-wmf.2 [core] (wmf/1.37.0-wmf.2) - 10https://gerrit.wikimedia.org/r/681232 (owner: 10TrainBranchBot) [03:27:06] PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:34:15] (03Merged) 10jenkins-bot: Send "0 edits" userEditCountBucket for anons [extensions/WikimediaEvents] (wmf/1.37.0-wmf.2) - 10https://gerrit.wikimedia.org/r/681156 (https://phabricator.wikimedia.org/T210106) (owner: 10Krinkle) [03:36:31] (03CR) 10Krinkle: "Confirmed: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/refs/heads/wmf/1.37.0-wmf.2" [extensions/WikimediaEvents] (wmf/1.37.0-wmf.2) - 10https://gerrit.wikimedia.org/r/681156 (https://phabricator.wikimedia.org/T210106) (owner: 10Krinkle) [03:46:56] 10SRE, 10Release-Engineering-Team, 10Scap, 10Sustainability (Incident Followup), 10User-brennen: scap's logstash_checker.py is blissfully unaware of any logstash indexing latency - https://phabricator.wikimedia.org/T255197 (10thcipriani) [03:48:35] 10SRE, 10Release-Engineering-Team, 10Scap: Enable scap to roll back broken changes to MediaWiki - https://phabricator.wikimedia.org/T225207 (10thcipriani) [03:53:54] 10SRE, 10Gerrit, 10Release-Engineering-Team, 10serviceops: Deploy multi-site plugin to gerrit1001 and gerrit2001 - https://phabricator.wikimedia.org/T217174 (10thcipriani) [03:54:09] 10SRE, 10Release-Engineering-Team, 10Scap: Remove trusty-specific hacks from logstash_checker.py - https://phabricator.wikimedia.org/T216380 (10thcipriani) [03:54:20] 10SRE, 10Phabricator, 10Release-Engineering-Team, 10Sustainability (Incident Followup): Analyze and amend (if necessary) workflow of user reporting and detecting large regressions/outages - https://phabricator.wikimedia.org/T219589 (10thcipriani) [03:55:20] 10SRE, 10Prod-Kubernetes, 10Release Pipeline, 10Release-Engineering-Team, 10Documentation: Document helm chart creation - https://phabricator.wikimedia.org/T213197 (10thcipriani) [03:55:29] 10SRE, 10Machine-Learning-Team, 10ORES, 10Release Pipeline, 10Release-Engineering-Team: Execution of the deployment pipeline should be configurable via .pipeline/config.yaml - https://phabricator.wikimedia.org/T210267 (10thcipriani) [03:56:30] 10SRE, 10Keyholder, 10Release-Engineering-Team: Keyholder phab repo duplicate work - https://phabricator.wikimedia.org/T203003 (10thcipriani) [03:56:50] 10SRE, 10Release-Engineering-Team, 10Scap: find a way to systematically update the deployment server name across all repos - https://phabricator.wikimedia.org/T197470 (10thcipriani) [03:56:52] 10SRE, 10Release Pipeline, 10Release-Engineering-Team, 10Epic, 10Services (watching): Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10thcipriani) [03:59:05] 10Puppet, 10SRE, 10Release-Engineering-Team, 10puppet-compiler, 10Continuous-Integration-Config: Figure out a way to enable volunteers to use the puppet compiler - https://phabricator.wikimedia.org/T192532 (10thcipriani) [03:59:46] 10SRE, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10observability, and 2 others: Add Prometheus exporter to Jenkins instances - https://phabricator.wikimedia.org/T182759 (10thcipriani) [04:00:17] 10SRE, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10Jenkins: Upgrade ci ssh key to ecdsa - https://phabricator.wikimedia.org/T177826 (10thcipriani) [04:01:32] 10Puppet, 10Beta-Cluster-Infrastructure, 10Cloud-Services, 10Release-Engineering-Team, 10User-Joe: Re-think puppet management for deployment-prep - https://phabricator.wikimedia.org/T161675 (10thcipriani) [04:01:59] 10SRE, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: Make it possible to run the mediawiki testsuite against a staging repo of apt.wikimedia.org - https://phabricator.wikimedia.org/T157038 (10thcipriani) [04:02:53] 10SRE, 10Release-Engineering-Team, 10Services (later), 10Sustainability (Incident Followup): Review new service 'pre-deployment to production' checklist - https://phabricator.wikimedia.org/T141897 (10thcipriani) [04:07:34] 10SRE, 10Release-Engineering-Team, 10Epic, 10Performance Issue: [EPIC] Performance testing environment - https://phabricator.wikimedia.org/T67394 (10thcipriani) [05:07:58] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:09:40] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 236, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:20:24] (03PS1) 10Legoktm: role::lists: Drop pre-stretch compat [puppet] - 10https://gerrit.wikimedia.org/r/681241 [05:20:26] (03PS1) 10Legoktm: exim: Drop support for legacy mailing list domains [puppet] - 10https://gerrit.wikimedia.org/r/681242 (https://phabricator.wikimedia.org/T280472) [05:44:24] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/681241 (owner: 10Legoktm) [05:53:59] (03PS1) 10Legoktm: mailman: Remove IP exemption for 2018 hackathon [puppet] - 10https://gerrit.wikimedia.org/r/681243 [05:54:01] (03PS1) 10Legoktm: mailman: Update mod_security rules for mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/681244 [06:08:40] ACKNOWLEDGEMENT - MariaDB Replica Lag: m2 on db1117 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 18488.06 seconds Marostegui known https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:10:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2105.codfw.wmnet with reason: REIMAGE [06:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2074.codfw.wmnet with reason: REIMAGE [06:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2105.codfw.wmnet with reason: REIMAGE [06:12:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2073.codfw.wmnet with reason: REIMAGE [06:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2074.codfw.wmnet with reason: REIMAGE [06:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:45] (03PS1) 10Marostegui: install_server: Reimage db2127 to Buster. [puppet] - 10https://gerrit.wikimedia.org/r/681245 (https://phabricator.wikimedia.org/T280492) [06:15:47] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db2127 to Buster. [puppet] - 10https://gerrit.wikimedia.org/r/681245 (https://phabricator.wikimedia.org/T280492) (owner: 10Marostegui) [06:16:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2073.codfw.wmnet with reason: REIMAGE [06:16:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:05] PROBLEM - Disk space on backup2002 is CRITICAL: DISK CRITICAL - free space: /srv 3024417 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=backup2002&var-datasource=codfw+prometheus/ops [06:20:13] RECOVERY - MariaDB Replica Lag: m2 on db2078 is OK: OK slave_sql_lag Replication lag: 0.17 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:23:49] (03PS1) 10Legoktm: mailman: Clean up monitoring host IP exemption [puppet] - 10https://gerrit.wikimedia.org/r/681246 [06:23:51] (03PS1) 10Legoktm: mailman: Redirect domain root to Postorius if mailman3 enabled [puppet] - 10https://gerrit.wikimedia.org/r/681247 [06:26:11] (03CR) 10Ladsgroup: [C: 03+1] exim: Drop support for legacy mailing list domains [puppet] - 10https://gerrit.wikimedia.org/r/681242 (https://phabricator.wikimedia.org/T280472) (owner: 10Legoktm) [06:26:52] (03CR) 10Ladsgroup: [C: 03+1] "I saw it and wanted to a make a patch but forgot 😞" [puppet] - 10https://gerrit.wikimedia.org/r/681241 (owner: 10Legoktm) [06:27:18] (03CR) 10Ladsgroup: [C: 03+1] "lol" [puppet] - 10https://gerrit.wikimedia.org/r/681243 (owner: 10Legoktm) [06:28:57] (03CR) 10Ladsgroup: [C: 03+1] "I was thinking we can just simplify it and throttle all POST requests altogether?" [puppet] - 10https://gerrit.wikimedia.org/r/681244 (owner: 10Legoktm) [06:29:55] RECOVERY - MariaDB Replica Lag: m2 on db1117 is OK: OK slave_sql_lag Replication lag: 15.37 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:30:18] (03CR) 10Ladsgroup: [C: 03+1] "I'm not sure why alert is in public network but probably there is a good reason." [puppet] - 10https://gerrit.wikimedia.org/r/681246 (owner: 10Legoktm) [06:36:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2127.codfw.wmnet with reason: REIMAGE [06:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:26] 10ops-eqiad: htmldumper1001 power suply failure - https://phabricator.wikimedia.org/T280618 (10ayounsi) p:05Triageβ†’03High [06:37:54] ACKNOWLEDGEMENT - IPMI Sensor Status on htmldumper1001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T280618 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [06:38:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2127.codfw.wmnet with reason: REIMAGE [06:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:16] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 428, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:46:25] (03PS1) 10Marostegui: install_server: Reimgae db2074 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/681252 (https://phabricator.wikimedia.org/T280492) [06:47:58] (03CR) 10Marostegui: [C: 03+2] install_server: Reimgae db2074 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/681252 (https://phabricator.wikimedia.org/T280492) (owner: 10Marostegui) [06:52:28] (03CR) 10Ladsgroup: "I was thinking we would do the main page switch once more than half of the mailing lists have been migrated but I don't mind either way." [puppet] - 10https://gerrit.wikimedia.org/r/681247 (owner: 10Legoktm) [06:52:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 25%: Repool db1161', diff saved to https://phabricator.wikimedia.org/P15477 and previous config saved to /var/cache/conftool/dbconfig/20210420-065257-root.json [06:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:06] (03CR) 10Ladsgroup: "The commit message looks weird with the gerrit UI wrapping :/" [puppet] - 10https://gerrit.wikimedia.org/r/681247 (owner: 10Legoktm) [06:54:56] (03PS1) 10Marostegui: db1161: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/681286 (https://phabricator.wikimedia.org/T258361) [06:55:33] (03CR) 10Marostegui: [C: 03+2] db1161: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/681286 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [07:02:35] (03CR) 10Ladsgroup: [C: 03+1] Revert "Set wgPageImagesAPIDefaultLicense to 'any' for wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681137 (owner: 10Hoo man) [07:03:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2074.codfw.wmnet with reason: REIMAGE [07:03:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2074.codfw.wmnet with reason: REIMAGE [07:05:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:30] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] swift: force creation of /var/log/swift symlink [puppet] - 10https://gerrit.wikimedia.org/r/681010 (https://phabricator.wikimedia.org/T280257) (owner: 10Filippo Giunchedi) [07:08:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 50%: Repool db1161', diff saved to https://phabricator.wikimedia.org/P15478 and previous config saved to /var/cache/conftool/dbconfig/20210420-070801-root.json [07:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:50] (03PS2) 10WMDE-Fisch: [beta] Enable suggested values paramter in TemplateData and VisualEditor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678798 (https://phabricator.wikimedia.org/T271825) [07:12:57] FYI: Going to merge some beta labs only config patches [07:13:22] (03CR) 10WMDE-Fisch: [C: 03+2] "Deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678798 (https://phabricator.wikimedia.org/T271825) (owner: 10WMDE-Fisch) [07:13:36] (03PS2) 10WMDE-Fisch: [beta] Enable changes to the descriptions in the VE transclusion dialog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681011 (https://phabricator.wikimedia.org/T273425) [07:14:16] (03CR) 10Muehlenhoff: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/681116 (owner: 10Muehlenhoff) [07:14:29] (03Merged) 10jenkins-bot: [beta] Enable suggested values paramter in TemplateData and VisualEditor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678798 (https://phabricator.wikimedia.org/T271825) (owner: 10WMDE-Fisch) [07:16:32] (03CR) 10WMDE-Fisch: [C: 03+2] "Deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681011 (https://phabricator.wikimedia.org/T273425) (owner: 10WMDE-Fisch) [07:18:46] (03Merged) 10jenkins-bot: [beta] Enable changes to the descriptions in the VE transclusion dialog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681011 (https://phabricator.wikimedia.org/T273425) (owner: 10WMDE-Fisch) [07:20:23] Done [07:23:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 75%: Repool db1161', diff saved to https://phabricator.wikimedia.org/P15479 and previous config saved to /var/cache/conftool/dbconfig/20210420-072305-root.json [07:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:22] (03PS1) 10Legoktm: mailman3: Add monitoring for active processes [puppet] - 10https://gerrit.wikimedia.org/r/681287 [07:26:24] (03PS1) 10Legoktm: lists: Port check_mailman_queue to Python [puppet] - 10https://gerrit.wikimedia.org/r/681288 [07:26:26] (03PS1) 10Legoktm: lists: Use check_mailman_queue for monitoring mailman3 too [puppet] - 10https://gerrit.wikimedia.org/r/681289 (https://phabricator.wikimedia.org/T278280) [07:32:17] (03PS1) 10Legoktm: mailman3: Send queue lengths to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/681290 (https://phabricator.wikimedia.org/T278280) [07:33:06] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-fe2003.codfw.wmnet with reason: REIMAGE [07:33:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:09] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-fe2003.codfw.wmnet with reason: REIMAGE [07:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 100%: Repool db1161', diff saved to https://phabricator.wikimedia.org/P15480 and previous config saved to /var/cache/conftool/dbconfig/20210420-073808-root.json [07:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:17] !log BGP: prioritize directly connected peers - T280054 [07:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:25] T280054: BGP: prioritize directly connected peers - https://phabricator.wikimedia.org/T280054 [07:40:15] 10Puppet, 10SRE: Determine safe concurrent puppet run batches via cumin - https://phabricator.wikimedia.org/T280622 (10fgiunchedi) [07:43:33] (03PS1) 10Marostegui: instances.yaml: Remove db1086 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/681292 (https://phabricator.wikimedia.org/T278229) [07:44:17] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1086 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/681292 (https://phabricator.wikimedia.org/T278229) (owner: 10Marostegui) [07:46:56] 10SRE, 10netops, 10Patch-For-Review: BGP: prioritize directly connected peers - https://phabricator.wikimedia.org/T280054 (10ayounsi) Pushed to eqsin and confirmed working as expected: ` 182.79.252.0/24 *[BGP/170] 00:00:56, localpref 280, from 27.111.228.122 AS path: 9498 ?, validati... [07:47:08] (03CR) 10Ayounsi: [C: 03+2] BGP: prioritize directly connected peers [homer/public] - 10https://gerrit.wikimedia.org/r/680980 (https://phabricator.wikimedia.org/T280054) (owner: 10Ayounsi) [07:51:34] 10ops-eqiad: Can't access thanos-fe1001.mgmt - https://phabricator.wikimedia.org/T280623 (10fgiunchedi) [07:55:38] (03CR) 10Marostegui: [C: 03+1] mediabackup: Initial setup for the media backup worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/668380 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [07:57:40] (03CR) 10Marostegui: mariadb: Setup 2 new host as temporary metadata database for media backups (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/681103 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [07:58:00] 10Puppet, 10SRE: Determine safe concurrent puppet run batches via cumin - https://phabricator.wikimedia.org/T280622 (10fgiunchedi) [07:59:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1086 from dbctl T278229', diff saved to https://phabricator.wikimedia.org/P15482 and previous config saved to /var/cache/conftool/dbconfig/20210420-075949-marostegui.json [07:59:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:58] T278229: decommission db1086.eqiad.wmnet - https://phabricator.wikimedia.org/T278229 [08:01:36] (03CR) 10Gehel: [C: 04-1] wdqs: switch to raid0 for more space (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/681223 (https://phabricator.wikimedia.org/T280382) (owner: 10Ryan Kemper) [08:01:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10dcaro) Hi! Thanks a lot for this, I'm eager to start playing with the new machines :) @RobH: I'm not sure if it's done later (l... [08:04:42] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host orespoolcounter2003.codfw.wmnet [08:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:09] 10SRE, 10Machine-Learning-Team, 10serviceops: Kubernetes packages in Debian Bullseye - https://phabricator.wikimedia.org/T280625 (10MoritzMuehlenhoff) [08:05:43] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-fe1002.eqiad.wmnet with reason: REIMAGE [08:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10dcaro) > @RobH: I'm not sure if it's done later (let me know if so), but the hosts should have the DNS entries *.eqiad.wmnet, n... [08:06:41] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host orespoolcounter2003.codfw.wmnet [08:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:50] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-fe1002.eqiad.wmnet with reason: REIMAGE [08:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:14] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host orespoolcounter2004.codfw.wmnet [08:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:14] !log reprepro updating thirdparty/ceph-octopus repo [08:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2128.codfw.wmnet with reason: REIMAGE [08:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:20] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host orespoolcounter2004.codfw.wmnet [08:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2128.codfw.wmnet with reason: REIMAGE [08:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:26] (03CR) 10Volans: "Thanks for migrating to the new class API Ryan! I've left some comments inline. Feel free to ping me on IRC if you want to chat about any " (0311 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [08:12:31] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host orespoolcounter1004.eqiad.wmnet [08:12:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:31] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host orespoolcounter1004.eqiad.wmnet [08:14:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:17] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host orespoolcounter1003.eqiad.wmnet [08:15:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:18] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host orespoolcounter1003.eqiad.wmnet [08:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:37] 10Puppet, 10SRE: Determine safe concurrent puppet run batches via cumin - https://phabricator.wikimedia.org/T280622 (10Volans) Adding @jbond that might have some insights about it. My 2 cents are that historically we've used 15 as a safe batch number that should not cause issues (see for example https://wikite... [08:21:20] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:21:40] (03PS3) 10David Caro: prometheus: allow using the --storage.tsdb.retention.size option [puppet] - 10https://gerrit.wikimedia.org/r/681107 (https://phabricator.wikimedia.org/T280530) [08:21:42] (03CR) 10David Caro: prometheus: allow using the --storage.tsdb.retention.size option (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/681107 (https://phabricator.wikimedia.org/T280530) (owner: 10David Caro) [08:22:45] (03PS4) 10David Caro: prometheus: allow using the --storage.tsdb.retention.size option [puppet] - 10https://gerrit.wikimedia.org/r/681107 (https://phabricator.wikimedia.org/T280530) [08:25:46] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on thumbor2001 is CRITICAL: 7.001 ge 4 Effie Mouzeli known issue, server will be retired soon https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor2001&var-datasource=codfw+prometheus/ops [08:25:54] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 65 probes of 717 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:26:49] 10Puppet, 10SRE: Determine safe concurrent puppet run batches via cumin - https://phabricator.wikimedia.org/T280622 (10MoritzMuehlenhoff) Another factor to keep on mind that is that we have three puppet masters each in eqiad/codfw, so requests better balanced. pm1003 was only added during the Buster upgrade e.... [08:28:53] (03PS1) 10David Caro: ceph.eqiad: enable octopus repositories [puppet] - 10https://gerrit.wikimedia.org/r/681296 (https://phabricator.wikimedia.org/T274566) [08:29:13] (03CR) 10Effie Mouzeli: [C: 03+1] remove mwdebug1003 from list of debug servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681144 (https://phabricator.wikimedia.org/T267248) (owner: 10Dzahn) [08:30:08] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 238, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:31:20] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 7 probes of 717 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:31:34] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! Thank you" [puppet] - 10https://gerrit.wikimedia.org/r/681107 (https://phabricator.wikimedia.org/T280530) (owner: 10David Caro) [08:31:40] 10SRE, 10netops: BGP: prioritize directly connected peers - https://phabricator.wikimedia.org/T280054 (10ayounsi) Fun fact: `as-path-calc-length` is not on Junos 17, while `as-path-unique-count` is present... `as-path-calc-length` would have been better to take AS path prepending into consideration. But the c... [08:33:12] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/681296 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [08:33:31] (03PS1) 10Ayounsi: Replace as-path-calc-length with as-path-unique-count [homer/public] - 10https://gerrit.wikimedia.org/r/681297 (https://phabricator.wikimedia.org/T280054) [08:33:37] (03CR) 10David Caro: [C: 03+2] prometheus: allow using the --storage.tsdb.retention.size option [puppet] - 10https://gerrit.wikimedia.org/r/681107 (https://phabricator.wikimedia.org/T280530) (owner: 10David Caro) [08:39:09] 10SRE, 10observability, 10Graphite: PHP statsd client doesn't support tagging metrics - https://phabricator.wikimedia.org/T225721 (10fgiunchedi) [08:39:16] 10SRE, 10observability, 10Graphite: Include ADD operation in memcached stats and grafana dashboard - https://phabricator.wikimedia.org/T201016 (10fgiunchedi) [08:39:25] 10SRE, 10observability, 10Graphite: extend existing graphite whisper files retention to five years - https://phabricator.wikimedia.org/T138821 (10fgiunchedi) [08:39:29] 10SRE, 10observability, 10Graphite: Enforce a minimum refresh period for grafana dashboards hitting graphite - https://phabricator.wikimedia.org/T119719 (10fgiunchedi) [08:39:32] 10SRE, 10observability, 10Graphite: grafana access control - https://phabricator.wikimedia.org/T108546 (10fgiunchedi) [08:39:39] 10SRE, 10observability, 10Documentation, 10Graphite: document graphite failover/backfill procedures - https://phabricator.wikimedia.org/T102575 (10fgiunchedi) [08:39:48] 10SRE, 10observability, 10Graphite: unused grafana-dashboard indices on elasticsearch / logstash - https://phabricator.wikimedia.org/T174172 (10fgiunchedi) [08:40:06] (03CR) 10Ayounsi: [C: 03+2] Replace as-path-calc-length with as-path-unique-count [homer/public] - 10https://gerrit.wikimedia.org/r/681297 (https://phabricator.wikimedia.org/T280054) (owner: 10Ayounsi) [08:40:45] (03Merged) 10jenkins-bot: Replace as-path-calc-length with as-path-unique-count [homer/public] - 10https://gerrit.wikimedia.org/r/681297 (https://phabricator.wikimedia.org/T280054) (owner: 10Ayounsi) [08:43:15] some incoming phab spam, apologies [08:43:27] 10SRE, 10Wikimedia-Logstash, 10observability, 10User-fgiunchedi: Ingestion errors for production logs on ELK7 - https://phabricator.wikimedia.org/T240667 (10fgiunchedi) [08:43:31] 10SRE, 10Wikimedia-Logstash, 10observability: Update saved / short links with objects in ELK7 - https://phabricator.wikimedia.org/T272016 (10fgiunchedi) [08:43:33] 10SRE, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: elk7: fields indexed without position data; cannot run PhraseQuery - https://phabricator.wikimedia.org/T248400 (10fgiunchedi) [08:43:35] 10SRE, 10Citoid, 10Wikimedia-Logstash, 10observability, and 2 others: Citoid is logging all request / response headers as separate fields - https://phabricator.wikimedia.org/T239713 (10fgiunchedi) [08:43:45] 10SRE, 10Wikimedia-Logstash, 10observability, 10User-fgiunchedi: Ingest production logs with ELK7 - https://phabricator.wikimedia.org/T235891 (10fgiunchedi) [08:43:47] 10SRE, 10Privacy Engineering, 10Wikimedia-Logstash, 10observability, 10Privacy: Production logstash should be protected by two-factor auth, at the least - https://phabricator.wikimedia.org/T237630 (10fgiunchedi) [08:43:49] 10SRE, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Upgrade ELK Stack to version 7 - https://phabricator.wikimedia.org/T234854 (10fgiunchedi) [08:44:13] 10SRE, 10Icinga, 10observability, 10serviceops: incident 20170323-wikibase did not trigger Icinga paging - https://phabricator.wikimedia.org/T161528 (10fgiunchedi) [08:44:15] 10SRE, 10Icinga, 10observability: implement icinga paging for non-ops teams - https://phabricator.wikimedia.org/T141038 (10fgiunchedi) [08:44:17] 10SRE, 10Icinga, 10Scap, 10observability: expose hosts in maintenance state so we can prevent scap from running on them - https://phabricator.wikimedia.org/T100777 (10fgiunchedi) [08:45:33] (03PS2) 10David Caro: ceph.eqiad: enable octopus repositories [puppet] - 10https://gerrit.wikimedia.org/r/681296 (https://phabricator.wikimedia.org/T274566) [08:46:30] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:47:07] (03PS1) 10Kosta Harlan: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/681298 (https://phabricator.wikimedia.org/T279037) [08:47:21] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/681298 (https://phabricator.wikimedia.org/T279037) (owner: 10Kosta Harlan) [08:48:59] (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/681298 (https://phabricator.wikimedia.org/T279037) (owner: 10Kosta Harlan) [08:49:04] PROBLEM - Prometheus prometheus2004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2004 is CRITICAL: instance=127.0.0.1 job=prometheus site={codfw,esams,ulsfo} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [08:49:29] 10SRE, 10Machine-Learning-Team, 10serviceops: Kubernetes packages in Debian Bullseye - https://phabricator.wikimedia.org/T280625 (10akosiaris) For the "main" set of clusters, we have devised a plan to adopt upstream binaries but without relying on their repos. The Policy as well as the reasoning for that (th... [08:49:36] PROBLEM - Prometheus prometheus2003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance=127.0.0.1 job=prometheus site={codfw,esams,ulsfo} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [08:49:57] dcaro: mmhh I wasn't expecting prometheus to be restarted, oh well [08:50:12] too late now anyways, should be fine [08:50:12] !log kharlan@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [08:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:20] PROBLEM - Prometheus prometheus1004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1 job=prometheus site={codfw,ulsfo} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [08:50:38] godog: is there anything I can do to help? [08:51:11] dcaro: I don't think so but thank you anyways! might as well let it run now [08:51:31] PROBLEM - Prometheus prometheus2003/ext restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/ext [08:51:36] I'll silence the alerts related to prometheus restarted though [08:51:49] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-fe1003.eqiad.wmnet with reason: REIMAGE [08:51:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:06] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] ceph.eqiad: enable octopus repositories [puppet] - 10https://gerrit.wikimedia.org/r/681296 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [08:54:01] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-fe1003.eqiad.wmnet with reason: REIMAGE [08:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:55] 10Puppet, 10SRE: Determine safe concurrent puppet run batches via cumin - https://phabricator.wikimedia.org/T280622 (10fgiunchedi) Good point re: more expensive catalogs! Also +1 on empirically testing batches, the most common use case for wanting fast rollouts in my mind is the edge caches; so starting with `... [08:58:09] !log kharlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'internal' . [08:58:09] !log kharlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [08:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:24] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/679292 (owner: 10Jbond) [09:00:49] (03PS3) 10David Caro: ceph.eqiad: enable octopus repositories [puppet] - 10https://gerrit.wikimedia.org/r/681296 (https://phabricator.wikimedia.org/T274566) [09:01:02] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/681296 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [09:02:37] (03PS1) 10David Caro: aptrepo.ceph: filter out debugging packages [puppet] - 10https://gerrit.wikimedia.org/r/681299 [09:11:21] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "I didn't test the expression, but LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/681299 (owner: 10David Caro) [09:11:45] (03PS1) 10Kormat: docs: Update + improve WMCS bastion handling. [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/681303 [09:13:11] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={etcd,pdu_sentry4} site={codfw,eqsin} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:13:43] (03CR) 10Jbond: [V: 03+2 C: 03+2] "thanks ill merge this" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/681303 (owner: 10Kormat) [09:13:49] RECOVERY - Prometheus prometheus2003/ext restarted: beware possible monitoring artifacts on prometheus2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/ext [09:16:59] (03CR) 10Jbond: [C: 03+2] sudo: add new flag purge_sudoeres_d [puppet] - 10https://gerrit.wikimedia.org/r/681026 (owner: 10Jbond) [09:18:01] RECOVERY - Prometheus prometheus1004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [09:18:35] RECOVERY - Prometheus prometheus2004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [09:20:05] RECOVERY - Prometheus prometheus2003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [09:20:17] (03PS1) 10Volans: sre.hosts.remove-downtime: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/681308 [09:20:44] !log kharlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'internal' . [09:20:44] !log kharlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [09:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:39] (03PS1) 10GergΕ‘ Tisza: Update $wgGEHomepageNewAccountVariants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681309 (https://phabricator.wikimedia.org/T278123) [09:22:11] (03CR) 10Volans: "As a reply to my comment on Iaaec2e58f8f7304102670715c675e6f2b4184228 I've sent this patch. Let me know what you think." [cookbooks] - 10https://gerrit.wikimedia.org/r/681308 (owner: 10Volans) [09:22:16] (03PS2) 10GergΕ‘ Tisza: Update $wgGEHomepageNewAccountVariants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681309 (https://phabricator.wikimedia.org/T278123) [09:22:52] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/679297 (owner: 10Jbond) [09:26:54] (03CR) 10David Caro: "LGTM, did not test it though so leaving without score until/if I get around to doing it." [cookbooks] - 10https://gerrit.wikimedia.org/r/681308 (owner: 10Volans) [09:27:28] (03CR) 10Gehel: [C: 04-1] "Thanks volans for all the comments! Only very minor additional ones from me." (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [09:28:24] (03PS1) 10David Caro: toolforge.prometheus: add possibility to parametrize the retention [puppet] - 10https://gerrit.wikimedia.org/r/681314 [09:29:35] (03CR) 10Volans: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/680376 (owner: 10David Caro) [09:30:46] 10SRE, 10Language-Team, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Fuzzy) As partial solution, it is possible to use `2*$wgMaxArticleSize` as limit for the page post-expand include size? [09:32:59] 10SRE, 10Privacy Engineering, 10Wikimedia-Logstash, 10observability, 10Privacy: Production logstash should be protected by two-factor auth, at the least - https://phabricator.wikimedia.org/T237630 (10MoritzMuehlenhoff) We'll automatically get 2FA with CAS, but for Kibana that has proven impossible since... [09:33:09] (03PS1) 10Arturo Borrero Gonzalez: firewall: cloud-in4: new TCP port in cloudcontrol servers for rabbitmq [homer/public] - 10https://gerrit.wikimedia.org/r/681315 (https://phabricator.wikimedia.org/T280310) [09:33:11] (03PS1) 10Arturo Borrero Gonzalez: firewall: cloud-in: drop nova-fullstack term [homer/public] - 10https://gerrit.wikimedia.org/r/681316 (https://phabricator.wikimedia.org/T272587) [09:33:20] (03CR) 10Volans: "reply to comment" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [09:34:07] (03PS2) 10Daimona Eaytoy: Relax CSP rules for taint-check-demo [puppet] - 10https://gerrit.wikimedia.org/r/680337 (https://phabricator.wikimedia.org/T257301) [09:34:23] (03PS2) 10Arturo Borrero Gonzalez: firewall: cloud-in4: drop cloudcontrol-novafullstack term [homer/public] - 10https://gerrit.wikimedia.org/r/681316 (https://phabricator.wikimedia.org/T272587) [09:35:23] Hello, could somebody please take a look at puppet change https://gerrit.wikimedia.org/r/c/operations/puppet/+/680337 ? I'm not very familiar with pushing there and don't really know where to ask for review. [09:35:48] (03CR) 10Volans: "> Patch Set 1:" [cookbooks] - 10https://gerrit.wikimedia.org/r/681308 (owner: 10Volans) [09:37:28] (03PS1) 10Jcrespo: dbbackups: Add enabled parameter backups on the cluster::management role [puppet] - 10https://gerrit.wikimedia.org/r/681317 (https://phabricator.wikimedia.org/T138562) [09:38:07] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw1001.eqiad.wmnet with reason: REIMAGE [09:38:13] (03PS2) 10Jcrespo: dbbackups: Add enabled parameter backups on the cluster::management role [puppet] - 10https://gerrit.wikimedia.org/r/681317 (https://phabricator.wikimedia.org/T138562) [09:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:48] (03CR) 10jerkins-bot: [V: 04-1] dbbackups: Add enabled parameter backups on the cluster::management role [puppet] - 10https://gerrit.wikimedia.org/r/681317 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [09:38:51] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw1002.eqiad.wmnet with reason: REIMAGE [09:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:11] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw1001.eqiad.wmnet with reason: REIMAGE [09:40:12] (03PS3) 10Jcrespo: dbbackups: Add enabled parameter backups on the cluster::management role [puppet] - 10https://gerrit.wikimedia.org/r/681317 (https://phabricator.wikimedia.org/T138562) [09:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:22] (03PS4) 10Jcrespo: dbbackups: Add enabled parameter backups on the cluster::management role [puppet] - 10https://gerrit.wikimedia.org/r/681317 (https://phabricator.wikimedia.org/T138562) [09:41:47] (03CR) 10jerkins-bot: [V: 04-1] dbbackups: Add enabled parameter backups on the cluster::management role [puppet] - 10https://gerrit.wikimedia.org/r/681317 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [09:42:00] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw1002.eqiad.wmnet with reason: REIMAGE [09:42:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:14] (03PS5) 10Jcrespo: dbbackups: Add enabled parameter backups on the cluster::management role [puppet] - 10https://gerrit.wikimedia.org/r/681317 (https://phabricator.wikimedia.org/T138562) [09:42:35] !log volans@cumin2001 START - Cookbook sre.hosts.remove-downtime for cumin2001.codfw.wmnet,cumin1001.eqiad.wmnet [09:42:38] !log volans@cumin2001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cumin2001.codfw.wmnet,cumin1001.eqiad.wmnet [09:42:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:48] (03PS6) 10Jcrespo: dbbackups: Add enabled parameter backups on the cluster::management role [puppet] - 10https://gerrit.wikimedia.org/r/681317 (https://phabricator.wikimedia.org/T138562) [09:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:07] (03PS7) 10Jcrespo: dbbackups: Add enabled parameter backups on the cluster::management role [puppet] - 10https://gerrit.wikimedia.org/r/681317 (https://phabricator.wikimedia.org/T138562) [09:44:54] (03PS8) 10Jcrespo: dbbackups: Add enabled parameter backups on the cluster::management role [puppet] - 10https://gerrit.wikimedia.org/r/681317 (https://phabricator.wikimedia.org/T138562) [09:46:48] (03CR) 10Volans: [C: 04-1] sre.hosts.remove-downtime: add new cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/681308 (owner: 10Volans) [09:48:26] (03CR) 10Hnowlan: "> Patch Set 1:" [homer/public] - 10https://gerrit.wikimedia.org/r/681059 (https://phabricator.wikimedia.org/T280155) (owner: 10Hnowlan) [09:49:07] 10SRE, 10observability, 10serviceops: conftool unable to announce changes to icinga.wikimedia.org:9200 - https://phabricator.wikimedia.org/T280642 (10jijiki) [09:49:18] 10SRE, 10observability, 10serviceops: conftool unable to announce changes to icinga.wikimedia.org:9200 - https://phabricator.wikimedia.org/T280642 (10jijiki) p:05Triageβ†’03High [09:51:32] (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/compiler1002/29121/cumin1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/681317 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [09:51:53] (03CR) 10Jcrespo: "See what you think about my suggestion at: https://gerrit.wikimedia.org/r/c/operations/puppet/+/681317" [puppet] - 10https://gerrit.wikimedia.org/r/681116 (owner: 10Muehlenhoff) [09:53:20] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] toolforge.prometheus: add possibility to parametrize the retention [puppet] - 10https://gerrit.wikimedia.org/r/681314 (owner: 10David Caro) [09:54:33] (03CR) 10Muehlenhoff: "Looks good, two comments inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/681317 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [09:55:12] 10SRE, 10observability, 10serviceops: conftool unable to announce changes to icinga.wikimedia.org:9200 - https://phabricator.wikimedia.org/T280642 (10jijiki) 05Openβ†’03Invalid [09:56:14] (03CR) 10David Caro: [C: 03+2] toolforge.prometheus: add possibility to parametrize the retention [puppet] - 10https://gerrit.wikimedia.org/r/681314 (owner: 10David Caro) [09:56:57] 10SRE, 10observability, 10serviceops: conftool unable to announce changes to icinga.wikimedia.org:9200 - https://phabricator.wikimedia.org/T280642 (10fgiunchedi) For future reference, we're explicitly allowing 9200/tcp only from a selection of hosts (deployment, cumin, etc) (in `modules/profile/manifests/tcp... [09:58:08] (03CR) 10Awight: "Thanks for the backport!" [extensions/WikimediaEvents] (wmf/1.37.0-wmf.2) - 10https://gerrit.wikimedia.org/r/681156 (https://phabricator.wikimedia.org/T210106) (owner: 10Krinkle) [09:58:29] (03CR) 10Jcrespo: "Perfectly great suggestions. Thanks, Marostegui, will implement as suggested!" [puppet] - 10https://gerrit.wikimedia.org/r/681103 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [10:00:38] (03PS2) 10Volans: sre.hosts.remove-downtime: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/681308 [10:01:06] (03CR) 10Volans: "Addressed comments" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/681308 (owner: 10Volans) [10:02:26] (03PS5) 10Jcrespo: mariadb: Setup 2 new host as temporary metadata database for media backups [puppet] - 10https://gerrit.wikimedia.org/r/681103 (https://phabricator.wikimedia.org/T276442) [10:04:03] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host aphlict1001.eqiad.wmnet [10:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:16] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aphlict1001.eqiad.wmnet [10:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:08] (03CR) 10Hnowlan: [C: 03+2] cr/firewall: allow access to new AQS hosts [homer/public] - 10https://gerrit.wikimedia.org/r/681059 (https://phabricator.wikimedia.org/T280155) (owner: 10Hnowlan) [10:07:54] (03Merged) 10jenkins-bot: cr/firewall: allow access to new AQS hosts [homer/public] - 10https://gerrit.wikimedia.org/r/681059 (https://phabricator.wikimedia.org/T280155) (owner: 10Hnowlan) [10:11:07] !log opening access to cassandra on new AQS hosts (aqs101*) to analytics-in4 filter [10:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:55] (03CR) 10Jcrespo: "> Patch Set 8:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/681317 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [10:12:15] (03PS1) 10Jbond: debmonitor-client.postinst: ignore systemd-sysusers [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/681319 [10:13:36] (03CR) 10Volans: [C: 03+1] "I didn't test it but looks sane" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/681319 (owner: 10Jbond) [10:17:15] (03CR) 10Volans: [C: 03+1] "Actually I've a comment :)" (031 comment) [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/681319 (owner: 10Jbond) [10:18:07] (03PS1) 10Zabe: Add NS_PROJECT alias for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681321 (https://phabricator.wikimedia.org/T280577) [10:20:05] (03PS1) 10Arturo Borrero Gonzalez: wikimediacloud.org: prepare DNS records for cloudgw @ eqiad [dns] - 10https://gerrit.wikimedia.org/r/681322 (https://phabricator.wikimedia.org/T270704) [10:20:14] (03PS2) 10Jbond: debmonitor-client.postinst: ignore systemd-sysusers [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/681319 [10:20:33] !log drain ganeti5001 [10:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:20] 10SRE, 10WMF-JobQueue, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10akosiaris) 05Resolvedβ†’03Open Reopening. The bug that @jeena reported in T279100#7000270 is repro... [10:21:59] (03CR) 10Volans: [C: 03+1] "LGTM, thx" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/681319 (owner: 10Jbond) [10:25:43] 10SRE, 10WMF-JobQueue, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10akosiaris) So, the crux of the issue is at those 2 functions below `lang=python def pool(self,... [10:26:32] (03CR) 10Kormat: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/679607 (https://phabricator.wikimedia.org/T278421) (owner: 10Razzi) [10:27:30] (03CR) 10Muehlenhoff: [C: 03+1] "> Patch Set 8:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/681317 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [10:29:22] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/681319 (owner: 10Jbond) [10:31:41] kormat: <3 [10:31:57] (03CR) 10Awight: [C: 03+1] "Yup, ready for backport." [extensions/VisualEditor] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/679462 (https://phabricator.wikimedia.org/T271898) (owner: 10WMDE-Fisch) [10:32:07] elukey: was i accidentally helpful again? damnit. i hate when that happens [10:33:06] kormat: it was for Razzi not for me, so you can feel better [10:33:19] (03CR) 10Jcrespo: [C: 03+2] "> Patch Set 8: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/681317 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [10:33:22] i guess... [10:34:29] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti5001.eqsin.wmnet [10:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:40] (03CR) 10Muehlenhoff: [C: 03+1] debmonitor-client.postinst: ignore systemd-sysusers (031 comment) [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/681319 (owner: 10Jbond) [10:41:12] (03PS3) 10Jbond: debmonitor-client.postinst: ignore systemd-sysusers [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/681319 [10:42:05] (03CR) 10Jbond: [C: 03+2] debmonitor-client.postinst: ignore systemd-sysusers (031 comment) [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/681319 (owner: 10Jbond) [10:42:43] (03CR) 10Volans: "question inline" (031 comment) [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/681319 (owner: 10Jbond) [10:43:09] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5001.eqsin.wmnet [10:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:38] (03Merged) 10jenkins-bot: debmonitor-client.postinst: ignore systemd-sysusers [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/681319 (owner: 10Jbond) [10:47:34] (03PS1) 10ArielGlenn: eliminate double slash in construction of api path [dumps] - 10https://gerrit.wikimedia.org/r/681326 [10:48:48] (03PS1) 10Marostegui: production-m2.sql: Remove ALTER grant from adminlinkrecommendation [puppet] - 10https://gerrit.wikimedia.org/r/681327 (https://phabricator.wikimedia.org/T279053) [10:49:22] <_joe_> !log temporary installing some python packages on deploy1002 for testing [10:49:25] (03CR) 10Marostegui: [C: 03+2] production-m2.sql: Remove ALTER grant from adminlinkrecommendation [puppet] - 10https://gerrit.wikimedia.org/r/681327 (https://phabricator.wikimedia.org/T279053) (owner: 10Marostegui) [10:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: Time to snap out of that daydream and deploy European mid-day backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210420T1100). [11:00:05] aharoni and CFisch_WMDE: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:26] o/ [11:01:18] o/ [11:01:26] good afternoon Europe [11:01:50] hi aharoni! [11:02:33] (03PS4) 10Amire80: Add default import sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676930 (https://phabricator.wikimedia.org/T214139) [11:02:54] I have a config patch to deploy. I haven't done backporting in a long while: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/676930 . Is it ready, or do I have to do any more preparations? [11:03:28] I can do my backport [11:03:40] but should all be good there [11:04:22] we had a run yesterday that failed due to unrelated broken stuff, but should be fine now ^^' [11:04:46] go ahead with the config [11:06:14] I am not doing it myself. I only scheduled it. I need someone to do the actual deployment. [11:06:20] Ahh kk [11:06:21] I can [11:06:26] looks ready [11:06:27] hi, jouncebot keeps forgetting to ping me [11:07:14] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mathoid' for release 'staging' . [11:07:16] (03CR) 10WMDE-Fisch: [C: 03+2] "Deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676930 (https://phabricator.wikimedia.org/T214139) (owner: 10Amire80) [11:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:45] that config change looks incomplete to me, according to the comment in the file :/ [11:08:01] (03Merged) 10jenkins-bot: Add default import sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676930 (https://phabricator.wikimedia.org/T214139) (owner: 10Amire80) [11:08:16] <_joe_> (I am deploying an absolutely trivial change to the tls terminators, don't worry about me doing deploys right now, I won't touch anything critical) [11:08:19] e.g. wikipedias would be able to import from enwiki, arwiki, etc. by default, but alswiki’s import sources are still only meta? [11:09:12] (03CR) 10WMDE-Fisch: [C: 03+2] "Deploy" [extensions/VisualEditor] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/679462 (https://phabricator.wikimedia.org/T271898) (owner: 10WMDE-Fisch) [11:09:51] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mathoid' for release 'production' . [11:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:06] Lucas_WMDE is it a regression for als? [11:10:19] if I understand correctly, no [11:10:21] just a no-op [11:10:45] aharoni: if you're talking about the config change for import sources, it won't change behavior of wikis with aby sources defined [11:10:48] *any [11:10:54] aharoni: should be on mwdebug1001 [11:11:11] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mathoid' for release 'production' . [11:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:21] Urbanecm: a problem with that patch? [11:11:40] CFisch_WMDE: not really, go ahead :) [11:11:49] *phew* [11:11:58] Urbanecm, Lucas_WMDE - exactly, and that's the intention. Minimal change. If wikis that already have something defined want to change it, they are welcome to do it. [11:12:07] alright, then that sounds okay :) [11:12:15] aharoni: okay, then everything is all right :) [11:12:49] Now how do I use that browser extension? I think I have it installed in Firefox, but IIRC, I'm supposed to see a toolbar button, but I don't see it. [11:13:04] Yeah, it should show an icon [11:13:14] hmmm [11:13:15] (03CR) 10Marostegui: [C: 03+1] mariadb: Setup 2 new host as temporary metadata database for media backups [puppet] - 10https://gerrit.wikimedia.org/r/681103 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [11:13:25] Try clicking "customize toolbar" [11:13:28] Maybe it's hidden [11:13:35] https://support.mozilla.org/en-US/kb/customize-firefox-controls-buttons-and-toolbars [11:13:49] Right! Enabled now. [11:13:52] Thanks. [11:13:52] (03CR) 10Jbond: [C: 03+2] systemd::timer::job: update mailing script with additional options (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/679292 (owner: 10Jbond) [11:13:56] (03CR) 10Jbond: [C: 03+2] check_cumin_aliases: ensure script exits 1 on error [puppet] - 10https://gerrit.wikimedia.org/r/679297 (owner: 10Jbond) [11:14:01] (03CR) 10Jbond: [C: 03+2] P:debmonitor::client: migrate timer::job to use send_mail [puppet] - 10https://gerrit.wikimedia.org/r/679293 (owner: 10Jbond) [11:14:12] aharoni: will do [11:14:22] OK, I've tested it, and it seems to work. [11:14:41] (03PS9) 10Jbond: P:debmonitor::client: migrate timer::job to use send_mail [puppet] - 10https://gerrit.wikimedia.org/r/679293 [11:14:47] I opened Special:Import before the deployment on tw.wikipedia.org, and it said "no import sources". [11:14:57] Now I opened the same page with the debug mode enabled, and I see the import form. [11:15:10] Cool [11:15:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10aborrero) thanks! [11:15:56] !log wmde-fisch@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:676930|Add default import sources (T214139)]] (duration: 00m 58s) [11:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:05] T214139: Create a meaningful default for ImportSources - https://phabricator.wikimedia.org/T214139 [11:16:20] Done [11:16:39] Will do my backport next [11:18:48] Thanks CFisch_WMDE, Lucas_WMDE, Urbanecm! [11:19:04] (03CR) 10Ayounsi: [C: 03+1] firewall: cloud-in4: drop cloudcontrol-novafullstack term [homer/public] - 10https://gerrit.wikimedia.org/r/681316 (https://phabricator.wikimedia.org/T272587) (owner: 10Arturo Borrero Gonzalez) [11:19:09] Any time [11:19:45] Your branch is ahead of 'origin/wmf/1.37.0-wmf.1' by 1 commit. o.O [11:20:22] CFisch_WMDE: lmk if I can help you with sth [11:20:29] (03CR) 10Ayounsi: "Are those ports documented somewhere?" [homer/public] - 10https://gerrit.wikimedia.org/r/681315 (https://phabricator.wikimedia.org/T280310) (owner: 10Arturo Borrero Gonzalez) [11:20:44] Does someone has time for a config deployment? [11:20:57] PROBLEM - Check systemd state on db2132 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-debian-version-textfile.timer,prometheus-nic-firmware-textfile.timer,prometheus_intel_microcode.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:21:50] (03PS1) 10ZPapierski: Move WCQS update to Tuesday [puppet] - 10https://gerrit.wikimedia.org/r/681331 (https://phabricator.wikimedia.org/T280022) [11:21:53] (03PS1) 10Jbond: changelog: create unique version number [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/681332 [11:22:05] Zabe: We're currently working on the things on the board - if we have time left we could do another [11:22:32] ok, thx [11:25:01] (03PS1) 10Phuedx: Send "0 edits" userEditCountBucket for anons [extensions/WikimediaEvents] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/681334 (https://phabricator.wikimedia.org/T210106) [11:25:13] CI taking ages -.- [11:30:02] Urbanecm: thanks, yeah I'm just a bit confused I guess, if can just go on with git fetch when the status says that the branch is ahead of origin [11:30:29] without me doing anything yet [11:31:05] CFisch_WMDE: if you do `git log`, you'll see there's a security patch. That's okay and expected, just do git rebase, and proceed as normally [11:31:29] if there is something that _isn't_ a security patch, it means someone did something naughty :) [11:31:35] okay thanks for the explanation, yeah I saw that, still was a bit unsure [11:31:43] no problem :) [11:32:25] (03CR) 10ArielGlenn: [C: 03+2] "tested in deployment-prep." [dumps] - 10https://gerrit.wikimedia.org/r/681326 (owner: 10ArielGlenn) [11:32:57] (03Merged) 10jenkins-bot: eliminate double slash in construction of api path [dumps] - 10https://gerrit.wikimedia.org/r/681326 (owner: 10ArielGlenn) [11:34:01] (03Merged) 10jenkins-bot: Add filtering for the suggested values combo box [extensions/VisualEditor] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/679462 (https://phabricator.wikimedia.org/T271898) (owner: 10WMDE-Fisch) [11:34:23] (03PS2) 10Zabe: Add NS_PROJECT alias for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681321 (https://phabricator.wikimedia.org/T280577) [11:38:25] !log wmde-fisch@deploy1002 Synchronized php-1.37.0-wmf.1/extensions/VisualEditor/modules/ve-mw/ui/pages/ve.ui.MWParameterPage.js: Backport: [[gerrit:679462|Add filtering for the suggested values combo box (T271898)]] (duration: 00m 58s) [11:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:36] T271898: Add combobox to VE to display suggested values - https://phabricator.wikimedia.org/T271898 [11:38:37] PROBLEM - Check systemd state on mw2274 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:38:40] (03CR) 10Jbond: [C: 03+2] changelog: create unique version number [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/681332 (owner: 10Jbond) [11:39:40] MatmaRex: I could do yours now [11:39:47] (03PS1) 10Hnowlan: envoy-future: update envoyproxy to 1.16.3 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/681335 (https://phabricator.wikimedia.org/T280317) [11:40:09] CFisch_WMDE: thanks, i'm still around! [11:40:27] (03CR) 10WMDE-Fisch: [C: 03+2] "Backport" [extensions/DiscussionTools] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/681153 (https://phabricator.wikimedia.org/T280433) (owner: 10Bartosz DziewoΕ„ski) [11:40:58] (03PS1) 10Hnowlan: api-gateway: use envoy 1.16.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/681336 (https://phabricator.wikimedia.org/T280317) [11:41:04] (the patch can be tested on de.wp main page when you have DiscussionTools enabled) [11:42:35] Ok Urbanecm, now I've got the next level of confusion [11:42:44] yes? [11:42:49] Changes not staged for commit: [11:42:49] modified: extensions/AbuseFilter (new commits) [11:43:03] (03CR) 10Giuseppe Lavagetto: [C: 03+1] api-gateway: use envoy 1.16.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/681336 (https://phabricator.wikimedia.org/T280317) (owner: 10Hnowlan) [11:43:03] that's security patches for AF [11:43:06] (03CR) 10Ppchelko: "I remember last time there was a problem that the image tagged with 1.16 was actually 1.15. Did you solve that?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/681336 (https://phabricator.wikimedia.org/T280317) (owner: 10Hnowlan) [11:43:16] okay probably the status after the feth [11:43:20] fetch [11:43:28] `$ cd /srv/mediawiki-staging/php-1.37.0-wmf.1/extensions/AbuseFilter/; git log` will show you that [11:43:29] right, that was in AbuseFilter [11:44:04] just remember to do `git submodule update extensions/DiscussionTools`, to make sure you don't touch AF unintentionally [11:44:21] +1 [11:44:41] RECOVERY - Check systemd state on db2132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:44:42] (03CR) 10Elukey: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/679607 (https://phabricator.wikimedia.org/T278421) (owner: 10Razzi) [11:46:47] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [11:46:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:08] !log failover ganeti master in eqsin to ganeti5001 [11:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:31] PROBLEM - Check systemd state on logstash2024 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:48:37] PROBLEM - Check systemd state on ganeti1022 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:48:41] PROBLEM - Check systemd state on chartmuseum1001 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:48:47] PROBLEM - Check systemd state on ganeti2014 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:48:47] PROBLEM - Check systemd state on wtp1044 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:48:53] PROBLEM - Check systemd state on mw1374 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:49:03] PROBLEM - Check systemd state on analytics1074 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:49:10] moritzm: is this expected? [11:49:15] PROBLEM - Check systemd state on cuminunpriv1001 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:49:25] PROBLEM - Check systemd state on elastic2051 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:49:26] (03Merged) 10jenkins-bot: CommentFormatter: Add 'ext-discussiontools-section' class instead of overwriting [extensions/DiscussionTools] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/681153 (https://phabricator.wikimedia.org/T280433) (owner: 10Bartosz DziewoΕ„ski) [11:49:35] PROBLEM - Check systemd state on ganeti5001 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:49:37] PROBLEM - Check systemd state on cp2031 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:49:45] PROBLEM - Check systemd state on db1141 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:49:51] PROBLEM - Check systemd state on cp3052 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:49:58] acknevermind its debmonitor i think a ganeti host failed first [11:50:01] PROBLEM - Check systemd state on bast2002 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:50:13] PROBLEM - Check systemd state on wtp1030 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:50:15] PROBLEM - Check systemd state on mw1283 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:50:25] PROBLEM - Check systemd state on mw2252 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:50:31] PROBLEM - Check systemd state on an-conf1002 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:51:11] jbond42: there's a traceback in the systemd-time-mail-wrapper, [11:51:18] cmd is a required arg [11:51:32] yes i see its due to the no-error patch sending a fix shortly [11:51:52] ack [11:52:11] MatmaRex: Should be on debug1001 [11:53:00] CFisch_WMDE: thanks, looks good [11:53:17] cool deploying now [11:54:11] PROBLEM - ganeti-wconfd running on ganeti5003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [11:54:35] (03PS1) 10Cparle: Adjust min shard size for mediainfo dumps [puppet] - 10https://gerrit.wikimedia.org/r/681341 (https://phabricator.wikimedia.org/T280624) [11:54:45] !log wmde-fisch@deploy1002 Synchronized php-1.37.0-wmf.1/extensions/DiscussionTools/includes/CommentFormatter.php: Backport: [[gerrit:681153|CommentFormatter: Add ext-discussiontools-section class instead of overwriting (T280433)]] (duration: 00m 57s) [11:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:54] T280433: DiscussionTools makes talk page sections uncollapsible on mobile - https://phabricator.wikimedia.org/T280433 [11:55:12] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [11:55:12] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [11:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:48] I could do yours now, will probably take a bit more than 5 min but I'm fine [11:55:54] (03PS1) 10Jbond: systemd::timer: fix ignore-errors switch [puppet] - 10https://gerrit.wikimedia.org/r/681342 [11:55:54] ganeti5003 is expected [11:55:54] Zabe ^ [11:56:04] cool [11:56:39] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/681321 [11:56:43] thanks CFisch_WMDE [11:56:47] Always a bit nicer to have a +1 by someone else on the config patch, but it looks fine [11:56:53] MatmaRex: You're welcome :-) [11:57:02] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29122/console" [puppet] - 10https://gerrit.wikimedia.org/r/681342 (owner: 10Jbond) [11:57:17] (03CR) 10Muehlenhoff: [C: 03+1] systemd::timer: fix ignore-errors switch [puppet] - 10https://gerrit.wikimedia.org/r/681342 (owner: 10Jbond) [11:57:42] (03CR) 10Jbond: [V: 03+1 C: 03+2] systemd::timer: fix ignore-errors switch [puppet] - 10https://gerrit.wikimedia.org/r/681342 (owner: 10Jbond) [11:57:55] (03CR) 10WMDE-Fisch: [C: 03+2] "Deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681321 (https://phabricator.wikimedia.org/T280577) (owner: 10Zabe) [12:01:08] (03Merged) 10jenkins-bot: Add NS_PROJECT alias for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681321 (https://phabricator.wikimedia.org/T280577) (owner: 10Zabe) [12:01:31] PROBLEM - Check systemd state on mw2380 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:02:32] PROBLEM - Check systemd state on ldap-replica2004 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:02:40] Zabe: Should be on debug1001 can you test if it's working? [12:02:52] PROBLEM - Check systemd state on kafka-test1006 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:02:52] yes [12:03:43] ping me if it's fine :-) [12:03:45] CFisch_WMDE: works the supposed way [12:03:49] nice [12:03:55] deploying now [12:04:37] !log drain ganeti5003 [12:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:19] !log wmde-fisch@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:681321|Add NS_PROJECT alias for azwiki (T280577)]] (duration: 00m 57s) [12:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:28] T280577: Namespace changes in Azerbaijani Wikipedia - https://phabricator.wikimedia.org/T280577 [12:05:43] Zabe: Done! [12:05:56] CFisch_WMDE: thanks for your help :) [12:06:20] You're welcome. Thanks Urbanecm for holding hands ;-)! [12:06:22] jbond42 moritzm volans FYI a bunch of alerts of debmonitor-client.service failed [12:06:26] any time :) [12:06:44] CFisch_WMDE: i see you did a namespace change. did you run namespaceDupes.php on that wiki? [12:06:48] godog: on it, a change brke the timer just pushing out a fix now [12:06:52] Backport window is done. Wasn't there also some command for that? ;-) [12:06:56] Urbanecm: nope [12:06:58] :-D [12:07:01] can you do it, please? [12:07:02] jbond42: ack, thanks [12:07:07] otherwise, it can leave some pages inaccessible [12:07:24] CFisch_WMDE: https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers#namespaceDupes is the docs [12:07:24] Urbanecm: Guess, I don;t know how .... :-/ [12:07:34] * CFisch_WMDE looks [12:07:34] i was looking for the docs: https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers#namespaceDupes :-) [12:07:59] (03CR) 10Alexandros Kosiaris: [C: 03+2] README.md: markdown syntax for code, links, small wording changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/681227 (owner: 10Dzahn) [12:08:12] Pchelolo: hi, why was https://gerrit.wikimedia.org/r/c/mediawiki/core/+/680857 cherry picked? looks like it likely caused https://phabricator.wikimedia.org/T280655 and I'd like some context on why it wasn't on deployed normally on the train [12:08:18] (03CR) 10Alexandros Kosiaris: [C: 03+2] create_new_service/scaffold.rb: add a missing / in path to generated files [deployment-charts] - 10https://gerrit.wikimedia.org/r/681229 (owner: 10Dzahn) [12:09:08] Majavah: it's 5AM for Peter, he likely won't respond [12:09:13] (03Merged) 10jenkins-bot: README.md: markdown syntax for code, links, small wording changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/681227 (owner: 10Dzahn) [12:09:17] Majavah: yeah, that's releng asked us to try deploying large patches as a backport instead of the train to decrease the load on train deployers, and we tried [12:09:25] oh, i was wrong :) [12:09:48] ah, thanks [12:09:49] (03Merged) 10jenkins-bot: create_new_service/scaffold.rb: add a missing / in path to generated files [deployment-charts] - 10https://gerrit.wikimedia.org/r/681229 (owner: 10Dzahn) [12:10:08] the experiment has demonstrated a number of flaws in the process, so it was a one-off [12:10:15] jouncebot: current [12:10:19] jouncebot: now [12:10:19] For the next 0 hour(s) and 49 minute(s): European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210420T1100) [12:10:19] jouncebot: now [12:10:20] For the next 0 hour(s) and 49 minute(s): European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210420T1100) [12:10:21] ah [12:10:49] I think I found the cause for that task, so I'll try to get an hotfix for this window if that's ok [12:10:58] Majavah: if it's merged in master, sure :) [12:11:28] Urbanecm: done I ran the script and it fixed a lot of pages [12:11:34] Still theres: [12:11:39] that might be hard to get in ~50min [12:11:41] https://www.irccloud.com/pastebin/RAhXp21j/ [12:12:14] right, that's because those pages now conflict (there was a page called both VP:DΙ™rinlik and Project:DΙ™rinlik) [12:12:27] d'oh [12:12:29] 10SRE, 10SRE-swift-storage: Refactor swift credentials to be global rather than per-site - https://phabricator.wikimedia.org/T162123 (10fgiunchedi) [12:12:30] to fix it, add `--add-prefix=BROKEN`, run it again, and paste the output to the task, so they can delete the broken pages [12:12:35] 10SRE, 10SRE-swift-storage: Swiftrepl was stuck in an infinite loop since days - https://phabricator.wikimedia.org/T162122 (10fgiunchedi) [12:12:44] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 42411184 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:12:59] Majavah: as for the root cause of the issue - do you need help figuring this one out? [12:13:16] Pchelolo: likely https://github.com/wikimedia/mediawiki/commit/46db19ecdfd5e6df1148e21e6f1f6721654f903e#diff-f82936dc23ff0e44824529773444b4a33c29539c3cd5854bed2c8dee106b2d43R127 [12:13:19] 10SRE, 10SRE-swift-storage: Some object-replicator log lines not making it to centrallog - https://phabricator.wikimedia.org/T264998 (10fgiunchedi) [12:13:24] Urbanecm: kewl, will do thanks! [12:13:29] Done now [12:13:33] oh yes. [12:13:53] if you make a hotfix, I'll make a followup with a proper test afterwards [12:13:56] my bad.. [12:14:10] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 5488 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:14:13] CFisch_WMDE: cool, thanks :). Can you also !log that you ran it? [12:14:25] I can change that to the correct param, but I'm not sure where the csrf token should be validated now [12:14:36] PROBLEM - Check systemd state on es2030 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:15:11] just there in the api module [12:15:20] Urbanecm: Ah yeah, what's the template for that again :->? [12:15:36] PROBLEM - Check systemd state on lvs3005 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:15:40] just do !log # task number [12:15:42] PROBLEM - Check systemd state on mw2371 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:15:46] (at least that's why i usually do) [12:15:54] PROBLEM - Check systemd state on mw2357 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:16:30] <_joe_> is the backport window done? [12:16:34] PROBLEM - Check systemd state on ganeti1019 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:16:40] <_joe_> I have depoyments (multiple) scheduled [12:16:42] PROBLEM - Check systemd state on logstash1021 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:17:32] CFisch_WMDE: ^^ [12:17:52] _joe_: yeah I'm done [12:18:19] !log European mid-day backport window done [12:18:21] :-) [12:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:37] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [12:21:37] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [12:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:20] (03CR) 10Awight: [C: 03+1] "Thanks!" [extensions/WikimediaEvents] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/681334 (https://phabricator.wikimedia.org/T210106) (owner: 10Phuedx) [12:23:00] (03PS1) 10Ayounsi: CHANGELOG: add changelogs for release v0.2.7 [software/homer] - 10https://gerrit.wikimedia.org/r/681350 [12:23:05] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti5003.eqsin.wmnet [12:23:11] Urbanecm: Pchelolo: https://gerrit.wikimedia.org/r/681349 [12:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:43] SGTM, but not comfortable pressing the fancy button [12:25:02] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' . [12:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:17] PROBLEM - Check systemd state on thanos-be1002 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:26:29] PROBLEM - Check systemd state on seaborgium is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:27:06] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' . [12:27:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:38] (03PS1) 10Majavah: Do not mark rollbacks as bot edits [core] (wmf/1.37.0-wmf.2) - 10https://gerrit.wikimedia.org/r/681366 (https://phabricator.wikimedia.org/T280655) [12:28:39] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' . [12:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:57] PROBLEM - Check systemd state on mw2288 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:28:59] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5003.eqsin.wmnet [12:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:14] (03CR) 10jerkins-bot: [V: 04-1] CHANGELOG: add changelogs for release v0.2.7 [software/homer] - 10https://gerrit.wikimedia.org/r/681350 (owner: 10Ayounsi) [12:29:58] Urbanecm: patch was +2'd, do you want to wait until it's merged on master before backporting as https://gerrit.wikimedia.org/r/c/mediawiki/core/+/681366? [12:30:27] _joe_: okay for me to do deployment? [12:31:02] <_joe_> Urbanecm: yes, I'll release a lower-prio service [12:31:08] okay :) [12:31:12] thanks [12:31:22] (03CR) 10Urbanecm: [C: 03+2] Do not mark rollbacks as bot edits [core] (wmf/1.37.0-wmf.2) - 10https://gerrit.wikimedia.org/r/681366 (https://phabricator.wikimedia.org/T280655) (owner: 10Majavah) [12:31:31] Majavah: let's wait for CI then :) [12:31:35] thanks! [12:34:13] PROBLEM - Check systemd state on mw1395 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:34:14] (03CR) 10Muehlenhoff: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/681116 (owner: 10Muehlenhoff) [12:34:19] (03PS3) 10Muehlenhoff: Install cumin2002 with bullseye [puppet] - 10https://gerrit.wikimedia.org/r/681116 [12:36:29] PROBLEM - Check systemd state on es2029 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:36:41] PROBLEM - Check systemd state on mw1393 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:36:45] RECOVERY - Check systemd state on ganeti1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:36:51] RECOVERY - Check systemd state on chartmuseum1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:36:57] RECOVERY - Check systemd state on logstash1021 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:37:17] RECOVERY - Check systemd state on wtp1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:37:21] RECOVERY - Check systemd state on ldap-replica2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:37:31] RECOVERY - Check systemd state on elastic2051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:37:35] RECOVERY - Check systemd state on mw2288 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:37:37] RECOVERY - Check systemd state on ganeti2014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:37:39] RECOVERY - Check systemd state on mw1283 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:37:45] RECOVERY - Check systemd state on lvs3005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:37:47] RECOVERY - Check systemd state on thanos-be1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:37:53] RECOVERY - Check systemd state on mw2252 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:37:53] RECOVERY - Check systemd state on mw1395 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:37:55] RECOVERY - Check systemd state on mw2274 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:37:57] RECOVERY - Check systemd state on mw2371 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:37:59] RECOVERY - Check systemd state on an-conf1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:38:05] RECOVERY - Check systemd state on seaborgium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:38:07] RECOVERY - Check systemd state on db1141 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:38:07] RECOVERY - Check systemd state on wtp1044 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:38:09] RECOVERY - Check systemd state on cp2031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:38:15] RECOVERY - Check systemd state on mw1374 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:38:17] RECOVERY - Check systemd state on mw2357 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:38:21] RECOVERY - Check systemd state on ganeti5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:38:21] RECOVERY - Check systemd state on es2029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:38:25] RECOVERY - Check systemd state on logstash2024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:38:29] RECOVERY - Check systemd state on bast2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:38:29] RECOVERY - Check systemd state on cuminunpriv1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:38:31] RECOVERY - Check systemd state on ganeti1019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:38:33] RECOVERY - Check systemd state on mw1393 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:38:37] RECOVERY - Check systemd state on es2030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:40:08] (03CR) 10Muehlenhoff: [C: 03+2] Install cumin2002 with bullseye [puppet] - 10https://gerrit.wikimedia.org/r/681116 (owner: 10Muehlenhoff) [12:41:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1165 to check its tables T280492', diff saved to https://phabricator.wikimedia.org/P15483 and previous config saved to /var/cache/conftool/dbconfig/20210420-124118-marostegui.json [12:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:28] T280492: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 [12:41:49] 10SRE, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T280668 (10ops-monitoring-bot) [12:42:33] !log uploaded PHP 7.2.34-18+0~20210223.60+debian10~1.gbpb21322+wmf1 to component/php72 [12:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:23] RECOVERY - Check systemd state on cp3052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:45:35] RECOVERY - Check systemd state on mw2380 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:47:08] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'recommendation-api' for release 'production' . [12:47:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:29] RECOVERY - Check systemd state on analytics1074 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:49:06] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'recommendation-api' for release 'production' . [12:49:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:37] 10Puppet, 10SRE: Determine safe concurrent puppet run batches via cumin - https://phabricator.wikimedia.org/T280622 (10jbond) FYI i ran puppet fleet wide today using a batch size of 40 and there was no issue. puppet master load rose from ~1.5 -> 4.0. you can see a [[ https://grafana-rw.wikimedia.org/d/000000... [12:51:07] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'recommendation-api' for release 'production' . [12:51:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:05] (03PS1) 10Jbond: debmonitor: make cfssl the default PKI issuer for debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/681353 [12:52:41] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' . [12:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:17] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'termbox' for release 'production' . [12:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:18] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'termbox' for release 'production' . [12:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:04] !log reimaging cumin2002 to bullseye T276589 [12:58:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:15] T276589: migrate services from cumin2001 to cumin2002 - https://phabricator.wikimedia.org/T276589 [12:58:42] Majavah: still around? [12:58:47] yes [12:58:54] cool, do you have a testcase ready? [12:59:04] I think I have [12:59:06] cool [12:59:11] patch should merge any moment [12:59:33] (03PS1) 10Ladsgroup: snapshot: Migrate cronjobs in categoriesrdf to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/681357 (https://phabricator.wikimedia.org/T273673) [13:02:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2076.codfw.wmnet with reason: REIMAGE [13:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:27] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' . [13:03:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2076.codfw.wmnet with reason: REIMAGE [13:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:04] * Majavah complains loudly about core/all-extensions selenium tests taking too long [13:05:14] (03PS1) 10Elukey: role::analytics_cluster::coordinator: move mysql to /srv/sqldata [puppet] - 10https://gerrit.wikimedia.org/r/681358 (https://phabricator.wikimedia.org/T278424) [13:06:43] (03PS1) 10Elukey: Move analytics-hive to an-coord1002 [dns] - 10https://gerrit.wikimedia.org/r/681359 (https://phabricator.wikimedia.org/T278424) [13:07:22] (03PS1) 10Ladsgroup: snapshot: Migrate cronjobs in dump_global_blocks to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/681360 (https://phabricator.wikimedia.org/T273673) [13:07:48] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'proton' for release 'production' . [13:07:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:04] (03Merged) 10jenkins-bot: Do not mark rollbacks as bot edits [core] (wmf/1.37.0-wmf.2) - 10https://gerrit.wikimedia.org/r/681366 (https://phabricator.wikimedia.org/T280655) (owner: 10Majavah) [13:08:06] FINALLY [13:08:12] Urbanecm: ^ [13:08:16] thanks [13:08:30] ... [13:08:40] ...what will you do if i tell you we cherry-picked it to a wrong branch [13:08:55] really [13:08:57] (03PS1) 10Urbanecm: Do not mark rollbacks as bot edits [core] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/681367 (https://phabricator.wikimedia.org/T280655) [13:09:00] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'proton' for release 'production' . [13:09:00] yeah, we're on wmf.1 [13:09:04] wmf.2 will never be deployed [13:09:06] I selected wmf.1 [13:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:16] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] "already passed in wmf.2 and master" [core] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/681367 (https://phabricator.wikimedia.org/T280655) (owner: 10Urbanecm) [13:09:24] (03PS1) 10Ladsgroup: snapshot: Migrate cronjobs in dump_machine_vision to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/681361 (https://phabricator.wikimedia.org/T273673) [13:09:26] Majavah: the commit is in wmf.2 :D [13:09:31] :// [13:09:40] I forcemerged it, it's unlikely it'll fail anyway [13:09:53] Majavah: mwdebug1001 is yours :) [13:09:57] thanks, testing [13:09:58] please test [13:09:58] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [13:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:28] Urbanecm: working I think [13:12:41] cool, syncing [13:12:42] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on cumin2002.codfw.wmnet with reason: REIMAGE [13:12:43] cc Pchelolo [13:12:46] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [13:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:14] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.2.7 [software/homer] - 10https://gerrit.wikimedia.org/r/681362 [13:13:21] thank you. I'll write some tests for this a bit later in a day as a followup [13:13:37] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.1/includes/actions/RollbackAction.php: ccbfcf28a2f507ed40dcf7af748c30f581b5079f: Do not mark rollbacks as bot edits (T280655) (duration: 00m 57s) [13:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:45] T280655: Incorrect bot marking for rollbacks - https://phabricator.wikimedia.org/T280655 [13:13:45] should be live [13:13:48] ty [13:13:50] thanks Pchelolo :) [13:14:23] _joe_: fyi, I'm done. [13:14:46] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cumin2002.codfw.wmnet with reason: REIMAGE [13:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:21] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [13:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:08] (03CR) 10Elukey: [C: 03+2] Move analytics-hive to an-coord1002 [dns] - 10https://gerrit.wikimedia.org/r/681359 (https://phabricator.wikimedia.org/T278424) (owner: 10Elukey) [13:16:50] (03CR) 10Ayounsi: [C: 03+1] CHANGELOG: add changelogs for release v0.2.7 [software/homer] - 10https://gerrit.wikimedia.org/r/681362 (owner: 10Volans) [13:17:11] RECOVERY - Check systemd state on kafka-test1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:18:34] (03PS17) 10Jcrespo: mediabackup: Initial setup for the media backup worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/668380 (https://phabricator.wikimedia.org/T276442) [13:18:36] (03PS6) 10Jcrespo: mariadb: Setup 2 new host as temporary metadata database for media backups [puppet] - 10https://gerrit.wikimedia.org/r/681103 (https://phabricator.wikimedia.org/T276442) [13:18:38] (03PS1) 10Jcrespo: dbbackups: Add s3 to db2139 (mariadb source backups) [puppet] - 10https://gerrit.wikimedia.org/r/681364 (https://phabricator.wikimedia.org/T280492) [13:18:51] (03PS2) 10Jcrespo: dbbackups: Add s3 to db2139 (mariadb source backups) [puppet] - 10https://gerrit.wikimedia.org/r/681364 (https://phabricator.wikimedia.org/T280492) [13:19:15] (03CR) 10Effie Mouzeli: [C: 04-2] "After some digging with Αλέξανδρος, we found that if we merge this, servers only in the videoscaler cluster will not end up in /etc/dsh/gr" [puppet] - 10https://gerrit.wikimedia.org/r/679258 (https://phabricator.wikimedia.org/T279100) (owner: 10Alexandros Kosiaris) [13:19:24] (03PS3) 10Jcrespo: dbbackups: Add s3 to db2139 (mariadb source backups) [puppet] - 10https://gerrit.wikimedia.org/r/681364 (https://phabricator.wikimedia.org/T280492) [13:19:50] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [13:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:04] (03CR) 10Effie Mouzeli: [C: 04-2] "see I1d40b352582f2784501ccc371609d05e1b43b25." [puppet] - 10https://gerrit.wikimedia.org/r/679432 (https://phabricator.wikimedia.org/T279100) (owner: 10Dzahn) [13:20:43] (03CR) 10Jcrespo: "I will have to add db2139:s3 to tendril and zarcillo." [puppet] - 10https://gerrit.wikimedia.org/r/681364 (https://phabricator.wikimedia.org/T280492) (owner: 10Jcrespo) [13:21:17] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [13:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:06] milimetric: yt? [13:22:13] if i deployed https://gerrit.wikimedia.org/r/c/analytics/aqs/+/679398 could you test it? [13:22:27] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [13:22:32] oops wrong chat [13:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:41] <_joe_> ottomata: we need to deploy all eventgate-* services [13:22:52] _joe_: k [13:22:57] <_joe_> I already deployed all eventstreams-* stuff [13:23:03] <_joe_> I can do those as well [13:23:09] cool, should be fine [13:23:26] thanks for heads up, lemme know if i can help [13:25:03] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [13:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:57] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.2.7 [software/homer] - 10https://gerrit.wikimedia.org/r/681362 (owner: 10Volans) [13:26:16] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'similar-users' for release 'main' . [13:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:55] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'similar-users' for release 'main' . [13:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:13] !log otto@deploy1002 Started deploy [analytics/aqs/deploy@ad170d4]: deploy Refactor pageviews per-article endpoint [13:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:23] !log upgrading mw1261 to PHP 7.2.34 [13:33:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:49] (03CR) 10Volans: debmonitor: make cfssl the default PKI issuer for debmonitor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/681353 (owner: 10Jbond) [13:35:05] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [13:35:05] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [13:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:52] (03PS2) 10Jbond: debmonitor: make cfssl the default PKI issuer for debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/681353 [13:36:28] (03CR) 10Jcrespo: [C: 03+2] "I am merging now as this seems to me like a trivial change." [puppet] - 10https://gerrit.wikimedia.org/r/681364 (https://phabricator.wikimedia.org/T280492) (owner: 10Jcrespo) [13:36:30] (03CR) 10Jbond: "fixed" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/681353 (owner: 10Jbond) [13:36:31] !log otto@deploy1002 Finished deploy [analytics/aqs/deploy@ad170d4]: deploy Refactor pageviews per-article endpoint (duration: 05m 17s) [13:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:44] 10SRE, 10ops-codfw, 10netops: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457 (10Papaul) if you are a service owner of any of the servers listed below please check the box if you are able to depool the server on April 27th before 10:30am CT time set to replace the faulty s... [13:38:12] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [13:38:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:02] !log ayounsi@deploy1002 Started deploy [homer/deploy@759f82c]: Homer release v0.2.7 [13:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:15] !log ayounsi@deploy1002 Finished deploy [homer/deploy@759f82c]: Homer release v0.2.7 (duration: 00m 13s) [13:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:27] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [13:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:38] !log upgrading mw1276 to PHP 7.2.34 [13:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:23] bblack: ok thanks [13:50:30] 10SRE, 10ops-codfw, 10netops: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457 (10Papaul) [13:56:30] (03PS1) 10Elukey: install_server: add custom recipe for an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/681396 (https://phabricator.wikimedia.org/T278424) [13:57:09] 10SRE, 10ops-codfw, 10netops: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457 (10fgiunchedi) [13:58:42] (03CR) 10Elukey: "Stevie this is a little weirder/cumbersome than the other ones, so if you don't have time don't worry. Hopefully in the future we'll be ab" [puppet] - 10https://gerrit.wikimedia.org/r/681396 (https://phabricator.wikimedia.org/T278424) (owner: 10Elukey) [13:59:19] !log otto@deploy1002 Started deploy [analytics/refinery@fc6767a]: Regular analytics weekly train [analytics/refinery@fc6767a] [13:59:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:41] (03PS2) 10Elukey: install_server: add custom recipe for an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/681396 (https://phabricator.wikimedia.org/T278424) [14:01:10] !log jiji@cumin1001 conftool action : set/pooled=no; selector: name=mw2280.codfw.wmnet,cluster=videoscaler [14:01:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:39] (03PS1) 10Volans: doc: fix documentation generation [software/homer] - 10https://gerrit.wikimedia.org/r/681399 [14:04:15] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [14:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:39] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [14:06:39] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [14:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:41] (03CR) 10Volans: [C: 03+2] doc: fix documentation generation [software/homer] - 10https://gerrit.wikimedia.org/r/681399 (owner: 10Volans) [14:07:52] (03PS1) 10Elukey: User the 'yarn' user in the Hadoop test yarn's UI [puppet] - 10https://gerrit.wikimedia.org/r/681400 (https://phabricator.wikimedia.org/T277062) [14:08:26] elukey: is that instead of 'dr. who'? [14:08:54] (03PS2) 10ArielGlenn: Adjust min shard size for mediainfo dumps [puppet] - 10https://gerrit.wikimedia.org/r/681341 (https://phabricator.wikimedia.org/T280624) (owner: 10Cparle) [14:09:03] ottomata: yes correct, I just found out that Hue cannot see logs, it seems that 'yarn' is the suggested way [14:09:13] huh cool [14:09:15] nice [14:09:16] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29123/console" [puppet] - 10https://gerrit.wikimedia.org/r/681400 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [14:09:19] also it seems a little more consistent, what do you think? [14:09:21] glad not to see 'dr. who' anhmore too [14:09:24] yeah +1 for sure [14:09:29] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [14:09:29] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [14:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:41] super testing it :0 [14:09:42] :) [14:09:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:12] (03CR) 10Elukey: [V: 03+1 C: 03+2] User the 'yarn' user in the Hadoop test yarn's UI [puppet] - 10https://gerrit.wikimedia.org/r/681400 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [14:11:03] (03CR) 10ArielGlenn: [C: 03+2] Adjust min shard size for mediainfo dumps [puppet] - 10https://gerrit.wikimedia.org/r/681341 (https://phabricator.wikimedia.org/T280624) (owner: 10Cparle) [14:11:19] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [14:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:33] (03CR) 10Cwhite: [C: 03+2] logstash: limit apifeatureusage curator job to jobs_host [puppet] - 10https://gerrit.wikimedia.org/r/680399 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite) [14:14:09] !log otto@deploy1002 Finished deploy [analytics/refinery@fc6767a]: Regular analytics weekly train [analytics/refinery@fc6767a] (duration: 14m 50s) [14:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:23] <_joe_> ottomata: eventgate-logging-external has other changes right now other than envoy [14:14:31] <_joe_> should I go on and deployt it anyways? [14:14:34] (03Merged) 10jenkins-bot: doc: fix documentation generation [software/homer] - 10https://gerrit.wikimedia.org/r/681399 (owner: 10Volans) [14:14:54] !log otto@deploy1002 Started deploy [analytics/refinery@fc6767a]: Regular analytics weekly train - an-launcher1002 retry\ [analytics/refinery@fc6767a] [14:14:57] !log otto@deploy1002 Finished deploy [analytics/refinery@fc6767a]: Regular analytics weekly train - an-launcher1002 retry\ [analytics/refinery@fc6767a] (duration: 00m 03s) [14:15:01] <_joe_> oh intrestingly only in codfw [14:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:10] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:15:56] !log otto@deploy1002 Started deploy [analytics/refinery@fc6767a]: Regular analytics weekly train - an-launcher1002 retry\ [analytics/refinery@fc6767a] [14:16:00] !log otto@deploy1002 Finished deploy [analytics/refinery@fc6767a]: Regular analytics weekly train - an-launcher1002 retry\ [analytics/refinery@fc6767a] (duration: 00m 03s) [14:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:11] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [14:16:11] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [14:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:30] !log otto@deploy1002 Started deploy [analytics/refinery@fc6767a]: Regular analytics weekly train - an-launcher1002 retry [analytics/refinery@fc6767a] [14:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:33] !log otto@deploy1002 Finished deploy [analytics/refinery@fc6767a]: Regular analytics weekly train - an-launcher1002 retry [analytics/refinery@fc6767a] (duration: 00m 03s) [14:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:05] !log otto@deploy1002 Started deploy [analytics/refinery@fc6767a] (thin): Regular analytics weekly train THIN [analytics/refinery@fc6767a] [14:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:13] !log otto@deploy1002 Finished deploy [analytics/refinery@fc6767a] (thin): Regular analytics weekly train THIN [analytics/refinery@fc6767a] (duration: 00m 07s) [14:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:22] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:17:23] _joe_: oh that makes sense [14:17:32] are those networkpolicy changes? [14:17:36] <_joe_> yes [14:17:41] kafka has had new hosts in eqiad [14:17:44] but not codfw yet [14:17:47] <_joe_> ok [14:17:47] so it wasn't deployed to codfw [14:17:49] you can proceed [14:17:54] <_joe_> ok, going on then [14:18:20] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [14:18:20] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [14:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:53] !log otto@deploy1002 Started deploy [analytics/refinery@fc6767a] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@fc6767a] [14:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:07] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [14:22:08] (03PS1) 10Muehlenhoff: Make cumin2002 a Cumin host [puppet] - 10https://gerrit.wikimedia.org/r/681404 (https://phabricator.wikimedia.org/T276589) [14:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:37] <_joe_> hnowlan: did you release api-gateway? [14:23:36] (03CR) 10ArielGlenn: "+1, will merge once we have a live test of MAILTO by the end of the week." [puppet] - 10https://gerrit.wikimedia.org/r/681357 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [14:23:48] (03CR) 10ArielGlenn: "+1, will merge once we have a live test of MAILTO by the end of the week." [puppet] - 10https://gerrit.wikimedia.org/r/681360 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [14:23:51] _joe_: not yet [14:23:57] (03CR) 10ArielGlenn: "+1, will merge once we have a live test of MAILTO by the end of the week." [puppet] - 10https://gerrit.wikimedia.org/r/681361 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [14:24:20] _joe_: if you're feeling generous ;) https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/681335 [14:24:58] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [14:24:58] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [14:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:12] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] envoy-future: update envoyproxy to 1.16.3 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/681335 (https://phabricator.wikimedia.org/T280317) (owner: 10Hnowlan) [14:25:49] !log otto@deploy1002 Finished deploy [analytics/refinery@fc6767a] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@fc6767a] (duration: 04m 56s) [14:25:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:46] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [14:27:46] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [14:27:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:06] !log installing exim updates from Buster point release [14:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:20] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [14:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:15] (03CR) 10Arturo Borrero Gonzalez: "> Patch Set 1:" [homer/public] - 10https://gerrit.wikimedia.org/r/681315 (https://phabricator.wikimedia.org/T280310) (owner: 10Arturo Borrero Gonzalez) [14:34:08] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [14:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:44] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [14:35:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:32] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'zotero' for release 'staging' . [14:37:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:46] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'zotero' for release 'production' . [14:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:24] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'zotero' for release 'production' . [14:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:58] (03CR) 10Andrew Bogott: "I'm happy to make a doc page about these if someone can suggest where the right place is. Maybe right here in firewall.conf?" [homer/public] - 10https://gerrit.wikimedia.org/r/681315 (https://phabricator.wikimedia.org/T280310) (owner: 10Arturo Borrero Gonzalez) [14:44:23] (03CR) 10Andrew Bogott: [C: 03+1] firewall: cloud-in4: new TCP port in cloudcontrol servers for rabbitmq [homer/public] - 10https://gerrit.wikimedia.org/r/681315 (https://phabricator.wikimedia.org/T280310) (owner: 10Arturo Borrero Gonzalez) [14:51:54] (03PS2) 10Ottomata: test/refine_sanitized - add a event_sanitized_main job [puppet] - 10https://gerrit.wikimedia.org/r/681105 (https://phabricator.wikimedia.org/T273789) [14:53:24] (03CR) 10jerkins-bot: [V: 04-1] test/refine_sanitized - add a event_sanitized_main job [puppet] - 10https://gerrit.wikimedia.org/r/681105 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [14:53:27] (03PS3) 10Ottomata: test/refine_sanitized - add a event_sanitized_main job [puppet] - 10https://gerrit.wikimedia.org/r/681105 (https://phabricator.wikimedia.org/T273789) [14:54:43] (03CR) 10jerkins-bot: [V: 04-1] test/refine_sanitized - add a event_sanitized_main job [puppet] - 10https://gerrit.wikimedia.org/r/681105 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [14:58:18] PROBLEM - Disk space on otrs1001 is CRITICAL: DISK CRITICAL - /var/spool/exim4/db is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=otrs1001&var-datasource=eqiad+prometheus/ops [14:58:22] (03PS4) 10Ottomata: test/refine_sanitized - add a event_sanitized_main job [puppet] - 10https://gerrit.wikimedia.org/r/681105 (https://phabricator.wikimedia.org/T273789) [14:58:58] !log volker-e@deploy1002 Started deploy [design/style-guide@c4d8314]: Deploy design/style-guide: c4d8314 β€œComponents”: Fix β€œButtons” active states (#460) [14:59:05] !log volker-e@deploy1002 Finished deploy [design/style-guide@c4d8314]: Deploy design/style-guide: c4d8314 β€œComponents”: Fix β€œButtons” active states (#460) (duration: 00m 07s) [14:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:43] (03CR) 10jerkins-bot: [V: 04-1] test/refine_sanitized - add a event_sanitized_main job [puppet] - 10https://gerrit.wikimedia.org/r/681105 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [15:00:30] PROBLEM - Disk space on phab2001 is CRITICAL: DISK CRITICAL - /var/spool/exim4/scan is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=phab2001&var-datasource=codfw+prometheus/ops [15:00:52] (03PS5) 10Ottomata: test/refine_sanitized - add a event_sanitized_main job [puppet] - 10https://gerrit.wikimedia.org/r/681105 (https://phabricator.wikimedia.org/T273789) [15:00:55] 10SRE, 10vm-requests: eqiad: 1 VM %request for eventlog - https://phabricator.wikimedia.org/T280679 (10hnowlan) [15:01:09] 10SRE, 10vm-requests: eqiad: 1 VM %request for eventlog - https://phabricator.wikimedia.org/T280679 (10hnowlan) a:03hnowlan [15:01:27] 10SRE, 10vm-requests: eqiad: 1 VM %request for eventlog - https://phabricator.wikimedia.org/T280679 (10hnowlan) [15:02:06] PROBLEM - Disk space on phab1001 is CRITICAL: DISK CRITICAL - /var/spool/exim4/db is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=phab1001&var-datasource=eqiad+prometheus/ops [15:02:39] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [15:02:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:51] (03PS1) 10Andrew Bogott: Make cloudvirt1040-46 into hypervisors [puppet] - 10https://gerrit.wikimedia.org/r/681413 (https://phabricator.wikimedia.org/T275081) [15:04:23] (03CR) 10Andrew Bogott: [C: 03+2] Make cloudvirt1040-46 into hypervisors [puppet] - 10https://gerrit.wikimedia.org/r/681413 (https://phabricator.wikimedia.org/T275081) (owner: 10Andrew Bogott) [15:05:25] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29124/console" [puppet] - 10https://gerrit.wikimedia.org/r/681105 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [15:07:12] RECOVERY - Check systemd state on logstash2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:07:14] RECOVERY - Check systemd state on logstash1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:07:35] (03PS1) 10Andrew Bogott: Add host hiera for new cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/681415 (https://phabricator.wikimedia.org/T275081) [15:07:46] RECOVERY - Check systemd state on logstash1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:07:48] RECOVERY - Check systemd state on logstash1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:08:06] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:08:10] RECOVERY - Check systemd state on logstash2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:29] (03CR) 10Andrew Bogott: [C: 03+2] Add host hiera for new cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/681415 (https://phabricator.wikimedia.org/T275081) (owner: 10Andrew Bogott) [15:08:58] RECOVERY - Check systemd state on logstash2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:09:17] (03PS6) 10Ottomata: test/refine_sanitized - add a event_sanitized_main job [puppet] - 10https://gerrit.wikimedia.org/r/681105 (https://phabricator.wikimedia.org/T273789) [15:09:33] 10SRE, 10ops-codfw, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudcephmon2001-dev - https://phabricator.wikimedia.org/T279662 (10Papaul) [15:09:56] 10SRE, 10ops-codfw, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudcephmon2001-dev - https://phabricator.wikimedia.org/T279662 (10Papaul) 05Openβ†’03Resolved complete [15:11:22] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup200[4-7] - https://phabricator.wikimedia.org/T277323 (10Papaul) I am planning on moving some servers in B4 and C4 to make room on the bottom of the racking to be able to rack backup2005 and 2006 [15:13:55] jouncebot: next [15:13:55] In 0 hour(s) and 46 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210420T1600) [15:14:25] !log hnowlan@cumin1001 START - Cookbook sre.ganeti.makevm for new host eventlog1003.eqiad.wmnet [15:14:26] trading empty puppet request window for mediawiki-config request window [15:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:41] (03PS3) 10Razzi: clouddb: enable alerting for clouddb1021 [puppet] - 10https://gerrit.wikimedia.org/r/677977 (https://phabricator.wikimedia.org/T269211) [15:15:58] (03CR) 10jerkins-bot: [V: 04-1] clouddb: enable alerting for clouddb1021 [puppet] - 10https://gerrit.wikimedia.org/r/677977 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi) [15:18:07] mutante: I'm going to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/681144 [15:18:12] (as discussed in PM) [15:18:19] (03CR) 10Urbanecm: [C: 03+2] remove mwdebug1003 from list of debug servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681144 (https://phabricator.wikimedia.org/T267248) (owner: 10Dzahn) [15:18:42] Urbanecm: thank you very much, that is what powers what is shown on noc.wikimedia.org, and then I found out the WikimediaDebug extension now gets its info from noc.wm [15:18:56] cool :). [15:19:00] in the past this thing would have needed a change in WikimediaDebug repo [15:19:07] sounds like an improvement :) [15:19:10] but now just this.. yes [15:19:13] (03Merged) 10jenkins-bot: remove mwdebug1003 from list of debug servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681144 (https://phabricator.wikimedia.org/T267248) (owner: 10Dzahn) [15:19:38] RECOVERY - Disk space on otrs1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=otrs1001&var-datasource=eqiad+prometheus/ops [15:20:09] a) trafficserver change b) mw config change c) WikimediaDebug config change .. but c) is gone [15:20:46] !log urbanecm@deploy1002 Synchronized debug.json: dc6647b9c674429c0811116e0caca7639b766e77: remove mwdebug1003 from list of debug servers (T267248) (duration: 00m 57s) [15:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:57] T267248: create mwdebug1003 - ganeti VM with buster and appserver role - https://phabricator.wikimedia.org/T267248 [15:21:59] !log urbanecm@deploy1002 Synchronized docroot/noc/conf/debug.json: dc6647b9c674429c0811116e0caca7639b766e77: remove mwdebug1003 from list of debug servers (T267248) (duration: 00m 58s) [15:22:06] mutante: should be live [15:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:12] RECOVERY - Disk space on phab1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=phab1001&var-datasource=eqiad+prometheus/ops [15:22:28] and https://noc.wikimedia.org/conf/debug.json sounds to indeed be updated [15:22:41] 10SRE: Integrate Buster 10.9 point update - https://phabricator.wikimedia.org/T279054 (10MoritzMuehlenhoff) [15:23:41] Urbanecm: I did the full test: install WikimediaDebug extension in Chrome from scratch, checked the host is gone from the list [15:23:44] thanks [15:24:08] yeah, it works on an existing install too [15:24:12] :) [15:24:49] (03PS4) 10Razzi: clouddb: enable alerting for clouddb1021 [puppet] - 10https://gerrit.wikimedia.org/r/677977 (https://phabricator.wikimedia.org/T269211) [15:26:04] (03CR) 10jerkins-bot: [V: 04-1] clouddb: enable alerting for clouddb1021 [puppet] - 10https://gerrit.wikimedia.org/r/677977 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi) [15:26:16] PROBLEM - ensure kvm processes are running on cloudvirt1045 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:27:22] (03PS5) 10Razzi: clouddb: enable alerting for clouddb1021 [puppet] - 10https://gerrit.wikimedia.org/r/677977 (https://phabricator.wikimedia.org/T269211) [15:29:02] (03PS7) 10Ottomata: test/refine_sanitized - add a event_sanitized_main job [puppet] - 10https://gerrit.wikimedia.org/r/681105 (https://phabricator.wikimedia.org/T273789) [15:29:23] (03PS8) 10Ottomata: test/refine_sanitized - add a event_sanitized_main job [puppet] - 10https://gerrit.wikimedia.org/r/681105 (https://phabricator.wikimedia.org/T273789) [15:29:48] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29125/console" [puppet] - 10https://gerrit.wikimedia.org/r/677977 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi) [15:30:35] (03CR) 10jerkins-bot: [V: 04-1] test/refine_sanitized - add a event_sanitized_main job [puppet] - 10https://gerrit.wikimedia.org/r/681105 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [15:30:41] (03PS9) 10Ottomata: test/refine_sanitized - add a event_sanitized_main job [puppet] - 10https://gerrit.wikimedia.org/r/681105 (https://phabricator.wikimedia.org/T273789) [15:31:31] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29127/console" [puppet] - 10https://gerrit.wikimedia.org/r/681105 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [15:32:04] (03CR) 10Ottomata: [V: 03+1 C: 03+2] test/refine_sanitized - add a event_sanitized_main job [puppet] - 10https://gerrit.wikimedia.org/r/681105 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [15:33:00] ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirt1045 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 andrew bogott setup in progress https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:37:42] (03PS1) 10Joal: Update hadoop capacity scheduler [puppet] - 10https://gerrit.wikimedia.org/r/681417 [15:37:49] here it is elukey --^ [15:37:53] sorry for the dealy [15:38:20] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host eventlog1003.eqiad.wmnet [15:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:08] (03CR) 10Elukey: [C: 03+2] Update hadoop capacity scheduler [puppet] - 10https://gerrit.wikimedia.org/r/681417 (owner: 10Joal) [15:39:14] RECOVERY - Disk space on phab2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=phab2001&var-datasource=codfw+prometheus/ops [15:40:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10Andrew) They should definitely be .eqiad.wmnet, these are meant to resemble existing osd nodes. There might be some broken copy... [15:42:10] (03PS1) 10Razzi: alerts: add victorops paging for hadoop master and kafka broker [puppet] - 10https://gerrit.wikimedia.org/r/681420 (https://phabricator.wikimedia.org/T273064) [15:42:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10Andrew) >>! In T274945#7020862, @Andrew wrote: > I think that our very first initial ceph cluster was in .wikimedia.org and tha... [15:42:32] (03PS1) 10Elukey: Revert "Move analytics-hive to an-coord1002" [dns] - 10https://gerrit.wikimedia.org/r/681368 [15:45:08] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [15:45:13] (03CR) 10Ottomata: "hm, this might result in a" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/681420 (https://phabricator.wikimedia.org/T273064) (owner: 10Razzi) [15:51:17] (03PS2) 10Ryan Kemper: wdqs: switch to raid0 for more space [puppet] - 10https://gerrit.wikimedia.org/r/681223 (https://phabricator.wikimedia.org/T280382) [15:51:28] (03CR) 10Elukey: [C: 03+2] Revert "Move analytics-hive to an-coord1002" [dns] - 10https://gerrit.wikimedia.org/r/681368 (owner: 10Elukey) [15:53:33] (03PS1) 10Hnowlan: install_server: add entry for eventlog1003 [puppet] - 10https://gerrit.wikimedia.org/r/681425 (https://phabricator.wikimedia.org/T280679) [15:54:58] (03PS1) 10Hnowlan: reboot-single: allow specification of ticket and reason [cookbooks] - 10https://gerrit.wikimedia.org/r/681426 [15:56:14] (03CR) 10Elukey: [C: 03+1] "mac address looks good (checked on ganeti1009). You can also add the partman recipe in this change if you want, otherwise feel free to mer" [puppet] - 10https://gerrit.wikimedia.org/r/681425 (https://phabricator.wikimedia.org/T280679) (owner: 10Hnowlan) [15:56:22] (03PS1) 10Volans: setup.py: revert tqdm upper limit constraint [software/cumin] - 10https://gerrit.wikimedia.org/r/681427 [15:56:33] (03PS1) 10Giuseppe Lavagetto: helmfile: create module [puppet] - 10https://gerrit.wikimedia.org/r/681428 [15:57:04] (03PS3) 10Ryan Kemper: wdqs: switch to raid0 for more space [puppet] - 10https://gerrit.wikimedia.org/r/681223 (https://phabricator.wikimedia.org/T280382) [15:58:48] (03PS4) 10Ryan Kemper: wdqs: switch to raid0 for more space [puppet] - 10https://gerrit.wikimedia.org/r/681223 (https://phabricator.wikimedia.org/T280382) [15:59:19] (03CR) 10jerkins-bot: [V: 04-1] reboot-single: allow specification of ticket and reason [cookbooks] - 10https://gerrit.wikimedia.org/r/681426 (owner: 10Hnowlan) [15:59:21] (03PS1) 10Urbanecm: [refactor] Move wasPosted to MentorStore [extensions/GrowthExperiments] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/681429 [15:59:24] (03PS1) 10Urbanecm: MentorStore: Set wasPosted to true in command line mode [extensions/GrowthExperiments] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/681430 (https://phabricator.wikimedia.org/T275773) [15:59:27] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/681223 (https://phabricator.wikimedia.org/T280382) (owner: 10Ryan Kemper) [15:59:31] (03CR) 10Volans: reboot-single: allow specification of ticket and reason (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/681426 (owner: 10Hnowlan) [16:00:04] jbond42 and cdanis: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet request window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210420T1600). [16:01:22] (03PS5) 10Ryan Kemper: wdqs: switch to raid0 for more space [puppet] - 10https://gerrit.wikimedia.org/r/681223 (https://phabricator.wikimedia.org/T280382) [16:03:27] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: switch to raid0 for more space [puppet] - 10https://gerrit.wikimedia.org/r/681223 (https://phabricator.wikimedia.org/T280382) (owner: 10Ryan Kemper) [16:03:31] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/681223 (https://phabricator.wikimedia.org/T280382) (owner: 10Ryan Kemper) [16:03:52] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: switch to raid0 for more space (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/681223 (https://phabricator.wikimedia.org/T280382) (owner: 10Ryan Kemper) [16:04:16] (03CR) 10Volans: debmonitor: make cfssl the default PKI issuer for debmonitor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/681353 (owner: 10Jbond) [16:06:23] 10SRE, 10ops-codfw, 10netops: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457 (10Volans) [16:07:16] 10SRE, 10ops-codfw, 10netops: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457 (10Volans) @ayounsi do we have already a plan for how to manage the swap in Netbox? Should we discuss it? [16:10:25] (03CR) 10Ayounsi: [C: 03+1] firewall: cloud-in4: new TCP port in cloudcontrol servers for rabbitmq [homer/public] - 10https://gerrit.wikimedia.org/r/681315 (https://phabricator.wikimedia.org/T280310) (owner: 10Arturo Borrero Gonzalez) [16:11:16] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup200[4-7] - https://phabricator.wikimedia.org/T277323 (10Papaul) [16:11:40] (03PS2) 10Hnowlan: reboot-single: allow specification of ticket and reason [cookbooks] - 10https://gerrit.wikimedia.org/r/681426 [16:12:02] (03CR) 10Hnowlan: reboot-single: allow specification of ticket and reason (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/681426 (owner: 10Hnowlan) [16:15:45] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [16:15:46] !log updating core routers config with https://gerrit.wikimedia.org/r/c/operations/homer/public/+/681315 [16:15:49] (03CR) 10Andrew Bogott: [C: 03+2] firewall: cloud-in4: new TCP port in cloudcontrol servers for rabbitmq [homer/public] - 10https://gerrit.wikimedia.org/r/681315 (https://phabricator.wikimedia.org/T280310) (owner: 10Arturo Borrero Gonzalez) [16:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:30] (03CR) 10Volans: [C: 03+1] "LGTM, generic question inline." (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/681426 (owner: 10Hnowlan) [16:16:49] (03CR) 10jerkins-bot: [V: 04-1] reboot-single: allow specification of ticket and reason [cookbooks] - 10https://gerrit.wikimedia.org/r/681426 (owner: 10Hnowlan) [16:17:28] (03PS3) 10Hnowlan: reboot-single: allow specification of ticket and reason [cookbooks] - 10https://gerrit.wikimedia.org/r/681426 [16:20:08] (03PS2) 10Giuseppe Lavagetto: helmfile: create module [puppet] - 10https://gerrit.wikimedia.org/r/681428 [16:20:10] (03PS1) 10Giuseppe Lavagetto: helmfile: install a simple deployment shell [puppet] - 10https://gerrit.wikimedia.org/r/681432 [16:20:12] (03PS4) 10Jcrespo: mediabackup: Setup the storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/681117 (https://phabricator.wikimedia.org/T276442) [16:20:53] (03CR) 10jerkins-bot: [V: 04-1] helmfile: install a simple deployment shell [puppet] - 10https://gerrit.wikimedia.org/r/681432 (owner: 10Giuseppe Lavagetto) [16:21:46] (03CR) 10jerkins-bot: [V: 04-1] reboot-single: allow specification of ticket and reason [cookbooks] - 10https://gerrit.wikimedia.org/r/681426 (owner: 10Hnowlan) [16:22:31] (03PS3) 10Jbond: debmonitor: make cfssl the default PKI issuer for debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/681353 [16:22:37] (03CR) 10Elukey: [C: 03+1] "LGTM! Just to be sure, run pcc with another clouddb node to see that we have a no-op (paranoid check)" [puppet] - 10https://gerrit.wikimedia.org/r/677977 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi) [16:22:43] (03CR) 10Jbond: debmonitor: make cfssl the default PKI issuer for debmonitor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/681353 (owner: 10Jbond) [16:23:51] (03CR) 10Volans: [C: 03+1] "LGTM, I don't have the full picture of the consequences of changing this hiera and how to roll it out transparently." [puppet] - 10https://gerrit.wikimedia.org/r/681353 (owner: 10Jbond) [16:25:09] (03CR) 10Jbond: "> Patch Set 3: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/681353 (owner: 10Jbond) [16:26:08] (03PS3) 10Arturo Borrero Gonzalez: firewall: cloud-in4: drop cloudcontrol-novafullstack term [homer/public] - 10https://gerrit.wikimedia.org/r/681316 (https://phabricator.wikimedia.org/T272587) [16:26:13] (03PS4) 10Hnowlan: reboot-single: allow specification of ticket and reason [cookbooks] - 10https://gerrit.wikimedia.org/r/681426 [16:26:31] (03CR) 10Hnowlan: reboot-single: allow specification of ticket and reason (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/681426 (owner: 10Hnowlan) [16:27:04] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] firewall: cloud-in4: drop cloudcontrol-novafullstack term [homer/public] - 10https://gerrit.wikimedia.org/r/681316 (https://phabricator.wikimedia.org/T272587) (owner: 10Arturo Borrero Gonzalez) [16:27:08] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29128/console" [puppet] - 10https://gerrit.wikimedia.org/r/677977 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi) [16:28:14] (03CR) 10Razzi: [V: 03+1] "> Patch Set 5: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/677977 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi) [16:28:17] (03CR) 10Razzi: [V: 03+1 C: 03+2] clouddb: enable alerting for clouddb1021 [puppet] - 10https://gerrit.wikimedia.org/r/677977 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi) [16:28:43] (03CR) 10Razzi: [V: 03+1 C: 03+2] clouddb: enable alerting for clouddb1021 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677977 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi) [16:31:06] 10SRE, 10serviceops, 10Release-Engineering-Team (Radar): Hundreds of tags for `wikimedia/mediawiki-core` image - https://phabricator.wikimedia.org/T242775 (10thcipriani) [16:32:04] !log merging change to core route firewall https://gerrit.wikimedia.org/r/c/operations/homer/public/+/681316 (T272587) [16:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:12] T272587: cloud: current nova-fullstack mechanism requires cloudcontrol nodes to access individual VMs - https://phabricator.wikimedia.org/T272587 [16:33:16] (03CR) 10Volans: "Couple of comments inline (resurfacing an older one too), I skipped the tests." (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855) (owner: 10CRusnov) [16:33:35] 10SRE, 10Icinga, 10observability: implement icinga paging for non-ops teams - https://phabricator.wikimedia.org/T141038 (10lmata) do we want to keep the same scope for ICINGA? or consider our other paging tools? [16:34:02] (03PS2) 10Hnowlan: install_server: add entry for eventlog1003 [puppet] - 10https://gerrit.wikimedia.org/r/681425 (https://phabricator.wikimedia.org/T280679) [16:38:04] 10SRE, 10Release Pipeline, 10serviceops, 10Release-Engineering-Team (Radar), and 2 others: Remove obsoleted docker images - https://phabricator.wikimedia.org/T242604 (10thcipriani) [16:38:53] (03CR) 10Volans: [C: 03+1] "LGTM, if possible try to test it, at least with --dry-run" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/681426 (owner: 10Hnowlan) [16:39:00] 10SRE, 10ops-codfw, 10netops: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457 (10ayounsi) Because it's part of a VC, the easiest is to swap serial# (and other attributes like procurement task). [16:39:23] PROBLEM - Too many messages in kafka logging-eqiad #o11y on alert1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash7-codfw,logstash7-eqiad} instance=kafkamon1002 job=burrow partition={0,3,5} prometheus=ops site=eqiad topic=rsyslog-notice https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource [16:39:23] er=logging-eqiad&var-topic=All&var-consumer_group=All [16:41:43] RECOVERY - Too many messages in kafka logging-eqiad #o11y on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [16:44:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10Andrew) [16:45:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10Andrew) 05Resolvedβ†’03Open These need to be re-imaged with internal IPs and names in .eqiad.wmnet. Sorry for the confusion! [16:46:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10RobH) Reopening (already synced with Andrew via irc). In the future, please reopen tasks that need action, otherwise its easy t... [16:49:10] (03PS18) 10Jcrespo: mediabackup: Initial setup for the media backup worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/668380 (https://phabricator.wikimedia.org/T276442) [16:49:12] (03PS7) 10Jcrespo: mariadb: Setup 2 new host as temporary metadata database for media backups [puppet] - 10https://gerrit.wikimedia.org/r/681103 (https://phabricator.wikimedia.org/T276442) [16:49:14] (03PS1) 10Jcrespo: dbbackups: Move backups of s3 on codfw from db2098 to db2139 [puppet] - 10https://gerrit.wikimedia.org/r/681439 (https://phabricator.wikimedia.org/T280492) [16:49:37] (03PS2) 10Jcrespo: dbbackups: Move backups of s3 on codfw from db2098 to db2139 [puppet] - 10https://gerrit.wikimedia.org/r/681439 (https://phabricator.wikimedia.org/T280492) [16:50:45] RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.072 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:51:11] PROBLEM - WDQS high update lag on wdqs1006 is CRITICAL: 4.916e+04 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:56:16] (03CR) 10Muehlenhoff: [C: 03+1] setup.py: revert tqdm upper limit constraint [software/cumin] - 10https://gerrit.wikimedia.org/r/681427 (owner: 10Volans) [17:00:04] chrisalbon and accraze: May I have your attention please! Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210420T1700) [17:02:13] (03PS2) 10Legoktm: lists: Remove IP exemption for 2018 hackathon [puppet] - 10https://gerrit.wikimedia.org/r/681243 [17:02:15] (03PS2) 10Legoktm: lists: Clean up monitoring host IP exemption [puppet] - 10https://gerrit.wikimedia.org/r/681246 [17:02:18] (03PS2) 10Legoktm: mailman3: Add monitoring for active processes [puppet] - 10https://gerrit.wikimedia.org/r/681287 [17:02:20] (03PS2) 10Legoktm: lists: Port check_mailman_queue to Python [puppet] - 10https://gerrit.wikimedia.org/r/681288 [17:02:21] (03PS2) 10Legoktm: lists: Use check_mailman_queue for monitoring mailman3 too [puppet] - 10https://gerrit.wikimedia.org/r/681289 (https://phabricator.wikimedia.org/T278280) [17:02:24] (03PS2) 10Legoktm: mailman3: Send queue lengths to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/681290 (https://phabricator.wikimedia.org/T278280) [17:02:25] (03PS2) 10Legoktm: lists: Update mod_security rules for mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/681244 [17:02:28] (03PS2) 10Legoktm: exim: Drop support for legacy mailing list domains [puppet] - 10https://gerrit.wikimedia.org/r/681242 (https://phabricator.wikimedia.org/T280472) [17:02:30] (03PS2) 10Legoktm: lists: Redirect domain root to Postorius if mailman3 enabled [puppet] - 10https://gerrit.wikimedia.org/r/681247 [17:02:50] (03PS2) 10Legoktm: role::lists: Drop pre-stretch compat [puppet] - 10https://gerrit.wikimedia.org/r/681241 [17:02:52] (03PS3) 10Legoktm: lists: Remove IP exemption for 2018 hackathon [puppet] - 10https://gerrit.wikimedia.org/r/681243 [17:02:54] (03PS3) 10Legoktm: lists: Clean up monitoring host IP exemption [puppet] - 10https://gerrit.wikimedia.org/r/681246 [17:02:56] (03PS3) 10Legoktm: mailman3: Add monitoring for active processes [puppet] - 10https://gerrit.wikimedia.org/r/681287 [17:02:58] (03PS3) 10Legoktm: lists: Port check_mailman_queue to Python [puppet] - 10https://gerrit.wikimedia.org/r/681288 [17:03:00] (03PS3) 10Legoktm: lists: Use check_mailman_queue for monitoring mailman3 too [puppet] - 10https://gerrit.wikimedia.org/r/681289 (https://phabricator.wikimedia.org/T278280) [17:03:02] (03PS3) 10Legoktm: mailman3: Send queue lengths to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/681290 (https://phabricator.wikimedia.org/T278280) [17:03:04] (03PS3) 10Legoktm: lists: Update mod_security rules for mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/681244 [17:03:06] (03PS3) 10Legoktm: exim: Drop support for legacy mailing list domains [puppet] - 10https://gerrit.wikimedia.org/r/681242 (https://phabricator.wikimedia.org/T280472) [17:03:08] (03PS3) 10Legoktm: lists: Redirect domain root to Postorius if mailman3 enabled [puppet] - 10https://gerrit.wikimedia.org/r/681247 [17:04:57] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29129/console" [puppet] - 10https://gerrit.wikimedia.org/r/681241 (owner: 10Legoktm) [17:11:47] (03CR) 10Legoktm: [V: 03+1 C: 03+2] role::lists: Drop pre-stretch compat [puppet] - 10https://gerrit.wikimedia.org/r/681241 (owner: 10Legoktm) [17:16:38] (03CR) 10Legoktm: [C: 03+2] lists: Remove IP exemption for 2018 hackathon [puppet] - 10https://gerrit.wikimedia.org/r/681243 (owner: 10Legoktm) [17:16:49] !log Adding a MPC7E to cr1-codfw [17:16:57] (03CR) 10Legoktm: [C: 03+2] lists: Clean up monitoring host IP exemption [puppet] - 10https://gerrit.wikimedia.org/r/681246 (owner: 10Legoktm) [17:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:34] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Move backups of s3 on codfw from db2098 to db2139 [puppet] - 10https://gerrit.wikimedia.org/r/681439 (https://phabricator.wikimedia.org/T280492) (owner: 10Jcrespo) [17:17:39] (03PS1) 10Jbond: C:package_builder: Add Script for building debian packages from git [puppet] - 10https://gerrit.wikimedia.org/r/681445 [17:18:32] (03CR) 10jerkins-bot: [V: 04-1] C:package_builder: Add Script for building debian packages from git [puppet] - 10https://gerrit.wikimedia.org/r/681445 (owner: 10Jbond) [17:18:57] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [17:20:37] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) install MPC7E-MRATE FPC into cr[12]-codfw - https://phabricator.wikimedia.org/T277341 (10Papaul) ` Slot 1 information: State Present Temperature 26 degrees C / 78 degrees F Total CPU DRAM... [17:21:13] (03PS1) 10Legoktm: lists: Fix apache config with SecRule comments [puppet] - 10https://gerrit.wikimedia.org/r/681446 [17:21:26] (03PS8) 10Ryan Kemper: elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) [17:21:33] (03CR) 10Legoktm: [V: 03+2 C: 03+2] lists: Fix apache config with SecRule comments [puppet] - 10https://gerrit.wikimedia.org/r/681446 (owner: 10Legoktm) [17:21:49] (03CR) 10Ryan Kemper: "Heading to lunch now so pushing what I've got currently. Still have several more comments to get to." (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [17:24:32] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29132/console" [puppet] - 10https://gerrit.wikimedia.org/r/681287 (owner: 10Legoktm) [17:26:04] !log boot cr1-codfw:fpc1 - T277341 [17:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:14] T277341: (Need By: TBD) install MPC7E-MRATE FPC into cr[12]-codfw - https://phabricator.wikimedia.org/T277341 [17:28:04] (03CR) 10Legoktm: [V: 03+1 C: 03+2] mailman3: Add monitoring for active processes [puppet] - 10https://gerrit.wikimedia.org/r/681287 (owner: 10Legoktm) [17:29:54] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [17:31:57] (03PS2) 10Jbond: C:package_builder: Add Script for building debian packages from git [puppet] - 10https://gerrit.wikimedia.org/r/681445 [17:32:40] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29133/console" [puppet] - 10https://gerrit.wikimedia.org/r/681445 (owner: 10Jbond) [17:34:44] (03CR) 10Jbond: [V: 03+1] C:package_builder: Add Script for building debian packages from git (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/681445 (owner: 10Jbond) [17:36:01] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) install MPC7E-MRATE FPC into cr[12]-codfw - https://phabricator.wikimedia.org/T277341 (10Papaul) 05Openβ†’03Resolved Both MPC7E are in place [17:36:11] (03PS1) 10Jcrespo: mariadb: Reenable notifications for db2139 after maintenance [puppet] - 10https://gerrit.wikimedia.org/r/681447 (https://phabricator.wikimedia.org/T280492) [17:36:13] (03PS1) 10Jcrespo: mariadb: Remove s3 from db2098 [puppet] - 10https://gerrit.wikimedia.org/r/681448 (https://phabricator.wikimedia.org/T280492) [17:36:29] (03PS2) 10Jcrespo: mariadb: Reenable notifications for db2139 after maintenance [puppet] - 10https://gerrit.wikimedia.org/r/681447 (https://phabricator.wikimedia.org/T280492) [17:36:33] (03PS2) 10Jcrespo: mariadb: Remove s3 from db2098 [puppet] - 10https://gerrit.wikimedia.org/r/681448 (https://phabricator.wikimedia.org/T280492) [17:37:01] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: 2021-04-30) rack/setup/install backup200[4-7] - https://phabricator.wikimedia.org/T277323 (10Papaul) [17:38:40] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 145, down: 48, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:44:13] (03CR) 10Legoktm: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/681244 (owner: 10Legoktm) [17:44:28] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 146, down: 48, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:44:43] (03CR) 10Legoktm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/681247 (owner: 10Legoktm) [17:48:36] (03PS3) 10Jbond: C:package_builder: Add Script for building debian packages from git [puppet] - 10https://gerrit.wikimedia.org/r/681445 [17:52:03] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [17:54:13] (03CR) 10Dzahn: [C: 03+1] "Haven't really tested this myself but it does seem to do the same thing and I am totally fine with replacing this. The motivation to do th" [puppet] - 10https://gerrit.wikimedia.org/r/681288 (owner: 10Legoktm) [17:54:59] (03CR) 10Ladsgroup: lists: Port check_mailman_queue to Python (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/681288 (owner: 10Legoktm) [17:55:25] (03CR) 10Dzahn: [C: 03+1] "plus the improvements you are describing of course!" [puppet] - 10https://gerrit.wikimedia.org/r/681288 (owner: 10Legoktm) [17:56:19] (03CR) 10Ladsgroup: lists: Port check_mailman_queue to Python (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/681288 (owner: 10Legoktm) [17:58:14] (03CR) 10Legoktm: lists: Port check_mailman_queue to Python (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/681288 (owner: 10Legoktm) [17:59:43] Amir1: https://gerrit.wikimedia.org/r/c/operations/puppet/+/681289/ [18:00:04] RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210420T1800). Please do the needful. [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:00:27] (03CR) 10Dzahn: [C: 03+1] lists: Port check_mailman_queue to Python (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/681288 (owner: 10Legoktm) [18:02:19] PROBLEM - MariaDB Replica Lag: s6 on db2076 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 805.56 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:02:51] (03CR) 10Urbanecm: [C: 03+2] [refactor] Move wasPosted to MentorStore [extensions/GrowthExperiments] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/681429 (owner: 10Urbanecm) [18:02:54] (03CR) 10Urbanecm: [C: 03+2] MentorStore: Set wasPosted to true in command line mode [extensions/GrowthExperiments] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/681430 (https://phabricator.wikimedia.org/T275773) (owner: 10Urbanecm) [18:04:25] (03CR) 10Legoktm: [C: 03+2] lists: Port check_mailman_queue to Python [puppet] - 10https://gerrit.wikimedia.org/r/681288 (owner: 10Legoktm) [18:04:37] (03PS4) 10Legoktm: lists: Port check_mailman_queue to Python [puppet] - 10https://gerrit.wikimedia.org/r/681288 [18:04:39] (03CR) 10Legoktm: [V: 03+2 C: 03+2] lists: Port check_mailman_queue to Python [puppet] - 10https://gerrit.wikimedia.org/r/681288 (owner: 10Legoktm) [18:05:14] jynus: is the replag known? (sorryh for the ping, I just see you working on codfw dbs) [18:05:51] apergos: also see https://sal.toolforge.org/log/WtJh73gBa_6PSCT99gg2, the host was apparently recently reimaged [18:06:29] oh huh, gtk [18:06:33] (03PS4) 10Legoktm: lists: Use check_mailman_queue for monitoring mailman3 too [puppet] - 10https://gerrit.wikimedia.org/r/681289 (https://phabricator.wikimedia.org/T278280) [18:07:41] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29134/console" [puppet] - 10https://gerrit.wikimedia.org/r/681289 (https://phabricator.wikimedia.org/T278280) (owner: 10Legoktm) [18:10:57] (03PS5) 10Legoktm: lists: Use check_mailman_queue for monitoring mailman3 too [puppet] - 10https://gerrit.wikimedia.org/r/681289 (https://phabricator.wikimedia.org/T278280) [18:10:59] (03PS1) 10Legoktm: icinga: Fix check_mailman_queue success message [puppet] - 10https://gerrit.wikimedia.org/r/681453 [18:11:56] (03CR) 10Legoktm: [V: 03+2 C: 03+2] icinga: Fix check_mailman_queue success message [puppet] - 10https://gerrit.wikimedia.org/r/681453 (owner: 10Legoktm) [18:14:55] (03CR) 10Legoktm: [C: 03+2] lists: Use check_mailman_queue for monitoring mailman3 too [puppet] - 10https://gerrit.wikimedia.org/r/681289 (https://phabricator.wikimedia.org/T278280) (owner: 10Legoktm) [18:18:48] (03Merged) 10jenkins-bot: [refactor] Move wasPosted to MentorStore [extensions/GrowthExperiments] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/681429 (owner: 10Urbanecm) [18:18:51] (03Merged) 10jenkins-bot: MentorStore: Set wasPosted to true in command line mode [extensions/GrowthExperiments] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/681430 (https://phabricator.wikimedia.org/T275773) (owner: 10Urbanecm) [18:25:10] (03PS4) 10Legoktm: mailman3: Send queue lengths to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/681290 (https://phabricator.wikimedia.org/T278280) [18:25:35] (03CR) 10jerkins-bot: [V: 04-1] mailman3: Send queue lengths to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/681290 (https://phabricator.wikimedia.org/T278280) (owner: 10Legoktm) [18:27:05] (03PS5) 10Legoktm: mailman3: Send queue lengths to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/681290 (https://phabricator.wikimedia.org/T278280) [18:28:24] (03CR) 10Legoktm: mailman3: Send queue lengths to prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/681290 (https://phabricator.wikimedia.org/T278280) (owner: 10Legoktm) [18:29:24] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.1/extensions/GrowthExperiments/: 4d1969d: 1fbb8e9: MentorStore: Set wasPosted to true in command line mode (T275773) (duration: 00m 59s) [18:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:35] T275773: Move mentor/mentee relationship to a separate database table to make it possible to run more queries on it - https://phabricator.wikimedia.org/T275773 [18:33:14] !log robh@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudcephosd1016.wikimedia.org [18:33:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:37] RECOVERY - MariaDB Replica Lag: s6 on db2076 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:34:16] !log mwscript extensions/GrowthExperiments/maintenance/migrateMentorMenteeRelationship.php --wiki=idwiki # T279853 [18:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:24] T279853: Migrate mentor/mentee relationship to a separate database table on Wikimedia wikis - https://phabricator.wikimedia.org/T279853 [18:42:08] (03PS1) 10Clare Ming: Update wgVectorLanguageInHeader variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681456 (https://phabricator.wikimedia.org/T277588) [18:42:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10RobH) a:05Jclark-ctrβ†’03Andrew @Andrew, The info for networking is as follows: Networking/Subnet/VLAN/IP: 2 x 10G ports pe... [18:45:06] (03CR) 10jerkins-bot: [V: 04-1] Update wgVectorLanguageInHeader variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681456 (https://phabricator.wikimedia.org/T277588) (owner: 10Clare Ming) [18:45:18] (03CR) 10Cwhite: [C: 03+1] mailman3: Send queue lengths to prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/681290 (https://phabricator.wikimedia.org/T278280) (owner: 10Legoktm) [18:50:44] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29135/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/681290 (https://phabricator.wikimedia.org/T278280) (owner: 10Legoktm) [18:51:40] (03CR) 10Legoktm: [V: 03+1 C: 03+2] mailman3: Send queue lengths to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/681290 (https://phabricator.wikimedia.org/T278280) (owner: 10Legoktm) [18:55:11] (03PS2) 10Clare Ming: Update wgVectorLanguageInHeader variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681456 (https://phabricator.wikimedia.org/T277588) [18:56:27] (03PS3) 10Clare Ming: Update wgVectorLanguageInHeader variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681456 (https://phabricator.wikimedia.org/T277588) [18:58:05] (03CR) 10jerkins-bot: [V: 04-1] Update wgVectorLanguageInHeader variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681456 (https://phabricator.wikimedia.org/T277588) (owner: 10Clare Ming) [18:58:42] (03PS4) 10Clare Ming: Update wgVectorLanguageInHeader variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681456 (https://phabricator.wikimedia.org/T277588) [19:00:19] (03CR) 10jerkins-bot: [V: 04-1] Update wgVectorLanguageInHeader variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681456 (https://phabricator.wikimedia.org/T277588) (owner: 10Clare Ming) [19:05:44] PROBLEM - Long running screen/tmux on aqs1011 is CRITICAL: CRIT: Long running SCREEN process. (user: root PID: 327, 1733542s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [19:09:34] (03PS5) 10Clare Ming: Update wgVectorLanguageInHeader variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681456 (https://phabricator.wikimedia.org/T277588) [19:10:54] 10SRE, 10Security-Team, 10Wikimedia-Mailing-lists: Upgrade GNU Mailman from 2.1 to Mailman3 - https://phabricator.wikimedia.org/T52864 (10Legoktm) [19:11:42] 10SRE, 10Wikimedia-Mailing-lists, 10observability: Setup monitoring for mailman3 - https://phabricator.wikimedia.org/T278280 (10Legoktm) 05Openβ†’03Resolved a:03Legoktm I think we're mostly good here. https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3?orgId=1 has the lists-next queue sizes but eventuall... [19:15:50] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:28:20] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudcephosd1016.wikimedia.org [19:28:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `cloudcephosd1016.wikimedia.org` - cl... [19:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:48] 10SRE, 10Wikimedia-General-or-Unknown, 10Wikimedia-SVG-rendering, 10Documentation: Document how to request installing additional SVG and PDF fonts on Wikimedia servers - https://phabricator.wikimedia.org/T228591 (10Aklapper) [19:39:25] 10SRE, 10Wikimedia-General-or-Unknown, 10Wikimedia-SVG-rendering, 10Documentation: Document how to request installing additional SVG and PDF fonts on Wikimedia servers - https://phabricator.wikimedia.org/T228591 (10Aklapper) @jbond: If this is a task to work on in the foreseeable future (because "medium" p... [19:42:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10Andrew) [19:43:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10Andrew) I have updated the networking request yet again -- with luck I got it right this time. For reference it's probably wort... [19:50:53] 10SRE, 10Icinga, 10observability: implement paging for non-ops teams - https://phabricator.wikimedia.org/T141038 (10Dzahn) [19:52:56] 10SRE, 10Icinga, 10observability: implement paging for non-ops teams - https://phabricator.wikimedia.org/T141038 (10Dzahn) removed the word "icinga" from the ticket title. I think it is just about "paging for subgroups" / "paging for teams outside SRE" but doesn't really matter if it's Icinga or not. That... [19:53:07] 10SRE, 10Wikimedia-SVG-rendering: keep https://noc.wikimedia.org/conf/fc-list up-to-date - https://phabricator.wikimedia.org/T280718 (10Peachey88) [19:55:10] 10SRE, 10Wikimedia-SVG-rendering, 10serviceops-radar: keep https://noc.wikimedia.org/conf/fc-list up-to-date - https://phabricator.wikimedia.org/T280718 (10Dzahn) [19:56:19] !log robh@cumin1001 START - Cookbook sre.dns.netbox [19:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:50] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:58:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:54] 10SRE, 10Wikimedia-Mailing-lists, 10observability: Implement central logging for mailman3 - https://phabricator.wikimedia.org/T276697 (10Legoktm) This is the last hard blocker we have for mailman3 at this point - is there a guide we can follow or an example we can copy/paste from? [20:03:48] !log robh@cumin1001 START - Cookbook sre.dns.netbox [20:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:39] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:08:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:03] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/GrowthExperiments/maintenance/migrateMentorMenteeRelationship.php --wiki=bnwiki # T279853 [20:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:12] T279853: Migrate mentor/mentee relationship to a separate database table on Wikimedia wikis - https://phabricator.wikimedia.org/T279853 [20:14:35] (03PS1) 10Alexandros Kosiaris: staging-codfw: Enable masquarade_all [puppet] - 10https://gerrit.wikimedia.org/r/681470 (https://phabricator.wikimedia.org/T238909) [20:15:30] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/GrowthExperiments/maintenance/migrateMentorMenteeRelationship.php --wiki=euwiki # T279853 [20:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:00] (03PS1) 10Alexandros Kosiaris: Add kubernetes service IP ranges to prefix list [homer/public] - 10https://gerrit.wikimedia.org/r/681472 (https://phabricator.wikimedia.org/T238909) [20:16:39] !log robh@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudcephosd[1017-1019].wikimedia.org [20:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:46] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/GrowthExperiments/maintenance/migrateMentorMenteeRelationship.php --wiki=frwiktionary # T279853 [20:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:33] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/GrowthExperiments/maintenance/migrateMentorMenteeRelationship.php --wiki=hewiki # T279853 [20:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:43] T279853: Migrate mentor/mentee relationship to a separate database table on Wikimedia wikis - https://phabricator.wikimedia.org/T279853 [20:21:00] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/GrowthExperiments/maintenance/migrateMentorMenteeRelationship.php --wiki=hrwiki # T279853 [20:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:37] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/GrowthExperiments/maintenance/migrateMentorMenteeRelationship.php --wiki=huwiki # T279853 [20:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:08] (03PS1) 10Alexandros Kosiaris: staging-codfw: Advertise service cluster IPs [deployment-charts] - 10https://gerrit.wikimedia.org/r/681473 (https://phabricator.wikimedia.org/T238909) [20:27:43] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/GrowthExperiments/maintenance/migrateMentorMenteeRelationship.php --wiki=hywiki # T279853 [20:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:51] T279853: Migrate mentor/mentee relationship to a separate database table on Wikimedia wikis - https://phabricator.wikimedia.org/T279853 [20:29:54] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/GrowthExperiments/maintenance/migrateMentorMenteeRelationship.php --wiki=rowiki # T279853 [20:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:54] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/GrowthExperiments/maintenance/migrateMentorMenteeRelationship.php --wiki=srwiki # T279853 [20:31:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:43] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/GrowthExperiments/maintenance/migrateMentorMenteeRelationship.php --wiki=svwiki # T279853 [20:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:52] T279853: Migrate mentor/mentee relationship to a separate database table on Wikimedia wikis - https://phabricator.wikimedia.org/T279853 [20:34:33] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/GrowthExperiments/maintenance/migrateMentorMenteeRelationship.php --wiki=tewiki # T279853 [20:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:58] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcephosd[1017-1019].wikimedia.org [20:36:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `cloudcephosd[1017-1019].wikimedia.or... [20:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:14] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/GrowthExperiments/maintenance/migrateMentorMenteeRelationship.php --wiki=ukwiki # T279853 [20:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:50] !log robh@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudcephosd1020.wikimedia.org [20:36:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:22] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/GrowthExperiments/maintenance/migrateMentorMenteeRelationship.php --wiki=viwiki # T279853 [20:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:31] T279853: Migrate mentor/mentee relationship to a separate database table on Wikimedia wikis - https://phabricator.wikimedia.org/T279853 [20:43:48] (03PS1) 10RobH: updating cloudcephosd fqdn [puppet] - 10https://gerrit.wikimedia.org/r/681476 (https://phabricator.wikimedia.org/T274945) [20:44:16] (03CR) 10RobH: [C: 03+2] updating cloudcephosd fqdn [puppet] - 10https://gerrit.wikimedia.org/r/681476 (https://phabricator.wikimedia.org/T274945) (owner: 10RobH) [20:48:10] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcephosd1020.wikimedia.org [20:48:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `cloudcephosd10... [20:52:57] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/GrowthExperiments/maintenance/migrateMentorMenteeRelationship.php --wiki=ruwiki # T279853 [20:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:08] T279853: Migrate mentor/mentee relationship to a separate database table on Wikimedia wikis - https://phabricator.wikimedia.org/T279853 [20:54:22] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 84 probes of 634 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:54:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` clo... [20:56:17] 10SRE, 10Wikimedia-Mailing-lists: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 (10Legoktm) > The script to migrate mailing lists will first send an email to that mailing list saying it's going to be disabled, then disable the mailing list, migrate it and then... [20:58:12] (03CR) 10Ryan Kemper: [C: 03+2] Move WCQS update to Tuesday [puppet] - 10https://gerrit.wikimedia.org/r/681331 (https://phabricator.wikimedia.org/T280022) (owner: 10ZPapierski) [20:59:42] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 51 probes of 634 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:04:54] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [21:06:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10RobH) All hosts have had the decom cookbook run for them, and then set back to planned. Network port and... [21:11:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1016.eqiad.wmnet'] ` Of which those... [21:28:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` clo... [21:44:06] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1016.eqiad.wmnet with reason: REIMAGE [21:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:12] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1016.eqiad.wmnet with reason: REIMAGE [21:46:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Jclark-ctr) a:05Jclark-ctrβ†’03Cmjohnson @Cmjohnson passing over partial please return after finished. name rack position cable id port mw1414 A3 27 1826 1... [21:55:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1016.eqiad.wmnet'] ` and were **ALL*... [21:56:10] 10SRE, 10ops-codfw, 10netops: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457 (10BBlack) [22:10:30] !log robh@cumin1001 START - Cookbook sre.dns.netbox [22:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:56] (03PS4) 10Dzahn: conftool: add comments about 2 dedicated videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/679432 (https://phabricator.wikimedia.org/T279100) [22:14:21] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:20] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [22:16:18] (03PS5) 10Dzahn: conftool: add comments about 2 dedicated videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/679432 (https://phabricator.wikimedia.org/T279100) [22:20:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10RobH) Summary of work so far: All hosts have their interfaces removed and added back in the proper vlan.... [22:22:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10RobH) a:05Andrewβ†’03ayounsi ` Configuration diff for asw2-d-eqiad.mgmt.eqiad.wmnet: [edit interfaces i... [22:25:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` clo... [22:30:04] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [22:30:44] (03CR) 10Dzahn: "amended/recycled to simply add the 2 missing comments and that's it" [puppet] - 10https://gerrit.wikimedia.org/r/679432 (https://phabricator.wikimedia.org/T279100) (owner: 10Dzahn) [22:31:47] (03PS3) 10Razzi: netboot: update {flerovium,furud}.eqiad.wmnet to buster [puppet] - 10https://gerrit.wikimedia.org/r/679607 (https://phabricator.wikimedia.org/T278421) [22:32:56] (03CR) 10Dzahn: [C: 03+2] "merging, though is removing a redirect that exists in production really aligning it closer with production?" [puppet] - 10https://gerrit.wikimedia.org/r/681142 (owner: 10Krinkle) [22:33:23] (03PS1) 10RobH: updating with new ssd sku [software] - 10https://gerrit.wikimedia.org/r/681490 [22:33:43] (03CR) 10RobH: [C: 03+2] updating with new ssd sku [software] - 10https://gerrit.wikimedia.org/r/681490 (owner: 10RobH) [22:34:23] (03Merged) 10jenkins-bot: updating with new ssd sku [software] - 10https://gerrit.wikimedia.org/r/681490 (owner: 10RobH) [22:36:58] (03PS4) 10Razzi: netboot: update {flerovium,furud}.eqiad.wmnet to buster [puppet] - 10https://gerrit.wikimedia.org/r/679607 (https://phabricator.wikimedia.org/T278421) [22:39:19] 10SRE, 10Wikimedia-Mailing-lists: Enable CAPTCHA on mailman instances - https://phabricator.wikimedia.org/T194558 (10Legoktm) p:05Highβ†’03Low CAPTCHAs are not accessible and either require proprietary services or can be beaten by OCR software. I would rather invest time in adjusting rate limits and other an... [22:43:03] (03CR) 10Razzi: [C: 03+2] netboot: update {flerovium,furud}.eqiad.wmnet to buster [puppet] - 10https://gerrit.wikimedia.org/r/679607 (https://phabricator.wikimedia.org/T278421) (owner: 10Razzi) [22:44:27] (03PS5) 10Razzi: netboot: update flerovium.eqiad.wmnet, furud.codfw.wmnet to install buster [puppet] - 10https://gerrit.wikimedia.org/r/679607 (https://phabricator.wikimedia.org/T278421) [22:47:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1017.eqiad.wmnet'] ` Of which those... [22:51:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` clo... [23:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor I οΏ½ Unicode. All rise for Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210420T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:03:38] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on flerovium.eqiad.wmnet with reason: REIMAGE [23:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1017.eqiad.wmnet'] ` Of which those... [23:05:39] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on flerovium.eqiad.wmnet with reason: REIMAGE [23:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:47] (03PS1) 10Urbanecm: cawiki: Enable Growth team features for cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681492 [23:08:53] (03CR) 10jerkins-bot: [V: 04-1] cawiki: Enable Growth team features for cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681492 (owner: 10Urbanecm) [23:09:39] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on furud.codfw.wmnet with reason: REIMAGE [23:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:41] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on furud.codfw.wmnet with reason: REIMAGE [23:11:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:55] (03PS2) 10Urbanecm: cawiki: Enable Growth team features for cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681492 [23:12:57] (03PS1) 10Urbanecm: elwiki: Enable Growth team features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681493 (https://phabricator.wikimedia.org/T280172) [23:13:29] (03PS3) 10Urbanecm: cawiki: Enable Growth team features [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681492 (https://phabricator.wikimedia.org/T280673) [23:13:53] (03PS2) 10Urbanecm: elwiki: Enable Growth team features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681493 (https://phabricator.wikimedia.org/T280172) [23:13:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10RobH) >>! In T274945#7022569, @ops-monitoring-bot wrote: > Completed auto-reimage of hosts: > ` > ['cloudc... [23:19:59] (03PS1) 10Urbanecm: urwiki: Enable Growth team features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681494 (https://phabricator.wikimedia.org/T280067) [23:21:05] (03PS3) 10Urbanecm: elwiki: Enable Growth team features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681493 (https://phabricator.wikimedia.org/T280172) [23:21:22] (03CR) 10jerkins-bot: [V: 04-1] urwiki: Enable Growth team features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681494 (https://phabricator.wikimedia.org/T280067) (owner: 10Urbanecm) [23:21:40] (03PS4) 10Urbanecm: cawiki: Enable Growth team features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681492 (https://phabricator.wikimedia.org/T280673) [23:22:04] (03PS4) 10Urbanecm: elwiki: Enable Growth team features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681493 (https://phabricator.wikimedia.org/T280172) [23:22:12] (03PS2) 10Urbanecm: urwiki: Enable Growth team features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681494 (https://phabricator.wikimedia.org/T280067) [23:23:09] (03PS3) 10Urbanecm: urwiki: Enable Growth team features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681494 (https://phabricator.wikimedia.org/T280067) [23:23:23] (03CR) 10Urbanecm: [C: 03+2] cawiki: Enable Growth team features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681492 (https://phabricator.wikimedia.org/T280673) (owner: 10Urbanecm) [23:24:22] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/WikimediaMaintenance/createExtensionTables.php --wiki=cawiki GrowthExperiments # T280673 [23:24:26] (03Merged) 10jenkins-bot: cawiki: Enable Growth team features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681492 (https://phabricator.wikimedia.org/T280673) (owner: 10Urbanecm) [23:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:30] T280673: Deploy Growth features on Catalan Wikipedia - https://phabricator.wikimedia.org/T280673 [23:27:11] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 425d77b73f48b3e16a5aa2c0086f292d370cd17e: cawiki: Enable Growth team features in stealth mode (T280673; 1/3) (duration: 00m 57s) [23:27:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:32] !log urbanecm@deploy1002 Synchronized wmf-config/config/cawiki.yaml: 425d77b73f48b3e16a5aa2c0086f292d370cd17e: cawiki: Enable Growth team features in stealth mode (T280673; 2/3) (duration: 00m 57s) [23:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:51] !log [urbanecm@mwmaint1002 ~]$ foreachwikiindblist growthexperiments sql.php --cluster=extension1 /srv/mediawiki/php-1.37.0-wmf.1/extensions/GrowthExperiments/maintenance/schemas/mysql/growthexperiments_mentee_data.sql # T279587 [23:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:59] T279587: Create database table to cache data about mentees - https://phabricator.wikimedia.org/T279587 [23:30:17] (03CR) 10Urbanecm: [C: 03+2] elwiki: Enable Growth team features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681493 (https://phabricator.wikimedia.org/T280172) (owner: 10Urbanecm) [23:31:07] !log urbanecm@deploy1002 Synchronized dblists/growthexperiments.dblist: 425d77b73f48b3e16a5aa2c0086f292d370cd17e: cawiki: Enable Growth team features in stealth mode (T280673; 3/3) (duration: 00m 57s) [23:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:19] T280673: Deploy Growth features on Catalan Wikipedia - https://phabricator.wikimedia.org/T280673 [23:31:53] (03Merged) 10jenkins-bot: elwiki: Enable Growth team features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681493 (https://phabricator.wikimedia.org/T280172) (owner: 10Urbanecm) [23:32:09] (03CR) 10Nray: [C: 03+1] "Lgtm. The next step if for you to sign up on the deployment calendar for Wednesday: https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681456 (https://phabricator.wikimedia.org/T277588) (owner: 10Clare Ming) [23:32:37] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/WikimediaMaintenance/createExtensionTables.php --wiki=elwiki GrowthExperiments # T280172 [23:32:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:45] T280172: Deploy Growth features on Greek Wikipedia - https://phabricator.wikimedia.org/T280172 [23:34:47] !log [urbanecm@mwmaint1002 ~]$ mwscript deleteEqualMessages.php --wiki=hrwiki --delete [23:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:34] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 314367bca6e924136704911b55fd3e2c929fa704: elwiki: Enable Growth team features in stealth mode (T280172; 1/3) (duration: 00m 57s) [23:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:50] !log urbanecm@deploy1002 Synchronized wmf-config/config/elwiki.yaml: 314367bca6e924136704911b55fd3e2c929fa704: elwiki: Enable Growth team features in stealth mode (T280172; 2/3) (duration: 00m 57s) [23:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:06] !log urbanecm@deploy1002 Synchronized dblists/growthexperiments.dblist: 314367bca6e924136704911b55fd3e2c929fa704: elwiki: Enable Growth team features in stealth mode (T280172; 3/3) (duration: 00m 56s) [23:38:13] (03CR) 10Urbanecm: [C: 03+2] urwiki: Enable Growth team features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681494 (https://phabricator.wikimedia.org/T280067) (owner: 10Urbanecm) [23:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:15] T280172: Deploy Growth features on Greek Wikipedia - https://phabricator.wikimedia.org/T280172 [23:38:19] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/WikimediaMaintenance/createExtensionTables.php --wiki=urwiki GrowthExperiments # T280067 [23:38:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:27] T280067: Deploy Growth features on Urdu Wikipedia - https://phabricator.wikimedia.org/T280067 [23:39:17] (03Merged) 10jenkins-bot: urwiki: Enable Growth team features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681494 (https://phabricator.wikimedia.org/T280067) (owner: 10Urbanecm) [23:39:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Dzahn) The priority for this has been lowered a bit on our side, no need to rush on your side. [23:40:14] (03PS10) 10Mstyles: rdf-streaming-updater: create helmfile.d structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) [23:42:15] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 73544ccb40d9687b54c039aceb05cd033901d86f: urwiki: Enable Growth team features in stealth mode (T280067) (duration: 00m 58s) [23:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:26] !log urbanecm@deploy1002 Synchronized wmf-config/config/urwiki.yaml: 73544ccb40d9687b54c039aceb05cd033901d86f: urwiki: Enable Growth team features in stealth mode (T280067) (duration: 00m 57s) [23:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:35] T280067: Deploy Growth features on Urdu Wikipedia - https://phabricator.wikimedia.org/T280067 [23:46:02] !log urbanecm@deploy1002 Synchronized dblists/growthexperiments.dblist: 73544ccb40d9687b54c039aceb05cd033901d86f: urwiki: Enable Growth team features in stealth mode (T280067) (duration: 00m 57s) [23:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log