[00:00:04] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210203T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:00:26] (03CR) 10Dduvall: releases: Parameterize profile::ci::kubernetes_config owner/group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/661224 (https://phabricator.wikimedia.org/T273681) (owner: 10Dduvall) [00:01:37] marxarelli: so 'releasers-mediawiki' isn't the right owner when it's for MW? [00:01:55] not in this case [00:02:00] ok, ack [00:02:12] I understand, owning config vs owning actual release files [00:02:55] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/27829/" [puppet] - 10https://gerrit.wikimedia.org/r/661224 (https://phabricator.wikimedia.org/T273681) (owner: 10Dduvall) [00:03:28] since no one proposed any patches, I'm going to use the backport window to deploy some logo (non-)changes [00:03:41] right. restricting access to the config isn't that critical but i didn't want to add contint-admins to those hosts [00:03:59] I like that part, yep [00:04:20] noop on contint1001 [00:04:24] thanks for the review/fixes/merge! :) [00:04:33] ci-staging.config]/group: group changed 'root' to 'contint-roots' [00:04:44] (03PS2) 10Legoktm: logos: Update nlwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660973 [00:04:46] (03PS2) 10Legoktm: logos: Update eswiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660974 [00:04:47] (03PS2) 10Legoktm: logos: Update ptwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660975 [00:04:49] (03PS2) 10Legoktm: logos: Update ruwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660976 [00:04:51] (03PS2) 10Legoktm: logos: Update svwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660977 [00:04:53] (03PS2) 10Legoktm: logos: Remove TODO for pngout [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660978 (https://phabricator.wikimedia.org/T273380) [00:04:56] (03PS2) 10Legoktm: logos: Redo how variants work [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661052 (https://phabricator.wikimedia.org/T98640) [00:04:57] (03PS3) 10Legoktm: logos: Update zhwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660979 [00:05:17] marxarelli: everything seems good now on releases1002 :) [00:05:26] yay \o/ [00:05:34] (03CR) 10Legoktm: [C: 03+2] logos: Update nlwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660973 (owner: 10Legoktm) [00:06:07] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team-TODO: Puppet failing on releases hosts due to missing profile::ci::kubernetes_config::token, dependency issue in kubeconfig.pp - https://phabricator.wikimedia.org/T273681 (10Dzahn) after this last merge everything seems good... [00:06:10] (03CR) 10Legoktm: [C: 03+2] logos: Update eswiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660974 (owner: 10Legoktm) [00:06:31] (03CR) 10Legoktm: [C: 03+2] logos: Update ptwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660975 (owner: 10Legoktm) [00:06:35] (03Merged) 10jenkins-bot: logos: Update nlwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660973 (owner: 10Legoktm) [00:06:51] marxarelli: I just claimed that ticket is resolved then. of course reopen if there is something else, about to go off for now [00:07:03] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team-TODO: Puppet failing on releases hosts due to missing profile::ci::kubernetes_config::token, dependency issue in kubeconfig.pp - https://phabricator.wikimedia.org/T273681 (10Dzahn) 05Open→03Resolved a:03Dzahn [00:07:04] thanks again [00:07:11] (03Merged) 10jenkins-bot: logos: Update eswiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660974 (owner: 10Legoktm) [00:07:14] np! cu [00:07:19] (03CR) 10Legoktm: [C: 03+2] logos: Update ruwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660976 (owner: 10Legoktm) [00:07:36] (03CR) 10Legoktm: [C: 03+2] logos: Update svwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660977 (owner: 10Legoktm) [00:07:43] (03CR) 10Legoktm: [C: 03+2] logos: Remove TODO for pngout [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660978 (https://phabricator.wikimedia.org/T273380) (owner: 10Legoktm) [00:08:14] (03CR) 10Legoktm: [C: 03+2] logos: Redo how variants work [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661052 (https://phabricator.wikimedia.org/T98640) (owner: 10Legoktm) [00:08:16] (03Merged) 10jenkins-bot: logos: Update ptwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660975 (owner: 10Legoktm) [00:08:42] (03CR) 10Legoktm: [C: 03+2] logos: Update zhwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660979 (owner: 10Legoktm) [00:08:52] (03Merged) 10jenkins-bot: logos: Update ruwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660976 (owner: 10Legoktm) [00:09:00] (03Merged) 10jenkins-bot: logos: Update svwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660977 (owner: 10Legoktm) [00:09:07] (03Merged) 10jenkins-bot: logos: Remove TODO for pngout [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660978 (https://phabricator.wikimedia.org/T273380) (owner: 10Legoktm) [00:09:16] (03Merged) 10jenkins-bot: logos: Redo how variants work [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661052 (https://phabricator.wikimedia.org/T98640) (owner: 10Legoktm) [00:09:31] (03Merged) 10jenkins-bot: logos: Update zhwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660979 (owner: 10Legoktm) [00:09:34] ok [00:10:17] Minor merge party. :-) [00:10:26] !log jhuneidi@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [00:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:16] !log legoktm@deploy1001 Synchronized static/images/project-logos/: Update and recompress logos for nlwiki, eswiki, ptwiki, ruwiki, svwiki, zhwiki (1/2) (duration: 01m 10s) [00:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:32] James_F: the major merge party is scheduled for friday EOD ;) [00:12:57] be there or be square [00:12:57] we still have hundreds of logos to go through! [00:13:08] wee! [00:13:09] marxarelli: Oh gods. :-) [00:13:42] !log legoktm@deploy1001 Synchronized logos/: Update and recompress logos for nlwiki, eswiki, ptwiki, ruwiki, svwiki, zhwiki (2/2) (duration: 01m 05s) [00:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:21] PROBLEM - dump of es4 in eqiad on alert1001 is CRITICAL: dump for es4 at eqiad taken more than 8 days ago: Most recent backup 2021-01-26 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [00:14:45] cool I'm done [00:16:19] !log jhuneidi@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [00:16:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:05] MatmaRex would you mind looking at https://github.com/MatmaRex/patchdemo/pull/223 if you have a second? [00:28:23] (03PS1) 10Ryan Kemper: relforge: service impl of relforge100[3,4] [puppet] - 10https://gerrit.wikimedia.org/r/661229 (https://phabricator.wikimedia.org/T262211) [00:29:09] DannyS712: eeeh. okay, whatever [00:29:44] thanks [00:30:00] i don't really like having weird configuration there, but i guess we already have some [00:30:22] DannyS712: deployed now [00:42:19] (03PS2) 10Ryan Kemper: relforge: service impl of relforge100[3,4] [puppet] - 10https://gerrit.wikimedia.org/r/661229 (https://phabricator.wikimedia.org/T262211) [00:48:58] why is Friday now "No deploys all day!" ? I thought that used to be just Saturday and Sunday [00:50:21] oh, should probably ask in -releng [00:57:30] DannyS712, because no one wants to fix the site on Saturday or Sunday because of a broken deploy Friday [01:19:49] (03Abandoned) 10Krinkle: objectcache: return false during more error cases in RedisBagOStuff::*Multi() methods [core] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/658945 (owner: 10Ahmon Dancy) [01:35:47] (03PS5) 10Krinkle: apache: Stop aliasing zero.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/524925 (https://phabricator.wikimedia.org/T187716) (owner: 10Jforrester) [01:36:24] (03CR) 10Ottomata: [C: 03+1] presto: require partitions predicate [puppet] - 10https://gerrit.wikimedia.org/r/661209 (https://phabricator.wikimedia.org/T273004) (owner: 10Razzi) [01:42:15] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:44:13] (03PS6) 10Krinkle: apache: Replace zero.wikipedia.org vhost alias with redirect [puppet] - 10https://gerrit.wikimedia.org/r/524925 (https://phabricator.wikimedia.org/T187716) (owner: 10Jforrester) [01:44:17] (03CR) 10Krinkle: "done" [puppet] - 10https://gerrit.wikimedia.org/r/524925 (https://phabricator.wikimedia.org/T187716) (owner: 10Jforrester) [04:22:00] (03PS1) 10Andrew Bogott: Detect hosts under .wikimedia.cloud as 'labs' VMs. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/661278 (https://phabricator.wikimedia.org/T273706) [04:25:27] (03PS2) 10Andrew Bogott: Detect hosts under .wikimedia.cloud as 'labs' VMs. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/661278 (https://phabricator.wikimedia.org/T273706) [04:31:05] (03CR) 10jerkins-bot: [V: 04-1] Detect hosts under .wikimedia.cloud as 'labs' VMs. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/661278 (https://phabricator.wikimedia.org/T273706) (owner: 10Andrew Bogott) [05:07:47] PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [05:41:44] (03CR) 10QChris: "Wooohooo! \o/" [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/617206 (owner: 10QChris) [06:11:53] (03PS1) 10Marostegui: instances.yaml: Add db1174. [puppet] - 10https://gerrit.wikimedia.org/r/661283 (https://phabricator.wikimedia.org/T258361) [06:13:32] (03Abandoned) 10Marostegui: mariadb: Productionize db1174 [puppet] - 10https://gerrit.wikimedia.org/r/661066 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [06:14:21] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1174. [puppet] - 10https://gerrit.wikimedia.org/r/661283 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [06:14:51] 10SRE, 10Research, 10SRE-Access-Requests: Access to analytics-privatedata-users for Research contractor AikoChou - https://phabricator.wikimedia.org/T273602 (10AikoChou) Hi @CDanis, My wikitech username: AikoChou Preferred shell username: aikochou SSh public key: https://phabricator.wikimedia.org/P14137 I ha... [06:29:11] (03CR) 10Marostegui: "Small comment: I would add that this require reloading haproxy on the given proxy." [puppet] - 10https://gerrit.wikimedia.org/r/661206 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm) [06:35:45] (03PS1) 10Marostegui: db1174: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/661286 [06:36:30] (03CR) 10Marostegui: [C: 03+2] db1174: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/661286 (owner: 10Marostegui) [06:38:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1174 with minimal weight for the first time in s7', diff saved to https://phabricator.wikimedia.org/P14138 and previous config saved to /var/cache/conftool/dbconfig/20210203-063812-marostegui.json [06:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1078 - will be decommissioned', diff saved to https://phabricator.wikimedia.org/P14139 and previous config saved to /var/cache/conftool/dbconfig/20210203-064137-marostegui.json [06:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:27] (03PS1) 10Marostegui: install_server: Reimage db1173 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/661288 (https://phabricator.wikimedia.org/T258361) [06:52:40] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db1173 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/661288 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [06:54:25] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1173.eqiad.wmnet'] ` The log ca... [07:06:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1173.eqiad.wmnet with reason: REIMAGE [07:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1173.eqiad.wmnet with reason: REIMAGE [07:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give some more weight to db1174', diff saved to https://phabricator.wikimedia.org/P14141 and previous config saved to /var/cache/conftool/dbconfig/20210203-071310-marostegui.json [07:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:21] (03PS1) 10Marostegui: install_server: Do not reimage db1174. [puppet] - 10https://gerrit.wikimedia.org/r/661291 [07:15:15] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1174. [puppet] - 10https://gerrit.wikimedia.org/r/661291 (owner: 10Marostegui) [07:15:32] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1173.eqiad.wmnet'] ` and were **ALL** successful. [07:17:28] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [07:17:43] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [07:21:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: 2021-03-31) rack/setup/install snapshot101[1-5] - https://phabricator.wikimedia.org/T272509 (10ArielGlenn) We have many wiki dump runs completed without problems. So please do go ahead with buster on these new servers. Thanks! [07:22:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: 2021-03-31) rack/setup/install snapshot101[1-5] - https://phabricator.wikimedia.org/T272509 (10ArielGlenn) [07:37:23] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.95 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [07:37:23] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:45:52] 10SRE, 10Analytics, 10Traffic: Downloading from Archiva.wikimedia.org seems slower than Maven Central - https://phabricator.wikimedia.org/T273086 (10elukey) To keep archives happy - I had to revert the patch since some maven build jobs issue HTTP PUT to the /repository path, meanwhile my assumption was that... [07:46:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 5%: Slowly pool db1174 into s7', diff saved to https://phabricator.wikimedia.org/P14142 and previous config saved to /var/cache/conftool/dbconfig/20210203-074651-root.json [07:46:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1093 to clone db1173 T258361', diff saved to https://phabricator.wikimedia.org/P14143 and previous config saved to /var/cache/conftool/dbconfig/20210203-074749-marostegui.json [07:47:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:53] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [07:49:06] !log Stop mysql on db1093 to clone db1173 T258361 [07:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:32] (03CR) 10Ayounsi: [C: 03+2] Alert manager, fix DCops email [puppet] - 10https://gerrit.wikimedia.org/r/661178 (owner: 10Ayounsi) [07:53:40] (03PS1) 10Marostegui: mariadb: Productionize db1173 [puppet] - 10https://gerrit.wikimedia.org/r/661333 (https://phabricator.wikimedia.org/T258361) [07:54:16] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1173 [puppet] - 10https://gerrit.wikimedia.org/r/661333 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [08:01:08] (03CR) 10Elukey: [C: 03+1] burrow/check_kafka_consumer_lag.py: Port to Python 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/658396 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [08:01:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 8%: Slowly pool db1174 into s7', diff saved to https://phabricator.wikimedia.org/P14145 and previous config saved to /var/cache/conftool/dbconfig/20210203-080154-root.json [08:01:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:11] (03CR) 10Elukey: Add BGP configuration for the new ML Serve eqiad/codfw clusters (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/661055 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey) [08:02:22] (03CR) 10Elukey: Add BGP configuration for the new ML Serve eqiad/codfw clusters (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/661055 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey) [08:06:25] (03PS6) 10Elukey: Add BGP configuration for the new ML Serve eqiad/codfw clusters [homer/public] - 10https://gerrit.wikimedia.org/r/661055 (https://phabricator.wikimedia.org/T272918) [08:09:26] (03CR) 10Elukey: "@Arzhel: I picked 6460[6,7] and updated https://wikitech.wikimedia.org/wiki/IP_and_AS_allocations#Private_AS, let me know if it is ok :)" [homer/public] - 10https://gerrit.wikimedia.org/r/661055 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey) [08:11:14] (03PS1) 10Muehlenhoff: Remove access for Bernd Sitzmann [puppet] - 10https://gerrit.wikimedia.org/r/661334 [08:14:24] (03CR) 10Ayounsi: [C: 03+1] "It is ok." [homer/public] - 10https://gerrit.wikimedia.org/r/661055 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey) [08:16:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 10%: Slowly pool db1174 into s7', diff saved to https://phabricator.wikimedia.org/P14146 and previous config saved to /var/cache/conftool/dbconfig/20210203-081658-root.json [08:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:14] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for Bernd Sitzmann [puppet] - 10https://gerrit.wikimedia.org/r/661334 (owner: 10Muehlenhoff) [08:28:32] 10SRE, 10Datasets-General-or-Unknown, 10netops: Packets discarded on dumpsdata1001 - https://phabricator.wikimedia.org/T273713 (10Peachey88) [08:29:02] (03CR) 10Muehlenhoff: [C: 04-1] "This removes Željko from all access groups for production, which means we should disable production SSH access entirely. @Željko: You're s" [puppet] - 10https://gerrit.wikimedia.org/r/661150 (https://phabricator.wikimedia.org/T267313) (owner: 10Thcipriani) [08:32:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 13%: Slowly pool db1174 into s7', diff saved to https://phabricator.wikimedia.org/P14147 and previous config saved to /var/cache/conftool/dbconfig/20210203-083201-root.json [08:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:55] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/658415 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [08:33:02] 10SRE, 10Datasets-General-or-Unknown, 10netops: Packets discarded on dumpsdata1001 - https://phabricator.wikimedia.org/T273713 (10ArielGlenn) Arzhel pointed out that the dumpsdata1003 host has a 10G NIC, so we can swap 1001 and 1003 roles when the current dumps run completes. This would mean dumpsdata1003 wo... [08:33:19] 10SRE, 10Datasets-General-or-Unknown, 10Dumps-Generation, 10netops: Packets discarded on dumpsdata1001 - https://phabricator.wikimedia.org/T273713 (10ArielGlenn) [08:37:04] 10ops-eqiad: Interface errors on asw2-b-eqiad:ge-8/0/6 (dumpsdata1001) - https://phabricator.wikimedia.org/T273714 (10ayounsi) p:05Triage→03Medium [08:38:42] (03CR) 10Muehlenhoff: "The new timer is fine, but we need to absent the old cron, otherwise it'll be kept in crontab and we'll effectively run it twice?" [puppet] - 10https://gerrit.wikimedia.org/r/661189 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [08:38:46] (Access port utilisation over 80% for 1h) firing: Access port utilisation over 80% for 1h - https://alerts.wikimedia.org [08:39:47] 10ops-eqiad, 10User-ArielGlenn: Interface errors on asw2-b-eqiad:ge-8/0/6 (dumpsdata1001) - https://phabricator.wikimedia.org/T273714 (10ArielGlenn) [08:40:14] 10ops-eqiad, 10User-ArielGlenn: Interface errors on asw2-b-eqiad:ge-8/0/6 (dumpsdata1001) - https://phabricator.wikimedia.org/T273714 (10ArielGlenn) We can schedule this for when the current dump run is complete and before the next one starts, so likely 17th-18th-19th Feb. [08:40:38] (03CR) 10Ladsgroup: [C: 04-1] "Yes it needs to be absented first." [puppet] - 10https://gerrit.wikimedia.org/r/661189 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [08:43:46] (Access port utilisation over 80% for 1h) resolved: Access port utilisation over 80% for 1h - https://alerts.wikimedia.org [08:45:06] (03PS1) 10Majavah: Add missing isset() check to ApiEchoUnreadNotificationPages [extensions/Echo] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/661258 (https://phabricator.wikimedia.org/T273479) [08:45:36] elukey: is there any trasfer going on like last week? see the access port utilization email [08:45:54] ah no, seems dumpsdata1001 [08:46:22] volans: always blaming Analytics [08:46:24] volans: it's a bug [08:46:37] yeah saw now T273713 [08:46:38] T273713: Packets discarded on dumpsdata1001 - https://phabricator.wikimedia.org/T273713 [08:46:38] (03CR) 10Muehlenhoff: [C: 03+2] Failover URL downloaders [dns] - 10https://gerrit.wikimedia.org/r/661131 (owner: 10Muehlenhoff) [08:46:39] volans: I acked the alert, but it send the alert instead [08:46:47] thx [08:47:03] btw I don't blame analytics... I just blame Luca ;) [08:47:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 15%: Slowly pool db1174 into s7', diff saved to https://phabricator.wikimedia.org/P14148 and previous config saved to /var/cache/conftool/dbconfig/20210203-084705-root.json [08:47:08] XioNoX: just to double check - Joseph is about to start copying some data, is it ok to proceed? [08:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:11] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "Thanks! This should be the smallest possible backport. I think there will be additional logging in the master branch, but this doesn't nee" [extensions/Echo] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/661258 (https://phabricator.wikimedia.org/T273479) (owner: 10Majavah) [08:47:58] elukey: yeah, no pb [08:48:04] perfect thanks [08:58:16] PROBLEM - ATS TLS has reduced HTTP availability #page on alert1001 is CRITICAL: cluster=cache_upload layer=tls https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [08:58:33] hi [08:58:40] <_joe_> uh [08:58:55] o/ [08:58:58] o/ [08:59:03] PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp5005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [08:59:11] <_joe_> eqsin has 90% availability [08:59:20] <_joe_> and yeah, that seems to correspond [08:59:34] (acked the alert) [08:59:39] yo? [08:59:54] what does that mean? [09:00:00] ? [09:00:01] <_joe_> XioNoX: I don't think that's the case, but can you check if eqsin has connectivity issues? [09:00:04] marostegui and akosiaris: It is that lovely time of the day again! You are hereby commanded to deploy m2 database master restart. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210203T0900). [09:00:15] <_joe_> we have issues with eqsin's upload AFAICT [09:00:19] PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp5004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:00:24] ^ I am going to hold that until we are clear on what's going on [09:00:34] cp5005 loadavg seems normal 5.78, 7.50, 7.64 [09:00:43] <_joe_> volans: check varnish [09:00:55] PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp5006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:00:56] it is the upload one, isn't it? [09:01:01] <_joe_> Feb 3 09:00:50 lvs5002 pybal[17977]: [uploadlb_443 ProxyFetch] WARN: cp5006.eqsin.wmnet (enabled/partially up/not pooled): Fetch failed (https://healthcheck.wikimedia.org/varnish-fe), 5.002 s [09:01:02] calendar says there is zayo maintenance right now fwiw [09:01:09] <_joe_> a lot of that stuff [09:01:32] _joe_: network looks good so far, still looking [09:01:33] <_joe_> yeah the upload varnishes seem to go down and back up consistently [09:01:42] <_joe_> ok, can we depool eqsin for upload? [09:01:48] <_joe_> volans: can yu prepare the patch? [09:01:52] legoktm: no Zayo in eqsin (or anywhere anymore) [09:01:59] +1 unless someone has a better suggestion [09:02:00] _joe_: sure [09:02:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 20%: Slowly pool db1174 into s7', diff saved to https://phabricator.wikimedia.org/P14149 and previous config saved to /var/cache/conftool/dbconfig/20210203-090208-root.json [09:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:22] * legoktm nods [09:02:33] PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp5001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:02:49] <_joe_> https://grafana.wikimedia.org/d/000000304/prometheus-varnish-dc-stats?viewPanel=18&orgId=1&var-datasource=eqsin%20prometheus%2Fops&var-cluster=cache_upload&var-layer=frontend [09:02:54] <_joe_> looks like a peak in requests [09:03:05] <_joe_> on upload [09:03:07] huge spike there [09:03:16] <_joe_> gonna take a look at the 5xx data [09:03:37] would a depool could make things worse? [09:03:43] (03PS1) 10Volans: depool eqsin, availability issue [dns] - 10https://gerrit.wikimedia.org/r/661338 [09:03:47] ^^^ if needed [09:05:33] PROBLEM - PyBal backends health check on lvs5002 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb6_443: Servers cp5001.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:05:59] PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp5002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:06:25] 10SRE, 10Traffic: HTTP 502 Error when trying to create new page (500k characters) on Romanian Wikisource - https://phabricator.wikimedia.org/T273623 (10Aklapper) [09:08:15] I'm here too if needed [09:17:02] PROBLEM - LVS upload-https eqsin port 443/tcp - Images and other media- upload.eqiad.wikimedia.org IPv6 #page on upload-lb.eqsin.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 connect failed - 1961 bytes in 3.938 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:17:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 25%: Slowly pool db1174 into s7', diff saved to https://phabricator.wikimedia.org/P14150 and previous config saved to /var/cache/conftool/dbconfig/20210203-091712-root.json [09:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:15] RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp5002 is OK: HTTP OK: HTTP/1.1 200 OK - 413 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:17:53] RECOVERY - PyBal backends health check on lvs5002 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:20:08] <_joe_> !log restarting varnish-frontend on cp5001 [09:20:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:13] PROBLEM - PyBal backends health check on lvs5003 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb6_443: Servers cp5004.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:23:19] RECOVERY - PyBal backends health check on lvs5003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:25:14] RECOVERY - LVS upload-https eqsin port 443/tcp - Images and other media- upload.eqiad.wikimedia.org IPv6 #page on upload-lb.eqsin.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 1038 bytes in 0.964 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:25:25] RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp5001 is OK: HTTP OK: HTTP/1.1 200 OK - 412 bytes in 0.445 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:26:44] !log rolling restart varnish-fe on cp5004-5006 [09:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:04] !log depool cp5006 [09:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:39] RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp5006 is OK: HTTP OK: HTTP/1.1 200 OK - 414 bytes in 0.451 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:32:02] (03CR) 10Hashar: [C: 03+2] Add missing isset() check to ApiEchoUnreadNotificationPages [extensions/Echo] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/661258 (https://phabricator.wikimedia.org/T273479) (owner: 10Majavah) [09:32:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 30%: Slowly pool db1174 into s7', diff saved to https://phabricator.wikimedia.org/P14151 and previous config saved to /var/cache/conftool/dbconfig/20210203-093215-root.json [09:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:09] RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp5004 is OK: HTTP OK: HTTP/1.1 200 OK - 412 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:35:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1093 (re)pooling @ 5%: Slowly pooling db1093 after cloning db1173', diff saved to https://phabricator.wikimedia.org/P14152 and previous config saved to /var/cache/conftool/dbconfig/20210203-093540-root.json [09:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:22] hashar: I rebased that other Echo patch [09:38:24] PROBLEM - ATS TLS has reduced HTTP availability #page on alert1001 is CRITICAL: cluster=cache_upload layer=tls https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [09:38:32] Majavah: greaaaat [09:38:49] PROBLEM - Varnish HTTP upload-frontend - port 3127 on cp5006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:38:59] PROBLEM - Varnish HTTP upload-frontend - port 3126 on cp5006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:39:59] PROBLEM - Varnish HTTP upload-frontend - port 3123 on cp5006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:40:01] Majavah: I will update the cluster with the other change [09:40:14] I don't think we had anyway to reproduce though [09:40:19] PROBLEM - Varnish HTTP upload-frontend - port 3121 on cp5006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:40:19] PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp5006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:40:43] PROBLEM - Varnish HTTP upload-frontend - port 3125 on cp5006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:40:43] PROBLEM - Varnish HTTP upload-frontend - port 3124 on cp5006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:40:53] I don't think we know how to reproduce any of the current blockers [09:40:53] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) db1173 is now replicating. [09:41:01] also what's happening with cp5006? [09:41:36] Majavah: we're working on it [09:41:44] currently depooled [09:41:47] PROBLEM - Varnish HTTP upload-frontend - port 80 on cp5006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:41:55] RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp5005 is OK: HTTP OK: HTTP/1.1 200 OK - 413 bytes in 0.484 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:42:05] PROBLEM - Varnish HTTP upload-frontend - port 3122 on cp5006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [09:42:48] RECOVERY - ATS TLS has reduced HTTP availability #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [09:43:01] RECOVERY - Varnish HTTP upload-frontend - port 3126 on cp5006 is OK: HTTP OK: HTTP/1.1 200 OK - 410 bytes in 0.484 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:43:09] PROBLEM - Number of messages locally queued by purged for processing on cp5006 is CRITICAL: cluster=cache_upload instance=cp5006 job=purged layer=frontend site=eqsin https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5006 [09:43:16] Majavah: the other patch made it simply a warning and also added the foreign wiki to the error message ( https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Echo/+/661046/1/includes/api/ApiEchoUnreadNotificationPages.php ) . I am going to update your rebase ;) [09:43:40] please do, thanks [09:44:01] RECOVERY - Varnish HTTP upload-frontend - port 3123 on cp5006 is OK: HTTP OK: HTTP/1.1 200 OK - 409 bytes in 0.445 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:44:19] RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp5006 is OK: HTTP OK: HTTP/1.1 200 OK - 410 bytes in 0.445 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:44:19] RECOVERY - Varnish HTTP upload-frontend - port 3121 on cp5006 is OK: HTTP OK: HTTP/1.1 200 OK - 410 bytes in 0.458 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:44:43] RECOVERY - Varnish HTTP upload-frontend - port 3125 on cp5006 is OK: HTTP OK: HTTP/1.1 200 OK - 410 bytes in 0.451 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:44:43] RECOVERY - Varnish HTTP upload-frontend - port 3124 on cp5006 is OK: HTTP OK: HTTP/1.1 200 OK - 410 bytes in 0.454 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:45:19] RECOVERY - Number of messages locally queued by purged for processing on cp5006 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5006 [09:45:47] RECOVERY - Varnish HTTP upload-frontend - port 80 on cp5006 is OK: HTTP OK: HTTP/1.1 200 OK - 410 bytes in 0.445 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:46:07] RECOVERY - Varnish HTTP upload-frontend - port 3122 on cp5006 is OK: HTTP OK: HTTP/1.1 200 OK - 410 bytes in 0.445 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:46:52] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [09:47:03] RECOVERY - Varnish HTTP upload-frontend - port 3127 on cp5006 is OK: HTTP OK: HTTP/1.1 200 OK - 410 bytes in 0.446 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:47:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 40%: Slowly pool db1174 into s7', diff saved to https://phabricator.wikimedia.org/P14153 and previous config saved to /var/cache/conftool/dbconfig/20210203-094719-root.json [09:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:12] !log disable DE-CIX codfw peering sessions [09:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1093 (re)pooling @ 10%: Slowly pooling db1093 after cloning db1173', diff saved to https://phabricator.wikimedia.org/P14154 and previous config saved to /var/cache/conftool/dbconfig/20210203-095043-root.json [09:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:41] (03Merged) 10jenkins-bot: Add missing isset() check to ApiEchoUnreadNotificationPages [extensions/Echo] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/661258 (https://phabricator.wikimedia.org/T273479) (owner: 10Majavah) [09:57:02] hashar: backport merged ^ [09:57:53] !log m2 master restart - T272964 [09:57:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:58] T272964: Restart m2 database master (db1107) - https://phabricator.wikimedia.org/T272964 [09:59:18] done, checking things [09:59:25] proxy didn't complain? [09:59:31] debmonitor is all fine [09:59:42] otrs is fine [09:59:49] jynus: it was fast [10:00:46] is there something else (service) I can check? [10:00:53] xhgui looks fine too [10:01:08] Majavah: yeah updating [10:01:15] jynus: I think we are good! thank you though :* [10:02:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 50%: Slowly pool db1174 into s7', diff saved to https://phabricator.wikimedia.org/P14155 and previous config saved to /var/cache/conftool/dbconfig/20210203-100222-root.json [10:02:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:01] !log hashar@deploy1001 Synchronized php-1.36.0-wmf.29/extensions/Echo/includes/api/ApiEchoUnreadNotificationPages.php: Add missing isset() check to ApiEchoUnreadNotificationPages - T273479 (duration: 01m 14s) [10:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:05] T273479: ApiEchoUnreadNotificationPages.php PHP Notice: Undefined index: query - https://phabricator.wikimedia.org/T273479 [10:04:09] 10:03:59 /usr/bin/sudo -u root -- /usr/local/sbin/check-and-restart-php php7.2-fpm 100 on mw2295.codfw.wmnet returned [255]: Host key verification failed. [10:04:10] bah [10:04:45] (03CR) 10Hashar: "Synced on the cluster." [extensions/Echo] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/661258 (https://phabricator.wikimedia.org/T273479) (owner: 10Majavah) [10:05:01] !log depooling and restarting blazegraph on wdqs1007 [10:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:08] so now we just wait and see if the errors stop? [10:05:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1093 (re)pooling @ 25%: Slowly pooling db1093 after cloning db1173', diff saved to https://phabricator.wikimedia.org/P14156 and previous config saved to /var/cache/conftool/dbconfig/20210203-100547-root.json [10:05:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:14] hashar: for the last blocker, I'll try to make a patch that will log the memcached key that is trying to be set if it fails in addition of the current message [10:07:30] (03CR) 10JMeybohm: [C: 03+1] "Apart from the notes Alex had, this looks good to me!" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/661138 (https://phabricator.wikimedia.org/T273427) (owner: 10Giuseppe Lavagetto) [10:07:40] that should hopefully tell where the error is [10:09:07] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:09:19] XioNoX: ^^^ [10:09:35] volans: expected [10:09:36] thx [10:09:46] just checking given the ongoing stuff, thx ;) [10:10:36] 10SRE: mw2295.codfw.wmnet returned [255]: Host key verification failed. - https://phabricator.wikimedia.org/T273726 (10hashar) [10:13:55] hashar, last person to touch it according to phab seems to be legoktm, you may want to add him to the ticket [10:14:03] XioNoX: all problems are expected if you're pessimistic enough [10:14:39] hmm [10:15:53] hashar: weird, I don't know why puppet is disabled. Let me re-enable it [10:16:21] legoktm: no idea :] but you shouldl sleep() really! [10:16:21] :D [10:16:37] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:16:40] I guess the ssh host key is not collected and thus does not land on the deploy machine [10:16:43] !log re-enabled puppet on mw2295 (T273726) [10:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:48] T273726: mw2295.codfw.wmnet returned [255]: Host key verification failed. - https://phabricator.wikimedia.org/T273726 [10:17:01] but other people have deployed with no issues since I reimaged it last week? [10:17:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 60%: Slowly pool db1174 into s7', diff saved to https://phabricator.wikimedia.org/P14157 and previous config saved to /var/cache/conftool/dbconfig/20210203-101726-root.json [10:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:41] legoktm, indeed, so it is probably not that [10:18:00] I just saw the data that it was long ago [10:18:02] *date [10:18:23] yeah it looks like puppet was never enabled after the reimage [10:19:01] hashar: want to try again? [10:19:34] keep an eye on it in case there is something weird there- hw issues or something :-) [10:20:32] legoktm: checking [10:20:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1093 (re)pooling @ 50%: Slowly pooling db1093 after cloning db1173', diff saved to https://phabricator.wikimedia.org/P14158 and previous config saved to /var/cache/conftool/dbconfig/20210203-102050-root.json [10:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:23] legoktm: I think we need a puppet run on deploy1001 to get the /etc/ssh/ssh_known_hosts , I guess that will self solve eventually. Thx! [10:22:29] one moment [10:23:19] yep, there it is [10:23:24] +mw2295.codfw.wmnet,mw2295,10.192.0.165,2620:0:860:101:10:192:0:165 ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBOdJ3WhOvOB/k01lnZynn+0rpV+cWylCU3quVOezb69ZiDLvZnDJUhFVEDVQ7g1yjX0EV8stu8MKUMX9uR9Y3vY= [10:23:33] so how did previous deploys work?? [10:23:46] I have no idea ;] [10:24:18] maybe the host was not in the dsh group [10:24:40] it was, we have icinga alerting for that [10:25:14] well who knows so :/ [10:25:24] seems to be fixed anyway, I am closing that task. Thank you legoktm ! [10:25:35] hashar: er, please leave it open [10:25:37] 10SRE: mw2295.codfw.wmnet returned [255]: Host key verification failed. - https://phabricator.wikimedia.org/T273726 (10hashar) 05Open→03Resolved a:03Legoktm SSH host keys are collected by puppet on the hosts and writen to /etc/ssh/ssh_known_hosts and since puppet was disabled the key was not collected. Th... [10:26:27] 10SRE: mw2295.codfw.wmnet returned [255]: Host key verification failed. - https://phabricator.wikimedia.org/T273726 (10Legoktm) puppet hadn't run since the reimaging, which is problem. I would've logged into it after the reimaging to run `scap pull` before repooling it, but it's possible I didn't read the MOTD p... [10:26:35] 10SRE: mw2295.codfw.wmnet returned [255]: Host key verification failed. - https://phabricator.wikimedia.org/T273726 (10Legoktm) 05Resolved→03Open [10:26:50] hashar: I'll look into this properly tomorrow [10:27:09] I feel like something else went wrong here, puppet shouldn't have been disabled for a whole week with no one noticing, etc. [10:27:23] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: limit rsync service memory [puppet] - 10https://gerrit.wikimedia.org/r/660854 (https://phabricator.wikimedia.org/T221904) (owner: 10Filippo Giunchedi) [10:28:07] !log rolling restart of varnish-fe on cp5002 and cp5003 [10:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:34] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: limit rsync to 10% memory in codfw [puppet] - 10https://gerrit.wikimedia.org/r/660855 (https://phabricator.wikimedia.org/T221904) (owner: 10Filippo Giunchedi) [10:31:41] legoktm: yeah good luck :\ Thx for the quick fix! [10:32:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 75%: Slowly pool db1174 into s7', diff saved to https://phabricator.wikimedia.org/P14159 and previous config saved to /var/cache/conftool/dbconfig/20210203-103229-root.json [10:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:12] hashar: you think we can just wrap the memcached operation in a try-catch and add a custom message to exceptions with the memcached key? or does it need something better [10:33:37] (03PS1) 10Filippo Giunchedi: role: default swift rsync memory limit [puppet] - 10https://gerrit.wikimedia.org/r/661343 (https://phabricator.wikimedia.org/T221904) [10:35:26] (03CR) 10JMeybohm: [C: 03+1] role::kubernetes::worker: add empty stanza for eventstreams-internal [puppet] - 10https://gerrit.wikimedia.org/r/661072 (https://phabricator.wikimedia.org/T269160) (owner: 10Elukey) [10:35:39] (03CR) 10JMeybohm: [C: 03+1] Add conftool data for eventstreams-internal (new VIP) [puppet] - 10https://gerrit.wikimedia.org/r/661067 (https://phabricator.wikimedia.org/T269160) (owner: 10Elukey) [10:35:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1093 (re)pooling @ 75%: Slowly pooling db1093 after cloning db1173', diff saved to https://phabricator.wikimedia.org/P14160 and previous config saved to /var/cache/conftool/dbconfig/20210203-103554-root.json [10:35:58] (03CR) 10Filippo Giunchedi: [C: 03+2] role: default swift rsync memory limit [puppet] - 10https://gerrit.wikimedia.org/r/661343 (https://phabricator.wikimedia.org/T221904) (owner: 10Filippo Giunchedi) [10:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:11] (03CR) 10JMeybohm: [C: 04-1] Add eventstreams-internal to service_catalog (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/661071 (https://phabricator.wikimedia.org/T269160) (owner: 10Elukey) [10:38:54] 10SRE, 10SRE-swift-storage, 10Patch-For-Review, 10User-fgiunchedi: swift backend decomms / rebalances are noisy - https://phabricator.wikimedia.org/T221904 (10fgiunchedi) [10:39:45] (03PS1) 10Kormat: integration_env: Add 'dbvers' command [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/661344 [10:40:24] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 95, down: 4, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:40:30] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10jcrespo) [10:40:49] (03PS1) 10Jcrespo: install_server: Decommission db1095, substitute with db1171 [puppet] - 10https://gerrit.wikimedia.org/r/661345 (https://phabricator.wikimedia.org/T273732) [10:41:04] (03PS2) 10Jcrespo: install_server: Decommission db1095, substitute with db1171 [puppet] - 10https://gerrit.wikimedia.org/r/661345 (https://phabricator.wikimedia.org/T273732) [10:42:04] (03CR) 10Jcrespo: [C: 04-1] "Blocked on db1171 being 100% ready." [puppet] - 10https://gerrit.wikimedia.org/r/661345 (https://phabricator.wikimedia.org/T273732) (owner: 10Jcrespo) [10:42:56] (03CR) 10Giuseppe Lavagetto: Add the 'uid' template helper (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/660851 (https://phabricator.wikimedia.org/T228967) (owner: 10Giuseppe Lavagetto) [10:43:16] !log elukey@deploy1001 Started deploy [analytics/refinery@8b8f0cf]: Weekly deployment [10:43:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:25] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission relforge1001.eqiad.wmnet and relforge1002.eqiad.wmnet - https://phabricator.wikimedia.org/T272444 (10Gehel) Removing Search Platform, the remaining work is under the control of DC-Ops. [10:43:38] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10jcrespo) [10:43:45] (03CR) 10Kormat: [C: 03+2] integration_env: Add 'dbvers' command [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/661344 (owner: 10Kormat) [10:46:13] (03Merged) 10jenkins-bot: integration_env: Add 'dbvers' command [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/661344 (owner: 10Kormat) [10:47:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 85%: Slowly pool db1174 into s7', diff saved to https://phabricator.wikimedia.org/P14161 and previous config saved to /var/cache/conftool/dbconfig/20210203-104733-root.json [10:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:10] (03PS1) 10Kormat: integration_env: Use better names for paths. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/661346 [10:50:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1093 (re)pooling @ 100%: Slowly pooling db1093 after cloning db1173', diff saved to https://phabricator.wikimedia.org/P14162 and previous config saved to /var/cache/conftool/dbconfig/20210203-105057-root.json [10:51:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:50] (03CR) 10Kormat: [C: 03+2] integration_env: Use better names for paths. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/661346 (owner: 10Kormat) [10:53:27] (03PS2) 10Elukey: Add eventstreams-internal to service_catalog [puppet] - 10https://gerrit.wikimedia.org/r/661071 (https://phabricator.wikimedia.org/T269160) [10:54:06] (03CR) 10Elukey: Add eventstreams-internal to service_catalog (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/661071 (https://phabricator.wikimedia.org/T269160) (owner: 10Elukey) [10:54:22] (03PS1) 10David Caro: wmcs.backups: Use the wmcs-backup script for vms [puppet] - 10https://gerrit.wikimedia.org/r/661348 (https://phabricator.wikimedia.org/T260692) [10:54:22] !log elukey@deploy1001 Finished deploy [analytics/refinery@8b8f0cf]: Weekly deployment (duration: 11m 06s) [10:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:29] (03Merged) 10jenkins-bot: integration_env: Use better names for paths. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/661346 (owner: 10Kormat) [10:54:56] (03PS1) 10Kormat: integration_env: Make deployment more configurable [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/661349 [10:55:51] (03PS2) 10Kormat: integration_env: Make deployment more configurable [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/661349 (https://phabricator.wikimedia.org/T265266) [10:55:54] (03CR) 10jerkins-bot: [V: 04-1] wmcs.backups: Use the wmcs-backup script for vms [puppet] - 10https://gerrit.wikimedia.org/r/661348 (https://phabricator.wikimedia.org/T260692) (owner: 10David Caro) [10:58:50] (03CR) 10Kormat: [C: 03+2] integration_env: Make deployment more configurable [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/661349 (https://phabricator.wikimedia.org/T265266) (owner: 10Kormat) [11:00:02] (03CR) 10Marostegui: "Reminder (cause I tend to forget it): remove from tendril and zarcillo" [puppet] - 10https://gerrit.wikimedia.org/r/661345 (https://phabricator.wikimedia.org/T273732) (owner: 10Jcrespo) [11:00:22] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:01:20] (03Merged) 10jenkins-bot: integration_env: Make deployment more configurable [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/661349 (https://phabricator.wikimedia.org/T265266) (owner: 10Kormat) [11:02:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 100%: Slowly pool db1174 into s7', diff saved to https://phabricator.wikimedia.org/P14163 and previous config saved to /var/cache/conftool/dbconfig/20210203-110236-root.json [11:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:44] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:05:09] (03PS2) 10David Caro: wmcs.backups: Use the wmcs-backup script for vms [puppet] - 10https://gerrit.wikimedia.org/r/661348 (https://phabricator.wikimedia.org/T260692) [11:06:28] 10SRE, 10serviceops, 10Performance-Team (Radar), 10Release-Engineering-Team (Deployment services), and 2 others: Investigate possible performance degradation on mediawiki servers after Debian Buster upgrade - https://phabricator.wikimedia.org/T273312 (10Legoktm) I generated some [[http://www.brendangregg.c... [11:09:42] (03PS2) 10Giuseppe Lavagetto: Add the 'uid' template helper [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/660851 (https://phabricator.wikimedia.org/T228967) [11:09:44] (03PS3) 10Giuseppe Lavagetto: Remove the build image functionality [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/660852 [11:09:46] (03PS2) 10Giuseppe Lavagetto: Allow running tests on an image once it's built [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/661138 (https://phabricator.wikimedia.org/T273427) [11:15:40] (03CR) 10Jbond: [C: 03+2] stdlib: update to 6.6.0 [puppet] - 10https://gerrit.wikimedia.org/r/661118 (owner: 10Jbond) [11:16:01] (03PS1) 10Legoktm: logos: Update nowiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661350 [11:16:03] (03PS1) 10Legoktm: logos: Update cawiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661351 [11:16:05] (03PS1) 10Legoktm: logos: Update fiwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661352 [11:16:07] (03PS1) 10Legoktm: logos: Update ukwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661353 [11:16:09] (03PS1) 10Legoktm: logos: Update cswiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661354 [11:16:11] (03PS1) 10Legoktm: logos: Update huwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661355 [11:16:13] (03PS1) 10Legoktm: logos: Update trwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661356 [11:17:44] RECOVERY - rpki grafana alert on alert1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [11:20:22] !log update puppetlabs-stdlib to v6.6.0 [11:20:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:50] (03Abandoned) 10Volans: depool eqsin, availability issue [dns] - 10https://gerrit.wikimedia.org/r/661338 (owner: 10Volans) [11:21:42] 10SRE, 10Traffic: Investigate unusual media traffic pattern - https://phabricator.wikimedia.org/T273741 (10Joe) [11:21:52] 10SRE, 10Traffic: Investigate unusual media traffic pattern - https://phabricator.wikimedia.org/T273741 (10Joe) p:05Triage→03Medium [11:22:47] (03CR) 10Jbond: [C: 03+1] profile::redis::multidc: hiera -> lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/659392 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [11:25:55] ACKNOWLEDGEMENT - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Hnowlan New buster master, not in use https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:25:55] ACKNOWLEDGEMENT - Maps HTTPS on maps1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.006 second response time Hnowlan New buster master, not in use https://wikitech.wikimedia.org/wiki/Maps/RunBook [11:25:55] ACKNOWLEDGEMENT - cassandra CQL 10.64.32.8:9042 on maps1009 is CRITICAL: connect to address 10.64.32.8 and port 9042: Connection refused Hnowlan New buster master, not in use https://phabricator.wikimedia.org/T93886 [11:25:55] ACKNOWLEDGEMENT - cassandra service on maps1009 is CRITICAL: CRITICAL - Unit cassandra is active but reported SubState exited, wanted running Hnowlan New buster master, not in use https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:25:55] ACKNOWLEDGEMENT - tileratorui on maps1009 is CRITICAL: connect to address 10.64.32.8 and port 6535: Connection refused Hnowlan New buster master, not in use https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [11:27:45] (03CR) 10Jbond: "LGTM but wonder if we can just drop this script, adding moritz" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/658427 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [11:29:00] ACKNOWLEDGEMENT - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE, AS6939/IPv4: Connect - HE ayounsi DE-CIX maintenance https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:35:07] (03PS5) 10David Caro: Revert "elasticsearch: return the cluster name in __str__ for ElasticsearchCluster" [software/spicerack] - 10https://gerrit.wikimedia.org/r/659294 [11:36:40] (03CR) 10David Caro: Revert "elasticsearch: return the cluster name in __str__ for ElasticsearchCluster" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/659294 (owner: 10David Caro) [11:38:58] (03CR) 10David Caro: [C: 03+1] "🎉" [puppet] - 10https://gerrit.wikimedia.org/r/660960 (https://phabricator.wikimedia.org/T272723) (owner: 10Bstorm) [11:39:43] (03CR) 10Klausman: [C: 03+1] Add BGP configuration for the new ML Serve eqiad/codfw clusters [homer/public] - 10https://gerrit.wikimedia.org/r/661055 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey) [11:40:47] (03CR) 10David Caro: [C: 03+2] last-puppet-run: don't crash if puppet has not run yet [puppet] - 10https://gerrit.wikimedia.org/r/641207 (owner: 10David Caro) [11:46:30] (03CR) 10Giuseppe Lavagetto: Allow running tests on an image once it's built (032 comments) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/661138 (https://phabricator.wikimedia.org/T273427) (owner: 10Giuseppe Lavagetto) [11:46:47] (03CR) 10Muehlenhoff: ldap/ldaplist.py: Port for Python 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/658427 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [11:55:06] 10SRE, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Aklapper) [11:55:40] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/659327 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto) [11:56:00] (03PS3) 10Giuseppe Lavagetto: mediawiki::prod_sites: move to dumb templates [puppet] - 10https://gerrit.wikimedia.org/r/659327 (https://phabricator.wikimedia.org/T272305) [11:58:18] PROBLEM - SSH on mw2249.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:59:19] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27834/console" [puppet] - 10https://gerrit.wikimedia.org/r/659327 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto) [11:59:32] 10Puppet, 10SRE, 10User-jbond: Identify and upstream usefull fuinctions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) [11:59:41] 10Puppet, 10SRE, 10User-jbond: Identify and upstream usefull fuinctions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) p:05Triage→03Medium [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210203T1200). [12:00:04] No GERRIT patches in the queue for this window AFAICS. [12:01:56] 10Puppet, 10SRE, 10User-jbond: Identify and upstream usefull fuinctions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) A new functions (`stdlib::ensure`) has now [[ https://github.com/puppetlabs/puppetlabs-stdlib/pull/1150 | been added ]] to sddlib as such we can replace our `ensure_{servi... [12:03:44] 10Puppet, 10SRE, 10User-jbond: Identify and upstream usefull fuinctions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) Stdlib now has `Stdlib::HTTPStatus` [[ https://github.com/puppetlabs/puppetlabs-stdlib/pull/1132 | types ]] as such we should update our code base to use them instad of `W... [12:04:18] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mediawiki::prod_sites: move to dumb templates [puppet] - 10https://gerrit.wikimedia.org/r/659327 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto) [12:12:00] (03CR) 10JMeybohm: [C: 03+1] Add eventstreams-internal to service_catalog [puppet] - 10https://gerrit.wikimedia.org/r/661071 (https://phabricator.wikimedia.org/T269160) (owner: 10Elukey) [12:12:04] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.533 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [12:19:02] !log installing openldap security updates on LDAP replicas [12:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:36] (03CR) 10Jbond: "lg, minor nit inline, CI error seems related to https://phabricator.wikimedia.org/T272985" (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/661278 (https://phabricator.wikimedia.org/T273706) (owner: 10Andrew Bogott) [12:22:32] !log disable puppet fleet wide to reboot puppetmaster,puppetdb [12:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:46] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 59, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:25:22] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetdb1002.eqiad.wmnet [12:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:33] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetdb2002.codfw.wmnet [12:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:39] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs.backups: Use the wmcs-backup script for vms [puppet] - 10https://gerrit.wikimedia.org/r/661348 (https://phabricator.wikimedia.org/T260692) (owner: 10David Caro) [12:28:03] !log jbond@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host puppetdb1002.eqiad.wmnet [12:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:22] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetmaster1001.eqiad.wmnet [12:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:29] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetmaster1002.eqiad.wmnet [12:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:36] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetmaster1003.eqiad.wmnet [12:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:37] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetmaster1003.eqiad.wmnet [12:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:09] (03PS1) 10Marostegui: install_server: Do not reimage db1173 [puppet] - 10https://gerrit.wikimedia.org/r/661359 [12:34:17] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetmaster1002.eqiad.wmnet [12:34:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:21] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetmaster2003.codfw.wmnet [12:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:32] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetmaster2002.codfw.wmnet [12:34:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:01] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetdb2002.codfw.wmnet [12:35:01] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1173 [puppet] - 10https://gerrit.wikimedia.org/r/661359 (owner: 10Marostegui) [12:35:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:05] !log jbond@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host puppetmaster1001.eqiad.wmnet [12:35:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:21] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetmaster2001.codfw.wmnet [12:35:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:30] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetmaster2003.codfw.wmnet [12:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:50] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetmaster2002.codfw.wmnet [12:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:31] 10Puppet, 10SRE, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10Aklapper) [12:43:37] (03CR) 10Muehlenhoff: ldap/ldaplist.py: Port for Python 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/658427 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [12:46:10] (03PS1) 10Jbond: tests: fix dependencies for tests [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/661362 [12:46:37] !log jbond@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host puppetmaster2001.codfw.wmnet [12:46:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:04] in that case ill enable puppet fleet wide post reboot the master and db servers [12:49:10] (03CR) 10Jbond: [C: 03+2] tests: fix dependencies for tests [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/661362 (owner: 10Jbond) [12:49:36] (03PS9) 10Jbond: nodegen: add cumin support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651800 (https://phabricator.wikimedia.org/T245288) [12:49:47] (03PS3) 10Jbond: Detect hosts under .wikimedia.cloud as 'labs' VMs. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/661278 (https://phabricator.wikimedia.org/T273706) (owner: 10Andrew Bogott) [12:51:51] (03CR) 10Jbond: "Ready for review" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651800 (https://phabricator.wikimedia.org/T245288) (owner: 10Jbond) [12:52:05] (03CR) 10Jbond: "CI tests fixed" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/661278 (https://phabricator.wikimedia.org/T273706) (owner: 10Andrew Bogott) [12:53:17] (03PS1) 10Giuseppe Lavagetto: httpbb-tests: Fix body assertion on ombudswiki [puppet] - 10https://gerrit.wikimedia.org/r/661363 [13:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210203T1300) [13:02:51] (03CR) 10David Caro: [C: 03+2] wmcs.backups: Use the wmcs-backup script for vms [puppet] - 10https://gerrit.wikimedia.org/r/661348 (https://phabricator.wikimedia.org/T260692) (owner: 10David Caro) [13:03:20] (03CR) 10Giuseppe Lavagetto: [C: 03+2] httpbb-tests: Fix body assertion on ombudswiki [puppet] - 10https://gerrit.wikimedia.org/r/661363 (owner: 10Giuseppe Lavagetto) [13:06:29] 10Puppet, 10SRE, 10User-jbond: Replace crons in puppet with systemd timer - https://phabricator.wikimedia.org/T273753 (10Ladsgroup) [13:06:54] jbond42: hey, I split the OKR ticket to the systemd timer so it'd be a little cleaner ^ [13:08:25] Amir1: mutante: allready beat you too it :) https://phabricator.wikimedia.org/T273673 [13:08:38] 🤦 sorry [13:08:45] :) np [13:09:05] 10Puppet, 10SRE, 10User-jbond: Replace crons in puppet with systemd timer - https://phabricator.wikimedia.org/T273753 (10Ladsgroup) [13:09:07] 10Puppet, 10SRE, 10puppet-compiler, 10Patch-For-Review, 10User-jbond: replace all puppet crons with systemd timers - https://phabricator.wikimedia.org/T273673 (10Ladsgroup) [13:10:16] (03CR) 10Muehlenhoff: "The old cron needs to be absented." [puppet] - 10https://gerrit.wikimedia.org/r/661198 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [13:10:36] 10SRE, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Gilles) Might be worth looking at the full unprocessed request headers? Do you have an example? [13:12:31] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host urldownloader1001.wikimedia.org [13:12:33] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host urldownloader2001.wikimedia.org [13:12:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:15] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader1001.wikimedia.org [13:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:02] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 69541976 and 35 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:16:45] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader2001.wikimedia.org [13:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:33] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host xhgui2001.codfw.wmnet [13:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:42] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host xhgui1001.eqiad.wmnet [13:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:02] (03PS1) 10Jbond: trafficserver: migrate from wmflib::HttpStatus to Stdlib::HttpStatus [puppet] - 10https://gerrit.wikimedia.org/r/661365 [13:19:34] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 333976 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:20:02] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27835/console" [puppet] - 10https://gerrit.wikimedia.org/r/661365 (owner: 10Jbond) [13:20:23] (03CR) 10Jbond: [V: 03+1 C: 03+2] trafficserver: migrate from wmflib::HttpStatus to Stdlib::HttpStatus [puppet] - 10https://gerrit.wikimedia.org/r/661365 (owner: 10Jbond) [13:27:46] (03PS1) 10Jbond: wmflib: drop ensure_service in favour of stdlib::ensure [puppet] - 10https://gerrit.wikimedia.org/r/661367 [13:28:14] (03CR) 10Jbond: "PCC running (https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27836)" [puppet] - 10https://gerrit.wikimedia.org/r/661367 (owner: 10Jbond) [13:29:16] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host xhgui2001.codfw.wmnet [13:29:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1120 T266483', diff saved to https://phabricator.wikimedia.org/P14164 and previous config saved to /var/cache/conftool/dbconfig/20210203-132938-marostegui.json [13:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:42] T266483: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 [13:30:24] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host xhgui1001.eqiad.wmnet [13:30:24] !log Stop mysql on db1120 to enable report_host T266483 [13:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:39] (03CR) 10jerkins-bot: [V: 04-1] wmflib: drop ensure_service in favour of stdlib::ensure [puppet] - 10https://gerrit.wikimedia.org/r/661367 (owner: 10Jbond) [13:30:57] (03PS7) 10David Caro: puppet: add ca_server retrieval [software/spicerack] - 10https://gerrit.wikimedia.org/r/659008 [13:31:06] PROBLEM - Check systemd state on xhgui2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:32:14] PROBLEM - Check systemd state on xhgui1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:33:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 5%: Repool db1120 after daemon restart', diff saved to https://phabricator.wikimedia.org/P14165 and previous config saved to /var/cache/conftool/dbconfig/20210203-133350-root.json [13:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:00] (03PS1) 10Jbond: wmflib: drop ensure_directory in favour of stdlib::ensure [puppet] - 10https://gerrit.wikimedia.org/r/661368 [13:36:25] (03CR) 10Jbond: "PCC running https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27837" [puppet] - 10https://gerrit.wikimedia.org/r/661368 (owner: 10Jbond) [13:38:08] (03PS1) 10Filippo Giunchedi: hieradata: use /monitoring/frontend for Swift's internal svc health checks [puppet] - 10https://gerrit.wikimedia.org/r/661369 (https://phabricator.wikimedia.org/T273453) [13:40:54] (03CR) 10jerkins-bot: [V: 04-1] puppet: add ca_server retrieval [software/spicerack] - 10https://gerrit.wikimedia.org/r/659008 (owner: 10David Caro) [13:41:10] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] swift: apply interface::rps to i40e NICs [puppet] - 10https://gerrit.wikimedia.org/r/661054 (https://phabricator.wikimedia.org/T271415) (owner: 10Filippo Giunchedi) [13:41:16] (03PS2) 10Filippo Giunchedi: swift: apply interface::rps to i40e NICs [puppet] - 10https://gerrit.wikimedia.org/r/661054 (https://phabricator.wikimedia.org/T271415) [13:46:28] (03CR) 10Volans: [C: 03+1] "I didn't test it but LGTM, it should work." [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651800 (https://phabricator.wikimedia.org/T245288) (owner: 10Jbond) [13:47:50] 10SRE, 10Cognate, 10ContentTranslation, 10DBA, and 9 others: Restart x1 database master - https://phabricator.wikimedia.org/T273758 (10Marostegui) [13:48:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 10%: Repool db1120 after daemon restart', diff saved to https://phabricator.wikimedia.org/P14166 and previous config saved to /var/cache/conftool/dbconfig/20210203-134854-root.json [13:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:02] 10SRE, 10Cognate, 10ContentTranslation, 10DBA, and 10 others: Restart x1 database master - https://phabricator.wikimedia.org/T273758 (10Trizek-WMF) Wow, this will happen soon! If I read it correctly, it will disturb ContentTranslation, Flow, Echo (all Echo notifications at en.wp and all X-wiki notificati... [13:55:04] (03PS8) 10David Caro: puppet: add ca_server retrieval [software/spicerack] - 10https://gerrit.wikimedia.org/r/659008 [13:58:43] !log swift codfw-prod decrease HDD weight for ms-be20[16-27] - T272837 [13:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:48] T272837: Decom ms-be[2016-2027] from swift - https://phabricator.wikimedia.org/T272837 [13:59:25] 10SRE, 10Cognate, 10ContentTranslation, 10DBA, and 10 others: Restart x1 database master - https://phabricator.wikimedia.org/T273758 (10Marostegui) >>! In T273758#6800123, @Trizek-WMF wrote: > Wow, this will happen soon! It will happen in 14 days. > > If I read it correctly, it will disturb ContentTran... [14:00:04] hashar and dancy: May I have your attention please! Mediawiki train - European+American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210203T1400) [14:03:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 20%: Repool db1120 after daemon restart', diff saved to https://phabricator.wikimedia.org/P14167 and previous config saved to /var/cache/conftool/dbconfig/20210203-140357-root.json [14:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:30] (03PS2) 10Jbond: wmflib: drop ensure_service in favour of stdlib::ensure [puppet] - 10https://gerrit.wikimedia.org/r/661367 [14:07:32] (03PS1) 10Jbond: zuul::server: Add types [puppet] - 10https://gerrit.wikimedia.org/r/661371 [14:19:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 25%: Repool db1120 after daemon restart', diff saved to https://phabricator.wikimedia.org/P14168 and previous config saved to /var/cache/conftool/dbconfig/20210203-141901-root.json [14:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:55] !log test memory limits on swift-object-replicator on ms-be2050 - T221904 [14:19:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:00] T221904: swift backend decomms / rebalances are noisy - https://phabricator.wikimedia.org/T221904 [14:20:27] !log installing openldap security updates on serpens/seaborgium [14:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:51] (03PS1) 10Jbond: wmflib: drop ensure_link in favour of stdlib::ensure [puppet] - 10https://gerrit.wikimedia.org/r/661372 (https://phabricator.wikimedia.org/T273743) [14:24:58] (03PS2) 10Jbond: wmflib: drop ensure_directory in favour of stdlib::ensure [puppet] - 10https://gerrit.wikimedia.org/r/661368 (https://phabricator.wikimedia.org/T273743) [14:25:07] (03PS3) 10Jbond: wmflib: drop ensure_service in favour of stdlib::ensure [puppet] - 10https://gerrit.wikimedia.org/r/661367 (https://phabricator.wikimedia.org/T273743) [14:25:30] (03CR) 10Jbond: "PCC running https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27838" [puppet] - 10https://gerrit.wikimedia.org/r/661372 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [14:26:06] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 247223448 and 21 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:28:26] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 417912 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:28:44] (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools Reply Tool A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661373 (https://phabricator.wikimedia.org/T273554) [14:28:46] RECOVERY - Long running screen/tmux on snapshot1009 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [14:28:46] RECOVERY - Long running screen/tmux on snapshot1010 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [14:31:46] (03CR) 10Jbond: [C: 03+2] nodegen: add cumin support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651800 (https://phabricator.wikimedia.org/T245288) (owner: 10Jbond) [14:32:18] (03CR) 10JMeybohm: [C: 03+1] Allow running tests on an image once it's built [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/661138 (https://phabricator.wikimedia.org/T273427) (owner: 10Giuseppe Lavagetto) [14:32:35] (03Merged) 10jenkins-bot: nodegen: add cumin support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651800 (https://phabricator.wikimedia.org/T245288) (owner: 10Jbond) [14:33:22] (03CR) 10JMeybohm: [C: 03+1] Add the 'uid' template helper [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/660851 (https://phabricator.wikimedia.org/T228967) (owner: 10Giuseppe Lavagetto) [14:34:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 50%: Repool db1120 after daemon restart', diff saved to https://phabricator.wikimedia.org/P14169 and previous config saved to /var/cache/conftool/dbconfig/20210203-143404-root.json [14:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:48] (03PS4) 10Jbond: Detect hosts under .wikimedia.cloud as 'labs' VMs. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/661278 (https://phabricator.wikimedia.org/T273706) (owner: 10Andrew Bogott) [14:36:30] (03CR) 10Jbond: "Gonna merge this as im doing a release" (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/661278 (https://phabricator.wikimedia.org/T273706) (owner: 10Andrew Bogott) [14:36:35] (03CR) 10Jbond: [C: 03+2] Detect hosts under .wikimedia.cloud as 'labs' VMs. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/661278 (https://phabricator.wikimedia.org/T273706) (owner: 10Andrew Bogott) [14:37:27] (03Merged) 10jenkins-bot: Detect hosts under .wikimedia.cloud as 'labs' VMs. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/661278 (https://phabricator.wikimedia.org/T273706) (owner: 10Andrew Bogott) [14:38:30] !log akosiaris@cumin1001 conftool action : set/pooled=True; selector: dnsdisc=similar-users [14:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:06] !log akosiaris@cumin1001 conftool action : set/pooled=True; selector: dnsdisc=linkrecommendation [14:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:47] (03PS1) 10Alexandros Kosiaris: disc_desired_state: Add linkrecommendation/similar-users [puppet] - 10https://gerrit.wikimedia.org/r/661375 [14:41:42] (03CR) 10Giuseppe Lavagetto: [C: 03+1] disc_desired_state: Add linkrecommendation/similar-users [puppet] - 10https://gerrit.wikimedia.org/r/661375 (owner: 10Alexandros Kosiaris) [14:43:39] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/661375 (owner: 10Alexandros Kosiaris) [14:45:28] (03CR) 10Jbond: "Need another PCC run on this to take account of the zuul::sever ps" [puppet] - 10https://gerrit.wikimedia.org/r/661367 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [14:45:49] (03CR) 10Vgutierrez: [C: 03+1] "proposed health check endpoint behaves as expected and LVS config seems sane" [puppet] - 10https://gerrit.wikimedia.org/r/661369 (https://phabricator.wikimedia.org/T273453) (owner: 10Filippo Giunchedi) [14:46:46] 10SRE, 10MW-on-K8s, 10observability, 10serviceops: Logging options for apache httpd in k8s - https://phabricator.wikimedia.org/T265876 (10Joe) [14:49:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 75%: Repool db1120 after daemon restart', diff saved to https://phabricator.wikimedia.org/P14170 and previous config saved to /var/cache/conftool/dbconfig/20210203-144908-root.json [14:49:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:44] PROBLEM - Check systemd state on ms-be2051 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:59:14] PROBLEM - Check systemd state on ms-be2053 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:04:08] (03PS16) 10Jcrespo: Bacula: Create a new set of storage daemons dedicated to db ES backups [puppet] - 10https://gerrit.wikimedia.org/r/659952 (https://phabricator.wikimedia.org/T79922) [15:04:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 100%: Repool db1120 after daemon restart', diff saved to https://phabricator.wikimedia.org/P14171 and previous config saved to /var/cache/conftool/dbconfig/20210203-150411-root.json [15:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:18] RECOVERY - Check systemd state on ms-be2051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:13:31] (03PS3) 10Jbond: (WIP) nodegen: add node selections based on commited files [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651925 (https://phabricator.wikimedia.org/T166066) [15:14:29] (03CR) 10jerkins-bot: [V: 04-1] (WIP) nodegen: add node selections based on commited files [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651925 (https://phabricator.wikimedia.org/T166066) (owner: 10Jbond) [15:17:10] (03PS1) 10DCausse: [cirrus] rename ores_articletopics -> weighted_tags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661383 (https://phabricator.wikimedia.org/T273508) [15:17:12] (03PS1) 10DCausse: [cirrus] drop deprecated ores_articletopics config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661384 (https://phabricator.wikimedia.org/T273508) [15:18:29] !log installing ca-certificates update for buster (reverting the Symantec CA blacklist, related to GeoTrust CA) [15:18:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:57] (03PS1) 10Elukey: Add eventstreams-internal VIP DNS config [dns] - 10https://gerrit.wikimedia.org/r/661386 (https://phabricator.wikimedia.org/T269160) [15:24:06] (03CR) 10Zfilipin: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/661150 (https://phabricator.wikimedia.org/T267313) (owner: 10Thcipriani) [15:25:20] (03PS2) 10Elukey: Add eventstreams-internal VIP DNS config [dns] - 10https://gerrit.wikimedia.org/r/661386 (https://phabricator.wikimedia.org/T269160) [15:26:24] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the patch!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/659008 (owner: 10David Caro) [15:32:24] 10SRE, 10netops: Test dhcp-option 82 - https://phabricator.wikimedia.org/T221388 (10Volans) 05Resolved→03Open a:05ayounsi→03Volans Re-opening as we're aiming to implement it this quarter. [15:32:42] !log disabling puppet on install1003 for a quick test for T221388 [15:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:59] T221388: Test dhcp-option 82 - https://phabricator.wikimedia.org/T221388 [15:34:01] (03PS1) 10Razzi: hadoop: Add hiera setting to symlink hadoop logs to /var/log/hadoop [puppet] - 10https://gerrit.wikimedia.org/r/661391 (https://phabricator.wikimedia.org/T265126) [15:34:23] (03CR) 10JMeybohm: [C: 03+1] "LGTM, but merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/661067/ first" [dns] - 10https://gerrit.wikimedia.org/r/661386 (https://phabricator.wikimedia.org/T269160) (owner: 10Elukey) [15:37:18] (03CR) 10David Caro: [C: 03+2] puppet: add ca_server retrieval [software/spicerack] - 10https://gerrit.wikimedia.org/r/659008 (owner: 10David Caro) [15:37:35] (03CR) 10Elukey: [C: 03+2] Add conftool data for eventstreams-internal (new VIP) [puppet] - 10https://gerrit.wikimedia.org/r/661067 (https://phabricator.wikimedia.org/T269160) (owner: 10Elukey) [15:40:54] (03PS3) 10Elukey: Add eventstreams-internal VIP DNS config [dns] - 10https://gerrit.wikimedia.org/r/661386 (https://phabricator.wikimedia.org/T269160) [15:41:05] 10SRE, 10vm-requests: codfw: 3 VM request for ML team etcd - https://phabricator.wikimedia.org/T273075 (10klausman) All machines are now base installed (puppet-runs done with `insetup`). [15:41:18] 10SRE, 10vm-requests: eqiad: 3 VM request for ML team etcd - https://phabricator.wikimedia.org/T273074 (10klausman) All machines are now base installed (puppet-runs done with `insetup`). [15:41:20] (03CR) 10Jbond: [C: 03+2] wmflib: drop ensure_directory in favour of stdlib::ensure [puppet] - 10https://gerrit.wikimedia.org/r/661368 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [15:42:12] (03PS4) 10Jbond: wmflib: drop ensure_service in favour of stdlib::ensure [puppet] - 10https://gerrit.wikimedia.org/r/661367 (https://phabricator.wikimedia.org/T273743) [15:42:36] RECOVERY - Check systemd state on ms-be2053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:42:49] (03PS5) 10Jbond: wmflib: drop ensure_service in favour of stdlib::ensure [puppet] - 10https://gerrit.wikimedia.org/r/661367 (https://phabricator.wikimedia.org/T273743) [15:42:57] (03PS2) 10Jbond: zuul::server: Add types [puppet] - 10https://gerrit.wikimedia.org/r/661371 [15:43:13] (03PS6) 10Jbond: wmflib: drop ensure_service in favour of stdlib::ensure [puppet] - 10https://gerrit.wikimedia.org/r/661367 (https://phabricator.wikimedia.org/T273743) [15:43:48] (03CR) 10Jbond: "new pcc (running) https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27844/" [puppet] - 10https://gerrit.wikimedia.org/r/661371 (owner: 10Jbond) [15:44:54] PROBLEM - configured eth on sretest1001 is CRITICAL: ens2f1 reporting no carrier. https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [15:45:03] (03CR) 10Jbond: "PCC (running) https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27844/console" [puppet] - 10https://gerrit.wikimedia.org/r/661367 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [15:45:23] (03CR) 10Jbond: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/661371 (owner: 10Jbond) [15:46:35] !log one-off installing imposm3 on maps1009 [15:46:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:49] (03Merged) 10jenkins-bot: puppet: add ca_server retrieval [software/spicerack] - 10https://gerrit.wikimedia.org/r/659008 (owner: 10David Caro) [15:48:09] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host cuminunpriv1001.eqiad.wmnet [15:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:20] !log draining ganeti4001 for eventual reboot [15:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:23] (03PS1) 10Jcrespo: Bacula: Create a new set of storage daemons dedicated to db ES backups [puppet] - 10https://gerrit.wikimedia.org/r/661396 (https://phabricator.wikimedia.org/T79922) [15:50:02] (03PS17) 10Jcrespo: Bacula: Create a new set of storage daemons dedicated to db ES backups [puppet] - 10https://gerrit.wikimedia.org/r/659952 (https://phabricator.wikimedia.org/T79922) [15:50:36] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cuminunpriv1001.eqiad.wmnet [15:50:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:51] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host labweb1001.wikimedia.org [15:52:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:00] (03PS18) 10Jcrespo: Bacula: Start using new storage/pools for es database content backups [puppet] - 10https://gerrit.wikimedia.org/r/659952 (https://phabricator.wikimedia.org/T79922) [15:54:35] RECOVERY - Check systemd state on doc2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:38] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti4001.ulsfo.wmnet [15:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:06] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host labweb1001.wikimedia.org [15:57:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:12] (03CR) 10Jcrespo: "I have the starting-to-become-quite-large commit into 2 smaller ones. I would like to start deploying to check I am not breaking anything," [puppet] - 10https://gerrit.wikimedia.org/r/661396 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [15:59:16] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4001.ulsfo.wmnet [15:59:30] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host labweb1002.wikimedia.org [15:59:31] (03CR) 10Cwhite: stdlib: update to 6.6.0 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/661118 (owner: 10Jbond) [15:59:47] jmm@cumin2001: Failed to log message to wiki. Somebody should check the error logs. [15:59:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:04] !log draining ganeti4003 for eventual reboot [16:00:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:01] PROBLEM - dhclient process on sretest1001 is CRITICAL: PROCS CRITICAL: 3 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [16:03:03] (03PS1) 10Jbond: stdlib: fix metadata version [puppet] - 10https://gerrit.wikimedia.org/r/661402 [16:03:48] (03CR) 10Jbond: stdlib: update to 6.6.0 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/661118 (owner: 10Jbond) [16:04:00] (03CR) 10Jbond: [C: 03+2] stdlib: fix metadata version [puppet] - 10https://gerrit.wikimedia.org/r/661402 (owner: 10Jbond) [16:05:02] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host labweb1002.wikimedia.org [16:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:10] (03CR) 10Elukey: [C: 03+2] Add eventstreams-internal VIP DNS config [dns] - 10https://gerrit.wikimedia.org/r/661386 (https://phabricator.wikimedia.org/T269160) (owner: 10Elukey) [16:06:02] (03CR) 10Jcrespo: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/661396 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [16:06:53] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:06:56] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti4003.ulsfo.wmnet [16:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:36] !log elukey@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=eventstreams-internal [16:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:17] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) [16:11:52] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) Wmflib::Php_version is probably a bit specific to go to stdlib but we should move it to the php module. [16:12:06] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) [16:12:08] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4003.ulsfo.wmnet [16:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:14] !log enabled puppet on install1003 after the test T221388 [16:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:19] T221388: Test dhcp-option 82 - https://phabricator.wikimedia.org/T221388 [16:13:50] !log failover ganeti master in ulsfo to ganeti4003 [16:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:12] !log draining ganeti4002 for eventual reboot [16:16:21] 10SRE, 10netops: Test dhcp-option 82 - https://phabricator.wikimedia.org/T221388 (10Volans) I tested the config with: ` host sretest1001 { host-identifier option agent.circuit-id "ge-3/0/15.0:private1-d-eqiad"; fixed-address sretest1001.eqiad.wmnet; } ` And it seemed to work as expected. I need to p... [16:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:57] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti4002.ulsfo.wmnet [16:18:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:29] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host miscweb1002.eqiad.wmnet [16:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:02] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host miscweb1002.eqiad.wmnet [16:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:14] (03PS1) 10Jbond: stdlib: fix refrences and changelog [puppet] - 10https://gerrit.wikimedia.org/r/661406 [16:22:40] (03CR) 10Jbond: [C: 03+2] stdlib: fix refrences and changelog [puppet] - 10https://gerrit.wikimedia.org/r/661406 (owner: 10Jbond) [16:22:55] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host planet2002.codfw.wmnet [16:22:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:54] (03CR) 10CRusnov: "Sounds good, marking this and ldapsupportlib.py willnotport. I'll leave these branches alive though just in case since the work is already" [puppet] - 10https://gerrit.wikimedia.org/r/658427 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [16:24:07] (03CR) 10CRusnov: [C: 04-1] ldap/ldaplist.py: Port for Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/658427 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [16:24:20] 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install kafka-logging100[123] - https://phabricator.wikimedia.org/T273778 (10RobH) [16:24:33] 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install kafka-logging100[123] - https://phabricator.wikimedia.org/T273778 (10RobH) [16:24:53] (03CR) 10CRusnov: [C: 04-1] "see comments in https://gerrit.wikimedia.org/r/c/operations/puppet/+/658427" [puppet] - 10https://gerrit.wikimedia.org/r/658360 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [16:24:59] (03PS1) 10Elukey: Remove dns-disc config for eventstreams-internal [dns] - 10https://gerrit.wikimedia.org/r/661407 [16:26:21] (03CR) 10Elukey: [C: 03+2] Remove dns-disc config for eventstreams-internal [dns] - 10https://gerrit.wikimedia.org/r/661407 (owner: 10Elukey) [16:26:26] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4002.ulsfo.wmnet [16:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:49] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host planet2002.codfw.wmnet [16:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:54] (03CR) 10Jbond: "another pcc https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27848/" [puppet] - 10https://gerrit.wikimedia.org/r/661372 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [16:28:24] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host planet1002.eqiad.wmnet [16:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:45] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host mwlog2002.codfw.wmnet [16:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:30] (03PS3) 10Elukey: Add eventstreams-internal to service_catalog [puppet] - 10https://gerrit.wikimedia.org/r/661071 (https://phabricator.wikimedia.org/T269160) [16:32:30] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host planet1002.eqiad.wmnet [16:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:15] RECOVERY - dhclient process on sretest1001 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [16:33:16] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host peek2001.codfw.wmnet [16:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:32] (03CR) 10Elukey: [C: 03+2] Add eventstreams-internal to service_catalog [puppet] - 10https://gerrit.wikimedia.org/r/661071 (https://phabricator.wikimedia.org/T269160) (owner: 10Elukey) [16:34:19] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mwlog2002.codfw.wmnet [16:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:30] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host people2001.codfw.wmnet [16:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:24] (03PS1) 10Filippo Giunchedi: swift: limit rsync and swift-object-replicator memory to 5% in codfw [puppet] - 10https://gerrit.wikimedia.org/r/661408 (https://phabricator.wikimedia.org/T221904) [16:35:37] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host peek2001.codfw.wmnet [16:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:49] (03PS2) 10Elukey: role::kubernetes::worker: add empty stanza for eventstreams-internal [puppet] - 10https://gerrit.wikimedia.org/r/661072 (https://phabricator.wikimedia.org/T269160) [16:36:32] (03CR) 10Elukey: [C: 03+2] role::kubernetes::worker: add empty stanza for eventstreams-internal [puppet] - 10https://gerrit.wikimedia.org/r/661072 (https://phabricator.wikimedia.org/T269160) (owner: 10Elukey) [16:36:53] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people2001.codfw.wmnet [16:36:57] 10SRE, 10LDAP-Access-Requests: Request to add Georgina Burnett to the ldap/wmde group - https://phabricator.wikimedia.org/T273780 (10georginaburnett-wmde) [16:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:01] !log mbsantos@deploy1001 Started deploy [tilerator/deploy@46a2eaf] (nvironment): imposm Deploy Tilerator build for buster machines [16:37:03] !log mbsantos@deploy1001 deploy aborted: imposm Deploy Tilerator build for buster machines (duration: 00m 03s) [16:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:20] !log mbsantos@deploy1001 Started deploy [tilerator/deploy@46a2eaf] (imposm): Deploy Tilerator build for buster machines [16:37:23] !log mbsantos@deploy1001 deploy aborted: Deploy Tilerator build for buster machines (duration: 00m 03s) [16:37:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:48] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) [16:42:05] (03PS1) 10Muehlenhoff: Update account meta data for mraish [puppet] - 10https://gerrit.wikimedia.org/r/661410 [16:44:26] (03CR) 10Muehlenhoff: [C: 03+2] Update account meta data for mraish [puppet] - 10https://gerrit.wikimedia.org/r/661410 (owner: 10Muehlenhoff) [16:44:33] !log mbsantos@deploy1001 Started deploy [tilerator/deploy@46a2eaf] (beta): (no justification provided) [16:44:34] !log mbsantos@deploy1001 deploy aborted: (no justification provided) (duration: 00m 01s) [16:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:38] (03CR) 10Filippo Giunchedi: swift: limit rsync and swift-object-replicator memory to 5% in codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/661408 (https://phabricator.wikimedia.org/T221904) (owner: 10Filippo Giunchedi) [16:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:51] !log mbsantos@deploy1001 Started deploy [tilerator/deploy@46a2eaf] (imposm): (no justification provided) [16:44:51] !log mbsantos@deploy1001 deploy aborted: (no justification provided) (duration: 00m 00s) [16:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:50] (03CR) 10Muehlenhoff: "Ack, then I'll update the patch to move you to the cn=wmf LDAP group instead." [puppet] - 10https://gerrit.wikimedia.org/r/661150 (https://phabricator.wikimedia.org/T267313) (owner: 10Thcipriani) [16:52:59] 10SRE, 10netops: Test dhcp-option 82 - https://phabricator.wikimedia.org/T221388 (10Volans) Adding the `circuit-id prefix host-name` setting and removing the `remote-id` that we're not gonna use, the circuit ID includes the switch hostname too, so becoming `asw2-d-eqiad:ge-3/0/15.0:private1-d-eqiad`. That shou... [16:53:06] (03PS2) 10Muehlenhoff: Offboard zfilipin from Release Engineering [puppet] - 10https://gerrit.wikimedia.org/r/661150 (https://phabricator.wikimedia.org/T267313) (owner: 10Thcipriani) [16:53:55] 10SRE, 10Analytics, 10Analytics-Kanban, 10Event-Platform, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10elukey) Next step is https://wikitech.wikimedia.org/wiki/LVS#Configure_the_load_ba... [16:55:54] (03PS3) 10Muehlenhoff: Offboard zfilipin from Release Engineering [puppet] - 10https://gerrit.wikimedia.org/r/661150 (https://phabricator.wikimedia.org/T267313) (owner: 10Thcipriani) [16:56:52] (03PS1) 10Jbond: nutcracker: drop use of to_milliseconds function [puppet] - 10https://gerrit.wikimedia.org/r/661414 (https://phabricator.wikimedia.org/T273743) [16:56:54] (03PS1) 10Jbond: wmflib: drop to_seconds and to_milliseconds [puppet] - 10https://gerrit.wikimedia.org/r/661415 (https://phabricator.wikimedia.org/T273743) [16:58:02] (03PS5) 10Jbond: (WIP) ssl: new ssl module intialy planned to replace ssl_ciphersuite() [puppet] - 10https://gerrit.wikimedia.org/r/640480 (https://phabricator.wikimedia.org/T273743) [17:00:45] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.046 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [17:01:15] !log elukey@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=eventstreams-internal [17:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:53] PROBLEM - dhclient process on sretest1001 is CRITICAL: PROCS CRITICAL: 1 process with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [17:15:52] (03CR) 10Zfilipin: [C: 03+1] Offboard zfilipin from Release Engineering [puppet] - 10https://gerrit.wikimedia.org/r/661150 (https://phabricator.wikimedia.org/T267313) (owner: 10Thcipriani) [17:37:58] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/661371 (owner: 10Jbond) [17:38:36] (03PS1) 10Urbanecm: [WIP] Enable GrowthExperiments at dawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661419 (https://phabricator.wikimedia.org/T256126) [17:38:38] (03PS1) 10Hnowlan: conftool: restore maps1009 to kartotherian pool [puppet] - 10https://gerrit.wikimedia.org/r/661420 [17:41:19] (03CR) 10Urbanecm: [C: 04-2] "DNM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661419 (https://phabricator.wikimedia.org/T256126) (owner: 10Urbanecm) [17:41:21] (03CR) 10Hashar: "I have deleted the couple comments that mentioned an unrelated ppc run (that was a good excuse for me to try deleting a comment)." [puppet] - 10https://gerrit.wikimedia.org/r/661371 (owner: 10Jbond) [17:44:10] * Urbanecm stagging at mwdebug1003 [17:45:51] 10SRE, 10LDAP-Access-Requests: Request to add Georgina Burnett to the ldap/wmde group - https://phabricator.wikimedia.org/T273780 (10WMDE-leszek) [17:46:00] * Urbanecm done [17:48:54] 10SRE, 10LDAP-Access-Requests, 10WMF-Legal: Request to add Georgina Burnett to the ldap/wmde group - https://phabricator.wikimedia.org/T273780 (10WMDE-leszek) As an Engineering Manager at WMDE, I approve this request and confirm Georgina's affiliation with WDME. Tagging #wmf-legal as well to ensure the requi... [17:50:21] 10SRE, 10LDAP-Access-Requests, 10WMF-Legal: Request to add Georgina Burnett to the ldap/wmde group - https://phabricator.wikimedia.org/T273780 (10WMDE-leszek) @Aklapper this message above from Herald seems like something relatively new? Please advise if we should reach out to WMF Legal for NDA and related to... [17:53:51] (03CR) 10David Caro: "@Volans anything else needed for this?" [software/spicerack] - 10https://gerrit.wikimedia.org/r/659009 (owner: 10David Caro) [17:58:28] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [18:02:25] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/661426 [18:04:12] RECOVERY - SSH on mw2249.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:06:12] (03PS3) 10Jbond: zuul::server: Add types [puppet] - 10https://gerrit.wikimedia.org/r/661371 [18:06:25] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/661371 (owner: 10Jbond) [18:07:38] (03CR) 10Jbond: "> Class[Profile::Zuul::Server]:" [puppet] - 10https://gerrit.wikimedia.org/r/661371 (owner: 10Jbond) [18:10:06] (03CR) 10Dduvall: [C: 03+2] blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/661426 (owner: 10PipelineBot) [18:11:32] (03Merged) 10jenkins-bot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/661426 (owner: 10PipelineBot) [18:13:50] !log dduvall@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [18:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:34] (03CR) 10Dzahn: [C: 03+1] "lgtm... and just a comment outside the scope of this, I wonder if a type for email addresses would make sense" [puppet] - 10https://gerrit.wikimedia.org/r/661371 (owner: 10Jbond) [18:16:14] (03CR) 10Volans: [C: 04-1] "> Patch Set 6:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/659009 (owner: 10David Caro) [18:18:23] 10SRE, 10netops: Test dhcp-option 82 - https://phabricator.wikimedia.org/T221388 (10Volans) The config for the above test used: ` circuit-id { prefix { host-name; } } ` [18:22:19] (03CR) 10Joal: [C: 03+1] "Thanks @razzi" [puppet] - 10https://gerrit.wikimedia.org/r/661209 (https://phabricator.wikimedia.org/T273004) (owner: 10Razzi) [18:23:48] !log dduvall@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [18:23:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:43] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [18:26:24] !log dduvall@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [18:26:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:33] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [18:30:41] volans: have you seen this one before? sudo ipmi-chassis --get-chassis-status -> "ipmi_cmd_get_chassis_status: bad completion code" ? [18:30:54] that's not the one we list as the "typical" error [18:32:36] !log razzi@cumin1001 START - Cookbook sre.kafka.reboot-workers for Kafka test cluster: Reboot kafka nodes - razzi@cumin1001 [18:32:36] the "diff" config command is also unusual, not empty and not a diff but instead "Unable to get Number of Users". this is broken in a new way [18:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:18] yea, I'm making a ticket because I also cant ssh to mgmt now. we'll see there [18:38:37] 10ops-codfw, 10serviceops-radar: mw2220 - broken IPMI / mgmt - https://phabricator.wikimedia.org/T273803 (10Dzahn) [18:38:59] 10ops-codfw, 10serviceops-radar: mw2220 - broken IPMI / mgmt - https://phabricator.wikimedia.org/T273803 (10Dzahn) p:05Triage→03Medium [18:39:23] (03PS3) 10Ryan Kemper: relforge: service impl of relforge100[3,4] [puppet] - 10https://gerrit.wikimedia.org/r/661229 (https://phabricator.wikimedia.org/T262211) [18:39:31] (03Abandoned) 10Ryan Kemper: search: bring "new" relforge hosts into service [puppet] - 10https://gerrit.wikimedia.org/r/656369 (https://phabricator.wikimedia.org/T262211) (owner: 10Ryan Kemper) [18:39:39] 10ops-codfw, 10serviceops-radar: mw2220 - broken IPMI / mgmt - https://phabricator.wikimedia.org/T273803 (10Dzahn) [18:39:42] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) [18:41:08] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [18:41:10] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [18:41:37] (03CR) 10Jbond: "> Patch Set 3: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/661371 (owner: 10Jbond) [18:43:21] (03CR) 10Dzahn: [C: 03+1] "Alright, I see. After seeing your recent ticket about upstreamf wmflib types to stdlib I guess it should be done upstream right away then." [puppet] - 10https://gerrit.wikimedia.org/r/661371 (owner: 10Jbond) [18:44:29] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2281.codfw.wmnet with reason: REIMAGE [18:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:40] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2282.codfw.wmnet with reason: REIMAGE [18:44:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:47] (03CR) 10Dzahn: [V: 03+1 C: 03+2] profile::redis::multidc: hiera -> lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/659392 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [18:44:56] (03CR) 10CDanis: swift: limit rsync and swift-object-replicator memory to 5% in codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/661408 (https://phabricator.wikimedia.org/T221904) (owner: 10Filippo Giunchedi) [18:45:13] 10SRE, 10ops-eqiad, 10User-ArielGlenn: Interface errors on asw2-b-eqiad:ge-8/0/6 (dumpsdata1001) - https://phabricator.wikimedia.org/T273714 (10Cmjohnson) I am going to be out that week...can you try and coordinate with @Jclark-ctr [18:45:55] 10SRE, 10ops-eqiad, 10Analytics-Radar: Degraded RAID on an-worker1099 - https://phabricator.wikimedia.org/T273034 (10Cmjohnson) 05Open→03Resolved done [18:45:59] 10SRE, 10ops-eqiad: Interface errors between pfw3a-eqiad and fasw-c1a-eqiad - https://phabricator.wikimedia.org/T271295 (10Cmjohnson) ok [18:46:34] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2281.codfw.wmnet with reason: REIMAGE [18:46:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:43] (03CR) 10Dzahn: "noop on mc2026" [puppet] - 10https://gerrit.wikimedia.org/r/659392 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [18:47:37] (03PS3) 10Bstorm: Revert "dumps: fail over dumps web" [dns] - 10https://gerrit.wikimedia.org/r/660798 [18:47:40] (03CR) 10Andrew Bogott: [C: 03+1] labs_bootstrapvz: hiera -> lookup [puppet] - 10https://gerrit.wikimedia.org/r/660953 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [18:48:13] (03CR) 10Dzahn: "sure, no problem. Just in the cases where it's only <= 4 servers it seemed quicker to manually delete crons than doing a second change to " [puppet] - 10https://gerrit.wikimedia.org/r/661198 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [18:48:30] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2282.codfw.wmnet with reason: REIMAGE [18:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:37] (03PS4) 10Ryan Kemper: relforge: service impl of relforge100[3,4] [puppet] - 10https://gerrit.wikimedia.org/r/661229 (https://phabricator.wikimedia.org/T262211) [18:49:13] (03CR) 10Bstorm: [C: 03+2] Revert "dumps: fail over dumps web" [dns] - 10https://gerrit.wikimedia.org/r/660798 (owner: 10Bstorm) [18:50:39] (03CR) 10Bstorm: [C: 03+2] Revert "dumps-dist: fail over labstore1006 to 1007" [puppet] - 10https://gerrit.wikimedia.org/r/660799 (owner: 10Bstorm) [18:50:41] (03Abandoned) 10Andrew Bogott: acme-chief designate-sync.py: set ttl to 0 for txt records [puppet] - 10https://gerrit.wikimedia.org/r/655476 (owner: 10Andrew Bogott) [18:50:55] (03Abandoned) 10Andrew Bogott: Cloud instances: add duplicate hiera settings for profile::base::labs:: settings [puppet] - 10https://gerrit.wikimedia.org/r/661171 (owner: 10Andrew Bogott) [18:51:36] (03PS5) 10Ryan Kemper: relforge: service impl of relforge100[3,4] [puppet] - 10https://gerrit.wikimedia.org/r/661229 (https://phabricator.wikimedia.org/T262211) [18:52:21] (03PS1) 10Urbanecm: kowiki: Fix wgGEHelpPanelHelpDeskTitle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661435 (https://phabricator.wikimedia.org/T273799) [18:53:17] (03PS6) 10Ryan Kemper: relforge: service impl of relforge100[3,4] [puppet] - 10https://gerrit.wikimedia.org/r/661229 (https://phabricator.wikimedia.org/T262211) [18:57:28] (03CR) 10Dzahn: [C: 03+2] labs_bootstrapvz: hiera -> lookup [puppet] - 10https://gerrit.wikimedia.org/r/660953 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [18:58:21] (03CR) 10Gergő Tisza: [C: 03+1] kowiki: Fix wgGEHelpPanelHelpDeskTitle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661435 (https://phabricator.wikimedia.org/T273799) (owner: 10Urbanecm) [19:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor I � Unicode. All rise for Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210203T1900). [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:00:04] hashar and dancy: I, the Bot under the Fountain, allow thee, The Deployer, to do Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210203T1900). [19:00:18] I'll sync the patch tg.r just +1'ed [19:00:29] (03CR) 10Urbanecm: [C: 03+2] kowiki: Fix wgGEHelpPanelHelpDeskTitle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661435 (https://phabricator.wikimedia.org/T273799) (owner: 10Urbanecm) [19:01:28] (03Merged) 10jenkins-bot: kowiki: Fix wgGEHelpPanelHelpDeskTitle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661435 (https://phabricator.wikimedia.org/T273799) (owner: 10Urbanecm) [19:02:02] (03PS3) 10Dzahn: installserver::proxy: replace cron with timer [puppet] - 10https://gerrit.wikimedia.org/r/661198 (https://phabricator.wikimedia.org/T273673) [19:06:54] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 56351f0434be36f4a639f98986d7785dd4d0b14d: kowiki: Fix wgGEHelpPanelHelpDeskTitle (T273799) (duration: 01m 10s) [19:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:58] T273799: The help panel at kowiki does not load - https://phabricator.wikimedia.org/T273799 [19:07:00] * Urbanecm done [19:07:45] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [19:08:11] (03CR) 10Ryan Kemper: relforge: service impl of relforge100[3,4] (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/661229 (https://phabricator.wikimedia.org/T262211) (owner: 10Ryan Kemper) [19:09:23] (03CR) 10Ladsgroup: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/661198 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [19:10:04] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1334.eqiad.wmnet with reason: REIMAGE [19:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:11] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1334.eqiad.wmnet with reason: REIMAGE [19:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:41] (03CR) 10Dzahn: "These are good points, alright, amending to all." [puppet] - 10https://gerrit.wikimedia.org/r/661198 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [19:14:09] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2282.codfw.wmnet'] ` an... [19:17:09] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2282.codfw.wmnet [19:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:27] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2281.codfw.wmnet'] ` an... [19:18:24] (03PS2) 10Dzahn: logging::mediawiki::udp2log: replace cron with timer [puppet] - 10https://gerrit.wikimedia.org/r/661200 (https://phabricator.wikimedia.org/T273673) [19:21:39] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2281.codfw.wmnet [19:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:23] I'm going to hack on mwdebug1003 for a bit [19:31:01] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2282.codfw.wmnet [19:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:53] (03CR) 10Gehel: [C: 04-1] "minor comments inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/661229 (https://phabricator.wikimedia.org/T262211) (owner: 10Ryan Kemper) [19:34:21] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2281.codfw.wmnet [19:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:04] PROBLEM - Check systemd state on kafka-test1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:41:08] (03CR) 10Hashar: [C: 04-1] "> > Class[Zuul::Server]: <----------- class is here" [puppet] - 10https://gerrit.wikimedia.org/r/661371 (owner: 10Jbond) [19:45:19] ;(done) [19:46:17] (03CR) 10Gehel: [C: 04-2] "I am now convinced that we should not fix those tests, but fix the production code instead. It seems like a bad practice to do IO in a __s" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/659294 (owner: 10David Caro) [19:50:07] 10SRE, 10MediaWiki-Debug-Logger, 10Platform Team Workboards (Clinic Duty Team), 10Wikimedia-production-error: LegacyHandler.php: PHP Warning: Host lookup failed [-10002]: Unknown error -10002 - https://phabricator.wikimedia.org/T231025 (10thcipriani) One helpful step might be to have a log of what hostname... [19:51:39] (03PS1) 10Urbanecm: Set wgGEHelpPanelAskMentor to true for several wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661448 (https://phabricator.wikimedia.org/T272753) [19:54:00] (03PS2) 10Dzahn: debmonitor::client: replace cron with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/661189 (https://phabricator.wikimedia.org/T273673) [19:57:25] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1334.eqiad.wmnet'] ` an... [20:00:04] hashar and dancy: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Mediawiki train - European+American Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210203T2000). [20:01:33] ^^ train is blocked [20:01:54] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1334.eqiad.wmnet [20:01:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:46] due to https://phabricator.wikimedia.org/T273242 [20:03:12] why is no-one reviewing my patch :-( [20:03:40] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1334.eqiad.wmnet [20:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:55] hashar are you planning to backport that Echo logging followup? [20:13:10] Majavah: your patch was mentioned in the platform engineering element/slack channel and so at least people know. I'm not sure who would pick it up during US work hours, but we'll see. [20:15:27] Majavah whats the patch? I'm bored [20:22:09] DannyS712: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FeaturedFeeds/+/661357 [20:23:11] oh, that one - I saw it and didn't understand how FeaturedFeeds worked. Does it do anything other than switch the parser options from user-based to anon? [20:24:54] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [20:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:26] it stops caching an user object, using anon parser is a byproduct of lego.ktm's comment [20:26:28] still pinged :p [20:26:41] :D [20:26:59] do I need to start doing something like le.gok.tm [20:27:10] still pinged [20:27:48] Cool though, that should be better overall too [20:34:20] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:33] 10SRE, 10ops-eqiad, 10Discovery-Search (Current work): Memory issue on elastic1063 caused elasticsearch to be killed - https://phabricator.wikimedia.org/T265113 (10Cmjohnson) a:05Cmjohnson→03Jclark-ctr I am assigning this to @Jclark-ctr [20:37:39] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install mwlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T267271 (10Cmjohnson) @herron I am not sure yet, it's not in the rack. I need to see where it is and I'll get back to you this week w/a plan. [20:38:40] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install logstash-be103[345] - https://phabricator.wikimedia.org/T267666 (10Cmjohnson) a:05Cmjohnson→03RobH Rob, these are ready for you with the temp password. [20:44:47] (03CR) 10Thcipriani: [C: 03+1] "Thanks Moritz!" [puppet] - 10https://gerrit.wikimedia.org/r/661150 (https://phabricator.wikimedia.org/T267313) (owner: 10Thcipriani) [20:45:57] (03PS2) 10Bstorm: wikireplicas-proxy: add commented examples of depoolings for multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/661206 (https://phabricator.wikimedia.org/T271476) [20:47:17] (03PS3) 10Bstorm: wikireplicas-proxy: add commented examples of depoolings for multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/661206 (https://phabricator.wikimedia.org/T271476) [20:48:25] (03CR) 10Bstorm: [C: 03+2] wikireplicas-proxy: add commented examples of depoolings for multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/661206 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm) [20:50:37] (03CR) 10Bstorm: [C: 03+2] wikireplicas: deploy a cloud-based query sampler for the replicas [puppet] - 10https://gerrit.wikimedia.org/r/660960 (https://phabricator.wikimedia.org/T272723) (owner: 10Bstorm) [20:57:18] (03PS1) 10Bstorm: wikireplicas: add a role for consistency on the querysampler service [puppet] - 10https://gerrit.wikimedia.org/r/661464 (https://phabricator.wikimedia.org/T272723) [21:00:04] chrisalbon and accraze: (Dis)respected human, time to deploy Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210203T2100). Please do the needful. [21:01:12] (03CR) 10Bstorm: [C: 03+2] wikireplicas: add a role for consistency on the querysampler service [puppet] - 10https://gerrit.wikimedia.org/r/661464 (https://phabricator.wikimedia.org/T272723) (owner: 10Bstorm) [21:03:06] RECOVERY - dump of es4 in eqiad on alert1001 is OK: Last dump for es4 at eqiad (es1022.eqiad.wmnet) taken on 2021-02-03 10:25:36 (1449 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [21:05:41] !log razzi@cumin1001 START - Cookbook sre.kafka.reboot-workers for Kafka test cluster: Reboot kafka nodes - razzi@cumin1001 [21:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:08] 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops: decommission francium.eqiad.wmnet - https://phabricator.wikimedia.org/T273142 (10Cmjohnson) 05Open→03Resolved removed from rack [21:06:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1089.eqiad.wmnet - https://phabricator.wikimedia.org/T273417 (10Cmjohnson) 05Open→03Resolved removed from rack [21:06:32] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Cmjohnson) [21:06:43] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission relforge1001.eqiad.wmnet and relforge1002.eqiad.wmnet - https://phabricator.wikimedia.org/T272444 (10Cmjohnson) [21:06:54] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission relforge1001.eqiad.wmnet and relforge1002.eqiad.wmnet - https://phabricator.wikimedia.org/T272444 (10Cmjohnson) netbox updated and removed from rack [21:07:10] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission relforge1001.eqiad.wmnet and relforge1002.eqiad.wmnet - https://phabricator.wikimedia.org/T272444 (10Cmjohnson) 05Open→03Resolved [21:07:31] 10SRE, 10ops-eqiad, 10Data-Persistence-Backup, 10decommission-hardware: decommission helium.eqiad.wmnet and helium-array - https://phabricator.wikimedia.org/T273049 (10Cmjohnson) [21:07:45] 10SRE, 10ops-eqiad, 10Data-Persistence-Backup, 10decommission-hardware: decommission helium.eqiad.wmnet and helium-array - https://phabricator.wikimedia.org/T273049 (10Cmjohnson) Both have been removed from rack and netbox updated [21:07:51] 10SRE, 10ops-eqiad, 10Data-Persistence-Backup, 10decommission-hardware: decommission helium.eqiad.wmnet and helium-array - https://phabricator.wikimedia.org/T273049 (10Cmjohnson) 05Open→03Resolved [21:08:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: decommission frdb1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T271739 (10Cmjohnson) [21:08:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: decommission frdb1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T271739 (10Cmjohnson) 05Open→03Resolved netbox updated and removd from rack [21:08:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission es1013.eqiad.wmnet - https://phabricator.wikimedia.org/T268436 (10Cmjohnson) [21:09:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission es1013.eqiad.wmnet - https://phabricator.wikimedia.org/T268436 (10Cmjohnson) 05Open→03Resolved netbox updated and removed from rack [21:10:09] 10SRE, 10ops-eqiad: Interface errors between pfw3a-eqiad and fasw-c1a-eqiad - https://phabricator.wikimedia.org/T271295 (10Cmjohnson) @ayounsi Can we do this Friday, 5 Feb 1500UTC? [21:10:49] 10SRE, 10ops-eqiad, 10User-ArielGlenn: Interface errors on asw2-b-eqiad:ge-8/0/6 (dumpsdata1001) - https://phabricator.wikimedia.org/T273714 (10Cmjohnson) a:05Cmjohnson→03Jclark-ctr Assigning this to @Jclark-ctr [21:22:22] (03PS1) 10Bstorm: wikireplicas: remove error from the profile for query sampler [puppet] - 10https://gerrit.wikimedia.org/r/661472 [21:24:42] (03CR) 10Bstorm: [C: 03+2] wikireplicas: remove error from the profile for query sampler [puppet] - 10https://gerrit.wikimedia.org/r/661472 (owner: 10Bstorm) [21:33:27] !log rebooting Netbox cluster [21:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:28] (03PS1) 10Bstorm: wikireplicas: fix one more typo in the new querysampler service [puppet] - 10https://gerrit.wikimedia.org/r/661478 [21:34:59] !log crusnov@cumin1001 START - Cookbook sre.hosts.reboot-single for host netboxdb2001.codfw.wmnet [21:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:50] (03CR) 10Bstorm: [C: 03+2] wikireplicas: fix one more typo in the new querysampler service [puppet] - 10https://gerrit.wikimedia.org/r/661478 (owner: 10Bstorm) [21:38:59] 10SRE, 10LDAP-Access-Requests: Access to Product Superset for Rmurthy - https://phabricator.wikimedia.org/T273813 (10Peachey88) [21:39:40] !log crusnov@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netboxdb2001.codfw.wmnet [21:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:16] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install frqueue100[34] - https://phabricator.wikimedia.org/T266365 (10Cmjohnson) @Jgreen Do you have an IP identified for these? [21:40:17] !log crusnov@cumin1001 START - Cookbook sre.hosts.reboot-single for host netbox2001.wikimedia.org [21:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:54] (03PS1) 10Bstorm: wikireplicas: query sampler subscribe to correct config file. [puppet] - 10https://gerrit.wikimedia.org/r/661485 [21:44:04] !log crusnov@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox2001.wikimedia.org [21:44:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:08] !log crusnov@cumin1001 START - Cookbook sre.hosts.reboot-single for host netboxdb1001.eqiad.wmnet [21:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:46] PROBLEM - Check systemd state on kafka-test1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:49:20] (03PS1) 10Wolfgang Kandek: Adding calculator-service to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/661491 (https://phabricator.wikimedia.org/T273151) [21:50:24] (03CR) 10Bstorm: [C: 03+2] wikireplicas: query sampler subscribe to correct config file. [puppet] - 10https://gerrit.wikimedia.org/r/661485 (owner: 10Bstorm) [21:51:02] (03CR) 10jerkins-bot: [V: 04-1] Adding calculator-service to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/661491 (https://phabricator.wikimedia.org/T273151) (owner: 10Wolfgang Kandek) [21:53:05] !log crusnov@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netboxdb1001.eqiad.wmnet [21:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:32] !log crusnov@cumin1001 START - Cookbook sre.hosts.reboot-single for host netbox1001.wikimedia.org [21:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:28] (03CR) 10Bstorm: "A minor suggestion and one nit. Let me know what you think. Otherwise lgtm" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/658637 (owner: 10David Caro) [22:06:30] !log crusnov@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox1001.wikimedia.org [22:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:18] PROBLEM - Check systemd state on kafka-test1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:26:50] (03CR) 10Holger Knust: [C: 03+2] labs: Remove redundant apiportal config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661167 (https://phabricator.wikimedia.org/T270178) (owner: 10Alex Paskulin) [22:28:10] (03Merged) 10jenkins-bot: labs: Remove redundant apiportal config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661167 (https://phabricator.wikimedia.org/T270178) (owner: 10Alex Paskulin) [22:29:28] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install mwlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T267271 (10Jclark-ctr) [22:32:08] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install logstash-be103[345] - https://phabricator.wikimedia.org/T267666 (10RobH) a:05RobH→03herron @herron: before I image these, they have an odd hostname of logstash-be103[345], where the codfw logstash ordered in Q2 just have normal logstash2* ho... [22:33:22] (03PS2) 10Urbanecm: [WIP] Enable GrowthExperiments at dawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661419 (https://phabricator.wikimedia.org/T256126) [22:42:53] (03PS1) 10Urbanecm: bnwiki: wgGEHelpPanelLinks: Remove text in brackets [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661522 (https://phabricator.wikimedia.org/T266020) [22:49:53] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install mwlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T267271 (10Jclark-ctr) @Cmjohnson netbox is updated racked in C6 U40. port 39 [22:50:17] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install mwlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T267271 (10Jclark-ctr) [22:53:42] !log razzi@cumin1001 END (PASS) - Cookbook sre.kafka.reboot-workers (exit_code=0) for Kafka test cluster: Reboot kafka nodes - razzi@cumin1001 [22:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:40] (03CR) 10Ladsgroup: [C: 03+1] installserver::proxy: replace cron with timer [puppet] - 10https://gerrit.wikimedia.org/r/661198 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [23:12:26] RECOVERY - Check systemd state on kafka-test1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:13:07] (03CR) 10Ladsgroup: logging::mediawiki::udp2log: replace cron with timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/661200 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [23:14:08] RECOVERY - Check systemd state on kafka-test1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:14:36] RECOVERY - Check systemd state on kafka-test1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:22:00] (03CR) 10Dzahn: logging::mediawiki::udp2log: replace cron with timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/661200 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [23:24:52] (03PS1) 10Razzi: site: add clouddb1021.eqiad.wmnet to insetup [puppet] - 10https://gerrit.wikimedia.org/r/661528 (https://phabricator.wikimedia.org/T269211) [23:28:02] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [23:29:26] (03PS1) 10Razzi: wikireplicas: Add configuration for clouddb1021 [puppet] - 10https://gerrit.wikimedia.org/r/661529 (https://phabricator.wikimedia.org/T269211) [23:29:33] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [23:31:05] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [23:31:37] jouncebot: next [23:31:37] In 0 hour(s) and 28 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210204T0000) [23:34:05] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [23:34:35] (03PS2) 10Razzi: site: add clouddb1021.eqiad.wmnet to insetup [puppet] - 10https://gerrit.wikimedia.org/r/661528 (https://phabricator.wikimedia.org/T269211) [23:35:46] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudnet1004/cloudnet1003: network hiccups because broadcom driver/firmware problem - https://phabricator.wikimedia.org/T271058 (10Jclark-ctr) service pack tool is only available for in warranty devices {F34091626}. Have reached out to Chris /papaul for... [23:38:34] 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops: decommission francium.eqiad.wmnet - https://phabricator.wikimedia.org/T273142 (10Dzahn) How about the unchecked boxes like wiping and updating netbox? I still see it here: https://netbox.wikimedia.org/dcim/devices/1444/ [23:39:48] (03PS2) 10Razzi: wikireplicas: Add configuration for clouddb1021 [puppet] - 10https://gerrit.wikimedia.org/r/661529 (https://phabricator.wikimedia.org/T269211) [23:40:22] (03PS3) 10Dzahn: logging::mediawiki::udp2log: replace cron with timer [puppet] - 10https://gerrit.wikimedia.org/r/661200 (https://phabricator.wikimedia.org/T273673) [23:43:39] (03CR) 10Razzi: "Once we have this, we should be able to reimage labsdb1012 and rename it to clouddb1021 any time in the next few weeks, the sooner the bet" [puppet] - 10https://gerrit.wikimedia.org/r/661528 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi) [23:46:28] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2280.codfw.wmnet with reason: REIMAGE [23:46:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:00] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2279.codfw.wmnet with reason: REIMAGE [23:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:29] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2280.codfw.wmnet with reason: REIMAGE [23:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:47] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/27852/install2003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/661198 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [23:50:20] PROBLEM - Kafka Broker Replica Max Lag on kafka-jumbo1009 is CRITICAL: 2.768e+07 ge 5e+06 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1009 [23:50:32] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2279.codfw.wmnet with reason: REIMAGE [23:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:00] !log installservers: replacing squid proxy logrotate cron with systemd timer [23:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:51] (03CR) 10Dzahn: "how to test after merge:" [puppet] - 10https://gerrit.wikimedia.org/r/661198 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [23:56:59] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1318.eqiad.wmnet with reason: REIMAGE [23:57:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:03] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1318.eqiad.wmnet with reason: REIMAGE [23:59:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log