[00:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210203T0000).
[00:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[00:00:26] <wikibugs_>	 (03CR) 10Dduvall: releases: Parameterize profile::ci::kubernetes_config owner/group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/661224 (https://phabricator.wikimedia.org/T273681) (owner: 10Dduvall)
[00:01:37] <mutante>	 marxarelli: so 'releasers-mediawiki' isn't the right owner when it's for MW?
[00:01:55] <marxarelli>	 not in this case
[00:02:00] <mutante>	 ok, ack
[00:02:12] <mutante>	 I understand, owning config vs owning actual release files
[00:02:55] <wikibugs_>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/27829/" [puppet] - 10https://gerrit.wikimedia.org/r/661224 (https://phabricator.wikimedia.org/T273681) (owner: 10Dduvall)
[00:03:28] <legoktm>	 since no one proposed any patches, I'm going to use the backport window to deploy some logo (non-)changes
[00:03:41] <marxarelli>	 right. restricting access to the config isn't that critical but i didn't want to add contint-admins to those hosts
[00:03:59] <mutante>	 I like that part, yep
[00:04:20] <mutante>	 noop on contint1001
[00:04:24] <marxarelli>	 thanks for the review/fixes/merge! :)
[00:04:33] <mutante>	 ci-staging.config]/group: group changed 'root' to 'contint-roots'
[00:04:44] <wikibugs_>	 (03PS2) 10Legoktm: logos: Update nlwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660973
[00:04:46] <wikibugs_>	 (03PS2) 10Legoktm: logos: Update eswiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660974
[00:04:47] <wikibugs_>	 (03PS2) 10Legoktm: logos: Update ptwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660975
[00:04:49] <wikibugs_>	 (03PS2) 10Legoktm: logos: Update ruwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660976
[00:04:51] <wikibugs_>	 (03PS2) 10Legoktm: logos: Update svwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660977
[00:04:53] <wikibugs_>	 (03PS2) 10Legoktm: logos: Remove TODO for pngout [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660978 (https://phabricator.wikimedia.org/T273380)
[00:04:56] <wikibugs_>	 (03PS2) 10Legoktm: logos: Redo how variants work [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661052 (https://phabricator.wikimedia.org/T98640)
[00:04:57] <wikibugs_>	 (03PS3) 10Legoktm: logos: Update zhwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660979
[00:05:17] <mutante>	 marxarelli: everything seems good now on releases1002 :)
[00:05:26] <marxarelli>	 yay \o/
[00:05:34] <wikibugs_>	 (03CR) 10Legoktm: [C: 03+2] logos: Update nlwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660973 (owner: 10Legoktm)
[00:06:07] <wikibugs_>	 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team-TODO: Puppet failing on releases hosts due to missing profile::ci::kubernetes_config::token, dependency issue in kubeconfig.pp - https://phabricator.wikimedia.org/T273681 (10Dzahn) after this last merge everything seems good...
[00:06:10] <wikibugs_>	 (03CR) 10Legoktm: [C: 03+2] logos: Update eswiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660974 (owner: 10Legoktm)
[00:06:31] <wikibugs_>	 (03CR) 10Legoktm: [C: 03+2] logos: Update ptwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660975 (owner: 10Legoktm)
[00:06:35] <wikibugs_>	 (03Merged) 10jenkins-bot: logos: Update nlwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660973 (owner: 10Legoktm)
[00:06:51] <mutante>	 marxarelli: I just claimed that ticket is resolved then. of course reopen if there is something else, about to go off for now
[00:07:03] <wikibugs_>	 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team-TODO: Puppet failing on releases hosts due to missing profile::ci::kubernetes_config::token, dependency issue in kubeconfig.pp - https://phabricator.wikimedia.org/T273681 (10Dzahn) 05Open→03Resolved a:03Dzahn
[00:07:04] <marxarelli>	 thanks again
[00:07:11] <wikibugs_>	 (03Merged) 10jenkins-bot: logos: Update eswiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660974 (owner: 10Legoktm)
[00:07:14] <mutante>	 np! cu
[00:07:19] <wikibugs_>	 (03CR) 10Legoktm: [C: 03+2] logos: Update ruwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660976 (owner: 10Legoktm)
[00:07:36] <wikibugs_>	 (03CR) 10Legoktm: [C: 03+2] logos: Update svwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660977 (owner: 10Legoktm)
[00:07:43] <wikibugs_>	 (03CR) 10Legoktm: [C: 03+2] logos: Remove TODO for pngout [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660978 (https://phabricator.wikimedia.org/T273380) (owner: 10Legoktm)
[00:08:14] <wikibugs_>	 (03CR) 10Legoktm: [C: 03+2] logos: Redo how variants work [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661052 (https://phabricator.wikimedia.org/T98640) (owner: 10Legoktm)
[00:08:16] <wikibugs_>	 (03Merged) 10jenkins-bot: logos: Update ptwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660975 (owner: 10Legoktm)
[00:08:42] <wikibugs_>	 (03CR) 10Legoktm: [C: 03+2] logos: Update zhwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660979 (owner: 10Legoktm)
[00:08:52] <wikibugs_>	 (03Merged) 10jenkins-bot: logos: Update ruwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660976 (owner: 10Legoktm)
[00:09:00] <wikibugs_>	 (03Merged) 10jenkins-bot: logos: Update svwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660977 (owner: 10Legoktm)
[00:09:07] <wikibugs_>	 (03Merged) 10jenkins-bot: logos: Remove TODO for pngout [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660978 (https://phabricator.wikimedia.org/T273380) (owner: 10Legoktm)
[00:09:16] <wikibugs_>	 (03Merged) 10jenkins-bot: logos: Redo how variants work [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661052 (https://phabricator.wikimedia.org/T98640) (owner: 10Legoktm)
[00:09:31] <wikibugs_>	 (03Merged) 10jenkins-bot: logos: Update zhwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660979 (owner: 10Legoktm)
[00:09:34] <legoktm>	 ok
[00:10:17] <James_F>	 Minor merge party. :-)
[00:10:26] <logmsgbot>	 !log jhuneidi@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
[00:10:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:12:16] <logmsgbot>	 !log legoktm@deploy1001 Synchronized static/images/project-logos/: Update and recompress logos for nlwiki, eswiki, ptwiki, ruwiki, svwiki, zhwiki (1/2) (duration: 01m 10s)
[00:12:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:12:32] <marxarelli>	 James_F: the major merge party is scheduled for friday EOD ;)
[00:12:57] <marxarelli>	 be there or be square
[00:12:57] <legoktm>	 we still have hundreds of logos to go through!
[00:13:08] <marxarelli>	 wee!
[00:13:09] <James_F>	 marxarelli: Oh gods. :-)
[00:13:42] <logmsgbot>	 !log legoktm@deploy1001 Synchronized logos/: Update and recompress logos for nlwiki, eswiki, ptwiki, ruwiki, svwiki, zhwiki (2/2) (duration: 01m 05s)
[00:13:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:14:21] <icinga-wm>	 PROBLEM - dump of es4 in eqiad on alert1001 is CRITICAL: dump for es4 at eqiad taken more than 8 days ago: Most recent backup 2021-01-26 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting
[00:14:45] <legoktm>	 cool I'm done 
[00:16:19] <logmsgbot>	 !log jhuneidi@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
[00:16:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:26:05] <DannyS712>	 MatmaRex would you mind looking at https://github.com/MatmaRex/patchdemo/pull/223 if you have a second?
[00:28:23] <wikibugs_>	 (03PS1) 10Ryan Kemper: relforge: service impl of relforge100[3,4] [puppet] - 10https://gerrit.wikimedia.org/r/661229 (https://phabricator.wikimedia.org/T262211)
[00:29:09] <MatmaRex>	 DannyS712: eeeh. okay, whatever
[00:29:44] <DannyS712>	 thanks
[00:30:00] <MatmaRex>	 i don't really like having weird configuration there, but i guess we already have some
[00:30:22] <MatmaRex>	 DannyS712: deployed now
[00:42:19] <wikibugs_>	 (03PS2) 10Ryan Kemper: relforge: service impl of relforge100[3,4] [puppet] - 10https://gerrit.wikimedia.org/r/661229 (https://phabricator.wikimedia.org/T262211)
[00:48:58] <DannyS712>	 why is Friday now "No deploys all day!" ? I thought that used to be just Saturday and Sunday
[00:50:21] <DannyS712>	 oh, should probably ask in -releng
[00:57:30] <AntiComposite>	 DannyS712, because no one wants to fix the site on Saturday or Sunday because of a broken deploy Friday
[01:19:49] <wikibugs_>	 (03Abandoned) 10Krinkle: objectcache: return false during more error cases in RedisBagOStuff::*Multi() methods [core] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/658945 (owner: 10Ahmon Dancy)
[01:35:47] <wikibugs_>	 (03PS5) 10Krinkle: apache: Stop aliasing zero.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/524925 (https://phabricator.wikimedia.org/T187716) (owner: 10Jforrester)
[01:36:24] <wikibugs_>	 (03CR) 10Ottomata: [C: 03+1] presto: require partitions predicate [puppet] - 10https://gerrit.wikimedia.org/r/661209 (https://phabricator.wikimedia.org/T273004) (owner: 10Razzi)
[01:42:15] <icinga-wm>	 PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:44:13] <wikibugs_>	 (03PS6) 10Krinkle: apache: Replace zero.wikipedia.org vhost alias with redirect [puppet] - 10https://gerrit.wikimedia.org/r/524925 (https://phabricator.wikimedia.org/T187716) (owner: 10Jforrester)
[01:44:17] <wikibugs_>	 (03CR) 10Krinkle: "done" [puppet] - 10https://gerrit.wikimedia.org/r/524925 (https://phabricator.wikimedia.org/T187716) (owner: 10Jforrester)
[04:22:00] <wikibugs_>	 (03PS1) 10Andrew Bogott: Detect hosts under .wikimedia.cloud as 'labs' VMs. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/661278 (https://phabricator.wikimedia.org/T273706)
[04:25:27] <wikibugs_>	 (03PS2) 10Andrew Bogott: Detect hosts under .wikimedia.cloud as 'labs' VMs. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/661278 (https://phabricator.wikimedia.org/T273706)
[04:31:05] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Detect hosts under .wikimedia.cloud as 'labs' VMs. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/661278 (https://phabricator.wikimedia.org/T273706) (owner: 10Andrew Bogott)
[05:07:47] <icinga-wm>	 PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/
[05:41:44] <wikibugs>	 (03CR) 10QChris: "Wooohooo! \o/" [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/617206 (owner: 10QChris)
[06:11:53] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db1174. [puppet] - 10https://gerrit.wikimedia.org/r/661283 (https://phabricator.wikimedia.org/T258361)
[06:13:32] <wikibugs>	 (03Abandoned) 10Marostegui: mariadb: Productionize db1174 [puppet] - 10https://gerrit.wikimedia.org/r/661066 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui)
[06:14:21] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1174. [puppet] - 10https://gerrit.wikimedia.org/r/661283 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui)
[06:14:51] <wikibugs>	 10SRE, 10Research, 10SRE-Access-Requests: Access to analytics-privatedata-users for Research contractor AikoChou - https://phabricator.wikimedia.org/T273602 (10AikoChou) Hi @CDanis, My wikitech username: AikoChou Preferred shell username: aikochou SSh public key: https://phabricator.wikimedia.org/P14137 I ha...
[06:29:11] <wikibugs>	 (03CR) 10Marostegui: "Small comment: I would add that this require reloading haproxy on the given proxy." [puppet] - 10https://gerrit.wikimedia.org/r/661206 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm)
[06:35:45] <wikibugs>	 (03PS1) 10Marostegui: db1174: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/661286
[06:36:30] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1174: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/661286 (owner: 10Marostegui)
[06:38:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1174 with minimal weight for the first time in s7', diff saved to https://phabricator.wikimedia.org/P14138 and previous config saved to /var/cache/conftool/dbconfig/20210203-063812-marostegui.json
[06:38:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:41:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1078 - will be decommissioned', diff saved to https://phabricator.wikimedia.org/P14139 and previous config saved to /var/cache/conftool/dbconfig/20210203-064137-marostegui.json
[06:41:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:51:27] <wikibugs>	 (03PS1) 10Marostegui: install_server: Reimage db1173 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/661288 (https://phabricator.wikimedia.org/T258361)
[06:52:40] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db1173 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/661288 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui)
[06:54:25] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1173.eqiad.wmnet'] ` The log ca...
[07:06:19] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1173.eqiad.wmnet with reason: REIMAGE
[07:06:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:08:20] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1173.eqiad.wmnet with reason: REIMAGE
[07:08:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:13:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Give some more weight to db1174', diff saved to https://phabricator.wikimedia.org/P14141 and previous config saved to /var/cache/conftool/dbconfig/20210203-071310-marostegui.json
[07:13:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:14:21] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not reimage db1174. [puppet] - 10https://gerrit.wikimedia.org/r/661291
[07:15:15] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1174. [puppet] - 10https://gerrit.wikimedia.org/r/661291 (owner: 10Marostegui)
[07:15:32] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1173.eqiad.wmnet'] `  and were **ALL** successful.
[07:17:28] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[07:17:43] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[07:21:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: 2021-03-31) rack/setup/install snapshot101[1-5] - https://phabricator.wikimedia.org/T272509 (10ArielGlenn) We have many wiki dump runs completed without problems. So please do go ahead with buster on these new servers. Thanks!
[07:22:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: 2021-03-31) rack/setup/install snapshot101[1-5] - https://phabricator.wikimedia.org/T272509 (10ArielGlenn)
[07:37:23] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.95 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash
[07:37:23] <icinga-wm>	 RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:45:52] <wikibugs>	 10SRE, 10Analytics, 10Traffic: Downloading from Archiva.wikimedia.org seems slower than Maven Central - https://phabricator.wikimedia.org/T273086 (10elukey) To keep archives happy - I had to revert the patch since some maven build jobs issue HTTP PUT to the /repository path, meanwhile my assumption was that...
[07:46:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 5%: Slowly pool db1174 into s7', diff saved to https://phabricator.wikimedia.org/P14142 and previous config saved to /var/cache/conftool/dbconfig/20210203-074651-root.json
[07:46:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:47:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1093 to clone db1173 T258361', diff saved to https://phabricator.wikimedia.org/P14143 and previous config saved to /var/cache/conftool/dbconfig/20210203-074749-marostegui.json
[07:47:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:47:53] <stashbot>	 T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
[07:49:06] <marostegui>	 !log Stop mysql on db1093 to clone db1173 T258361
[07:49:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:51:32] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Alert manager, fix DCops email [puppet] - 10https://gerrit.wikimedia.org/r/661178 (owner: 10Ayounsi)
[07:53:40] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize db1173 [puppet] - 10https://gerrit.wikimedia.org/r/661333 (https://phabricator.wikimedia.org/T258361)
[07:54:16] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1173 [puppet] - 10https://gerrit.wikimedia.org/r/661333 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui)
[08:01:08] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] burrow/check_kafka_consumer_lag.py: Port to Python 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/658396 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[08:01:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 8%: Slowly pool db1174 into s7', diff saved to https://phabricator.wikimedia.org/P14145 and previous config saved to /var/cache/conftool/dbconfig/20210203-080154-root.json
[08:01:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:02:11] <wikibugs>	 (03CR) 10Elukey: Add BGP configuration for the new ML Serve eqiad/codfw clusters (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/661055 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey)
[08:02:22] <wikibugs>	 (03CR) 10Elukey: Add BGP configuration for the new ML Serve eqiad/codfw clusters (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/661055 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey)
[08:06:25] <wikibugs>	 (03PS6) 10Elukey: Add BGP configuration for the new ML Serve eqiad/codfw clusters [homer/public] - 10https://gerrit.wikimedia.org/r/661055 (https://phabricator.wikimedia.org/T272918)
[08:09:26] <wikibugs>	 (03CR) 10Elukey: "@Arzhel: I picked 6460[6,7] and updated https://wikitech.wikimedia.org/wiki/IP_and_AS_allocations#Private_AS, let me know if it is ok :)" [homer/public] - 10https://gerrit.wikimedia.org/r/661055 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey)
[08:11:14] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access for Bernd Sitzmann [puppet] - 10https://gerrit.wikimedia.org/r/661334
[08:14:24] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "It is ok." [homer/public] - 10https://gerrit.wikimedia.org/r/661055 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey)
[08:16:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 10%: Slowly pool db1174 into s7', diff saved to https://phabricator.wikimedia.org/P14146 and previous config saved to /var/cache/conftool/dbconfig/20210203-081658-root.json
[08:17:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:20:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove access for Bernd Sitzmann [puppet] - 10https://gerrit.wikimedia.org/r/661334 (owner: 10Muehlenhoff)
[08:28:32] <wikibugs>	 10SRE, 10Datasets-General-or-Unknown, 10netops: Packets discarded on dumpsdata1001 - https://phabricator.wikimedia.org/T273713 (10Peachey88)
[08:29:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 04-1] "This removes Željko from all access groups for production, which means we should disable production SSH access entirely. @Željko: You're s" [puppet] - 10https://gerrit.wikimedia.org/r/661150 (https://phabricator.wikimedia.org/T267313) (owner: 10Thcipriani)
[08:32:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 13%: Slowly pool db1174 into s7', diff saved to https://phabricator.wikimedia.org/P14147 and previous config saved to /var/cache/conftool/dbconfig/20210203-083201-root.json
[08:32:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/658415 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[08:33:02] <wikibugs>	 10SRE, 10Datasets-General-or-Unknown, 10netops: Packets discarded on dumpsdata1001 - https://phabricator.wikimedia.org/T273713 (10ArielGlenn) Arzhel pointed out that the dumpsdata1003 host has a 10G NIC, so we can swap 1001 and 1003 roles when the current dumps run completes. This would mean dumpsdata1003 wo...
[08:33:19] <wikibugs>	 10SRE, 10Datasets-General-or-Unknown, 10Dumps-Generation, 10netops: Packets discarded on dumpsdata1001 - https://phabricator.wikimedia.org/T273713 (10ArielGlenn)
[08:37:04] <wikibugs>	 10ops-eqiad: Interface errors on asw2-b-eqiad:ge-8/0/6  (dumpsdata1001) - https://phabricator.wikimedia.org/T273714 (10ayounsi) p:05Triage→03Medium
[08:38:42] <wikibugs>	 (03CR) 10Muehlenhoff: "The new timer is fine, but we need to absent the old cron, otherwise it'll be kept in crontab and we'll effectively run it twice?" [puppet] - 10https://gerrit.wikimedia.org/r/661189 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[08:38:46] <jinxer-wm>	 (Access port utilisation over 80% for 1h) firing: Access port utilisation over 80% for 1h - https://alerts.wikimedia.org
[08:39:47] <wikibugs>	 10ops-eqiad, 10User-ArielGlenn: Interface errors on asw2-b-eqiad:ge-8/0/6  (dumpsdata1001) - https://phabricator.wikimedia.org/T273714 (10ArielGlenn)
[08:40:14] <wikibugs>	 10ops-eqiad, 10User-ArielGlenn: Interface errors on asw2-b-eqiad:ge-8/0/6  (dumpsdata1001) - https://phabricator.wikimedia.org/T273714 (10ArielGlenn) We can schedule this for when the current dump run is complete and before the next one starts, so likely 17th-18th-19th Feb.
[08:40:38] <wikibugs>	 (03CR) 10Ladsgroup: [C: 04-1] "Yes it needs to be absented first." [puppet] - 10https://gerrit.wikimedia.org/r/661189 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[08:43:46] <jinxer-wm>	 (Access port utilisation over 80% for 1h) resolved: Access port utilisation over 80% for 1h - https://alerts.wikimedia.org
[08:45:06] <wikibugs>	 (03PS1) 10Majavah: Add missing isset() check to ApiEchoUnreadNotificationPages [extensions/Echo] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/661258 (https://phabricator.wikimedia.org/T273479)
[08:45:36] <volans>	 elukey: is there any trasfer going on like last week? see the access port utilization email
[08:45:54] <volans>	 ah no, seems dumpsdata1001
[08:46:22] <elukey>	 volans: always blaming Analytics
[08:46:24] <XioNoX>	 volans: it's a bug
[08:46:37] <volans>	 yeah saw now T273713
[08:46:38] <stashbot>	 T273713: Packets discarded on dumpsdata1001 - https://phabricator.wikimedia.org/T273713
[08:46:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Failover URL downloaders [dns] - 10https://gerrit.wikimedia.org/r/661131 (owner: 10Muehlenhoff)
[08:46:39] <XioNoX>	 volans: I acked the alert, but it send the alert instead
[08:46:47] <volans>	 thx
[08:47:03] <volans>	 btw I don't blame analytics... I just blame Luca ;)
[08:47:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 15%: Slowly pool db1174 into s7', diff saved to https://phabricator.wikimedia.org/P14148 and previous config saved to /var/cache/conftool/dbconfig/20210203-084705-root.json
[08:47:08] <elukey>	 XioNoX: just to double check - Joseph is about to start copying some data, is it ok to proceed?
[08:47:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:47:11] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "Thanks! This should be the smallest possible backport. I think there will be additional logging in the master branch, but this doesn't nee" [extensions/Echo] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/661258 (https://phabricator.wikimedia.org/T273479) (owner: 10Majavah)
[08:47:58] <XioNoX>	 elukey: yeah, no pb
[08:48:04] <elukey>	 perfect thanks
[08:58:16] <icinga-wm>	 PROBLEM - ATS TLS has reduced HTTP availability #page on alert1001 is CRITICAL: cluster=cache_upload layer=tls https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1
[08:58:33] <legoktm>	 hi
[08:58:40] <_joe_>	 uh
[08:58:55] <jayme>	 o/
[08:58:58] <volans>	 o/
[08:59:03] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp5005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[08:59:11] <_joe_>	 eqsin has 90% availability
[08:59:20] <_joe_>	 and yeah, that seems to correspond
[08:59:34] <legoktm>	 (acked the alert)
[08:59:39] <XioNoX>	 yo?
[08:59:54] <XioNoX>	 what does that mean?
[09:00:00] <apergos>	 ?
[09:00:01] <_joe_>	 XioNoX: I don't think that's the case, but can you check if eqsin has connectivity issues?
[09:00:04] <jouncebot>	 marostegui and akosiaris: It is that lovely time of the day again! You are hereby commanded to deploy m2 database master restart. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210203T0900).
[09:00:15] <_joe_>	 we have issues with eqsin's upload AFAICT
[09:00:19] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp5004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[09:00:24] <marostegui>	 ^ I am going to hold that until we are clear on what's going on
[09:00:34] <volans>	 cp5005 loadavg seems normal  5.78, 7.50, 7.64
[09:00:43] <_joe_>	 volans: check varnish
[09:00:55] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp5006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[09:00:56] <jynus>	 it is the upload one, isn't it?
[09:01:01] <_joe_>	 Feb  3 09:00:50 lvs5002 pybal[17977]: [uploadlb_443 ProxyFetch] WARN: cp5006.eqsin.wmnet (enabled/partially up/not pooled): Fetch failed (https://healthcheck.wikimedia.org/varnish-fe), 5.002 s
[09:01:02] <legoktm>	 calendar says there is zayo maintenance right now fwiw
[09:01:09] <_joe_>	 a lot of that stuff
[09:01:32] <XioNoX>	 _joe_: network looks good so far, still looking
[09:01:33] <_joe_>	 yeah the upload varnishes seem to go down and back up consistently
[09:01:42] <_joe_>	 ok, can we depool eqsin for upload?
[09:01:48] <_joe_>	 volans: can yu prepare the patch?
[09:01:52] <XioNoX>	 legoktm: no Zayo in eqsin (or anywhere anymore)
[09:01:59] <jynus>	 +1 unless someone has a better suggestion
[09:02:00] <volans>	 _joe_: sure
[09:02:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 20%: Slowly pool db1174 into s7', diff saved to https://phabricator.wikimedia.org/P14149 and previous config saved to /var/cache/conftool/dbconfig/20210203-090208-root.json
[09:02:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:02:22] * legoktm nods
[09:02:33] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp5001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[09:02:49] <_joe_>	 https://grafana.wikimedia.org/d/000000304/prometheus-varnish-dc-stats?viewPanel=18&orgId=1&var-datasource=eqsin%20prometheus%2Fops&var-cluster=cache_upload&var-layer=frontend
[09:02:54] <_joe_>	 looks like a peak in requests
[09:03:05] <_joe_>	 on upload
[09:03:07] <vgutierrez>	 huge spike there
[09:03:16] <_joe_>	 gonna take a look at the 5xx data
[09:03:37] <jynus>	 would a depool could make things worse?
[09:03:43] <wikibugs>	 (03PS1) 10Volans: depool eqsin, availability issue [dns] - 10https://gerrit.wikimedia.org/r/661338
[09:03:47] <volans>	 ^^^ if needed
[09:05:33] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs5002 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb6_443: Servers cp5001.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:05:59] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp5002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[09:06:25] <wikibugs>	 10SRE, 10Traffic: HTTP 502 Error when trying to create new page (500k characters) on Romanian Wikisource - https://phabricator.wikimedia.org/T273623 (10Aklapper)
[09:08:15] <godog>	 I'm here too if needed
[09:17:02] <icinga-wm>	 PROBLEM - LVS upload-https eqsin port 443/tcp - Images and other media- upload.eqiad.wikimedia.org IPv6 #page on upload-lb.eqsin.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 connect failed - 1961 bytes in 3.938 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[09:17:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 25%: Slowly pool db1174 into s7', diff saved to https://phabricator.wikimedia.org/P14150 and previous config saved to /var/cache/conftool/dbconfig/20210203-091712-root.json
[09:17:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:15] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp5002 is OK: HTTP OK: HTTP/1.1 200 OK - 413 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish
[09:17:53] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs5002 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:20:08] <_joe_>	 !log restarting varnish-frontend on cp5001
[09:20:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:21:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs5003 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb6_443: Servers cp5004.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:23:19] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs5003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:25:14] <icinga-wm>	 RECOVERY - LVS upload-https eqsin port 443/tcp - Images and other media- upload.eqiad.wikimedia.org IPv6 #page on upload-lb.eqsin.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 1038 bytes in 0.964 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[09:25:25] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp5001 is OK: HTTP OK: HTTP/1.1 200 OK - 412 bytes in 0.445 second response time https://wikitech.wikimedia.org/wiki/Varnish
[09:26:44] <vgutierrez>	 !log rolling restart varnish-fe on cp5004-5006
[09:26:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:30:04] <vgutierrez>	 !log depool cp5006
[09:30:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:31:39] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp5006 is OK: HTTP OK: HTTP/1.1 200 OK - 414 bytes in 0.451 second response time https://wikitech.wikimedia.org/wiki/Varnish
[09:32:02] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] Add missing isset() check to ApiEchoUnreadNotificationPages [extensions/Echo] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/661258 (https://phabricator.wikimedia.org/T273479) (owner: 10Majavah)
[09:32:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 30%: Slowly pool db1174 into s7', diff saved to https://phabricator.wikimedia.org/P14151 and previous config saved to /var/cache/conftool/dbconfig/20210203-093215-root.json
[09:32:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:35:09] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp5004 is OK: HTTP OK: HTTP/1.1 200 OK - 412 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish
[09:35:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1093 (re)pooling @ 5%: Slowly pooling db1093 after cloning db1173', diff saved to https://phabricator.wikimedia.org/P14152 and previous config saved to /var/cache/conftool/dbconfig/20210203-093540-root.json
[09:35:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:38:22] <Majavah>	 hashar: I rebased that other Echo patch
[09:38:24] <icinga-wm>	 PROBLEM - ATS TLS has reduced HTTP availability #page on alert1001 is CRITICAL: cluster=cache_upload layer=tls https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1
[09:38:32] <hashar>	 Majavah: greaaaat
[09:38:49] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3127 on cp5006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[09:38:59] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3126 on cp5006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[09:39:59] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3123 on cp5006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[09:40:01] <hashar>	 Majavah: I will update the cluster with the other change
[09:40:14] <hashar>	 I don't think we had anyway to reproduce though
[09:40:19] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3121 on cp5006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[09:40:19] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp5006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[09:40:43] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3125 on cp5006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[09:40:43] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3124 on cp5006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[09:40:53] <Majavah>	 I don't think we know how to reproduce any of the current blockers
[09:40:53] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) db1173 is now replicating.
[09:41:01] <Majavah>	 also what's happening with cp5006?
[09:41:36] <volans>	 Majavah: we're working on it
[09:41:44] <volans>	 currently depooled
[09:41:47] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 80 on cp5006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[09:41:55] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp5005 is OK: HTTP OK: HTTP/1.1 200 OK - 413 bytes in 0.484 second response time https://wikitech.wikimedia.org/wiki/Varnish
[09:42:05] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3122 on cp5006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[09:42:48] <icinga-wm>	 RECOVERY - ATS TLS has reduced HTTP availability #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1
[09:43:01] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3126 on cp5006 is OK: HTTP OK: HTTP/1.1 200 OK - 410 bytes in 0.484 second response time https://wikitech.wikimedia.org/wiki/Varnish
[09:43:09] <icinga-wm>	 PROBLEM - Number of messages locally queued by purged for processing on cp5006 is CRITICAL: cluster=cache_upload instance=cp5006 job=purged layer=frontend site=eqsin https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5006
[09:43:16] <hashar>	 Majavah: the other patch made it simply a warning and also added the foreign wiki to the error message ( https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Echo/+/661046/1/includes/api/ApiEchoUnreadNotificationPages.php )  . I am going to update your rebase ;)
[09:43:40] <Majavah>	 please do, thanks
[09:44:01] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3123 on cp5006 is OK: HTTP OK: HTTP/1.1 200 OK - 409 bytes in 0.445 second response time https://wikitech.wikimedia.org/wiki/Varnish
[09:44:19] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp5006 is OK: HTTP OK: HTTP/1.1 200 OK - 410 bytes in 0.445 second response time https://wikitech.wikimedia.org/wiki/Varnish
[09:44:19] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3121 on cp5006 is OK: HTTP OK: HTTP/1.1 200 OK - 410 bytes in 0.458 second response time https://wikitech.wikimedia.org/wiki/Varnish
[09:44:43] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3125 on cp5006 is OK: HTTP OK: HTTP/1.1 200 OK - 410 bytes in 0.451 second response time https://wikitech.wikimedia.org/wiki/Varnish
[09:44:43] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3124 on cp5006 is OK: HTTP OK: HTTP/1.1 200 OK - 410 bytes in 0.454 second response time https://wikitech.wikimedia.org/wiki/Varnish
[09:45:19] <icinga-wm>	 RECOVERY - Number of messages locally queued by purged for processing on cp5006 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5006
[09:45:47] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 80 on cp5006 is OK: HTTP OK: HTTP/1.1 200 OK - 410 bytes in 0.445 second response time https://wikitech.wikimedia.org/wiki/Varnish
[09:46:07] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3122 on cp5006 is OK: HTTP OK: HTTP/1.1 200 OK - 410 bytes in 0.445 second response time https://wikitech.wikimedia.org/wiki/Varnish
[09:46:52] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[09:47:03] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3127 on cp5006 is OK: HTTP OK: HTTP/1.1 200 OK - 410 bytes in 0.446 second response time https://wikitech.wikimedia.org/wiki/Varnish
[09:47:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 40%: Slowly pool db1174 into s7', diff saved to https://phabricator.wikimedia.org/P14153 and previous config saved to /var/cache/conftool/dbconfig/20210203-094719-root.json
[09:47:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:12] <XioNoX>	 !log disable DE-CIX codfw peering sessions
[09:50:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1093 (re)pooling @ 10%: Slowly pooling db1093 after cloning db1173', diff saved to https://phabricator.wikimedia.org/P14154 and previous config saved to /var/cache/conftool/dbconfig/20210203-095043-root.json
[09:50:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:56:41] <wikibugs>	 (03Merged) 10jenkins-bot: Add missing isset() check to ApiEchoUnreadNotificationPages [extensions/Echo] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/661258 (https://phabricator.wikimedia.org/T273479) (owner: 10Majavah)
[09:57:02] <Majavah>	 hashar: backport merged ^
[09:57:53] <marostegui>	 !log m2 master restart - T272964
[09:57:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:57:58] <stashbot>	 T272964: Restart m2 database master (db1107) - https://phabricator.wikimedia.org/T272964
[09:59:18] <marostegui>	 done, checking things
[09:59:25] <jynus>	 proxy didn't complain?
[09:59:31] <moritzm>	 debmonitor is all fine
[09:59:42] <marostegui>	 otrs is fine
[09:59:49] <marostegui>	 jynus: it was fast
[10:00:46] <jynus>	 is there something else (service) I can check?
[10:00:53] <marostegui>	 xhgui looks fine too
[10:01:08] <hashar>	 Majavah: yeah updating
[10:01:15] <marostegui>	 jynus: I think we are good! thank you though :*
[10:02:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 50%: Slowly pool db1174 into s7', diff saved to https://phabricator.wikimedia.org/P14155 and previous config saved to /var/cache/conftool/dbconfig/20210203-100222-root.json
[10:02:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:04:01] <logmsgbot>	 !log hashar@deploy1001 Synchronized php-1.36.0-wmf.29/extensions/Echo/includes/api/ApiEchoUnreadNotificationPages.php: Add missing isset() check to ApiEchoUnreadNotificationPages - T273479 (duration: 01m 14s)
[10:04:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:04:05] <stashbot>	 T273479: ApiEchoUnreadNotificationPages.php PHP Notice: Undefined index: query - https://phabricator.wikimedia.org/T273479
[10:04:09] <hashar>	 10:03:59 /usr/bin/sudo -u root -- /usr/local/sbin/check-and-restart-php php7.2-fpm 100 on mw2295.codfw.wmnet returned [255]: Host key verification failed.
[10:04:10] <hashar>	 bah
[10:04:45] <wikibugs>	 (03CR) 10Hashar: "Synced on the cluster." [extensions/Echo] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/661258 (https://phabricator.wikimedia.org/T273479) (owner: 10Majavah)
[10:05:01] <gehel>	 !log depooling and restarting blazegraph on wdqs1007
[10:05:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:05:08] <Majavah>	 so now we just wait and see if the errors stop?
[10:05:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1093 (re)pooling @ 25%: Slowly pooling db1093 after cloning db1173', diff saved to https://phabricator.wikimedia.org/P14156 and previous config saved to /var/cache/conftool/dbconfig/20210203-100547-root.json
[10:05:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:07:14] <Majavah>	 hashar: for the last blocker, I'll try to make a patch that will log the memcached key that is trying to be set if it fails in addition of the current message
[10:07:30] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "Apart from the notes Alex had, this looks good to me!" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/661138 (https://phabricator.wikimedia.org/T273427) (owner: 10Giuseppe Lavagetto)
[10:07:40] <Majavah>	 that should hopefully tell where the error is
[10:09:07] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:09:19] <volans>	 XioNoX: ^^^
[10:09:35] <XioNoX>	 volans: expected
[10:09:36] <XioNoX>	 thx
[10:09:46] <volans>	 just checking given the ongoing stuff, thx ;)
[10:10:36] <wikibugs>	 10SRE: mw2295.codfw.wmnet returned [255]: Host key verification failed. - https://phabricator.wikimedia.org/T273726 (10hashar)
[10:13:55] <jynus>	 hashar, last person to touch it according to phab seems to be legoktm, you may want to add him to the ticket
[10:14:03] <kormat>	 XioNoX: all problems are expected if you're pessimistic enough
[10:14:39] <legoktm>	 hmm
[10:15:53] <legoktm>	 hashar: weird, I don't know why puppet is disabled. Let me re-enable it
[10:16:21] <hashar>	 legoktm: no idea :]  but you shouldl sleep() really!
[10:16:21] <hashar>	 :D
[10:16:37] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:16:40] <hashar>	 I guess the ssh host key is not collected and thus does not land on the deploy machine
[10:16:43] <legoktm>	 !log re-enabled puppet on mw2295 (T273726)
[10:16:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:16:48] <stashbot>	 T273726: mw2295.codfw.wmnet returned [255]: Host key verification failed. - https://phabricator.wikimedia.org/T273726
[10:17:01] <legoktm>	 but other people have deployed with no issues since I reimaged it last week?
[10:17:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 60%: Slowly pool db1174 into s7', diff saved to https://phabricator.wikimedia.org/P14157 and previous config saved to /var/cache/conftool/dbconfig/20210203-101726-root.json
[10:17:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:17:41] <jynus>	 legoktm, indeed, so it is probably not that
[10:18:00] <jynus>	 I just saw the data that it was long ago
[10:18:02] <jynus>	 *date
[10:18:23] <legoktm>	 yeah it looks like puppet was never enabled after the reimage
[10:19:01] <legoktm>	 hashar: want to try again?
[10:19:34] <jynus>	 keep an eye on it in case there is something weird there- hw issues or something :-)
[10:20:32] <hashar>	 legoktm: checking
[10:20:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1093 (re)pooling @ 50%: Slowly pooling db1093 after cloning db1173', diff saved to https://phabricator.wikimedia.org/P14158 and previous config saved to /var/cache/conftool/dbconfig/20210203-102050-root.json
[10:20:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:23] <hashar>	 legoktm: I think we need a puppet run on deploy1001 to get the /etc/ssh/ssh_known_hosts  , I guess that will self solve eventually. Thx!
[10:22:29] <legoktm>	 one moment
[10:23:19] <legoktm>	 yep, there it is
[10:23:24] <legoktm>	 +mw2295.codfw.wmnet,mw2295,10.192.0.165,2620:0:860:101:10:192:0:165 ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBOdJ3WhOvOB/k01lnZynn+0rpV+cWylCU3quVOezb69ZiDLvZnDJUhFVEDVQ7g1yjX0EV8stu8MKUMX9uR9Y3vY=
[10:23:33] <legoktm>	 so how did previous deploys work??
[10:23:46] <hashar>	 I have no idea ;]
[10:24:18] <hashar>	 maybe the host was not in the dsh group
[10:24:40] <legoktm>	 it was, we have icinga alerting for that
[10:25:14] <hashar>	 well who knows so :/
[10:25:24] <hashar>	 seems to be fixed anyway, I am closing that task. Thank you legoktm !
[10:25:35] <legoktm>	 hashar: er, please leave it open
[10:25:37] <wikibugs>	 10SRE: mw2295.codfw.wmnet returned [255]: Host key verification failed. - https://phabricator.wikimedia.org/T273726 (10hashar) 05Open→03Resolved a:03Legoktm SSH host keys are collected by puppet on the hosts and writen to /etc/ssh/ssh_known_hosts and since puppet was disabled the key was not collected.  Th...
[10:26:27] <wikibugs>	 10SRE: mw2295.codfw.wmnet returned [255]: Host key verification failed. - https://phabricator.wikimedia.org/T273726 (10Legoktm) puppet hadn't run since the reimaging, which is problem. I would've logged into it after the reimaging to run `scap pull` before repooling it, but it's possible I didn't read the MOTD p...
[10:26:35] <wikibugs>	 10SRE: mw2295.codfw.wmnet returned [255]: Host key verification failed. - https://phabricator.wikimedia.org/T273726 (10Legoktm) 05Resolved→03Open
[10:26:50] <legoktm>	 hashar: I'll look into this properly tomorrow
[10:27:09] <legoktm>	 I feel like something else went wrong here, puppet shouldn't have been disabled for a whole week with no one noticing, etc.
[10:27:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] swift: limit rsync service memory [puppet] - 10https://gerrit.wikimedia.org/r/660854 (https://phabricator.wikimedia.org/T221904) (owner: 10Filippo Giunchedi)
[10:28:07] <vgutierrez>	 !log rolling restart of varnish-fe on cp5002 and cp5003
[10:28:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:29:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] swift: limit rsync to 10% memory in codfw [puppet] - 10https://gerrit.wikimedia.org/r/660855 (https://phabricator.wikimedia.org/T221904) (owner: 10Filippo Giunchedi)
[10:31:41] <hashar>	 legoktm: yeah good luck :\  Thx for the quick fix!
[10:32:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 75%: Slowly pool db1174 into s7', diff saved to https://phabricator.wikimedia.org/P14159 and previous config saved to /var/cache/conftool/dbconfig/20210203-103229-root.json
[10:32:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:12] <Majavah>	 hashar: you think we can just wrap the memcached operation in a try-catch and add a custom message to exceptions with the memcached key? or does it need something better
[10:33:37] <wikibugs>	 (03PS1) 10Filippo Giunchedi: role: default swift rsync memory limit [puppet] - 10https://gerrit.wikimedia.org/r/661343 (https://phabricator.wikimedia.org/T221904)
[10:35:26] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] role::kubernetes::worker: add empty stanza for eventstreams-internal [puppet] - 10https://gerrit.wikimedia.org/r/661072 (https://phabricator.wikimedia.org/T269160) (owner: 10Elukey)
[10:35:39] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Add conftool data for eventstreams-internal (new VIP) [puppet] - 10https://gerrit.wikimedia.org/r/661067 (https://phabricator.wikimedia.org/T269160) (owner: 10Elukey)
[10:35:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1093 (re)pooling @ 75%: Slowly pooling db1093 after cloning db1173', diff saved to https://phabricator.wikimedia.org/P14160 and previous config saved to /var/cache/conftool/dbconfig/20210203-103554-root.json
[10:35:58] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] role: default swift rsync memory limit [puppet] - 10https://gerrit.wikimedia.org/r/661343 (https://phabricator.wikimedia.org/T221904) (owner: 10Filippo Giunchedi)
[10:35:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:36:11] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] Add eventstreams-internal to service_catalog (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/661071 (https://phabricator.wikimedia.org/T269160) (owner: 10Elukey)
[10:38:54] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Patch-For-Review, 10User-fgiunchedi: swift backend decomms / rebalances are noisy - https://phabricator.wikimedia.org/T221904 (10fgiunchedi)
[10:39:45] <wikibugs>	 (03PS1) 10Kormat: integration_env: Add 'dbvers' command [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/661344
[10:40:24] <icinga-wm>	 RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 95, down: 4, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:40:30] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10jcrespo)
[10:40:49] <wikibugs>	 (03PS1) 10Jcrespo: install_server: Decommission db1095, substitute with db1171 [puppet] - 10https://gerrit.wikimedia.org/r/661345 (https://phabricator.wikimedia.org/T273732)
[10:41:04] <wikibugs>	 (03PS2) 10Jcrespo: install_server: Decommission db1095, substitute with db1171 [puppet] - 10https://gerrit.wikimedia.org/r/661345 (https://phabricator.wikimedia.org/T273732)
[10:42:04] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-1] "Blocked on db1171 being 100% ready." [puppet] - 10https://gerrit.wikimedia.org/r/661345 (https://phabricator.wikimedia.org/T273732) (owner: 10Jcrespo)
[10:42:56] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: Add the 'uid' template helper (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/660851 (https://phabricator.wikimedia.org/T228967) (owner: 10Giuseppe Lavagetto)
[10:43:16] <logmsgbot>	 !log elukey@deploy1001 Started deploy [analytics/refinery@8b8f0cf]: Weekly deployment
[10:43:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:43:25] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: decommission relforge1001.eqiad.wmnet and relforge1002.eqiad.wmnet - https://phabricator.wikimedia.org/T272444 (10Gehel) Removing Search Platform, the remaining work is under the control of DC-Ops.
[10:43:38] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10jcrespo)
[10:43:45] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] integration_env: Add 'dbvers' command [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/661344 (owner: 10Kormat)
[10:46:13] <wikibugs>	 (03Merged) 10jenkins-bot: integration_env: Add 'dbvers' command [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/661344 (owner: 10Kormat)
[10:47:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 85%: Slowly pool db1174 into s7', diff saved to https://phabricator.wikimedia.org/P14161 and previous config saved to /var/cache/conftool/dbconfig/20210203-104733-root.json
[10:47:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:48:10] <wikibugs>	 (03PS1) 10Kormat: integration_env: Use better names for paths. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/661346
[10:50:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1093 (re)pooling @ 100%: Slowly pooling db1093 after cloning db1173', diff saved to https://phabricator.wikimedia.org/P14162 and previous config saved to /var/cache/conftool/dbconfig/20210203-105057-root.json
[10:51:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:51:50] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] integration_env: Use better names for paths. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/661346 (owner: 10Kormat)
[10:53:27] <wikibugs>	 (03PS2) 10Elukey: Add eventstreams-internal to service_catalog [puppet] - 10https://gerrit.wikimedia.org/r/661071 (https://phabricator.wikimedia.org/T269160)
[10:54:06] <wikibugs>	 (03CR) 10Elukey: Add eventstreams-internal to service_catalog (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/661071 (https://phabricator.wikimedia.org/T269160) (owner: 10Elukey)
[10:54:22] <wikibugs>	 (03PS1) 10David Caro: wmcs.backups: Use the wmcs-backup script for vms [puppet] - 10https://gerrit.wikimedia.org/r/661348 (https://phabricator.wikimedia.org/T260692)
[10:54:22] <logmsgbot>	 !log elukey@deploy1001 Finished deploy [analytics/refinery@8b8f0cf]: Weekly deployment (duration: 11m 06s)
[10:54:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:54:29] <wikibugs>	 (03Merged) 10jenkins-bot: integration_env: Use better names for paths. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/661346 (owner: 10Kormat)
[10:54:56] <wikibugs>	 (03PS1) 10Kormat: integration_env: Make deployment more configurable [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/661349
[10:55:51] <wikibugs>	 (03PS2) 10Kormat: integration_env: Make deployment more configurable [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/661349 (https://phabricator.wikimedia.org/T265266)
[10:55:54] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmcs.backups: Use the wmcs-backup script for vms [puppet] - 10https://gerrit.wikimedia.org/r/661348 (https://phabricator.wikimedia.org/T260692) (owner: 10David Caro)
[10:58:50] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] integration_env: Make deployment more configurable [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/661349 (https://phabricator.wikimedia.org/T265266) (owner: 10Kormat)
[11:00:02] <wikibugs>	 (03CR) 10Marostegui: "Reminder (cause I tend to forget it): remove from tendril and zarcillo" [puppet] - 10https://gerrit.wikimedia.org/r/661345 (https://phabricator.wikimedia.org/T273732) (owner: 10Jcrespo)
[11:00:22] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[11:01:20] <wikibugs>	 (03Merged) 10jenkins-bot: integration_env: Make deployment more configurable [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/661349 (https://phabricator.wikimedia.org/T265266) (owner: 10Kormat)
[11:02:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 100%: Slowly pool db1174 into s7', diff saved to https://phabricator.wikimedia.org/P14163 and previous config saved to /var/cache/conftool/dbconfig/20210203-110236-root.json
[11:02:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:44] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[11:05:09] <wikibugs>	 (03PS2) 10David Caro: wmcs.backups: Use the wmcs-backup script for vms [puppet] - 10https://gerrit.wikimedia.org/r/661348 (https://phabricator.wikimedia.org/T260692)
[11:06:28] <wikibugs>	 10SRE, 10serviceops, 10Performance-Team (Radar), 10Release-Engineering-Team (Deployment services), and 2 others: Investigate possible performance degradation on mediawiki servers after Debian Buster upgrade - https://phabricator.wikimedia.org/T273312 (10Legoktm) I generated some [[http://www.brendangregg.c...
[11:09:42] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Add the 'uid' template helper [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/660851 (https://phabricator.wikimedia.org/T228967)
[11:09:44] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: Remove the build image functionality [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/660852
[11:09:46] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Allow running tests on an image once it's built [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/661138 (https://phabricator.wikimedia.org/T273427)
[11:15:40] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] stdlib: update to 6.6.0 [puppet] - 10https://gerrit.wikimedia.org/r/661118 (owner: 10Jbond)
[11:16:01] <wikibugs>	 (03PS1) 10Legoktm: logos: Update nowiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661350
[11:16:03] <wikibugs>	 (03PS1) 10Legoktm: logos: Update cawiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661351
[11:16:05] <wikibugs>	 (03PS1) 10Legoktm: logos: Update fiwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661352
[11:16:07] <wikibugs>	 (03PS1) 10Legoktm: logos: Update ukwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661353
[11:16:09] <wikibugs>	 (03PS1) 10Legoktm: logos: Update cswiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661354
[11:16:11] <wikibugs>	 (03PS1) 10Legoktm: logos: Update huwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661355
[11:16:13] <wikibugs>	 (03PS1) 10Legoktm: logos: Update trwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661356
[11:17:44] <icinga-wm>	 RECOVERY - rpki grafana alert on alert1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/
[11:20:22] <jbond42>	 !log update puppetlabs-stdlib to v6.6.0
[11:20:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:20:50] <wikibugs>	 (03Abandoned) 10Volans: depool eqsin, availability issue [dns] - 10https://gerrit.wikimedia.org/r/661338 (owner: 10Volans)
[11:21:42] <wikibugs>	 10SRE, 10Traffic: Investigate unusual media traffic pattern - https://phabricator.wikimedia.org/T273741 (10Joe)
[11:21:52] <wikibugs>	 10SRE, 10Traffic: Investigate unusual media traffic pattern - https://phabricator.wikimedia.org/T273741 (10Joe) p:05Triage→03Medium
[11:22:47] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] profile::redis::multidc: hiera -> lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/659392 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[11:25:55] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Hnowlan New buster master, not in use https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:25:55] <icinga-wm>	 ACKNOWLEDGEMENT - Maps HTTPS on maps1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.006 second response time Hnowlan New buster master, not in use https://wikitech.wikimedia.org/wiki/Maps/RunBook
[11:25:55] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra CQL 10.64.32.8:9042 on maps1009 is CRITICAL: connect to address 10.64.32.8 and port 9042: Connection refused Hnowlan New buster master, not in use https://phabricator.wikimedia.org/T93886
[11:25:55] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra service on maps1009 is CRITICAL: CRITICAL - Unit cassandra is active but reported SubState exited, wanted running Hnowlan New buster master, not in use https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:25:55] <icinga-wm>	 ACKNOWLEDGEMENT - tileratorui on maps1009 is CRITICAL: connect to address 10.64.32.8 and port 6535: Connection refused Hnowlan New buster master, not in use https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui
[11:27:45] <wikibugs>	 (03CR) 10Jbond: "LGTM but wonder if we can just drop this script, adding moritz" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/658427 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[11:29:00] <icinga-wm>	 ACKNOWLEDGEMENT - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE, AS6939/IPv4: Connect - HE ayounsi DE-CIX maintenance https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:35:07] <wikibugs>	 (03PS5) 10David Caro: Revert "elasticsearch: return the cluster name in __str__ for ElasticsearchCluster" [software/spicerack] - 10https://gerrit.wikimedia.org/r/659294
[11:36:40] <wikibugs>	 (03CR) 10David Caro: Revert "elasticsearch: return the cluster name in __str__ for ElasticsearchCluster" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/659294 (owner: 10David Caro)
[11:38:58] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "🎉" [puppet] - 10https://gerrit.wikimedia.org/r/660960 (https://phabricator.wikimedia.org/T272723) (owner: 10Bstorm)
[11:39:43] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] Add BGP configuration for the new ML Serve eqiad/codfw clusters [homer/public] - 10https://gerrit.wikimedia.org/r/661055 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey)
[11:40:47] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] last-puppet-run: don't crash if puppet has not run yet [puppet] - 10https://gerrit.wikimedia.org/r/641207 (owner: 10David Caro)
[11:46:30] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: Allow running tests on an image once it's built (032 comments) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/661138 (https://phabricator.wikimedia.org/T273427) (owner: 10Giuseppe Lavagetto)
[11:46:47] <wikibugs>	 (03CR) 10Muehlenhoff: ldap/ldaplist.py: Port for Python 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/658427 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[11:55:06] <wikibugs>	 10SRE, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Aklapper)
[11:55:40] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/659327 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto)
[11:56:00] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: mediawiki::prod_sites: move to dumb templates [puppet] - 10https://gerrit.wikimedia.org/r/659327 (https://phabricator.wikimedia.org/T272305)
[11:58:18] <icinga-wm>	 PROBLEM - SSH on mw2249.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:59:19] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27834/console" [puppet] - 10https://gerrit.wikimedia.org/r/659327 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto)
[11:59:32] <wikibugs>	 10Puppet, 10SRE, 10User-jbond: Identify and upstream usefull fuinctions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond)
[11:59:41] <wikibugs>	 10Puppet, 10SRE, 10User-jbond: Identify and upstream usefull fuinctions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) p:05Triage→03Medium
[12:00:04] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210203T1200).
[12:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[12:01:56] <wikibugs>	 10Puppet, 10SRE, 10User-jbond: Identify and upstream usefull fuinctions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) A new functions (`stdlib::ensure`) has now [[ https://github.com/puppetlabs/puppetlabs-stdlib/pull/1150 | been added ]] to sddlib as such we can replace our `ensure_{servi...
[12:03:44] <wikibugs>	 10Puppet, 10SRE, 10User-jbond: Identify and upstream usefull fuinctions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) Stdlib now has `Stdlib::HTTPStatus` [[ https://github.com/puppetlabs/puppetlabs-stdlib/pull/1132 | types ]] as such we should update our code base to use them instad of `W...
[12:04:18] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mediawiki::prod_sites: move to dumb templates [puppet] - 10https://gerrit.wikimedia.org/r/659327 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto)
[12:12:00] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Add eventstreams-internal to service_catalog [puppet] - 10https://gerrit.wikimedia.org/r/661071 (https://phabricator.wikimedia.org/T269160) (owner: 10Elukey)
[12:12:04] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.533 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash
[12:19:02] <moritzm>	 !log installing openldap security updates on LDAP replicas
[12:19:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:21:36] <wikibugs>	 (03CR) 10Jbond: "lg, minor nit inline, CI error seems related to https://phabricator.wikimedia.org/T272985" (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/661278 (https://phabricator.wikimedia.org/T273706) (owner: 10Andrew Bogott)
[12:22:32] <jbond42>	 !log disable puppet fleet wide to reboot puppetmaster,puppetdb
[12:22:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:22:46] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 59, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:25:22] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetdb1002.eqiad.wmnet
[12:25:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:26:33] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetdb2002.codfw.wmnet
[12:26:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:27:39] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs.backups: Use the wmcs-backup script for vms [puppet] - 10https://gerrit.wikimedia.org/r/661348 (https://phabricator.wikimedia.org/T260692) (owner: 10David Caro)
[12:28:03] <logmsgbot>	 !log jbond@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host puppetdb1002.eqiad.wmnet
[12:28:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:29:22] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetmaster1001.eqiad.wmnet
[12:29:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:29:29] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetmaster1002.eqiad.wmnet
[12:29:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:29:36] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetmaster1003.eqiad.wmnet
[12:29:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:33:37] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetmaster1003.eqiad.wmnet
[12:33:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:34:09] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not reimage db1173 [puppet] - 10https://gerrit.wikimedia.org/r/661359
[12:34:17] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetmaster1002.eqiad.wmnet
[12:34:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:34:21] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetmaster2003.codfw.wmnet
[12:34:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:34:32] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetmaster2002.codfw.wmnet
[12:34:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:35:01] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetdb2002.codfw.wmnet
[12:35:01] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1173 [puppet] - 10https://gerrit.wikimedia.org/r/661359 (owner: 10Marostegui)
[12:35:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:35:05] <logmsgbot>	 !log jbond@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host puppetmaster1001.eqiad.wmnet
[12:35:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:35:21] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetmaster2001.codfw.wmnet
[12:35:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:38:30] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetmaster2003.codfw.wmnet
[12:38:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:40:50] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetmaster2002.codfw.wmnet
[12:40:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:43:31] <wikibugs>	 10Puppet, 10SRE, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10Aklapper)
[12:43:37] <wikibugs>	 (03CR) 10Muehlenhoff: ldap/ldaplist.py: Port for Python 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/658427 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[12:46:10] <wikibugs>	 (03PS1) 10Jbond: tests: fix dependencies for tests [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/661362
[12:46:37] <logmsgbot>	 !log jbond@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host puppetmaster2001.codfw.wmnet
[12:46:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:47:04] <jbond42>	 in that case ill enable puppet fleet wide post reboot the master and db servers
[12:49:10] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] tests: fix dependencies for tests [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/661362 (owner: 10Jbond)
[12:49:36] <wikibugs>	 (03PS9) 10Jbond: nodegen: add cumin support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651800 (https://phabricator.wikimedia.org/T245288)
[12:49:47] <wikibugs>	 (03PS3) 10Jbond: Detect hosts under .wikimedia.cloud as 'labs' VMs. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/661278 (https://phabricator.wikimedia.org/T273706) (owner: 10Andrew Bogott)
[12:51:51] <wikibugs>	 (03CR) 10Jbond: "Ready for review" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651800 (https://phabricator.wikimedia.org/T245288) (owner: 10Jbond)
[12:52:05] <wikibugs>	 (03CR) 10Jbond: "CI tests fixed" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/661278 (https://phabricator.wikimedia.org/T273706) (owner: 10Andrew Bogott)
[12:53:17] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: httpbb-tests: Fix body assertion on ombudswiki [puppet] - 10https://gerrit.wikimedia.org/r/661363
[13:00:04] <jouncebot>	 Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210203T1300)
[13:02:51] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] wmcs.backups: Use the wmcs-backup script for vms [puppet] - 10https://gerrit.wikimedia.org/r/661348 (https://phabricator.wikimedia.org/T260692) (owner: 10David Caro)
[13:03:20] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] httpbb-tests: Fix body assertion on ombudswiki [puppet] - 10https://gerrit.wikimedia.org/r/661363 (owner: 10Giuseppe Lavagetto)
[13:06:29] <wikibugs>	 10Puppet, 10SRE, 10User-jbond: Replace crons in puppet with systemd timer - https://phabricator.wikimedia.org/T273753 (10Ladsgroup)
[13:06:54] <Amir1>	 jbond42: hey, I split the OKR ticket to the systemd timer so it'd be a little cleaner ^
[13:08:25] <jbond42>	 Amir1: mutante: allready beat you too it :) https://phabricator.wikimedia.org/T273673
[13:08:38] <Amir1>	 🤦 sorry
[13:08:45] <jbond42>	 :) np
[13:09:05] <wikibugs>	 10Puppet, 10SRE, 10User-jbond: Replace crons in puppet with systemd timer - https://phabricator.wikimedia.org/T273753 (10Ladsgroup)
[13:09:07] <wikibugs>	 10Puppet, 10SRE, 10puppet-compiler, 10Patch-For-Review, 10User-jbond: replace all puppet crons with systemd timers - https://phabricator.wikimedia.org/T273673 (10Ladsgroup)
[13:10:16] <wikibugs>	 (03CR) 10Muehlenhoff: "The old cron needs to be absented." [puppet] - 10https://gerrit.wikimedia.org/r/661198 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[13:10:36] <wikibugs>	 10SRE, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Gilles) Might be worth looking at the full unprocessed request headers? Do you have an example?
[13:12:31] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host urldownloader1001.wikimedia.org
[13:12:33] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host urldownloader2001.wikimedia.org
[13:12:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:12:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:15] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader1001.wikimedia.org
[13:14:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:02] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 69541976 and 35 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[13:16:45] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader2001.wikimedia.org
[13:16:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:17:33] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host xhgui2001.codfw.wmnet
[13:17:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:17:42] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host xhgui1001.eqiad.wmnet
[13:17:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:02] <wikibugs>	 (03PS1) 10Jbond: trafficserver: migrate from wmflib::HttpStatus to Stdlib::HttpStatus [puppet] - 10https://gerrit.wikimedia.org/r/661365
[13:19:34] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 333976 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[13:20:02] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27835/console" [puppet] - 10https://gerrit.wikimedia.org/r/661365 (owner: 10Jbond)
[13:20:23] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] trafficserver: migrate from wmflib::HttpStatus to Stdlib::HttpStatus [puppet] - 10https://gerrit.wikimedia.org/r/661365 (owner: 10Jbond)
[13:27:46] <wikibugs>	 (03PS1) 10Jbond: wmflib: drop ensure_service in favour of stdlib::ensure [puppet] - 10https://gerrit.wikimedia.org/r/661367
[13:28:14] <wikibugs>	 (03CR) 10Jbond: "PCC running (https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27836)" [puppet] - 10https://gerrit.wikimedia.org/r/661367 (owner: 10Jbond)
[13:29:16] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host xhgui2001.codfw.wmnet
[13:29:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:29:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1120 T266483', diff saved to https://phabricator.wikimedia.org/P14164 and previous config saved to /var/cache/conftool/dbconfig/20210203-132938-marostegui.json
[13:29:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:29:42] <stashbot>	 T266483: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483
[13:30:24] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host xhgui1001.eqiad.wmnet
[13:30:24] <marostegui>	 !log Stop mysql on db1120 to enable report_host T266483
[13:30:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:39] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmflib: drop ensure_service in favour of stdlib::ensure [puppet] - 10https://gerrit.wikimedia.org/r/661367 (owner: 10Jbond)
[13:30:57] <wikibugs>	 (03PS7) 10David Caro: puppet: add ca_server retrieval [software/spicerack] - 10https://gerrit.wikimedia.org/r/659008
[13:31:06] <icinga-wm>	 PROBLEM - Check systemd state on xhgui2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:32:14] <icinga-wm>	 PROBLEM - Check systemd state on xhgui1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:33:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 5%: Repool db1120 after daemon restart', diff saved to https://phabricator.wikimedia.org/P14165 and previous config saved to /var/cache/conftool/dbconfig/20210203-133350-root.json
[13:33:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:00] <wikibugs>	 (03PS1) 10Jbond: wmflib: drop ensure_directory in favour of stdlib::ensure [puppet] - 10https://gerrit.wikimedia.org/r/661368
[13:36:25] <wikibugs>	 (03CR) 10Jbond: "PCC running https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27837" [puppet] - 10https://gerrit.wikimedia.org/r/661368 (owner: 10Jbond)
[13:38:08] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hieradata: use /monitoring/frontend for Swift's internal svc health checks [puppet] - 10https://gerrit.wikimedia.org/r/661369 (https://phabricator.wikimedia.org/T273453)
[13:40:54] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] puppet: add ca_server retrieval [software/spicerack] - 10https://gerrit.wikimedia.org/r/659008 (owner: 10David Caro)
[13:41:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] swift: apply interface::rps to i40e NICs [puppet] - 10https://gerrit.wikimedia.org/r/661054 (https://phabricator.wikimedia.org/T271415) (owner: 10Filippo Giunchedi)
[13:41:16] <wikibugs>	 (03PS2) 10Filippo Giunchedi: swift: apply interface::rps to i40e NICs [puppet] - 10https://gerrit.wikimedia.org/r/661054 (https://phabricator.wikimedia.org/T271415)
[13:46:28] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "I didn't test it but LGTM, it should work." [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651800 (https://phabricator.wikimedia.org/T245288) (owner: 10Jbond)
[13:47:50] <wikibugs>	 10SRE, 10Cognate, 10ContentTranslation, 10DBA, and 9 others: Restart x1 database master - https://phabricator.wikimedia.org/T273758 (10Marostegui)
[13:48:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 10%: Repool db1120 after daemon restart', diff saved to https://phabricator.wikimedia.org/P14166 and previous config saved to /var/cache/conftool/dbconfig/20210203-134854-root.json
[13:48:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:54:02] <wikibugs>	 10SRE, 10Cognate, 10ContentTranslation, 10DBA, and 10 others: Restart x1 database master - https://phabricator.wikimedia.org/T273758 (10Trizek-WMF) Wow, this will happen soon!   If I read it correctly, it will disturb ContentTranslation, Flow, Echo (all Echo notifications at en.wp and all X-wiki notificati...
[13:55:04] <wikibugs>	 (03PS8) 10David Caro: puppet: add ca_server retrieval [software/spicerack] - 10https://gerrit.wikimedia.org/r/659008
[13:58:43] <godog>	 !log swift codfw-prod decrease HDD weight for ms-be20[16-27] - T272837
[13:58:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:58:48] <stashbot>	 T272837:  Decom ms-be[2016-2027] from swift - https://phabricator.wikimedia.org/T272837
[13:59:25] <wikibugs>	 10SRE, 10Cognate, 10ContentTranslation, 10DBA, and 10 others: Restart x1 database master - https://phabricator.wikimedia.org/T273758 (10Marostegui) >>! In T273758#6800123, @Trizek-WMF wrote: > Wow, this will happen soon!   It will happen in 14 days.  >  > If I read it correctly, it will disturb ContentTran...
[14:00:04] <jouncebot>	 hashar and dancy: May I have your attention please! Mediawiki train - European+American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210203T1400)
[14:03:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 20%: Repool db1120 after daemon restart', diff saved to https://phabricator.wikimedia.org/P14167 and previous config saved to /var/cache/conftool/dbconfig/20210203-140357-root.json
[14:04:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:07:30] <wikibugs>	 (03PS2) 10Jbond: wmflib: drop ensure_service in favour of stdlib::ensure [puppet] - 10https://gerrit.wikimedia.org/r/661367
[14:07:32] <wikibugs>	 (03PS1) 10Jbond: zuul::server: Add types [puppet] - 10https://gerrit.wikimedia.org/r/661371
[14:19:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 25%: Repool db1120 after daemon restart', diff saved to https://phabricator.wikimedia.org/P14168 and previous config saved to /var/cache/conftool/dbconfig/20210203-141901-root.json
[14:19:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:55] <godog>	 !log test memory limits on swift-object-replicator on ms-be2050 - T221904
[14:19:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:20:00] <stashbot>	 T221904: swift backend decomms / rebalances are noisy - https://phabricator.wikimedia.org/T221904
[14:20:27] <moritzm>	 !log installing openldap security updates on serpens/seaborgium
[14:20:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:24:51] <wikibugs>	 (03PS1) 10Jbond: wmflib: drop ensure_link in favour of stdlib::ensure [puppet] - 10https://gerrit.wikimedia.org/r/661372 (https://phabricator.wikimedia.org/T273743)
[14:24:58] <wikibugs>	 (03PS2) 10Jbond: wmflib: drop ensure_directory in favour of stdlib::ensure [puppet] - 10https://gerrit.wikimedia.org/r/661368 (https://phabricator.wikimedia.org/T273743)
[14:25:07] <wikibugs>	 (03PS3) 10Jbond: wmflib: drop ensure_service in favour of stdlib::ensure [puppet] - 10https://gerrit.wikimedia.org/r/661367 (https://phabricator.wikimedia.org/T273743)
[14:25:30] <wikibugs>	 (03CR) 10Jbond: "PCC running https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27838" [puppet] - 10https://gerrit.wikimedia.org/r/661372 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond)
[14:26:06] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 247223448 and 21 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[14:28:26] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 417912 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[14:28:44] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools Reply Tool A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661373 (https://phabricator.wikimedia.org/T273554)
[14:28:46] <icinga-wm>	 RECOVERY - Long running screen/tmux on snapshot1009 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens
[14:28:46] <icinga-wm>	 RECOVERY - Long running screen/tmux on snapshot1010 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens
[14:31:46] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] nodegen: add cumin support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651800 (https://phabricator.wikimedia.org/T245288) (owner: 10Jbond)
[14:32:18] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Allow running tests on an image once it's built [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/661138 (https://phabricator.wikimedia.org/T273427) (owner: 10Giuseppe Lavagetto)
[14:32:35] <wikibugs>	 (03Merged) 10jenkins-bot: nodegen: add cumin support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651800 (https://phabricator.wikimedia.org/T245288) (owner: 10Jbond)
[14:33:22] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Add the 'uid' template helper [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/660851 (https://phabricator.wikimedia.org/T228967) (owner: 10Giuseppe Lavagetto)
[14:34:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 50%: Repool db1120 after daemon restart', diff saved to https://phabricator.wikimedia.org/P14169 and previous config saved to /var/cache/conftool/dbconfig/20210203-143404-root.json
[14:34:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:35:48] <wikibugs>	 (03PS4) 10Jbond: Detect hosts under .wikimedia.cloud as 'labs' VMs. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/661278 (https://phabricator.wikimedia.org/T273706) (owner: 10Andrew Bogott)
[14:36:30] <wikibugs>	 (03CR) 10Jbond: "Gonna merge this as im doing a release" (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/661278 (https://phabricator.wikimedia.org/T273706) (owner: 10Andrew Bogott)
[14:36:35] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Detect hosts under .wikimedia.cloud as 'labs' VMs. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/661278 (https://phabricator.wikimedia.org/T273706) (owner: 10Andrew Bogott)
[14:37:27] <wikibugs>	 (03Merged) 10jenkins-bot: Detect hosts under .wikimedia.cloud as 'labs' VMs. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/661278 (https://phabricator.wikimedia.org/T273706) (owner: 10Andrew Bogott)
[14:38:30] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=True; selector: dnsdisc=similar-users
[14:38:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:06] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=True; selector: dnsdisc=linkrecommendation
[14:39:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:47] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: disc_desired_state: Add linkrecommendation/similar-users [puppet] - 10https://gerrit.wikimedia.org/r/661375
[14:41:42] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] disc_desired_state: Add linkrecommendation/similar-users [puppet] - 10https://gerrit.wikimedia.org/r/661375 (owner: 10Alexandros Kosiaris)
[14:43:39] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/661375 (owner: 10Alexandros Kosiaris)
[14:45:28] <wikibugs>	 (03CR) 10Jbond: "Need another PCC run on this to take account of the zuul::sever ps" [puppet] - 10https://gerrit.wikimedia.org/r/661367 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond)
[14:45:49] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "proposed health check endpoint behaves as expected and LVS config seems sane" [puppet] - 10https://gerrit.wikimedia.org/r/661369 (https://phabricator.wikimedia.org/T273453) (owner: 10Filippo Giunchedi)
[14:46:46] <wikibugs>	 10SRE, 10MW-on-K8s, 10observability, 10serviceops: Logging options for apache httpd in k8s - https://phabricator.wikimedia.org/T265876 (10Joe)
[14:49:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 75%: Repool db1120 after daemon restart', diff saved to https://phabricator.wikimedia.org/P14170 and previous config saved to /var/cache/conftool/dbconfig/20210203-144908-root.json
[14:49:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:56:44] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2051 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:59:14] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2053 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:04:08] <wikibugs>	 (03PS16) 10Jcrespo: Bacula: Create a new set of storage daemons dedicated to db ES backups [puppet] - 10https://gerrit.wikimedia.org/r/659952 (https://phabricator.wikimedia.org/T79922)
[15:04:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 100%: Repool db1120 after daemon restart', diff saved to https://phabricator.wikimedia.org/P14171 and previous config saved to /var/cache/conftool/dbconfig/20210203-150411-root.json
[15:04:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:10:18] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:13:31] <wikibugs>	 (03PS3) 10Jbond: (WIP) nodegen: add node selections based on commited files [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651925 (https://phabricator.wikimedia.org/T166066)
[15:14:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] (WIP) nodegen: add node selections based on commited files [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651925 (https://phabricator.wikimedia.org/T166066) (owner: 10Jbond)
[15:17:10] <wikibugs>	 (03PS1) 10DCausse: [cirrus] rename ores_articletopics -> weighted_tags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661383 (https://phabricator.wikimedia.org/T273508)
[15:17:12] <wikibugs>	 (03PS1) 10DCausse: [cirrus] drop deprecated ores_articletopics config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661384 (https://phabricator.wikimedia.org/T273508)
[15:18:29] <moritzm>	 !log installing ca-certificates update for buster (reverting the Symantec CA blacklist, related to GeoTrust CA)
[15:18:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:23:57] <wikibugs>	 (03PS1) 10Elukey: Add eventstreams-internal VIP DNS config [dns] - 10https://gerrit.wikimedia.org/r/661386 (https://phabricator.wikimedia.org/T269160)
[15:24:06] <wikibugs>	 (03CR) 10Zfilipin: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/661150 (https://phabricator.wikimedia.org/T267313) (owner: 10Thcipriani)
[15:25:20] <wikibugs>	 (03PS2) 10Elukey: Add eventstreams-internal VIP DNS config [dns] - 10https://gerrit.wikimedia.org/r/661386 (https://phabricator.wikimedia.org/T269160)
[15:26:24] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the patch!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/659008 (owner: 10David Caro)
[15:32:24] <wikibugs>	 10SRE, 10netops: Test dhcp-option 82 - https://phabricator.wikimedia.org/T221388 (10Volans) 05Resolved→03Open a:05ayounsi→03Volans Re-opening as we're aiming to implement it this quarter.
[15:32:42] <volans>	 !log disabling puppet on install1003 for a quick test for T221388
[15:32:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:32:59] <stashbot>	 T221388: Test dhcp-option 82 - https://phabricator.wikimedia.org/T221388
[15:34:01] <wikibugs>	 (03PS1) 10Razzi: hadoop: Add hiera setting to symlink hadoop logs to /var/log/hadoop [puppet] - 10https://gerrit.wikimedia.org/r/661391 (https://phabricator.wikimedia.org/T265126)
[15:34:23] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "LGTM, but merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/661067/ first" [dns] - 10https://gerrit.wikimedia.org/r/661386 (https://phabricator.wikimedia.org/T269160) (owner: 10Elukey)
[15:37:18] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] puppet: add ca_server retrieval [software/spicerack] - 10https://gerrit.wikimedia.org/r/659008 (owner: 10David Caro)
[15:37:35] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add conftool data for eventstreams-internal (new VIP) [puppet] - 10https://gerrit.wikimedia.org/r/661067 (https://phabricator.wikimedia.org/T269160) (owner: 10Elukey)
[15:40:54] <wikibugs>	 (03PS3) 10Elukey: Add eventstreams-internal VIP DNS config [dns] - 10https://gerrit.wikimedia.org/r/661386 (https://phabricator.wikimedia.org/T269160)
[15:41:05] <wikibugs>	 10SRE, 10vm-requests: codfw: 3 VM request for ML team etcd - https://phabricator.wikimedia.org/T273075 (10klausman) All machines are now base installed (puppet-runs done with `insetup`).
[15:41:18] <wikibugs>	 10SRE, 10vm-requests: eqiad: 3 VM request for ML team etcd - https://phabricator.wikimedia.org/T273074 (10klausman) All machines are now base installed (puppet-runs done with `insetup`).
[15:41:20] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] wmflib: drop ensure_directory in favour of stdlib::ensure [puppet] - 10https://gerrit.wikimedia.org/r/661368 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond)
[15:42:12] <wikibugs>	 (03PS4) 10Jbond: wmflib: drop ensure_service in favour of stdlib::ensure [puppet] - 10https://gerrit.wikimedia.org/r/661367 (https://phabricator.wikimedia.org/T273743)
[15:42:36] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:42:49] <wikibugs>	 (03PS5) 10Jbond: wmflib: drop ensure_service in favour of stdlib::ensure [puppet] - 10https://gerrit.wikimedia.org/r/661367 (https://phabricator.wikimedia.org/T273743)
[15:42:57] <wikibugs>	 (03PS2) 10Jbond: zuul::server: Add types [puppet] - 10https://gerrit.wikimedia.org/r/661371
[15:43:13] <wikibugs>	 (03PS6) 10Jbond: wmflib: drop ensure_service in favour of stdlib::ensure [puppet] - 10https://gerrit.wikimedia.org/r/661367 (https://phabricator.wikimedia.org/T273743)
[15:43:48] <wikibugs>	 (03CR) 10Jbond: "new pcc (running) https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27844/" [puppet] - 10https://gerrit.wikimedia.org/r/661371 (owner: 10Jbond)
[15:44:54] <icinga-wm>	 PROBLEM - configured eth on sretest1001 is CRITICAL: ens2f1 reporting no carrier. https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[15:45:03] <wikibugs>	 (03CR) 10Jbond: "PCC (running) https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27844/console" [puppet] - 10https://gerrit.wikimedia.org/r/661367 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond)
[15:45:23] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/661371 (owner: 10Jbond)
[15:46:35] <hnowlan>	 !log one-off installing imposm3 on maps1009 
[15:46:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:46:49] <wikibugs>	 (03Merged) 10jenkins-bot: puppet: add ca_server retrieval [software/spicerack] - 10https://gerrit.wikimedia.org/r/659008 (owner: 10David Caro)
[15:48:09] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host cuminunpriv1001.eqiad.wmnet
[15:48:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:49:20] <moritzm>	 !log draining ganeti4001 for eventual reboot
[15:49:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:49:23] <wikibugs>	 (03PS1) 10Jcrespo: Bacula: Create a new set of storage daemons dedicated to db ES backups [puppet] - 10https://gerrit.wikimedia.org/r/661396 (https://phabricator.wikimedia.org/T79922)
[15:50:02] <wikibugs>	 (03PS17) 10Jcrespo: Bacula: Create a new set of storage daemons dedicated to db ES backups [puppet] - 10https://gerrit.wikimedia.org/r/659952 (https://phabricator.wikimedia.org/T79922)
[15:50:36] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cuminunpriv1001.eqiad.wmnet
[15:50:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:52:51] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host labweb1001.wikimedia.org
[15:52:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:00] <wikibugs>	 (03PS18) 10Jcrespo: Bacula: Start using new storage/pools for es database content backups [puppet] - 10https://gerrit.wikimedia.org/r/659952 (https://phabricator.wikimedia.org/T79922)
[15:54:35] <icinga-wm>	 RECOVERY - Check systemd state on doc2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:54:38] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti4001.ulsfo.wmnet
[15:54:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:57:06] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host labweb1001.wikimedia.org
[15:57:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:12] <wikibugs>	 (03CR) 10Jcrespo: "I have the starting-to-become-quite-large commit into 2 smaller ones. I would like to start deploying to check I am not breaking anything," [puppet] - 10https://gerrit.wikimedia.org/r/661396 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo)
[15:59:16] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4001.ulsfo.wmnet
[15:59:30] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host labweb1002.wikimedia.org
[15:59:31] <wikibugs>	 (03CR) 10Cwhite: stdlib: update to 6.6.0 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/661118 (owner: 10Jbond)
[15:59:47] <stashbot>	 jmm@cumin2001: Failed to log message to wiki. Somebody should check the error logs.
[15:59:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:00:04] <moritzm>	 !log draining ganeti4003 for eventual reboot
[16:00:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:03:01] <icinga-wm>	 PROBLEM - dhclient process on sretest1001 is CRITICAL: PROCS CRITICAL: 3 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[16:03:03] <wikibugs>	 (03PS1) 10Jbond: stdlib: fix metadata version [puppet] - 10https://gerrit.wikimedia.org/r/661402
[16:03:48] <wikibugs>	 (03CR) 10Jbond: stdlib: update to 6.6.0 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/661118 (owner: 10Jbond)
[16:04:00] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] stdlib: fix metadata version [puppet] - 10https://gerrit.wikimedia.org/r/661402 (owner: 10Jbond)
[16:05:02] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host labweb1002.wikimedia.org
[16:05:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:05:10] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add eventstreams-internal VIP DNS config [dns] - 10https://gerrit.wikimedia.org/r/661386 (https://phabricator.wikimedia.org/T269160) (owner: 10Elukey)
[16:06:02] <wikibugs>	 (03CR) 10Jcrespo: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/661396 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo)
[16:06:53] <icinga-wm>	 RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:06:56] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti4003.ulsfo.wmnet
[16:06:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:36] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=eventstreams-internal
[16:08:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:11:17] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond)
[16:11:52] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond)  Wmflib::Php_version is probably a bit specific to go to stdlib but we should move it to the php module.
[16:12:06] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond)
[16:12:08] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4003.ulsfo.wmnet
[16:12:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:13:14] <volans>	 !log enabled puppet on install1003 after the test T221388
[16:13:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:13:19] <stashbot>	 T221388: Test dhcp-option 82 - https://phabricator.wikimedia.org/T221388
[16:13:50] <moritzm>	 !log failover ganeti master in ulsfo to ganeti4003
[16:13:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:16:12] <moritzm>	 !log draining ganeti4002 for eventual reboot
[16:16:21] <wikibugs>	 10SRE, 10netops: Test dhcp-option 82 - https://phabricator.wikimedia.org/T221388 (10Volans) I tested the config with:  ` host sretest1001 {     host-identifier option agent.circuit-id "ge-3/0/15.0:private1-d-eqiad";      fixed-address sretest1001.eqiad.wmnet; } `  And it seemed to work as expected. I need to p...
[16:16:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:18:57] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti4002.ulsfo.wmnet
[16:18:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:19:29] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host miscweb1002.eqiad.wmnet
[16:19:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:21:02] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host miscweb1002.eqiad.wmnet
[16:21:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:21:14] <wikibugs>	 (03PS1) 10Jbond: stdlib: fix refrences and changelog [puppet] - 10https://gerrit.wikimedia.org/r/661406
[16:22:40] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] stdlib: fix refrences and changelog [puppet] - 10https://gerrit.wikimedia.org/r/661406 (owner: 10Jbond)
[16:22:55] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host planet2002.codfw.wmnet
[16:22:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:23:54] <wikibugs>	 (03CR) 10CRusnov: "Sounds good, marking this and ldapsupportlib.py willnotport. I'll leave these branches alive though just in case since the work is already" [puppet] - 10https://gerrit.wikimedia.org/r/658427 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[16:24:07] <wikibugs>	 (03CR) 10CRusnov: [C: 04-1] ldap/ldaplist.py: Port for Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/658427 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[16:24:20] <wikibugs>	 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install kafka-logging100[123] - https://phabricator.wikimedia.org/T273778 (10RobH)
[16:24:33] <wikibugs>	 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install kafka-logging100[123] - https://phabricator.wikimedia.org/T273778 (10RobH)
[16:24:53] <wikibugs>	 (03CR) 10CRusnov: [C: 04-1] "see comments in https://gerrit.wikimedia.org/r/c/operations/puppet/+/658427" [puppet] - 10https://gerrit.wikimedia.org/r/658360 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[16:24:59] <wikibugs>	 (03PS1) 10Elukey: Remove dns-disc config for eventstreams-internal [dns] - 10https://gerrit.wikimedia.org/r/661407
[16:26:21] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Remove dns-disc config for eventstreams-internal [dns] - 10https://gerrit.wikimedia.org/r/661407 (owner: 10Elukey)
[16:26:26] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4002.ulsfo.wmnet
[16:26:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:26:49] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host planet2002.codfw.wmnet
[16:26:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:27:54] <wikibugs>	 (03CR) 10Jbond: "another pcc https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27848/" [puppet] - 10https://gerrit.wikimedia.org/r/661372 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond)
[16:28:24] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host planet1002.eqiad.wmnet
[16:28:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:29:45] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host mwlog2002.codfw.wmnet
[16:29:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:31:30] <wikibugs>	 (03PS3) 10Elukey: Add eventstreams-internal to service_catalog [puppet] - 10https://gerrit.wikimedia.org/r/661071 (https://phabricator.wikimedia.org/T269160)
[16:32:30] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host planet1002.eqiad.wmnet
[16:32:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:33:15] <icinga-wm>	 RECOVERY - dhclient process on sretest1001 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[16:33:16] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host peek2001.codfw.wmnet
[16:33:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:33:32] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add eventstreams-internal to service_catalog [puppet] - 10https://gerrit.wikimedia.org/r/661071 (https://phabricator.wikimedia.org/T269160) (owner: 10Elukey)
[16:34:19] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mwlog2002.codfw.wmnet
[16:34:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:34:30] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host people2001.codfw.wmnet
[16:34:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:35:24] <wikibugs>	 (03PS1) 10Filippo Giunchedi: swift: limit rsync and swift-object-replicator memory to 5% in codfw [puppet] - 10https://gerrit.wikimedia.org/r/661408 (https://phabricator.wikimedia.org/T221904)
[16:35:37] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host peek2001.codfw.wmnet
[16:35:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:35:49] <wikibugs>	 (03PS2) 10Elukey: role::kubernetes::worker: add empty stanza for eventstreams-internal [puppet] - 10https://gerrit.wikimedia.org/r/661072 (https://phabricator.wikimedia.org/T269160)
[16:36:32] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::kubernetes::worker: add empty stanza for eventstreams-internal [puppet] - 10https://gerrit.wikimedia.org/r/661072 (https://phabricator.wikimedia.org/T269160) (owner: 10Elukey)
[16:36:53] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people2001.codfw.wmnet
[16:36:57] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Request to add Georgina Burnett to the ldap/wmde group - https://phabricator.wikimedia.org/T273780 (10georginaburnett-wmde)
[16:37:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:37:01] <logmsgbot>	 !log mbsantos@deploy1001 Started deploy [tilerator/deploy@46a2eaf] (nvironment): imposm Deploy Tilerator build for buster machines
[16:37:03] <logmsgbot>	 !log mbsantos@deploy1001 deploy aborted: imposm Deploy Tilerator build for buster machines (duration: 00m 03s)
[16:37:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:37:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:37:20] <logmsgbot>	 !log mbsantos@deploy1001 Started deploy [tilerator/deploy@46a2eaf] (imposm): Deploy Tilerator build for buster machines
[16:37:23] <logmsgbot>	 !log mbsantos@deploy1001 deploy aborted: Deploy Tilerator build for buster machines (duration: 00m 03s)
[16:37:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:37:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:38:48] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond)
[16:42:05] <wikibugs>	 (03PS1) 10Muehlenhoff: Update account meta data for mraish [puppet] - 10https://gerrit.wikimedia.org/r/661410
[16:44:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Update account meta data for mraish [puppet] - 10https://gerrit.wikimedia.org/r/661410 (owner: 10Muehlenhoff)
[16:44:33] <logmsgbot>	 !log mbsantos@deploy1001 Started deploy [tilerator/deploy@46a2eaf] (beta): (no justification provided)
[16:44:34] <logmsgbot>	 !log mbsantos@deploy1001 deploy aborted: (no justification provided) (duration: 00m 01s)
[16:44:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:44:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: swift: limit rsync and swift-object-replicator memory to 5% in codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/661408 (https://phabricator.wikimedia.org/T221904) (owner: 10Filippo Giunchedi)
[16:44:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:44:51] <logmsgbot>	 !log mbsantos@deploy1001 Started deploy [tilerator/deploy@46a2eaf] (imposm): (no justification provided)
[16:44:51] <logmsgbot>	 !log mbsantos@deploy1001 deploy aborted: (no justification provided) (duration: 00m 00s)
[16:44:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:44:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:45:50] <wikibugs>	 (03CR) 10Muehlenhoff: "Ack, then I'll update the patch to move you to the cn=wmf LDAP group instead." [puppet] - 10https://gerrit.wikimedia.org/r/661150 (https://phabricator.wikimedia.org/T267313) (owner: 10Thcipriani)
[16:52:59] <wikibugs>	 10SRE, 10netops: Test dhcp-option 82 - https://phabricator.wikimedia.org/T221388 (10Volans) Adding the `circuit-id prefix host-name` setting and removing the `remote-id` that we're not gonna use, the circuit ID includes the switch hostname too, so becoming `asw2-d-eqiad:ge-3/0/15.0:private1-d-eqiad`. That shou...
[16:53:06] <wikibugs>	 (03PS2) 10Muehlenhoff: Offboard zfilipin from Release Engineering [puppet] - 10https://gerrit.wikimedia.org/r/661150 (https://phabricator.wikimedia.org/T267313) (owner: 10Thcipriani)
[16:53:55] <wikibugs>	 10SRE, 10Analytics, 10Analytics-Kanban, 10Event-Platform, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10elukey) Next step is https://wikitech.wikimedia.org/wiki/LVS#Configure_the_load_ba...
[16:55:54] <wikibugs>	 (03PS3) 10Muehlenhoff: Offboard zfilipin from Release Engineering [puppet] - 10https://gerrit.wikimedia.org/r/661150 (https://phabricator.wikimedia.org/T267313) (owner: 10Thcipriani)
[16:56:52] <wikibugs>	 (03PS1) 10Jbond: nutcracker: drop use of to_milliseconds function [puppet] - 10https://gerrit.wikimedia.org/r/661414 (https://phabricator.wikimedia.org/T273743)
[16:56:54] <wikibugs>	 (03PS1) 10Jbond: wmflib: drop to_seconds and to_milliseconds [puppet] - 10https://gerrit.wikimedia.org/r/661415 (https://phabricator.wikimedia.org/T273743)
[16:58:02] <wikibugs>	 (03PS5) 10Jbond: (WIP) ssl: new ssl module intialy planned to replace ssl_ciphersuite() [puppet] - 10https://gerrit.wikimedia.org/r/640480 (https://phabricator.wikimedia.org/T273743)
[17:00:45] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.046 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash
[17:01:15] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=eventstreams-internal
[17:01:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:53] <icinga-wm>	 PROBLEM - dhclient process on sretest1001 is CRITICAL: PROCS CRITICAL: 1 process with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[17:15:52] <wikibugs>	 (03CR) 10Zfilipin: [C: 03+1] Offboard zfilipin from Release Engineering [puppet] - 10https://gerrit.wikimedia.org/r/661150 (https://phabricator.wikimedia.org/T267313) (owner: 10Thcipriani)
[17:37:58] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/661371 (owner: 10Jbond)
[17:38:36] <wikibugs>	 (03PS1) 10Urbanecm: [WIP] Enable GrowthExperiments at dawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661419 (https://phabricator.wikimedia.org/T256126)
[17:38:38] <wikibugs>	 (03PS1) 10Hnowlan: conftool: restore maps1009 to kartotherian pool [puppet] - 10https://gerrit.wikimedia.org/r/661420
[17:41:19] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-2] "DNM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661419 (https://phabricator.wikimedia.org/T256126) (owner: 10Urbanecm)
[17:41:21] <wikibugs>	 (03CR) 10Hashar: "I have deleted the couple comments that mentioned an unrelated ppc run (that was a good excuse for me to try deleting a comment)." [puppet] - 10https://gerrit.wikimedia.org/r/661371 (owner: 10Jbond)
[17:44:10] * Urbanecm stagging at mwdebug1003
[17:45:51] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Request to add Georgina Burnett to the ldap/wmde group - https://phabricator.wikimedia.org/T273780 (10WMDE-leszek)
[17:46:00] * Urbanecm done
[17:48:54] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-Legal: Request to add Georgina Burnett to the ldap/wmde group - https://phabricator.wikimedia.org/T273780 (10WMDE-leszek) As an Engineering Manager at WMDE, I approve this request and confirm Georgina's affiliation with WDME. Tagging #wmf-legal as well to ensure the requi...
[17:50:21] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-Legal: Request to add Georgina Burnett to the ldap/wmde group - https://phabricator.wikimedia.org/T273780 (10WMDE-leszek) @Aklapper this message above from Herald seems like something relatively new? Please advise if we should reach out to WMF Legal for NDA and related to...
[17:53:51] <wikibugs>	 (03CR) 10David Caro: "@Volans anything else needed for this?" [software/spicerack] - 10https://gerrit.wikimedia.org/r/659009 (owner: 10David Caro)
[17:58:28] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[18:02:25] <wikibugs>	 (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/661426
[18:04:12] <icinga-wm>	 RECOVERY - SSH on mw2249.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:06:12] <wikibugs>	 (03PS3) 10Jbond: zuul::server: Add types [puppet] - 10https://gerrit.wikimedia.org/r/661371
[18:06:25] <wikibugs>	 (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/661371 (owner: 10Jbond)
[18:07:38] <wikibugs>	 (03CR) 10Jbond: ">  Class[Profile::Zuul::Server]:" [puppet] - 10https://gerrit.wikimedia.org/r/661371 (owner: 10Jbond)
[18:10:06] <wikibugs>	 (03CR) 10Dduvall: [C: 03+2] blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/661426 (owner: 10PipelineBot)
[18:11:32] <wikibugs>	 (03Merged) 10jenkins-bot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/661426 (owner: 10PipelineBot)
[18:13:50] <logmsgbot>	 !log dduvall@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' .
[18:13:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:15:34] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "lgtm... and just a comment outside the scope of this, I wonder if a type for email addresses would make sense" [puppet] - 10https://gerrit.wikimedia.org/r/661371 (owner: 10Jbond)
[18:16:14] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "> Patch Set 6:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/659009 (owner: 10David Caro)
[18:18:23] <wikibugs>	 10SRE, 10netops: Test dhcp-option 82 - https://phabricator.wikimedia.org/T221388 (10Volans) The config for the above test used:  ` circuit-id {     prefix {         host-name;     } } `
[18:22:19] <wikibugs>	 (03CR) 10Joal: [C: 03+1] "Thanks @razzi" [puppet] - 10https://gerrit.wikimedia.org/r/661209 (https://phabricator.wikimedia.org/T273004) (owner: 10Razzi)
[18:23:48] <logmsgbot>	 !log dduvall@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
[18:23:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:25:43] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[18:26:24] <logmsgbot>	 !log dduvall@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
[18:26:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:27:33] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[18:30:41] <mutante>	 volans: have you seen this one before?  sudo ipmi-chassis --get-chassis-status  -> "ipmi_cmd_get_chassis_status: bad completion code" ?
[18:30:54] <mutante>	 that's not the one we list as the "typical" error
[18:32:36] <logmsgbot>	 !log razzi@cumin1001 START - Cookbook sre.kafka.reboot-workers for Kafka test cluster: Reboot kafka nodes - razzi@cumin1001
[18:32:36] <mutante>	 the "diff" config command is also unusual, not empty and not a diff but instead "Unable to get Number of Users". this is broken in a new way
[18:32:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:35:18] <mutante>	 yea, I'm making  a ticket because I also cant ssh to mgmt now. we'll see there
[18:38:37] <wikibugs>	 10ops-codfw, 10serviceops-radar: mw2220 - broken IPMI / mgmt - https://phabricator.wikimedia.org/T273803 (10Dzahn)
[18:38:59] <wikibugs>	 10ops-codfw, 10serviceops-radar: mw2220 - broken IPMI / mgmt - https://phabricator.wikimedia.org/T273803 (10Dzahn) p:05Triage→03Medium
[18:39:23] <wikibugs>	 (03PS3) 10Ryan Kemper: relforge: service impl of relforge100[3,4] [puppet] - 10https://gerrit.wikimedia.org/r/661229 (https://phabricator.wikimedia.org/T262211)
[18:39:31] <wikibugs>	 (03Abandoned) 10Ryan Kemper: search: bring "new" relforge hosts into service [puppet] - 10https://gerrit.wikimedia.org/r/656369 (https://phabricator.wikimedia.org/T262211) (owner: 10Ryan Kemper)
[18:39:39] <wikibugs>	 10ops-codfw, 10serviceops-radar: mw2220 - broken IPMI / mgmt - https://phabricator.wikimedia.org/T273803 (10Dzahn)
[18:39:42] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn)
[18:41:08] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[18:41:10] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[18:41:37] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 3: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/661371 (owner: 10Jbond)
[18:43:21] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "Alright, I see. After seeing your recent ticket about upstreamf wmflib types to stdlib I guess it should be done upstream right away then." [puppet] - 10https://gerrit.wikimedia.org/r/661371 (owner: 10Jbond)
[18:44:29] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2281.codfw.wmnet with reason: REIMAGE
[18:44:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:44:40] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2282.codfw.wmnet with reason: REIMAGE
[18:44:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:44:47] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] profile::redis::multidc: hiera -> lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/659392 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[18:44:56] <wikibugs>	 (03CR) 10CDanis: swift: limit rsync and swift-object-replicator memory to 5% in codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/661408 (https://phabricator.wikimedia.org/T221904) (owner: 10Filippo Giunchedi)
[18:45:13] <wikibugs>	 10SRE, 10ops-eqiad, 10User-ArielGlenn: Interface errors on asw2-b-eqiad:ge-8/0/6  (dumpsdata1001) - https://phabricator.wikimedia.org/T273714 (10Cmjohnson) I am going to be out that week...can you try and coordinate with @Jclark-ctr
[18:45:55] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Radar: Degraded RAID on an-worker1099 - https://phabricator.wikimedia.org/T273034 (10Cmjohnson) 05Open→03Resolved done
[18:45:59] <wikibugs>	 10SRE, 10ops-eqiad: Interface errors between pfw3a-eqiad and fasw-c1a-eqiad - https://phabricator.wikimedia.org/T271295 (10Cmjohnson) ok
[18:46:34] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2281.codfw.wmnet with reason: REIMAGE
[18:46:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:46:43] <wikibugs>	 (03CR) 10Dzahn: "noop on mc2026" [puppet] - 10https://gerrit.wikimedia.org/r/659392 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[18:47:37] <wikibugs>	 (03PS3) 10Bstorm: Revert "dumps: fail over dumps web" [dns] - 10https://gerrit.wikimedia.org/r/660798
[18:47:40] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] labs_bootstrapvz: hiera -> lookup [puppet] - 10https://gerrit.wikimedia.org/r/660953 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[18:48:13] <wikibugs>	 (03CR) 10Dzahn: "sure, no problem. Just in the cases where it's only <= 4 servers it seemed quicker to manually delete crons than doing a second change to " [puppet] - 10https://gerrit.wikimedia.org/r/661198 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[18:48:30] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2282.codfw.wmnet with reason: REIMAGE
[18:48:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:48:37] <wikibugs>	 (03PS4) 10Ryan Kemper: relforge: service impl of relforge100[3,4] [puppet] - 10https://gerrit.wikimedia.org/r/661229 (https://phabricator.wikimedia.org/T262211)
[18:49:13] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] Revert "dumps: fail over dumps web" [dns] - 10https://gerrit.wikimedia.org/r/660798 (owner: 10Bstorm)
[18:50:39] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] Revert "dumps-dist: fail over labstore1006 to 1007" [puppet] - 10https://gerrit.wikimedia.org/r/660799 (owner: 10Bstorm)
[18:50:41] <wikibugs>	 (03Abandoned) 10Andrew Bogott: acme-chief designate-sync.py: set ttl to 0 for txt records [puppet] - 10https://gerrit.wikimedia.org/r/655476 (owner: 10Andrew Bogott)
[18:50:55] <wikibugs>	 (03Abandoned) 10Andrew Bogott: Cloud instances: add duplicate hiera settings for profile::base::labs:: settings [puppet] - 10https://gerrit.wikimedia.org/r/661171 (owner: 10Andrew Bogott)
[18:51:36] <wikibugs>	 (03PS5) 10Ryan Kemper: relforge: service impl of relforge100[3,4] [puppet] - 10https://gerrit.wikimedia.org/r/661229 (https://phabricator.wikimedia.org/T262211)
[18:52:21] <wikibugs>	 (03PS1) 10Urbanecm: kowiki: Fix wgGEHelpPanelHelpDeskTitle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661435 (https://phabricator.wikimedia.org/T273799)
[18:53:17] <wikibugs>	 (03PS6) 10Ryan Kemper: relforge: service impl of relforge100[3,4] [puppet] - 10https://gerrit.wikimedia.org/r/661229 (https://phabricator.wikimedia.org/T262211)
[18:57:28] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] labs_bootstrapvz: hiera -> lookup [puppet] - 10https://gerrit.wikimedia.org/r/660953 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[18:58:21] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+1] kowiki: Fix wgGEHelpPanelHelpDeskTitle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661435 (https://phabricator.wikimedia.org/T273799) (owner: 10Urbanecm)
[19:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: #bothumor I � Unicode. All rise for Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210203T1900).
[19:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[19:00:04] <jouncebot>	 hashar and dancy: I, the Bot under the Fountain, allow thee, The Deployer, to do Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210203T1900).
[19:00:18] <Urbanecm>	 I'll sync the patch tg.r just +1'ed
[19:00:29] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] kowiki: Fix wgGEHelpPanelHelpDeskTitle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661435 (https://phabricator.wikimedia.org/T273799) (owner: 10Urbanecm)
[19:01:28] <wikibugs>	 (03Merged) 10jenkins-bot: kowiki: Fix wgGEHelpPanelHelpDeskTitle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661435 (https://phabricator.wikimedia.org/T273799) (owner: 10Urbanecm)
[19:02:02] <wikibugs>	 (03PS3) 10Dzahn: installserver::proxy: replace cron with timer [puppet] - 10https://gerrit.wikimedia.org/r/661198 (https://phabricator.wikimedia.org/T273673)
[19:06:54] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 56351f0434be36f4a639f98986d7785dd4d0b14d: kowiki: Fix wgGEHelpPanelHelpDeskTitle (T273799) (duration: 01m 10s)
[19:06:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:06:58] <stashbot>	 T273799: The help panel at kowiki does not load - https://phabricator.wikimedia.org/T273799
[19:07:00] * Urbanecm done
[19:07:45] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[19:08:11] <wikibugs>	 (03CR) 10Ryan Kemper: relforge: service impl of relforge100[3,4] (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/661229 (https://phabricator.wikimedia.org/T262211) (owner: 10Ryan Kemper)
[19:09:23] <wikibugs>	 (03CR) 10Ladsgroup: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/661198 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[19:10:04] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1334.eqiad.wmnet with reason: REIMAGE
[19:10:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:12:11] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1334.eqiad.wmnet with reason: REIMAGE
[19:12:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:13:41] <wikibugs>	 (03CR) 10Dzahn: "These are good points, alright, amending to all." [puppet] - 10https://gerrit.wikimedia.org/r/661198 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[19:14:09] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2282.codfw.wmnet'] `  an...
[19:17:09] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2282.codfw.wmnet
[19:17:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:17:27] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2281.codfw.wmnet'] `  an...
[19:18:24] <wikibugs>	 (03PS2) 10Dzahn: logging::mediawiki::udp2log: replace cron with timer [puppet] - 10https://gerrit.wikimedia.org/r/661200 (https://phabricator.wikimedia.org/T273673)
[19:21:39] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2281.codfw.wmnet
[19:21:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:25:23] <legoktm>	 I'm going to hack on mwdebug1003 for a bit
[19:31:01] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2282.codfw.wmnet
[19:31:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:32:53] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "minor comments inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/661229 (https://phabricator.wikimedia.org/T262211) (owner: 10Ryan Kemper)
[19:34:21] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2281.codfw.wmnet
[19:34:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:39:04] <icinga-wm>	 PROBLEM - Check systemd state on kafka-test1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:41:08] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] "> >  Class[Zuul::Server]:    <----------- class is here" [puppet] - 10https://gerrit.wikimedia.org/r/661371 (owner: 10Jbond)
[19:45:19] <legoktm>	 ;(done)
[19:46:17] <wikibugs>	 (03CR) 10Gehel: [C: 04-2] "I am now convinced that we should not fix those tests, but fix the production code instead. It seems like a bad practice to do IO in a __s" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/659294 (owner: 10David Caro)
[19:50:07] <wikibugs>	 10SRE, 10MediaWiki-Debug-Logger, 10Platform Team Workboards (Clinic Duty Team), 10Wikimedia-production-error: LegacyHandler.php: PHP Warning: Host lookup failed [-10002]: Unknown error -10002 - https://phabricator.wikimedia.org/T231025 (10thcipriani) One helpful step might be to have a log of what hostname...
[19:51:39] <wikibugs>	 (03PS1) 10Urbanecm: Set wgGEHelpPanelAskMentor to true for several wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661448 (https://phabricator.wikimedia.org/T272753)
[19:54:00] <wikibugs>	 (03PS2) 10Dzahn: debmonitor::client: replace cron with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/661189 (https://phabricator.wikimedia.org/T273673)
[19:57:25] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1334.eqiad.wmnet'] `  an...
[20:00:04] <jouncebot>	 hashar and dancy: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Mediawiki train - European+American Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210203T2000).
[20:01:33] <hashar>	 ^^ train is blocked
[20:01:54] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1334.eqiad.wmnet
[20:01:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:02:46] <hashar>	 due to https://phabricator.wikimedia.org/T273242
[20:03:12] <Majavah>	 why is no-one reviewing my patch :-(
[20:03:40] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1334.eqiad.wmnet
[20:03:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:03:55] <Majavah>	 hashar are you planning to backport that Echo logging followup?
[20:13:10] <apergos>	 Majavah: your patch was mentioned in the platform engineering element/slack channel and so at least people know. I'm not sure who would pick it up during US work hours, but we'll see.
[20:15:27] <DannyS712>	 Majavah whats the patch? I'm bored
[20:22:09] <Majavah>	 DannyS712: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FeaturedFeeds/+/661357
[20:23:11] <DannyS712>	 oh, that one - I saw it and didn't understand how FeaturedFeeds worked. Does it do anything other than switch the parser options from user-based to anon?
[20:24:54] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[20:24:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:25:26] <Majavah>	 it stops caching an user object, using anon parser is a byproduct of lego.ktm's comment
[20:26:28] <legoktm>	 still pinged :p
[20:26:41] <Majavah>	 :D
[20:26:59] <Majavah>	 do I need to start doing something like le.gok.tm
[20:27:10] <legoktm>	 still pinged 
[20:27:48] <legoktm>	 Cool though, that should be better overall too
[20:34:20] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:34:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:35:33] <wikibugs>	 10SRE, 10ops-eqiad, 10Discovery-Search (Current work): Memory issue on elastic1063 caused elasticsearch to be killed - https://phabricator.wikimedia.org/T265113 (10Cmjohnson) a:05Cmjohnson→03Jclark-ctr I am assigning this to @Jclark-ctr
[20:37:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install mwlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T267271 (10Cmjohnson) @herron I am not sure yet, it's not in the rack. I need to see where it is and I'll get back to you this week w/a plan.
[20:38:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install logstash-be103[345] - https://phabricator.wikimedia.org/T267666 (10Cmjohnson) a:05Cmjohnson→03RobH Rob, these are ready for you with the temp password.
[20:44:47] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+1] "Thanks Moritz!" [puppet] - 10https://gerrit.wikimedia.org/r/661150 (https://phabricator.wikimedia.org/T267313) (owner: 10Thcipriani)
[20:45:57] <wikibugs>	 (03PS2) 10Bstorm: wikireplicas-proxy: add commented examples of depoolings for multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/661206 (https://phabricator.wikimedia.org/T271476)
[20:47:17] <wikibugs>	 (03PS3) 10Bstorm: wikireplicas-proxy: add commented examples of depoolings for multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/661206 (https://phabricator.wikimedia.org/T271476)
[20:48:25] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] wikireplicas-proxy: add commented examples of depoolings for multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/661206 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm)
[20:50:37] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] wikireplicas: deploy a cloud-based query sampler for the replicas [puppet] - 10https://gerrit.wikimedia.org/r/660960 (https://phabricator.wikimedia.org/T272723) (owner: 10Bstorm)
[20:57:18] <wikibugs>	 (03PS1) 10Bstorm: wikireplicas: add a role for consistency on the querysampler service [puppet] - 10https://gerrit.wikimedia.org/r/661464 (https://phabricator.wikimedia.org/T272723)
[21:00:04] <jouncebot>	 chrisalbon and accraze: (Dis)respected human, time to deploy Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210203T2100). Please do the needful.
[21:01:12] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] wikireplicas: add a role for consistency on the querysampler service [puppet] - 10https://gerrit.wikimedia.org/r/661464 (https://phabricator.wikimedia.org/T272723) (owner: 10Bstorm)
[21:03:06] <icinga-wm>	 RECOVERY - dump of es4 in eqiad on alert1001 is OK: Last dump for es4 at eqiad (es1022.eqiad.wmnet) taken on 2021-02-03 10:25:36 (1449 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting
[21:05:41] <logmsgbot>	 !log razzi@cumin1001 START - Cookbook sre.kafka.reboot-workers for Kafka test cluster: Reboot kafka nodes - razzi@cumin1001
[21:05:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:06:08] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops: decommission francium.eqiad.wmnet - https://phabricator.wikimedia.org/T273142 (10Cmjohnson) 05Open→03Resolved removed from rack
[21:06:27] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1089.eqiad.wmnet - https://phabricator.wikimedia.org/T273417 (10Cmjohnson) 05Open→03Resolved removed from rack
[21:06:32] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Cmjohnson)
[21:06:43] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: decommission relforge1001.eqiad.wmnet and relforge1002.eqiad.wmnet - https://phabricator.wikimedia.org/T272444 (10Cmjohnson)
[21:06:54] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: decommission relforge1001.eqiad.wmnet and relforge1002.eqiad.wmnet - https://phabricator.wikimedia.org/T272444 (10Cmjohnson) netbox updated and removed from rack
[21:07:10] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: decommission relforge1001.eqiad.wmnet and relforge1002.eqiad.wmnet - https://phabricator.wikimedia.org/T272444 (10Cmjohnson) 05Open→03Resolved
[21:07:31] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Persistence-Backup, 10decommission-hardware: decommission helium.eqiad.wmnet and helium-array - https://phabricator.wikimedia.org/T273049 (10Cmjohnson)
[21:07:45] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Persistence-Backup, 10decommission-hardware: decommission helium.eqiad.wmnet and helium-array - https://phabricator.wikimedia.org/T273049 (10Cmjohnson) Both have been removed from rack and netbox updated
[21:07:51] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Persistence-Backup, 10decommission-hardware: decommission helium.eqiad.wmnet and helium-array - https://phabricator.wikimedia.org/T273049 (10Cmjohnson) 05Open→03Resolved
[21:08:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: decommission frdb1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T271739 (10Cmjohnson)
[21:08:28] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: decommission frdb1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T271739 (10Cmjohnson) 05Open→03Resolved netbox updated and removd from rack
[21:08:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission es1013.eqiad.wmnet - https://phabricator.wikimedia.org/T268436 (10Cmjohnson)
[21:09:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission es1013.eqiad.wmnet - https://phabricator.wikimedia.org/T268436 (10Cmjohnson) 05Open→03Resolved netbox updated and removed from rack
[21:10:09] <wikibugs>	 10SRE, 10ops-eqiad: Interface errors between pfw3a-eqiad and fasw-c1a-eqiad - https://phabricator.wikimedia.org/T271295 (10Cmjohnson) @ayounsi Can we do this Friday, 5 Feb 1500UTC?
[21:10:49] <wikibugs>	 10SRE, 10ops-eqiad, 10User-ArielGlenn: Interface errors on asw2-b-eqiad:ge-8/0/6  (dumpsdata1001) - https://phabricator.wikimedia.org/T273714 (10Cmjohnson) a:05Cmjohnson→03Jclark-ctr Assigning this to @Jclark-ctr
[21:22:22] <wikibugs>	 (03PS1) 10Bstorm: wikireplicas: remove error from the profile for query sampler [puppet] - 10https://gerrit.wikimedia.org/r/661472
[21:24:42] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] wikireplicas: remove error from the profile for query sampler [puppet] - 10https://gerrit.wikimedia.org/r/661472 (owner: 10Bstorm)
[21:33:27] <chaomodus>	 !log rebooting Netbox cluster
[21:33:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:34:28] <wikibugs>	 (03PS1) 10Bstorm: wikireplicas: fix one more typo in the new querysampler service [puppet] - 10https://gerrit.wikimedia.org/r/661478
[21:34:59] <logmsgbot>	 !log crusnov@cumin1001 START - Cookbook sre.hosts.reboot-single for host netboxdb2001.codfw.wmnet
[21:35:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:37:50] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] wikireplicas: fix one more typo in the new querysampler service [puppet] - 10https://gerrit.wikimedia.org/r/661478 (owner: 10Bstorm)
[21:38:59] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Access to Product Superset for Rmurthy - https://phabricator.wikimedia.org/T273813 (10Peachey88)
[21:39:40] <logmsgbot>	 !log crusnov@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netboxdb2001.codfw.wmnet
[21:39:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:40:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install frqueue100[34] - https://phabricator.wikimedia.org/T266365 (10Cmjohnson) @Jgreen Do you have an IP identified for these?
[21:40:17] <logmsgbot>	 !log crusnov@cumin1001 START - Cookbook sre.hosts.reboot-single for host netbox2001.wikimedia.org
[21:40:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:42:54] <wikibugs>	 (03PS1) 10Bstorm: wikireplicas: query sampler subscribe to correct config file. [puppet] - 10https://gerrit.wikimedia.org/r/661485
[21:44:04] <logmsgbot>	 !log crusnov@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox2001.wikimedia.org
[21:44:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:46:08] <logmsgbot>	 !log crusnov@cumin1001 START - Cookbook sre.hosts.reboot-single for host netboxdb1001.eqiad.wmnet
[21:46:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:46:46] <icinga-wm>	 PROBLEM - Check systemd state on kafka-test1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:49:20] <wikibugs>	 (03PS1) 10Wolfgang Kandek: Adding calculator-service to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/661491 (https://phabricator.wikimedia.org/T273151)
[21:50:24] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] wikireplicas: query sampler subscribe to correct config file. [puppet] - 10https://gerrit.wikimedia.org/r/661485 (owner: 10Bstorm)
[21:51:02] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Adding calculator-service to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/661491 (https://phabricator.wikimedia.org/T273151) (owner: 10Wolfgang Kandek)
[21:53:05] <logmsgbot>	 !log crusnov@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netboxdb1001.eqiad.wmnet
[21:53:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:53:32] <logmsgbot>	 !log crusnov@cumin1001 START - Cookbook sre.hosts.reboot-single for host netbox1001.wikimedia.org
[21:53:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:03:28] <wikibugs>	 (03CR) 10Bstorm: "A minor suggestion and one nit. Let me know what you think. Otherwise lgtm" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/658637 (owner: 10David Caro)
[22:06:30] <logmsgbot>	 !log crusnov@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox1001.wikimedia.org
[22:06:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:25:18] <icinga-wm>	 PROBLEM - Check systemd state on kafka-test1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:26:50] <wikibugs>	 (03CR) 10Holger Knust: [C: 03+2] labs: Remove redundant apiportal config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661167 (https://phabricator.wikimedia.org/T270178) (owner: 10Alex Paskulin)
[22:28:10] <wikibugs>	 (03Merged) 10jenkins-bot: labs: Remove redundant apiportal config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661167 (https://phabricator.wikimedia.org/T270178) (owner: 10Alex Paskulin)
[22:29:28] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install mwlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T267271 (10Jclark-ctr)
[22:32:08] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install logstash-be103[345] - https://phabricator.wikimedia.org/T267666 (10RobH) a:05RobH→03herron @herron: before I image these, they have an odd hostname of logstash-be103[345], where the codfw logstash ordered in Q2 just have normal logstash2* ho...
[22:33:22] <wikibugs>	 (03PS2) 10Urbanecm: [WIP] Enable GrowthExperiments at dawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661419 (https://phabricator.wikimedia.org/T256126)
[22:42:53] <wikibugs>	 (03PS1) 10Urbanecm: bnwiki: wgGEHelpPanelLinks: Remove text in brackets [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661522 (https://phabricator.wikimedia.org/T266020)
[22:49:53] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install mwlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T267271 (10Jclark-ctr) @Cmjohnson  netbox is updated racked in C6 U40. port 39
[22:50:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install mwlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T267271 (10Jclark-ctr)
[22:53:42] <logmsgbot>	 !log razzi@cumin1001 END (PASS) - Cookbook sre.kafka.reboot-workers (exit_code=0) for Kafka test cluster: Reboot kafka nodes - razzi@cumin1001
[22:53:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:06:40] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] installserver::proxy: replace cron with timer [puppet] - 10https://gerrit.wikimedia.org/r/661198 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[23:12:26] <icinga-wm>	 RECOVERY - Check systemd state on kafka-test1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:13:07] <wikibugs>	 (03CR) 10Ladsgroup: logging::mediawiki::udp2log: replace cron with timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/661200 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[23:14:08] <icinga-wm>	 RECOVERY - Check systemd state on kafka-test1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:14:36] <icinga-wm>	 RECOVERY - Check systemd state on kafka-test1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:22:00] <wikibugs>	 (03CR) 10Dzahn: logging::mediawiki::udp2log: replace cron with timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/661200 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[23:24:52] <wikibugs>	 (03PS1) 10Razzi: site: add clouddb1021.eqiad.wmnet to insetup [puppet] - 10https://gerrit.wikimedia.org/r/661528 (https://phabricator.wikimedia.org/T269211)
[23:28:02] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[23:29:26] <wikibugs>	 (03PS1) 10Razzi: wikireplicas: Add configuration for clouddb1021 [puppet] - 10https://gerrit.wikimedia.org/r/661529 (https://phabricator.wikimedia.org/T269211)
[23:29:33] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[23:31:05] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[23:31:37] <mutante>	 jouncebot: next
[23:31:37] <jouncebot>	 In 0 hour(s) and 28 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210204T0000)
[23:34:05] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[23:34:35] <wikibugs>	 (03PS2) 10Razzi: site: add clouddb1021.eqiad.wmnet to insetup [puppet] - 10https://gerrit.wikimedia.org/r/661528 (https://phabricator.wikimedia.org/T269211)
[23:35:46] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudnet1004/cloudnet1003: network hiccups because broadcom driver/firmware problem - https://phabricator.wikimedia.org/T271058 (10Jclark-ctr) service pack tool is only available for in warranty devices {F34091626}.   Have reached out to Chris /papaul for...
[23:38:34] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops: decommission francium.eqiad.wmnet - https://phabricator.wikimedia.org/T273142 (10Dzahn) How about the unchecked boxes like wiping and updating netbox?  I still see it here:  https://netbox.wikimedia.org/dcim/devices/1444/
[23:39:48] <wikibugs>	 (03PS2) 10Razzi: wikireplicas: Add configuration for clouddb1021 [puppet] - 10https://gerrit.wikimedia.org/r/661529 (https://phabricator.wikimedia.org/T269211)
[23:40:22] <wikibugs>	 (03PS3) 10Dzahn: logging::mediawiki::udp2log: replace cron with timer [puppet] - 10https://gerrit.wikimedia.org/r/661200 (https://phabricator.wikimedia.org/T273673)
[23:43:39] <wikibugs>	 (03CR) 10Razzi: "Once we have this, we should be able to reimage labsdb1012 and rename it to clouddb1021 any time in the next few weeks, the sooner the bet" [puppet] - 10https://gerrit.wikimedia.org/r/661528 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi)
[23:46:28] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2280.codfw.wmnet with reason: REIMAGE
[23:46:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:48:00] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2279.codfw.wmnet with reason: REIMAGE
[23:48:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:48:29] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2280.codfw.wmnet with reason: REIMAGE
[23:48:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:49:47] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/27852/install2003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/661198 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[23:50:20] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka-jumbo1009 is CRITICAL: 2.768e+07 ge 5e+06 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1009
[23:50:32] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2279.codfw.wmnet with reason: REIMAGE
[23:50:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:51:00] <mutante>	 !log installservers: replacing squid proxy logrotate cron with systemd timer
[23:51:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:56:51] <wikibugs>	 (03CR) 10Dzahn: "how to test after merge:" [puppet] - 10https://gerrit.wikimedia.org/r/661198 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[23:56:59] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1318.eqiad.wmnet with reason: REIMAGE
[23:57:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:59:03] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1318.eqiad.wmnet with reason: REIMAGE
[23:59:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log