[00:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201211T0000).
[00:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[00:03:13] <wikibugs>	 (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/647849
[00:05:49] <icinga-wm>	 PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:06:11] <wikibugs>	 (03PS1) 10Andrew Bogott: Add cinder logs to central logging [puppet] - 10https://gerrit.wikimedia.org/r/647850 (https://phabricator.wikimedia.org/T269511)
[00:30:03] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:33:23] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 04-1] New utility macros in templates/_mediawiki-common.tpl (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/647843 (owner: 10Ahmon Dancy)
[00:37:23] <wikibugs>	 (03PS1) 10BBlack: Temporarily block certain IABot reqs that are broken and spammy [puppet] - 10https://gerrit.wikimedia.org/r/647854
[00:44:15] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1953602712 and 125 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:44:29] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 7075067448 and 484 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:44:29] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 8448067536 and 570 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:44:39] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1359659688 and 69 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:45:05] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 67242472 and 5 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:45:05] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 3890210056 and 264 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:45:05] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2355267096 and 130 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:45:31] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 5111040064 and 365 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:46:43] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 224000 and 88 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:47:57] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 434576 and 162 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:48:23] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 10504 and 188 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:49:11] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 47928 and 235 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:50:01] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 54800 and 286 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:50:15] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:50:27] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 312 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:51:03] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 30720 and 348 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:51:03] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 30720 and 348 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[02:06:04] <wikibugs>	 10Operations, 10LDAP-Access-Requests: LDAP access to wmf group for Matt Cleinman - https://phabricator.wikimedia.org/T269696 (10MattCleinman)
[02:08:39] <wikibugs>	 10Operations, 10LDAP-Access-Requests: LDAP access to wmf group for Matt Cleinman - https://phabricator.wikimedia.org/T269696 (10MattCleinman) Updated the ticket with (I believe) all the info needed. (And updated some erroneous documentation.) Thanks for pointing me in the right direction, @Aklapper   @JoeWalsh...
[02:32:41] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:54:22] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Add cinder logs to central logging [puppet] - 10https://gerrit.wikimedia.org/r/647850 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott)
[04:04:25] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:09:17] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:11:23] <icinga-wm>	 PROBLEM - SSH on ms-be2032 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[04:12:51] <icinga-wm>	 RECOVERY - SSH on ms-be2032 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[04:43:55] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:44:30] <wikibugs>	 (03CR) 10DannyS712: [C: 04-1] Temporarily block certain IABot reqs that are broken and spammy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/647854 (owner: 10BBlack)
[05:57:43] <wikibugs>	 10Operations, 10InternetArchiveBot, 10Traffic: IAbot sending a huge volume of action=raw requests - https://phabricator.wikimedia.org/T269914 (10Tgr)
[05:59:21] <wikibugs>	 10Operations, 10InternetArchiveBot, 10Platform Engineering, 10Traffic: IAbot sending a huge volume of action=raw requests - https://phabricator.wikimedia.org/T269914 (10Tgr) Tagging Platform Engineering to get feedback about the optimal way of getting the page source.
[06:21:35] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:21:43] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:28:49] <icinga-wm>	 PROBLEM - ores on ores2008 is CRITICAL: connect to address 10.192.48.89 and port 8081: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores
[06:35:17] <icinga-wm>	 RECOVERY - ores on ores2008 is OK: HTTP OK: HTTP/1.0 200 OK - 6397 bytes in 0.083 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores
[06:40:13] <elukey>	 mmmm the Telia link between eqiad and codfw had maintenance ongoing since an hour ago, then they sent "work done" 
[06:41:02] <elukey>	 Laser receiver power                      :  0.0006 mW / -32.22 dBm
[06:41:10] <elukey>	 this is on the eqiad side
[06:42:28] <elukey>	 on the codfw one is better (reasonably in range)
[06:42:56] <elukey>	 I am going to wait a bit since in theory the maintenance is scheduled up to 8UTC, if still down I'll send an email to Telia
[06:45:03] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:00:51] <wikibugs>	 (03PS1) 10Elukey: presto: reduce the max heap size from 110G to 100G [puppet] - 10https://gerrit.wikimedia.org/r/647999
[07:01:22] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] presto: reduce the max heap size from 110G to 100G [puppet] - 10https://gerrit.wikimedia.org/r/647999 (owner: 10Elukey)
[07:04:04] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.presto.roll-restart-workers
[07:04:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:12:13] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0)
[07:12:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:12:33] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:30:00] <elukey>	 sent an email to Telia
[07:34:37] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:39:27] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:43:03] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Gehel issue with prometheus exporter - https://phabricator.wikimedia.org/T269872 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:43:03] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on wdqs1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Gehel issue with prometheus exporter - https://phabricator.wikimedia.org/T269872 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:43:03] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on wdqs2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Gehel issue with prometheus exporter - https://phabricator.wikimedia.org/T269872 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:43:03] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on wdqs2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Gehel issue with prometheus exporter - https://phabricator.wikimedia.org/T269872 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:43:04] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on wdqs2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Gehel issue with prometheus exporter - https://phabricator.wikimedia.org/T269872 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:52:19] <wikibugs>	 (03PS1) 10Elukey: admin: remove user legoktm from 'researchers' [puppet] - 10https://gerrit.wikimedia.org/r/648112 (https://phabricator.wikimedia.org/T268801)
[07:52:48] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] admin: remove user legoktm from 'researchers' [puppet] - 10https://gerrit.wikimedia.org/r/648112 (https://phabricator.wikimedia.org/T268801) (owner: 10Elukey)
[08:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201211T0800)
[08:07:26] <wikibugs>	 (03PS1) 10Elukey: Add the sre.hadoop.upgrade_bigtop_distro.py cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/648121
[08:13:21] <elukey>	 !log restart memcached on mwdebug1002 to pick up the correct port (11210 instead of the default 11211)
[08:13:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:14:21] <icinga-wm>	 RECOVERY - Memcached on mwdebug1002 is OK: TCP OK - 0.001 second response time on 10.64.0.46 port 11210 https://wikitech.wikimedia.org/wiki/Memcached
[08:34:46] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to releasers-wikibase for toan - https://phabricator.wikimedia.org/T269777 (10toan) >>! In T269777#6683465, @KFrancis wrote: >>>! In T269777#6682253, @jbond wrote: >> @KFrancis Are you able to confirm NDA status for Tobias, thanks >...
[08:35:27] <wikibugs>	 (03PS2) 10Elukey: Add the sre.hadoop.upgrade_bigtop_distro.py cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/648121 (https://phabricator.wikimedia.org/T269919)
[08:47:55] <wikibugs>	 (03PS1) 10Elukey: Set bigtop 1.5 for Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/648132 (https://phabricator.wikimedia.org/T269919)
[08:49:34] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Set bigtop 1.5 for Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/648132 (https://phabricator.wikimedia.org/T269919) (owner: 10Elukey)
[08:51:11] <wikibugs>	 (03PS2) 10Gehel: wdqs: fix prom exporter's broken namespace [puppet] - 10https://gerrit.wikimedia.org/r/647774 (https://phabricator.wikimedia.org/T269872) (owner: 10Ryan Kemper)
[08:51:56] <wikibugs>	 (03CR) 10Gehel: "@ryan: I took the liberty to get started on the implementation. This isn't tested at all yet!" [puppet] - 10https://gerrit.wikimedia.org/r/647774 (https://phabricator.wikimedia.org/T269872) (owner: 10Ryan Kemper)
[08:52:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wdqs: fix prom exporter's broken namespace [puppet] - 10https://gerrit.wikimedia.org/r/647774 (https://phabricator.wikimedia.org/T269872) (owner: 10Ryan Kemper)
[09:06:59] <wikibugs>	 (03PS3) 10Gehel: wdqs: fix prom exporter's broken namespace [puppet] - 10https://gerrit.wikimedia.org/r/647774 (https://phabricator.wikimedia.org/T269872) (owner: 10Ryan Kemper)
[09:16:53] <wikibugs>	 (03PS1) 10Elukey: aptrepo: add a bigtop15 component also for Buster [puppet] - 10https://gerrit.wikimedia.org/r/648138 (https://phabricator.wikimedia.org/T269919)
[09:20:36] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27088/console" [puppet] - 10https://gerrit.wikimedia.org/r/648138 (https://phabricator.wikimedia.org/T269919) (owner: 10Elukey)
[09:23:33] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] aptrepo: add a bigtop15 component also for Buster [puppet] - 10https://gerrit.wikimedia.org/r/648138 (https://phabricator.wikimedia.org/T269919) (owner: 10Elukey)
[09:26:25] <wikibugs>	 10Operations, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Gilles) Sure, @WDoranWMF, you can send me a meeting invite for next week. After that I'll be off for 3 wee...
[09:26:34] <elukey>	 !log add thirdparty/bigtop15 to buster-wikimedia
[09:26:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:32:50] <wikibugs>	 (03CR) 10Volans: "I don't have the hadoop context to judge the procedure, but cookbook wise it looks ok, few minor/optional things inline. It would be nice " (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/648121 (https://phabricator.wikimedia.org/T269919) (owner: 10Elukey)
[09:38:34] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "see comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/647774 (https://phabricator.wikimedia.org/T269872) (owner: 10Ryan Kemper)
[09:39:19] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 (10ops-monitoring-bot) Script wmf-auto-reimage was launched b...
[09:52:26] <wikibugs>	 (03PS1) 10JMeybohm: Add calico releases to admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/648143 (https://phabricator.wikimedia.org/T267653)
[09:53:46] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage2002.codfw.wmnet with reason: REIMAGE
[09:53:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:54:17] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage2001.codfw.wmnet with reason: REIMAGE
[09:54:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:55:40] <logmsgbot>	 !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubestage2002.codfw.wmnet with reason: REIMAGE
[09:55:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:57:43] <logmsgbot>	 !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubestage2001.codfw.wmnet with reason: REIMAGE
[09:57:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:58:07] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Add calico releases to admin_ng helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/648143 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm)
[10:01:39] <icinga-wm>	 PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:01:53] <wikibugs>	 (03PS2) 10JMeybohm: Add calico releases to admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/648143 (https://phabricator.wikimedia.org/T267653)
[10:01:56] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['k...
[10:01:59] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:02:11] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Add calico releases to admin_ng helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/648143 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm)
[10:03:50] <wikibugs>	 (03Merged) 10jenkins-bot: Add calico releases to admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/648143 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm)
[10:25:52] <wikibugs>	 (03PS3) 10Elukey: Add the sre.hadoop.upgrade_bigtop_distro.py cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/648121 (https://phabricator.wikimedia.org/T269919)
[10:25:56] <wikibugs>	 (03CR) 10Elukey: "Also tried to move the codebase to the class API! :)" (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/648121 (https://phabricator.wikimedia.org/T269919) (owner: 10Elukey)
[10:28:04] <icinga-wm>	 RECOVERY - Check systemd state on cp5007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:29:49] <volans>	 elukey: so now it's impossible to check the diffs from the previous PS :-P
[10:29:58] <volans>	 smart!
[10:30:01] <volans>	 :D
[10:30:16] * volans joking, thanks for using the new APIs
[10:38:59] <wikibugs>	 (03PS1) 10JMeybohm: Move non-common kubernetes staging values to DC specific files [puppet] - 10https://gerrit.wikimedia.org/r/648166 (https://phabricator.wikimedia.org/T244335)
[10:40:09] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "Nice! Thanks a lot for using the new API <3. LGTM cookbook-wise as before, see replies inline." (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/648121 (https://phabricator.wikimedia.org/T269919) (owner: 10Elukey)
[10:48:34] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: profile::kubernetes::node: Remove old redundant code [puppet] - 10https://gerrit.wikimedia.org/r/646645
[10:48:36] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: k8s::node: Split staging cluster hieras [puppet] - 10https://gerrit.wikimedia.org/r/646646
[10:54:08] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Move non-common kubernetes staging values to DC specific files [puppet] - 10https://gerrit.wikimedia.org/r/648166 (https://phabricator.wikimedia.org/T244335) (owner: 10JMeybohm)
[10:54:24] <wikibugs>	 (03PS1) 10Elukey: sre.hadoop.stop-cluster.py: move to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/648172 (https://phabricator.wikimedia.org/T269925)
[10:54:36] <elukey>	 volans: I need to stop the cluster first soo --^ 
[10:54:38] <elukey>	 :D
[10:56:21] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27090/console" [puppet] - 10https://gerrit.wikimedia.org/r/646645 (owner: 10Alexandros Kosiaris)
[10:57:25] <volans>	 elukey: lol
[10:57:37] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27091/console" [puppet] - 10https://gerrit.wikimedia.org/r/646646 (owner: 10Alexandros Kosiaris)
[10:58:52] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27092/console" [puppet] - 10https://gerrit.wikimedia.org/r/648166 (https://phabricator.wikimedia.org/T244335) (owner: 10JMeybohm)
[11:01:40] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:01:58] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:03:16] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "<3 for the new API! LGTM, couple of nits inline" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/648172 (https://phabricator.wikimedia.org/T269925) (owner: 10Elukey)
[11:04:29] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: k8s::node: Split staging cluster hieras [puppet] - 10https://gerrit.wikimedia.org/r/646646
[11:05:29] <wikibugs>	 (03Abandoned) 10Alexandros Kosiaris: Move non-common kubernetes staging values to DC specific files [puppet] - 10https://gerrit.wikimedia.org/r/648166 (https://phabricator.wikimedia.org/T244335) (owner: 10JMeybohm)
[11:07:12] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27093/console" [puppet] - 10https://gerrit.wikimedia.org/r/646646 (owner: 10Alexandros Kosiaris)
[11:07:42] <wikibugs>	 (03PS2) 10Elukey: sre.hadoop.stop-cluster.py: move to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/648172 (https://phabricator.wikimedia.org/T269925)
[11:07:45] <wikibugs>	 (03CR) 10Elukey: sre.hadoop.stop-cluster.py: move to class API (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/648172 (https://phabricator.wikimedia.org/T269925) (owner: 10Elukey)
[11:08:34] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] "PCC happy, the squashed commit https://gerrit.wikimedia.org/r/c/operations/puppet/+/648166/2 was also required, so good catch, merging" [puppet] - 10https://gerrit.wikimedia.org/r/646646 (owner: 10Alexandros Kosiaris)
[11:08:39] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] profile::kubernetes::node: Remove old redundant code [puppet] - 10https://gerrit.wikimedia.org/r/646645 (owner: 10Alexandros Kosiaris)
[11:10:02] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:14:52] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:18:00] <wikibugs>	 (03PS4) 10Elukey: Add the sre.hadoop.upgrade_bigtop_distro.py cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/648121 (https://phabricator.wikimedia.org/T269919)
[11:18:02] <wikibugs>	 (03PS3) 10Elukey: sre.hadoop.stop-cluster.py: move to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/648172 (https://phabricator.wikimedia.org/T269925)
[11:18:08] <wikibugs>	 (03CR) 10Elukey: Add the sre.hadoop.upgrade_bigtop_distro.py cookbook (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/648121 (https://phabricator.wikimedia.org/T269919) (owner: 10Elukey)
[11:18:34] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: k8s-staging-codfw: Switch calico_version to String [puppet] - 10https://gerrit.wikimedia.org/r/648182
[11:19:36] <wikibugs>	 (03PS5) 10Elukey: Add the sre.hadoop.upgrade_bigtop_distro.py cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/648121 (https://phabricator.wikimedia.org/T269919)
[11:19:38] <wikibugs>	 (03PS4) 10Elukey: sre.hadoop.stop-cluster.py: move to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/648172 (https://phabricator.wikimedia.org/T269925)
[11:20:02] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] k8s-staging-codfw: Switch calico_version to String [puppet] - 10https://gerrit.wikimedia.org/r/648182 (owner: 10Alexandros Kosiaris)
[11:21:50] <jbond42>	 volans: fyi i tested sre.puppet.renew-cert and it worked fine
[11:21:59] <volans>	 jbond42: <3 thanks a lot!
[11:23:41] <jbond42>	 np
[11:29:14] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: calico: Move calico-cni package inclusion to main class [puppet] - 10https://gerrit.wikimedia.org/r/648184
[11:30:49] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27098/console" [puppet] - 10https://gerrit.wikimedia.org/r/648184 (owner: 10Alexandros Kosiaris)
[11:31:36] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "LGTM and +1 to merging the classes!" [puppet] - 10https://gerrit.wikimedia.org/r/648184 (owner: 10Alexandros Kosiaris)
[11:32:04] <wikibugs>	 (03PS4) 10Jbond: (WIP) spec test fixes [puppet] - 10https://gerrit.wikimedia.org/r/645187
[11:33:26] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] (WIP) spec test fixes [puppet] - 10https://gerrit.wikimedia.org/r/645187 (owner: 10Jbond)
[11:33:31] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] calico: Move calico-cni package inclusion to main class [puppet] - 10https://gerrit.wikimedia.org/r/648184 (owner: 10Alexandros Kosiaris)
[11:34:48] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] "Way easier to understand, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/647419 (https://phabricator.wikimedia.org/T269620) (owner: 10Bstorm)
[11:35:30] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27099/console" [puppet] - 10https://gerrit.wikimedia.org/r/647728 (https://phabricator.wikimedia.org/T252185) (owner: 10Alexandros Kosiaris)
[11:37:45] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: kubestage2*: Assign role [puppet] - 10https://gerrit.wikimedia.org/r/647728 (https://phabricator.wikimedia.org/T252185)
[11:40:02] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27100/console" [puppet] - 10https://gerrit.wikimedia.org/r/647728 (https://phabricator.wikimedia.org/T252185) (owner: 10Alexandros Kosiaris)
[11:40:23] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] early_command: configure static, mapped ipv6 address (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/645081 (owner: 10Jbond)
[11:46:34] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:49:30] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/647728 (https://phabricator.wikimedia.org/T252185) (owner: 10Alexandros Kosiaris)
[11:50:26] <wikibugs>	 (03CR) 10Volans: "replies inline" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/648121 (https://phabricator.wikimedia.org/T269919) (owner: 10Elukey)
[11:51:35] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/648172 (https://phabricator.wikimedia.org/T269925) (owner: 10Elukey)
[11:57:43] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install aqs101[0-5] - https://phabricator.wikimedia.org/T267414 (10Jclark-ctr) @Cmjohnson  Those came in with that large shipment of 8 8 6  S01720435 - 8 boxes on 1 pallet - T264584_PO1016 S01719765 - 8 boxes on 1 pallet - T264584_PO1016 S0172051...
[11:57:51] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install aqs101[0-5] - https://phabricator.wikimedia.org/T267414 (10Jclark-ctr)
[12:00:40] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): update RAID controller firmware on labstore1006, 1007 - https://phabricator.wikimedia.org/T268285 (10Jclark-ctr) labstore1006 has firmware 6.6 for smart array controller  https://support.hpe.com/hpesc/public/docDisplay?docId=a00037929en_us&docLocale=...
[12:00:41] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "I didn't test it but the changes looks ok" [puppet] - 10https://gerrit.wikimedia.org/r/647369 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[12:00:46] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: REIMAGE
[12:00:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:02:42] <logmsgbot>	 !log jbond@cumin1001 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 2:00:00 on sretest1001.eqiad.wmnet with reason: REIMAGE
[12:02:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:03:56] <wikibugs>	 (03PS1) 10JMeybohm: Enable k8s-staging prometheus instance in codfw [puppet] - 10https://gerrit.wikimedia.org/r/648192 (https://phabricator.wikimedia.org/T244335)
[12:03:58] <wikibugs>	 (03PS1) 10JMeybohm: Add k8s-staging prometheus instance datasource [puppet] - 10https://gerrit.wikimedia.org/r/648193 (https://phabricator.wikimedia.org/T244335)
[12:07:38] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: kubelet: Remove --allow-privileged [puppet] - 10https://gerrit.wikimedia.org/r/648194
[12:07:40] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: k8s::apiserver: Allow setting allow_privileged [puppet] - 10https://gerrit.wikimedia.org/r/648195
[12:10:52] <wikibugs>	 10Operations, 10Traffic, 10Readers-Web-Backlog (Needs Product Owner Decisions): [Bug] iPadOS 13 shows the desktop version of Safari with a broken layout - https://phabricator.wikimedia.org/T229875 (10dr0ptp4kt) 05Open→03Resolved I was able to reproduce the new behavior observed by @ckoerner on a number o...
[12:11:49] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27101/console" [puppet] - 10https://gerrit.wikimedia.org/r/648194 (owner: 10Alexandros Kosiaris)
[12:15:17] <wikibugs>	 (03PS1) 10Jbond: early_command: busy box doesn't have awk [puppet] - 10https://gerrit.wikimedia.org/r/648197
[12:15:38] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27102/console" [puppet] - 10https://gerrit.wikimedia.org/r/648195 (owner: 10Alexandros Kosiaris)
[12:16:42] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27103/console" [puppet] - 10https://gerrit.wikimedia.org/r/648195 (owner: 10Alexandros Kosiaris)
[12:17:04] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/27102/ says ok, merging and proceed. Thanks for the +1" [puppet] - 10https://gerrit.wikimedia.org/r/648195 (owner: 10Alexandros Kosiaris)
[12:17:11] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] kubelet: Remove --allow-privileged [puppet] - 10https://gerrit.wikimedia.org/r/648194 (owner: 10Alexandros Kosiaris)
[12:18:17] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] early_command: busy box doesn't have awk [puppet] - 10https://gerrit.wikimedia.org/r/648197 (owner: 10Jbond)
[12:20:51] <wikibugs>	 (03PS1) 10Ema: vcl: fix X-Cache-Status on deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/648199 (https://phabricator.wikimedia.org/T269825)
[12:39:28] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 485025352 and 48 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:40:07] <wikibugs>	 (03PS2) 10Ema: vcl: fix X-Cache-Status on deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/648199 (https://phabricator.wikimedia.org/T269825)
[12:40:52] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 196520 and 94 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:41:00] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 254577048 and 121 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:42:26] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 72736 and 188 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:44:37] <wikibugs>	 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) Initial results of the 6.0.0 experiment on cp3054 are encouraging: for the past 12 hours [[ https://grafana.wikimedia.org/d/Lp_BDKJMz/em...
[12:48:00] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:54:01] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: kubeadm: wmcs-k8s-node-upgrade.py: format with black [puppet] - 10https://gerrit.wikimedia.org/r/648206
[12:54:03] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: kubeadm: wmcs-k8s-node-upgrade.py: specify default version numbers in a single place [puppet] - 10https://gerrit.wikimedia.org/r/648207
[12:54:05] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: kubedam: wmcs-k8s-node-upgrade.py: help message refresh [puppet] - 10https://gerrit.wikimedia.org/r/648208
[12:54:08] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: kubeadm: wmcs-k8s-node-upgrade.py: skip node if current version fails [puppet] - 10https://gerrit.wikimedia.org/r/648209
[12:54:10] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: kubeadm: wmcs-k8s-node-upgrade.py: cache status yaml [puppet] - 10https://gerrit.wikimedia.org/r/648210
[12:54:12] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: kubeadm: wmcs-k8s-node-upgrade.py: collapse ssh calls for same package checks [puppet] - 10https://gerrit.wikimedia.org/r/648211
[12:54:40] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] kubeadm: wmcs-k8s-node-upgrade.py: specify default version numbers in a single place [puppet] - 10https://gerrit.wikimedia.org/r/648207 (owner: 10Arturo Borrero Gonzalez)
[12:55:40] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] kubeadm: wmcs-k8s-node-upgrade.py: collapse ssh calls for same package checks [puppet] - 10https://gerrit.wikimedia.org/r/648211 (owner: 10Arturo Borrero Gonzalez)
[13:21:12] <wikibugs>	 (03PS1) 10Jbond: install_server: try to fix the ip address in late command [puppet] - 10https://gerrit.wikimedia.org/r/648222
[13:22:38] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] install_server: try to fix the ip address in late command [puppet] - 10https://gerrit.wikimedia.org/r/648222 (owner: 10Jbond)
[13:24:44] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to RESOURCE for guergana.tzatchkova - https://phabricator.wikimedia.org/T269930 (10guergana.tzatchkova)
[13:25:50] <icinga-wm>	 PROBLEM - Disk space on dumpsdata1001 is CRITICAL: DISK CRITICAL - free space: /data 894342 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=dumpsdata1001&var-datasource=eqiad+prometheus/ops
[13:27:28] <volans>	 apergos: ^^^
[13:27:38] <volans>	 something known?
[13:27:56] <apergos>	 jus tme accumulating too much cruft from test runs
[13:27:58] <apergos>	 will clean up
[13:28:09] <volans>	 ok, there is also a 5% free space on / fwiw
[13:30:29] <apergos>	 mm that's harder, I'll see if there's anything that can be made to go away
[13:35:05] <apergos>	 uh 
[13:35:21] <apergos>	 on / there's lots available 
[13:35:32] <apergos>	 maybe you read avail for used when you were looking at that output?
[13:35:51] <volans>	 apergos: lol, my bad, eyes crossed columns
[13:36:04] <apergos>	 no worries! means it's fixed already :-D
[13:36:07] <volans>	 :D
[13:36:16] <volans>	 thanks for fixing it so quickly and efficiently! :-)
[13:36:28] <apergos>	 the alert should clear soon,  (yw :-P :-D)  I got rid of some junk
[13:36:45] <apergos>	 I'll be able to get rid of the rest once this job of wikidata completes in a few days
[13:36:52] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: REIMAGE
[13:36:53] <wikibugs>	 (03CR) 10Elukey: Add the sre.hadoop.upgrade_bigtop_distro.py cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/648121 (https://phabricator.wikimedia.org/T269919) (owner: 10Elukey)
[13:36:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:37:13] <volans>	 np
[13:38:27] <wikibugs>	 (03PS1) 10Jbond: install_server: use correct token [puppet] - 10https://gerrit.wikimedia.org/r/648228
[13:38:37] <wikibugs>	 (03PS6) 10Elukey: Add the sre.hadoop.upgrade_bigtop_distro.py cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/648121 (https://phabricator.wikimedia.org/T269919)
[13:38:39] <wikibugs>	 (03PS5) 10Elukey: sre.hadoop.stop-cluster.py: move to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/648172 (https://phabricator.wikimedia.org/T269925)
[13:38:57] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: REIMAGE
[13:38:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:39:16] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] install_server: use correct token [puppet] - 10https://gerrit.wikimedia.org/r/648228 (owner: 10Jbond)
[13:45:54] <icinga-wm>	 RECOVERY - Disk space on dumpsdata1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=dumpsdata1001&var-datasource=eqiad+prometheus/ops
[13:53:28] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "Going to merge and test, pretty sure I'll have to follow up with some bug :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/648121 (https://phabricator.wikimedia.org/T269919) (owner: 10Elukey)
[13:53:35] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] sre.hadoop.stop-cluster.py: move to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/648172 (https://phabricator.wikimedia.org/T269925) (owner: 10Elukey)
[13:57:03] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.stop-cluster for Hadoop test cluster: Stop the Hadoop cluster before maintenance. - elukey@cumin1001
[13:57:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:58:06] <elukey>	 wooow
[13:58:20] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: REIMAGE
[13:58:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:00:21] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: REIMAGE
[14:00:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:03:04] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0) for Hadoop test cluster: Stop the Hadoop cluster before maintenance. - elukey@cumin1001
[14:03:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:04:04] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.upgrade-bigtop-distro for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001
[14:04:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:12:25] <wikibugs>	 (03PS1) 10Phuedx: Disable Page Previews IRC alerts [puppet] - 10https://gerrit.wikimedia.org/r/648237
[14:14:01] <wikibugs>	 (03PS1) 10Jbond: late_command: add cidr bits [puppet] - 10https://gerrit.wikimedia.org/r/648238
[14:14:46] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] late_command: add cidr bits [puppet] - 10https://gerrit.wikimedia.org/r/648238 (owner: 10Jbond)
[14:15:06] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:16:26] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hadoop.upgrade-bigtop-distro (exit_code=99) for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001
[14:16:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:16:32] <wikibugs>	 (03PS1) 10Phuedx: Update .mailmap to de-duplicate my email addresses [puppet] - 10https://gerrit.wikimedia.org/r/648239
[14:17:23] <wikibugs>	 (03PS1) 10JMeybohm: Add wmf-node-authorization ClusterRoleBinding [deployment-charts] - 10https://gerrit.wikimedia.org/r/648240 (https://phabricator.wikimedia.org/T244335)
[14:18:26] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Add wmf-node-authorization ClusterRoleBinding [deployment-charts] - 10https://gerrit.wikimedia.org/r/648240 (https://phabricator.wikimedia.org/T244335) (owner: 10JMeybohm)
[14:18:38] <wikibugs>	 (03PS1) 10Elukey: hadoop: fix typo in package list [cookbooks] - 10https://gerrit.wikimedia.org/r/648241
[14:19:26] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:19:57] <wikibugs>	 (03Merged) 10jenkins-bot: Add wmf-node-authorization ClusterRoleBinding [deployment-charts] - 10https://gerrit.wikimedia.org/r/648240 (https://phabricator.wikimedia.org/T244335) (owner: 10JMeybohm)
[14:20:17] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] hadoop: fix typo in package list [cookbooks] - 10https://gerrit.wikimedia.org/r/648241 (owner: 10Elukey)
[14:21:54] <elukey>	 in theory I should be able to re-run the cookbook and restart from what I left it, let's see
[14:23:04] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.upgrade-bigtop-distro for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001
[14:23:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:24:00] <wikibugs>	 10Operations, 10InternetArchiveBot, 10Platform Engineering, 10Traffic: IAbot sending a huge volume of action=raw requests - https://phabricator.wikimedia.org/T269914 (10jbond) p:05Triage→03Medium
[14:26:28] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hadoop.upgrade-bigtop-distro (exit_code=99) for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001
[14:26:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:26:50] <elukey>	 oh noes another typo
[14:26:54] * elukey cries in a corner
[14:27:37] <wikibugs>	 (03PS1) 10Elukey: hadoop: fix another typo in the package list [cookbooks] - 10https://gerrit.wikimedia.org/r/648244
[14:28:07] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: REIMAGE
[14:28:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:43] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to RESOURCE for guergana.tzatchkova - https://phabricator.wikimedia.org/T269930 (10jbond)
[14:29:27] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] hadoop: fix another typo in the package list [cookbooks] - 10https://gerrit.wikimedia.org/r/648244 (owner: 10Elukey)
[14:30:11] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: REIMAGE
[14:30:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:31:05] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to RESOURCE for guergana.tzatchkova - https://phabricator.wikimedia.org/T269930 (10jbond) @WMDE-leszek are you able to approve this access request thanks @Ottomata are you able to approve Guergana access to `analytics-wmde-users` & `analytics-privatedata-u...
[14:32:31] <wikibugs>	 (03PS1) 10JMeybohm: helmfile needs: parameter requires a release namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/648245 (https://phabricator.wikimedia.org/T267653)
[14:34:21] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to RESOURCE for guergana.tzatchkova - https://phabricator.wikimedia.org/T269930 (10Ottomata) Approved!  Guergana should also be in the `nda` LDAP group and be given a Kerberos principal.
[14:34:53] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] helmfile needs: parameter requires a release namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/648245 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm)
[14:35:53] <wikibugs>	 (03PS1) 10Jbond: admin: add gtzatchkova to analytics-wmde-users & analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/648246 (https://phabricator.wikimedia.org/T269930)
[14:36:25] <wikibugs>	 (03Merged) 10jenkins-bot: helmfile needs: parameter requires a release namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/648245 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm)
[14:37:23] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] presto: reduce the max heap size from 110G to 100G [puppet] - 10https://gerrit.wikimedia.org/r/647999 (owner: 10Elukey)
[14:38:48] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to RESOURCE for guergana.tzatchkova - https://phabricator.wikimedia.org/T269930 (10jbond) @guergana.tzatchkova I have created the CR just waiting on Leszek's approval however i have used the username `gtzatchkova` as this was already...
[14:39:09] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to RESOURCE for guergana.tzatchkova - https://phabricator.wikimedia.org/T269930 (10jbond)
[14:42:29] <wikibugs>	 (03PS2) 10Hashar: doc: switch to scap DocumentRoot [take 2] [puppet] - 10https://gerrit.wikimedia.org/r/647763 (https://phabricator.wikimedia.org/T149924)
[14:42:31] <wikibugs>	 (03PS1) 10Hashar: doc: use an Apache Define for WMF_DOC_PATH [puppet] - 10https://gerrit.wikimedia.org/r/648247 (https://phabricator.wikimedia.org/T149924)
[14:42:33] <wikibugs>	 (03PS1) 10Hashar: doc: fix fallback to WMF_DOC_PATH files [puppet] - 10https://gerrit.wikimedia.org/r/648248 (https://phabricator.wikimedia.org/T149924)
[14:45:36] <logmsgbot>	 !log jayme@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'.
[14:45:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:46:16] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/648247 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar)
[14:51:33] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-wmde-users, analytics-privatedata-users for guergana.tzatchkova - https://phabricator.wikimedia.org/T269930 (10Aklapper)
[14:53:08] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-wmde-users, analytics-privatedata-users for guergana.tzatchkova - https://phabricator.wikimedia.org/T269930 (10jbond) p:05Triage→03Medium
[14:53:47] <icinga-wm>	 PROBLEM - Check systemd state on deploy1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:54:19] <jayme>	 probably me, checking
[14:54:33] <wikibugs>	 (03PS1) 10Volans: wmf-auto-reimage: remove hack to parse output [puppet] - 10https://gerrit.wikimedia.org/r/648250
[14:55:12] <wikibugs>	 (03CR) 10Hashar: "PPC shows nothing cause there is just a file change so one just have to look at the Gerrit change to find out what has changed :]" [puppet] - 10https://gerrit.wikimedia.org/r/648247 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar)
[14:55:44] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/648248 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar)
[14:55:49] <icinga-wm>	 RECOVERY - Check systemd state on deploy1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:56:41] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] wmf-auto-reimage: remove hack to parse output [puppet] - 10https://gerrit.wikimedia.org/r/648250 (owner: 10Volans)
[14:57:00] <wikibugs>	 (03CR) 10Volans: [C: 03+2] wmf-auto-reimage: remove hack to parse output [puppet] - 10https://gerrit.wikimedia.org/r/648250 (owner: 10Volans)
[14:59:43] <logmsgbot>	 !log jayme@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'.
[14:59:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:47] <wikibugs>	 (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/647657 (owner: 10Volans)
[15:06:16] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.upgrade-bigtop-distro for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001
[15:06:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:06:42] <wikibugs>	 10Operations, 10InternetArchiveBot, 10Platform Engineering, 10Traffic: IAbot sending a huge volume of action=raw requests - https://phabricator.wikimedia.org/T269914 (10Cyberpower678) Thursday, as in yesterday?  I’m not aware of anything that should have been running to create that massive level of requests.
[15:07:17] <wikibugs>	 10Operations, 10InternetArchiveBot, 10Platform Engineering, 10Traffic: IAbot sending a huge volume of action=raw requests - https://phabricator.wikimedia.org/T269914 (10Cyberpower678) Especially to Wikidata.
[15:10:01] <logmsgbot>	 !log jayme@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
[15:10:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:57] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: REIMAGE
[15:12:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:15:05] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: REIMAGE
[15:15:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:16:30] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "I'm lacking part of the context,did a pass and didn't see anything obviously wrong. I didn't test it but if the test are running fine for " (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/646634 (https://phabricator.wikimedia.org/T265266) (owner: 10Kormat)
[15:20:06] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hadoop.upgrade-bigtop-distro (exit_code=99) for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001
[15:20:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:21:19] <wikibugs>	 (03PS8) 10Kormat: integration: Complete framework for running basic tests [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/646634 (https://phabricator.wikimedia.org/T265266)
[15:24:10] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "Ship it" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/646634 (https://phabricator.wikimedia.org/T265266) (owner: 10Kormat)
[15:24:40] <wikibugs>	 (03CR) 10Kormat: integration: Complete framework for running basic tests (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/646634 (https://phabricator.wikimedia.org/T265266) (owner: 10Kormat)
[15:25:18] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] integration: Complete framework for running basic tests [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/646634 (https://phabricator.wikimedia.org/T265266) (owner: 10Kormat)
[15:27:35] <wikibugs>	 (03Merged) 10jenkins-bot: integration: Complete framework for running basic tests [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/646634 (https://phabricator.wikimedia.org/T265266) (owner: 10Kormat)
[15:30:33] <wikibugs>	 (03PS1) 10Jbond: install_server: slaac not slacc [puppet] - 10https://gerrit.wikimedia.org/r/648253
[15:31:10] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] install_server: slaac not slacc [puppet] - 10https://gerrit.wikimedia.org/r/648253 (owner: 10Jbond)
[15:33:25] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1002.eqiad.wmnet with reason: REIMAGE
[15:33:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:35:27] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1002.eqiad.wmnet with reason: REIMAGE
[15:35:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:36:33] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-wmde-users, analytics-privatedata-users for guergana.tzatchkova - https://phabricator.wikimedia.org/T269930 (10WMDE-leszek) I approve this request.
[15:36:48] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-wmde-users, analytics-privatedata-users for guergana.tzatchkova - https://phabricator.wikimedia.org/T269930 (10WMDE-leszek)
[15:37:15] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "> Patch Set 16:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) (owner: 10CRusnov)
[15:38:01] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:44:34] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-wmde-users, analytics-privatedata-users for guergana.tzatchkova - https://phabricator.wikimedia.org/T269930 (10guergana.tzatchkova) >>! In T269930#6684825, @jbond wrote: > @guergana.tzatchkova I have created the CR just w...
[15:44:35] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:46:11] <wikibugs>	 (03PS1) 10Papaul: DHCP: Add MAC address for ml-serve200[1234] [puppet] - 10https://gerrit.wikimedia.org/r/648258 (https://phabricator.wikimedia.org/T267670)
[15:47:43] <wikibugs>	 (03PS1) 10Volans: sre.hosts.decommission: try to avoid Netbox issue [cookbooks] - 10https://gerrit.wikimedia.org/r/648259 (https://phabricator.wikimedia.org/T268605)
[15:47:47] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] DHCP: Add MAC address for ml-serve200[1234] [puppet] - 10https://gerrit.wikimedia.org/r/648258 (https://phabricator.wikimedia.org/T267670) (owner: 10Papaul)
[15:51:28] <wikibugs>	 (03PS1) 10Elukey: sre.hadoop.upgrade-bigtop-distro.py: format standby only when in rollback [cookbooks] - 10https://gerrit.wikimedia.org/r/648261 (https://phabricator.wikimedia.org/T269919)
[15:53:56] <wikibugs>	 (03PS1) 10Papaul: Add ml-serve200[1234] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/648262 (https://phabricator.wikimedia.org/T267670)
[15:54:50] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] Add ml-serve200[1234] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/648262 (https://phabricator.wikimedia.org/T267670) (owner: 10Papaul)
[15:57:28] <wikibugs>	 (03PS2) 10Elukey: sre.hadoop.upgrade-bigtop-distro.py: stop standby before format [cookbooks] - 10https://gerrit.wikimedia.org/r/648261 (https://phabricator.wikimedia.org/T269919)
[15:58:10] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[1-4] - https://phabricator.wikimedia.org/T267670 (10Papaul)
[15:58:31] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-wmde-users, analytics-privatedata-users for guergana.tzatchkova - https://phabricator.wikimedia.org/T269930 (10jbond)
[15:58:43] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] admin: add gtzatchkova to analytics-wmde-users & analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/648246 (https://phabricator.wikimedia.org/T269930) (owner: 10Jbond)
[15:59:22] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[1-4] - https://phabricator.wikimedia.org/T267670 (10Papaul) a:05Papaul→03klausman @klausman this is done from my end
[16:00:32] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] sre.hosts.decommission: try to avoid Netbox issue [cookbooks] - 10https://gerrit.wikimedia.org/r/648259 (https://phabricator.wikimedia.org/T268605) (owner: 10Volans)
[16:01:10] <wikibugs>	 (03PS2) 10Volans: sre.hosts.decommission: try to avoid Netbox issue [cookbooks] - 10https://gerrit.wikimedia.org/r/648259 (https://phabricator.wikimedia.org/T268605)
[16:01:35] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-wmde-users, analytics-privatedata-users for guergana.tzatchkova - https://phabricator.wikimedia.org/T269930 (10jbond) 05Open→03Resolved a:03jbond >>! In T269930#6684981, @guergana.tzatchkova wrote: >>>! In T269930#6...
[16:01:53] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] sre.hadoop.upgrade-bigtop-distro.py: stop standby before format [cookbooks] - 10https://gerrit.wikimedia.org/r/648261 (https://phabricator.wikimedia.org/T269919) (owner: 10Elukey)
[16:01:57] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on logstash2022 - https://phabricator.wikimedia.org/T269552 (10Papaul) @herron any update on this?
[16:02:49] <wikibugs>	 (03PS3) 10Volans: sre.hosts.decommission: try to avoid Netbox issue [cookbooks] - 10https://gerrit.wikimedia.org/r/648259 (https://phabricator.wikimedia.org/T268605)
[16:05:47] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[1-4] - https://phabricator.wikimedia.org/T267670 (10Papaul) @klausman can you also please add the new naming to  https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions
[16:06:31] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] kubeadm: wmcs-k8s-node-upgrade.py: format with black [puppet] - 10https://gerrit.wikimedia.org/r/648206 (owner: 10Arturo Borrero Gonzalez)
[16:06:53] <wikibugs>	 (03PS1) 10Elukey: Revert "Set bigtop 1.5 for Hadoop test" [puppet] - 10https://gerrit.wikimedia.org/r/648269
[16:07:39] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Revert "Set bigtop 1.5 for Hadoop test" [puppet] - 10https://gerrit.wikimedia.org/r/648269 (owner: 10Elukey)
[16:09:38] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:10:07] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.stop-cluster for Hadoop test cluster: Stop the Hadoop cluster before maintenance. - elukey@cumin1001
[16:10:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:10:46] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: try to avoid Netbox issue [cookbooks] - 10https://gerrit.wikimedia.org/r/648259 (https://phabricator.wikimedia.org/T268605) (owner: 10Volans)
[16:11:31] <wikibugs>	 (03PS2) 10Bstorm: kubeadm: wmcs-k8s-node-upgrade.py: hoist constants to top [puppet] - 10https://gerrit.wikimedia.org/r/648207 (owner: 10Arturo Borrero Gonzalez)
[16:12:21] <elukey>	 look at all this hadoop spam in the SAL with proper info, how lovely
[16:12:46] <elukey>	 (I'll have to do some more rounds of tests so please be patient :D)
[16:13:01] <volans>	 I think the message could be shortened
[16:13:24] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.decommission: try to avoid Netbox issue [cookbooks] - 10https://gerrit.wikimedia.org/r/648259 (https://phabricator.wikimedia.org/T268605) (owner: 10Volans)
[16:13:29] <volans>	 and yes I need to work on the administrative.reason in spicerack to allow for more flexible usages where the user@host is not needed
[16:13:45] <elukey>	 volans: how dare you saying that the hadoop message is too long, I am offended
[16:13:47] <wikibugs>	 (03PS3) 10Bstorm: kubeadm: wmcs-k8s-node-upgrade.py: hoist constants to top [puppet] - 10https://gerrit.wikimedia.org/r/648207 (owner: 10Arturo Borrero Gonzalez)
[16:13:48] <elukey>	 :D
[16:13:48] <volans>	 ahahah
[16:14:03] <volans>	 I think that this might be enough: Cookbook sre.hadoop.stop-cluster test: $reason
[16:15:02] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] kubeadm: wmcs-k8s-node-upgrade.py: hoist constants to top [puppet] - 10https://gerrit.wikimedia.org/r/648207 (owner: 10Arturo Borrero Gonzalez)
[16:15:32] <wikibugs>	 (03PS2) 10Bstorm: kubedam: wmcs-k8s-node-upgrade.py: help message refresh [puppet] - 10https://gerrit.wikimedia.org/r/648208 (owner: 10Arturo Borrero Gonzalez)
[16:15:36] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0) for Hadoop test cluster: Stop the Hadoop cluster before maintenance. - elukey@cumin1001
[16:15:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:16:39] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] kubedam: wmcs-k8s-node-upgrade.py: help message refresh [puppet] - 10https://gerrit.wikimedia.org/r/648208 (owner: 10Arturo Borrero Gonzalez)
[16:18:20] <wikibugs>	 (03PS2) 10Bstorm: kubeadm: wmcs-k8s-node-upgrade.py: skip node if current version fails [puppet] - 10https://gerrit.wikimedia.org/r/648209 (owner: 10Arturo Borrero Gonzalez)
[16:26:02] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] kubeadm: wmcs-k8s-node-upgrade.py: skip node if current version fails [puppet] - 10https://gerrit.wikimedia.org/r/648209 (owner: 10Arturo Borrero Gonzalez)
[16:28:43] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.upgrade-bigtop-distro for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001
[16:28:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:31:02] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:31:47] <wikibugs>	 (03PS2) 10Bstorm: kubeadm: wmcs-k8s-node-upgrade.py: cache status yaml [puppet] - 10https://gerrit.wikimedia.org/r/648210 (owner: 10Arturo Borrero Gonzalez)
[16:35:44] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:38:57] <rzl>	 ^ T269693 again
[16:38:58] <stashbot>	 T269693: mediawiki_job_wikidata-updateQueryServiceLag failing - https://phabricator.wikimedia.org/T269693
[16:40:03] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hadoop.upgrade-bigtop-distro (exit_code=99) for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001
[16:40:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:40:44] <volans>	 sad trombone.wav elukey :)
[16:41:01] <elukey>	 I knoooww
[16:41:13] <elukey>	 but it is something that I didn't take into account, good
[16:42:04] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:46:44] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:46:56] <icinga-wm>	 PROBLEM - Check systemd state on an-test-client1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:51:28] <wikibugs>	 (03CR) 10Volans: "> Patch Set 18: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) (owner: 10CRusnov)
[17:01:04] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:05:50] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:13:18] <wikibugs>	 (03PS1) 10Andrew Bogott: Cinder: Fix syslog filter of health checks [puppet] - 10https://gerrit.wikimedia.org/r/648296 (https://phabricator.wikimedia.org/T269511)
[17:15:09] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Cinder: Fix syslog filter of health checks [puppet] - 10https://gerrit.wikimedia.org/r/648296 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott)
[17:31:05] <wikibugs>	 (03PS1) 10Andrew Bogott: Glance: increase priority of rsyslog filter [puppet] - 10https://gerrit.wikimedia.org/r/648299
[17:31:48] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Glance: increase priority of rsyslog filter [puppet] - 10https://gerrit.wikimedia.org/r/648299 (owner: 10Andrew Bogott)
[17:34:02] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic: Enable webp thumbnails on all images for non-Commons wikis - https://phabricator.wikimedia.org/T269946 (10Gilles)
[17:38:47] <wikibugs>	 (03PS1) 10Elukey: sre.hadoop.upgrade-bigtop-distro.py: stop active namenode before rollback [cookbooks] - 10https://gerrit.wikimedia.org/r/648301
[17:40:34] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] sre.hadoop.upgrade-bigtop-distro.py: stop active namenode before rollback [cookbooks] - 10https://gerrit.wikimedia.org/r/648301 (owner: 10Elukey)
[17:41:53] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.upgrade-bigtop-distro for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001
[17:41:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:46:14] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[17:48:28] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hadoop.upgrade-bigtop-distro (exit_code=99) for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001
[17:48:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:50:35] <wikibugs>	 (03PS1) 10Andrew Bogott: Cinder policy.yaml: fix typo that broke policy parsing [puppet] - 10https://gerrit.wikimedia.org/r/648302 (https://phabricator.wikimedia.org/T269511)
[17:51:11] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Cinder policy.yaml: fix typo that broke policy parsing [puppet] - 10https://gerrit.wikimedia.org/r/648302 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott)
[17:52:48] <logmsgbot>	 !log razzi@cumin1001 START - Cookbook sre.ganeti.makevm
[17:52:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:53:34] <logmsgbot>	 !log razzi@cumin1001 START - Cookbook sre.hosts.decommission
[17:53:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:58:34] <wikibugs>	 (03PS1) 10Andrew Bogott: Cinder: a few tweaks to quiet log warnings [puppet] - 10https://gerrit.wikimedia.org/r/648303 (https://phabricator.wikimedia.org/T269511)
[18:00:26] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Cinder: a few tweaks to quiet log warnings [puppet] - 10https://gerrit.wikimedia.org/r/648303 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott)
[18:03:12] <wikibugs>	 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Goal, 10Patch-For-Review: Prepare a proof of concept of the minimum setup capable of backup and recover testwiki media files - https://phabricator.wikimedia.org/T264189 (10jcrespo) Distribution of images by initial: ` root@db1133.eqiad.wmnet...
[18:05:00] <wikibugs>	 (03PS1) 10Elukey: Revert "Revert "Set bigtop 1.5 for Hadoop test"" [puppet] - 10https://gerrit.wikimedia.org/r/648271
[18:05:26] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[18:05:49] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Revert "Revert "Set bigtop 1.5 for Hadoop test"" [puppet] - 10https://gerrit.wikimedia.org/r/648271 (owner: 10Elukey)
[18:10:10] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] doc: use an Apache Define for WMF_DOC_PATH [puppet] - 10https://gerrit.wikimedia.org/r/648247 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar)
[18:12:10] <wikibugs>	 (03PS2) 10Ahmon Dancy: New utility macros in templates/_mediawiki-common.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/647843
[18:12:12] <wikibugs>	 (03PS2) 10Ahmon Dancy: 0.2.0: Use a Job to set up the database [deployment-charts] - 10https://gerrit.wikimedia.org/r/647844
[18:12:14] <wikibugs>	 (03PS1) 10Ahmon Dancy: add manually recached l10n CDB support [deployment-charts] - 10https://gerrit.wikimedia.org/r/648304
[18:12:22] <wikibugs>	 (03CR) 10Dzahn: "ran puppet on doc1001, restarted apache2, https://doc.wikimedia.org/mediawiki-core/master/php/ and main page still working fine" [puppet] - 10https://gerrit.wikimedia.org/r/648247 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar)
[18:12:27] <mutante>	 hashar: ^ done
[18:13:02] <mutante>	 !log doc1001 restarted apache2 just in case after DOC_PATH change
[18:13:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:13:26] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] add manually recached l10n CDB support [deployment-charts] - 10https://gerrit.wikimedia.org/r/648304 (owner: 10Ahmon Dancy)
[18:13:58] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.stop-cluster for Hadoop test cluster: Stop the Hadoop cluster before maintenance. - elukey@cumin1001
[18:13:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:14:49] <hashar>	 mutante: ah yeah
[18:15:29] <hashar>	 mutante: so I checked again this morning and the setup we pushed yesterday was definitely working on my machine.  The issue is that I had the doc published files copied at both places so they were always found :]
[18:16:28] <mutante>	 hashar: ACK, i saw your comments about that on Gerrit. *nod*
[18:16:30] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] puppetmaster: replace cron to remove old reports with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/636104 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn)
[18:16:44] <mutante>	 hashar: this step was noop, all looks like before
[18:16:47] <wikibugs>	 (03PS2) 10Ahmon Dancy: Reorganized setup.sh and added db wait loop [deployment-charts] - 10https://gerrit.wikimedia.org/r/647842
[18:16:49] <wikibugs>	 (03PS3) 10Ahmon Dancy: New utility macros in templates/_mediawiki-common.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/647843
[18:16:51] <wikibugs>	 (03PS3) 10Ahmon Dancy: 0.2.0: Use a Job to set up the database [deployment-charts] - 10https://gerrit.wikimedia.org/r/647844
[18:16:52] <mutante>	 I know there will be a follow-up 
[18:16:53] <wikibugs>	 (03PS2) 10Ahmon Dancy: add manually recached l10n CDB support [deployment-charts] - 10https://gerrit.wikimedia.org/r/648304
[18:18:27] <hashar>	 mutante: yes that will be https://gerrit.wikimedia.org/r/c/operations/puppet/+/648248/1/modules/profile/files/doc/httpd-doc.wikimedia.org.conf
[18:18:29] <mutante>	 but you can use that variable now
[18:18:59] <hashar>	 I am setting up a WMCS instance to validate it and prove it is working
[18:19:18] <hashar>	 cause there is 50% chance that one will break stuff eventually
[18:19:19] <mutante>	 hashar: :) ok, cool!
[18:19:20] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0) for Hadoop test cluster: Stop the Hadoop cluster before maintenance. - elukey@cumin1001
[18:19:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:19:31] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.upgrade-bigtop-distro for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001
[18:19:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:19:49] <mutante>	 yea, we don't want to cache broken redirects and +1 to using cloud
[18:20:21] <mutante>	 Let me also add a basic test to prod tests, i can do that.
[18:26:02] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on logstash2022 - https://phabricator.wikimedia.org/T269552 (10herron) I think we can go without it, we plan to replace these older hosts in the near future and also have some logstash refresh hardware that was just ordered.  Thanks!
[18:26:59] <wikibugs>	 (03PS4) 10Ahmon Dancy: New utility macros in templates/_mediawiki-common.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/647843
[18:27:01] <wikibugs>	 (03PS4) 10Ahmon Dancy: 0.2.0: Use a Job to set up the database [deployment-charts] - 10https://gerrit.wikimedia.org/r/647844
[18:27:03] <wikibugs>	 (03PS3) 10Ahmon Dancy: add manually recached l10n CDB support [deployment-charts] - 10https://gerrit.wikimedia.org/r/648304
[18:27:30] <wikibugs>	 (03PS4) 10Ahmon Dancy: 0.3.0: add manually recached l10n CDB support [deployment-charts] - 10https://gerrit.wikimedia.org/r/648304
[18:29:37] <wikibugs>	 (03PS5) 10Ahmon Dancy: 0.3.0: add manually recached l10n CDB support [deployment-charts] - 10https://gerrit.wikimedia.org/r/648304
[18:30:30] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.upgrade-bigtop-distro (exit_code=0) for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001
[18:30:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:32:30] <wikibugs>	 (03PS9) 10Jcrespo: [WIP] We continue with swift listing and download tests for media backups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/643980
[18:39:18] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[18:42:34] <wikibugs>	 (03PS1) 10Legoktm: admin: Update my (legoktm) dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/648321
[18:47:47] <logmsgbot>	 !log razzi@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
[18:47:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:49:00] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[18:56:26] <icinga-wm>	 PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:57:15] <mutante>	 ^ caused by my change but nothing serious and looking at the new timer 
[18:57:33] <mutante>	 it's about a former cron deleting old reports
[19:02:10] <icinga-wm>	 PROBLEM - Check systemd state on sretest1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:04:12] <wikibugs>	 (03PS1) 10Hashar: devtools: add integration/docroot.git on deploy server [puppet] - 10https://gerrit.wikimedia.org/r/648323
[19:05:03] <wikibugs>	 (03CR) 10Hashar: "That is for doc.devtools.eqiad1.wikimedia.cloud , it is a scap deployment target for integration/docroot and fails puppet with:" [puppet] - 10https://gerrit.wikimedia.org/r/648323 (owner: 10Hashar)
[19:12:37] <wikibugs>	 (03PS1) 10Dzahn: puppetmaster: run "remove old reports" job as root [puppet] - 10https://gerrit.wikimedia.org/r/648325 (https://phabricator.wikimedia.org/T265138)
[19:15:06] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] devtools: add integration/docroot.git on deploy server [puppet] - 10https://gerrit.wikimedia.org/r/648323 (owner: 10Hashar)
[19:15:46] <wikibugs>	 (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/648325" [puppet] - 10https://gerrit.wikimedia.org/r/636104 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn)
[19:17:24] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "/var/lib/puppet/reports# ls -hals" [puppet] - 10https://gerrit.wikimedia.org/r/648325 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn)
[19:18:11] <logmsgbot>	 !log razzi@cumin1001 END (ERROR) - Cookbook sre.hosts.decommission (exit_code=97)
[19:18:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:20:37] <wikibugs>	 (03PS1) 10Dzahn: puppetmaster: remove code to remove crons, replaced by timer [puppet] - 10https://gerrit.wikimedia.org/r/648327 (https://phabricator.wikimedia.org/T265138)
[19:20:46] <icinga-wm>	 RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:21:16] <wikibugs>	 (03CR) 10Dzahn: "Dec 11 19:20:48 puppetmaster1001 systemd[1]: remove_old_puppet_reports.service: Succeeded." [puppet] - 10https://gerrit.wikimedia.org/r/648325 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn)
[19:21:27] <wikibugs>	 (03CR) 10Dzahn: "<+icinga-wm> RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wiki" [puppet] - 10https://gerrit.wikimedia.org/r/648325 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn)
[19:27:37] <wikibugs>	 (03PS1) 10Dzahn: puppetmaster: ensure previously used cron is properly removed [puppet] - 10https://gerrit.wikimedia.org/r/648328 (https://phabricator.wikimedia.org/T265138)
[19:27:49] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Refine Growth schemas using eventlogging_legacy job [puppet] - 10https://gerrit.wikimedia.org/r/647817 (https://phabricator.wikimedia.org/T267333) (owner: 10Mforns)
[19:28:06] <logmsgbot>	 !log razzi@cumin1001 START - Cookbook sre.ganeti.makevm
[19:28:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:28:41] <wikibugs>	 (03PS1) 10Ahmon Dancy: prometheus: collect zuul error mtail metrics [puppet] - 10https://gerrit.wikimedia.org/r/648329 (https://phabricator.wikimedia.org/T258821)
[19:29:47] <wikibugs>	 (03PS2) 10Dzahn: puppetmaster: ensure previously used cron is properly removed [puppet] - 10https://gerrit.wikimedia.org/r/648328 (https://phabricator.wikimedia.org/T265138)
[19:30:16] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] puppetmaster: ensure previously used cron is properly removed [puppet] - 10https://gerrit.wikimedia.org/r/648328 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn)
[19:33:04] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] We continue with swift listing and download tests for media backups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/643980 (owner: 10Jcrespo)
[19:35:19] <wikibugs>	 (03CR) 10Dzahn: "properly removed on puppetmaster1001 now.. waiting a bit to let it happen on all puppetmasters including those in cloud" [puppet] - 10https://gerrit.wikimedia.org/r/648328 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn)
[19:44:24] <wikibugs>	 (03PS6) 10Ahmon Dancy: 0.3.0: add manually recached l10n CDB support [deployment-charts] - 10https://gerrit.wikimedia.org/r/648304
[19:57:39] <wikibugs>	 (03Abandoned) 10Dzahn: httpd: change default server admin from webmaster@ to noc@ [puppet] - 10https://gerrit.wikimedia.org/r/645431 (https://phabricator.wikimedia.org/T251005) (owner: 10Dzahn)
[19:59:14] <wikibugs>	 (03PS1) 10Hashar: devtools: add dsh group for integration/docroot [puppet] - 10https://gerrit.wikimedia.org/r/648333
[19:59:33] <wikibugs>	 (03CR) 10Dzahn: "I think I already added what I was able to contribute here and will leave this to observability and traffic folks." [puppet] - 10https://gerrit.wikimedia.org/r/632738 (https://phabricator.wikimedia.org/T148976) (owner: 10Cwhite)
[20:01:28] <wikibugs>	 (03PS1) 10Mforns: Do not refine HomepageVisit using eventlogging_legacy job [puppet] - 10https://gerrit.wikimedia.org/r/648334 (https://phabricator.wikimedia.org/T267333)
[20:02:09] <wikibugs>	 (03CR) 10Dzahn: "@hashar Can we do this? just adding new keys is only one step that is not removing the old keys." [puppet] - 10https://gerrit.wikimedia.org/r/556270 (https://phabricator.wikimedia.org/T240266) (owner: 10Paladox)
[20:02:33] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] "This is a server side event, which is not ready for migration" [puppet] - 10https://gerrit.wikimedia.org/r/648334 (https://phabricator.wikimedia.org/T267333) (owner: 10Mforns)
[20:04:08] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:07:30] <wikibugs>	 (03PS1) 10Mforns: Remove HomepageVisit from wgEventLoggingSchemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/648336 (https://phabricator.wikimedia.org/T267333)
[20:08:56] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Remove HomepageVisit from wgEventLoggingSchemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/648336 (https://phabricator.wikimedia.org/T267333) (owner: 10Mforns)
[20:11:05] <logmsgbot>	 !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Un-migrtate Growth EventLogging schema HomepageVisit back to EventLogging-backend on all wikis (this is a server side event which is not yet ready to migrate) - T267333 (duration: 00m 58s)
[20:11:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:11:09] <stashbot>	 T267333: Migrate Growth EventLogging schemas to Event Platform - https://phabricator.wikimedia.org/T267333
[20:15:26] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "That is to generate the dsh file for the deploy-1002 instance.  It is using the cloudinfra puppetmaster hence I can not cherry pick.   I h" [puppet] - 10https://gerrit.wikimedia.org/r/648333 (owner: 10Hashar)
[20:27:28] <logmsgbot>	 !log razzi@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
[20:27:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:27:54] <wikibugs>	 (03PS1) 10Dzahn: httpbb/doc: add tests for doc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/648339
[20:29:02] <wikibugs>	 (03PS2) 10Dzahn: httpbb/doc: add tests for doc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/648339 (https://phabricator.wikimedia.org/T149924)
[20:29:33] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] devtools: add dsh group for integration/docroot [puppet] - 10https://gerrit.wikimedia.org/r/648333 (owner: 10Hashar)
[20:30:13] <wikibugs>	 (03CR) 10Dzahn: "@hashar I already tested this on deploy1001 with the same yaml in my home dir and doc1001 passes all the assertions." [puppet] - 10https://gerrit.wikimedia.org/r/648339 (https://phabricator.wikimedia.org/T149924) (owner: 10Dzahn)
[20:31:13] <wikibugs>	 (03CR) 10Dzahn: "Doing this because we have ongoing changes to the docroot etc on doc1001 which make tests nice to have." [puppet] - 10https://gerrit.wikimedia.org/r/648339 (https://phabricator.wikimedia.org/T149924) (owner: 10Dzahn)
[20:31:37] <hashar>	 mutante: thanks :]
[20:31:51] <hashar>	 for the Gerrit ssh keys, yeah I guess I should look at the patches eventually
[20:32:02] <hashar>	 no idea what kind of breakage that will cause everywhere though
[20:32:02] <mutante>	 hashar: no problem, also see this:
[20:32:12] <mutante>	 [deploy1001:~] $ httpbb --hosts doc1001.eqiad.wmnet - < test_doc.yaml 
[20:32:14] <mutante>	 Sending to doc1001.eqiad.wmnet...
[20:32:17] <mutante>	 PASS: 1 request sent to doc1001.eqiad.wmnet. All assertions passed.
[20:32:27] <hashar>	 oh
[20:32:35] <mutante>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/648339/2/modules/profile/files/httpbb/doc/test_doc.yaml
[20:32:48] <mutante>	 tests the redirect from "without slash" to "/" etc
[20:33:02] <mutante>	 except that should be more than just 1 request
[20:33:06] <mutante>	 I think it means "1 host"
[20:33:13] <hashar>	 so hmm 
[20:33:35] <hashar>	 potentially we could add that to scap and use it to assert the deployment went well
[20:33:48] <wikibugs>	 (03PS1) 10Bstorm: toolforge: make timeouts for our slow etcd clusters configurable [puppet] - 10https://gerrit.wikimedia.org/r/648340 (https://phabricator.wikimedia.org/T267966)
[20:33:51] <hashar>	 or might even have the test_doc.yaml directly in the integration/docroot.git repo
[20:34:07] <mutante>	 hashar: yea, i think that is already an idea 
[20:34:34] <mutante>	 also it should have a comment in the yaml file which hosts this is for
[20:34:41] <mutante>	 beyond being in that subdir 
[20:34:43] <mutante>	 called "doc"
[20:34:50] <mutante>	 but it implies "any doc* host" 
[20:35:16] <hashar>	 or have profile::doc to ship the test file to httpbb
[20:35:20] <mutante>	 I talked about it with Reuven recently
[20:35:41] <mutante>	 hmm.. that's an interesting approach as well, yes
[20:36:21] <hashar>	 or direcltly inside integration/docroot ;]
[20:36:44] <hashar>	 then it is a bit of a mix cause that is affected by either an Apache config change or the  web app being changed
[20:36:48] <hashar>	 so I guess in puppet it is fine
[20:37:28] <hashar>	 in an ideal world, we would boot a docker container, provision it with puppet and run httpbb against the resulting web server
[20:37:36] <wikibugs>	 (03CR) 10Dzahn: "@RLazarus When I run this and see "1 request" shouldn't I expect at least one request per assertion in my yaml? Does it actually mean "som" [puppet] - 10https://gerrit.wikimedia.org/r/648339 (https://phabricator.wikimedia.org/T149924) (owner: 10Dzahn)
[20:39:37] <mutante>	 hashar: I would manually run this when there are Apache config changes, so if there could be "check experimental" and "Hosts: " and jenkins bot saying -1 automatically if the existing tests fail after a change.. but it's probably not easy to detect which changes affect apache config.
[20:40:42] <mutante>	 and I can run it by myself from cumin or deploy server.. and not a huge difference that is worth spending a lot of effort 
[20:40:55] <mutante>	 I think it's more important for now we have moar test files :)
[20:42:55] <mutante>	 oh.. duh.. the issue is how I am using it with:
[20:43:14] <mutante>	 - < test_doc.yaml   this is only doing the first test
[20:43:30] <mutante>	 need to juse use full path to it
[20:44:37] <wikibugs>	 (03PS1) 10Razzi: kafka: Add kafka-test1008 - 1010 [puppet] - 10https://gerrit.wikimedia.org/r/648342 (https://phabricator.wikimedia.org/T268202)
[20:45:57] <wikibugs>	 (03CR) 10Hashar: "Thanks for starting this! There is a potential typo and I have a question about giving names to assert." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/648339 (https://phabricator.wikimedia.org/T149924) (owner: 10Dzahn)
[20:46:44] <hashar>	 mutante: or maybe the entries are overriden
[20:46:50] <hashar>	 foo: [bar]
[20:46:54] <hashar>	 foo: [baz]
[20:46:58] <hashar>	 foo: [yo]
[20:46:58] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "@RLazarus Scratch my question, I am using it wrong "- < file" will only read the first test. I need to simply use full path to the yaml fi" [puppet] - 10https://gerrit.wikimedia.org/r/648339 (https://phabricator.wikimedia.org/T149924) (owner: 10Dzahn)
[20:47:07] <hashar>	 might just yields the last one  (yo)
[20:47:27] <hashar>	 I don't know how yaml manages keys overlaps, it probably overwrites
[20:55:59] <wikibugs>	 (03PS3) 10Dzahn: httpbb/doc: add tests for doc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/648339 (https://phabricator.wikimedia.org/T149924)
[20:56:16] <mutante>	 hashar: ok, this version works now
[20:56:22] <mutante>	 PASS: 9 requests sent to doc1001.eqiad.wmnet. All assertions passed.
[20:58:17] <mutante>	 answer is to not repeat the key at all.. this was just from an example of a host that has multiple sites
[20:59:09] <wikibugs>	 (03CR) 10Bstorm: "https://puppet-compiler.wmflabs.org/compiler1001/27109/tools-k8s-etcd-4.tools.eqiad.wmflabs/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/648340 (https://phabricator.wikimedia.org/T267966) (owner: 10Bstorm)
[21:00:16] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] toolforge: make timeouts for our slow etcd clusters configurable [puppet] - 10https://gerrit.wikimedia.org/r/648340 (https://phabricator.wikimedia.org/T267966) (owner: 10Bstorm)
[21:00:16] <hashar>	 mutante: yeah I guess right.  It picked either the first or the last entry :]
[21:01:00] <hashar>	 mutante: thanks for working on that bit!  Iwas too lazy to find the doc and set it up locally 
[21:02:05] <wikibugs>	 (03PS4) 10Dzahn: httpbb/doc: add tests for doc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/648339 (https://phabricator.wikimedia.org/T149924)
[21:02:28] <mutante>	 hashar: no problem, I was doing that for some other hosts like miscweb before.. and PS4 should be correct
[21:02:33] <mutante>	 going to cook now
[21:03:46] <mutante>	 will find out the answer to your question about 'labels'
[21:05:06] <rzl>	 hashar: no support for named assertions at this point but it's an interesting idea
[21:05:19] <rzl>	 it does support comments in the YAML though, that's what I would do :)
[21:07:15] <rzl>	 and, correct, don't reuse hosts, just combine them -- I might try to go back and add merging logic or at least print a warning, but IIRC that all takes place in library code so it's nontrivial
[21:08:19] <wikibugs>	 (03PS1) 10Ladsgroup: kafkatee: Migrate hiera() to lookup() and set data type [puppet] - 10https://gerrit.wikimedia.org/r/648348 (https://phabricator.wikimedia.org/T209953)
[21:08:28] <rzl>	 cc mutante fyi ^
[21:08:38] <hashar>	 rzl: awesome thank you !
[21:10:44] <wikibugs>	 (03CR) 10Ladsgroup: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/27110/" [puppet] - 10https://gerrit.wikimedia.org/r/648348 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[21:12:55] <wikibugs>	 (03CR) 10Dzahn: httpbb/doc: add tests for doc.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/648339 (https://phabricator.wikimedia.org/T149924) (owner: 10Dzahn)
[21:14:11] <wikibugs>	 (03CR) 10Dzahn: httpbb/doc: add tests for doc.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/648339 (https://phabricator.wikimedia.org/T149924) (owner: 10Dzahn)
[21:18:04] <mutante>	 rzl: thanks :) ack
[21:20:16] <wikibugs>	 (03PS1) 10Ladsgroup: mjolnir: Migrate hiera() to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/648349 (https://phabricator.wikimedia.org/T209953)
[21:21:29] <wikibugs>	 (03CR) 10RLazarus: "Don't forget to add an httpbb::test_suite at modules/profile/manifests/httpbb.pp so this gets picked up. Thanks for adding this!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/648339 (https://phabricator.wikimedia.org/T149924) (owner: 10Dzahn)
[21:21:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mjolnir: Migrate hiera() to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/648349 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[21:23:39] <wikibugs>	 (03CR) 10Ladsgroup: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/648349 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[21:24:43] <wikibugs>	 (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/648351
[21:25:25] <wikibugs>	 (03Abandoned) 10Dduvall: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/647849 (owner: 10PipelineBot)
[21:25:38] <wikibugs>	 (03Abandoned) 10Dduvall: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/647831 (owner: 10PipelineBot)
[21:26:36] <wikibugs>	 (03CR) 10Ladsgroup: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/27111/" [puppet] - 10https://gerrit.wikimedia.org/r/648349 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[21:26:58] <wikibugs>	 (03CR) 10Dduvall: [C: 03+2] blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/648351 (owner: 10PipelineBot)
[21:28:15] <wikibugs>	 (03Merged) 10jenkins-bot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/648351 (owner: 10PipelineBot)
[21:33:18] <logmsgbot>	 !log dduvall@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' .
[21:33:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:35:45] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'.
[21:35:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:36:29] <wikibugs>	 10Operations, 10DynamicPageList (Wikimedia), 10PoolCounter, 10serviceops, and 4 others: Limit concurrency of DPL queries - https://phabricator.wikimedia.org/T263220 (10Urbanecm)
[21:36:40] <wikibugs>	 10Operations, 10DynamicPageList (Wikimedia), 10PoolCounter, 10serviceops, and 4 others: Limit concurrency of DPL queries - https://phabricator.wikimedia.org/T263220 (10Urbanecm) Wrong tag, but keeping it, as it's related too.
[21:37:35] <wikibugs>	 (03PS1) 10Bstorm: etcd: make snapshot interval configurable [puppet] - 10https://gerrit.wikimedia.org/r/648354 (https://phabricator.wikimedia.org/T267966)
[21:38:45] <wikibugs>	 (03CR) 10Bstorm: "The defaults should make this a noop for existing servers, but it will allow us to use horizon to configure our VMs according to which clu" [puppet] - 10https://gerrit.wikimedia.org/r/648354 (https://phabricator.wikimedia.org/T267966) (owner: 10Bstorm)
[21:38:48] <logmsgbot>	 !log dduvall@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
[21:38:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:39:06] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] etcd: make snapshot interval configurable [puppet] - 10https://gerrit.wikimedia.org/r/648354 (https://phabricator.wikimedia.org/T267966) (owner: 10Bstorm)
[21:41:11] <logmsgbot>	 !log dduvall@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
[21:41:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:42:38] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'.
[21:42:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:45:18] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'.
[21:45:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:45:42] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Apache slash expansion should not redirect from HTTPS to HTTP - https://phabricator.wikimedia.org/T95164 (10hashar) `DirectorySlash` redirecting to http instead of canonical https is #upstream Apach...
[21:45:59] <wikibugs>	 (03PS5) 10Dzahn: httpbb/doc: add tests for doc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/648339 (https://phabricator.wikimedia.org/T149924)
[21:46:09] <Amir1>	 !log Running schema changes on wikitech database for T269348
[21:46:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:46:12] <stashbot>	 T269348: wikitech database has almost all of its varbinary fields wrong - https://phabricator.wikimedia.org/T269348
[21:53:57] <wikibugs>	 (03PS1) 10Dduvall: Revert "blubberoid: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/648275
[21:54:04] <mutante>	 hashar: re: Gerrit keys, I think we already went through the concerns about a specific thing that might break but you also tested it and it did not actually happen
[21:54:48] <mutante>	 but yea.. never say never
[21:56:05] <hashar>	 :]
[21:56:50] <wikibugs>	 (03CR) 10Dduvall: [C: 03+2] Revert "blubberoid: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/648275 (owner: 10Dduvall)
[21:57:06] <hashar>	 mutante: well I am off, thanks for the patches :]   Will look at finishing the install of the doc website on labs next week
[21:57:10] <akosiaris>	 !log add docker-ce_18.06.3~ce~3-0~debian_amd64.deb to apt.wikimedia.org stretch-wikimedia/thirdparty/k8s
[21:57:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:58:14] <mutante>	 hashar: see you! enjoy the weekend
[21:58:16] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "blubberoid: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/648275 (owner: 10Dduvall)
[21:58:28] <hashar>	 sanding and painting! :]
[21:58:44] <mutante>	 :p new car
[21:59:01] <hashar>	 na doors inside the house
[21:59:12] <mutante>	 :) gotcha
[21:59:17] <logmsgbot>	 !log dduvall@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
[21:59:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:02:15] <logmsgbot>	 !log dduvall@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
[22:02:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:04:29] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: k8s-staging-codfw: Use docker-ce 18.06.3~ce~3-0~debian [puppet] - 10https://gerrit.wikimedia.org/r/648356
[22:04:32] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:05:00] <logmsgbot>	 !log dduvall@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' .
[22:05:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:06:31] <wikibugs>	 (03PS6) 10Dzahn: httpbb/doc: add tests for doc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/648339 (https://phabricator.wikimedia.org/T149924)
[22:11:25] <wikibugs>	 (03PS2) 10Bstorm: etcd: make snapshot interval configurable [puppet] - 10https://gerrit.wikimedia.org/r/648354 (https://phabricator.wikimedia.org/T267966)
[22:12:30] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27112/console" [puppet] - 10https://gerrit.wikimedia.org/r/648356 (owner: 10Alexandros Kosiaris)
[22:13:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] etcd: make snapshot interval configurable [puppet] - 10https://gerrit.wikimedia.org/r/648354 (https://phabricator.wikimedia.org/T267966) (owner: 10Bstorm)
[22:13:27] <wikibugs>	 (03CR) 10Bstorm: "The CI failures appear to be from lack of a sane default to "$srv_dns = undef," rather than what I'm doing here. It seems to want a value " [puppet] - 10https://gerrit.wikimedia.org/r/648354 (https://phabricator.wikimedia.org/T267966) (owner: 10Bstorm)
[22:14:36] <wikibugs>	 (03PS1) 10Dzahn: puppetmaster: require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/648358 (https://phabricator.wikimedia.org/T266479)
[22:15:25] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27113/console" [puppet] - 10https://gerrit.wikimedia.org/r/648356 (owner: 10Alexandros Kosiaris)
[22:15:54] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:16:44] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/27113/kubestage2002.codfw.wmnet/index.html is as expected, so +2ing" [puppet] - 10https://gerrit.wikimedia.org/r/648356 (owner: 10Alexandros Kosiaris)
[22:17:20] <wikibugs>	 (03PS1) 10Dzahn: icinga: require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/648359 (https://phabricator.wikimedia.org/T266479)
[22:39:03] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] kubeadm: wmcs-k8s-node-upgrade.py: cache status yaml [puppet] - 10https://gerrit.wikimedia.org/r/648210 (owner: 10Arturo Borrero Gonzalez)
[22:40:40] <wikibugs>	 (03PS2) 10Bstorm: kubeadm: wmcs-k8s-node-upgrade.py: collapse ssh calls for same package checks [puppet] - 10https://gerrit.wikimedia.org/r/648211 (owner: 10Arturo Borrero Gonzalez)
[22:41:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] kubeadm: wmcs-k8s-node-upgrade.py: collapse ssh calls for same package checks [puppet] - 10https://gerrit.wikimedia.org/r/648211 (owner: 10Arturo Borrero Gonzalez)
[22:41:11] <wikibugs>	 (03PS3) 10Bstorm: kubeadm: wmcs-k8s-node-upgrade.py: collapse ssh calls for same checks [puppet] - 10https://gerrit.wikimedia.org/r/648211 (owner: 10Arturo Borrero Gonzalez)
[22:41:39] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] kubeadm: wmcs-k8s-node-upgrade.py: collapse ssh calls for same checks [puppet] - 10https://gerrit.wikimedia.org/r/648211 (owner: 10Arturo Borrero Gonzalez)
[22:41:41] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] "LGTM! Feel free to add comments as discussed, but I'm happy with this merging whenever." [puppet] - 10https://gerrit.wikimedia.org/r/648339 (https://phabricator.wikimedia.org/T149924) (owner: 10Dzahn)
[22:49:39] <wikibugs>	 (03CR) 10Bstorm: "I'd expected the fail on CI was just the commit message (so I updated it), but it fails on:" [puppet] - 10https://gerrit.wikimedia.org/r/648211 (owner: 10Arturo Borrero Gonzalez)
[22:52:03] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Also ship the following plugins which are included in the release [debs/calico] - 10https://gerrit.wikimedia.org/r/648363
[23:01:19] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] httpbb/doc: add tests for doc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/648339 (https://phabricator.wikimedia.org/T149924) (owner: 10Dzahn)
[23:07:42] <wikibugs>	 10Operations, 10vm-requests: eqiad: 1 VM request for doc - https://phabricator.wikimedia.org/T269977 (10Dzahn)
[23:07:54] <wikibugs>	 10Operations, 10vm-requests: eqiad: 1 VM request for doc (doc1002) - https://phabricator.wikimedia.org/T269977 (10Dzahn)
[23:08:18] <icinga-wm>	 RECOVERY - MegaRAID on es1023 is OK: OK: optimal, 1 logical, 12 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[23:08:30] <wikibugs>	 10Operations, 10vm-requests: eqiad: 1 VM request for doc (doc1002) - https://phabricator.wikimedia.org/T269977 (10Dzahn)
[23:08:32] <wikibugs>	 10Operations, 10Release-Engineering-Team, 10serviceops: replace doc1001.eqiad.wmnet with a buster VM - https://phabricator.wikimedia.org/T247653 (10Dzahn)
[23:08:54] <wikibugs>	 10Operations, 10vm-requests: eqiad: 1 VM request for doc (doc1002) - https://phabricator.wikimedia.org/T269977 (10Dzahn)
[23:10:49] <wikibugs>	 10Operations, 10serviceops, 10vm-requests, 10Release-Engineering-Team (Kanban): eqiad: 1 VM request for doc.wikimedia.org - https://phabricator.wikimedia.org/T211974 (10Dzahn) Now doc1002 should be created (T269977) to buster (T247653).  And we should also make doc2001 in codfw.
[23:21:35] <wikibugs>	 10Operations, 10vm-requests: codfw:  1 VM %request for doc.wikimedia.org - https://phabricator.wikimedia.org/T269978 (10Dzahn)
[23:21:49] <wikibugs>	 10Operations, 10vm-requests: codfw:  1 VM request for doc.wikimedia.org (doc2001) - https://phabricator.wikimedia.org/T269978 (10Dzahn)
[23:22:06] <wikibugs>	 10Operations, 10vm-requests: codfw:  1 VM request for doc.wikimedia.org (doc2001) - https://phabricator.wikimedia.org/T269978 (10Dzahn)
[23:22:08] <wikibugs>	 10Operations, 10Release-Engineering-Team, 10serviceops: replace doc1001.eqiad.wmnet with a buster VM - https://phabricator.wikimedia.org/T247653 (10Dzahn)
[23:33:55] <wikibugs>	 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki appservers to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) I ran all httpbb appserver tests on mwdebug1003:   ` [deploy1001:~] $ for testfile in /srv/deploymen...
[23:38:25] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: k8s-staging-codfw: Specify admission_controllers hiera [puppet] - 10https://gerrit.wikimedia.org/r/648377
[23:39:49] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27114/console" [puppet] - 10https://gerrit.wikimedia.org/r/648377 (owner: 10Alexandros Kosiaris)
[23:41:06] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] "PCC happy at https://puppet-compiler.wmflabs.org/compiler1003/27114/kubestagemaster2001.codfw.wmnet/index.html, merging" [puppet] - 10https://gerrit.wikimedia.org/r/648377 (owner: 10Alexandros Kosiaris)
[23:42:31] <wikibugs>	 (03PS1) 10Dzahn: httpbb: add missing directory for doc tests [puppet] - 10https://gerrit.wikimedia.org/r/648380
[23:43:47] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] httpbb: add missing directory for doc tests [puppet] - 10https://gerrit.wikimedia.org/r/648380 (owner: 10Dzahn)
[23:45:18] <icinga-wm>	 PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01061 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[23:45:48] <icinga-wm>	 PROBLEM - Check systemd state on kubestagemaster2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:47:43] <wikibugs>	 (03CR) 10Dzahn: "[deploy1001:~] $ httpbb --hosts doc1001.eqiad.wmnet /srv/deployment/httpbb-tests/doc/test_doc.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/648380 (owner: 10Dzahn)
[23:48:12] <rzl>	 mutante: ah, sorry for not catching that
[23:48:28] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] httpbb: add missing directory for doc tests [puppet] - 10https://gerrit.wikimedia.org/r/648380 (owner: 10Dzahn)
[23:48:38] <rzl>	 as ever, I keep meaning to rethink how that file works :)
[23:48:43] <mutante>	 rzl: no problem at all. I did not even notice the puppet issue 
[23:49:04] <icinga-wm>	 RECOVERY - Check systemd state on kubestagemaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:49:17] <mutante>	 regarding the "widespread puppet failures" it is just always living on the edge slightly under the threshold what becomes widespread, heh
[23:49:53] <mutante>	 rzl: I kind of wanted to add the host names in comments :)
[23:50:36] <mutante>	 when making the next directory I will also follow the "cluster name" (host name without numbers), so NOT  ./parsoid/  but  ./parse/
[23:51:14] <rzl>	 hmm, yeah!
[23:51:28] <rzl>	 we'd have to change "appserver" to "mw" if we're keeping that, but I kind of like it
[23:51:39] <rzl>	 I wonder if there are any examples where it's not one-to-one, in either direction
[23:52:15] <mutante>	 files under /srv/deployment make me think "that's deployed by scap" and not by puppet, btw. but .. also not like it really matters a lot
[23:53:07] <mutante>	 rzl: right now it is for parsoid, parse vs wtp .. but unique and temp situation
[23:53:11] <rzl>	 hmm, I guess the appservers are already mw# and mwdebug# but that's not the worst
[23:53:27] <rzl>	 yeah, I wasn't counting that just because we're getting rid of it
[23:56:18] <wikibugs>	 (03PS1) 10Dzahn: httpbb: add tests for parse (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/648383
[23:57:14] <mutante>	 just parking that there because I want to do another thing first
[23:59:50] <wikibugs>	 (03PS1) 10Dzahn: httpbb: auto-create directories for test suites [puppet] - 10https://gerrit.wikimedia.org/r/648385