[00:00:04] RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201211T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:03:13] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/647849 [00:05:49] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:06:11] (03PS1) 10Andrew Bogott: Add cinder logs to central logging [puppet] - 10https://gerrit.wikimedia.org/r/647850 (https://phabricator.wikimedia.org/T269511) [00:30:03] PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:33:23] (03CR) 10Ahmon Dancy: [C: 04-1] New utility macros in templates/_mediawiki-common.tpl (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/647843 (owner: 10Ahmon Dancy) [00:37:23] (03PS1) 10BBlack: Temporarily block certain IABot reqs that are broken and spammy [puppet] - 10https://gerrit.wikimedia.org/r/647854 [00:44:15] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1953602712 and 125 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:44:29] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 7075067448 and 484 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:44:29] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 8448067536 and 570 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:44:39] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1359659688 and 69 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:45:05] PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 67242472 and 5 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:45:05] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 3890210056 and 264 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:45:05] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2355267096 and 130 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:45:31] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 5111040064 and 365 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:46:43] RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 224000 and 88 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:47:57] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 434576 and 162 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:48:23] RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 10504 and 188 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:49:11] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 47928 and 235 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:50:01] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 54800 and 286 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:50:15] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:50:27] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 312 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:51:03] RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 30720 and 348 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:51:03] RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 30720 and 348 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:06:04] 10Operations, 10LDAP-Access-Requests: LDAP access to wmf group for Matt Cleinman - https://phabricator.wikimedia.org/T269696 (10MattCleinman) [02:08:39] 10Operations, 10LDAP-Access-Requests: LDAP access to wmf group for Matt Cleinman - https://phabricator.wikimedia.org/T269696 (10MattCleinman) Updated the ticket with (I believe) all the info needed. (And updated some erroneous documentation.) Thanks for pointing me in the right direction, @Aklapper @JoeWalsh... [02:32:41] PROBLEM - Check systemd state on wdqs1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:54:22] (03CR) 10Andrew Bogott: [C: 03+2] Add cinder logs to central logging [puppet] - 10https://gerrit.wikimedia.org/r/647850 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [04:04:25] RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:09:17] PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:11:23] PROBLEM - SSH on ms-be2032 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:12:51] RECOVERY - SSH on ms-be2032 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:43:55] PROBLEM - Check systemd state on wdqs2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:44:30] (03CR) 10DannyS712: [C: 04-1] Temporarily block certain IABot reqs that are broken and spammy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/647854 (owner: 10BBlack) [05:57:43] 10Operations, 10InternetArchiveBot, 10Traffic: IAbot sending a huge volume of action=raw requests - https://phabricator.wikimedia.org/T269914 (10Tgr) [05:59:21] 10Operations, 10InternetArchiveBot, 10Platform Engineering, 10Traffic: IAbot sending a huge volume of action=raw requests - https://phabricator.wikimedia.org/T269914 (10Tgr) Tagging Platform Engineering to get feedback about the optimal way of getting the page source. [06:21:35] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:21:43] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:28:49] PROBLEM - ores on ores2008 is CRITICAL: connect to address 10.192.48.89 and port 8081: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [06:35:17] RECOVERY - ores on ores2008 is OK: HTTP OK: HTTP/1.0 200 OK - 6397 bytes in 0.083 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [06:40:13] mmmm the Telia link between eqiad and codfw had maintenance ongoing since an hour ago, then they sent "work done" [06:41:02] Laser receiver power : 0.0006 mW / -32.22 dBm [06:41:10] this is on the eqiad side [06:42:28] on the codfw one is better (reasonably in range) [06:42:56] I am going to wait a bit since in theory the maintenance is scheduled up to 8UTC, if still down I'll send an email to Telia [06:45:03] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:00:51] (03PS1) 10Elukey: presto: reduce the max heap size from 110G to 100G [puppet] - 10https://gerrit.wikimedia.org/r/647999 [07:01:22] (03CR) 10Elukey: [C: 03+2] presto: reduce the max heap size from 110G to 100G [puppet] - 10https://gerrit.wikimedia.org/r/647999 (owner: 10Elukey) [07:04:04] !log elukey@cumin1001 START - Cookbook sre.presto.roll-restart-workers [07:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:13] !log elukey@cumin1001 END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0) [07:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:33] PROBLEM - Check systemd state on wdqs2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:30:00] sent an email to Telia [07:34:37] RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:39:27] PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:43:03] ACKNOWLEDGEMENT - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Gehel issue with prometheus exporter - https://phabricator.wikimedia.org/T269872 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:43:03] ACKNOWLEDGEMENT - Check systemd state on wdqs1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Gehel issue with prometheus exporter - https://phabricator.wikimedia.org/T269872 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:43:03] ACKNOWLEDGEMENT - Check systemd state on wdqs2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Gehel issue with prometheus exporter - https://phabricator.wikimedia.org/T269872 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:43:03] ACKNOWLEDGEMENT - Check systemd state on wdqs2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Gehel issue with prometheus exporter - https://phabricator.wikimedia.org/T269872 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:43:04] ACKNOWLEDGEMENT - Check systemd state on wdqs2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Gehel issue with prometheus exporter - https://phabricator.wikimedia.org/T269872 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:52:19] (03PS1) 10Elukey: admin: remove user legoktm from 'researchers' [puppet] - 10https://gerrit.wikimedia.org/r/648112 (https://phabricator.wikimedia.org/T268801) [07:52:48] (03CR) 10Elukey: [C: 03+2] admin: remove user legoktm from 'researchers' [puppet] - 10https://gerrit.wikimedia.org/r/648112 (https://phabricator.wikimedia.org/T268801) (owner: 10Elukey) [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201211T0800) [08:07:26] (03PS1) 10Elukey: Add the sre.hadoop.upgrade_bigtop_distro.py cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/648121 [08:13:21] !log restart memcached on mwdebug1002 to pick up the correct port (11210 instead of the default 11211) [08:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:21] RECOVERY - Memcached on mwdebug1002 is OK: TCP OK - 0.001 second response time on 10.64.0.46 port 11210 https://wikitech.wikimedia.org/wiki/Memcached [08:34:46] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to releasers-wikibase for toan - https://phabricator.wikimedia.org/T269777 (10toan) >>! In T269777#6683465, @KFrancis wrote: >>>! In T269777#6682253, @jbond wrote: >> @KFrancis Are you able to confirm NDA status for Tobias, thanks >... [08:35:27] (03PS2) 10Elukey: Add the sre.hadoop.upgrade_bigtop_distro.py cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/648121 (https://phabricator.wikimedia.org/T269919) [08:47:55] (03PS1) 10Elukey: Set bigtop 1.5 for Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/648132 (https://phabricator.wikimedia.org/T269919) [08:49:34] (03CR) 10Elukey: [C: 03+2] Set bigtop 1.5 for Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/648132 (https://phabricator.wikimedia.org/T269919) (owner: 10Elukey) [08:51:11] (03PS2) 10Gehel: wdqs: fix prom exporter's broken namespace [puppet] - 10https://gerrit.wikimedia.org/r/647774 (https://phabricator.wikimedia.org/T269872) (owner: 10Ryan Kemper) [08:51:56] (03CR) 10Gehel: "@ryan: I took the liberty to get started on the implementation. This isn't tested at all yet!" [puppet] - 10https://gerrit.wikimedia.org/r/647774 (https://phabricator.wikimedia.org/T269872) (owner: 10Ryan Kemper) [08:52:43] (03CR) 10jerkins-bot: [V: 04-1] wdqs: fix prom exporter's broken namespace [puppet] - 10https://gerrit.wikimedia.org/r/647774 (https://phabricator.wikimedia.org/T269872) (owner: 10Ryan Kemper) [09:06:59] (03PS3) 10Gehel: wdqs: fix prom exporter's broken namespace [puppet] - 10https://gerrit.wikimedia.org/r/647774 (https://phabricator.wikimedia.org/T269872) (owner: 10Ryan Kemper) [09:16:53] (03PS1) 10Elukey: aptrepo: add a bigtop15 component also for Buster [puppet] - 10https://gerrit.wikimedia.org/r/648138 (https://phabricator.wikimedia.org/T269919) [09:20:36] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27088/console" [puppet] - 10https://gerrit.wikimedia.org/r/648138 (https://phabricator.wikimedia.org/T269919) (owner: 10Elukey) [09:23:33] (03CR) 10Elukey: [V: 03+1 C: 03+2] aptrepo: add a bigtop15 component also for Buster [puppet] - 10https://gerrit.wikimedia.org/r/648138 (https://phabricator.wikimedia.org/T269919) (owner: 10Elukey) [09:26:25] 10Operations, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Gilles) Sure, @WDoranWMF, you can send me a meeting invite for next week. After that I'll be off for 3 wee... [09:26:34] !log add thirdparty/bigtop15 to buster-wikimedia [09:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:50] (03CR) 10Volans: "I don't have the hadoop context to judge the procedure, but cookbook wise it looks ok, few minor/optional things inline. It would be nice " (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/648121 (https://phabricator.wikimedia.org/T269919) (owner: 10Elukey) [09:38:34] (03CR) 10Gehel: [C: 04-1] "see comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/647774 (https://phabricator.wikimedia.org/T269872) (owner: 10Ryan Kemper) [09:39:19] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 (10ops-monitoring-bot) Script wmf-auto-reimage was launched b... [09:52:26] (03PS1) 10JMeybohm: Add calico releases to admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/648143 (https://phabricator.wikimedia.org/T267653) [09:53:46] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage2002.codfw.wmnet with reason: REIMAGE [09:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:17] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage2001.codfw.wmnet with reason: REIMAGE [09:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:40] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubestage2002.codfw.wmnet with reason: REIMAGE [09:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:43] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubestage2001.codfw.wmnet with reason: REIMAGE [09:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:07] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add calico releases to admin_ng helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/648143 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [10:01:39] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:01:53] (03PS2) 10JMeybohm: Add calico releases to admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/648143 (https://phabricator.wikimedia.org/T267653) [10:01:56] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['k... [10:01:59] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:02:11] (03CR) 10JMeybohm: [C: 03+2] Add calico releases to admin_ng helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/648143 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [10:03:50] (03Merged) 10jenkins-bot: Add calico releases to admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/648143 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [10:25:52] (03PS3) 10Elukey: Add the sre.hadoop.upgrade_bigtop_distro.py cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/648121 (https://phabricator.wikimedia.org/T269919) [10:25:56] (03CR) 10Elukey: "Also tried to move the codebase to the class API! :)" (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/648121 (https://phabricator.wikimedia.org/T269919) (owner: 10Elukey) [10:28:04] RECOVERY - Check systemd state on cp5007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:29:49] elukey: so now it's impossible to check the diffs from the previous PS :-P [10:29:58] smart! [10:30:01] :D [10:30:16] * volans joking, thanks for using the new APIs [10:38:59] (03PS1) 10JMeybohm: Move non-common kubernetes staging values to DC specific files [puppet] - 10https://gerrit.wikimedia.org/r/648166 (https://phabricator.wikimedia.org/T244335) [10:40:09] (03CR) 10Volans: [C: 03+1] "Nice! Thanks a lot for using the new API <3. LGTM cookbook-wise as before, see replies inline." (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/648121 (https://phabricator.wikimedia.org/T269919) (owner: 10Elukey) [10:48:34] (03PS3) 10Alexandros Kosiaris: profile::kubernetes::node: Remove old redundant code [puppet] - 10https://gerrit.wikimedia.org/r/646645 [10:48:36] (03PS3) 10Alexandros Kosiaris: k8s::node: Split staging cluster hieras [puppet] - 10https://gerrit.wikimedia.org/r/646646 [10:54:08] (03PS2) 10Alexandros Kosiaris: Move non-common kubernetes staging values to DC specific files [puppet] - 10https://gerrit.wikimedia.org/r/648166 (https://phabricator.wikimedia.org/T244335) (owner: 10JMeybohm) [10:54:24] (03PS1) 10Elukey: sre.hadoop.stop-cluster.py: move to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/648172 (https://phabricator.wikimedia.org/T269925) [10:54:36] volans: I need to stop the cluster first soo --^ [10:54:38] :D [10:56:21] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27090/console" [puppet] - 10https://gerrit.wikimedia.org/r/646645 (owner: 10Alexandros Kosiaris) [10:57:25] elukey: lol [10:57:37] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27091/console" [puppet] - 10https://gerrit.wikimedia.org/r/646646 (owner: 10Alexandros Kosiaris) [10:58:52] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27092/console" [puppet] - 10https://gerrit.wikimedia.org/r/648166 (https://phabricator.wikimedia.org/T244335) (owner: 10JMeybohm) [11:01:40] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:01:58] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:03:16] (03CR) 10Volans: [C: 03+1] "<3 for the new API! LGTM, couple of nits inline" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/648172 (https://phabricator.wikimedia.org/T269925) (owner: 10Elukey) [11:04:29] (03PS4) 10Alexandros Kosiaris: k8s::node: Split staging cluster hieras [puppet] - 10https://gerrit.wikimedia.org/r/646646 [11:05:29] (03Abandoned) 10Alexandros Kosiaris: Move non-common kubernetes staging values to DC specific files [puppet] - 10https://gerrit.wikimedia.org/r/648166 (https://phabricator.wikimedia.org/T244335) (owner: 10JMeybohm) [11:07:12] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27093/console" [puppet] - 10https://gerrit.wikimedia.org/r/646646 (owner: 10Alexandros Kosiaris) [11:07:42] (03PS2) 10Elukey: sre.hadoop.stop-cluster.py: move to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/648172 (https://phabricator.wikimedia.org/T269925) [11:07:45] (03CR) 10Elukey: sre.hadoop.stop-cluster.py: move to class API (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/648172 (https://phabricator.wikimedia.org/T269925) (owner: 10Elukey) [11:08:34] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] "PCC happy, the squashed commit https://gerrit.wikimedia.org/r/c/operations/puppet/+/648166/2 was also required, so good catch, merging" [puppet] - 10https://gerrit.wikimedia.org/r/646646 (owner: 10Alexandros Kosiaris) [11:08:39] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] profile::kubernetes::node: Remove old redundant code [puppet] - 10https://gerrit.wikimedia.org/r/646645 (owner: 10Alexandros Kosiaris) [11:10:02] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:14:52] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:18:00] (03PS4) 10Elukey: Add the sre.hadoop.upgrade_bigtop_distro.py cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/648121 (https://phabricator.wikimedia.org/T269919) [11:18:02] (03PS3) 10Elukey: sre.hadoop.stop-cluster.py: move to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/648172 (https://phabricator.wikimedia.org/T269925) [11:18:08] (03CR) 10Elukey: Add the sre.hadoop.upgrade_bigtop_distro.py cookbook (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/648121 (https://phabricator.wikimedia.org/T269919) (owner: 10Elukey) [11:18:34] (03PS1) 10Alexandros Kosiaris: k8s-staging-codfw: Switch calico_version to String [puppet] - 10https://gerrit.wikimedia.org/r/648182 [11:19:36] (03PS5) 10Elukey: Add the sre.hadoop.upgrade_bigtop_distro.py cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/648121 (https://phabricator.wikimedia.org/T269919) [11:19:38] (03PS4) 10Elukey: sre.hadoop.stop-cluster.py: move to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/648172 (https://phabricator.wikimedia.org/T269925) [11:20:02] (03CR) 10Alexandros Kosiaris: [C: 03+2] k8s-staging-codfw: Switch calico_version to String [puppet] - 10https://gerrit.wikimedia.org/r/648182 (owner: 10Alexandros Kosiaris) [11:21:50] volans: fyi i tested sre.puppet.renew-cert and it worked fine [11:21:59] jbond42: <3 thanks a lot! [11:23:41] np [11:29:14] (03PS1) 10Alexandros Kosiaris: calico: Move calico-cni package inclusion to main class [puppet] - 10https://gerrit.wikimedia.org/r/648184 [11:30:49] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27098/console" [puppet] - 10https://gerrit.wikimedia.org/r/648184 (owner: 10Alexandros Kosiaris) [11:31:36] (03CR) 10JMeybohm: [C: 03+1] "LGTM and +1 to merging the classes!" [puppet] - 10https://gerrit.wikimedia.org/r/648184 (owner: 10Alexandros Kosiaris) [11:32:04] (03PS4) 10Jbond: (WIP) spec test fixes [puppet] - 10https://gerrit.wikimedia.org/r/645187 [11:33:26] (03CR) 10jerkins-bot: [V: 04-1] (WIP) spec test fixes [puppet] - 10https://gerrit.wikimedia.org/r/645187 (owner: 10Jbond) [11:33:31] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] calico: Move calico-cni package inclusion to main class [puppet] - 10https://gerrit.wikimedia.org/r/648184 (owner: 10Alexandros Kosiaris) [11:34:48] (03CR) 10David Caro: [C: 03+2] "Way easier to understand, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/647419 (https://phabricator.wikimedia.org/T269620) (owner: 10Bstorm) [11:35:30] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27099/console" [puppet] - 10https://gerrit.wikimedia.org/r/647728 (https://phabricator.wikimedia.org/T252185) (owner: 10Alexandros Kosiaris) [11:37:45] (03PS2) 10Alexandros Kosiaris: kubestage2*: Assign role [puppet] - 10https://gerrit.wikimedia.org/r/647728 (https://phabricator.wikimedia.org/T252185) [11:40:02] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27100/console" [puppet] - 10https://gerrit.wikimedia.org/r/647728 (https://phabricator.wikimedia.org/T252185) (owner: 10Alexandros Kosiaris) [11:40:23] (03CR) 10Jbond: [C: 03+2] early_command: configure static, mapped ipv6 address (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/645081 (owner: 10Jbond) [11:46:34] RECOVERY - Check systemd state on wdqs2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:49:30] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/647728 (https://phabricator.wikimedia.org/T252185) (owner: 10Alexandros Kosiaris) [11:50:26] (03CR) 10Volans: "replies inline" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/648121 (https://phabricator.wikimedia.org/T269919) (owner: 10Elukey) [11:51:35] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/648172 (https://phabricator.wikimedia.org/T269925) (owner: 10Elukey) [11:57:43] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install aqs101[0-5] - https://phabricator.wikimedia.org/T267414 (10Jclark-ctr) @Cmjohnson Those came in with that large shipment of 8 8 6 S01720435 - 8 boxes on 1 pallet - T264584_PO1016 S01719765 - 8 boxes on 1 pallet - T264584_PO1016 S0172051... [11:57:51] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install aqs101[0-5] - https://phabricator.wikimedia.org/T267414 (10Jclark-ctr) [12:00:40] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): update RAID controller firmware on labstore1006, 1007 - https://phabricator.wikimedia.org/T268285 (10Jclark-ctr) labstore1006 has firmware 6.6 for smart array controller https://support.hpe.com/hpesc/public/docDisplay?docId=a00037929en_us&docLocale=... [12:00:41] (03CR) 10Volans: [C: 03+1] "I didn't test it but the changes looks ok" [puppet] - 10https://gerrit.wikimedia.org/r/647369 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [12:00:46] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: REIMAGE [12:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:42] !log jbond@cumin1001 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 2:00:00 on sretest1001.eqiad.wmnet with reason: REIMAGE [12:02:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:56] (03PS1) 10JMeybohm: Enable k8s-staging prometheus instance in codfw [puppet] - 10https://gerrit.wikimedia.org/r/648192 (https://phabricator.wikimedia.org/T244335) [12:03:58] (03PS1) 10JMeybohm: Add k8s-staging prometheus instance datasource [puppet] - 10https://gerrit.wikimedia.org/r/648193 (https://phabricator.wikimedia.org/T244335) [12:07:38] (03PS1) 10Alexandros Kosiaris: kubelet: Remove --allow-privileged [puppet] - 10https://gerrit.wikimedia.org/r/648194 [12:07:40] (03PS1) 10Alexandros Kosiaris: k8s::apiserver: Allow setting allow_privileged [puppet] - 10https://gerrit.wikimedia.org/r/648195 [12:10:52] 10Operations, 10Traffic, 10Readers-Web-Backlog (Needs Product Owner Decisions): [Bug] iPadOS 13 shows the desktop version of Safari with a broken layout - https://phabricator.wikimedia.org/T229875 (10dr0ptp4kt) 05Open→03Resolved I was able to reproduce the new behavior observed by @ckoerner on a number o... [12:11:49] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27101/console" [puppet] - 10https://gerrit.wikimedia.org/r/648194 (owner: 10Alexandros Kosiaris) [12:15:17] (03PS1) 10Jbond: early_command: busy box doesn't have awk [puppet] - 10https://gerrit.wikimedia.org/r/648197 [12:15:38] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27102/console" [puppet] - 10https://gerrit.wikimedia.org/r/648195 (owner: 10Alexandros Kosiaris) [12:16:42] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27103/console" [puppet] - 10https://gerrit.wikimedia.org/r/648195 (owner: 10Alexandros Kosiaris) [12:17:04] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/27102/ says ok, merging and proceed. Thanks for the +1" [puppet] - 10https://gerrit.wikimedia.org/r/648195 (owner: 10Alexandros Kosiaris) [12:17:11] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] kubelet: Remove --allow-privileged [puppet] - 10https://gerrit.wikimedia.org/r/648194 (owner: 10Alexandros Kosiaris) [12:18:17] (03CR) 10Jbond: [C: 03+2] early_command: busy box doesn't have awk [puppet] - 10https://gerrit.wikimedia.org/r/648197 (owner: 10Jbond) [12:20:51] (03PS1) 10Ema: vcl: fix X-Cache-Status on deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/648199 (https://phabricator.wikimedia.org/T269825) [12:39:28] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 485025352 and 48 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:40:07] (03PS2) 10Ema: vcl: fix X-Cache-Status on deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/648199 (https://phabricator.wikimedia.org/T269825) [12:40:52] RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 196520 and 94 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:41:00] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 254577048 and 121 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:42:26] RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 72736 and 188 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:44:37] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) Initial results of the 6.0.0 experiment on cp3054 are encouraging: for the past 12 hours [[ https://grafana.wikimedia.org/d/Lp_BDKJMz/em... [12:48:00] RECOVERY - Check systemd state on wdqs1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:54:01] (03PS1) 10Arturo Borrero Gonzalez: kubeadm: wmcs-k8s-node-upgrade.py: format with black [puppet] - 10https://gerrit.wikimedia.org/r/648206 [12:54:03] (03PS1) 10Arturo Borrero Gonzalez: kubeadm: wmcs-k8s-node-upgrade.py: specify default version numbers in a single place [puppet] - 10https://gerrit.wikimedia.org/r/648207 [12:54:05] (03PS1) 10Arturo Borrero Gonzalez: kubedam: wmcs-k8s-node-upgrade.py: help message refresh [puppet] - 10https://gerrit.wikimedia.org/r/648208 [12:54:08] (03PS1) 10Arturo Borrero Gonzalez: kubeadm: wmcs-k8s-node-upgrade.py: skip node if current version fails [puppet] - 10https://gerrit.wikimedia.org/r/648209 [12:54:10] (03PS1) 10Arturo Borrero Gonzalez: kubeadm: wmcs-k8s-node-upgrade.py: cache status yaml [puppet] - 10https://gerrit.wikimedia.org/r/648210 [12:54:12] (03PS1) 10Arturo Borrero Gonzalez: kubeadm: wmcs-k8s-node-upgrade.py: collapse ssh calls for same package checks [puppet] - 10https://gerrit.wikimedia.org/r/648211 [12:54:40] (03CR) 10jerkins-bot: [V: 04-1] kubeadm: wmcs-k8s-node-upgrade.py: specify default version numbers in a single place [puppet] - 10https://gerrit.wikimedia.org/r/648207 (owner: 10Arturo Borrero Gonzalez) [12:55:40] (03CR) 10jerkins-bot: [V: 04-1] kubeadm: wmcs-k8s-node-upgrade.py: collapse ssh calls for same package checks [puppet] - 10https://gerrit.wikimedia.org/r/648211 (owner: 10Arturo Borrero Gonzalez) [13:21:12] (03PS1) 10Jbond: install_server: try to fix the ip address in late command [puppet] - 10https://gerrit.wikimedia.org/r/648222 [13:22:38] (03CR) 10Jbond: [C: 03+2] install_server: try to fix the ip address in late command [puppet] - 10https://gerrit.wikimedia.org/r/648222 (owner: 10Jbond) [13:24:44] 10Operations, 10SRE-Access-Requests: Requesting access to RESOURCE for guergana.tzatchkova - https://phabricator.wikimedia.org/T269930 (10guergana.tzatchkova) [13:25:50] PROBLEM - Disk space on dumpsdata1001 is CRITICAL: DISK CRITICAL - free space: /data 894342 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=dumpsdata1001&var-datasource=eqiad+prometheus/ops [13:27:28] apergos: ^^^ [13:27:38] something known? [13:27:56] jus tme accumulating too much cruft from test runs [13:27:58] will clean up [13:28:09] ok, there is also a 5% free space on / fwiw [13:30:29] mm that's harder, I'll see if there's anything that can be made to go away [13:35:05] uh [13:35:21] on / there's lots available [13:35:32] maybe you read avail for used when you were looking at that output? [13:35:51] apergos: lol, my bad, eyes crossed columns [13:36:04] no worries! means it's fixed already :-D [13:36:07] :D [13:36:16] thanks for fixing it so quickly and efficiently! :-) [13:36:28] the alert should clear soon, (yw :-P :-D) I got rid of some junk [13:36:45] I'll be able to get rid of the rest once this job of wikidata completes in a few days [13:36:52] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: REIMAGE [13:36:53] (03CR) 10Elukey: Add the sre.hadoop.upgrade_bigtop_distro.py cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/648121 (https://phabricator.wikimedia.org/T269919) (owner: 10Elukey) [13:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:13] np [13:38:27] (03PS1) 10Jbond: install_server: use correct token [puppet] - 10https://gerrit.wikimedia.org/r/648228 [13:38:37] (03PS6) 10Elukey: Add the sre.hadoop.upgrade_bigtop_distro.py cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/648121 (https://phabricator.wikimedia.org/T269919) [13:38:39] (03PS5) 10Elukey: sre.hadoop.stop-cluster.py: move to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/648172 (https://phabricator.wikimedia.org/T269925) [13:38:57] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: REIMAGE [13:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:16] (03CR) 10Jbond: [C: 03+2] install_server: use correct token [puppet] - 10https://gerrit.wikimedia.org/r/648228 (owner: 10Jbond) [13:45:54] RECOVERY - Disk space on dumpsdata1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=dumpsdata1001&var-datasource=eqiad+prometheus/ops [13:53:28] (03CR) 10Elukey: [C: 03+2] "Going to merge and test, pretty sure I'll have to follow up with some bug :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/648121 (https://phabricator.wikimedia.org/T269919) (owner: 10Elukey) [13:53:35] (03CR) 10Elukey: [C: 03+2] sre.hadoop.stop-cluster.py: move to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/648172 (https://phabricator.wikimedia.org/T269925) (owner: 10Elukey) [13:57:03] !log elukey@cumin1001 START - Cookbook sre.hadoop.stop-cluster for Hadoop test cluster: Stop the Hadoop cluster before maintenance. - elukey@cumin1001 [13:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:06] wooow [13:58:20] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: REIMAGE [13:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:21] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: REIMAGE [14:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:04] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0) for Hadoop test cluster: Stop the Hadoop cluster before maintenance. - elukey@cumin1001 [14:03:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:04] !log elukey@cumin1001 START - Cookbook sre.hadoop.upgrade-bigtop-distro for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001 [14:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:25] (03PS1) 10Phuedx: Disable Page Previews IRC alerts [puppet] - 10https://gerrit.wikimedia.org/r/648237 [14:14:01] (03PS1) 10Jbond: late_command: add cidr bits [puppet] - 10https://gerrit.wikimedia.org/r/648238 [14:14:46] (03CR) 10Jbond: [C: 03+2] late_command: add cidr bits [puppet] - 10https://gerrit.wikimedia.org/r/648238 (owner: 10Jbond) [14:15:06] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:26] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hadoop.upgrade-bigtop-distro (exit_code=99) for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001 [14:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:32] (03PS1) 10Phuedx: Update .mailmap to de-duplicate my email addresses [puppet] - 10https://gerrit.wikimedia.org/r/648239 [14:17:23] (03PS1) 10JMeybohm: Add wmf-node-authorization ClusterRoleBinding [deployment-charts] - 10https://gerrit.wikimedia.org/r/648240 (https://phabricator.wikimedia.org/T244335) [14:18:26] (03CR) 10JMeybohm: [C: 03+2] Add wmf-node-authorization ClusterRoleBinding [deployment-charts] - 10https://gerrit.wikimedia.org/r/648240 (https://phabricator.wikimedia.org/T244335) (owner: 10JMeybohm) [14:18:38] (03PS1) 10Elukey: hadoop: fix typo in package list [cookbooks] - 10https://gerrit.wikimedia.org/r/648241 [14:19:26] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:57] (03Merged) 10jenkins-bot: Add wmf-node-authorization ClusterRoleBinding [deployment-charts] - 10https://gerrit.wikimedia.org/r/648240 (https://phabricator.wikimedia.org/T244335) (owner: 10JMeybohm) [14:20:17] (03CR) 10Elukey: [C: 03+2] hadoop: fix typo in package list [cookbooks] - 10https://gerrit.wikimedia.org/r/648241 (owner: 10Elukey) [14:21:54] in theory I should be able to re-run the cookbook and restart from what I left it, let's see [14:23:04] !log elukey@cumin1001 START - Cookbook sre.hadoop.upgrade-bigtop-distro for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001 [14:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:00] 10Operations, 10InternetArchiveBot, 10Platform Engineering, 10Traffic: IAbot sending a huge volume of action=raw requests - https://phabricator.wikimedia.org/T269914 (10jbond) p:05Triage→03Medium [14:26:28] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hadoop.upgrade-bigtop-distro (exit_code=99) for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001 [14:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:50] oh noes another typo [14:26:54] * elukey cries in a corner [14:27:37] (03PS1) 10Elukey: hadoop: fix another typo in the package list [cookbooks] - 10https://gerrit.wikimedia.org/r/648244 [14:28:07] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: REIMAGE [14:28:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:43] 10Operations, 10SRE-Access-Requests: Requesting access to RESOURCE for guergana.tzatchkova - https://phabricator.wikimedia.org/T269930 (10jbond) [14:29:27] (03CR) 10Elukey: [C: 03+2] hadoop: fix another typo in the package list [cookbooks] - 10https://gerrit.wikimedia.org/r/648244 (owner: 10Elukey) [14:30:11] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: REIMAGE [14:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:05] 10Operations, 10SRE-Access-Requests: Requesting access to RESOURCE for guergana.tzatchkova - https://phabricator.wikimedia.org/T269930 (10jbond) @WMDE-leszek are you able to approve this access request thanks @Ottomata are you able to approve Guergana access to `analytics-wmde-users` & `analytics-privatedata-u... [14:32:31] (03PS1) 10JMeybohm: helmfile needs: parameter requires a release namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/648245 (https://phabricator.wikimedia.org/T267653) [14:34:21] 10Operations, 10SRE-Access-Requests: Requesting access to RESOURCE for guergana.tzatchkova - https://phabricator.wikimedia.org/T269930 (10Ottomata) Approved! Guergana should also be in the `nda` LDAP group and be given a Kerberos principal. [14:34:53] (03CR) 10JMeybohm: [C: 03+2] helmfile needs: parameter requires a release namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/648245 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [14:35:53] (03PS1) 10Jbond: admin: add gtzatchkova to analytics-wmde-users & analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/648246 (https://phabricator.wikimedia.org/T269930) [14:36:25] (03Merged) 10jenkins-bot: helmfile needs: parameter requires a release namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/648245 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [14:37:23] (03CR) 10Ottomata: [C: 03+1] presto: reduce the max heap size from 110G to 100G [puppet] - 10https://gerrit.wikimedia.org/r/647999 (owner: 10Elukey) [14:38:48] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to RESOURCE for guergana.tzatchkova - https://phabricator.wikimedia.org/T269930 (10jbond) @guergana.tzatchkova I have created the CR just waiting on Leszek's approval however i have used the username `gtzatchkova` as this was already... [14:39:09] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to RESOURCE for guergana.tzatchkova - https://phabricator.wikimedia.org/T269930 (10jbond) [14:42:29] (03PS2) 10Hashar: doc: switch to scap DocumentRoot [take 2] [puppet] - 10https://gerrit.wikimedia.org/r/647763 (https://phabricator.wikimedia.org/T149924) [14:42:31] (03PS1) 10Hashar: doc: use an Apache Define for WMF_DOC_PATH [puppet] - 10https://gerrit.wikimedia.org/r/648247 (https://phabricator.wikimedia.org/T149924) [14:42:33] (03PS1) 10Hashar: doc: fix fallback to WMF_DOC_PATH files [puppet] - 10https://gerrit.wikimedia.org/r/648248 (https://phabricator.wikimedia.org/T149924) [14:45:36] !log jayme@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [14:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:16] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/648247 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [14:51:33] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-wmde-users, analytics-privatedata-users for guergana.tzatchkova - https://phabricator.wikimedia.org/T269930 (10Aklapper) [14:53:08] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-wmde-users, analytics-privatedata-users for guergana.tzatchkova - https://phabricator.wikimedia.org/T269930 (10jbond) p:05Triage→03Medium [14:53:47] PROBLEM - Check systemd state on deploy1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:54:19] probably me, checking [14:54:33] (03PS1) 10Volans: wmf-auto-reimage: remove hack to parse output [puppet] - 10https://gerrit.wikimedia.org/r/648250 [14:55:12] (03CR) 10Hashar: "PPC shows nothing cause there is just a file change so one just have to look at the Gerrit change to find out what has changed :]" [puppet] - 10https://gerrit.wikimedia.org/r/648247 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [14:55:44] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/648248 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [14:55:49] RECOVERY - Check systemd state on deploy1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:56:41] (03CR) 10Jbond: [C: 03+1] wmf-auto-reimage: remove hack to parse output [puppet] - 10https://gerrit.wikimedia.org/r/648250 (owner: 10Volans) [14:57:00] (03CR) 10Volans: [C: 03+2] wmf-auto-reimage: remove hack to parse output [puppet] - 10https://gerrit.wikimedia.org/r/648250 (owner: 10Volans) [14:59:43] !log jayme@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [14:59:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:47] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/647657 (owner: 10Volans) [15:06:16] !log elukey@cumin1001 START - Cookbook sre.hadoop.upgrade-bigtop-distro for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001 [15:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:42] 10Operations, 10InternetArchiveBot, 10Platform Engineering, 10Traffic: IAbot sending a huge volume of action=raw requests - https://phabricator.wikimedia.org/T269914 (10Cyberpower678) Thursday, as in yesterday? I’m not aware of anything that should have been running to create that massive level of requests. [15:07:17] 10Operations, 10InternetArchiveBot, 10Platform Engineering, 10Traffic: IAbot sending a huge volume of action=raw requests - https://phabricator.wikimedia.org/T269914 (10Cyberpower678) Especially to Wikidata. [15:10:01] !log jayme@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [15:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:57] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: REIMAGE [15:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:05] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: REIMAGE [15:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:30] (03CR) 10Volans: [C: 03+1] "I'm lacking part of the context,did a pass and didn't see anything obviously wrong. I didn't test it but if the test are running fine for " (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/646634 (https://phabricator.wikimedia.org/T265266) (owner: 10Kormat) [15:20:06] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hadoop.upgrade-bigtop-distro (exit_code=99) for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001 [15:20:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:19] (03PS8) 10Kormat: integration: Complete framework for running basic tests [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/646634 (https://phabricator.wikimedia.org/T265266) [15:24:10] (03CR) 10Volans: [C: 03+1] "Ship it" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/646634 (https://phabricator.wikimedia.org/T265266) (owner: 10Kormat) [15:24:40] (03CR) 10Kormat: integration: Complete framework for running basic tests (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/646634 (https://phabricator.wikimedia.org/T265266) (owner: 10Kormat) [15:25:18] (03CR) 10Kormat: [C: 03+2] integration: Complete framework for running basic tests [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/646634 (https://phabricator.wikimedia.org/T265266) (owner: 10Kormat) [15:27:35] (03Merged) 10jenkins-bot: integration: Complete framework for running basic tests [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/646634 (https://phabricator.wikimedia.org/T265266) (owner: 10Kormat) [15:30:33] (03PS1) 10Jbond: install_server: slaac not slacc [puppet] - 10https://gerrit.wikimedia.org/r/648253 [15:31:10] (03CR) 10Jbond: [C: 03+2] install_server: slaac not slacc [puppet] - 10https://gerrit.wikimedia.org/r/648253 (owner: 10Jbond) [15:33:25] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1002.eqiad.wmnet with reason: REIMAGE [15:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:27] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1002.eqiad.wmnet with reason: REIMAGE [15:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:33] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-wmde-users, analytics-privatedata-users for guergana.tzatchkova - https://phabricator.wikimedia.org/T269930 (10WMDE-leszek) I approve this request. [15:36:48] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-wmde-users, analytics-privatedata-users for guergana.tzatchkova - https://phabricator.wikimedia.org/T269930 (10WMDE-leszek) [15:37:15] (03CR) 10Volans: [C: 04-1] "> Patch Set 16:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) (owner: 10CRusnov) [15:38:01] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:44:34] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-wmde-users, analytics-privatedata-users for guergana.tzatchkova - https://phabricator.wikimedia.org/T269930 (10guergana.tzatchkova) >>! In T269930#6684825, @jbond wrote: > @guergana.tzatchkova I have created the CR just w... [15:44:35] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:11] (03PS1) 10Papaul: DHCP: Add MAC address for ml-serve200[1234] [puppet] - 10https://gerrit.wikimedia.org/r/648258 (https://phabricator.wikimedia.org/T267670) [15:47:43] (03PS1) 10Volans: sre.hosts.decommission: try to avoid Netbox issue [cookbooks] - 10https://gerrit.wikimedia.org/r/648259 (https://phabricator.wikimedia.org/T268605) [15:47:47] (03CR) 10Papaul: [C: 03+2] DHCP: Add MAC address for ml-serve200[1234] [puppet] - 10https://gerrit.wikimedia.org/r/648258 (https://phabricator.wikimedia.org/T267670) (owner: 10Papaul) [15:51:28] (03PS1) 10Elukey: sre.hadoop.upgrade-bigtop-distro.py: format standby only when in rollback [cookbooks] - 10https://gerrit.wikimedia.org/r/648261 (https://phabricator.wikimedia.org/T269919) [15:53:56] (03PS1) 10Papaul: Add ml-serve200[1234] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/648262 (https://phabricator.wikimedia.org/T267670) [15:54:50] (03CR) 10Papaul: [C: 03+2] Add ml-serve200[1234] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/648262 (https://phabricator.wikimedia.org/T267670) (owner: 10Papaul) [15:57:28] (03PS2) 10Elukey: sre.hadoop.upgrade-bigtop-distro.py: stop standby before format [cookbooks] - 10https://gerrit.wikimedia.org/r/648261 (https://phabricator.wikimedia.org/T269919) [15:58:10] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[1-4] - https://phabricator.wikimedia.org/T267670 (10Papaul) [15:58:31] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-wmde-users, analytics-privatedata-users for guergana.tzatchkova - https://phabricator.wikimedia.org/T269930 (10jbond) [15:58:43] (03CR) 10Jbond: [C: 03+2] admin: add gtzatchkova to analytics-wmde-users & analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/648246 (https://phabricator.wikimedia.org/T269930) (owner: 10Jbond) [15:59:22] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[1-4] - https://phabricator.wikimedia.org/T267670 (10Papaul) a:05Papaul→03klausman @klausman this is done from my end [16:00:32] (03CR) 10Elukey: [C: 03+1] sre.hosts.decommission: try to avoid Netbox issue [cookbooks] - 10https://gerrit.wikimedia.org/r/648259 (https://phabricator.wikimedia.org/T268605) (owner: 10Volans) [16:01:10] (03PS2) 10Volans: sre.hosts.decommission: try to avoid Netbox issue [cookbooks] - 10https://gerrit.wikimedia.org/r/648259 (https://phabricator.wikimedia.org/T268605) [16:01:35] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-wmde-users, analytics-privatedata-users for guergana.tzatchkova - https://phabricator.wikimedia.org/T269930 (10jbond) 05Open→03Resolved a:03jbond >>! In T269930#6684981, @guergana.tzatchkova wrote: >>>! In T269930#6... [16:01:53] (03CR) 10Elukey: [C: 03+2] sre.hadoop.upgrade-bigtop-distro.py: stop standby before format [cookbooks] - 10https://gerrit.wikimedia.org/r/648261 (https://phabricator.wikimedia.org/T269919) (owner: 10Elukey) [16:01:57] 10Operations, 10ops-codfw: Degraded RAID on logstash2022 - https://phabricator.wikimedia.org/T269552 (10Papaul) @herron any update on this? [16:02:49] (03PS3) 10Volans: sre.hosts.decommission: try to avoid Netbox issue [cookbooks] - 10https://gerrit.wikimedia.org/r/648259 (https://phabricator.wikimedia.org/T268605) [16:05:47] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[1-4] - https://phabricator.wikimedia.org/T267670 (10Papaul) @klausman can you also please add the new naming to https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions [16:06:31] (03CR) 10Bstorm: [C: 03+2] kubeadm: wmcs-k8s-node-upgrade.py: format with black [puppet] - 10https://gerrit.wikimedia.org/r/648206 (owner: 10Arturo Borrero Gonzalez) [16:06:53] (03PS1) 10Elukey: Revert "Set bigtop 1.5 for Hadoop test" [puppet] - 10https://gerrit.wikimedia.org/r/648269 [16:07:39] (03CR) 10Elukey: [C: 03+2] Revert "Set bigtop 1.5 for Hadoop test" [puppet] - 10https://gerrit.wikimedia.org/r/648269 (owner: 10Elukey) [16:09:38] PROBLEM - Check systemd state on wdqs1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:10:07] !log elukey@cumin1001 START - Cookbook sre.hadoop.stop-cluster for Hadoop test cluster: Stop the Hadoop cluster before maintenance. - elukey@cumin1001 [16:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:46] (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: try to avoid Netbox issue [cookbooks] - 10https://gerrit.wikimedia.org/r/648259 (https://phabricator.wikimedia.org/T268605) (owner: 10Volans) [16:11:31] (03PS2) 10Bstorm: kubeadm: wmcs-k8s-node-upgrade.py: hoist constants to top [puppet] - 10https://gerrit.wikimedia.org/r/648207 (owner: 10Arturo Borrero Gonzalez) [16:12:21] look at all this hadoop spam in the SAL with proper info, how lovely [16:12:46] (I'll have to do some more rounds of tests so please be patient :D) [16:13:01] I think the message could be shortened [16:13:24] (03Merged) 10jenkins-bot: sre.hosts.decommission: try to avoid Netbox issue [cookbooks] - 10https://gerrit.wikimedia.org/r/648259 (https://phabricator.wikimedia.org/T268605) (owner: 10Volans) [16:13:29] and yes I need to work on the administrative.reason in spicerack to allow for more flexible usages where the user@host is not needed [16:13:45] volans: how dare you saying that the hadoop message is too long, I am offended [16:13:47] (03PS3) 10Bstorm: kubeadm: wmcs-k8s-node-upgrade.py: hoist constants to top [puppet] - 10https://gerrit.wikimedia.org/r/648207 (owner: 10Arturo Borrero Gonzalez) [16:13:48] :D [16:13:48] ahahah [16:14:03] I think that this might be enough: Cookbook sre.hadoop.stop-cluster test: $reason [16:15:02] (03CR) 10Bstorm: [C: 03+2] kubeadm: wmcs-k8s-node-upgrade.py: hoist constants to top [puppet] - 10https://gerrit.wikimedia.org/r/648207 (owner: 10Arturo Borrero Gonzalez) [16:15:32] (03PS2) 10Bstorm: kubedam: wmcs-k8s-node-upgrade.py: help message refresh [puppet] - 10https://gerrit.wikimedia.org/r/648208 (owner: 10Arturo Borrero Gonzalez) [16:15:36] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0) for Hadoop test cluster: Stop the Hadoop cluster before maintenance. - elukey@cumin1001 [16:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:39] (03CR) 10Bstorm: [C: 03+2] kubedam: wmcs-k8s-node-upgrade.py: help message refresh [puppet] - 10https://gerrit.wikimedia.org/r/648208 (owner: 10Arturo Borrero Gonzalez) [16:18:20] (03PS2) 10Bstorm: kubeadm: wmcs-k8s-node-upgrade.py: skip node if current version fails [puppet] - 10https://gerrit.wikimedia.org/r/648209 (owner: 10Arturo Borrero Gonzalez) [16:26:02] (03CR) 10Bstorm: [C: 03+2] kubeadm: wmcs-k8s-node-upgrade.py: skip node if current version fails [puppet] - 10https://gerrit.wikimedia.org/r/648209 (owner: 10Arturo Borrero Gonzalez) [16:28:43] !log elukey@cumin1001 START - Cookbook sre.hadoop.upgrade-bigtop-distro for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001 [16:28:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:02] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:31:47] (03PS2) 10Bstorm: kubeadm: wmcs-k8s-node-upgrade.py: cache status yaml [puppet] - 10https://gerrit.wikimedia.org/r/648210 (owner: 10Arturo Borrero Gonzalez) [16:35:44] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:38:57] ^ T269693 again [16:38:58] T269693: mediawiki_job_wikidata-updateQueryServiceLag failing - https://phabricator.wikimedia.org/T269693 [16:40:03] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hadoop.upgrade-bigtop-distro (exit_code=99) for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001 [16:40:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:44] sad trombone.wav elukey :) [16:41:01] I knoooww [16:41:13] but it is something that I didn't take into account, good [16:42:04] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:46:44] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:46:56] PROBLEM - Check systemd state on an-test-client1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:51:28] (03CR) 10Volans: "> Patch Set 18: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) (owner: 10CRusnov) [17:01:04] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:50] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:18] (03PS1) 10Andrew Bogott: Cinder: Fix syslog filter of health checks [puppet] - 10https://gerrit.wikimedia.org/r/648296 (https://phabricator.wikimedia.org/T269511) [17:15:09] (03CR) 10Andrew Bogott: [C: 03+2] Cinder: Fix syslog filter of health checks [puppet] - 10https://gerrit.wikimedia.org/r/648296 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [17:31:05] (03PS1) 10Andrew Bogott: Glance: increase priority of rsyslog filter [puppet] - 10https://gerrit.wikimedia.org/r/648299 [17:31:48] (03CR) 10Andrew Bogott: [C: 03+2] Glance: increase priority of rsyslog filter [puppet] - 10https://gerrit.wikimedia.org/r/648299 (owner: 10Andrew Bogott) [17:34:02] 10Operations, 10Performance-Team, 10Traffic: Enable webp thumbnails on all images for non-Commons wikis - https://phabricator.wikimedia.org/T269946 (10Gilles) [17:38:47] (03PS1) 10Elukey: sre.hadoop.upgrade-bigtop-distro.py: stop active namenode before rollback [cookbooks] - 10https://gerrit.wikimedia.org/r/648301 [17:40:34] (03CR) 10Elukey: [C: 03+2] sre.hadoop.upgrade-bigtop-distro.py: stop active namenode before rollback [cookbooks] - 10https://gerrit.wikimedia.org/r/648301 (owner: 10Elukey) [17:41:53] !log elukey@cumin1001 START - Cookbook sre.hadoop.upgrade-bigtop-distro for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001 [17:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:14] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [17:48:28] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hadoop.upgrade-bigtop-distro (exit_code=99) for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001 [17:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:35] (03PS1) 10Andrew Bogott: Cinder policy.yaml: fix typo that broke policy parsing [puppet] - 10https://gerrit.wikimedia.org/r/648302 (https://phabricator.wikimedia.org/T269511) [17:51:11] (03CR) 10Andrew Bogott: [C: 03+2] Cinder policy.yaml: fix typo that broke policy parsing [puppet] - 10https://gerrit.wikimedia.org/r/648302 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [17:52:48] !log razzi@cumin1001 START - Cookbook sre.ganeti.makevm [17:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:34] !log razzi@cumin1001 START - Cookbook sre.hosts.decommission [17:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:34] (03PS1) 10Andrew Bogott: Cinder: a few tweaks to quiet log warnings [puppet] - 10https://gerrit.wikimedia.org/r/648303 (https://phabricator.wikimedia.org/T269511) [18:00:26] (03CR) 10Andrew Bogott: [C: 03+2] Cinder: a few tweaks to quiet log warnings [puppet] - 10https://gerrit.wikimedia.org/r/648303 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [18:03:12] 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Goal, 10Patch-For-Review: Prepare a proof of concept of the minimum setup capable of backup and recover testwiki media files - https://phabricator.wikimedia.org/T264189 (10jcrespo) Distribution of images by initial: ` root@db1133.eqiad.wmnet... [18:05:00] (03PS1) 10Elukey: Revert "Revert "Set bigtop 1.5 for Hadoop test"" [puppet] - 10https://gerrit.wikimedia.org/r/648271 [18:05:26] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [18:05:49] (03CR) 10Elukey: [C: 03+2] Revert "Revert "Set bigtop 1.5 for Hadoop test"" [puppet] - 10https://gerrit.wikimedia.org/r/648271 (owner: 10Elukey) [18:10:10] (03CR) 10Dzahn: [C: 03+2] doc: use an Apache Define for WMF_DOC_PATH [puppet] - 10https://gerrit.wikimedia.org/r/648247 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [18:12:10] (03PS2) 10Ahmon Dancy: New utility macros in templates/_mediawiki-common.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/647843 [18:12:12] (03PS2) 10Ahmon Dancy: 0.2.0: Use a Job to set up the database [deployment-charts] - 10https://gerrit.wikimedia.org/r/647844 [18:12:14] (03PS1) 10Ahmon Dancy: add manually recached l10n CDB support [deployment-charts] - 10https://gerrit.wikimedia.org/r/648304 [18:12:22] (03CR) 10Dzahn: "ran puppet on doc1001, restarted apache2, https://doc.wikimedia.org/mediawiki-core/master/php/ and main page still working fine" [puppet] - 10https://gerrit.wikimedia.org/r/648247 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [18:12:27] hashar: ^ done [18:13:02] !log doc1001 restarted apache2 just in case after DOC_PATH change [18:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:26] (03CR) 10jerkins-bot: [V: 04-1] add manually recached l10n CDB support [deployment-charts] - 10https://gerrit.wikimedia.org/r/648304 (owner: 10Ahmon Dancy) [18:13:58] !log elukey@cumin1001 START - Cookbook sre.hadoop.stop-cluster for Hadoop test cluster: Stop the Hadoop cluster before maintenance. - elukey@cumin1001 [18:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:49] mutante: ah yeah [18:15:29] mutante: so I checked again this morning and the setup we pushed yesterday was definitely working on my machine. The issue is that I had the doc published files copied at both places so they were always found :] [18:16:28] hashar: ACK, i saw your comments about that on Gerrit. *nod* [18:16:30] (03CR) 10Dzahn: [C: 03+2] puppetmaster: replace cron to remove old reports with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/636104 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [18:16:44] hashar: this step was noop, all looks like before [18:16:47] (03PS2) 10Ahmon Dancy: Reorganized setup.sh and added db wait loop [deployment-charts] - 10https://gerrit.wikimedia.org/r/647842 [18:16:49] (03PS3) 10Ahmon Dancy: New utility macros in templates/_mediawiki-common.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/647843 [18:16:51] (03PS3) 10Ahmon Dancy: 0.2.0: Use a Job to set up the database [deployment-charts] - 10https://gerrit.wikimedia.org/r/647844 [18:16:52] I know there will be a follow-up [18:16:53] (03PS2) 10Ahmon Dancy: add manually recached l10n CDB support [deployment-charts] - 10https://gerrit.wikimedia.org/r/648304 [18:18:27] mutante: yes that will be https://gerrit.wikimedia.org/r/c/operations/puppet/+/648248/1/modules/profile/files/doc/httpd-doc.wikimedia.org.conf [18:18:29] but you can use that variable now [18:18:59] I am setting up a WMCS instance to validate it and prove it is working [18:19:18] cause there is 50% chance that one will break stuff eventually [18:19:19] hashar: :) ok, cool! [18:19:20] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0) for Hadoop test cluster: Stop the Hadoop cluster before maintenance. - elukey@cumin1001 [18:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:31] !log elukey@cumin1001 START - Cookbook sre.hadoop.upgrade-bigtop-distro for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001 [18:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:49] yea, we don't want to cache broken redirects and +1 to using cloud [18:20:21] Let me also add a basic test to prod tests, i can do that. [18:26:02] 10Operations, 10ops-codfw: Degraded RAID on logstash2022 - https://phabricator.wikimedia.org/T269552 (10herron) I think we can go without it, we plan to replace these older hosts in the near future and also have some logstash refresh hardware that was just ordered. Thanks! [18:26:59] (03PS4) 10Ahmon Dancy: New utility macros in templates/_mediawiki-common.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/647843 [18:27:01] (03PS4) 10Ahmon Dancy: 0.2.0: Use a Job to set up the database [deployment-charts] - 10https://gerrit.wikimedia.org/r/647844 [18:27:03] (03PS3) 10Ahmon Dancy: add manually recached l10n CDB support [deployment-charts] - 10https://gerrit.wikimedia.org/r/648304 [18:27:30] (03PS4) 10Ahmon Dancy: 0.3.0: add manually recached l10n CDB support [deployment-charts] - 10https://gerrit.wikimedia.org/r/648304 [18:29:37] (03PS5) 10Ahmon Dancy: 0.3.0: add manually recached l10n CDB support [deployment-charts] - 10https://gerrit.wikimedia.org/r/648304 [18:30:30] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.upgrade-bigtop-distro (exit_code=0) for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001 [18:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:30] (03PS9) 10Jcrespo: [WIP] We continue with swift listing and download tests for media backups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/643980 [18:39:18] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [18:42:34] (03PS1) 10Legoktm: admin: Update my (legoktm) dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/648321 [18:47:47] !log razzi@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [18:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:00] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [18:56:26] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:57:15] ^ caused by my change but nothing serious and looking at the new timer [18:57:33] it's about a former cron deleting old reports [19:02:10] PROBLEM - Check systemd state on sretest1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:04:12] (03PS1) 10Hashar: devtools: add integration/docroot.git on deploy server [puppet] - 10https://gerrit.wikimedia.org/r/648323 [19:05:03] (03CR) 10Hashar: "That is for doc.devtools.eqiad1.wikimedia.cloud , it is a scap deployment target for integration/docroot and fails puppet with:" [puppet] - 10https://gerrit.wikimedia.org/r/648323 (owner: 10Hashar) [19:12:37] (03PS1) 10Dzahn: puppetmaster: run "remove old reports" job as root [puppet] - 10https://gerrit.wikimedia.org/r/648325 (https://phabricator.wikimedia.org/T265138) [19:15:06] (03CR) 10Dzahn: [C: 03+2] devtools: add integration/docroot.git on deploy server [puppet] - 10https://gerrit.wikimedia.org/r/648323 (owner: 10Hashar) [19:15:46] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/648325" [puppet] - 10https://gerrit.wikimedia.org/r/636104 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [19:17:24] (03CR) 10Dzahn: [C: 03+2] "/var/lib/puppet/reports# ls -hals" [puppet] - 10https://gerrit.wikimedia.org/r/648325 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [19:18:11] !log razzi@cumin1001 END (ERROR) - Cookbook sre.hosts.decommission (exit_code=97) [19:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:37] (03PS1) 10Dzahn: puppetmaster: remove code to remove crons, replaced by timer [puppet] - 10https://gerrit.wikimedia.org/r/648327 (https://phabricator.wikimedia.org/T265138) [19:20:46] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:21:16] (03CR) 10Dzahn: "Dec 11 19:20:48 puppetmaster1001 systemd[1]: remove_old_puppet_reports.service: Succeeded." [puppet] - 10https://gerrit.wikimedia.org/r/648325 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [19:21:27] (03CR) 10Dzahn: "<+icinga-wm> RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wiki" [puppet] - 10https://gerrit.wikimedia.org/r/648325 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [19:27:37] (03PS1) 10Dzahn: puppetmaster: ensure previously used cron is properly removed [puppet] - 10https://gerrit.wikimedia.org/r/648328 (https://phabricator.wikimedia.org/T265138) [19:27:49] (03CR) 10Ottomata: [C: 03+2] Refine Growth schemas using eventlogging_legacy job [puppet] - 10https://gerrit.wikimedia.org/r/647817 (https://phabricator.wikimedia.org/T267333) (owner: 10Mforns) [19:28:06] !log razzi@cumin1001 START - Cookbook sre.ganeti.makevm [19:28:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:41] (03PS1) 10Ahmon Dancy: prometheus: collect zuul error mtail metrics [puppet] - 10https://gerrit.wikimedia.org/r/648329 (https://phabricator.wikimedia.org/T258821) [19:29:47] (03PS2) 10Dzahn: puppetmaster: ensure previously used cron is properly removed [puppet] - 10https://gerrit.wikimedia.org/r/648328 (https://phabricator.wikimedia.org/T265138) [19:30:16] (03CR) 10Dzahn: [C: 03+2] puppetmaster: ensure previously used cron is properly removed [puppet] - 10https://gerrit.wikimedia.org/r/648328 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [19:33:04] (03CR) 10jerkins-bot: [V: 04-1] [WIP] We continue with swift listing and download tests for media backups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/643980 (owner: 10Jcrespo) [19:35:19] (03CR) 10Dzahn: "properly removed on puppetmaster1001 now.. waiting a bit to let it happen on all puppetmasters including those in cloud" [puppet] - 10https://gerrit.wikimedia.org/r/648328 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [19:44:24] (03PS6) 10Ahmon Dancy: 0.3.0: add manually recached l10n CDB support [deployment-charts] - 10https://gerrit.wikimedia.org/r/648304 [19:57:39] (03Abandoned) 10Dzahn: httpd: change default server admin from webmaster@ to noc@ [puppet] - 10https://gerrit.wikimedia.org/r/645431 (https://phabricator.wikimedia.org/T251005) (owner: 10Dzahn) [19:59:14] (03PS1) 10Hashar: devtools: add dsh group for integration/docroot [puppet] - 10https://gerrit.wikimedia.org/r/648333 [19:59:33] (03CR) 10Dzahn: "I think I already added what I was able to contribute here and will leave this to observability and traffic folks." [puppet] - 10https://gerrit.wikimedia.org/r/632738 (https://phabricator.wikimedia.org/T148976) (owner: 10Cwhite) [20:01:28] (03PS1) 10Mforns: Do not refine HomepageVisit using eventlogging_legacy job [puppet] - 10https://gerrit.wikimedia.org/r/648334 (https://phabricator.wikimedia.org/T267333) [20:02:09] (03CR) 10Dzahn: "@hashar Can we do this? just adding new keys is only one step that is not removing the old keys." [puppet] - 10https://gerrit.wikimedia.org/r/556270 (https://phabricator.wikimedia.org/T240266) (owner: 10Paladox) [20:02:33] (03CR) 10Ottomata: [C: 03+2] "This is a server side event, which is not ready for migration" [puppet] - 10https://gerrit.wikimedia.org/r/648334 (https://phabricator.wikimedia.org/T267333) (owner: 10Mforns) [20:04:08] RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:07:30] (03PS1) 10Mforns: Remove HomepageVisit from wgEventLoggingSchemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/648336 (https://phabricator.wikimedia.org/T267333) [20:08:56] (03CR) 10Ottomata: [C: 03+2] Remove HomepageVisit from wgEventLoggingSchemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/648336 (https://phabricator.wikimedia.org/T267333) (owner: 10Mforns) [20:11:05] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Un-migrtate Growth EventLogging schema HomepageVisit back to EventLogging-backend on all wikis (this is a server side event which is not yet ready to migrate) - T267333 (duration: 00m 58s) [20:11:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:09] T267333: Migrate Growth EventLogging schemas to Event Platform - https://phabricator.wikimedia.org/T267333 [20:15:26] (03CR) 10Hashar: [C: 03+1] "That is to generate the dsh file for the deploy-1002 instance. It is using the cloudinfra puppetmaster hence I can not cherry pick. I h" [puppet] - 10https://gerrit.wikimedia.org/r/648333 (owner: 10Hashar) [20:27:28] !log razzi@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [20:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:54] (03PS1) 10Dzahn: httpbb/doc: add tests for doc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/648339 [20:29:02] (03PS2) 10Dzahn: httpbb/doc: add tests for doc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/648339 (https://phabricator.wikimedia.org/T149924) [20:29:33] (03CR) 10Dzahn: [C: 03+2] devtools: add dsh group for integration/docroot [puppet] - 10https://gerrit.wikimedia.org/r/648333 (owner: 10Hashar) [20:30:13] (03CR) 10Dzahn: "@hashar I already tested this on deploy1001 with the same yaml in my home dir and doc1001 passes all the assertions." [puppet] - 10https://gerrit.wikimedia.org/r/648339 (https://phabricator.wikimedia.org/T149924) (owner: 10Dzahn) [20:31:13] (03CR) 10Dzahn: "Doing this because we have ongoing changes to the docroot etc on doc1001 which make tests nice to have." [puppet] - 10https://gerrit.wikimedia.org/r/648339 (https://phabricator.wikimedia.org/T149924) (owner: 10Dzahn) [20:31:37] mutante: thanks :] [20:31:51] for the Gerrit ssh keys, yeah I guess I should look at the patches eventually [20:32:02] no idea what kind of breakage that will cause everywhere though [20:32:02] hashar: no problem, also see this: [20:32:12] [deploy1001:~] $ httpbb --hosts doc1001.eqiad.wmnet - < test_doc.yaml [20:32:14] Sending to doc1001.eqiad.wmnet... [20:32:17] PASS: 1 request sent to doc1001.eqiad.wmnet. All assertions passed. [20:32:27] oh [20:32:35] https://gerrit.wikimedia.org/r/c/operations/puppet/+/648339/2/modules/profile/files/httpbb/doc/test_doc.yaml [20:32:48] tests the redirect from "without slash" to "/" etc [20:33:02] except that should be more than just 1 request [20:33:06] I think it means "1 host" [20:33:13] so hmm [20:33:35] potentially we could add that to scap and use it to assert the deployment went well [20:33:48] (03PS1) 10Bstorm: toolforge: make timeouts for our slow etcd clusters configurable [puppet] - 10https://gerrit.wikimedia.org/r/648340 (https://phabricator.wikimedia.org/T267966) [20:33:51] or might even have the test_doc.yaml directly in the integration/docroot.git repo [20:34:07] hashar: yea, i think that is already an idea [20:34:34] also it should have a comment in the yaml file which hosts this is for [20:34:41] beyond being in that subdir [20:34:43] called "doc" [20:34:50] but it implies "any doc* host" [20:35:16] or have profile::doc to ship the test file to httpbb [20:35:20] I talked about it with Reuven recently [20:35:41] hmm.. that's an interesting approach as well, yes [20:36:21] or direcltly inside integration/docroot ;] [20:36:44] then it is a bit of a mix cause that is affected by either an Apache config change or the web app being changed [20:36:48] so I guess in puppet it is fine [20:37:28] in an ideal world, we would boot a docker container, provision it with puppet and run httpbb against the resulting web server [20:37:36] (03CR) 10Dzahn: "@RLazarus When I run this and see "1 request" shouldn't I expect at least one request per assertion in my yaml? Does it actually mean "som" [puppet] - 10https://gerrit.wikimedia.org/r/648339 (https://phabricator.wikimedia.org/T149924) (owner: 10Dzahn) [20:39:37] hashar: I would manually run this when there are Apache config changes, so if there could be "check experimental" and "Hosts: " and jenkins bot saying -1 automatically if the existing tests fail after a change.. but it's probably not easy to detect which changes affect apache config. [20:40:42] and I can run it by myself from cumin or deploy server.. and not a huge difference that is worth spending a lot of effort [20:40:55] I think it's more important for now we have moar test files :) [20:42:55] oh.. duh.. the issue is how I am using it with: [20:43:14] - < test_doc.yaml this is only doing the first test [20:43:30] need to juse use full path to it [20:44:37] (03PS1) 10Razzi: kafka: Add kafka-test1008 - 1010 [puppet] - 10https://gerrit.wikimedia.org/r/648342 (https://phabricator.wikimedia.org/T268202) [20:45:57] (03CR) 10Hashar: "Thanks for starting this! There is a potential typo and I have a question about giving names to assert." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/648339 (https://phabricator.wikimedia.org/T149924) (owner: 10Dzahn) [20:46:44] mutante: or maybe the entries are overriden [20:46:50] foo: [bar] [20:46:54] foo: [baz] [20:46:58] foo: [yo] [20:46:58] (03CR) 10Dzahn: [C: 04-1] "@RLazarus Scratch my question, I am using it wrong "- < file" will only read the first test. I need to simply use full path to the yaml fi" [puppet] - 10https://gerrit.wikimedia.org/r/648339 (https://phabricator.wikimedia.org/T149924) (owner: 10Dzahn) [20:47:07] might just yields the last one (yo) [20:47:27] I don't know how yaml manages keys overlaps, it probably overwrites [20:55:59] (03PS3) 10Dzahn: httpbb/doc: add tests for doc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/648339 (https://phabricator.wikimedia.org/T149924) [20:56:16] hashar: ok, this version works now [20:56:22] PASS: 9 requests sent to doc1001.eqiad.wmnet. All assertions passed. [20:58:17] answer is to not repeat the key at all.. this was just from an example of a host that has multiple sites [20:59:09] (03CR) 10Bstorm: "https://puppet-compiler.wmflabs.org/compiler1001/27109/tools-k8s-etcd-4.tools.eqiad.wmflabs/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/648340 (https://phabricator.wikimedia.org/T267966) (owner: 10Bstorm) [21:00:16] (03CR) 10Bstorm: [C: 03+2] toolforge: make timeouts for our slow etcd clusters configurable [puppet] - 10https://gerrit.wikimedia.org/r/648340 (https://phabricator.wikimedia.org/T267966) (owner: 10Bstorm) [21:00:16] mutante: yeah I guess right. It picked either the first or the last entry :] [21:01:00] mutante: thanks for working on that bit! Iwas too lazy to find the doc and set it up locally [21:02:05] (03PS4) 10Dzahn: httpbb/doc: add tests for doc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/648339 (https://phabricator.wikimedia.org/T149924) [21:02:28] hashar: no problem, I was doing that for some other hosts like miscweb before.. and PS4 should be correct [21:02:33] going to cook now [21:03:46] will find out the answer to your question about 'labels' [21:05:06] hashar: no support for named assertions at this point but it's an interesting idea [21:05:19] it does support comments in the YAML though, that's what I would do :) [21:07:15] and, correct, don't reuse hosts, just combine them -- I might try to go back and add merging logic or at least print a warning, but IIRC that all takes place in library code so it's nontrivial [21:08:19] (03PS1) 10Ladsgroup: kafkatee: Migrate hiera() to lookup() and set data type [puppet] - 10https://gerrit.wikimedia.org/r/648348 (https://phabricator.wikimedia.org/T209953) [21:08:28] cc mutante fyi ^ [21:08:38] rzl: awesome thank you ! [21:10:44] (03CR) 10Ladsgroup: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/27110/" [puppet] - 10https://gerrit.wikimedia.org/r/648348 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [21:12:55] (03CR) 10Dzahn: httpbb/doc: add tests for doc.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/648339 (https://phabricator.wikimedia.org/T149924) (owner: 10Dzahn) [21:14:11] (03CR) 10Dzahn: httpbb/doc: add tests for doc.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/648339 (https://phabricator.wikimedia.org/T149924) (owner: 10Dzahn) [21:18:04] rzl: thanks :) ack [21:20:16] (03PS1) 10Ladsgroup: mjolnir: Migrate hiera() to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/648349 (https://phabricator.wikimedia.org/T209953) [21:21:29] (03CR) 10RLazarus: "Don't forget to add an httpbb::test_suite at modules/profile/manifests/httpbb.pp so this gets picked up. Thanks for adding this!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/648339 (https://phabricator.wikimedia.org/T149924) (owner: 10Dzahn) [21:21:47] (03CR) 10jerkins-bot: [V: 04-1] mjolnir: Migrate hiera() to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/648349 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [21:23:39] (03CR) 10Ladsgroup: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/648349 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [21:24:43] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/648351 [21:25:25] (03Abandoned) 10Dduvall: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/647849 (owner: 10PipelineBot) [21:25:38] (03Abandoned) 10Dduvall: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/647831 (owner: 10PipelineBot) [21:26:36] (03CR) 10Ladsgroup: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/27111/" [puppet] - 10https://gerrit.wikimedia.org/r/648349 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [21:26:58] (03CR) 10Dduvall: [C: 03+2] blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/648351 (owner: 10PipelineBot) [21:28:15] (03Merged) 10jenkins-bot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/648351 (owner: 10PipelineBot) [21:33:18] !log dduvall@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [21:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:45] !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [21:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:29] 10Operations, 10DynamicPageList (Wikimedia), 10PoolCounter, 10serviceops, and 4 others: Limit concurrency of DPL queries - https://phabricator.wikimedia.org/T263220 (10Urbanecm) [21:36:40] 10Operations, 10DynamicPageList (Wikimedia), 10PoolCounter, 10serviceops, and 4 others: Limit concurrency of DPL queries - https://phabricator.wikimedia.org/T263220 (10Urbanecm) Wrong tag, but keeping it, as it's related too. [21:37:35] (03PS1) 10Bstorm: etcd: make snapshot interval configurable [puppet] - 10https://gerrit.wikimedia.org/r/648354 (https://phabricator.wikimedia.org/T267966) [21:38:45] (03CR) 10Bstorm: "The defaults should make this a noop for existing servers, but it will allow us to use horizon to configure our VMs according to which clu" [puppet] - 10https://gerrit.wikimedia.org/r/648354 (https://phabricator.wikimedia.org/T267966) (owner: 10Bstorm) [21:38:48] !log dduvall@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [21:38:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:06] (03CR) 10jerkins-bot: [V: 04-1] etcd: make snapshot interval configurable [puppet] - 10https://gerrit.wikimedia.org/r/648354 (https://phabricator.wikimedia.org/T267966) (owner: 10Bstorm) [21:41:11] !log dduvall@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [21:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:38] !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [21:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:18] !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [21:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:42] 10Operations, 10Continuous-Integration-Infrastructure, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Apache slash expansion should not redirect from HTTPS to HTTP - https://phabricator.wikimedia.org/T95164 (10hashar) `DirectorySlash` redirecting to http instead of canonical https is #upstream Apach... [21:45:59] (03PS5) 10Dzahn: httpbb/doc: add tests for doc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/648339 (https://phabricator.wikimedia.org/T149924) [21:46:09] !log Running schema changes on wikitech database for T269348 [21:46:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:12] T269348: wikitech database has almost all of its varbinary fields wrong - https://phabricator.wikimedia.org/T269348 [21:53:57] (03PS1) 10Dduvall: Revert "blubberoid: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/648275 [21:54:04] hashar: re: Gerrit keys, I think we already went through the concerns about a specific thing that might break but you also tested it and it did not actually happen [21:54:48] but yea.. never say never [21:56:05] :] [21:56:50] (03CR) 10Dduvall: [C: 03+2] Revert "blubberoid: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/648275 (owner: 10Dduvall) [21:57:06] mutante: well I am off, thanks for the patches :] Will look at finishing the install of the doc website on labs next week [21:57:10] !log add docker-ce_18.06.3~ce~3-0~debian_amd64.deb to apt.wikimedia.org stretch-wikimedia/thirdparty/k8s [21:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:14] hashar: see you! enjoy the weekend [21:58:16] (03Merged) 10jenkins-bot: Revert "blubberoid: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/648275 (owner: 10Dduvall) [21:58:28] sanding and painting! :] [21:58:44] :p new car [21:59:01] na doors inside the house [21:59:12] :) gotcha [21:59:17] !log dduvall@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [21:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:15] !log dduvall@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [22:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:29] (03PS1) 10Alexandros Kosiaris: k8s-staging-codfw: Use docker-ce 18.06.3~ce~3-0~debian [puppet] - 10https://gerrit.wikimedia.org/r/648356 [22:04:32] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:05:00] !log dduvall@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [22:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:31] (03PS6) 10Dzahn: httpbb/doc: add tests for doc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/648339 (https://phabricator.wikimedia.org/T149924) [22:11:25] (03PS2) 10Bstorm: etcd: make snapshot interval configurable [puppet] - 10https://gerrit.wikimedia.org/r/648354 (https://phabricator.wikimedia.org/T267966) [22:12:30] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27112/console" [puppet] - 10https://gerrit.wikimedia.org/r/648356 (owner: 10Alexandros Kosiaris) [22:13:10] (03CR) 10jerkins-bot: [V: 04-1] etcd: make snapshot interval configurable [puppet] - 10https://gerrit.wikimedia.org/r/648354 (https://phabricator.wikimedia.org/T267966) (owner: 10Bstorm) [22:13:27] (03CR) 10Bstorm: "The CI failures appear to be from lack of a sane default to "$srv_dns = undef," rather than what I'm doing here. It seems to want a value " [puppet] - 10https://gerrit.wikimedia.org/r/648354 (https://phabricator.wikimedia.org/T267966) (owner: 10Bstorm) [22:14:36] (03PS1) 10Dzahn: puppetmaster: require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/648358 (https://phabricator.wikimedia.org/T266479) [22:15:25] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27113/console" [puppet] - 10https://gerrit.wikimedia.org/r/648356 (owner: 10Alexandros Kosiaris) [22:15:54] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:16:44] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/27113/kubestage2002.codfw.wmnet/index.html is as expected, so +2ing" [puppet] - 10https://gerrit.wikimedia.org/r/648356 (owner: 10Alexandros Kosiaris) [22:17:20] (03PS1) 10Dzahn: icinga: require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/648359 (https://phabricator.wikimedia.org/T266479) [22:39:03] (03CR) 10Bstorm: [C: 03+2] kubeadm: wmcs-k8s-node-upgrade.py: cache status yaml [puppet] - 10https://gerrit.wikimedia.org/r/648210 (owner: 10Arturo Borrero Gonzalez) [22:40:40] (03PS2) 10Bstorm: kubeadm: wmcs-k8s-node-upgrade.py: collapse ssh calls for same package checks [puppet] - 10https://gerrit.wikimedia.org/r/648211 (owner: 10Arturo Borrero Gonzalez) [22:41:03] (03CR) 10jerkins-bot: [V: 04-1] kubeadm: wmcs-k8s-node-upgrade.py: collapse ssh calls for same package checks [puppet] - 10https://gerrit.wikimedia.org/r/648211 (owner: 10Arturo Borrero Gonzalez) [22:41:11] (03PS3) 10Bstorm: kubeadm: wmcs-k8s-node-upgrade.py: collapse ssh calls for same checks [puppet] - 10https://gerrit.wikimedia.org/r/648211 (owner: 10Arturo Borrero Gonzalez) [22:41:39] (03CR) 10jerkins-bot: [V: 04-1] kubeadm: wmcs-k8s-node-upgrade.py: collapse ssh calls for same checks [puppet] - 10https://gerrit.wikimedia.org/r/648211 (owner: 10Arturo Borrero Gonzalez) [22:41:41] (03CR) 10RLazarus: [C: 03+1] "LGTM! Feel free to add comments as discussed, but I'm happy with this merging whenever." [puppet] - 10https://gerrit.wikimedia.org/r/648339 (https://phabricator.wikimedia.org/T149924) (owner: 10Dzahn) [22:49:39] (03CR) 10Bstorm: "I'd expected the fail on CI was just the commit message (so I updated it), but it fails on:" [puppet] - 10https://gerrit.wikimedia.org/r/648211 (owner: 10Arturo Borrero Gonzalez) [22:52:03] (03PS1) 10Alexandros Kosiaris: Also ship the following plugins which are included in the release [debs/calico] - 10https://gerrit.wikimedia.org/r/648363 [23:01:19] (03CR) 10Dzahn: [C: 03+2] httpbb/doc: add tests for doc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/648339 (https://phabricator.wikimedia.org/T149924) (owner: 10Dzahn) [23:07:42] 10Operations, 10vm-requests: eqiad: 1 VM request for doc - https://phabricator.wikimedia.org/T269977 (10Dzahn) [23:07:54] 10Operations, 10vm-requests: eqiad: 1 VM request for doc (doc1002) - https://phabricator.wikimedia.org/T269977 (10Dzahn) [23:08:18] RECOVERY - MegaRAID on es1023 is OK: OK: optimal, 1 logical, 12 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:08:30] 10Operations, 10vm-requests: eqiad: 1 VM request for doc (doc1002) - https://phabricator.wikimedia.org/T269977 (10Dzahn) [23:08:32] 10Operations, 10Release-Engineering-Team, 10serviceops: replace doc1001.eqiad.wmnet with a buster VM - https://phabricator.wikimedia.org/T247653 (10Dzahn) [23:08:54] 10Operations, 10vm-requests: eqiad: 1 VM request for doc (doc1002) - https://phabricator.wikimedia.org/T269977 (10Dzahn) [23:10:49] 10Operations, 10serviceops, 10vm-requests, 10Release-Engineering-Team (Kanban): eqiad: 1 VM request for doc.wikimedia.org - https://phabricator.wikimedia.org/T211974 (10Dzahn) Now doc1002 should be created (T269977) to buster (T247653). And we should also make doc2001 in codfw. [23:21:35] 10Operations, 10vm-requests: codfw: 1 VM %request for doc.wikimedia.org - https://phabricator.wikimedia.org/T269978 (10Dzahn) [23:21:49] 10Operations, 10vm-requests: codfw: 1 VM request for doc.wikimedia.org (doc2001) - https://phabricator.wikimedia.org/T269978 (10Dzahn) [23:22:06] 10Operations, 10vm-requests: codfw: 1 VM request for doc.wikimedia.org (doc2001) - https://phabricator.wikimedia.org/T269978 (10Dzahn) [23:22:08] 10Operations, 10Release-Engineering-Team, 10serviceops: replace doc1001.eqiad.wmnet with a buster VM - https://phabricator.wikimedia.org/T247653 (10Dzahn) [23:33:55] 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki appservers to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) I ran all httpbb appserver tests on mwdebug1003: ` [deploy1001:~] $ for testfile in /srv/deploymen... [23:38:25] (03PS1) 10Alexandros Kosiaris: k8s-staging-codfw: Specify admission_controllers hiera [puppet] - 10https://gerrit.wikimedia.org/r/648377 [23:39:49] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27114/console" [puppet] - 10https://gerrit.wikimedia.org/r/648377 (owner: 10Alexandros Kosiaris) [23:41:06] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] "PCC happy at https://puppet-compiler.wmflabs.org/compiler1003/27114/kubestagemaster2001.codfw.wmnet/index.html, merging" [puppet] - 10https://gerrit.wikimedia.org/r/648377 (owner: 10Alexandros Kosiaris) [23:42:31] (03PS1) 10Dzahn: httpbb: add missing directory for doc tests [puppet] - 10https://gerrit.wikimedia.org/r/648380 [23:43:47] (03CR) 10Dzahn: [C: 03+2] httpbb: add missing directory for doc tests [puppet] - 10https://gerrit.wikimedia.org/r/648380 (owner: 10Dzahn) [23:45:18] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01061 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [23:45:48] PROBLEM - Check systemd state on kubestagemaster2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:47:43] (03CR) 10Dzahn: "[deploy1001:~] $ httpbb --hosts doc1001.eqiad.wmnet /srv/deployment/httpbb-tests/doc/test_doc.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/648380 (owner: 10Dzahn) [23:48:12] mutante: ah, sorry for not catching that [23:48:28] (03CR) 10RLazarus: [C: 03+1] httpbb: add missing directory for doc tests [puppet] - 10https://gerrit.wikimedia.org/r/648380 (owner: 10Dzahn) [23:48:38] as ever, I keep meaning to rethink how that file works :) [23:48:43] rzl: no problem at all. I did not even notice the puppet issue [23:49:04] RECOVERY - Check systemd state on kubestagemaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:49:17] regarding the "widespread puppet failures" it is just always living on the edge slightly under the threshold what becomes widespread, heh [23:49:53] rzl: I kind of wanted to add the host names in comments :) [23:50:36] when making the next directory I will also follow the "cluster name" (host name without numbers), so NOT ./parsoid/ but ./parse/ [23:51:14] hmm, yeah! [23:51:28] we'd have to change "appserver" to "mw" if we're keeping that, but I kind of like it [23:51:39] I wonder if there are any examples where it's not one-to-one, in either direction [23:52:15] files under /srv/deployment make me think "that's deployed by scap" and not by puppet, btw. but .. also not like it really matters a lot [23:53:07] rzl: right now it is for parsoid, parse vs wtp .. but unique and temp situation [23:53:11] hmm, I guess the appservers are already mw# and mwdebug# but that's not the worst [23:53:27] yeah, I wasn't counting that just because we're getting rid of it [23:56:18] (03PS1) 10Dzahn: httpbb: add tests for parse (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/648383 [23:57:14] just parking that there because I want to do another thing first [23:59:50] (03PS1) 10Dzahn: httpbb: auto-create directories for test suites [puppet] - 10https://gerrit.wikimedia.org/r/648385