[00:00:04] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Evening SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200219T0000). [00:00:04] tgr, kart_, and Volker_E: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:37] o/ [00:00:55] \o [00:01:01] RECOVERY - mediawiki-installation DSH group on mw1349 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [00:01:08] I'm here for SWAT too :) [00:02:10] thanks for the backport James_F! [00:02:36] can I add one for wmf.20 too? the SWAT window is already full... [00:02:36] tgr: Happy to help. Given that we got the wmf.18 train de-deployed, wanted to clean up any efforts in that direction. :-) [00:02:57] i dont expect it but if you see any errors about mw1349 during scap... let me know [00:02:58] (03CR) 10Niharika29: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572401 (owner: 10Gergő Tisza) [00:02:59] tgr: Niharika's call. Did the patch not make it into wmf.20? Oops, sorry, my screw-up. [00:03:11] it has just been added to mw-installation "dsh" group [00:03:21] tgr: I have 30 minutes to swat so I'll let you know if I have time. [00:03:33] nvm, it was merged before the branch cut [00:03:38] (03Merged) 10jenkins-bot: Make the logstash and authmanager-statsd Monolog handlers compatible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572401 (owner: 10Gergő Tisza) [00:03:48] 10Operations, 10ops-codfw: (No Need By Date Provided) codfw: rack/setup/install elastic20{55,56,57,58,59,60}.codfw.wmnet - https://phabricator.wikimedia.org/T241337 (10Papaul) [00:03:49] Ah, good. [00:04:52] (03CR) 10Niharika29: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572502 (owner: 10Gergő Tisza) [00:05:09] Niharika: the three changes can go together [00:05:10] (03CR) 10jerkins-bot: [V: 04-1] Update authmanager-statsd channel names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572502 (owner: 10Gergő Tisza) [00:06:24] and don't need testing (I'll see on the dashboard if they work, but that takes a while) [00:06:52] (03PS3) 10Gergő Tisza: Update authmanager-statsd channel names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572502 [00:07:26] (03PS1) 10BryanDavis: toolforge: increase nginx-ingress proxy_read_timeout to match dynamicproxy [puppet] - 10https://gerrit.wikimedia.org/r/573016 (https://phabricator.wikimedia.org/T245426) [00:07:34] tgr: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/572401/ is on mwdebug1002. [00:08:28] the wiki still works so I'll call that a success [00:08:41] !log creating mcrouter certs for mw1350 [00:08:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:52] the other one probably needs to be re+2-d [00:10:56] !log niharika29@deploy1001 Synchronized wmf-config/logging.php: Make the logstash and authmanager-statsd Monolog handlers compatible (duration: 01m 04s) [00:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:19] Niharika: You'll want to run `echo 'https://en.wikipedia.org/static/images/project-logos/trwiki.png' | mwscript purgeList.php` or similar from mwmaint1002. [00:12:33] (03CR) 10Niharika29: [C: 03+2] Update authmanager-statsd channel names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572502 (owner: 10Gergő Tisza) [00:13:29] (03Merged) 10jenkins-bot: Update authmanager-statsd channel names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572502 (owner: 10Gergő Tisza) [00:15:13] (03PS1) 10Dzahn: assign mw1350 through mw1355 as MediaWiki appservers [puppet] - 10https://gerrit.wikimedia.org/r/573019 (https://phabricator.wikimedia.org/T236437) [00:17:32] (03PS1) 10Papaul: DHCP: Change MAC address of elastic20[5-6],elastic2059 and elastic2060 from 1G MAC to 10G MAC [puppet] - 10https://gerrit.wikimedia.org/r/573020 (https://phabricator.wikimedia.org/T241337) [00:18:45] (03CR) 10jerkins-bot: [V: 04-1] DHCP: Change MAC address of elastic20[5-6],elastic2059 and elastic2060 from 1G MAC to 10G MAC [puppet] - 10https://gerrit.wikimedia.org/r/573020 (https://phabricator.wikimedia.org/T241337) (owner: 10Papaul) [00:19:57] (03CR) 10Ppchelko: Migrate changeprop & cpjobqueue to kubernetes (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 (owner: 10Holger Knust) [00:20:23] (03PS2) 10Dzahn: DHCP: Change MAC address of some elastic servers to 10G interface [puppet] - 10https://gerrit.wikimedia.org/r/573020 (https://phabricator.wikimedia.org/T241337) (owner: 10Papaul) [00:21:31] (03PS2) 10Niharika29: Adjust MT Threshold for Assamese to 70% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572871 (https://phabricator.wikimedia.org/T245509) (owner: 10KartikMistry) [00:21:32] (03CR) 10Niharika29: [C: 03+2] Adjust MT Threshold for Assamese to 70% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572871 (https://phabricator.wikimedia.org/T245509) (owner: 10KartikMistry) [00:21:43] !log niharika29@deploy1001 Synchronized wmf-config/logging.php: Update authmanager-statsd channel name (duration: 01m 03s) [00:21:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:57] (03CR) 10Dzahn: [C: 03+1] DHCP: Change MAC address of some elastic servers to 10G interface [puppet] - 10https://gerrit.wikimedia.org/r/573020 (https://phabricator.wikimedia.org/T241337) (owner: 10Papaul) [00:22:11] James_F: tgr: Is https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikimediaEvents/+/573013/ good to sync? [00:22:55] Niharika: yup, thanks! [00:24:52] !log niharika29@deploy1001 Synchronized php-1.35.0-wmf.19/extensions/WikimediaEvents/: Follow up on authevents statsd changes in I7612b68fe (duration: 01m 03s) [00:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:21] (03CR) 10Dzahn: "need to also adjust the icinga check that is still using http:// but besides that should be good" [puppet] - 10https://gerrit.wikimedia.org/r/572353 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [00:25:44] kart_: Your patch is on mwdebug1001. Please check. [00:25:52] (03CR) 10Dzahn: [C: 03+2] DHCP: Change MAC address of some elastic servers to 10G interface [puppet] - 10https://gerrit.wikimedia.org/r/573020 (https://phabricator.wikimedia.org/T241337) (owner: 10Papaul) [00:26:06] Niharika: OK [00:26:08] (03PS2) 10Niharika29: Remove unnecessary id from wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571836 (owner: 10VolkerE) [00:26:21] (03CR) 10Niharika29: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571836 (owner: 10VolkerE) [00:27:16] (03Merged) 10jenkins-bot: Remove unnecessary id from wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571836 (owner: 10VolkerE) [00:30:12] Niharika: all good. Go ahead. [00:30:29] 10Operations, 10ops-codfw, 10Patch-For-Review: (No Need By Date Provided) codfw: rack/setup/install elastic20{55,56,57,58,59,60}.codfw.wmnet - https://phabricator.wikimedia.org/T241337 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` elastic2055.cod... [00:30:50] (03PS1) 10Dzahn: releases: change Icinga monitoring from HTTP to HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/573023 [00:31:56] 10Operations, 10ops-codfw, 10Patch-For-Review: (No Need By Date Provided) codfw: rack/setup/install elastic20{55,56,57,58,59,60}.codfw.wmnet - https://phabricator.wikimedia.org/T241337 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` elastic2056.cod... [00:31:59] (03PS2) 10BryanDavis: toolforge: increase nginx-ingress proxy_read_timeout to match dynamicproxy [puppet] - 10https://gerrit.wikimedia.org/r/573016 (https://phabricator.wikimedia.org/T245426) [00:32:32] (03PS1) 10Dzahn: noc: change Icinga monitoring from HTTP to HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/573024 [00:34:08] (03PS1) 10Dzahn: peopleweb: change Icinga monitoring from HTTP to HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/573025 [00:34:48] !log niharika29@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Adjust MT Threshold for Assamese to 70% - T245509 (duration: 01m 04s) [00:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:34:52] T245509: Adjust the threshold for Assamese to prevent publishing when overall unmodified content is higher than 70% - https://phabricator.wikimedia.org/T245509 [00:34:53] (03CR) 10BryanDavis: "Applied on the live cluster via `kubectl edit configmap nginx-configuration -n ingress-nginx` and verified in generated config using `kube" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/573016 (https://phabricator.wikimedia.org/T245426) (owner: 10BryanDavis) [00:35:22] (03CR) 10Dzahn: [C: 03+2] assign mw1350 through mw1355 as MediaWiki appservers [puppet] - 10https://gerrit.wikimedia.org/r/573019 (https://phabricator.wikimedia.org/T236437) (owner: 10Dzahn) [00:35:35] Niharika: Thanks!! [00:36:50] (03PS7) 10Jforrester: Fix latin Wikipedia (VICIPÆDIA) wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571838 (https://phabricator.wikimedia.org/T240728) (owner: 10VolkerE) [00:36:59] !log niharika29@deploy1001 Synchronized static/images/mobile/copyright/: Remove unnecessary id from wordmark (duration: 01m 03s) [00:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:37:20] (03PS8) 10Jforrester: Fix Latin Wikipedia (VICIPÆDIA) wordmark and set size correctly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571838 (https://phabricator.wikimedia.org/T240728) (owner: 10VolkerE) [00:37:27] (03CR) 10Jforrester: [C: 03+2] Fix Latin Wikipedia (VICIPÆDIA) wordmark and set size correctly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571838 (https://phabricator.wikimedia.org/T240728) (owner: 10VolkerE) [00:37:59] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:38:44] (03Merged) 10jenkins-bot: Fix Latin Wikipedia (VICIPÆDIA) wordmark and set size correctly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571838 (https://phabricator.wikimedia.org/T240728) (owner: 10VolkerE) [00:40:18] !log mw1351 through mw1355 - initial puppet runs - new appservers [00:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:36] !log jforrester@deploy1001 Synchronized static/images/mobile/copyright/: T240728 Sync logo images (duration: 01m 04s) [00:41:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:40] T240728: [Bug] Latin Wikipedia has a stretched logo (was `Create and use the Latin wikipedia (VICIPÆDIA) wordmark on mobile site`) - https://phabricator.wikimedia.org/T240728 [00:43:28] !log Manually purged https://en.wikipedia.org/images/mobile/copyright/wikipedia-wordmark-la.svg and .png from Varnish for T240728 [00:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:07] (03CR) 10Dzahn: [C: 03+2] releases: change Icinga monitoring from HTTP to HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/573023 (owner: 10Dzahn) [00:45:31] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [00:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:25] (03PS1) 10Dzahn: microsites: change Icinga monitoring from HTTP to HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/573032 [00:48:25] (03PS1) 10Dzahn: tendril: change Icinga monitoring from HTTP to HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/573034 [00:48:46] PROBLEM - Check systemd state on mw1351 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:49:01] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [00:49:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:55] ACKNOWLEDGEMENT - Check systemd state on mw1351 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn new install https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:49:55] ACKNOWLEDGEMENT - DPKG on mw1351 is CRITICAL: DPKG CRITICAL dpkg reports broken packages daniel_zahn new install https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [00:50:14] alerts for mw hosts starting with "mw13" can be ignored right now [00:50:26] new installs that get the new checks [00:50:37] and then will recover.. not pooled yet [00:51:42] PROBLEM - nutcracker process on mw1352 is CRITICAL: NRPE: Command check_nutcracker not defined https://wikitech.wikimedia.org/wiki/Nutcracker [00:51:42] PROBLEM - mcrouter process on mw1355 is CRITICAL: NRPE: Command check_mcrouter not defined https://wikitech.wikimedia.org/wiki/Mcrouter [00:53:45] 10Operations, 10ops-codfw, 10Patch-For-Review: (No Need By Date Provided) codfw: rack/setup/install elastic20{55,56,57,58,59,60}.codfw.wmnet - https://phabricator.wikimedia.org/T241337 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['elastic2055.codfw.wmnet'] ` and were **ALL** successful. [00:54:06] PROBLEM - mediawiki-installation DSH group on mw1355 is CRITICAL: Host mw1355 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [00:54:08] (03PS1) 10Dzahn: acme_chief: add apt[12]001 to authorized hosts for apt cert [puppet] - 10https://gerrit.wikimedia.org/r/573036 [00:54:08] PROBLEM - mcrouter process on mw1350 is CRITICAL: NRPE: Command check_mcrouter not defined https://wikitech.wikimedia.org/wiki/Mcrouter [00:54:08] PROBLEM - nutcracker socket on mw1352 is CRITICAL: NRPE: Command check_nutcracker_socket not defined https://wikitech.wikimedia.org/wiki/Nutcracker [00:54:08] PROBLEM - Check systemd state on mw1355 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:55:14] RECOVERY - nutcracker process on mw1352 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker [00:55:14] RECOVERY - mcrouter process on mw1355 is OK: PROCS OK: 1 process with UID = 114 (mcrouter), command name mcrouter https://wikitech.wikimedia.org/wiki/Mcrouter [00:55:19] ^ me [00:55:20] RECOVERY - nutcracker socket on mw1352 is OK: TCP OK - 0.000 second response time on socket /var/run/nutcracker/redis_eqiad.sock https://wikitech.wikimedia.org/wiki/Nutcracker [00:55:20] RECOVERY - mcrouter process on mw1350 is OK: PROCS OK: 1 process with UID = 114 (mcrouter), command name mcrouter https://wikitech.wikimedia.org/wiki/Mcrouter [00:56:32] PROBLEM - mediawiki-installation DSH group on mw1350 is CRITICAL: Host mw1350 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [00:56:32] PROBLEM - PHP opcache health on mw1351 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:56:38] RECOVERY - Check systemd state on mw1355 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:56:48] RECOVERY - Check systemd state on mw1351 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:57:48] RECOVERY - PHP opcache health on mw1351 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [01:01:24] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [01:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:52] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T240728 Fix Latin Wikipedia (VICIPÆDIA) wordmark and set size correctly (duration: 01m 06s) [01:01:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:56] T240728: [Bug] Latin Wikipedia has a stretched logo (was `Create and use the Latin wikipedia (VICIPÆDIA) wordmark on mobile site`) - https://phabricator.wikimedia.org/T240728 [01:03:45] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [01:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:48] PROBLEM - Check systemd state on mw1353 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:07:12] PROBLEM - mediawiki-installation DSH group on mw1351 is CRITICAL: Host mw1351 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [01:08:30] 10Operations, 10ops-codfw, 10Patch-For-Review: (No Need By Date Provided) codfw: rack/setup/install elastic20{55,56,57,58,59,60}.codfw.wmnet - https://phabricator.wikimedia.org/T241337 (10Papaul) [01:08:33] 10Operations, 10ops-codfw, 10Patch-For-Review: (No Need By Date Provided) codfw: rack/setup/install elastic20{55,56,57,58,59,60}.codfw.wmnet - https://phabricator.wikimedia.org/T241337 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['elastic2056.codfw.wmnet'] ` and were **ALL** successful. [01:08:47] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw1350 is CRITICAL: Host mw1350 is not in mediawiki-installation dsh group daniel_zahn new installs https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [01:08:47] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw1351 is CRITICAL: Host mw1351 is not in mediawiki-installation dsh group daniel_zahn new installs https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [01:08:47] ACKNOWLEDGEMENT - Check systemd state on mw1353 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn new installs https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:08:47] ACKNOWLEDGEMENT - Ensure local MW versions match expected deployment on mw1353 is CRITICAL: NRPE: Command check_mw_wikiversion_difference not defined daniel_zahn new installs https://wikitech.wikimedia.org/wiki/Application_servers [01:08:47] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw1355 is CRITICAL: Host mw1355 is not in mediawiki-installation dsh group daniel_zahn new installs https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [01:09:50] 10Operations, 10ops-codfw, 10Patch-For-Review: (No Need By Date Provided) codfw: rack/setup/install elastic20{55,56,57,58,59,60}.codfw.wmnet - https://phabricator.wikimedia.org/T241337 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` elastic2059.cod... [01:14:10] !log dzahn@cumin1001 conftool action : set/weight=10; selector: name=mw1352.eqiad.wmnet [01:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:14:16] PROBLEM - mediawiki-installation DSH group on mw1354 is CRITICAL: Host mw1354 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [01:14:16] !log dzahn@cumin1001 conftool action : set/weight=10; selector: name=mw1351.eqiad.wmnet [01:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:14:22] !log dzahn@cumin1001 conftool action : set/weight=10; selector: name=mw1353.eqiad.wmnet [01:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:14:28] !log dzahn@cumin1001 conftool action : set/weight=10; selector: name=mw1350.eqiad.wmnet [01:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:14:35] !log dzahn@cumin1001 conftool action : set/weight=10; selector: name=mw1354.eqiad.wmnet [01:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:14:41] !log dzahn@cumin1001 conftool action : set/weight=10; selector: name=mw1355.eqiad.wmnet [01:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:14:56] PROBLEM - mediawiki-installation DSH group on mw1352 is CRITICAL: Host mw1352 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [01:15:27] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1352.eqiad.wmnet [01:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:16:03] (03PS1) 10CRusnov: reports/management.py: Fix for 2.7 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/573045 [01:16:10] (03CR) 10CRusnov: [V: 03+2 C: 03+2] "This change is ready for review." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/573045 (owner: 10CRusnov) [01:16:12] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1351.eqiad.wmnet [01:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:16:36] (03CR) 10CRusnov: [V: 03+2 C: 03+2] "Self merging, fixes issue with upgrade." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/573045 (owner: 10CRusnov) [01:16:53] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1350.eqiad.wmnet [01:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:17:12] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1355.eqiad.wmnet [01:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:17:15] 10Operations, 10ops-codfw: (No Need By Date Provided) codfw: rack/setup/install elastic20{55,56,57,58,59,60}.codfw.wmnet - https://phabricator.wikimedia.org/T241337 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` elastic2060.codfw.wmnet ` The log can... [01:19:58] RECOVERY - mediawiki-installation DSH group on mw1352 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [01:20:28] PROBLEM - Nginx local proxy to apache on mw1353 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Application_servers [01:21:40] RECOVERY - Nginx local proxy to apache on mw1353 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 0.500 second response time https://wikitech.wikimedia.org/wiki/Application_servers [01:22:01] !log mw1353 - restarted apache (some race condition on new installs, 5 other servers did not have the issue) [01:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:22:06] RECOVERY - Check systemd state on mw1353 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:23:36] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1354.eqiad.wmnet [01:23:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:24:48] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [01:24:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:24:54] PROBLEM - mediawiki-installation DSH group on mw1353 is CRITICAL: Host mw1353 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [01:27:01] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [01:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:28:10] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1353.eqiad.wmnet [01:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:28:34] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:30:38] 10Operations, 10serviceops: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10Dzahn) The first 7: mw1349 through mw1355 have been added as regular appservers and are pooled now. But just with weight 10. We will change weights and add more (API) appse... [01:31:57] PROBLEM - Host elastic2059 is DOWN: PING CRITICAL - Packet loss = 100% [01:32:16] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [01:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:32:19] RECOVERY - Host elastic2059 is UP: PING OK - Packet loss = 0%, RTA = 36.16 ms [01:32:57] RECOVERY - mediawiki-installation DSH group on mw1350 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [01:32:59] RECOVERY - mediawiki-installation DSH group on mw1351 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [01:33:17] RECOVERY - mediawiki-installation DSH group on mw1355 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [01:33:19] RECOVERY - mediawiki-installation DSH group on mw1353 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [01:33:41] RECOVERY - mediawiki-installation DSH group on mw1354 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [01:33:46] 10Operations, 10ops-codfw: (No Need By Date Provided) codfw: rack/setup/install elastic20{55,56,57,58,59,60}.codfw.wmnet - https://phabricator.wikimedia.org/T241337 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['elastic2059.codfw.wmnet'] ` and were **ALL** successful. [01:34:31] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [01:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:39:12] 10Operations, 10ops-codfw: (No Need By Date Provided) codfw: rack/setup/install elastic20{55,56,57,58,59,60}.codfw.wmnet - https://phabricator.wikimedia.org/T241337 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['elastic2060.codfw.wmnet'] ` and were **ALL** successful. [01:40:10] 10Operations, 10ops-codfw: (No Need By Date Provided) codfw: rack/setup/install elastic20{55,56,57,58,59,60}.codfw.wmnet - https://phabricator.wikimedia.org/T241337 (10Papaul) [01:40:44] 10Operations, 10ops-codfw: (No Need By Date Provided) codfw: rack/setup/install elastic20{55,56,57,58,59,60}.codfw.wmnet - https://phabricator.wikimedia.org/T241337 (10Papaul) a:05Papaul→03Gehel @Gehel All yours let me know if you have any questions. [01:51:40] (03CR) 10Ppchelko: Migrate changeprop & cpjobqueue to kubernetes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 (owner: 10Holger Knust) [01:51:46] (03CR) 10Ppchelko: [C: 04-1] Migrate changeprop & cpjobqueue to kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 (owner: 10Holger Knust) [01:57:00] (03CR) 10Ppchelko: [C: 04-1] "This is what the config is being rendered to for me https://gist.github.com/Pchelolo/d58fd72938c4e031f329dd4a635d6167" [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 (owner: 10Holger Knust) [02:12:02] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By 2020-03-01) rack/setup/install htmldumper1001.eqiad.wmnet. - https://phabricator.wikimedia.org/T245567 (10wiki_willy) [02:13:41] (03CR) 10Ppchelko: [C: 04-1] "After the few modifications that I've mentioned in the review, I've got it to fail not from some config compilation issues, but from time " (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 (owner: 10Holger Knust) [02:30:01] (03PS1) 10Papaul: DNS: Remove mgmt DNS for Bellatrix [dns] - 10https://gerrit.wikimedia.org/r/573059 [02:31:28] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for Bellatrix [dns] - 10https://gerrit.wikimedia.org/r/573059 (owner: 10Papaul) [02:32:28] 10Operations, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission bellatrix.frack.codfw.wmnet - https://phabricator.wikimedia.org/T244743 (10Papaul) [02:32:38] 10Operations, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission bellatrix.frack.codfw.wmnet - https://phabricator.wikimedia.org/T244743 (10Papaul) 05Open→03Resolved complete [02:42:01] 10Operations, 10ops-codfw, 10netops: codfw: Delete cloud interface-range - https://phabricator.wikimedia.org/T244196 (10Papaul) ` papaul@asw-b-codfw# show | compare [edit interfaces] - interface-range vlan-cloud-support1-b-codfw { - member ge-8/0/9; - mtu 9192; - unit 0 { - fami... [02:45:43] 10Operations, 10ops-codfw, 10netops: codfw: Delete cloud interface-range - https://phabricator.wikimedia.org/T244196 (10Papaul) @ayounsi since i deleted the interface range do you want me to delete also the VLAN cloud-support1-b-codfw [02:54:44] PROBLEM - dump of s3 in eqiad on db1115 is CRITICAL: dump for s3 at eqiad taken more than 8 days ago: Most recent backup 2020-02-11 02:23:28 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [04:46:38] PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:03:09] ACKNOWLEDGEMENT - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Cas Rusnov These issues are being debugged. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:30:43] (03PS3) 10Dave Pifke: Scrape webperf Prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/572141 (https://phabricator.wikimedia.org/T175087) [05:34:29] (03PS2) 10Dave Pifke: Add Swift user for ArcLamp [puppet] - 10https://gerrit.wikimedia.org/r/572129 (https://phabricator.wikimedia.org/T244776) [05:34:46] (03CR) 10jerkins-bot: [V: 04-1] Scrape webperf Prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/572141 (https://phabricator.wikimedia.org/T175087) (owner: 10Dave Pifke) [05:37:36] RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:53:39] (03PS4) 10Dave Pifke: Scrape webperf Prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/572141 (https://phabricator.wikimedia.org/T175087) [06:08:53] (03PS1) 10Marostegui: install_server: Pass bootif installer to new ES hosts [puppet] - 10https://gerrit.wikimedia.org/r/573069 (https://phabricator.wikimedia.org/T242481) [06:17:35] !log Compress new and empty watchlist_expiry table - T245358 [06:17:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:42] T245358: Compress table watchlist_expiry - https://phabricator.wikimedia.org/T245358 [06:35:11] !log Compress watchlist_expiry table on s3 (this will take hours as I have left a 60 seconds sleep between tables) - T245358 [06:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:15] T245358: Compress table watchlist_expiry - https://phabricator.wikimedia.org/T245358 [06:57:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase API weight for db1107 50 -> 100 for 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10454 and previous config saved to /var/cache/conftool/dbconfig/20200219-065726-marostegui.json [06:57:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:31] T242702: Test MariaDB 10.4 in production - https://phabricator.wikimedia.org/T242702 [07:02:57] !log Remove wikiadmin2 user from es2 - T243512 [07:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:01] T243512: Clean up wikiadmin2 user from core hosts - https://phabricator.wikimedia.org/T243512 [07:08:22] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 41 probes of 527 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:15:18] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 38 probes of 527 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:16:03] 10Operations, 10Analytics, 10serviceops, 10vm-requests, 10Patch-For-Review: Create a ganeti VM in eqiad: an-tool1008 - https://phabricator.wikimedia.org/T244717 (10elukey) 05Open→03Stalled Setting this to stalled since I'd need to figure out exactly how much disk space this host needs. [07:26:28] (03CR) 10ArielGlenn: "recheck" [dumps] - 10https://gerrit.wikimedia.org/r/566580 (https://phabricator.wikimedia.org/T243434) (owner: 10ArielGlenn) [07:31:58] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [07:40:10] (03Abandoned) 10Vgutierrez: Release 8.0.6-rc0-1wm1 [debs/trafficserver] (8.0.6) - 10https://gerrit.wikimedia.org/r/571865 (owner: 10Vgutierrez) [07:51:32] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [07:52:08] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [07:57:26] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 34 probes of 527 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:02:36] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 32 probes of 527 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:05:43] (03PS1) 10Vgutierrez: Release 8.0.5-1wm16 [debs/trafficserver] (8.0.6) - 10https://gerrit.wikimedia.org/r/573220 (https://phabricator.wikimedia.org/T244464) [08:05:47] (03PS1) 10Vgutierrez: Release 8.0.6-rc1-1wm1 [debs/trafficserver] (8.0.6) - 10https://gerrit.wikimedia.org/r/573221 [08:05:54] (03CR) 10jerkins-bot: [V: 04-1] Release 8.0.5-1wm16 [debs/trafficserver] (8.0.6) - 10https://gerrit.wikimedia.org/r/573220 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez) [08:06:00] (03CR) 10jerkins-bot: [V: 04-1] Release 8.0.6-rc1-1wm1 [debs/trafficserver] (8.0.6) - 10https://gerrit.wikimedia.org/r/573221 (owner: 10Vgutierrez) [08:06:04] (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] Release 8.0.5-1wm16 [debs/trafficserver] (8.0.6) - 10https://gerrit.wikimedia.org/r/573220 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez) [08:06:06] (03CR) 10Filippo Giunchedi: [C: 04-1] "LGTM, see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/572141 (https://phabricator.wikimedia.org/T175087) (owner: 10Dave Pifke) [08:06:57] (03PS2) 10Vgutierrez: Release 8.0.6-rc1-1wm1 [debs/trafficserver] (8.0.6) - 10https://gerrit.wikimedia.org/r/573221 [08:07:07] (03CR) 10jerkins-bot: [V: 04-1] Release 8.0.6-rc1-1wm1 [debs/trafficserver] (8.0.6) - 10https://gerrit.wikimedia.org/r/573221 (owner: 10Vgutierrez) [08:08:55] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. I'm not 100% sure whether we need to re-create the bootif-stretch tftpboot environment after the recent Debian 9.12 point rele" [puppet] - 10https://gerrit.wikimedia.org/r/573069 (https://phabricator.wikimedia.org/T242481) (owner: 10Marostegui) [08:09:46] (03CR) 10Filippo Giunchedi: [C: 03+2] Add Swift user for ArcLamp [puppet] - 10https://gerrit.wikimedia.org/r/572129 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke) [08:11:49] 10Operations, 10ORES, 10Scoring-platform-team, 10vm-requests: New node request: oresrdb[12]003 - https://phabricator.wikimedia.org/T210582 (10akosiaris) 05Stalled→03Declined It's been a year already with no move on this one. I 'll mark as declined (that and parent task). We can always revisit and reopen. [08:14:09] !log roll restart swift proxies - T244776 [08:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:14] T244776: Swift container for performance flame graphs (ArcLamp) - https://phabricator.wikimedia.org/T244776 [08:21:05] (03CR) 10Marostegui: [V: 03+2 C: 03+2] "> Looks good. I'm not 100% sure whether we need to re-create the" [puppet] - 10https://gerrit.wikimedia.org/r/573069 (https://phabricator.wikimedia.org/T242481) (owner: 10Marostegui) [08:23:23] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['es1020.eqiad.wmnet']... [08:23:37] !log run mwscript deleteEqualMessages.php cswiki --delete [08:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:21] (03PS1) 10Fdans: Analytics refine: blacklist MobileWebMainMenuClickTracking [puppet] - 10https://gerrit.wikimedia.org/r/573224 [08:27:07] (03PS1) 10Elukey: Unify stat1004's and stat1005's roles into one [puppet] - 10https://gerrit.wikimedia.org/r/573225 (https://phabricator.wikimedia.org/T243934) [08:29:39] (03CR) 10Elukey: [C: 03+2] Analytics refine: blacklist MobileWebMainMenuClickTracking [puppet] - 10https://gerrit.wikimedia.org/r/573224 (owner: 10Fdans) [08:32:02] RECOVERY - mediawiki originals uploads -hourly- for codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [08:32:38] RECOVERY - mediawiki originals uploads -hourly- for eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [08:32:57] 10Operations, 10ops-codfw, 10Discovery-Search (Current work): (No Need By Date Provided) codfw: rack/setup/install elastic20{55,56,57,58,59,60}.codfw.wmnet - https://phabricator.wikimedia.org/T241337 (10Gehel) [08:35:31] (03CR) 10Muehlenhoff: Unify stat1004's and stat1005's roles into one (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/573225 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [08:35:59] (03PS10) 10Muehlenhoff: Add script to track OS migrations status [puppet] - 10https://gerrit.wikimedia.org/r/572251 [08:39:39] ACKNOWLEDGEMENT - dump of s3 in eqiad on db1115 is CRITICAL: dump for s3 at eqiad taken more than 8 days ago: Most recent backup 2020-02-11 02:23:28 Jcrespo s3 being repaired https://wikitech.wikimedia.org/wiki/MariaDB/Backups [08:41:41] !log Remove wikiadmin2 user from s7 - T243512 [08:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:45] T243512: Clean up wikiadmin2 user from core hosts - https://phabricator.wikimedia.org/T243512 [08:43:38] PROBLEM - PHP7 rendering on mw1313 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1310 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:43:42] RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [08:45:14] PROBLEM - Nginx local proxy to apache on mw1313 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1311 bytes in 0.134 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:45:14] PROBLEM - Apache HTTP on mw1313 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1310 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:45:14] (03CR) 10Muehlenhoff: [C: 03+2] Add script to track OS migrations status [puppet] - 10https://gerrit.wikimedia.org/r/572251 (owner: 10Muehlenhoff) [08:50:48] !log Remove dbproxy1007 grants from m2 - T231280 [08:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:52] T231280: Remove grants for the old dbproxy hosts from the misc databases - https://phabricator.wikimedia.org/T231280 [08:53:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [08:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [08:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:10] PROBLEM - MariaDB Slave SQL: x1 on db1140 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [08:57:18] PROBLEM - MariaDB Slave SQL: s2 on db1140 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [08:57:25] ^ jynus you? [08:57:39] or downtime expired maybe? [08:58:08] it is me, but it is a bug on the package [08:58:12] PROBLEM - MariaDB Slave IO: x1 on db1140 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [08:58:20] On the package? [08:58:27] (03PS1) 10Marostegui: dbproxy1007: Decommission [puppet] - 10https://gerrit.wikimedia.org/r/573227 (https://phabricator.wikimedia.org/T245385) [08:58:41] if you start a multi-instance on the new version, it wipes the socket dir [08:58:50] PROBLEM - MariaDB Slave IO: s2 on db1140 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [08:58:54] PROBLEM - MariaDB read only x1 on db1140 is CRITICAL: Could not connect to localhost:3320 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [08:58:54] PROBLEM - MariaDB read only s2 on db1140 is CRITICAL: Could not connect to localhost:3312 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [08:59:22] or some combination of stop/start [08:59:40] Going to downtime the host, so it doesn't mess up with icinga [09:00:07] so you mean if you start a new instance it wipes the socket dir? [09:00:11] (or stop) [09:00:17] some combinatio of it [09:00:30] wow, not fun [09:00:33] that is because it was built with the buster options [09:00:45] Ah, I see [09:01:28] (03CR) 10jerkins-bot: [V: 04-1] dbproxy1007: Decommission [puppet] - 10https://gerrit.wikimedia.org/r/573227 (https://phabricator.wikimedia.org/T245385) (owner: 10Marostegui) [09:02:47] (03PS1) 10Marostegui: wmnet: Remove production DNS entries for dbproxy1007 [dns] - 10https://gerrit.wikimedia.org/r/573228 (https://phabricator.wikimedia.org/T245385) [09:02:52] (03PS2) 10Marostegui: dbproxy1007: Decommission [puppet] - 10https://gerrit.wikimedia.org/r/573227 (https://phabricator.wikimedia.org/T245385) [09:06:58] (03CR) 10Marostegui: [C: 03+2] dbproxy1007: Decommission [puppet] - 10https://gerrit.wikimedia.org/r/573227 (https://phabricator.wikimedia.org/T245385) (owner: 10Marostegui) [09:07:26] (03CR) 10Marostegui: [C: 03+2] wmnet: Remove production DNS entries for dbproxy1007 [dns] - 10https://gerrit.wikimedia.org/r/573228 (https://phabricator.wikimedia.org/T245385) (owner: 10Marostegui) [09:08:20] RECOVERY - MariaDB Slave IO: x1 on db1140 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [09:08:48] 10Operations, 10ops-eqiad, 10decommission: decommission dbproxy1007.eqiad.wmnet - https://phabricator.wikimedia.org/T245385 (10Marostegui) a:05Marostegui→03Jclark-ctr [09:08:53] 10Operations, 10ops-eqiad, 10decommission: decommission dbproxy1007.eqiad.wmnet - https://phabricator.wikimedia.org/T245385 (10Marostegui) [09:08:56] RECOVERY - MariaDB Slave IO: s2 on db1140 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [09:09:02] RECOVERY - MariaDB read only x1 on db1140 is OK: Version 10.1.43-MariaDB, Uptime 106s, read_only: True, 78.68 QPS, connection latency: 0.001607s, query latency: 0.000326s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [09:09:02] RECOVERY - MariaDB read only s2 on db1140 is OK: Version 10.1.43-MariaDB, Uptime 94s, read_only: True, 271.86 QPS, connection latency: 0.002760s, query latency: 0.000386s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [09:09:10] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission dbproxy1007.eqiad.wmnet - https://phabricator.wikimedia.org/T245385 (10Marostegui) Host ready for on-site steps [09:09:20] RECOVERY - MariaDB Slave SQL: x1 on db1140 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [09:09:28] RECOVERY - MariaDB Slave SQL: s2 on db1140 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [09:09:36] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission dbproxy1002.eqiad.wmnet - https://phabricator.wikimedia.org/T245384 (10Marostegui) [09:09:42] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission dbproxy1001.eqiad.wmnet - https://phabricator.wikimedia.org/T244463 (10Marostegui) [09:21:58] (03CR) 10Elukey: Unify stat1004's and stat1005's roles into one (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/573225 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [09:22:00] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [09:22:32] (03PS2) 10Elukey: Unify stat1004's and stat1005's roles into one [puppet] - 10https://gerrit.wikimedia.org/r/573225 (https://phabricator.wikimedia.org/T243934) [09:23:52] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es1020.eqiad.wmnet'] ` Of which those **FAILED**: ` ['es1020.eqiad.wmnet'... [09:23:55] 10Operations, 10Mathoid, 10Wikimedia-Logstash, 10observability: Move mathoid to the logging pipeline - https://phabricator.wikimedia.org/T245516 (10akosiaris) p:05Triage→03Medium a:03akosiaris mathoid is already in the logging pipeline, e.g: https://logstash.wikimedia.org/goto/82706aa1d6f8767984775ab... [09:25:05] PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: RRDP status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [09:26:27] 10Operations: Migrate Cumin hosts to Buster - https://phabricator.wikimedia.org/T245114 (10Marostegui) I have checked the 10.4.12 mariadb client (and previos 10.4.11) on buster (on db1107) for the last few weeks without encountering any issues. [09:27:08] (03PS1) 10Muehlenhoff: Add system::role for role::logging::webrequest::ops [puppet] - 10https://gerrit.wikimedia.org/r/573230 [09:27:36] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10Marostegui) >>! In T241359#5896178, @ops-monitoring-bot wrote: > Completed auto-reimage of hosts: > ` > ['es1020.eqiad.wmnet'] > `... [09:30:08] (03PS1) 10Alexandros Kosiaris: mathoid: Remove logstash logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/573231 (https://phabricator.wikimedia.org/T245516) [09:30:42] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/572995 (https://phabricator.wikimedia.org/T241096) (owner: 10Ottomata) [09:31:07] (03CR) 10Alexandros Kosiaris: [C: 03+2] mathoid: Remove logstash logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/573231 (https://phabricator.wikimedia.org/T245516) (owner: 10Alexandros Kosiaris) [09:33:05] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'mathoid' for release 'staging' . [09:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:08] (03PS1) 10Volans: puppetdb report: include offline VMs [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/573232 [09:34:43] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'mathoid' for release 'production' . [09:34:43] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'mathoid' for release 'canary' . [09:34:45] 10Operations, 10Analytics, 10Analytics-Kanban, 10Wikimedia-Logstash, and 6 others: Move AQS logging to new logging pipeline - https://phabricator.wikimedia.org/T219928 (10elukey) [09:34:45] RECOVERY - PHP7 rendering on mw1313 is OK: HTTP OK: HTTP/1.1 200 OK - 74381 bytes in 0.151 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:48] <_joe_> !log cleared opcache on mw1313 [09:34:50] <_joe_> sigh [09:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:53] <_joe_> effie: ^^ [09:35:01] <_joe_> this fixed the situation on that server [09:35:01] 10Operations, 10Wikimedia-Logstash, 10observability, 10service-runner, and 2 others: Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125 (10elukey) [09:35:03] 10Operations, 10Analytics, 10Analytics-Kanban, 10Wikimedia-Logstash, and 6 others: Move AQS logging to new logging pipeline - https://phabricator.wikimedia.org/T219928 (10elukey) 05Open→03Resolved [09:35:08] * _joe_ cries in a corner [09:35:09] RECOVERY - Apache HTTP on mw1313 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.036 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:35:57] (03CR) 10Alexandros Kosiaris: [C: 03+1] puppetdb report: include offline VMs [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/573232 (owner: 10Volans) [09:35:58] (03CR) 10Giuseppe Lavagetto: [C: 03+2] envoy: split base profile out of tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/572831 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [09:36:20] (03CR) 10Volans: [C: 03+2] puppetdb report: include offline VMs [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/573232 (owner: 10Volans) [09:40:17] (03CR) 10Gehel: [C: 03+1] "Oh yes! Slightly better hack than what we had previously! Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/572684 (https://phabricator.wikimedia.org/T242910) (owner: 10Jbond) [09:41:45] RECOVERY - Nginx local proxy to apache on mw1313 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:43:05] !log test trafficserver 8.0.6-rc1 in cp40[26,32] [09:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:16] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'mathoid' for release 'production' . [09:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:29] PROBLEM - DPKG on cp4032 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:48:53] hmm I've triggered that [09:49:05] (03CR) 10Jbond: [C: 04-1] airflow: Expand sudo rights to analytics-search user (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/572997 (owner: 10EBernhardson) [09:49:25] !log T245516. Deploy mathoid chart version 0.0.27, removing logstash gelf configuration [09:49:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:29] T245516: Move mathoid to the logging pipeline - https://phabricator.wikimedia.org/T245516 [09:51:10] !log Depool db2089:3315, db2089:3316 for new package testing [09:51:11] RECOVERY - DPKG on cp4032 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:19] lovely [09:51:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2089:3315, db2089:3316 for new package testing', diff saved to https://phabricator.wikimedia.org/P10455 and previous config saved to /var/cache/conftool/dbconfig/20200219-095139-marostegui.json [09:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:12] (03PS5) 10Giuseppe Lavagetto: profile::services_proxy: envoy-based version [puppet] - 10https://gerrit.wikimedia.org/r/572832 (https://phabricator.wikimedia.org/T244843) [09:52:14] (03PS6) 10Giuseppe Lavagetto: mwdebug: enable envoy-based services proxy [puppet] - 10https://gerrit.wikimedia.org/r/572833 [09:52:57] 10Operations, 10Mathoid, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Move mathoid to the logging pipeline - https://phabricator.wikimedia.org/T245516 (10akosiaris) 05Open→03Resolved mathoid chart version 0.0.27, with the logstash gelf output disabled, has been deployed on all 3 cluster... [09:52:59] 10Operations, 10Wikimedia-Logstash, 10observability, 10User-fgiunchedi: Deprecate all non-Kafka logstash inputs - https://phabricator.wikimedia.org/T227080 (10akosiaris) [09:53:12] (03CR) 10Jbond: [C: 03+2] query_service::common: ensure we dont run exec on every run [puppet] - 10https://gerrit.wikimedia.org/r/572684 (https://phabricator.wikimedia.org/T242910) (owner: 10Jbond) [09:53:20] !log stopping and upgrading db1140 instances [09:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:49] (03PS12) 10Holger Knust: Migrate changeprop & cpjobqueue to kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 [09:55:18] (03CR) 10Holger Knust: "Changes for PS9" (0347 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 (owner: 10Holger Knust) [09:57:18] (03CR) 10Holger Knust: "Last message should have read: changes for PS 9 and PS 10" [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 (owner: 10Holger Knust) [09:59:30] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add system::role for role::kubernetes::worker and role::kubernetes::master [puppet] - 10https://gerrit.wikimedia.org/r/572907 (owner: 10Muehlenhoff) [10:04:34] (03PS1) 10Ladsgroup: Add 1000 more items to read from the new store in clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573236 (https://phabricator.wikimedia.org/T225057) [10:08:33] (03PS3) 10Elukey: Unify stat1004's and stat1005's roles into one [puppet] - 10https://gerrit.wikimedia.org/r/573225 (https://phabricator.wikimedia.org/T243934) [10:08:35] (03PS1) 10Elukey: profile::hive::site_hdfs: fix exec using bash -c to execute commands [puppet] - 10https://gerrit.wikimedia.org/r/573237 (https://phabricator.wikimedia.org/T240880) [10:09:57] !log updated tftpboot environment for stretch-bootif for the 9.12 point release T241359 [10:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:02] T241359: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 [10:11:18] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/20880/stat1004.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/573237 (https://phabricator.wikimedia.org/T240880) (owner: 10Elukey) [10:11:21] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['es1020.eqiad.wmnet']... [10:11:34] !log jiji@cumin1001 conftool action : set/weight=30; selector: dc=eqiad,cluster=appserver,service=apache2,name=mw1349.eqiad.wmnet [10:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:45] !log jiji@cumin1001 conftool action : set/weight=30; selector: dc=eqiad,cluster=appserver,service=nginx,name=mw1349.eqiad.wmnet [10:11:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:27] !log jiji@cumin1001 conftool action : set/weight=30; selector: dc=eqiad,cluster=appserver,service=nginx,name=mw135[0-5]*.eqiad.wmnet [10:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:33] 10Operations, 10Service-Architecture: Many objects in conftool have pooled=yes, weight=0 - https://phabricator.wikimedia.org/T245594 (10Joe) [10:12:51] 10Operations, 10Service-Architecture: Many objects in conftool have pooled=yes, weight=0 - https://phabricator.wikimedia.org/T245594 (10Joe) p:05Triage→03High a:03Joe [10:12:54] !log jiji@cumin1001 conftool action : set/weight=30; selector: dc=eqiad,cluster=appserver,service=apache2,name=mw135[0-5]*.eqiad.wmnet [10:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:01] !log stopping db2089 mariadb@s5 [10:17:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:49] 10Operations, 10serviceops: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10jijiki) Weights of mw1349-mw1355 were switched to 30 [10:18:43] (03PS4) 10Elukey: Unify stat1004's and stat1005's roles into one [puppet] - 10https://gerrit.wikimedia.org/r/573225 (https://phabricator.wikimedia.org/T243934) [10:18:47] (03PS1) 10Elukey: role::statistics::explorer: remove config from hiera [puppet] - 10https://gerrit.wikimedia.org/r/573239 (https://phabricator.wikimedia.org/T243934) [10:22:35] 10Operations, 10Service-Architecture: Many objects in conftool have pooled=yes, weight=0 - https://phabricator.wikimedia.org/T245594 (10elukey) One question - some aqs nodes are listed in P10456, but I don't see them reported in https://config-master.wikimedia.org/pybal/eqiad/aqs with weight=0. I usually doubl... [10:24:04] 10Operations, 10RESTBase, 10Wikimedia-Logstash, 10observability: Move restrouter to the logging pipeline - https://phabricator.wikimedia.org/T245515 (10akosiaris) 05Open→03Declined Restrouter will get undeployed. See T242461 [10:24:06] 10Operations, 10Wikimedia-Logstash, 10observability, 10User-fgiunchedi: Deprecate all non-Kafka logstash inputs - https://phabricator.wikimedia.org/T227080 (10akosiaris) [10:24:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [10:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:46] (03CR) 10Elukey: [C: 03+2] role::statistics::explorer: remove config from hiera [puppet] - 10https://gerrit.wikimedia.org/r/573239 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [10:26:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:28] (03PS1) 10Alexandros Kosiaris: cxserver: Remove logstash logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/573240 (https://phabricator.wikimedia.org/T219921) [10:27:42] (03CR) 10Elukey: "pcc looks good: https://puppet-compiler.wmflabs.org/compiler1001/20881/" [puppet] - 10https://gerrit.wikimedia.org/r/573225 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [10:33:34] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es1020.eqiad.wmnet'] ` and were **ALL** successful. [10:33:36] Quickly going to backport two metrics changes [10:38:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2089:3315, db2089:3316 after new package testing', diff saved to https://phabricator.wikimedia.org/P10457 and previous config saved to /var/cache/conftool/dbconfig/20200219-103806-marostegui.json [10:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:10] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10Marostegui) [10:41:36] (03PS1) 10Jbond: ores::base: myspell-nl is provided by hunspell-nl in buster [puppet] - 10https://gerrit.wikimedia.org/r/573243 (https://phabricator.wikimedia.org/T242910) [10:43:13] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10Marostegui) es1020 installed correctly: RAID10, 256k strip size, BBU and Cache policy right disk space, memory and cpus [10:44:46] (03PS2) 10Jcrespo: Revert "backups: Disable s3-eqiad backups until source host is restored" [puppet] - 10https://gerrit.wikimedia.org/r/572901 [10:45:52] !log upgrading mariadb client on cumin hosts [10:45:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:57] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['es1021.eqiad.wmnet',... [10:50:14] (03CR) 10Jcrespo: [C: 03+2] Revert "backups: Disable s3-eqiad backups until source host is restored" [puppet] - 10https://gerrit.wikimedia.org/r/572901 (owner: 10Jcrespo) [10:54:34] (03PS1) 10Muehlenhoff: Remove system::role from role::noc::site [puppet] - 10https://gerrit.wikimedia.org/r/573246 [10:56:02] (03PS1) 10Muehlenhoff: Remove system::role from role::swift::swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/573247 [10:57:18] (03CR) 10jerkins-bot: [V: 04-1] Remove system::role from role::noc::site [puppet] - 10https://gerrit.wikimedia.org/r/573246 (owner: 10Muehlenhoff) [10:58:20] (03PS1) 10Alexandros Kosiaris: restrouter: undeploy [deployment-charts] - 10https://gerrit.wikimedia.org/r/573248 (https://phabricator.wikimedia.org/T242461) [10:58:22] (03PS1) 10Alexandros Kosiaris: restrouter: Fully remove the helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/573249 (https://phabricator.wikimedia.org/T242461) [10:58:24] (03CR) 10jerkins-bot: [V: 04-1] Remove system::role from role::swift::swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/573247 (owner: 10Muehlenhoff) [10:58:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [10:58:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [10:58:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [10:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:45] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [10:58:45] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [10:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:03] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/573243 (https://phabricator.wikimedia.org/T242910) (owner: 10Jbond) [11:01:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:17] (03PS1) 10Alexandros Kosiaris: admin: Remove calico restrouter rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/573250 (https://phabricator.wikimedia.org/T242461) [11:02:04] (03PS1) 10Alexandros Kosiaris: restrouter: Remove restrouter LVS icinga config [puppet] - 10https://gerrit.wikimedia.org/r/573253 (https://phabricator.wikimedia.org/T242461) [11:02:06] (03PS1) 10Alexandros Kosiaris: restrouter: Remove LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/573254 (https://phabricator.wikimedia.org/T242461) [11:02:09] (03PS1) 10Alexandros Kosiaris: restrouter: Remove from conftool [puppet] - 10https://gerrit.wikimedia.org/r/573255 (https://phabricator.wikimedia.org/T242461) [11:02:11] (03PS1) 10Alexandros Kosiaris: restrouter: Remove LVS IP from kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/573256 (https://phabricator.wikimedia.org/T242461) [11:02:13] (03PS1) 10Alexandros Kosiaris: restrouter: Remove k8s tokens [puppet] - 10https://gerrit.wikimedia.org/r/573257 (https://phabricator.wikimedia.org/T242461) [11:06:04] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.19/extensions/Wikibase/lib/includes/Store: Get rid of useless metrics in EntityTermLookupBase (T245592) (duration: 01m 12s) [11:06:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:10] T245592: EntityTermLookupBase imcrements wb_terms related metrics, but is now also used by the new term storage and should use different metric names. - https://phabricator.wikimedia.org/T245592 [11:07:08] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es1022.eqiad.wmnet', 'es1023.eqiad.wmnet', 'es1021.eqiad.wmnet'] ` and we... [11:07:46] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] icinga: remove wmflabs.org HTTPS cert check [puppet] - 10https://gerrit.wikimedia.org/r/572665 (https://phabricator.wikimedia.org/T235252) (owner: 10Arturo Borrero Gonzalez) [11:08:12] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.20/extensions/Wikibase/lib/includes/Store: Get rid of useless metrics in EntityTermLookupBase (T245592) (duration: 01m 04s) [11:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:03] 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review: Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10jijiki) [11:11:28] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: increase nginx-ingress proxy_read_timeout to match dynamicproxy [puppet] - 10https://gerrit.wikimedia.org/r/573016 (https://phabricator.wikimedia.org/T245426) (owner: 10BryanDavis) [11:12:59] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10Marostegui) >>! In T241359#5896474, @ops-monitoring-bot wrote: > Completed auto-reimage of hosts: > ` > ['es1022.eqiad.wmnet', 'es1... [11:14:21] 10Operations, 10ops-eqiad, 10DBA: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10Marostegui) [11:14:41] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Lauren Dickinson - https://phabricator.wikimedia.org/T245524 (10RhinosF1) >>! In T245524#5894463, @Lookd_Up wrote: > Hi @Aklapper: Thanks for your quick reply! And apologies for not making this request using my WMF account. > > Would it b... [11:15:37] (03CR) 10Alexandros Kosiaris: [C: 03+2] restrouter: Remove restrouter LVS icinga config [puppet] - 10https://gerrit.wikimedia.org/r/573253 (https://phabricator.wikimedia.org/T242461) (owner: 10Alexandros Kosiaris) [11:17:20] (03PS1) 10Arturo Borrero Gonzalez: hiera: cloud: eqiad1: use openstack service names instead of server names [puppet] - 10https://gerrit.wikimedia.org/r/573258 [11:22:27] (03CR) 10Jbond: [C: 03+2] ores::base: myspell-nl is provided by hunspell-nl in buster [puppet] - 10https://gerrit.wikimedia.org/r/573243 (https://phabricator.wikimedia.org/T242910) (owner: 10Jbond) [11:22:39] (03CR) 10Arturo Borrero Gonzalez: "PCC https://puppet-compiler.wmflabs.org/compiler1003/20883/" [puppet] - 10https://gerrit.wikimedia.org/r/573258 (owner: 10Arturo Borrero Gonzalez) [11:26:15] 10Operations, 10ops-eqiad, 10DBA: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10Marostegui) [11:28:36] 10Operations, 10ops-eqiad, 10DBA: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['es1024.eqiad.wmnet', 'es1025.eqiad.wmnet']... [11:29:44] (03PS2) 10Muehlenhoff: Remove system::role from role::swift::swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/573247 [11:34:58] (03PS1) 10Jbond: profile::prometheus::ops: ensure rsync service is stopped [puppet] - 10https://gerrit.wikimedia.org/r/573261 (https://phabricator.wikimedia.org/T242910) [11:36:25] 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review: Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10jijiki) >>! In T240684#5893489, @elukey wrote: > Very nice summary, thanks! > > A couple of questions: > >> FailoverWithExptimeRo... [11:36:28] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/573261 (https://phabricator.wikimedia.org/T242910) (owner: 10Jbond) [11:37:33] (03PS3) 10Ema: fifo-log-tailer: do not convert stdout to io.Writer [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/572905 [11:38:19] 10Operations, 10Wikifeeds, 10Wikimedia-Logstash, 10observability: Move wikifeeds to the logging pipeline - https://phabricator.wikimedia.org/T245604 (10fgiunchedi) [11:39:41] (03PS1) 10Volans: netbox: remove temporary config post-upgrade [puppet] - 10https://gerrit.wikimedia.org/r/573262 (https://phabricator.wikimedia.org/T244291) [11:41:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [11:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:10] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Couple of comments and questions. Overall this has progressed a lot." (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 (owner: 10Holger Knust) [11:42:52] (03CR) 10Volans: [C: 03+2] netbox: remove temporary config post-upgrade [puppet] - 10https://gerrit.wikimedia.org/r/573262 (https://phabricator.wikimedia.org/T244291) (owner: 10Volans) [11:43:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:57] (03PS1) 10Volans: netbox: better splay scripts in the hour [puppet] - 10https://gerrit.wikimedia.org/r/573263 (https://phabricator.wikimedia.org/T244291) [11:53:01] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] hiera: cloud: eqiad1: use openstack service names instead of server names [puppet] - 10https://gerrit.wikimedia.org/r/573258 (owner: 10Arturo Borrero Gonzalez) [11:54:22] (03CR) 10Volans: [C: 03+2] netbox: better splay scripts in the hour [puppet] - 10https://gerrit.wikimedia.org/r/573263 (https://phabricator.wikimedia.org/T244291) (owner: 10Volans) [11:54:47] (03PS2) 10Alexandros Kosiaris: restrouter: Remove LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/573254 (https://phabricator.wikimedia.org/T242461) [11:54:49] (03PS2) 10Alexandros Kosiaris: restrouter: Remove LVS IP from kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/573256 (https://phabricator.wikimedia.org/T242461) [11:54:51] (03PS2) 10Alexandros Kosiaris: restrouter: Remove from conftool [puppet] - 10https://gerrit.wikimedia.org/r/573255 (https://phabricator.wikimedia.org/T242461) [11:54:53] (03PS2) 10Alexandros Kosiaris: restrouter: Remove k8s tokens [puppet] - 10https://gerrit.wikimedia.org/r/573257 (https://phabricator.wikimedia.org/T242461) [11:56:41] !log better splay of periodic scripts that interact with Netbox - T244291 [11:56:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:46] T244291: Upgrade Netbox to 2.7 series - https://phabricator.wikimedia.org/T244291 [11:56:46] 10Operations, 10Analytics: missing pig package on an-tool1006.eqiad.wmnet & analytics1030.eqiad.wmnet - https://phabricator.wikimedia.org/T245605 (10jbond) [11:56:53] 10Operations, 10serviceops: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10jijiki) [11:57:01] 10Operations, 10Analytics: missing pig package on an-tool1006.eqiad.wmnet & analytics1030.eqiad.wmnet - https://phabricator.wikimedia.org/T245605 (10jbond) [11:57:03] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Add check for changes applied at all runs - https://phabricator.wikimedia.org/T242910 (10jbond) [11:57:16] 10Operations, 10Analytics: missing pig package on an-tool1006.eqiad.wmnet & analytics1030.eqiad.wmnet - https://phabricator.wikimedia.org/T245605 (10jbond) p:05Triage→03Medium [11:57:52] (03PS1) 10Ema: cache: add role::cache::common [puppet] - 10https://gerrit.wikimedia.org/r/573264 [11:58:42] 10Operations, 10netops, 10cloud-services-team (Kanban): CloudVPS: enable BGP in the neutron transport network - https://phabricator.wikimedia.org/T245606 (10aborrero) [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for European Mid-day SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200219T1200). [12:00:04] No GERRIT patches in the queue for this window AFAICS. [12:00:23] o/ [12:00:32] no gerrit patches AFAICS eitehr [12:00:35] *either [12:00:46] I have a patch to deploy but I wait for a bit [12:00:46] 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review: Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10jijiki) To proceed with testing, we will puppetise the following configuration, and roll it to a couple of canary servers, and bloc... [12:01:12] (03CR) 10jerkins-bot: [V: 04-1] cache: add role::cache::common [puppet] - 10https://gerrit.wikimedia.org/r/573264 (owner: 10Ema) [12:03:21] (03PS2) 10Ema: cache: add role::cache::common [puppet] - 10https://gerrit.wikimedia.org/r/573264 [12:05:55] (03PS1) 10Jbond: swift::swiftrepl: force directory removal if resource absent [puppet] - 10https://gerrit.wikimedia.org/r/573265 (https://phabricator.wikimedia.org/T242910) [12:06:13] (03CR) 10jerkins-bot: [V: 04-1] cache: add role::cache::common [puppet] - 10https://gerrit.wikimedia.org/r/573264 (owner: 10Ema) [12:07:54] (03CR) 10Filippo Giunchedi: [C: 03+1] Remove system::role from role::swift::swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/573247 (owner: 10Muehlenhoff) [12:08:23] (03PS3) 10Ema: cache: add role::cache::common [puppet] - 10https://gerrit.wikimedia.org/r/573264 [12:08:30] (03CR) 10Filippo Giunchedi: [C: 03+1] swift::swiftrepl: force directory removal if resource absent [puppet] - 10https://gerrit.wikimedia.org/r/573265 (https://phabricator.wikimedia.org/T242910) (owner: 10Jbond) [12:09:01] (03PS3) 10Arturo Borrero Gonzalez: cloud: refresh names for DNS servers in eqiad1/codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/572213 (https://phabricator.wikimedia.org/T243766) [12:09:30] (03CR) 10Muehlenhoff: [C: 03+2] Remove system::role from role::swift::swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/573247 (owner: 10Muehlenhoff) [12:11:39] (03Abandoned) 10Ema: cache: add role::cache::common [puppet] - 10https://gerrit.wikimedia.org/r/573264 (owner: 10Ema) [12:12:12] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/573261 (https://phabricator.wikimedia.org/T242910) (owner: 10Jbond) [12:13:27] (03PS2) 10Muehlenhoff: Remove system::role from role::noc::site [puppet] - 10https://gerrit.wikimedia.org/r/573246 [12:15:39] (03PS2) 10Ladsgroup: Add 1000 more items to read from the new store in clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573236 (https://phabricator.wikimedia.org/T225057) [12:15:47] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573236 (https://phabricator.wikimedia.org/T225057) (owner: 10Ladsgroup) [12:15:56] (03CR) 10Vgutierrez: [C: 03+1] fifo-log-tailer: do not convert stdout to io.Writer [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/572905 (owner: 10Ema) [12:16:10] (03PS1) 10Ema: cache: enable cgroup accounting [puppet] - 10https://gerrit.wikimedia.org/r/573267 (https://phabricator.wikimedia.org/T183146) [12:16:25] (03CR) 10Ema: [C: 03+2] fifo-log-tailer: do not convert stdout to io.Writer [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/572905 (owner: 10Ema) [12:16:59] (03Merged) 10jenkins-bot: Add 1000 more items to read from the new store in clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573236 (https://phabricator.wikimedia.org/T225057) (owner: 10Ladsgroup) [12:18:27] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 39 probes of 527 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:19:39] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 48 probes of 527 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:20:46] (03CR) 10Ema: "pcc looks sunny: https://puppet-compiler.wmflabs.org/compiler1001/20888/" [puppet] - 10https://gerrit.wikimedia.org/r/573267 (https://phabricator.wikimedia.org/T183146) (owner: 10Ema) [12:21:20] (03PS1) 10Jbond: librenms: librenms and puppet managing files with different permissions [puppet] - 10https://gerrit.wikimedia.org/r/573268 (https://phabricator.wikimedia.org/T242910) [12:21:55] (03CR) 10Jbond: [C: 03+2] swift::swiftrepl: force directory removal if resource absent [puppet] - 10https://gerrit.wikimedia.org/r/573265 (https://phabricator.wikimedia.org/T242910) (owner: 10Jbond) [12:22:09] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:573236|Start reading for the new term store for clients up to Q2000 (T225057)]] (duration: 01m 06s) [12:22:11] (03CR) 10Vgutierrez: [C: 03+1] cache: enable cgroup accounting [puppet] - 10https://gerrit.wikimedia.org/r/573267 (https://phabricator.wikimedia.org/T183146) (owner: 10Ema) [12:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:14] (03CR) 10Ema: [C: 03+2] prometheus: add cadvisor_exporter module and profile [puppet] - 10https://gerrit.wikimedia.org/r/572682 (https://phabricator.wikimedia.org/T183146) (owner: 10Ema) [12:22:14] T225057: Switch `tmpItemTermsMigrationStages` to MIGRATION_WRITE_NEW - https://phabricator.wikimedia.org/T225057 [12:22:31] (03CR) 10Ema: [C: 03+2] cache: enable cgroup accounting [puppet] - 10https://gerrit.wikimedia.org/r/573267 (https://phabricator.wikimedia.org/T183146) (owner: 10Ema) [12:23:39] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 33 probes of 527 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:25:09] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:573236|Start reading for the new term store for clients up to Q2000 (T225057)]], take II, the cache issue (duration: 01m 04s) [12:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:20] (03PS2) 10Jbond: profile::prometheus::ops: ensure rsync service is stopped [puppet] - 10https://gerrit.wikimedia.org/r/573261 (https://phabricator.wikimedia.org/T242910) [12:26:31] (03CR) 10Jbond: profile::prometheus::ops: ensure rsync service is stopped (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/573261 (https://phabricator.wikimedia.org/T242910) (owner: 10Jbond) [12:26:56] (03PS2) 10Ema: cache: add cadvisor exporter [puppet] - 10https://gerrit.wikimedia.org/r/572693 (https://phabricator.wikimedia.org/T183146) [12:28:54] 10Operations, 10ops-eqiad, 10DBA: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es1024.eqiad.wmnet'] ` Of which those **FAILED**: ` ['es1024.eqiad.wmnet'] ` [12:29:41] (03CR) 10Ema: "https://puppet-compiler.wmflabs.org/compiler1002/20889/" [puppet] - 10https://gerrit.wikimedia.org/r/572693 (https://phabricator.wikimedia.org/T183146) (owner: 10Ema) [12:31:18] 10Operations, 10ops-codfw, 10netops: codfw: Delete cloud interface-range - https://phabricator.wikimedia.org/T244196 (10ayounsi) Nop, please keep the vlan. [12:31:23] (03CR) 10Jbond: [C: 03+2] profile::prometheus::ops: ensure rsync service is stopped [puppet] - 10https://gerrit.wikimedia.org/r/573261 (https://phabricator.wikimedia.org/T242910) (owner: 10Jbond) [12:36:52] (03PS1) 10Ema: prometheus: add cadvisor jobs [puppet] - 10https://gerrit.wikimedia.org/r/573272 (https://phabricator.wikimedia.org/T183146) [12:38:24] (03CR) 10Vgutierrez: [C: 03+1] cache: add cadvisor exporter [puppet] - 10https://gerrit.wikimedia.org/r/572693 (https://phabricator.wikimedia.org/T183146) (owner: 10Ema) [12:45:56] (03CR) 10Ema: [C: 03+2] cache: add cadvisor exporter [puppet] - 10https://gerrit.wikimedia.org/r/572693 (https://phabricator.wikimedia.org/T183146) (owner: 10Ema) [12:47:37] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 34 probes of 527 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:47:40] 10Operations, 10Puppet, 10User-jbond: reprepo user different on release1001 and release2001 - https://phabricator.wikimedia.org/T245612 (10jbond) [12:48:34] (03Abandoned) 10Ayounsi: Ignore MX104 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/571731 (owner: 10Ayounsi) [12:53:00] 10Operations, 10Puppet, 10User-jbond: reprepo user different on release1001 and release2001 - https://phabricator.wikimedia.org/T245612 (10jbond) p:05Triage→03Medium [12:55:18] (03PS1) 10Ema: cache: move traffic hiera settings away from horizon [puppet] - 10https://gerrit.wikimedia.org/r/573276 [12:56:50] (03CR) 10Vgutierrez: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/573276 (owner: 10Ema) [12:57:21] (03CR) 10Ema: [C: 03+2] cache: move traffic hiera settings away from horizon [puppet] - 10https://gerrit.wikimedia.org/r/573276 (owner: 10Ema) [13:03:59] (03PS1) 10Ema: tlsproxy: drop websocket_support [puppet] - 10https://gerrit.wikimedia.org/r/573277 (https://phabricator.wikimedia.org/T238625) [13:04:03] 10Operations, 10Puppet, 10User-jbond: reprepo user different on release1001 and release2001 - https://phabricator.wikimedia.org/T245612 (10jbond) reprepo is installed on the following servers, with owned files listed. any change to the uid/gid will need to update the permissions of theses files release2001... [13:07:09] (03CR) 10Ema: "pcc looks fine: https://puppet-compiler.wmflabs.org/compiler1001/20890/" [puppet] - 10https://gerrit.wikimedia.org/r/573277 (https://phabricator.wikimedia.org/T238625) (owner: 10Ema) [13:07:55] (03CR) 10Alexandros Kosiaris: [C: 03+2] restrouter: Remove LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/573254 (https://phabricator.wikimedia.org/T242461) (owner: 10Alexandros Kosiaris) [13:11:53] (03PS1) 10ArielGlenn: add unit tests to check temp stub generation commands [dumps] - 10https://gerrit.wikimedia.org/r/573279 (https://phabricator.wikimedia.org/T242209) [13:13:01] RECOVERY - dump of s3 in eqiad on db1115 is OK: dump for s3 at eqiad taken less than 8 days ago and larger than 10 GB: Last one 2020-02-19 11:03:08 from db1140.eqiad.wmnet:3313 (99 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [13:14:13] PROBLEM - Host restrouter.svc.codfw.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [13:14:35] sigh [13:14:37] paged [13:14:40] I wasn't fast enough sorry [13:14:46] expected? [13:14:50] yeah [13:14:51] ack [13:14:57] PROBLEM - PyBal IPVS diff check on lvs2006 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.1.48:7231]) https://wikitech.wikimedia.org/wiki/PyBal [13:15:01] <_joe_> yes it's not in prod even [13:15:10] <_joe_> akosiaris: you should've set critical=false [13:15:11] well for some value of "in prod" [13:15:25] <_joe_> bblack: it's not used by any live traffic [13:15:42] _joe_: yes I should have. I was removing the entire stanza so I forgot about it [13:15:44] looking in [13:15:58] apergos: look out again :P [13:16:02] and looking back out ;-D [13:16:04] morning! sounds like nothing needed? [13:16:08] <3 [13:16:09] it's a beautiful day [13:16:11] <_joe_> rlazarus: correct [13:16:13] lol [13:16:19] see you in an hour or so [13:16:20] I might have ruined it a bit, sorry [13:16:36] <_joe_> rlazarus: wake up by pager, what a good way to start your morning [13:16:58] at least I was fast enough for eqiad [13:17:07] * akosiaris must see the silver lining somewhere [13:17:09] isn't that every morning? :) [13:17:39] it's usually by the time I try to get away from the computer these days [13:17:46] damn tzs [13:19:42] PROBLEM - gdnsd checkconf on dns5001 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [13:19:44] (03PS1) 10Jbond: release: ensure reprepo uid/gid is the same on all servers [puppet] - 10https://gerrit.wikimedia.org/r/573282 (https://phabricator.wikimedia.org/T245612) [13:19:58] RECOVERY - PyBal IPVS diff check on lvs2006 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [13:19:59] I am guessing this is discovery ^ [13:20:19] checking on dns5001 [13:20:37] (03CR) 10Jbond: "i picked 201 for the uid/gid some what arbitrarily if there is a better more standard value happy to use it instead" [puppet] - 10https://gerrit.wikimedia.org/r/573282 (https://phabricator.wikimedia.org/T245612) (owner: 10Jbond) [13:20:44] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.2.48:7231]) https://wikitech.wikimedia.org/wiki/PyBal [13:22:11] 10Operations, 10Traffic: Provide a simple and automated SSL Ticket key generation system for ATS - https://phabricator.wikimedia.org/T245616 (10Vgutierrez) [13:22:24] looking at dns5001 as well [13:22:29] 10Operations, 10Traffic: Provide a simple and automated SSL Ticket key generation system for ATS - https://phabricator.wikimedia.org/T245616 (10Vgutierrez) p:05Triage→03Medium [13:22:38] PROBLEM - Host db1084 is DOWN: PING CRITICAL - Packet loss = 100% [13:22:59] somebody pissed of icinga? :) [13:23:05] *off [13:23:21] akosiaris: because the zonefile DNS entry wasn't removed before the service discovery dns stuff [13:23:25] error: plugin_geoip: Invalid resource name 'disc-restrouter' detected from zonefile lookup [13:23:28] error: Name 'restrouter.discovery.wmnet.': resolver plugin 'geoip' rejected resource name 'disc-restrouter' [13:23:42] PROBLEM - gdnsd checkconf on dns2001 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [13:23:42] (disc-restrouter is gone, but still referenced from zonefile) [13:24:30] yeah, removing it [13:24:31] (03PS1) 10Alexandros Kosiaris: restrouter: Remove all records [dns] - 10https://gerrit.wikimedia.org/r/573283 (https://phabricator.wikimedia.org/T242461) [13:24:58] there's a weird chicken and egg problem here though [13:25:20] if I wanted to avoid that alert I would have first to remove the discovery [13:25:20] (03CR) 10BBlack: [C: 03+1] restrouter: Remove all records [dns] - 10https://gerrit.wikimedia.org/r/573283 (https://phabricator.wikimedia.org/T242461) (owner: 10Alexandros Kosiaris) [13:25:30] but if I removed the discovery then icinga would alert for the LVS? [13:25:54] yeah [13:26:00] ah I guess I could remove first just the discovery from the lvs stanza [13:26:04] lol [13:26:17] there's a general pattern that underlies some of the issues we have like this [13:26:18] removal of an LVS service: an intricate dance of 15 steps [13:26:55] akosiaris: you can dance, you can jive [13:26:56] I mean, look at this https://gerrit.wikimedia.org/r/#/q/topic:+restrouter+status:open [13:27:02] there's a basic conflict between two desires, right? One is to make pretty data structures that make sense and have all related code pull from them, and the other is getting all the operation switches to sequence things.... [13:27:22] and there's another 2 changes already merged [13:27:33] so if you make one big data structure called "services", and you have a bunch of infra bits (say dns and lvs and some ferm rules and ....) all pull from that to derive their config [13:27:48] it becomes hard to stage things in and out, because one data change impacts a bunch of dependent things all at once [13:27:50] yeah, for example kubernetes hosts are goin to complain soon about puppet [13:28:03] cause they pull the config for the IPs from that data structure [13:28:09] you've lost some ability to say "well, define this only at layer X for now, but not the layer Y" [13:28:29] and I had no idea how to remove the rest without the IPs .. [13:28:49] (03CR) 10Alexandros Kosiaris: [C: 03+2] restrouter: Remove all records [dns] - 10https://gerrit.wikimedia.org/r/573283 (https://phabricator.wikimedia.org/T242461) (owner: 10Alexandros Kosiaris) [13:29:16] I guess we could perhaps get the best of both, if we conscious of who all the consumers are, and put flags in the data structure to be able to suppress various consumers [13:29:45] maybe each service in some big services datastructure has a bunch of optional flags like "suppress_lvs: true", and the lvs consumer ignores entries with that flag set, etc [13:29:59] db1084 real issue [13:29:59] that would at least allow the option to stage some things without affecting all layers (or unstage them) [13:30:08] RECOVERY - gdnsd checkconf on dns2001 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [13:30:48] RECOVERY - gdnsd checkconf on dns5001 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [13:30:58] !log jynus@cumin1001 dbctl commit (dc=all): 'Depool db1084, lots of connection errors', diff saved to https://phabricator.wikimedia.org/P10458 and previous config saved to /var/cache/conftool/dbconfig/20200219-133057-jynus.json [13:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:39] as an upgrade to the "suppress_foo" pattern, you could also define a set of states and even a human-level state-machine diagram for them. [13:32:04] the states could map to the sequential bringup and teardown process, configuring X then Y then Z, special test-only modes, etc [13:32:33] and the consumers know they should be ignoring some services if they're not in the appropriate state-level [13:34:34] RECOVERY - Host db1084 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [13:35:18] 10Operations, 10DBA: db1084 reboot causing commonswiki connection errors (crash?) - https://phabricator.wikimedia.org/T245621 (10jcrespo) [13:35:25] that will page, most likelyh [13:36:16] can someone double check that mediawiki is ok now? [13:36:50] in theory there should not be user impact, but in practice there can be some issues [13:36:58] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [13:37:24] jynus: https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=now-30m&to=now seems to say A OK [13:38:09] so there was some actual imapact [13:38:26] those posts maybe db [13:38:36] it does correlate timing wise [13:38:49] which is why in practice it takes some time for load balancer, etc. [13:39:03] plus all ongoing queries [13:39:34] (03PS1) 10Jbond: profile::microsites::static_rt: disable the rsync service [puppet] - 10https://gerrit.wikimedia.org/r/573287 (https://phabricator.wikimedia.org/T242910) [13:40:16] there is some noise due to duplicate keys (known), which difficults debugging [13:41:18] I created T245621 but because not yet seen why, leaving alerts disabled but not acked [13:41:20] T245621: db1084 reboot causing commonswiki connection errors (crash?) - https://phabricator.wikimedia.org/T245621 [13:41:26] (03CR) 10Alexandros Kosiaris: [C: 03+2] restrouter: Remove LVS IP from kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/573256 (https://phabricator.wikimedia.org/T242461) (owner: 10Alexandros Kosiaris) [13:41:40] <_joe_> akosiaris, bblack this will all be easier once I have finished the transition [13:41:47] <_joe_> and I have written the docs [13:41:59] <_joe_> removing a service will be way easier once that's done [13:42:18] <_joe_> I'm halfway through it, I got stopped by the shower of outages [13:42:32] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/20891" [puppet] - 10https://gerrit.wikimedia.org/r/573287 (https://phabricator.wikimedia.org/T242910) (owner: 10Jbond) [13:42:34] (03CR) 10Alexandros Kosiaris: [C: 03+2] restrouter: Remove from conftool [puppet] - 10https://gerrit.wikimedia.org/r/573255 (https://phabricator.wikimedia.org/T242461) (owner: 10Alexandros Kosiaris) [13:42:41] <_joe_> it's not like I don't realize the issues we've had for so long [13:44:12] 10Operations, 10netops, 10cloud-services-team (Kanban): CloudVPS: enable BGP in the neutron transport network - https://phabricator.wikimedia.org/T245606 (10jbond) p:05Triage→03Medium [13:44:28] 10Operations, 10Wikifeeds, 10Wikimedia-Logstash, 10observability: Move wikifeeds to the logging pipeline - https://phabricator.wikimedia.org/T245604 (10jbond) p:05Triage→03Medium [13:44:54] 10Operations, 10MediaWiki-Debug-Logger, 10Traffic: noc.wikimedia.org doesn't route to the docroot when WikimediaDebug browser extension is live - https://phabricator.wikimedia.org/T245552 (10jbond) p:05Triage→03Low [13:46:19] (03PS2) 10Ottomata: Set krb: present for user aarora [puppet] - 10https://gerrit.wikimedia.org/r/572995 (https://phabricator.wikimedia.org/T241096) [13:50:13] (03CR) 10Ottomata: [C: 03+2] Set krb: present for user aarora [puppet] - 10https://gerrit.wikimedia.org/r/572995 (https://phabricator.wikimedia.org/T241096) (owner: 10Ottomata) [13:50:27] 10Operations, 10ops-eqiad, 10DC-Ops: (ASAP) rack/setup/install fran1001 - https://phabricator.wikimedia.org/T245554 (10Jgreen) [13:51:10] 10Operations, 10netops: BGP peering sessions with corp partially down in ulsfo - https://phabricator.wikimedia.org/T239893 (10ayounsi) Down BGP sessions disabled on our side until the remote side is fixed. [13:52:29] I think s4 is healthy, gone back to lunch, me or manuel will do fallout later [13:52:52] (03CR) 10Ayounsi: [C: 03+1] "Specific task: https://phabricator.wikimedia.org/T239412" [puppet] - 10https://gerrit.wikimedia.org/r/573268 (https://phabricator.wikimedia.org/T242910) (owner: 10Jbond) [13:53:00] 10Operations, 10Analytics: missing pig package on an-tool1006.eqiad.wmnet & analytics1030.eqiad.wmnet - https://phabricator.wikimedia.org/T245605 (10elukey) Hey John, yes I am aware of this, I am testing bigtop in Hadoop test and still wondering if users need pig or not. I'll find a solution very soon :) [13:53:12] 10Operations, 10ops-codfw, 10DC-Ops: (ASAP) rack/setup/install frdb2001 - https://phabricator.wikimedia.org/T245566 (10Jgreen) [13:53:47] !log disable puppet to upgrade postgresql [13:53:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:05] 10Operations, 10DBA: db1084 reboot causing commonswiki connection errors (crash?) - https://phabricator.wikimedia.org/T245621 (10Marostegui) Looks like BBU died: ` Battery/Capacitor Count: 0 ` ` /system1/log1/record15 Targets Properties number=15 severity=Caution date=02/19/2020 time=1... [13:56:01] (03CR) 10Ottomata: [C: 03+1] Unify stat1004's and stat1005's roles into one [puppet] - 10https://gerrit.wikimedia.org/r/573225 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [13:56:43] 10Operations, 10DBA: db1084 reboot causing commonswiki connection errors (crash?) - https://phabricator.wikimedia.org/T245621 (10Marostegui) @wiki_willy do we have spare HP BBUs in eqiad? [13:57:18] 10Operations, 10DBA: db1084 crashed due to BBU failure - https://phabricator.wikimedia.org/T245621 (10Marostegui) [13:57:33] 10Operations, 10DBA: db1084 crashed due to BBU failure - https://phabricator.wikimedia.org/T245621 (10Marostegui) p:05Triage→03Medium [13:58:45] 10Operations, 10Analytics: missing pig package on an-tool1006.eqiad.wmnet & analytics1030.eqiad.wmnet - https://phabricator.wikimedia.org/T245605 (10jbond) >>! In T245605#5897105, @elukey wrote: > Hey John, > > yes I am aware of this, I am testing bigtop in Hadoop test and still wondering if users need pig or... [13:59:05] (03PS1) 10Marostegui: db1084: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/573288 (https://phabricator.wikimedia.org/T245621) [13:59:24] 10Operations, 10ops-eqiad: Degraded RAID on db1084 - https://phabricator.wikimedia.org/T245626 (10ops-monitoring-bot) [13:59:46] (03PS1) 10Giuseppe Lavagetto: conftool: remove useless cassandra service pool [puppet] - 10https://gerrit.wikimedia.org/r/573289 (https://phabricator.wikimedia.org/T245594) [13:59:48] (03PS1) 10Giuseppe Lavagetto: conftool: remove useless cassandra service pool [puppet] - 10https://gerrit.wikimedia.org/r/573290 (https://phabricator.wikimedia.org/T245594) [13:59:51] (03PS1) 10Giuseppe Lavagetto: conftool::scripts: remove compatibility, disable draining [puppet] - 10https://gerrit.wikimedia.org/r/573291 (https://phabricator.wikimedia.org/T245594) [13:59:52] (03PS1) 10Giuseppe Lavagetto: conftool::scripts: refuse to pool a server if the weight is 0 [puppet] - 10https://gerrit.wikimedia.org/r/573292 (https://phabricator.wikimedia.org/T245594) [13:59:57] 10Operations, 10DBA, 10Patch-For-Review: db1084 crashed due to BBU failure - https://phabricator.wikimedia.org/T245621 (10Marostegui) [13:59:59] 10Operations, 10ops-eqiad: Degraded RAID on db1084 - https://phabricator.wikimedia.org/T245626 (10Marostegui) [14:00:27] (03CR) 10Marostegui: [C: 03+2] db1084: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/573288 (https://phabricator.wikimedia.org/T245621) (owner: 10Marostegui) [14:00:29] 10Operations, 10netops: Librenms sessions are stored inside the deployment directory - https://phabricator.wikimedia.org/T239412 (10jbond) [14:00:33] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Add check for changes applied at all runs - https://phabricator.wikimedia.org/T242910 (10jbond) [14:00:44] (03PS2) 10Jbond: librenms: librenms and puppet managing files with different permissions [puppet] - 10https://gerrit.wikimedia.org/r/573268 (https://phabricator.wikimedia.org/T239412) [14:02:22] 10Operations, 10ops-eqiad, 10DBA: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['es1024.eqiad.wmnet'] ` The log can be foun... [14:02:43] !log Start mysql on db1084 without replication - T245621 [14:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:48] T245621: db1084 crashed due to BBU failure - https://phabricator.wikimedia.org/T245621 [14:03:14] (03CR) 10Alexandros Kosiaris: [C: 03+1] conftool: remove useless cassandra service pool [puppet] - 10https://gerrit.wikimedia.org/r/573290 (https://phabricator.wikimedia.org/T245594) (owner: 10Giuseppe Lavagetto) [14:03:23] (03CR) 10Alexandros Kosiaris: [C: 03+1] conftool: remove useless cassandra service pool [puppet] - 10https://gerrit.wikimedia.org/r/573289 (https://phabricator.wikimedia.org/T245594) (owner: 10Giuseppe Lavagetto) [14:03:25] (03CR) 10Jbond: [C: 03+2] librenms: librenms and puppet managing files with different permissions [puppet] - 10https://gerrit.wikimedia.org/r/573268 (https://phabricator.wikimedia.org/T239412) (owner: 10Jbond) [14:04:00] (03PS2) 10Giuseppe Lavagetto: conftool::scripts: remove compatibility, disable draining [puppet] - 10https://gerrit.wikimedia.org/r/573291 (https://phabricator.wikimedia.org/T245594) [14:04:02] (03PS2) 10Giuseppe Lavagetto: conftool::scripts: refuse to pool a server if the weight is 0 [puppet] - 10https://gerrit.wikimedia.org/r/573292 (https://phabricator.wikimedia.org/T245594) [14:04:05] 10Operations, 10ops-eqiad, 10DBA: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10Marostegui) es1025: RAID10, 256k strip size, BBU and Cache policy right disk space, memory and cpus looking good. [14:04:46] 10Operations, 10ops-eqiad, 10DBA: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10Marostegui) [14:05:58] 10Operations, 10ops-eqiad, 10DBA: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10Marostegui) @Cmjohnson can you double check es1024's link? It cannot PXE boot: ` Booting from PXE Device 1: Integrated NIC 1 Port 1 Partition 1 PXE: No m... [14:06:30] 10Operations, 10netops, 10Patch-For-Review: Librenms sessions are stored inside the deployment directory - https://phabricator.wikimedia.org/T239412 (10jbond) 05Open→03Resolved a:03jbond i have excluded the files in this directory from puppet managment [14:06:33] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Add check for changes applied at all runs - https://phabricator.wikimedia.org/T242910 (10jbond) [14:07:19] !log Upgrade and reboot db1084 - T245621 [14:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:39] (03CR) 10Alexandros Kosiaris: [C: 03+2] "PCC is fine per https://puppet-compiler.wmflabs.org/compiler1002/20892/" [puppet] - 10https://gerrit.wikimedia.org/r/572960 (https://phabricator.wikimedia.org/T245203) (owner: 10Ottomata) [14:08:46] (03PS2) 10Alexandros Kosiaris: Add new LVS services for new eventgate-main and eventgate-analytics ports [puppet] - 10https://gerrit.wikimedia.org/r/572960 (https://phabricator.wikimedia.org/T245203) (owner: 10Ottomata) [14:10:23] <_joe_> akosiaris: can you wait a sec before merging? [14:10:47] _joe_: sure [14:10:57] <_joe_> akosiaris: let me check authdns, I kinda have a tingling feeling I forgot a .unique somehwere [14:11:09] the patch btw has 4 LVS services for you to test your changes on [14:11:14] marked as _to_delete [14:11:20] so we will be removing them [14:12:18] (03PS2) 10Cmjohnson: updating snapshot1010 to raid1-lvm-ext4 cfg [puppet] - 10https://gerrit.wikimedia.org/r/572978 (https://phabricator.wikimedia.org/T241794) [14:12:35] <_joe_> Duplicate declaration: Confd::File[/var/lib/gdnsd/discovery-eventgate-analytics.state] [14:12:37] <_joe_> bingo [14:13:05] <_joe_> akosiaris: either we remove the dnsdisc from the other service stanza, or (better) you let me fix the problem [14:15:38] (03PS1) 10Giuseppe Lavagetto: dns::auth::discovery: ensure unique identifiers [puppet] - 10https://gerrit.wikimedia.org/r/573294 [14:16:40] _joe_: how did you get the duplicate declaration? [14:16:42] pcc doesn't [14:16:54] <_joe_> akosiaris: on authdns1001 it does [14:17:09] ? [14:17:17] I hadn't merge the change yet? [14:17:18] <_joe_> https://puppet-compiler.wmflabs.org/compiler1003/20893/authdns1001.wikimedia.org/ [14:17:28] <_joe_> yes, it's the compiler [14:17:32] ah, it's duplicate declaration [14:17:34] yeah yeah sorry [14:17:44] <_joe_> https://gerrit.wikimedia.org/r/#/c/573294/ fixes the problem [14:17:51] somehow I mixed them up with dependency loops [14:18:00] which PCC doesn't catch [14:18:12] lemme rebase on it [14:18:27] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/20894/" [puppet] - 10https://gerrit.wikimedia.org/r/573294 (owner: 10Giuseppe Lavagetto) [14:19:40] (03PS2) 10Clarakosi: Add EventBus Run Job API permissions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572994 (https://phabricator.wikimedia.org/T244770) [14:20:03] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: add cadvisor jobs [puppet] - 10https://gerrit.wikimedia.org/r/573272 (https://phabricator.wikimedia.org/T183146) (owner: 10Ema) [14:20:20] (03PS3) 10Alexandros Kosiaris: Add new LVS services for new eventgate-main and eventgate-analytics ports [puppet] - 10https://gerrit.wikimedia.org/r/572960 (https://phabricator.wikimedia.org/T245203) (owner: 10Ottomata) [14:21:17] (03CR) 10Cmjohnson: [C: 03+2] updating snapshot1010 to raid1-lvm-ext4 cfg [puppet] - 10https://gerrit.wikimedia.org/r/572978 (https://phabricator.wikimedia.org/T241794) (owner: 10Cmjohnson) [14:23:17] <_joe_> cmjohnson1: can I merge your patch? [14:23:38] <_joe_> I assume I can [14:24:34] 10Operations, 10ops-eqiad, 10decommission, 10netops: Decommission asw-c-eqiad - https://phabricator.wikimedia.org/T208734 (10ayounsi) Ping? [14:25:44] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor UX comment" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/573292 (https://phabricator.wikimedia.org/T245594) (owner: 10Giuseppe Lavagetto) [14:28:59] cmjohnson1: re: the partman recipe for snapshot1010, did you run into trouble with the standard one ? [14:29:31] !log Data checksum on db1084 T245621 [14:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:35] T245621: db1084 crashed due to BBU failure - https://phabricator.wikimedia.org/T245621 [14:29:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [14:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:00] godog: yes, and the task send to use the lvm-raid1 [14:30:11] stated to use lvm-raid1 [14:31:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:19] cmjohnson1: ah, ok I'm interested to know if there's a bug somewhere we can't use the standard recipe for some reason, do you have the output or remember the error ? if not that's fine too [14:32:35] RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [14:33:48] I do not have the output, I tried running it last night and I kept being prompted for dns setup during the install and then it failed in the partitioner [14:34:45] ah yeah I think because netboot.cfg was broken last night, dns2001 install ran into trouble too [14:36:01] PROBLEM - PyBal connections to etcd on lvs2003 is CRITICAL: CRITICAL: 53 connections established with conf2001.codfw.wmnet:2379 (min=57) https://wikitech.wikimedia.org/wiki/PyBal [14:36:29] PROBLEM - PyBal connections to etcd on lvs2006 is CRITICAL: CRITICAL: 53 connections established with conf2001.codfw.wmnet:2379 (min=57) https://wikitech.wikimedia.org/wiki/PyBal [14:36:51] akosiaris: is that still you ? [14:37:16] yup [14:37:25] it's 4 LVS services being added [14:37:30] (03Abandoned) 10Giuseppe Lavagetto: profile::services_proxy: add temporarily entries for k8s services [puppet] - 10https://gerrit.wikimedia.org/r/570306 (owner: 10Giuseppe Lavagetto) [14:37:37] PROBLEM - PyBal connections to etcd on lvs1015 is CRITICAL: CRITICAL: 64 connections established with conf1004.eqiad.wmnet:4001 (min=68) https://wikitech.wikimedia.org/wiki/PyBal [14:37:44] (03PS5) 10Elukey: Unify stat1004's and stat1005's roles into one [puppet] - 10https://gerrit.wikimedia.org/r/573225 (https://phabricator.wikimedia.org/T243934) [14:38:14] akosiaris: :D [14:38:39] PROBLEM - PyBal IPVS diff check on lvs2003 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.45:4492, 10.2.1.42:35192, 10.2.1.42:4592, 10.2.1.45:34192]) https://wikitech.wikimedia.org/wiki/PyBal [14:38:41] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.45:34192, 10.2.2.42:4592, 10.2.2.42:35192, 10.2.2.45:4492]) https://wikitech.wikimedia.org/wiki/PyBal [14:38:48] 10Operations, 10ops-eqiad, 10DBA: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es1024.eqiad.wmnet'] ` and were **ALL** successful. [14:39:17] cmjohnson1: if you have time I'm happy to help and revert to the standard recipe and see if now completes [14:39:26] 10Operations, 10DBA: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10Cmjohnson) [14:39:33] (03PS3) 10Giuseppe Lavagetto: profile::cache::base: remove the useless inclusion of lvs::configuration [puppet] - 10https://gerrit.wikimedia.org/r/570072 [14:39:44] (03CR) 10Elukey: [C: 03+2] Unify stat1004's and stat1005's roles into one [puppet] - 10https://gerrit.wikimedia.org/r/573225 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [14:40:00] godog: sure, that works [14:40:37] ack, I'll send out a patch [14:41:00] 10Operations, 10Traffic, 10netops: add TLS support for smokeping.wikimedia.org - https://phabricator.wikimedia.org/T238900 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [14:41:03] RECOVERY - PyBal connections to etcd on lvs2003 is OK: OK: 57 connections established with conf2001.codfw.wmnet:2379 (min=57) https://wikitech.wikimedia.org/wiki/PyBal [14:41:06] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Vgutierrez) [14:41:29] PROBLEM - termbox eqiad on termbox.svc.eqiad.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [14:41:33] 10Operations, 10DBA: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10Marostegui) es1024: RAID10, 256k strip size, BBU and Cache policy right disk space, memory and cpus looking good. [14:41:35] RECOVERY - PyBal connections to etcd on lvs2006 is OK: OK: 57 connections established with conf2001.codfw.wmnet:2379 (min=57) https://wikitech.wikimedia.org/wiki/PyBal [14:41:45] 10Operations, 10DBA: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10Marostegui) [14:41:59] 10Operations, 10DBA: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10Marostegui) 05Open→03Resolved All hosts have been installed successfully. Thanks! [14:42:07] (03PS1) 10Filippo Giunchedi: Revert "updating snapshot1010 to raid1-lvm-ext4 cfg" [puppet] - 10https://gerrit.wikimedia.org/r/573300 [14:42:27] RECOVERY - termbox eqiad on termbox.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [14:42:29] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/20896/" [puppet] - 10https://gerrit.wikimedia.org/r/570072 (owner: 10Giuseppe Lavagetto) [14:42:31] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "updating snapshot1010 to raid1-lvm-ext4 cfg" [puppet] - 10https://gerrit.wikimedia.org/r/573300 (owner: 10Filippo Giunchedi) [14:42:37] RECOVERY - PyBal connections to etcd on lvs1015 is OK: OK: 68 connections established with conf1004.eqiad.wmnet:4001 (min=68) https://wikitech.wikimedia.org/wiki/PyBal [14:43:14] (03CR) 10Holger Knust: "Made the changes locally that I commented on. Waiting for Petr to respond before I create a new PS" (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 (owner: 10Holger Knust) [14:43:17] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] Revert "updating snapshot1010 to raid1-lvm-ext4 cfg" [puppet] - 10https://gerrit.wikimedia.org/r/573300 (owner: 10Filippo Giunchedi) [14:43:24] what are folks doing with 'my' snapshot host? :-P [14:43:49] RECOVERY - PyBal IPVS diff check on lvs2003 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:43:51] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:44:39] hehhe see if the standard partman recipe is as standard as it claims [14:46:04] cmjohnson1: ok all done, I can try the install again [14:46:34] * apergos will follow along in here [14:48:18] (03PS3) 10Ottomata: New eventgate-analytics-external instance using remote EventStreamConfig API [deployment-charts] - 10https://gerrit.wikimedia.org/r/563211 (https://phabricator.wikimedia.org/T233629) [14:48:20] (03PS1) 10Elukey: Remove Hadoop Pig from puppet codebase [puppet] - 10https://gerrit.wikimedia.org/r/573301 (https://phabricator.wikimedia.org/T245605) [14:49:45] 10Operations, 10User-Elukey: Race condition in setting net.netfilter.nf_conntrack_tcp_timeout_time_wait - https://phabricator.wikimedia.org/T136094 (10Aklapper) @MoritzMuehlenhoff: All patches in Gerrit have been merged or abandoned. Is there more to do in this task? Asking as you are set as task assignee. (Yo... [14:50:59] (03CR) 10Giuseppe Lavagetto: [C: 03+2] conftool: remove useless cassandra service pool [puppet] - 10https://gerrit.wikimedia.org/r/573289 (https://phabricator.wikimedia.org/T245594) (owner: 10Giuseppe Lavagetto) [14:52:51] <_joe_> what's this bot? [14:54:07] <_joe_> p858snake: any idea? [14:55:31] (03CR) 10Giuseppe Lavagetto: [C: 03+2] conftool: remove useless cassandra service pool [puppet] - 10https://gerrit.wikimedia.org/r/573290 (https://phabricator.wikimedia.org/T245594) (owner: 10Giuseppe Lavagetto) [14:55:44] 10Operations, 10Graphite: investigate carbon-c-relay stalls/drops towards graphite2002 - https://phabricator.wikimedia.org/T135385 (10Aklapper) @fgiunchedi: The patch in Gerrit has been merged. Can this task be resolved (via {nav name=Add Action... > Change Status} in the dropdown menu), or is there more to do... [14:56:03] 10Operations, 10netops: Add monitoring for BGP peers exceeding prefix-limit - https://phabricator.wikimedia.org/T239256 (10ayounsi) a:03faidon In https://github.com/wikimedia/puppet/blob/59ae7b6aa0f8413b4a9a0479089b69b823aca532/modules/nagios_common/files/check_commands/check_bgp#L299 the script ignores the... [14:56:53] (03PS1) 10Marostegui: mariadb: Productionize es1020 [puppet] - 10https://gerrit.wikimedia.org/r/573303 (https://phabricator.wikimedia.org/T243052) [14:58:05] cmjohnson1 apergos it is installing now, will ping when it is done [14:58:27] that's great news! [14:58:28] 👍 [14:58:36] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize es1020 [puppet] - 10https://gerrit.wikimedia.org/r/573303 (https://phabricator.wikimedia.org/T243052) (owner: 10Marostegui) [14:59:53] !log Stop mysql on es2021 - T243052 [14:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:00] T243052: Productionize es1020-es1025, es2020-es2025 - https://phabricator.wikimedia.org/T243052 [15:06:39] apergos: you should be able to login to snapshot1010, LMK if the partitions/fs looks good [15:07:53] (03PS1) 10Ayounsi: Icinga check_bgp consider Idle as a failure [puppet] - 10https://gerrit.wikimedia.org/r/573305 (https://phabricator.wikimedia.org/T239256) [15:08:29] 10Operations, 10netops, 10Patch-For-Review: Add monitoring for BGP peers exceeding prefix-limit - https://phabricator.wikimedia.org/T239256 (10ayounsi) a:05faidon→03ayounsi From Faidon: > idle IIRC was when the other side shut down their sessions > but feel free to remove that elsif and see what happens [15:10:26] godog: looks fine to me [15:10:48] sweet, thank you [15:11:43] cmjohnson1: host is installed and puppet ran (no role::spare yet tho), LGTM [15:12:08] thanks apergos and godog! [15:12:49] you're welcome [15:13:04] \o/ [15:20:38] PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: RRDP status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [15:21:20] (03PS1) 10Ottomata: Use new LVS port for eventgate-analytics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573307 (https://phabricator.wikimedia.org/T245203) [15:22:38] (03PS1) 10Marostegui: install_server: Do not reimage es1020 [puppet] - 10https://gerrit.wikimedia.org/r/573308 (https://phabricator.wikimedia.org/T243052) [15:23:54] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage es1020 [puppet] - 10https://gerrit.wikimedia.org/r/573308 (https://phabricator.wikimedia.org/T243052) (owner: 10Marostegui) [15:27:04] (03PS2) 10Ema: cache: use Connection:KA for varnish-ats checks [puppet] - 10https://gerrit.wikimedia.org/r/571472 (https://phabricator.wikimedia.org/T244464) [15:28:45] (03CR) 10Ema: [C: 03+2] prometheus: add cadvisor jobs [puppet] - 10https://gerrit.wikimedia.org/r/573272 (https://phabricator.wikimedia.org/T183146) (owner: 10Ema) [15:28:52] (03PS2) 10Ema: prometheus: add cadvisor jobs [puppet] - 10https://gerrit.wikimedia.org/r/573272 (https://phabricator.wikimedia.org/T183146) [15:29:33] (03CR) 10Ema: "pcc seems fine: https://puppet-compiler.wmflabs.org/compiler1003/20898/" [puppet] - 10https://gerrit.wikimedia.org/r/571472 (https://phabricator.wikimedia.org/T244464) (owner: 10Ema) [15:30:19] (03PS2) 10ArielGlenn: add unit tests to check temp stub generation commands [dumps] - 10https://gerrit.wikimedia.org/r/573279 (https://phabricator.wikimedia.org/T242209) [15:35:53] (03CR) 10ArielGlenn: [C: 03+2] write out and reuse pagerange info for big page content jobs [dumps] - 10https://gerrit.wikimedia.org/r/566580 (https://phabricator.wikimedia.org/T243434) (owner: 10ArielGlenn) [15:39:03] (03CR) 10ArielGlenn: [C: 03+2] properly handle failure of writing of temp stubs for page content jobs [dumps] - 10https://gerrit.wikimedia.org/r/562995 (https://phabricator.wikimedia.org/T242209) (owner: 10ArielGlenn) [15:39:21] (03Merged) 10jenkins-bot: properly handle failure of writing of temp stubs for page content jobs [dumps] - 10https://gerrit.wikimedia.org/r/562995 (https://phabricator.wikimedia.org/T242209) (owner: 10ArielGlenn) [15:40:05] 10Operations: sshd warning on cache nodes: Deprecated option UsePrivilegeSeparation - https://phabricator.wikimedia.org/T245635 (10ema) [15:40:14] 10Operations: sshd warning on cache nodes: Deprecated option UsePrivilegeSeparation - https://phabricator.wikimedia.org/T245635 (10ema) p:05Triage→03Lowest [15:42:35] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/573305 (https://phabricator.wikimedia.org/T239256) (owner: 10Ayounsi) [15:46:28] RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [15:46:29] (03CR) 10ArielGlenn: [C: 03+2] add unit tests to check temp stub generation commands [dumps] - 10https://gerrit.wikimedia.org/r/573279 (https://phabricator.wikimedia.org/T242209) (owner: 10ArielGlenn) [15:48:20] !log ariel@deploy1001 Started deploy [dumps/dumps@b42acb5]: fix temp stub generation, add pagerangeinfo cache, some unit tests [15:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:23] !log ariel@deploy1001 Finished deploy [dumps/dumps@b42acb5]: fix temp stub generation, add pagerangeinfo cache, some unit tests (duration: 00m 03s) [15:48:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:17] (03PS2) 10Elukey: Add ports and codfw LVS IP to term eventgate-analytics in analytics-in4 [homer/public] - 10https://gerrit.wikimedia.org/r/562842 (https://phabricator.wikimedia.org/T245203) [15:51:45] (03CR) 10Ottomata: [C: 03+1] "THANK YOU" [homer/public] - 10https://gerrit.wikimedia.org/r/562842 (https://phabricator.wikimedia.org/T245203) (owner: 10Elukey) [15:54:17] (03PS9) 10Herron: WIP mediawiki: send apache logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/571239 (https://phabricator.wikimedia.org/T244472) (owner: 10Effie Mouzeli) [15:54:34] (03CR) 10Ayounsi: [C: 03+1] Add ports and codfw LVS IP to term eventgate-analytics in analytics-in4 [homer/public] - 10https://gerrit.wikimedia.org/r/562842 (https://phabricator.wikimedia.org/T245203) (owner: 10Elukey) [15:55:59] (03CR) 10Elukey: [C: 03+2] Add ports and codfw LVS IP to term eventgate-analytics in analytics-in4 [homer/public] - 10https://gerrit.wikimedia.org/r/562842 (https://phabricator.wikimedia.org/T245203) (owner: 10Elukey) [15:56:34] (03PS1) 10DCausse: [wdqs] use https and 4592 for eventgate-analytics endpoint [puppet] - 10https://gerrit.wikimedia.org/r/573317 (https://phabricator.wikimedia.org/T245203) [15:58:37] (03CR) 10Ayounsi: [C: 03+2] Icinga check_bgp consider Idle as a failure [puppet] - 10https://gerrit.wikimedia.org/r/573305 (https://phabricator.wikimedia.org/T239256) (owner: 10Ayounsi) [16:01:06] 10Operations, 10SRE-swift-storage: 'swift' user/group IDs should be consistent across the fleet - https://phabricator.wikimedia.org/T123918 (10Aklapper) @fgiunchedi: Hi, the patch in Gerrit has been merged. Can this task be resolved (via {nav name=Add Action... > Change Status} in the dropdown menu), or is the... [16:05:20] !log Update analytics-in4 filter term eventgate for T245203 on cr1/cr2 eqiad [16:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:25] T245203: Create production and canary releases for existent eventgate helmfile services - https://phabricator.wikimedia.org/T245203 [16:06:35] 10Operations, 10netops, 10Patch-For-Review: Add monitoring for BGP peers exceeding prefix-limit - https://phabricator.wikimedia.org/T239256 (10ayounsi) 05Open→03Resolved Tested and works as expected, will re-open if any false positive. [16:08:34] (03CR) 10Ottomata: [C: 03+2] [wdqs] use https and 4592 for eventgate-analytics endpoint [puppet] - 10https://gerrit.wikimedia.org/r/573317 (https://phabricator.wikimedia.org/T245203) (owner: 10DCausse) [16:10:34] (03PS1) 10Marostegui: dbproxy: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/573322 [16:10:37] 10Operations, 10ops-codfw, 10DC-Ops: (ASAP) rack/setup/install frdb2001 - https://phabricator.wikimedia.org/T245566 (10Papaul) [16:11:55] 10Operations, 10observability: Monitor resource usage on a per-cgroup basis - https://phabricator.wikimedia.org/T183146 (10ema) 05Open→03Resolved a:03ema This is now done for cache nodes, see https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?orgId=1&fullscreen&panelId=80&from=now-15m&to=now... [16:12:34] (03CR) 10Marostegui: [C: 03+2] dbproxy: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/573322 (owner: 10Marostegui) [16:12:54] (03PS1) 10DCausse: [wdqs] fix eventgate url [puppet] - 10https://gerrit.wikimedia.org/r/573323 [16:13:02] !log Depool labsdb1011 to help replication to catch up [16:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:24] (03CR) 10jerkins-bot: [V: 04-1] [wdqs] fix eventgate url [puppet] - 10https://gerrit.wikimedia.org/r/573323 (owner: 10DCausse) [16:13:44] (03PS2) 10DCausse: [wdqs] fix eventgate url [puppet] - 10https://gerrit.wikimedia.org/r/573323 [16:13:49] (03PS3) 10Ottomata: [wdqs] fix eventgate url [puppet] - 10https://gerrit.wikimedia.org/r/573323 (owner: 10DCausse) [16:13:53] hehe oops [16:13:56] :) [16:15:20] (03CR) 10Ottomata: [C: 03+2] [wdqs] fix eventgate url [puppet] - 10https://gerrit.wikimedia.org/r/573323 (owner: 10DCausse) [16:16:17] (03PS3) 10Ema: cache: use Connection:KA for varnish-ats checks [puppet] - 10https://gerrit.wikimedia.org/r/571472 (https://phabricator.wikimedia.org/T244464) [16:17:58] 10Operations, 10ops-eqiad, 10Dumps-Generation: (Need By Jan 25) rack/setup/install snapshot1010.eqiad.wmnet - https://phabricator.wikimedia.org/T241794 (10Cmjohnson) [16:18:04] (03CR) 10Vgutierrez: [C: 03+1] cache: use Connection:KA for varnish-ats checks [puppet] - 10https://gerrit.wikimedia.org/r/571472 (https://phabricator.wikimedia.org/T244464) (owner: 10Ema) [16:18:37] 10Operations, 10observability, 10serviceops, 10Patch-For-Review: Stream a subset of mediawiki apache logs to logstash - https://phabricator.wikimedia.org/T244472 (10herron) I've cherry picked https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/571239/ on deployment-puppetmaster04.deployment-prep.eqiad.w... [16:18:48] 10Operations, 10Dumps-Generation: (Need By Jan 25) rack/setup/install snapshot1010.eqiad.wmnet - https://phabricator.wikimedia.org/T241794 (10Cmjohnson) a:05Cmjohnson→03ArielGlenn Removing ops-eqiad tag, assigned to @ArielGlenn [16:19:03] (03PS4) 10Ema: cache: use Connection:KA for varnish-ats checks [puppet] - 10https://gerrit.wikimedia.org/r/571472 (https://phabricator.wikimedia.org/T244464) [16:20:18] 10Operations, 10SRE-swift-storage: 'swift' user/group IDs should be consistent across the fleet - https://phabricator.wikimedia.org/T123918 (10fgiunchedi) 05Open→03Stalled >>! In T123918#5897578, @Aklapper wrote: > @fgiunchedi: Hi, the patch in Gerrit has been merged. Can this task be resolved (via {nav na... [16:21:36] 10Operations, 10Graphite: investigate carbon-c-relay stalls/drops towards graphite2002 - https://phabricator.wikimedia.org/T135385 (10fgiunchedi) 05Open→03Declined Yes resolvable, graphite is on its way out eventually [16:21:41] 10Operations, 10Graphite, 10Patch-For-Review: put additional graphite machines in service - https://phabricator.wikimedia.org/T134889 (10fgiunchedi) [16:23:54] (03CR) 10Muehlenhoff: "< 499 is reserved for system users, we should choose something between 500 and 999, given that our human users start with 1000. We have so" [puppet] - 10https://gerrit.wikimedia.org/r/573282 (https://phabricator.wikimedia.org/T245612) (owner: 10Jbond) [16:24:55] !log otto@deploy1001 Started deploy [analytics/refinery@e23918a]: Updating eventgate-analytics port (T245203) and also eventlogging whitelist [16:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:59] T245203: Create production and canary releases for existent eventgate helmfile services - https://phabricator.wikimedia.org/T245203 [16:27:44] (03CR) 10Ema: [C: 03+2] cache: use Connection:KA for varnish-ats checks [puppet] - 10https://gerrit.wikimedia.org/r/571472 (https://phabricator.wikimedia.org/T244464) (owner: 10Ema) [16:27:50] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [16:29:46] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [16:30:00] PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: RRDP status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [16:32:25] !log depool cp4026, 5xx [16:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:58] (03PS1) 10Volans: netbox: disable keepalive between Apache and uWSGI [puppet] - 10https://gerrit.wikimedia.org/r/573330 (https://phabricator.wikimedia.org/T244291) [16:37:22] !log otto@deploy1001 Finished deploy [analytics/refinery@e23918a]: Updating eventgate-analytics port (T245203) and also eventlogging whitelist (duration: 12m 27s) [16:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:26] T245203: Create production and canary releases for existent eventgate helmfile services - https://phabricator.wikimedia.org/T245203 [16:38:07] (03CR) 10Vgutierrez: [C: 03+1] netbox: disable keepalive between Apache and uWSGI [puppet] - 10https://gerrit.wikimedia.org/r/573330 (https://phabricator.wikimedia.org/T244291) (owner: 10Volans) [16:39:11] (03CR) 10CRusnov: [C: 03+1] "cripes." [puppet] - 10https://gerrit.wikimedia.org/r/573330 (https://phabricator.wikimedia.org/T244291) (owner: 10Volans) [16:41:00] 10Operations, 10Parsoid, 10VisualEditor, 10wikitech.wikimedia.org: VisualEditor was removed from Wikitech because Parsoid/PHP isn't yet compatible with how Wikitech is set up - https://phabricator.wikimedia.org/T241961 (10ssastry) >>! In T241961#5894853, @bd808 wrote: > > We should be able to deploy Parsoi... [16:43:40] (03CR) 10Volans: [C: 03+2] netbox: disable keepalive between Apache and uWSGI [puppet] - 10https://gerrit.wikimedia.org/r/573330 (https://phabricator.wikimedia.org/T244291) (owner: 10Volans) [16:44:15] !log replacing ps1-a8-codfw mgmt in rack A8 will go down [16:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:24] PROBLEM - Host elastic2026.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:48:24] PROBLEM - Host elastic2027.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:48:50] PROBLEM - Juniper alarms on cr2-codfw is CRITICAL: JNX_ALARMS CRITICAL - 6 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [16:49:10] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:49:20] 10Operations, 10Mobile-Content-Service, 10Wikimedia-Logstash, 10observability, and 4 others: Move mobile apps logging to new logging pipeline - https://phabricator.wikimedia.org/T219924 (10LGoto) [16:49:42] 10Operations, 10Maps, 10Wikimedia-Logstash, 10observability, and 4 others: Move kartotherian/tilerator logging to new logging pipeline - https://phabricator.wikimedia.org/T222377 (10LGoto) [16:49:55] 10Operations, 10Proton, 10Wikimedia-Logstash, 10observability, and 4 others: Move proton logging to new logging pipeline - https://phabricator.wikimedia.org/T219925 (10LGoto) [16:50:04] PROBLEM - Host db2106.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:50:06] PROBLEM - Host db2091.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:50:10] PROBLEM - Juniper alarms on asw-a-codfw is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [16:50:28] PROBLEM - Host mc2022.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:50:50] PROBLEM - Host re0.cr2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [16:50:51] (03PS1) 10Hnowlan: mediawiki: install phpdbg on mwdebg hosts. [puppet] - 10https://gerrit.wikimedia.org/r/573333 (https://phabricator.wikimedia.org/T244549) [16:51:00] ^ expected [16:51:58] PROBLEM - Host heze.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:52:54] RECOVERY - Juniper alarms on cr2-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [16:53:16] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 133, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:53:43] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: install phpdbg on mwdebg hosts. [puppet] - 10https://gerrit.wikimedia.org/r/573333 (https://phabricator.wikimedia.org/T244549) (owner: 10Hnowlan) [16:53:54] RECOVERY - Host db2106.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.53 ms [16:54:32] RECOVERY - Host elastic2026.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.99 ms [16:54:32] RECOVERY - Host elastic2027.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.24 ms [16:55:15] (03CR) 10Muehlenhoff: mediawiki: install phpdbg on mwdebg hosts. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/573333 (https://phabricator.wikimedia.org/T244549) (owner: 10Hnowlan) [16:56:12] RECOVERY - Host db2091.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.83 ms [16:56:32] RECOVERY - Host mc2022.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.69 ms [16:56:56] RECOVERY - Host re0.cr2-codfw is UP: PING OK - Packet loss = 0%, RTA = 36.66 ms [16:57:16] (03PS2) 10Hnowlan: mediawiki: install phpdbg on mwdebg hosts. [puppet] - 10https://gerrit.wikimedia.org/r/573333 (https://phabricator.wikimedia.org/T244549) [16:57:40] (03CR) 10Ppchelko: Migrate changeprop & cpjobqueue to kubernetes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 (owner: 10Holger Knust) [16:58:04] RECOVERY - Host heze.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.96 ms [16:58:05] (03CR) 10Hnowlan: mediawiki: install phpdbg on mwdebg hosts. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/573333 (https://phabricator.wikimedia.org/T244549) (owner: 10Hnowlan) [16:58:18] RECOVERY - Juniper alarms on asw-a-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [16:59:29] (03CR) 10Elukey: [C: 03+2] Remove Hadoop Pig from puppet codebase [puppet] - 10https://gerrit.wikimedia.org/r/573301 (https://phabricator.wikimedia.org/T245605) (owner: 10Elukey) [17:00:07] 10Operations, 10DBA: db1084 crashed due to BBU failure - https://phabricator.wikimedia.org/T245621 (10wiki_willy) @Marostegui - we have a few spare BBUs in the process of being shipped onsite, one of them for T244958, which should be arriving early next week. You can just shoot open a dc-ops task with us, and... [17:00:32] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: install phpdbg on mwdebg hosts. [puppet] - 10https://gerrit.wikimedia.org/r/573333 (https://phabricator.wikimedia.org/T244549) (owner: 10Hnowlan) [17:00:58] (03CR) 10Hnowlan: "pcc run: https://puppet-compiler.wmflabs.org/compiler1003/20902/" [puppet] - 10https://gerrit.wikimedia.org/r/573333 (https://phabricator.wikimedia.org/T244549) (owner: 10Hnowlan) [17:05:12] (03PS1) 10Jbond: ferm: add a very basic status check [puppet] - 10https://gerrit.wikimedia.org/r/573335 (https://phabricator.wikimedia.org/T206951) [17:05:52] (03CR) 10Jbond: "this is not a great fix but i think its better then what we have no but more then open to how we can improve this" [puppet] - 10https://gerrit.wikimedia.org/r/573335 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [17:06:00] 10Operations, 10Analytics, 10Patch-For-Review: missing pig package on an-tool1006.eqiad.wmnet & analytics1030.eqiad.wmnet - https://phabricator.wikimedia.org/T245605 (10elukey) 05Open→03Resolved [17:06:01] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Add check for changes applied at all runs - https://phabricator.wikimedia.org/T242910 (10elukey) [17:07:16] (03PS1) 10Ema: cache: revert Connection:KA probe experiment on cp4026 [puppet] - 10https://gerrit.wikimedia.org/r/573337 (https://phabricator.wikimedia.org/T244464) [17:07:52] (03CR) 10Vgutierrez: [C: 03+1] cache: revert Connection:KA probe experiment on cp4026 [puppet] - 10https://gerrit.wikimedia.org/r/573337 (https://phabricator.wikimedia.org/T244464) (owner: 10Ema) [17:08:29] 10Operations, 10ops-eqiad: cp1088 - https://phabricator.wikimedia.org/T245645 (10RobH) p:05Triage→03Medium [17:08:39] 10Operations, 10ops-eqiad, 10Traffic: cp1088 - https://phabricator.wikimedia.org/T245645 (10RobH) [17:10:01] (03CR) 10Ppchelko: "General question - which Kafka cluster would the deployment in 'staging' connect to? We probably want the staging one to not mix up consum" (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 (owner: 10Holger Knust) [17:10:28] 10Operations, 10ops-eqiad, 10ops-esams, 10DC-Ops, 10Traffic: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH) [17:10:39] (03CR) 10Ema: [C: 03+2] cache: revert Connection:KA probe experiment on cp4026 [puppet] - 10https://gerrit.wikimedia.org/r/573337 (https://phabricator.wikimedia.org/T244464) (owner: 10Ema) [17:12:36] !log cp1088 returned to service, cp1089 & cp1090 offline for firmware update via T243167 [17:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:41] T243167: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 [17:14:10] jouncebot: now [17:14:10] No deployments scheduled for the next 1 hour(s) and 45 minute(s) [17:14:13] jouncebot: next [17:14:13] In 1 hour(s) and 45 minute(s): Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200219T1900) [17:14:44] (03PS2) 10Giuseppe Lavagetto: profile::lvs: use wmflib::fetch [puppet] - 10https://gerrit.wikimedia.org/r/572215 [17:14:59] !log cp4026: repool after probe Connection:keep-alive experiment revert https://gerrit.wikimedia.org/r/573337 [17:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:06] (03CR) 10Ottomata: "> Patch Set 12:" (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 (owner: 10Holger Knust) [17:16:04] (03CR) 10Ottomata: "> not mix up consumer groups with the production one right?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 (owner: 10Holger Knust) [17:16:58] (03PS3) 10Hnowlan: mediawiki: install phpdbg on mwdebg hosts. [puppet] - 10https://gerrit.wikimedia.org/r/573333 (https://phabricator.wikimedia.org/T244549) [17:17:14] (03PS1) 10Addshore: From 2k->4k for items to read from the new store in clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573340 (https://phabricator.wikimedia.org/T225057) [17:18:00] (03CR) 10Ppchelko: "Oh yeah, we have dc_name setting. currently it's set to Values.main_app.site which will be 'staging' for staging??? We can add a separate " [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 (owner: 10Holger Knust) [17:18:10] any objections if I deploy this here small config change now ^^ outside of swat? :) [17:19:07] James_F: I see your around so I'll ask you ^^ :) [17:19:23] addshore: Go for it. [17:19:30] (03CR) 10Ppchelko: Migrate changeprop & cpjobqueue to kubernetes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 (owner: 10Holger Knust) [17:19:30] ack! :) [17:19:36] (03CR) 10Addshore: [C: 03+2] From 2k->4k for items to read from the new store in clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573340 (https://phabricator.wikimedia.org/T225057) (owner: 10Addshore) [17:20:23] (03PS3) 10Giuseppe Lavagetto: profile::lvs: use wmflib::fetch [puppet] - 10https://gerrit.wikimedia.org/r/572215 [17:20:35] RECOVERY - Host ps1-a8-codfw is UP: PING OK - Packet loss = 0%, RTA = 37.69 ms [17:21:01] (03Merged) 10jenkins-bot: From 2k->4k for items to read from the new store in clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573340 (https://phabricator.wikimedia.org/T225057) (owner: 10Addshore) [17:21:30] (03CR) 10Ppchelko: [C: 03+1] "LGTM. We can deploy it at any time regardless of the train deployment of the actual code (preferrably before). I'll put it in one of the S" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572994 (https://phabricator.wikimedia.org/T244770) (owner: 10Clarakosi) [17:22:24] 10Operations, 10ops-eqiad, 10DC-Ops: Replace broken BBU on db1084 (HP host) - https://phabricator.wikimedia.org/T245647 (10Marostegui) [17:22:59] PROBLEM - ps1-a8-codfw-infeed-load-tower-A-phase-X on ps1-a8-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:23:05] PROBLEM - ps1-a8-codfw-infeed-load-tower-A-phase-Y on ps1-a8-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:23:14] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Add check for changes applied at all runs - https://phabricator.wikimedia.org/T242910 (10jbond) this is what we have left * flowspec1001.eqiad.wmnet: currently down * mwdebug1001.eqiad.wmnet: puppet disabled * mwdebug2001.codfw.wmnet: puppet disabl... [17:24:03] RECOVERY - snapshot of s3 in eqiad on db1115 is OK: snapshot for s3 at eqiad taken less than 4 days ago and larger than 90 GB: Last one 2020-02-19 16:03:57 from db1140.codfw.wmnet:3313 (805 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [17:25:03] (03PS4) 10Giuseppe Lavagetto: profile::lvs: use wmflib::fetch [puppet] - 10https://gerrit.wikimedia.org/r/572215 [17:25:32] syncing [17:26:29] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Start reading for the new term store for clients up to Q4000 (T225057) (duration: 01m 01s) [17:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:33] T225057: Switch `tmpItemTermsMigrationStages` to MIGRATION_WRITE_NEW - https://phabricator.wikimedia.org/T225057 [17:26:37] lovely [17:27:13] 10Operations, 10ops-eqiad, 10DC-Ops: Replace broken BBU on db1084 (HP host) - https://phabricator.wikimedia.org/T245647 (10wiki_willy) a:03Jclark-ctr Hey @Jclark-ctr - can you use the 3rd spare BBU (that's arriving on 2/22) for this host? Much appreciated. Thanks, Willy [17:28:34] (03CR) 10Ottomata: "BTW, you might want to enable canary releases." [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 (owner: 10Holger Knust) [17:29:29] PROBLEM - ps1-a8-codfw-infeed-load-tower-A-phase-Z on ps1-a8-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:30:49] (03PS5) 10Giuseppe Lavagetto: profile::lvs: use wmflib::fetch [puppet] - 10https://gerrit.wikimedia.org/r/572215 [17:30:57] RECOVERY - ps1-a8-codfw-infeed-load-tower-A-phase-X on ps1-a8-codfw is OK: SNMP OK - ps1-a8-codfw-infeed-load-tower-A-phase-X 488 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:31:05] RECOVERY - ps1-a8-codfw-infeed-load-tower-A-phase-Y on ps1-a8-codfw is OK: SNMP OK - ps1-a8-codfw-infeed-load-tower-A-phase-Y 225 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:31:08] (03CR) 10Effie Mouzeli: [C: 04-1] "Sadly, this will install this package on all canary appservers, for example: https://puppet-compiler.wmflabs.org/compiler1003/20907/mw1261" [puppet] - 10https://gerrit.wikimedia.org/r/573333 (https://phabricator.wikimedia.org/T244549) (owner: 10Hnowlan) [17:31:21] RECOVERY - ps1-a8-codfw-infeed-load-tower-A-phase-Z on ps1-a8-codfw is OK: SNMP OK - ps1-a8-codfw-infeed-load-tower-A-phase-Z 425 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:32:29] (03PS1) 10ArielGlenn: turn snapshot1010 into an xml dumps testbed [puppet] - 10https://gerrit.wikimedia.org/r/573343 (https://phabricator.wikimedia.org/T241794) [17:33:00] (03CR) 10Effie Mouzeli: [C: 04-1] "Please hold this patch for another week as we'd like to use this feature to run yet another experiment. Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/573335 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [17:33:57] RECOVERY - MariaDB Slave Lag: s8 on db1124 is OK: OK slave_sql_lag Replication lag: 0.50 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [17:34:17] 10Operations, 10ops-eqiad, 10DC-Ops: Replace broken BBU on db1084 (HP host) - https://phabricator.wikimedia.org/T245647 (10Marostegui) @Jclark-ctr if possible, let me know with 24h in advance when you want to switch the BBU so I can have the host ready for you (mysql off, power off..) Thank you! [17:35:18] (03PS6) 10Giuseppe Lavagetto: profile::lvs: use wmflib::fetch [puppet] - 10https://gerrit.wikimedia.org/r/572215 [17:37:12] (03CR) 10Ppchelko: "Interesting. Thank you @Ottomata. I think we should get this into a working state right now first and then probably incorporate canaries a" [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 (owner: 10Holger Knust) [17:37:42] (03CR) 10Giuseppe Lavagetto: "looks like we're getting there" [puppet] - 10https://gerrit.wikimedia.org/r/572215 (owner: 10Giuseppe Lavagetto) [17:39:18] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Start reading for the new term store for clients up to Q4000 (T225057) (just incase of cache issue) (duration: 01m 04s) [17:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:22] T225057: Switch `tmpItemTermsMigrationStages` to MIGRATION_WRITE_NEW - https://phabricator.wikimedia.org/T225057 [17:39:29] James_F: any idea when we can stop doig secondary syncs because of that fun odd cache thing? :P [17:40:58] !log starting data check between db1078 and db1140:3313 T244958 [17:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:02] T244958: db1095 backup source crashed: broken BBU - https://phabricator.wikimedia.org/T244958 [17:48:14] (03PS3) 10Effie Mouzeli: WIP hieradata: test streaming apache logs to logstash from mwdebug1001 [puppet] - 10https://gerrit.wikimedia.org/r/572057 (https://phabricator.wikimedia.org/T244472) [17:48:16] !log cp1089 cp1090 returned to service via T243167 [17:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:20] T243167: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 [17:50:37] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 70 probes of 523 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:51:27] 10Operations, 10ops-eqiad, 10ops-esams, 10DC-Ops, 10Traffic: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH) [17:51:30] (03PS1) 10Zoranzoki21: Add throttle rule for Vancouver Community College library event on 2020-03-03 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573345 (https://phabricator.wikimedia.org/T245323) [17:51:48] 10Operations, 10observability, 10serviceops, 10Patch-For-Review: Stream a subset of mediawiki apache logs to logstash - https://phabricator.wikimedia.org/T244472 (10jijiki) It appears that on beta the variable `$server_role = $::_role.split('/')[-1` is not evaluated properly, while in production, it looks... [17:52:21] (03PS2) 10Bstorm: toolschecker: check node ready status on new k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/562000 (owner: 10BryanDavis) [17:53:26] 10Operations, 10ops-eqiad, 10ops-esams, 10DC-Ops, 10Traffic: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH) Please note as of now, all eqiad cp sysetms have been updated to the latest bios revision. If these hosts experience any further crashes, i... [17:53:36] addshore: Containers. [17:54:04] 10Operations, 10ops-eqiad, 10ops-esams, 10DC-Ops, 10Traffic: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH) [17:54:08] James_F: answer to everything [17:54:23] Yup. [17:54:28] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH) [17:55:51] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 32 probes of 523 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:56:00] (03CR) 10Bstorm: [C: 03+2] toolschecker: check node ready status on new k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/562000 (owner: 10BryanDavis) [17:56:21] (03PS2) 10Zoranzoki21: Add throttle rule for Vancouver Community College library event on 2020-03-03 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573345 (https://phabricator.wikimedia.org/T245323) [17:57:25] (03CR) 10jerkins-bot: [V: 04-1] Add throttle rule for Vancouver Community College library event on 2020-03-03 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573345 (https://phabricator.wikimedia.org/T245323) (owner: 10Zoranzoki21) [17:58:19] (03CR) 10Dzahn: [C: 03+2] peopleweb: change Icinga monitoring from HTTP to HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/573025 (owner: 10Dzahn) [17:58:27] (03PS2) 10Dzahn: peopleweb: change Icinga monitoring from HTTP to HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/573025 [18:00:25] (03CR) 10Dzahn: "https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=releases1001&service=HTTPS+releases.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/573023 (owner: 10Dzahn) [18:11:52] (03PS2) 10Dzahn: microsites: change Icinga monitoring from HTTP to HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/573032 [18:15:43] (03CR) 10Dzahn: [C: 03+2] microsites: change Icinga monitoring from HTTP to HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/573032 (owner: 10Dzahn) [18:16:33] (03PS4) 10Hnowlan: mediawiki: install phpdbg on mwdebg hosts. [puppet] - 10https://gerrit.wikimedia.org/r/573333 (https://phabricator.wikimedia.org/T244549) [18:16:43] (03CR) 10Dzahn: "https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=people1001&service=HTTPS-peopleweb" [puppet] - 10https://gerrit.wikimedia.org/r/573025 (owner: 10Dzahn) [18:17:02] 10Operations, 10ops-eqiad, 10DC-Ops: audit/rebalance power in a5-eqiad - https://phabricator.wikimedia.org/T245655 (10RobH) p:05Triage→03Medium [18:19:07] 10Operations, 10ops-eqiad, 10DC-Ops: audit/rebalance power in a5-eqiad - https://phabricator.wikimedia.org/T245655 (10RobH) Please note this imbalance is what is triggering the email alerts: Alert for device ps1-a5-eqiad.mgmt.eqiad.wmnet - Sensor over limit [18:19:11] !log removing problem ACK from Icinga alerts for wikitech-static MediaWiki version. comments were about things in 2019 [18:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:04] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10mmodell) [18:21:06] (03PS2) 10Dzahn: noc: change Icinga monitoring from HTTP to HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/573024 [18:21:54] (03CR) 10Hnowlan: "Refactored to only run on mwdebug hosts via a toggle in profile::mediawiki::php." [puppet] - 10https://gerrit.wikimedia.org/r/573333 (https://phabricator.wikimedia.org/T244549) (owner: 10Hnowlan) [18:22:31] !log phab2001 - upgrading mariadb client package versions [18:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:41] !log phab2001 - installing package upgrades, incl. openssh, PHP version [18:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:59] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:25:25] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:26:23] !log phab2001 - upgraded ssh-server, kept locally modified config; apt autoremove removes python3-debconf [18:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:49] (03CR) 10Dzahn: [C: 03+2] noc: change Icinga monitoring from HTTP to HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/573024 (owner: 10Dzahn) [18:32:44] (03PS2) 10Dzahn: tendril: change Icinga monitoring from HTTP to HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/573034 [18:33:44] 10Operations, 10ops-codfw, 10DC-Ops: (ASAP) rack/setup/install frdb2001 - https://phabricator.wikimedia.org/T245566 (10Papaul) [18:35:41] (03PS1) 10ArielGlenn: weekly dump of machine vision tables from commonswiki [puppet] - 10https://gerrit.wikimedia.org/r/573351 (https://phabricator.wikimedia.org/T236431) [18:36:17] (03PS3) 10Dzahn: tendril: change Icinga monitoring from HTTP to HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/573034 [18:38:07] (03CR) 10jerkins-bot: [V: 04-1] weekly dump of machine vision tables from commonswiki [puppet] - 10https://gerrit.wikimedia.org/r/573351 (https://phabricator.wikimedia.org/T236431) (owner: 10ArielGlenn) [18:38:43] !log mwmaint1002 - removing Icinga ACK for systemd state - comments for it were from HHVM removal in Oct 2019 [18:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:03] !log mwmaint1002 - sudo systemctl reset-failed to clear systemd alerts [18:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:28] (03PS2) 10ArielGlenn: weekly dump of machine vision tables from commonswiki [puppet] - 10https://gerrit.wikimedia.org/r/573351 (https://phabricator.wikimedia.org/T236431) [18:40:11] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:40:59] (03CR) 10Dzahn: "https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=vega&service=Static+Bugzilla+HTTPS" [puppet] - 10https://gerrit.wikimedia.org/r/573032 (owner: 10Dzahn) [18:44:05] !log reprepro: upload gdnsd 3.2.2-1~wmf1 to buster-wikimedia [18:44:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:23] !log dns4001 - upgraded to gdnsd-3.2.2 [18:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:25] (03PS3) 10ArielGlenn: weekly dump of machine vision tables from commonswiki [puppet] - 10https://gerrit.wikimedia.org/r/573351 (https://phabricator.wikimedia.org/T236431) [18:48:35] RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [18:49:06] (03PS1) 10RLazarus: site: Assign mw13[56-62] as mediawiki::appserver::api. [puppet] - 10https://gerrit.wikimedia.org/r/573352 (https://phabricator.wikimedia.org/T236437) [18:49:20] 10Operations, 10ops-codfw, 10fundraising-tech-ops: codfw:fundraising single-cpu misc servers - https://phabricator.wikimedia.org/T244950 (10Papaul) [18:50:37] (03CR) 10Dzahn: [C: 03+1] site: Assign mw13[56-62] as mediawiki::appserver::api. [puppet] - 10https://gerrit.wikimedia.org/r/573352 (https://phabricator.wikimedia.org/T236437) (owner: 10RLazarus) [18:50:59] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) In this topic branch i am also switching monitoring of these services from HTTP to HTTPS: https://gerrit.wikimedia.org/r/q/topic:%22icinga-http-https%22+(status:op... [18:51:11] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Lauren Dickinson - https://phabricator.wikimedia.org/T245524 (10LDickinsonWMF) Hi, @jbond, @Aklapper, @RhinosF1, and @Varnent: This is my Wikimedia account. My username orginally was LDickinson (WMF), but MediaWiki required me to remove th... [18:51:47] (03CR) 10RLazarus: [C: 03+2] site: Assign mw13[56-62] as mediawiki::appserver::api. [puppet] - 10https://gerrit.wikimedia.org/r/573352 (https://phabricator.wikimedia.org/T236437) (owner: 10RLazarus) [18:53:38] (03PS3) 10Zoranzoki21: Add throttle rule for Vancouver Community College library event on 2020-03-03 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573345 (https://phabricator.wikimedia.org/T245323) [18:57:35] Hi, beta-mediawiki-config-update-eqiad is stuck again [19:00:04] RoanKattouw, Niharika, and Urbanecm: How many deployers does it take to do Morning SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200219T1900). [19:00:04] Jdlrobson: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:24] Bot forgot on me :/ [19:00:53] o/ around [19:01:21] Hello Jdlrobson :) [19:02:06] I can SWAT today! [19:02:07] hey Zoranzoki21 thanks for all the patches you've been submitting! [19:02:29] Hey, sorry, no SWAT right now. [19:02:42] (03CR) 10Muehlenhoff: [C: 04-1] acme_chief: add apt[12]001 to authorized hosts for apt cert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/573036 (owner: 10Dzahn) [19:02:42] I'm deploying a UBN train blocker. [19:02:42] James_F: okay [19:02:50] Jdlrobson: YW [19:02:53] (Isn't life fun?) [19:03:28] James_F: Not always [19:06:00] (03PS4) 10ArielGlenn: weekly dump of machine vision tables from commonswiki [puppet] - 10https://gerrit.wikimedia.org/r/573351 (https://phabricator.wikimedia.org/T236431) [19:06:12] > I'm deploying a UBN train blocker. < is this the one i put in swat @james_F ? [19:06:28] Jdlrobson: No, different one. [19:06:43] I can do yours too, once I've ensured this one is fixed? [19:07:03] Jdlrobson: This one, I think https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/573339/ [19:07:08] James_F: sounds good [19:08:12] Urbanecm: OK if I just do SWAT for you? :-) [19:08:20] absolutely! [19:08:21] Sorry to steal your limelight. [19:08:27] Kk, let's get stuff done. [19:09:11] (03PS2) 10Jforrester: [trwiki] Enable the WikidataPageBanner extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570691 (https://phabricator.wikimedia.org/T244369) [19:09:17] (03CR) 10Jforrester: [C: 03+2] [trwiki] Enable the WikidataPageBanner extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570691 (https://phabricator.wikimedia.org/T244369) (owner: 10Jforrester) [19:10:57] (03Merged) 10jenkins-bot: [trwiki] Enable the WikidataPageBanner extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570691 (https://phabricator.wikimedia.org/T244369) (owner: 10Jforrester) [19:11:31] PROBLEM - Check systemd state on mw1356 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:11:53] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.20/includes/resourceloader/dependencystore/SqlModuleDependencyStore.php: T245570 resourceloader: fix SqlDependencyModuleStore::setMulti() to use upsert() (duration: 01m 01s) [19:11:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:57] T245570: Duplicate entry 'ext.uls.pt-vector|en' for key 'PRIMARY' - https://phabricator.wikimedia.org/T245570 [19:12:02] ^mw13 servers = new installs [19:12:06] we got it [19:12:20] Thanks, mutante. [19:12:39] it's hard to avoid that because the checks get added in the moment we apply the role the first time.. before they are pooled [19:12:42] OK, T245570 looks fixed, which is nice. [19:12:57] and it's also somewhat nice to see them turn green after initial run [19:13:04] will silence IRC part a bit though [19:14:33] Jdlrobson: Any need to test the trwiki one before deploying? [19:14:51] (03PS2) 10Jforrester: Disable MobileFrontend mainpage special casing on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570982 (https://phabricator.wikimedia.org/T244577) (owner: 10Ammarpad) [19:14:58] (03CR) 10Jforrester: [C: 03+2] Disable MobileFrontend mainpage special casing on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570982 (https://phabricator.wikimedia.org/T244577) (owner: 10Ammarpad) [19:15:57] (03Merged) 10jenkins-bot: Disable MobileFrontend mainpage special casing on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570982 (https://phabricator.wikimedia.org/T244577) (owner: 10Ammarpad) [19:15:59] _joe_: its and expected bot, it does some reporting stuff amongst other stuff [19:17:18] LGTM, deploying. [19:17:41] (03PS4) 10Jforrester: Add throttle rule for Vancouver Community College library event on 2020-03-03 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573345 (https://phabricator.wikimedia.org/T245323) (owner: 10Zoranzoki21) [19:17:42] PROBLEM - Check systemd state on mw1359 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:17:44] PROBLEM - Check systemd state on mw1357 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:18:13] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T244369 [trwiki] Enable the WikidataPageBanner extension (duration: 01m 05s) [19:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:17] (03CR) 10Jforrester: [C: 03+2] Add throttle rule for Vancouver Community College library event on 2020-03-03 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573345 (https://phabricator.wikimedia.org/T245323) (owner: 10Zoranzoki21) [19:18:18] T244369: Deploy WikidataPageBanner extension on trwiki - https://phabricator.wikimedia.org/T244369 [19:18:39] James_F: You can merge my patch directly [19:19:08] PROBLEM - Check systemd state on mw1360 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:19:11] (03Merged) 10jenkins-bot: Add throttle rule for Vancouver Community College library event on 2020-03-03 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573345 (https://phabricator.wikimedia.org/T245323) (owner: 10Zoranzoki21) [19:19:58] PROBLEM - Check systemd state on mw1362 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:20:49] Zoranzoki21: Yeah, LGTM. [19:21:01] !log jforrester@deploy1001 Synchronized dblists/mobilemainpagelegacy.dblist: T244577 [metawiki] Disable MobileFrontend mainpage special casing (duration: 01m 04s) [19:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:05] T244577: Request to disable MFSpecialCaseMainPage for metawiki - https://phabricator.wikimedia.org/T244577 [19:21:06] RECOVERY - Check systemd state on mw1362 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:21:14] PROBLEM - PHP7 rendering on mw1358 is CRITICAL: connect to address 10.64.48.200 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [19:22:30] James_F: sorry no testing on tr needed [19:22:38] sorry im listening to roan talk at the same time [19:22:57] Jdlrobson: Ha. [19:23:15] i'll need to test trwiki after it goes live [19:23:36] Jdlrobson: It went live 7 minutes ago. [19:23:48] (It looked OK to me.) [19:24:42] James_F: yup works ! https://tr.wikipedia.org/wiki/Kullan%C4%B1c%C4%B1:Jdlrobson/draft [19:24:46] RECOVERY - Check systemd state on mw1359 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:25:04] Success. [19:25:45] Aha, the MF code has finally landed. [19:26:06] PROBLEM - Apache HTTP on mw1357 is CRITICAL: connect to address 10.64.48.199 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [19:26:28] PROBLEM - mediawiki-installation DSH group on mw1357 is CRITICAL: Host mw1357 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [19:26:28] PROBLEM - mediawiki-installation DSH group on mw1361 is CRITICAL: Host mw1361 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [19:26:34] RECOVERY - PHP7 rendering on mw1358 is OK: HTTP OK: HTTP/1.1 200 OK - 74387 bytes in 0.974 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [19:26:58] ACKNOWLEDGEMENT - Apache HTTP on mw1357 is CRITICAL: connect to address 10.64.48.199 and port 80: Connection refused daniel_zahn new installs https://wikitech.wikimedia.org/wiki/Application_servers [19:26:58] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw1357 is CRITICAL: Host mw1357 is not in mediawiki-installation dsh group daniel_zahn new installs https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [19:26:58] ACKNOWLEDGEMENT - Apache HTTP on mw1361 is CRITICAL: connect to address 10.64.48.203 and port 80: Connection refused daniel_zahn new installs https://wikitech.wikimedia.org/wiki/Application_servers [19:26:58] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw1361 is CRITICAL: Host mw1361 is not in mediawiki-installation dsh group daniel_zahn new installs https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [19:27:31] k i have a headphone in my ear so ca n test when synced [19:27:42] rather when it's on debug [19:27:54] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.20/skins/MinervaNeue/includes/MinervaHooks.php: T245162 Check title value before proceeding to check if user page (duration: 01m 04s) [19:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:01] T245162: Job unable to create file page "Fatal: Call function inNamespace() on null" (via MinervaHooks) - https://phabricator.wikimedia.org/T245162 [19:28:03] (03CR) 10Papaul: [C: 03+1] site: add new codfw mw appservers with spare role [puppet] - 10https://gerrit.wikimedia.org/r/572993 (https://phabricator.wikimedia.org/T241852) (owner: 10Dzahn) [19:29:11] Jdlrobson: It's live in prod for wmf.20 (nothing seemed to break); on mwdebug1001 if you want to poke for wmf.19. [19:29:25] will poke 19 now [19:30:11] James_F: yeh not seeing any issues. i think this is good [19:30:19] Syncing. [19:31:10] PROBLEM - mediawiki-installation DSH group on mw1358 is CRITICAL: Host mw1358 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [19:31:21] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.19/skins/MinervaNeue/includes/MinervaHooks.php: T245162 Check title value before proceeding to check if user page (duration: 01m 04s) [19:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:28] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw1358 is CRITICAL: Host mw1358 is not in mediawiki-installation dsh group daniel_zahn new install https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [19:32:28] ACKNOWLEDGEMENT - Nginx local proxy to apache on mw1360 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.008 second response time daniel_zahn new install https://wikitech.wikimedia.org/wiki/Application_servers [19:33:14] (03CR) 10RLazarus: [C: 03+1] site: add new codfw mw appservers with spare role [puppet] - 10https://gerrit.wikimedia.org/r/572993 (https://phabricator.wikimedia.org/T241852) (owner: 10Dzahn) [19:35:31] (03CR) 10Dzahn: [C: 03+2] site: add new codfw mw appservers with spare role [puppet] - 10https://gerrit.wikimedia.org/r/572993 (https://phabricator.wikimedia.org/T241852) (owner: 10Dzahn) [19:35:48] PROBLEM - mediawiki-installation DSH group on mw1362 is CRITICAL: Host mw1362 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [19:35:48] PROBLEM - PHP opcache health on mw1360 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:35:49] (03PS3) 10Dzahn: site: add new codfw mw appservers with spare role [puppet] - 10https://gerrit.wikimedia.org/r/572993 (https://phabricator.wikimedia.org/T241852) [19:36:07] thanks for the swats james [19:36:23] ACKNOWLEDGEMENT - PHP opcache health on mw1360 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn new installs https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:36:23] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw1362 is CRITICAL: Host mw1362 is not in mediawiki-installation dsh group daniel_zahn new installs https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [19:36:59] SWAT done, sorry. [19:37:06] RECOVERY - Apache HTTP on mw1357 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.469 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:37:14] RECOVERY - Check systemd state on mw1357 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:39:28] !log initial puppet run on new hosts mw231* [19:39:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:06] RECOVERY - PHP opcache health on mw1360 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:41:26] RECOVERY - Check systemd state on mw1360 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:44:12] RECOVERY - Check systemd state on mw1356 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:46:02] (03PS1) 10Dzahn: add fake keys for new codfw appserver certs [labs/private] - 10https://gerrit.wikimedia.org/r/573357 (https://phabricator.wikimedia.org/T241852) [19:50:59] !log generating mcrouter certs for new codfw mw appservers [19:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:33] (03PS1) 10Papaul: DNS: ADD mgmt DNS for frdb2001, payments200[1-3]-a [dns] - 10https://gerrit.wikimedia.org/r/573358 [19:51:54] (03CR) 10jerkins-bot: [V: 04-1] DNS: ADD mgmt DNS for frdb2001, payments200[1-3]-a [dns] - 10https://gerrit.wikimedia.org/r/573358 (owner: 10Papaul) [19:54:07] !log scap pull on new api servers mw13[56-62] [19:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:20] (03PS2) 10Papaul: DNS: ADD mgmt DNS for frdb2001, payments200[1-3]-a [dns] - 10https://gerrit.wikimedia.org/r/573358 [19:55:38] longma: I'm feeling optimistic about the train; you? [19:56:58] James_F: you are now, I always start optimistic :) [19:57:06] * James_F grins. [19:58:15] James_F: It’s how we all finish that counts. [19:58:47] James_F: oh good, I'll follow your lead [19:59:10] Ha. [20:00:05] James_F and longma: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Mediawiki train - American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200219T2000). [20:00:16] (03PS1) 10Jforrester: group1 wikis to 1.35.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573361 [20:00:18] (03CR) 10Jforrester: [C: 03+2] group1 wikis to 1.35.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573361 (owner: 10Jforrester) [20:00:19] Good luck! [20:00:38] Fingers crossed and all that. [20:01:44] Umm. [20:01:50] Yep [20:01:55] (03Merged) 10jenkins-bot: group1 wikis to 1.35.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573361 (owner: 10Jforrester) [20:01:55] (03PS1) 10Ottomata: Add eventgate-analytics-external.svc entries [dns] - 10https://gerrit.wikimedia.org/r/573362 (https://phabricator.wikimedia.org/T233629) [20:02:04] Amir1: You're back-porting to Wikidata during the train?! [20:02:13] Amir1: That is very not cool. [20:02:35] !log rzl@cumin1001 conftool action : set/weight=10; selector: name=mw13(5[6-9]|6[0-2]).eqiad.wmnet [20:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:47] James_F: it can wait, it's a metric fix [20:02:59] !log rzl@cumin1001 conftool action : set/pooled=yes; selector: name=mw13(5[6-9]|6[0-2]).eqiad.wmnet [20:03:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:02] we are getting wrong metrics [20:03:16] Amir1: Only UBNs should get emergency deployed (outside of a window), and should be co-ordinated with RelEng and SRE. [20:03:16] sorry if I stepped on your toes [20:03:21] Fine, will proceed. [20:03:28] Syncing now. [20:04:22] Manual testing on canaries seem OK to me. [20:04:37] !log jforrester@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.20 [20:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:35] This looks good [20:05:36] Usual minor spike in timeouts. [20:05:41] !log jforrester@deploy1001 Synchronized php: group1 wikis to 1.35.0-wmf.20 (duration: 01m 03s) [20:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:17] OK, declaring the train moved to group1, at least for now. [20:07:19] (03CR) 10Ottomata: "We can abandon, ya?" [puppet] - 10https://gerrit.wikimedia.org/r/563508 (https://phabricator.wikimedia.org/T237752) (owner: 10Elukey) [20:07:24] Amir1: Over to you. [20:07:29] thank you! [20:07:46] (03CR) 10Ottomata: airflow: Expand sudo rights to analytics-search user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/572997 (owner: 10EBernhardson) [20:09:06] James_F: well done, that was too calm! [20:09:49] If only all train deploys went so smoothly. [20:09:57] (03CR) 10RLazarus: [C: 03+1] add fake keys for new codfw appserver certs [labs/private] - 10https://gerrit.wikimedia.org/r/573357 (https://phabricator.wikimedia.org/T241852) (owner: 10Dzahn) [20:10:25] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.19/extensions/Wikibase/lib: Fix stastd metric for StatsdMissRecordingSimpleCache (wb_terms work) (duration: 01m 05s) [20:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:33] James_F: all deployments would be even nicer! [20:11:28] Well, indeed. [20:12:34] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.19/extensions/Wikibase/lib: Fix stastd metric for StatsdMissRecordingSimpleCache (wb_terms work) (duration: 01m 06s) [20:12:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:31] (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake keys for new codfw appserver certs [labs/private] - 10https://gerrit.wikimedia.org/r/573357 (https://phabricator.wikimedia.org/T241852) (owner: 10Dzahn) [20:13:56] !log rzl@cumin1001 conftool action : set/weight=30; selector: name=mw13(5[6-9]|6[0-2]).eqiad.wmnet [20:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:30] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.20/extensions/Wikibase/lib: Fix stastd metric for StatsdMissRecordingSimpleCache (wb_terms work) (duration: 01m 06s) [20:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:36] I'm done o/ [20:16:43] Sorry again for this [20:16:43] (03PS1) 10Ottomata: Add LVS for eventgate-analytics-external on port 4692 [puppet] - 10https://gerrit.wikimedia.org/r/573365 (https://phabricator.wikimedia.org/T233629) [20:16:45] (03PS1) 10Ottomata: Add discovery for eventgate-analytics-external [puppet] - 10https://gerrit.wikimedia.org/r/573366 (https://phabricator.wikimedia.org/T233629) [20:17:03] (03PS1) 10Ottomata: Add discovery for eventgate-analytics-external [dns] - 10https://gerrit.wikimedia.org/r/573367 (https://phabricator.wikimedia.org/T233629) [20:18:44] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Couple of inline comments. We also lack k8s tokens and namespaces, I can work on those tomorrow though" (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/563211 (https://phabricator.wikimedia.org/T233629) (owner: 10Ottomata) [20:18:46] ACKNOWLEDGEMENT - Host flowspec1001 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn WIP [20:22:25] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Dzahn) a:05Papaul→03Dzahn [20:22:26] RECOVERY - mediawiki-installation DSH group on mw1357 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [20:26:46] RECOVERY - mediawiki-installation DSH group on mw1361 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [20:28:49] (03PS1) 10Ottomata: Route intake-analytics.wm.org to eventgate-analytics-external [puppet] - 10https://gerrit.wikimedia.org/r/573369 (https://phabricator.wikimedia.org/T233629) [20:30:35] (03PS1) 10Dzahn: site: add 6 new codfw appservers in rack B3 with mw role [puppet] - 10https://gerrit.wikimedia.org/r/573370 (https://phabricator.wikimedia.org/T241852) [20:31:38] RECOVERY - mediawiki-installation DSH group on mw1358 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [20:31:42] (03PS2) 10Dzahn: site: add 6 new codfw appservers in rack B3 with mw role [puppet] - 10https://gerrit.wikimedia.org/r/573370 (https://phabricator.wikimedia.org/T241852) [20:32:13] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Event-Platform, and 5 others: Public EventGate instance and endpoint for analytics event intake: eventgate-analytics-external - https://phabricator.wikimedia.org/T233629 (10Ottomata) [20:32:14] (03CR) 10RLazarus: [C: 03+1] site: add 6 new codfw appservers in rack B3 with mw role [puppet] - 10https://gerrit.wikimedia.org/r/573370 (https://phabricator.wikimedia.org/T241852) (owner: 10Dzahn) [20:32:51] (03CR) 10jerkins-bot: [V: 04-1] site: add 6 new codfw appservers in rack B3 with mw role [puppet] - 10https://gerrit.wikimedia.org/r/573370 (https://phabricator.wikimedia.org/T241852) (owner: 10Dzahn) [20:33:30] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, and 6 others: Public EventGate instance and endpoint for analytics event intake: eventgate-analytics-external - https://phabricator.wikimedia.org/T233629 (10Ottomata) [20:33:31] (03PS3) 10Dzahn: site: add 7 new codfw appservers in rack B3 with mw role [puppet] - 10https://gerrit.wikimedia.org/r/573370 (https://phabricator.wikimedia.org/T241852) [20:33:54] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, and 6 others: Public EventGate instance and endpoint for analytics event intake: eventgate-analytics-external - https://phabricator.wikimedia.org/T233629 (10Ottomata) a:03Ottomata [20:34:43] (03CR) 10jerkins-bot: [V: 04-1] site: add 7 new codfw appservers in rack B3 with mw role [puppet] - 10https://gerrit.wikimedia.org/r/573370 (https://phabricator.wikimedia.org/T241852) (owner: 10Dzahn) [20:36:24] RECOVERY - mediawiki-installation DSH group on mw1362 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [20:36:57] (03PS4) 10Dzahn: site: add 7 new codfw appservers in rack B3 with mw role [puppet] - 10https://gerrit.wikimedia.org/r/573370 (https://phabricator.wikimedia.org/T241852) [20:41:03] (03PS1) 10Herron: add forward/reverse ipv4/ipv6 records for lists1001 VM [dns] - 10https://gerrit.wikimedia.org/r/573373 [20:44:11] (03PS2) 10Herron: add forward/reverse ipv4/ipv6 records for lists1001 VM [dns] - 10https://gerrit.wikimedia.org/r/573373 (https://phabricator.wikimedia.org/T224586) [20:46:24] (03PS4) 10Ottomata: New eventgate-analytics-external instance using remote EventStreamConfig API [deployment-charts] - 10https://gerrit.wikimedia.org/r/563211 (https://phabricator.wikimedia.org/T233629) [20:49:34] (03CR) 10Ottomata: New eventgate-analytics-external instance using remote EventStreamConfig API (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/563211 (https://phabricator.wikimedia.org/T233629) (owner: 10Ottomata) [20:49:37] (03CR) 10Herron: [C: 03+2] add forward/reverse ipv4/ipv6 records for lists1001 VM [dns] - 10https://gerrit.wikimedia.org/r/573373 (https://phabricator.wikimedia.org/T224586) (owner: 10Herron) [20:50:32] (03PS5) 10Ottomata: New eventgate-analytics-external instance using remote EventStreamConfig API [deployment-charts] - 10https://gerrit.wikimedia.org/r/563211 (https://phabricator.wikimedia.org/T233629) [20:51:07] (03CR) 10Ottomata: New eventgate-analytics-external instance using remote EventStreamConfig API (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/563211 (https://phabricator.wikimedia.org/T233629) (owner: 10Ottomata) [20:53:26] (03PS2) 10Jforrester: Bump php pointer from 1.35.0-wmf.19 to 1.35.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573006 [20:53:32] (03CR) 10Jforrester: [C: 03+2] Bump php pointer from 1.35.0-wmf.19 to 1.35.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573006 (owner: 10Jforrester) [20:54:41] (03Merged) 10jenkins-bot: Bump php pointer from 1.35.0-wmf.19 to 1.35.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573006 (owner: 10Jforrester) [20:59:11] (03CR) 10Dzahn: [C: 03+2] site: add 7 new codfw appservers in rack B3 with mw role [puppet] - 10https://gerrit.wikimedia.org/r/573370 (https://phabricator.wikimedia.org/T241852) (owner: 10Dzahn) [21:00:05] cscott, arlolra, subbu, halfak, and accraze: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200219T2100). [21:00:52] (03PS13) 10Holger Knust: Migrate changeprop & cpjobqueue to kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 [21:01:42] PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: RRDP status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [21:04:31] (03CR) 10Holger Knust: "Here are the changes based on the discussion." (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 (owner: 10Holger Knust) [21:04:47] 10Operations, 10serviceops: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10Dzahn) etherpad with role layout and status: https://etherpad.wikimedia.org/p/T236437 [21:05:43] (03PS1) 10RLazarus: site: Assign mw13[64-73,84] as mediawiki::appserver. [puppet] - 10https://gerrit.wikimedia.org/r/573379 [21:07:00] 10Operations, 10Release-Engineering-Team-TODO, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), 10Technical-Debt: Investigate whether GD is still needed on appservers - https://phabricator.wikimedia.org/T227734 (10Jdforrester-WMF) I propose that we Decline this, given that it's alrea... [21:07:33] (03CR) 10Jforrester: [C: 04-1] "We're not absolutely sure right now; see task." [puppet] - 10https://gerrit.wikimedia.org/r/526255 (https://phabricator.wikimedia.org/T227734) (owner: 10Jforrester) [21:09:04] (03CR) 10Dzahn: [C: 03+1] site: Assign mw13[64-73,84] as mediawiki::appserver. [puppet] - 10https://gerrit.wikimedia.org/r/573379 (owner: 10RLazarus) [21:09:19] (03PS1) 10Jhedden: openstack: Update cloud-init for virtio-scsi devices [puppet] - 10https://gerrit.wikimedia.org/r/573381 [21:09:24] (03CR) 10RLazarus: [C: 03+2] site: Assign mw13[64-73,84] as mediawiki::appserver. [puppet] - 10https://gerrit.wikimedia.org/r/573379 (owner: 10RLazarus) [21:09:57] (03PS2) 10RLazarus: site: Assign mw13[64-73,84] as mediawiki::appserver. [puppet] - 10https://gerrit.wikimedia.org/r/573379 [21:10:36] andrewbogott: Is now a good moment for me to deploy the "make wgServer protocol relative for Wikitech"? [21:10:45] (03PS2) 10Jforrester: Set wgServer to protocol-relative for Wikitech and Test Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571786 [21:11:33] ebernhardson: Now OK for me to revert the more_like redirection into codfw? It's been over 24 hours. [21:11:45] 10Operations, 10ops-eqiad: mw1280 crashed logging correctable memory errors - https://phabricator.wikimedia.org/T240187 (10Jclark-ctr) @wiki_willy going by dell support based on service tag Part number: PR5D1 DIMM,32GB,2133,2RX4,8G,DDR4,R 2 [21:15:18] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Hardware): relocate/reimage cloudvirt1013 with 10G interfaces - https://phabricator.wikimedia.org/T243414 (10Jclark-ctr) @Andrew This is already located in 10g rack just needs Dac cables and connected to 10g nic. I have talked to @JHedd... [21:18:28] (03PS1) 10Dzahn: conftool-data: add 7 new codfw appservers [puppet] - 10https://gerrit.wikimedia.org/r/573386 (https://phabricator.wikimedia.org/T241852) [21:19:34] PROBLEM - Check systemd state on mw2315 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:23:12] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [21:23:14] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:26] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Icinga downtime for 4:00:00 set by dzahn@cumin1001 on 1 host(s) and their services with reason: new_install ` mw2315.codfw.wmnet ` [21:23:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:33] 10Operations, 10ops-eqiad: mw1280 crashed logging correctable memory errors - https://phabricator.wikimedia.org/T240187 (10wiki_willy) [21:25:41] 10Operations, 10ops-eqiad: mw1280 crashed logging correctable memory errors - https://phabricator.wikimedia.org/T240187 (10wiki_willy) Created T245670 to have a replacement DIMM ordered and delivered. Thanks, Willy [21:26:01] (03CR) 10Jforrester: Revert "cirrus: redirect more_like to codfw to rebuild query cache" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572932 (owner: 10Jforrester) [21:29:08] !log rzl@cumin1001 START - Cookbook sre.hosts.downtime [21:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:58] rlazarus: downtiming a ton of hosts or forcing a puppet run on the icinga host? [21:30:01] :-P [21:30:08] the latter :D new hosts [21:30:33] I didn't see the reimage script message [21:31:06] not reimage, new-new hosts [21:31:18] oh hmmm I guess the puppet run wasn't necessary though, the only thing I'm doing right now is changing their role [21:31:22] so icinga probably already knows about them [21:31:30] !log rzl@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:35] 10Operations, 10serviceops: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10ops-monitoring-bot) Icinga downtime for 1:00:00 set by rzl@cumin1001 on 11 host(s) and their services with reason: new installs ` mw[1364-1373,1384].eqiad.wmnet ` [21:31:39] if they didn't had a role assigned no [21:31:51] they were spare::system until just now [21:32:00] PROBLEM - Check systemd state on mw2311 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:32:05] so there is no way to avoid the spam [21:32:14] PROBLEM - Check systemd state on mw2316 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:32:21] you can force a puppet run on all the hosts via cumin [21:32:26] and then another on the icinga host [21:32:28] PROBLEM - Check systemd state on mw2313 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:32:33] with the downtime [21:32:33] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [21:32:34] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:43] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Icinga downtime for 4:00:00 set by dzahn@cumin1001 on 1 host(s) and their services with reason: new_install ` mw2313.codfw.wmnet ` [21:32:47] ahhh I always forget icinga uses the compiled output, that's so counterintuitive [21:32:54] but yeah of course you're right, that's the order to do it in [21:32:54] thanks [21:32:56] or disable puppet on the icinga host before merging the patch that changes the role [21:32:58] oh, ACK. puppet run is neede on hosts and icinga [21:32:59] right [21:33:22] let puppet run everywhere, and then do downtime with the force puppet-rin [21:33:25] *run [21:33:40] PROBLEM - Check systemd state on mw2310 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:33:44] yes, they are all exported resources, so they need the double run, totally counterintuitive [21:34:06] icinga already knows about the hosts and has the base checks but the role change adds moaar checks [21:34:16] (on a change from spare to prod) [21:34:46] before that it's similar with "no role" to spare [21:35:02] PROBLEM - Check the last execution of php7.2-fpm_check_restart on mw2311 is CRITICAL: NRPE: Command check_check_php7.2-fpm_check_restart_status not defined https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:35:09] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [21:35:10] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:35:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:17] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Icinga downtime for 4:00:00 set by dzahn@cumin1001 on 1 host(s) and their services with reason: new_install ` mw2311.codfw.wmnet ` [21:35:31] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [21:35:32] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:40] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Icinga downtime for 4:00:00 set by dzahn@cumin1001 on 1 host(s) and their services with reason: new_install ` mw2314.codfw.wmnet ` [21:35:44] RECOVERY - Check systemd state on mw2313 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:36:12] PROBLEM - mediawiki-installation DSH group on mw2312 is CRITICAL: Host mw2312 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [21:36:14] PROBLEM - mediawiki-installation DSH group on mw2316 is CRITICAL: Host mw2316 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [21:36:19] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [21:36:20] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:27] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Icinga downtime for 4:00:00 set by dzahn@cumin1001 on 1 host(s) and their services with reason: new_install ` mw2316.codfw.wmnet ` [21:36:47] if you have more of those to do my suggestion is to temporarily disable puppet on icinga [21:37:42] hmm. but we also want to watch the checks in the web UI [21:38:17] downtime for 1h, then merge the puppet patch, either wait 30 min or force a run on the new hosts (*not* all together though, -b 15 is a good one), then re-enable puppet on icinga and run the downtime with the force-puppet-run [21:38:40] PROBLEM - nutcracker process on mw2312 is CRITICAL: NRPE: Command check_nutcracker not defined https://wikitech.wikimedia.org/wiki/Nutcracker [21:38:42] the double downtime is because the puppet apply might take down things that were already checked [21:38:53] so you want to downtime the existing checks and then the new ones later on [21:39:09] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [21:39:11] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:39:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:18] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Icinga downtime for 4:00:00 set by dzahn@cumin1001 on 3 host(s) and their services with reason: new_install ` mw[2310-2312].codfw.wmnet ` [21:39:21] if you re-enable puppet late enough you don't need the downtime as all should already be green at the first attempt [21:39:24] using better cumin query to reduce the number of those [21:41:57] volans: ACK, thanks. we should put that somewhere in docs [21:42:24] it's a pity Icinga doesn't allow to downtime a host including any new check that will be added later [21:42:42] yea, that was what i was just talking about with Reuven [21:42:50] can't add downtime for something before it exists [21:43:02] icinga just reads from the named pipe commandfile and tries to match things [21:44:06] also the fact that a puppet run on icinga is soooo slow doesn't help to coordinate stuff [21:44:59] my 2 cents is to reimage the servers directly into their final role if that doesn't have side effects (things like auto-discovery) [21:45:40] RECOVERY - Check the last execution of php7.2-fpm_check_restart on mw2311 is OK: OK: Status of the systemd unit php7.2-fpm_check_restart https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:46:47] i dumped that into https://wikitech.wikimedia.org/wiki/Icinga#Avoid_Icinga_spam_on_new_server_installs for right now [21:46:59] will enhance..it's wiki [21:48:03] !log all authdns servers - upgrade to gdnsd-3.2.2 [21:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:46] thanks mutante :) [21:52:09] !log rzl@cumin1001 START - Cookbook sre.hosts.downtime [21:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:01] (03CR) 10Dzahn: [C: 03+2] conftool-data: add 7 new codfw appservers [puppet] - 10https://gerrit.wikimedia.org/r/573386 (https://phabricator.wikimedia.org/T241852) (owner: 10Dzahn) [21:54:25] !log rzl@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:29] 10Operations, 10serviceops: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10ops-monitoring-bot) Icinga downtime for 1:00:00 set by rzl@cumin1001 on 11 host(s) and their services with reason: new installs ` mw[1364-1373,1384].eqiad.wmnet ` [21:56:54] RECOVERY - Check systemd state on mw2316 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:57:04] 10Operations, 10ops-codfw, 10netops: codfw: Delete cloud interface-range - https://phabricator.wikimedia.org/T244196 (10Papaul) 05Open→03Resolved a:03Papaul Complete [21:57:22] RECOVERY - nutcracker process on mw2312 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker [21:57:46] RECOVERY - Check systemd state on mw2311 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:57:51] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission labstore2001.codfw.wmnet and labstore2002.codfw.wmnet - https://phabricator.wikimedia.org/T243329 (10Papaul) [21:57:52] RECOVERY - Check systemd state on mw2315 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:58:18] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [21:58:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:22] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:27] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10cloud-services-team (Hardware): decommission labstore2003.codfw.wmnet and labstore2004.codfw.wmnet - https://phabricator.wikimedia.org/T243319 (10Papaul) [21:58:30] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Icinga downtime for 4:00:00 set by dzahn@cumin1001 on 7 host(s) and their services with reason: new_install ` mw[2310-2316].codfw.wmnet ` [21:58:48] RECOVERY - Check systemd state on mw2310 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:04:53] (03PS1) 10Ottomata: Set Superset UPLOAD_FOLDER to /tmp/superset_uploads/ [puppet] - 10https://gerrit.wikimedia.org/r/573393 (https://phabricator.wikimedia.org/T245679) [22:08:27] !log dzahn@cumin1001 conftool action : set/weight=10; selector: name=mw2314.codfw.wmnet [22:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:49] !log rzl@cumin1001 conftool action : set/weight=30; selector: name=mw13(6[4-9]|7[0-3]|84).eqiad.wmnet [22:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:01] !log rzl@cumin1001 conftool action : set/pooled=yes; selector: name=mw13(6[4-9]|7[0-3]|84).eqiad.wmnet [22:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:19] !log dzahn@cumin1001 conftool action : set/weight=15; selector: name=mw231([0-6]).codfw.wmnet [22:12:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:04] PROBLEM - IPMI Sensor Status on es2022 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [22:14:07] (03PS2) 10SBassett: Revert "Also log authevents channel." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572895 [22:14:56] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw231([0-6]).codfw.wmnet [22:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:45] (03CR) 10Krinkle: "Which side-effets? The dash was fixed separately." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572895 (owner: 10SBassett) [22:18:49] James_F: I'm having IRC issues -- you pinged me about that wikitech patch but I can't tell when you pinged [22:18:59] anyway... if for some reason you want to do it now that would be fine [22:21:52] (03CR) 10SBassett: "> Meh, now I think maybe this should just be reverted." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477005 (owner: 10Brian Wolff) [22:23:34] !log phabricator - upgrading PHP packages [22:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:33] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops group for dwisehaupt - https://phabricator.wikimedia.org/T244901 (10Dwisehaupt) @jbond Thanks. Both of these workarounds look good to me at this point. I appreciate taking the time to do it right and give us segmented permissio... [22:27:33] (03CR) 10Gergő Tisza: "Yeah, I don't think it's causing any problems at this point. OTOH given that captcha logging was moved to a different channel, is the auth" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572895 (owner: 10SBassett) [22:29:12] (03PS2) 10Dzahn: acme_chief: add apt[12]001 to authorized hosts for apt cert [puppet] - 10https://gerrit.wikimedia.org/r/573036 [22:29:27] (03CR) 10Dzahn: acme_chief: add apt[12]001 to authorized hosts for apt cert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/573036 (owner: 10Dzahn) [22:32:15] (03PS1) 10Samwilson: Enable password-reset (requireemail pref) on test WD and Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573397 (https://phabricator.wikimedia.org/T245660) [22:32:54] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, one comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/573036 (owner: 10Dzahn) [22:34:52] (03CR) 10SBassett: "> Yeah, I don't think it's causing any problems at this point. OTOH given that captcha logging was moved to a different channel, is the au" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572895 (owner: 10SBassett) [22:35:47] (03CR) 10Dzahn: acme_chief: add apt[12]001 to authorized hosts for apt cert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/573036 (owner: 10Dzahn) [22:37:18] RECOVERY - mediawiki-installation DSH group on mw2316 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [22:37:42] andrewbogott: Sorry, yeah, doing now. [22:37:52] !log taking cp3050 & cp3051 offline for firmware update via T243167 [22:37:54] (03CR) 10Jforrester: [C: 03+2] Set wgServer to protocol-relative for Wikitech and Test Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571786 (owner: 10Jforrester) [22:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:56] T243167: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 [22:37:58] (03PS1) 10Ayounsi: Add configuration for a flowspec controller [puppet] - 10https://gerrit.wikimedia.org/r/573401 [22:38:57] (03Merged) 10jenkins-bot: Set wgServer to protocol-relative for Wikitech and Test Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571786 (owner: 10Jforrester) [22:38:59] (03PS1) 10Dzahn: site/conftooldata: add more new eqiad MW API appservers [puppet] - 10https://gerrit.wikimedia.org/r/573402 (https://phabricator.wikimedia.org/T236437) [22:40:21] (03PS3) 10Dzahn: acme_chief: add apt[12]001 to authorized hosts for apt cert [puppet] - 10https://gerrit.wikimedia.org/r/573036 [22:40:29] (03CR) 10Dzahn: acme_chief: add apt[12]001 to authorized hosts for apt cert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/573036 (owner: 10Dzahn) [22:41:03] andrewbogott: All looks good from my end; syncing. [22:41:15] (03PS2) 10Ayounsi: Add configuration for a flowspec controller [puppet] - 10https://gerrit.wikimedia.org/r/573401 [22:41:18] (03CR) 10jerkins-bot: [V: 04-1] Add configuration for a flowspec controller [puppet] - 10https://gerrit.wikimedia.org/r/573401 (owner: 10Ayounsi) [22:42:01] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Set wgServer to protocol-relative for Wikitech and Test Wikitech (duration: 01m 05s) [22:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:23] (03CR) 10jerkins-bot: [V: 04-1] Add configuration for a flowspec controller [puppet] - 10https://gerrit.wikimedia.org/r/573401 (owner: 10Ayounsi) [22:44:52] (03CR) 10Dzahn: [C: 03+2] acme_chief: add apt[12]001 to authorized hosts for apt cert [puppet] - 10https://gerrit.wikimedia.org/r/573036 (owner: 10Dzahn) [22:45:09] (03PS2) 10EBernhardson: airflow: Expand sudo rights to analytics-search user [puppet] - 10https://gerrit.wikimedia.org/r/572997 [22:45:26] (03CR) 10Dzahn: [C: 03+2] site/conftooldata: add more new eqiad MW API appservers [puppet] - 10https://gerrit.wikimedia.org/r/573402 (https://phabricator.wikimedia.org/T236437) (owner: 10Dzahn) [22:48:50] (03CR) 10Ppchelko: [C: 03+1] "Woooohoooo! if I also install Kafka-dev chart and set proper addresses, this now runs locally." [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 (owner: 10Holger Knust) [22:49:34] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [22:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:39] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [22:49:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:46] 10Operations, 10serviceops, 10Patch-For-Review: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10ops-monitoring-bot) Icinga downtime for 1:00:00 set by dzahn@cumin1001 on 11 host(s) and their services with reason: new_install ` mw[1363,1374-1383].eqia... [22:50:00] James_F: wikitech lgtm. Thanks! [22:54:57] !log cp3050 & cp3051 returned to service via T243167 [22:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:01] T243167: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 [22:56:00] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH) [22:56:19] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@c16c63a]: articletopic thresholding for ores scores and eventgate port update [22:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:16] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@c16c63a]: articletopic thresholding for ores scores and eventgate port update (duration: 00m 57s) [22:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:54] (03PS3) 10Ayounsi: Add configuration for a flowspec controller [puppet] - 10https://gerrit.wikimedia.org/r/573401 [22:59:11] (03CR) 10EBernhardson: [C: 03+1] "shouldn't require anything special to ship" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572932 (owner: 10Jforrester) [22:59:43] (03CR) 10Dzahn: [C: 03+2] profile::microsites::static_rt: disable the rsync service [puppet] - 10https://gerrit.wikimedia.org/r/573287 (https://phabricator.wikimedia.org/T242910) (owner: 10Jbond) [22:59:51] (03PS2) 10Dzahn: profile::microsites::static_rt: disable the rsync service [puppet] - 10https://gerrit.wikimedia.org/r/573287 (https://phabricator.wikimedia.org/T242910) (owner: 10Jbond) [23:04:14] (03CR) 10Dzahn: "also see https://gerrit.wikimedia.org/r/c/operations/puppet/+/572343" [puppet] - 10https://gerrit.wikimedia.org/r/573246 (owner: 10Muehlenhoff) [23:04:57] (03CR) 10Dzahn: [C: 03+1] Add system::role for role::logging::webrequest::ops [puppet] - 10https://gerrit.wikimedia.org/r/573230 (owner: 10Muehlenhoff) [23:05:41] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Add check for changes applied at all runs - https://phabricator.wikimedia.org/T242910 (10Dzahn) flowspec1001: setup in progress (Arzhel) mwdebug: tests in progress (Effie) mwmaint: not disabled anymore, ran puppet vega: change merged , ran puppet [23:08:42] RECOVERY - mediawiki-installation DSH group on mw2312 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [23:09:26] (03PS2) 10Dzahn: role::noc::site: refactor role/profile, stop duplicate include [puppet] - 10https://gerrit.wikimedia.org/r/572343 [23:09:38] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [23:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:40] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [23:09:40] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 10 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214 (10Krinkle) [23:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:52] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 10 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214 (10Krinkle) [23:10:38] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [23:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:40] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [23:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:55] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [23:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:14] (03CR) 10Mooeypoo: [C: 03+1] Enable password-reset (requireemail pref) on test WD and Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573397 (https://phabricator.wikimedia.org/T245660) (owner: 10Samwilson) [23:13:20] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [23:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:25] 10Operations, 10serviceops: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10ops-monitoring-bot) Icinga downtime for 1:00:00 set by dzahn@cumin1001 on 11 host(s) and their services with reason: new_install ` mw[1363,1374-1383].eqiad.wmnet ` [23:13:41] (03PS3) 10Jforrester: Revert "cirrus: redirect more_like to codfw to rebuild query cache" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572932 [23:13:47] (03CR) 10Jforrester: [C: 03+2] Revert "cirrus: redirect more_like to codfw to rebuild query cache" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572932 (owner: 10Jforrester) [23:14:58] (03Merged) 10jenkins-bot: Revert "cirrus: redirect more_like to codfw to rebuild query cache" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572932 (owner: 10Jforrester) [23:16:43] (03PS3) 10Dzahn: role::noc::site: refactor role/profile, stop duplicate include [puppet] - 10https://gerrit.wikimedia.org/r/572343 [23:17:42] (03CR) 10RLazarus: [C: 03+1] "Only reviewed the Envoy configs but they seem like a good place to start." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/572832 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [23:19:21] (03PS4) 10Dzahn: role::noc::site: refactor role/profile, stop duplicate include [puppet] - 10https://gerrit.wikimedia.org/r/572343 [23:19:42] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1001/20915/" [puppet] - 10https://gerrit.wikimedia.org/r/572343 (owner: 10Dzahn) [23:20:53] (03CR) 10Dzahn: [C: 04-1] "let's merge the refactor change right away ^" [puppet] - 10https://gerrit.wikimedia.org/r/573246 (owner: 10Muehlenhoff) [23:22:01] (03CR) 10Dzahn: [C: 03+1] DNS: ADD mgmt DNS for frdb2001, payments200[1-3]-a [dns] - 10https://gerrit.wikimedia.org/r/573358 (owner: 10Papaul) [23:22:44] (03CR) 10Dzahn: [C: 03+2] add grafana-labs.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/572385 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [23:23:02] (03PS2) 10Dzahn: add grafana-labs.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/572385 (https://phabricator.wikimedia.org/T210411) [23:23:09] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: cirrus: redirect more_like from codfw back to eqiad (duration: 01m 04s) [23:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:52] (03CR) 10Dzahn: [C: 03+2] add graphite-labs.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/572387 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [23:23:57] (03PS2) 10Dzahn: add graphite-labs.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/572387 (https://phabricator.wikimedia.org/T210411) [23:25:39] !log dzahn@cumin1001 conftool action : set/weight=30; selector: name=mw1363.eqiad.wmnet [23:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:07] !log dzahn@cumin1001 conftool action : set/weight=30; selector: name=mw137[4-9].eqiad.wmnet [23:26:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:35] !log dzahn@cumin1001 conftool action : set/weight=30; selector: name=mw138[0-3].eqiad.wmnet [23:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:22] !log jforrester@deploy1001 Synchronized wmf-config/PoolCounterSettings.php: cirrus: Reduce CirrusSearch-MoreLike cache workers and queue back to normal (duration: 01m 03s) [23:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:23] ebernhardson: All seems OK from here. [23:32:50] (03CR) 10Dwisehaupt: [C: 03+1] "Looks good." [dns] - 10https://gerrit.wikimedia.org/r/573358 (owner: 10Papaul) [23:36:00] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1363.eqiad.wmnet [23:36:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:18] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw137[4-9].eqiad.wmnet [23:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:29] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw138[0-3].eqiad.wmnet [23:39:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:19] James_F: yup, basically just works :) [23:47:24] good once in awhile.. [23:52:14] 10Operations, 10Release-Engineering-Team-TODO, 10SRE-Access-Requests: Should 'doc' machines (i.e. doc1001) have contint-roots as a group? - https://phabricator.wikimedia.org/T245691 (10Jdforrester-WMF) [23:52:31] 10Operations, 10Release-Engineering-Team-TODO, 10SRE-Access-Requests: Should 'doc' machines (i.e. doc1001) have contint-roots as a group? - https://phabricator.wikimedia.org/T245691 (10Jdforrester-WMF) Created following chatting with @greg earlier. [23:52:57] ebernhardson: Well, soon enough we'll be rotating data centres, so… [23:53:36] in cirrus that part is also auto-magic ;) although without a warmup it's a little bumpy at first... [23:53:45] Yeah. :-( [23:53:57] Do we split the traffic over a day ahead of time? [23:54:30] James_F: not usually, it gates on $wmfDatacenter [23:54:48] basically when web requests hit codfw, the codfw app servers will hit the codfw elasticsearc hcluster [23:55:41] Maybe we should? [23:56:01] Though I guess doing special work for a planned switch-over doesn't test our emergency switch-over processes much. [23:56:30] it can, but the warmup on elasticsearch only takes a minute and basically amounts to issuing ~100 queries the same way mwgrep does (so, queries all indices at once) [23:56:41] Hmm. [23:57:39] also it's not the end of the world without a warmup, but the latency will at 2x or maybe 3x for the first bit [23:58:58] worst case, pool counter drops a few requests until the hot parts of data are pulled off disk