[00:00:04] RoanKattouw, Niharika, and Urbanecm: Time to snap out of that daydream and deploy Evening backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201202T0000). [00:00:04] hmonroy: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:01:43] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [00:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:02:15] I can deploy today! [00:02:28] hmonroy: hey, are you around? [00:03:08] yes [00:03:11] ready! [00:03:13] :) [00:03:40] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [00:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:38] (03PS2) 10Urbanecm: Enable watchlist expiry feature on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643554 (https://phabricator.wikimedia.org/T266875) (owner: 10HMonroy) [00:05:44] (03CR) 10Urbanecm: [C: 03+2] Enable watchlist expiry feature on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643554 (https://phabricator.wikimedia.org/T266875) (owner: 10HMonroy) [00:06:37] (03Merged) 10jenkins-bot: Enable watchlist expiry feature on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643554 (https://phabricator.wikimedia.org/T266875) (owner: 10HMonroy) [00:07:19] hmonroy: pulled onto mwdebug1001, can you test, please? [00:07:30] taking a look [00:08:57] thanks [00:10:42] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db214[234] - https://phabricator.wikimedia.org/T267041 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2142.codfw.wmnet'] ` Of which those **FAILED**: ` ['db2142.codfw.wmnet'] ` [00:13:13] Urbanecm: Looks good! [00:13:20] thanks, deploying [00:14:48] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: c73f0bf0d1cdc1c7441261ffb1ad8ae12aa92ec9: Enable watchlist expiry feature on all wikis (T266875) (duration: 01m 07s) [00:14:50] !log created views and wikireplicas indexes on clouddb10[13-19] sans s1 T268312 [00:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:56] T266875: Watchlist Expiry: enable the feature on all remaining wikis [DEC 1] - https://phabricator.wikimedia.org/T266875 [00:14:56] hmonroy: should be live everywhere :) [00:15:00] anything else? [00:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:03] T268312: Deploy labsdbuser and views to new clouddb hosts - https://phabricator.wikimedia.org/T268312 [00:15:25] awesome! Thank you:) [00:16:15] no problem :) [00:16:20] !log Evening B&C window done [00:16:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:17] ACKNOWLEDGEMENT - HP RAID on ms-be1022 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:2 - Failed: 2I:4:1 - Controller: OK - Cache: Permanently Disabled - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T269186 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Ra [00:19:17] thering [00:19:24] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1022 - https://phabricator.wikimedia.org/T269186 (10ops-monitoring-bot) [00:29:37] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1050101696 and 224 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:29:45] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1588898032 and 232 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:29:49] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 284598008 and 236 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:29:55] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1043336720 and 241 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:32:05] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 368122096 and 373 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:33:19] RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1689760 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:37:05] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 681526352 and 31 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:39:09] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 651611552 and 45 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:40:49] (03PS1) 10Cwhite: Generate Logstash ECS allow filter as part of regular build process [software/ecs] - 10https://gerrit.wikimedia.org/r/644638 [00:40:57] (03CR) 10jerkins-bot: [V: 04-1] Generate Logstash ECS allow filter as part of regular build process [software/ecs] - 10https://gerrit.wikimedia.org/r/644638 (owner: 10Cwhite) [00:41:02] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) [00:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:11] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 208267256 and 16 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:42:35] RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1787408 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:43:32] (03PS4) 10Cwhite: First attempt at a JSONSchema template generator utility. [software/ecs] - 10https://gerrit.wikimedia.org/r/637719 [00:43:40] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) [00:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:25] RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 7888 and 103 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:46:35] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 390756088 and 309 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:46:55] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 739711888 and 329 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:47:17] RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 39304 and 155 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:47:27] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 868162952 and 362 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:47:27] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 83717368 and 362 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:47:55] PROBLEM - Postgres Replication Lag on maps2009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 269500720 and 391 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:48:27] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 40376 and 225 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:48:41] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 87288 and 239 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:48:42] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) [00:48:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:37] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 59174232 and 492 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:50:16] (03PS2) 10Cwhite: Generate Logstash ECS allow filter as part of regular build process [software/ecs] - 10https://gerrit.wikimedia.org/r/644638 [00:50:24] (03CR) 10jerkins-bot: [V: 04-1] Generate Logstash ECS allow filter as part of regular build process [software/ecs] - 10https://gerrit.wikimedia.org/r/644638 (owner: 10Cwhite) [00:50:41] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 488434064 and 556 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:51:13] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1347520 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:52:23] RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1740960 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:53:03] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1467096 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:54:33] RECOVERY - Postgres Replication Lag on maps2009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 2034472 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:55:43] RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1791144 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:55:45] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1390320 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:56:49] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1419192 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:58:55] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) [00:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:04:46] (03PS5) 10Mstyles: Add new helm chart for rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/640571 (https://phabricator.wikimedia.org/T265526) [01:04:59] (03CR) 10Mstyles: Add new helm chart for rdf-streaming-updater (0311 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/640571 (https://phabricator.wikimedia.org/T265526) (owner: 10Mstyles) [01:05:39] (03CR) 10jerkins-bot: [V: 04-1] Add new helm chart for rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/640571 (https://phabricator.wikimedia.org/T265526) (owner: 10Mstyles) [01:07:41] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 247649232 and 10 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:08:21] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 470760168 and 15 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:09:25] RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 3624 and 38 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:10:03] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 15144 and 77 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:10:06] (03CR) 10Mstyles: "@akosiaris" [deployment-charts] - 10https://gerrit.wikimedia.org/r/640571 (https://phabricator.wikimedia.org/T265526) (owner: 10Mstyles) [01:10:49] (03CR) 10Mstyles: "Also I have no idea why CI is failing. Will look into that" [deployment-charts] - 10https://gerrit.wikimedia.org/r/640571 (https://phabricator.wikimedia.org/T265526) (owner: 10Mstyles) [01:18:26] (03CR) 10Jeena Huneidi: Add new helm chart for rdf-streaming-updater (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/640571 (https://phabricator.wikimedia.org/T265526) (owner: 10Mstyles) [01:30:01] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:35:07] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:50:50] ACKNOWLEDGEMENT - HP RAID on ms-be1030 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T269193 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Inform [01:50:54] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T269193 (10ops-monitoring-bot) [01:59:19] 10Operations, 10ops-eqsin, 10DC-Ops: cr2-eqsin: fan failure - https://phabricator.wikimedia.org/T267544 (10RobH) Jin went ahead and swapped the old fan for the new one, and all documentation on https://www.juniper.net/documentation/en_US/release-independent/junos/topics/topic-map/mx204-maintain-cooling-compo... [02:20:32] RECOVERY - Check systemd state on pki2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:25:42] PROBLEM - Check systemd state on pki2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:30:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:35:18] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:55:02] ACKNOWLEDGEMENT - HP RAID on ms-be1030 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T269195 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Inform [02:55:05] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T269195 (10ops-monitoring-bot) [03:29:14] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:30:06] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T269193 (10Reedy) [03:30:08] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T269195 (10Reedy) [03:37:54] ACKNOWLEDGEMENT - HP RAID on ms-be1030 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T269197 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Inform [03:37:58] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T269197 (10ops-monitoring-bot) [03:41:30] PROBLEM - Check systemd state on ms-be1022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:48:20] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:00:27] (03CR) 10KartikMistry: [C: 03+1] Add apertium namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/644530 (https://phabricator.wikimedia.org/T255672) (owner: 10Alexandros Kosiaris) [04:09:19] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-reload [04:09:25] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-reload [04:09:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:09:34] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-reload [04:09:38] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-reload [04:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:09:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:10:00] !log T259588 Beginning wdqs categories data-reload on the following instances (one each from `[public, internal] x [eqiad, codfw]`): `wdqs1006`, `wdqs2003`, `wdqs1011`, `wdqs2006` [04:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:10:07] T259588: Reload categories once 1.36.0-wmf.3 is running on all groups - https://phabricator.wikimedia.org/T259588 [04:30:26] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:32:25] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [04:32:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:34:31] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [04:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:40:19] ACKNOWLEDGEMENT - HP RAID on ms-be1030 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T269198 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Inform [04:40:23] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T269198 (10ops-monitoring-bot) [04:42:03] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 2 (deploy2002, ...), Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:42:04] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T269193 (10Reedy) [04:42:08] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T269197 (10Reedy) [04:42:10] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T269198 (10Reedy) [04:48:13] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:05:53] RECOVERY - Check systemd state on ms-be1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:13:38] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) [05:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:41] (03CR) 10Andrew Bogott: [C: 03+2] Make clouddvirt1030 a ceph-enabled hypervisor [puppet] - 10https://gerrit.wikimedia.org/r/644621 (https://phabricator.wikimedia.org/T261132) (owner: 10Andrew Bogott) [05:14:49] (03PS2) 10Andrew Bogott: Make clouddvirt1030 a ceph-enabled hypervisor [puppet] - 10https://gerrit.wikimedia.org/r/644621 (https://phabricator.wikimedia.org/T261132) [05:15:08] (03PS1) 10Andrew Bogott: Move cloudvirt200x-dev hosts to Buster [puppet] - 10https://gerrit.wikimedia.org/r/644671 [05:16:09] (03CR) 10Andrew Bogott: [C: 03+2] Move cloudvirt200x-dev hosts to Buster [puppet] - 10https://gerrit.wikimedia.org/r/644671 (owner: 10Andrew Bogott) [05:38:00] (03PS1) 10Razzi: superset: add cached to an-tool1010 [puppet] - 10https://gerrit.wikimedia.org/r/644672 (https://phabricator.wikimedia.org/T268219) [05:39:18] (03PS2) 10Razzi: superset: add cached to an-tool1010 [puppet] - 10https://gerrit.wikimedia.org/r/644672 (https://phabricator.wikimedia.org/T268784) [05:41:23] (03PS3) 10Razzi: superset: add cached to an-tool1010 [puppet] - 10https://gerrit.wikimedia.org/r/644672 (https://phabricator.wikimedia.org/T268784) [05:44:01] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) [05:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:37] (03PS4) 10Razzi: superset: add cached to an-tool1010 [puppet] - 10https://gerrit.wikimedia.org/r/644672 (https://phabricator.wikimedia.org/T268784) [05:44:44] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) [05:44:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:17] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) [05:46:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:41] (03PS5) 10Razzi: superset: add cached to an-tool1010 [puppet] - 10https://gerrit.wikimedia.org/r/644672 (https://phabricator.wikimedia.org/T268784) [05:47:29] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26822/console" [puppet] - 10https://gerrit.wikimedia.org/r/644672 (https://phabricator.wikimedia.org/T268784) (owner: 10Razzi) [05:49:05] (03PS6) 10Razzi: superset: add cache to an-tool1010 [puppet] - 10https://gerrit.wikimedia.org/r/644672 (https://phabricator.wikimedia.org/T268784) [06:14:08] (03CR) 10Marostegui: "I am going to merge this as the grants were applied on the DB already. The comment about the puppetization is still valid and needs to be " [puppet] - 10https://gerrit.wikimedia.org/r/644456 (https://phabricator.wikimedia.org/T267214) (owner: 10Marostegui) [06:14:14] (03CR) 10Marostegui: [C: 03+2] production-m2: Add grants for mwaddlink new database [puppet] - 10https://gerrit.wikimedia.org/r/644456 (https://phabricator.wikimedia.org/T267214) (owner: 10Marostegui) [06:16:16] (03PS1) 10Marostegui: mariadb: Decommission es1017 [puppet] - 10https://gerrit.wikimedia.org/r/644675 (https://phabricator.wikimedia.org/T268825) [06:16:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [06:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:13] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) [06:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [06:22:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:40] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [06:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:56] (03PS1) 10Andrew Bogott: Additional hiera settings for moving cloudvirt2003-dev to ceph [puppet] - 10https://gerrit.wikimedia.org/r/644676 [06:43:08] (03CR) 10Andrew Bogott: [C: 03+2] Additional hiera settings for moving cloudvirt2003-dev to ceph [puppet] - 10https://gerrit.wikimedia.org/r/644676 (owner: 10Andrew Bogott) [06:54:42] (03PS2) 10KartikMistry: WIP: Add apertium helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/644531 (https://phabricator.wikimedia.org/T255672) [06:54:50] !log Remove es1017 from tendril and zarcillo T268825 [06:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:58] T268825: decommission es1017.eqiad.wmnet - https://phabricator.wikimedia.org/T268825 [06:55:33] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission es1017 [puppet] - 10https://gerrit.wikimedia.org/r/644675 (https://phabricator.wikimedia.org/T268825) (owner: 10Marostegui) [07:07:46] (03PS3) 10KartikMistry: WIP: Add apertium helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/644531 (https://phabricator.wikimedia.org/T255672) [07:08:02] (03CR) 10jerkins-bot: [V: 04-1] WIP: Add apertium helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/644531 (https://phabricator.wikimedia.org/T255672) (owner: 10KartikMistry) [07:10:42] RECOVERY - Device not healthy -SMART- on ms-be1022 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be1022&var-datasource=eqiad+prometheus/ops [07:15:30] (03PS1) 10Elukey: install_server: fix dhcp config for kafka-test1006 [puppet] - 10https://gerrit.wikimedia.org/r/644679 (https://phabricator.wikimedia.org/T268202) [07:16:09] (03CR) 10Elukey: [C: 03+2] install_server: fix dhcp config for kafka-test1006 [puppet] - 10https://gerrit.wikimedia.org/r/644679 (https://phabricator.wikimedia.org/T268202) (owner: 10Elukey) [07:20:00] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:20:40] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:21:08] Telia transport between ulsfo and eqord --^ [07:21:42] but I see it scheduled, so all good [07:28:16] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:29:02] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:38:18] (03PS1) 10Elukey: admin: improve the analytics-privatedata-users comment [puppet] - 10https://gerrit.wikimedia.org/r/644733 [07:43:32] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 104 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:46:26] (03CR) 10Elukey: [C: 03+2] Set oozie.service.coord.default.max.timeout to 13 months [puppet] - 10https://gerrit.wikimedia.org/r/644535 (https://phabricator.wikimedia.org/T264358) (owner: 10Ottomata) [07:52:42] PROBLEM - Device not healthy -SMART- on ms-be1022 is CRITICAL: cluster=swift device=None instance=ms-be1022 job=node site=eqiad https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be1022&var-datasource=eqiad+prometheus/ops [07:56:31] (03CR) 10Elukey: Ensure /tmp/sqoop-jars/ is present (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/644347 (https://phabricator.wikimedia.org/T251788) (owner: 10Razzi) [08:01:32] (03CR) 10Elukey: [C: 03+2] admin: improve the analytics-privatedata-users comment [puppet] - 10https://gerrit.wikimedia.org/r/644733 (owner: 10Elukey) [08:02:55] (03PS4) 10KartikMistry: WIP: Add apertium helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/644531 (https://phabricator.wikimedia.org/T255672) [08:03:13] (03CR) 10jerkins-bot: [V: 04-1] WIP: Add apertium helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/644531 (https://phabricator.wikimedia.org/T255672) (owner: 10KartikMistry) [08:05:41] (03CR) 10Florianschmidtwelzow: Move disabling sitenotice on wikimedia wikis to mediawiki-config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644627 (https://phabricator.wikimedia.org/T269173) (owner: 10Florianschmidtwelzow) [08:10:40] (03PS5) 10KartikMistry: WIP: Add apertium helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/644531 (https://phabricator.wikimedia.org/T255672) [08:10:56] (03CR) 10jerkins-bot: [V: 04-1] WIP: Add apertium helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/644531 (https://phabricator.wikimedia.org/T255672) (owner: 10KartikMistry) [08:19:51] (03CR) 10Muehlenhoff: admin: improve the analytics-privatedata-users comment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644733 (owner: 10Elukey) [08:19:54] (03PS1) 10Marostegui: production-m2.sql.erb: Add sockpuppet users [puppet] - 10https://gerrit.wikimedia.org/r/644745 (https://phabricator.wikimedia.org/T268505) [08:25:08] (03PS6) 10KartikMistry: WIP: Add apertium helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/644531 (https://phabricator.wikimedia.org/T255672) [08:25:32] (03CR) 10jerkins-bot: [V: 04-1] WIP: Add apertium helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/644531 (https://phabricator.wikimedia.org/T255672) (owner: 10KartikMistry) [08:25:36] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:26:22] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:26:40] (03PS7) 10KartikMistry: WIP: Add apertium helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/644531 (https://phabricator.wikimedia.org/T255672) [08:26:58] (03CR) 10jerkins-bot: [V: 04-1] WIP: Add apertium helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/644531 (https://phabricator.wikimedia.org/T255672) (owner: 10KartikMistry) [08:28:46] (03CR) 10JMeybohm: coredns: Create a wmfcoredns copy in charts dir (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/643936 (owner: 10JMeybohm) [08:29:32] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:29:35] (03CR) 10Elukey: admin: improve the analytics-privatedata-users comment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644733 (owner: 10Elukey) [08:31:04] (03CR) 10JMeybohm: [C: 03+2] Add helm chart for calico CRDs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/643974 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [08:32:13] (03Merged) 10jenkins-bot: Add helm chart for calico CRDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/643974 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [08:33:29] (03CR) 10JMeybohm: Add calico helm chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/644462 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [08:35:10] (03PS8) 10KartikMistry: WIP: Add apertium helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/644531 (https://phabricator.wikimedia.org/T255672) [08:35:27] (03CR) 10jerkins-bot: [V: 04-1] WIP: Add apertium helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/644531 (https://phabricator.wikimedia.org/T255672) (owner: 10KartikMistry) [08:36:22] PROBLEM - Check systemd state on chartmuseum1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:37:20] PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:39:43] (03CR) 10Muehlenhoff: admin: improve the analytics-privatedata-users comment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644733 (owner: 10Elukey) [08:40:26] (03PS9) 10KartikMistry: WIP: Add apertium helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/644531 (https://phabricator.wikimedia.org/T255672) [08:41:28] (03CR) 10jerkins-bot: [V: 04-1] WIP: Add apertium helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/644531 (https://phabricator.wikimedia.org/T255672) (owner: 10KartikMistry) [08:42:59] 10Operations, 10Puppet: systemd.timer not executing on cumin2001 after command was modified - https://phabricator.wikimedia.org/T268974 (10jcrespo) p:05High→03Unbreak! The previous patch did not have an effect- backups continue to be executed on eqiad, but codfw ones (cumin2001) do not run at all- they are... [08:45:37] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: parametrize rsync max_connections based on backends [puppet] - 10https://gerrit.wikimedia.org/r/643503 (https://phabricator.wikimedia.org/T266016) (owner: 10Filippo Giunchedi) [08:47:32] RECOVERY - Check systemd state on chartmuseum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:48:10] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:48:39] 10Operations, 10Puppet: systemd.timer not executing on cumin2001 after command was modified - https://phabricator.wikimedia.org/T268974 (10jcrespo) I am not anti-systemd timer, but for practical reasons (not having to run the backups manually) could I revert to a cron until this is sorted out? Manual running b... [08:50:19] (03PS1) 10Jcrespo: Revert "remote-backup-mariadb: update cron to systemd::timer::job" [puppet] - 10https://gerrit.wikimedia.org/r/644662 [08:50:33] (03CR) 10jerkins-bot: [V: 04-1] Revert "remote-backup-mariadb: update cron to systemd::timer::job" [puppet] - 10https://gerrit.wikimedia.org/r/644662 (owner: 10Jcrespo) [08:52:38] PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:53:15] (03PS2) 10Jcrespo: Revert "remote-backup-mariadb: update cron to systemd::timer::job" [puppet] - 10https://gerrit.wikimedia.org/r/644662 [08:53:39] jynus: i have just manuly started the job on cumin2001 and thing its running [08:53:51] currently troubleshooting why the timer is not starting it [08:54:04] (03CR) 10JMeybohm: [C: 03+2] admin_ng: Use more obvious names for kubernetes API host and port [deployment-charts] - 10https://gerrit.wikimedia.org/r/643933 (owner: 10JMeybohm) [08:55:05] !log swift eqiad-prod: add weight to ms-be106[0-3] - T268435 [08:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:13] T268435: Add ms-be106[0-3] to swift - https://phabricator.wikimedia.org/T268435 [08:55:14] (03Merged) 10jenkins-bot: admin_ng: Use more obvious names for kubernetes API host and port [deployment-charts] - 10https://gerrit.wikimedia.org/r/643933 (owner: 10JMeybohm) [08:55:26] jbond42: please don't [08:55:29] I started it myself [08:55:36] we may have 2 backups at the same time [08:55:54] that is bad [08:56:05] (03CR) 10JMeybohm: [C: 03+1] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/643301 (owner: 10Alexandros Kosiaris) [08:56:12] jynus: i didn't see anything elses running or logging [08:56:49] 10Operations, 10Puppet: systemd.timer not executing on cumin2001 after command was modified - https://phabricator.wikimedia.org/T268974 (10jbond) >>! In T268974#6662092, @jcrespo wrote: > The previous patch did not have an effect- backups continue to be executed on eqiad, but codfw ones (cumin2001) do not run... [08:57:06] I will try to kill all processes [08:57:10] and restart it [08:57:56] ack, however note there is nothing in the log before 08:49:35 [08:59:16] * jayme looking into the chartmuseum alerts [09:00:32] my fault..fixing asap [09:01:14] jayme: transfer.py is lock safe, so no issue with that, but it would create too much load on the mysqls/network and arzhel would shout at us 0:-) [09:01:26] PROBLEM - very high load average likely xfs on ms-be2031 is CRITICAL: CRITICAL - load average: 118.99, 101.96, 62.36 https://wikitech.wikimedia.org/wiki/Swift [09:01:30] sorry, wrong j, this ^ was for jbond42 [09:01:32] jynus: wrong handle? [09:01:35] ack [09:02:20] jbond42: for what I can see it was ran 3 times, not 2 [09:02:37] ack ok, jynus did you run the command from manualy with `remote-backup-mariadb` directly or by using systemd? [09:02:46] directly on a screen [09:03:17] one run on 08-44-16 another on 08-49-35 and a third one on 08-49-54 [09:03:22] ok thats probably why the logs ar not in systemd [09:03:45] the funny thing is cumin1001 works as intended :-P [09:03:56] so could it be a host-only issue? [09:04:09] like it needing a restart or something [09:04:13] yes i think systemd has got its self into a strange state [09:04:36] my gut feeling is it will work tonight and that manuly kicking the systemd scritp has kicked it into shape enough [09:04:37] my bet is to coordinate a restart, if that doesn't work, revert to a cron [09:04:44] but not sure why puppet wasn't able to do that [09:04:45] 10Operations, 10serviceops, 10Kubernetes: Migrate Chartmuseum (python3-docker-report) to use helm3 - https://phabricator.wikimedia.org/T268743 (10JMeybohm) a:03JMeybohm helm2 is not capable of packaging helm apiVersion v2 charts ofc. ` Dec 02 08:58:02 chartmuseum2001 sh[22971]: helm package exited with e... [09:04:48] it is quite taxing to run backups manuall for me [09:05:11] jbond42: we don't know, maybe software updates or somthing weird [09:05:34] not sure ill have a proper look today [09:05:35] I am just up to anything (including a cron) to get this out of the way [09:06:59] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission es1017.eqiad.wmnet - https://phabricator.wikimedia.org/T268825 (10Marostegui) Ready for #dc-ops [09:07:30] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission es1017.eqiad.wmnet - https://phabricator.wikimedia.org/T268825 (10Marostegui) For the record, the homer run: ` # homer asw2-c-eqiad* commit "T268825" INFO:homer.devices:Initialized 35 devices INFO:homer:Committing config for query a... [09:13:16] PROBLEM - very high load average likely xfs on ms-be2031 is CRITICAL: CRITICAL - load average: 121.54, 124.75, 97.18 https://wikitech.wikimedia.org/wiki/Swift [09:14:34] 10Operations, 10Puppet: systemd.timer not executing on cumin2001 after command was modified - https://phabricator.wikimedia.org/T268974 (10MoritzMuehlenhoff) Another data point: The systemd::timer::job for clear_gerrit_logs also changed it's command and it worked fine there (both gerrit1001/2001) [09:20:04] PROBLEM - very high load average likely xfs on ms-be2031 is CRITICAL: CRITICAL - load average: 126.86, 126.90, 107.78 https://wikitech.wikimedia.org/wiki/Swift [09:23:55] (03PS1) 10JMeybohm: Package helm charts with helm3 [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/644768 (https://phabricator.wikimedia.org/T268743) [09:25:41] (03CR) 10jerkins-bot: [V: 04-1] Package helm charts with helm3 [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/644768 (https://phabricator.wikimedia.org/T268743) (owner: 10JMeybohm) [09:30:58] (03PS1) 10JMeybohm: chartmuseum: Install helm3 [puppet] - 10https://gerrit.wikimedia.org/r/644769 (https://phabricator.wikimedia.org/T268743) [09:35:01] (03CR) 10JMeybohm: [C: 03+2] chartmuseum: Install helm3 [puppet] - 10https://gerrit.wikimedia.org/r/644769 (https://phabricator.wikimedia.org/T268743) (owner: 10JMeybohm) [09:36:16] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:36:35] (03PS1) 10Marostegui: Revert "db1106: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/644663 [09:37:10] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:37:29] (03CR) 10Marostegui: [C: 03+2] Revert "db1106: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/644663 (owner: 10Marostegui) [09:38:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 25%: After cloning clouddb hosts', diff saved to https://phabricator.wikimedia.org/P13512 and previous config saved to /var/cache/conftool/dbconfig/20201202-093838-root.json [09:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:44] ACKNOWLEDGEMENT - HP RAID on ms-be1022 is CRITICAL: CRITICAL: Slot 3: Failed: 2I:4:1 - OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T269209 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Ra [09:38:44] thering [09:38:48] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1022 - https://phabricator.wikimedia.org/T269209 (10ops-monitoring-bot) [09:41:08] (03PS1) 10Ladsgroup: hadoop: Migrate hiera() to lookup() and set datatype [puppet] - 10https://gerrit.wikimedia.org/r/644770 (https://phabricator.wikimedia.org/T209953) [09:41:14] PROBLEM - Check systemd state on ms-be1022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:42:03] (03PS1) 10Jbond: keyholder: fix initscript [puppet] - 10https://gerrit.wikimedia.org/r/644773 (https://phabricator.wikimedia.org/T268974) [09:43:44] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/644773 (https://phabricator.wikimedia.org/T268974) (owner: 10Jbond) [09:47:20] (03CR) 10Jbond: [C: 03+2] keyholder: fix initscript [puppet] - 10https://gerrit.wikimedia.org/r/644773 (https://phabricator.wikimedia.org/T268974) (owner: 10Jbond) [09:47:36] (03CR) 10Ladsgroup: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/26824/" [puppet] - 10https://gerrit.wikimedia.org/r/644770 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [09:50:15] 10Operations, 10serviceops, 10Datacenter-Switchover: Updates to warmup script - https://phabricator.wikimedia.org/T269179 (10Volans) > you were thinking about rewriting the warmup script in Python -- you were kind enough to let me talk you out of doing that right before the 2020 switchover, but now would be... [09:51:28] RECOVERY - Check systemd state on pki2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:53:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 50%: After cloning clouddb hosts', diff saved to https://phabricator.wikimedia.org/P13513 and previous config saved to /var/cache/conftool/dbconfig/20201202-095341-root.json [09:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:12] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:56:30] (03CR) 10Elukey: admin: improve the analytics-privatedata-users comment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644733 (owner: 10Elukey) [09:56:38] PROBLEM - Check systemd state on pki2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:56:51] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T268036 (10fgiunchedi) I've disabled the handler to avoid further duplicate tasks, we'll need to remember to enable it post-maintenance https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=ms-be1030&service... [09:57:44] 10Operations, 10ops-eqiad, 10SRE-swift-storage: ms-be1022 smart storage battery failure; disk sdb possibly bad - https://phabricator.wikimedia.org/T267870 (10fgiunchedi) I've disabled the handler to avoid further duplicate tasks, we'll need to remember to enable it post-maintenance https://icinga.wikimedia.o... [09:58:00] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:58:24] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1022 - https://phabricator.wikimedia.org/T269186 (10fgiunchedi) [09:58:27] 10Operations, 10ops-eqiad, 10SRE-swift-storage: ms-be1022 smart storage battery failure; disk sdb possibly bad - https://phabricator.wikimedia.org/T267870 (10fgiunchedi) [09:58:28] RECOVERY - Check systemd state on ms-be1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:58:29] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1022 - https://phabricator.wikimedia.org/T269209 (10fgiunchedi) [09:58:33] 10Operations, 10ops-eqiad, 10SRE-swift-storage: ms-be1022 smart storage battery failure; disk sdb possibly bad - https://phabricator.wikimedia.org/T267870 (10fgiunchedi) [09:59:45] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T269193 (10fgiunchedi) [09:59:47] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T268036 (10fgiunchedi) [09:59:58] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T269181 (10fgiunchedi) [10:00:00] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T268036 (10fgiunchedi) [10:00:12] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T269166 (10fgiunchedi) [10:00:14] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T268036 (10fgiunchedi) [10:01:32] RECOVERY - Check systemd state on chartmuseum1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:06:38] PROBLEM - Check systemd state on chartmuseum1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:08:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 75%: After cloning clouddb hosts', diff saved to https://phabricator.wikimedia.org/P13515 and previous config saved to /var/cache/conftool/dbconfig/20201202-100845-root.json [10:08:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:22] 10Operations, 10Domains, 10Traffic: URL to redirect to upcoming Wikipedia Birthday page on wikimediafoundation.org - https://phabricator.wikimedia.org/T264367 (10hdothiduc) @Dzahn, I checked with Greg and he said it's fine to keep it internal until January. But no need to sync it up super close to the 15th.... [10:13:20] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:14:10] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:14:54] PROBLEM - very high load average likely xfs on ms-be2031 is CRITICAL: CRITICAL - load average: 143.09, 147.69, 145.68 https://wikitech.wikimedia.org/wiki/Swift [10:15:02] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:15:51] (03PS2) 10JMeybohm: Package helm charts with helm3 [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/644768 (https://phabricator.wikimedia.org/T268743) [10:15:53] (03PS1) 10JMeybohm: Need to mock docker.from_env() now [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/644775 (https://phabricator.wikimedia.org/T268743) [10:17:19] 10Operations, 10Wikidata, 10Wikidata Query UI, 10User-Addshore: Move WDQS UI to microsites - https://phabricator.wikimedia.org/T266702 (10Addshore) Notes from the call: * Branches for gui deploy repo - for WDQS and WCQS * in the meantime, keep both approaches to deployment of GUI * WCQS should have micros... [10:17:38] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:18:17] 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage: Create a read-only swift identity for backup taking - https://phabricator.wikimedia.org/T269108 (10fgiunchedi) (braindumping) we had a similar case in the past (namely adding an account to mw containers), i.e. thumbor. Steps off the top of my head... [10:23:45] 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage: Create a read-only swift identity for backup taking - https://phabricator.wikimedia.org/T269108 (10jcrespo) Of course, I hadn't remembered that it should keep working for newly wikis created. Thanks for pointing that. I will have a look at thumbor... [10:23:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 100%: After cloning clouddb hosts', diff saved to https://phabricator.wikimedia.org/P13516 and previous config saved to /var/cache/conftool/dbconfig/20201202-102348-root.json [10:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:23] 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage: Create a read-only swift identity for backup taking - https://phabricator.wikimedia.org/T269108 (10jcrespo) p:05Triage→03Medium [10:26:27] 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage: Create a read-only swift identity for backup taking - https://phabricator.wikimedia.org/T269108 (10jcrespo) a:03jcrespo [10:27:02] (03CR) 10Muehlenhoff: [C: 03+2] Update vips library hints [puppet] - 10https://gerrit.wikimedia.org/r/644555 (owner: 10Muehlenhoff) [10:28:37] 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage: Create a read-only swift identity for backup taking - https://phabricator.wikimedia.org/T269108 (10jcrespo) Relevant ticket: T169144 and children. [10:29:38] RECOVERY - Check systemd state on chartmuseum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:30:51] 10Operations, 10Performance-Team, 10serviceops, 10User-jijiki: Run latest Thumbor on Docker with Buster + Python 3 - https://phabricator.wikimedia.org/T267327 (10Gilles) [10:33:12] PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:36:45] hmm...I thought I had acknowleged you [10:42:14] icinga disagrees :-P [10:42:26] (03CR) 10JMeybohm: [C: 03+2] Need to mock docker.from_env() now [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/644775 (https://phabricator.wikimedia.org/T268743) (owner: 10JMeybohm) [10:42:29] (03CR) 10JMeybohm: [C: 03+2] Package helm charts with helm3 [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/644768 (https://phabricator.wikimedia.org/T268743) (owner: 10JMeybohm) [10:44:03] (03Merged) 10jenkins-bot: Need to mock docker.from_env() now [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/644775 (https://phabricator.wikimedia.org/T268743) (owner: 10JMeybohm) [10:44:05] (03Merged) 10jenkins-bot: Package helm charts with helm3 [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/644768 (https://phabricator.wikimedia.org/T268743) (owner: 10JMeybohm) [10:44:14] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:45:46] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:47:12] 10Operations, 10fundraising-tech-ops, 10netops: Manage frack switches with Netbox - https://phabricator.wikimedia.org/T268802 (10ayounsi) Thanks! let me know if I can help! Another useful thing would be to duplicate the production "networking" puppet fact, to have LLDP, IPs, etc. exposed there as well. [10:49:04] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Speed up Homer by fixing fetch_device_circuits() [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/644563 (owner: 10Ayounsi) [10:50:10] (03PS1) 10Daniel Kinzler: Don't cache output that is not safe to cache [core] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644665 (https://phabricator.wikimedia.org/T269154) [10:50:24] RECOVERY - Check systemd state on chartmuseum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:53:36] (03PS3) 10Hnowlan: maps: remove no-longer-accurate insetup role [puppet] - 10https://gerrit.wikimedia.org/r/644611 (https://phabricator.wikimedia.org/T260271) (owner: 10Ryan Kemper) [10:53:47] (03CR) 10Hnowlan: [C: 03+1] maps: remove no-longer-accurate insetup role [puppet] - 10https://gerrit.wikimedia.org/r/644611 (https://phabricator.wikimedia.org/T260271) (owner: 10Ryan Kemper) [10:54:14] (03CR) 10Ayounsi: turnilo: add export mappings for network devices via query_resources (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/643703 (https://phabricator.wikimedia.org/T254332) (owner: 10Jbond) [10:55:20] PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:55:32] (03CR) 10Hnowlan: "Good catch! My bad" [puppet] - 10https://gerrit.wikimedia.org/r/644603 (https://phabricator.wikimedia.org/T260269) (owner: 10Ryan Kemper) [10:55:38] (03PS2) 10Hnowlan: maps: remove no-longer-accurate insetup role [puppet] - 10https://gerrit.wikimedia.org/r/644603 (https://phabricator.wikimedia.org/T260269) (owner: 10Ryan Kemper) [10:56:05] (03CR) 10Hnowlan: [C: 03+2] maps: remove no-longer-accurate insetup role [puppet] - 10https://gerrit.wikimedia.org/r/644611 (https://phabricator.wikimedia.org/T260271) (owner: 10Ryan Kemper) [10:56:31] (03CR) 10Hnowlan: [C: 03+2] maps: remove no-longer-accurate insetup role [puppet] - 10https://gerrit.wikimedia.org/r/644603 (https://phabricator.wikimedia.org/T260269) (owner: 10Ryan Kemper) [10:56:49] (03PS4) 10Hnowlan: maps: remove no-longer-accurate insetup role [puppet] - 10https://gerrit.wikimedia.org/r/644611 (https://phabricator.wikimedia.org/T260271) (owner: 10Ryan Kemper) [10:59:36] (03CR) 10Ladsgroup: [C: 03+1] k8s: replace hiera with lookup [puppet] - 10https://gerrit.wikimedia.org/r/644363 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [10:59:52] (03CR) 10Ladsgroup: [C: 03+1] conftool: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/644357 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [11:04:44] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:05:20] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:07:08] (03PS1) 10JMeybohm: Bump version to 0.0.9 [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/644781 (https://phabricator.wikimedia.org/T268743) [11:08:08] (03CR) 10JMeybohm: [C: 03+2] Bump version to 0.0.9 [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/644781 (https://phabricator.wikimedia.org/T268743) (owner: 10JMeybohm) [11:08:18] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Bump version to 0.0.9 [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/644781 (https://phabricator.wikimedia.org/T268743) (owner: 10JMeybohm) [11:10:53] 10Operations, 10fundraising-tech-ops, 10netops: Manage frack switches with Netbox - https://phabricator.wikimedia.org/T268802 (10jbond) >>! In T268802#6662467, @ayounsi wrote: > Another useful thing would be to duplicate the production "networking" puppet fact, to have LLDP, IPs, etc. exposed there as well.... [11:11:52] !log imported docker-report 0.0.9-1 to buster-wikimedia [11:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:17] 10Operations, 10Discovery-Search, 10Elasticsearch: Port elasticsearch support scripts to cookbooks - https://phabricator.wikimedia.org/T269218 (10Gehel) [11:15:18] RECOVERY - Check systemd state on chartmuseum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:15:38] 10Operations, 10Puppet, 10observability, 10Patch-For-Review, and 2 others: Puppet: get row/rack info from Netbox - https://phabricator.wikimedia.org/T229397 (10jbond) >>! In T229397#6650923, @jbond wrote: > > I think we could probably include all devices under something like > > ` > lang=yaml > netbox::n... [11:16:09] !log updated docker-report to 0.0.9-1 on chartmuseum* and deneb [11:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:04] RECOVERY - Check systemd state on chartmuseum1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:18:40] 10Operations, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10JMeybohm) [11:18:53] 10Operations, 10serviceops, 10Kubernetes: Migrate Chartmuseum (python3-docker-report) to use helm3 - https://phabricator.wikimedia.org/T268743 (10JMeybohm) 05Open→03Resolved docker-report 0.0.9 builds helm charts with helm3 now. Rolled the fix out to chartmuseum hosts (and patched puppet to install helm3... [11:21:05] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/644606 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm) [11:21:26] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:22:06] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:24:19] (03PS1) 10Ayounsi: Fix duplicate license key [homer/public] - 10https://gerrit.wikimedia.org/r/644782 [11:24:21] (03CR) 10Volans: "I don't know the logic underneath but cumin-wise looks good, just one comment inline. No need to wait for my review if fixed." (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/644552 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm) [11:28:03] (03CR) 10Ayounsi: [C: 03+2] Fix duplicate license key [homer/public] - 10https://gerrit.wikimedia.org/r/644782 (owner: 10Ayounsi) [11:28:30] (03Merged) 10jenkins-bot: Fix duplicate license key [homer/public] - 10https://gerrit.wikimedia.org/r/644782 (owner: 10Ayounsi) [11:28:51] (03PS1) 10Ayounsi: Add Lumen transit to ulsfo [homer/public] - 10https://gerrit.wikimedia.org/r/644783 (https://phabricator.wikimedia.org/T268691) [11:29:40] (03CR) 10Ayounsi: [C: 03+2] Add Lumen transit to ulsfo [homer/public] - 10https://gerrit.wikimedia.org/r/644783 (https://phabricator.wikimedia.org/T268691) (owner: 10Ayounsi) [11:30:09] (03Merged) 10jenkins-bot: Add Lumen transit to ulsfo [homer/public] - 10https://gerrit.wikimedia.org/r/644783 (https://phabricator.wikimedia.org/T268691) (owner: 10Ayounsi) [11:30:11] (03PS2) 10Kormat: test: Unpack db tarballs in integration env [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/644561 [11:31:55] (03PS3) 10Jcrespo: [WIP] We continue with swift listing and download tests for media backups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/643980 [11:32:06] (03CR) 10jerkins-bot: [V: 04-1] test: Unpack db tarballs in integration env [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/644561 (owner: 10Kormat) [11:32:30] (03CR) 10jerkins-bot: [V: 04-1] [WIP] We continue with swift listing and download tests for media backups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/643980 (owner: 10Jcrespo) [11:33:06] (03PS1) 10Jbond: systemd: add validate command for unit files [puppet] - 10https://gerrit.wikimedia.org/r/644784 (https://phabricator.wikimedia.org/T268974) [11:34:19] !log add Lumen transit to cr3-ulsfo - T268691 [11:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:29] (03CR) 10jerkins-bot: [V: 04-1] systemd: add validate command for unit files [puppet] - 10https://gerrit.wikimedia.org/r/644784 (https://phabricator.wikimedia.org/T268974) (owner: 10Jbond) [11:35:37] (03PS2) 10Jbond: systemd: add validate command for unit files [puppet] - 10https://gerrit.wikimedia.org/r/644784 (https://phabricator.wikimedia.org/T268974) [11:36:03] (03CR) 10Jbond: [C: 03+2] profile: migrate to shared spec_test [puppet] - 10https://gerrit.wikimedia.org/r/638678 (owner: 10Jbond) [11:36:55] (03PS7) 10Kormat: test: Start implementation of integration-env [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/644231 (https://phabricator.wikimedia.org/T265266) [11:37:01] (03PS3) 10Jbond: P:analytics::cluster::packages::common: demo spec test [puppet] - 10https://gerrit.wikimedia.org/r/644312 (https://phabricator.wikimedia.org/T261693) [11:37:03] (03CR) 10jerkins-bot: [V: 04-1] systemd: add validate command for unit files [puppet] - 10https://gerrit.wikimedia.org/r/644784 (https://phabricator.wikimedia.org/T268974) (owner: 10Jbond) [11:38:26] (03CR) 10jerkins-bot: [V: 04-1] test: Start implementation of integration-env [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/644231 (https://phabricator.wikimedia.org/T265266) (owner: 10Kormat) [11:38:30] (03PS8) 10Kormat: test: Start implementation of integration-env [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/644231 (https://phabricator.wikimedia.org/T265266) [11:38:44] (03PS3) 10Jbond: systemd: add validate command for unit files [puppet] - 10https://gerrit.wikimedia.org/r/644784 (https://phabricator.wikimedia.org/T268974) [11:40:00] (03CR) 10jerkins-bot: [V: 04-1] P:analytics::cluster::packages::common: demo spec test [puppet] - 10https://gerrit.wikimedia.org/r/644312 (https://phabricator.wikimedia.org/T261693) (owner: 10Jbond) [11:41:16] (03Abandoned) 10Kormat: test: Standardise integration_env output [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/644537 (owner: 10Kormat) [11:41:26] (03Abandoned) 10Kormat: test: Unpack db tarballs in integration env [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/644561 (owner: 10Kormat) [11:41:30] (03PS4) 10Jbond: P:analytics::cluster::packages::common: Add simple spec test [puppet] - 10https://gerrit.wikimedia.org/r/644312 (https://phabricator.wikimedia.org/T261693) [11:41:32] (03PS1) 10Jbond: P:analytics::cluster::packages::common: demostrate failing spec test [puppet] - 10https://gerrit.wikimedia.org/r/644786 (https://phabricator.wikimedia.org/T261693) [11:42:18] (03CR) 10Jbond: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/644312 (https://phabricator.wikimedia.org/T261693) (owner: 10Jbond) [11:43:18] (03PS2) 10Arturo Borrero Gonzalez: cloud: add conntrackd for better neutron l3 agent failover [puppet] - 10https://gerrit.wikimedia.org/r/644556 (https://phabricator.wikimedia.org/T268335) [11:44:32] (03CR) 10jerkins-bot: [V: 04-1] P:analytics::cluster::packages::common: demostrate failing spec test [puppet] - 10https://gerrit.wikimedia.org/r/644786 (https://phabricator.wikimedia.org/T261693) (owner: 10Jbond) [11:44:41] (03PS4) 10Jcrespo: [WIP] We continue with swift listing and download tests for media backups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/643980 [11:45:41] 10Operations, 10Puppet, 10puppet-compiler, 10Patch-For-Review: Ensure Puppet checks types as part of the build - https://phabricator.wikimedia.org/T261693 (10jbond) > however note that that spec test is dependent on a general refactor of the profile spect test which is in relation chain The dependency has... [11:46:38] (03PS3) 10Arturo Borrero Gonzalez: cloud: add conntrackd for better neutron l3 agent failover [puppet] - 10https://gerrit.wikimedia.org/r/644556 (https://phabricator.wikimedia.org/T268335) [11:48:35] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/644357 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [11:50:03] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/644591 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [11:50:39] (03CR) 10Jbond: [C: 03+2] "merging" [puppet] - 10https://gerrit.wikimedia.org/r/643249 (owner: 10Muehlenhoff) [11:51:16] (03PS4) 10Arturo Borrero Gonzalez: cloud: add conntrackd for better neutron l3 agent failover [puppet] - 10https://gerrit.wikimedia.org/r/644556 (https://phabricator.wikimedia.org/T268335) [11:52:12] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:53:54] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:54:15] (03PS3) 10Jbond: Add a script to manage the ssh configuration [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/639913 (owner: 10Giuseppe Lavagetto) [11:54:47] (03CR) 10Jbond: [V: 03+2 C: 03+2] Add a script to manage the ssh configuration [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/639913 (owner: 10Giuseppe Lavagetto) [11:55:16] (03PS5) 10Arturo Borrero Gonzalez: cloud: add conntrackd for better neutron l3 agent failover [puppet] - 10https://gerrit.wikimedia.org/r/644556 (https://phabricator.wikimedia.org/T268335) [11:56:07] 10Operations, 10Parsoid, 10serviceops: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10jijiki) @Dzahn is there something we need to fix on puppet ? [11:57:41] (03PS3) 10JMeybohm: coredns: Create a wmfcoredns copy in charts dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/643936 [11:57:43] (03PS1) 10JMeybohm: More generalization and a bit of cleanup for admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/644787 (https://phabricator.wikimedia.org/T268434) [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201202T1200). [12:00:04] No GERRIT patches in the queue for this window AFAICS. [12:00:56] (03PS11) 10Jbond: mirrors: replace cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/636082 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [12:01:21] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/644784 (https://phabricator.wikimedia.org/T268974) (owner: 10Jbond) [12:02:09] (03CR) 10Jbond: [V: 03+1 C: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26829/console" [puppet] - 10https://gerrit.wikimedia.org/r/641862 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [12:04:25] (03PS12) 10Jbond: mirrors: replace cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/636082 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [12:04:27] (03CR) 10Jbond: [V: 03+1 C: 03+2] "will merge" [puppet] - 10https://gerrit.wikimedia.org/r/641862 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [12:04:52] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26830/console" [puppet] - 10https://gerrit.wikimedia.org/r/636082 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [12:05:08] (03PS13) 10Jbond: mirrors: replace cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/636082 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [12:05:16] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26831/console" [puppet] - 10https://gerrit.wikimedia.org/r/636082 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [12:05:41] (03CR) 10Jbond: [V: 03+1 C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/636082 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [12:05:56] (03CR) 10Jbond: [V: 03+1 C: 03+2] "will merged" [puppet] - 10https://gerrit.wikimedia.org/r/636082 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [12:07:29] 10Operations, 10DBA, 10Performance-Team, 10Platform Engineering, 10User-Kormat: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 (10daniel) Back to the PET inbox per @WDoranWMF. We need to figure out where this fits in our process/roadmap. [12:07:51] (03PS3) 10Jbond: mariadb: Cleanup old cron deletions after some time after deploy [puppet] - 10https://gerrit.wikimedia.org/r/636644 (https://phabricator.wikimedia.org/T265138) (owner: 10Jcrespo) [12:09:43] (03CR) 10Jbond: [C: 03+1] "LGTM suspect enough time has passed now" [puppet] - 10https://gerrit.wikimedia.org/r/636644 (https://phabricator.wikimedia.org/T265138) (owner: 10Jcrespo) [12:09:47] (03CR) 10Jbond: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/636392 (https://phabricator.wikimedia.org/T257392) (owner: 10Ayounsi) [12:10:32] (03CR) 10Jbond: "Just going through gerrit queue, is this still needed. as there is no description im guessing it may just be a left over from an incident" [homer/public] - 10https://gerrit.wikimedia.org/r/627920 (owner: 10CDanis) [12:11:01] (03PS2) 10Jbond: admin: also remove the old ed25519 key for the time being [puppet] - 10https://gerrit.wikimedia.org/r/635497 (owner: 10Giuseppe Lavagetto) [12:11:06] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:11:18] (03CR) 10Jbond: [C: 03+1] admin: also remove the old ed25519 key for the time being [puppet] - 10https://gerrit.wikimedia.org/r/635497 (owner: 10Giuseppe Lavagetto) [12:12:45] (03CR) 10Jbond: [C: 03+1] "Similar wondering if this is still needed" [puppet] - 10https://gerrit.wikimedia.org/r/609475 (owner: 10CDanis) [12:14:33] (03Abandoned) 10Jbond: wikidough: improve naming of hiera keys and class variables [puppet] - 10https://gerrit.wikimedia.org/r/607477 (owner: 10Ssingh) [12:18:06] (03PS2) 10Jbond: Remove Hiera option to enable adduser config [puppet] - 10https://gerrit.wikimedia.org/r/602288 (https://phabricator.wikimedia.org/T235162) (owner: 10Muehlenhoff) [12:19:09] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26832/console" [puppet] - 10https://gerrit.wikimedia.org/r/602288 (https://phabricator.wikimedia.org/T235162) (owner: 10Muehlenhoff) [12:19:36] (03CR) 10Alexandros Kosiaris: [C: 04-1] WIP: Add apertium helm chart (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/644531 (https://phabricator.wikimedia.org/T255672) (owner: 10KartikMistry) [12:21:21] (03CR) 10Jbond: "I notice that the ticket got closed so wonder if this is still needed (probably not considering its age)?" [puppet] - 10https://gerrit.wikimedia.org/r/585200 (https://phabricator.wikimedia.org/T249037) (owner: 10Jcrespo) [12:21:33] (03CR) 10Jbond: [V: 03+1 C: 03+2] Remove Hiera option to enable adduser config [puppet] - 10https://gerrit.wikimedia.org/r/602288 (https://phabricator.wikimedia.org/T235162) (owner: 10Muehlenhoff) [12:22:50] (03PS1) 10Muehlenhoff: Fix typo in Ubuntu mirror timer [puppet] - 10https://gerrit.wikimedia.org/r/644792 [12:23:15] (03CR) 10Volans: [C: 03+1] "LGTM as a first implementation" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/641995 (https://phabricator.wikimedia.org/T268211) (owner: 10Jbond) [12:25:51] (03PS6) 10Arturo Borrero Gonzalez: cloud: add conntrackd for better neutron l3 agent failover [puppet] - 10https://gerrit.wikimedia.org/r/644556 (https://phabricator.wikimedia.org/T268335) [12:27:24] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 113565696 and 186 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:27:38] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 868078448 and 200 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:27:52] (03PS1) 10Jbond: Revert "Remove Hiera option to enable adduser config" [puppet] - 10https://gerrit.wikimedia.org/r/644807 [12:28:06] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 24974304 and 228 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:28:32] (03CR) 10Jbond: [C: 03+1] Fix typo in Ubuntu mirror timer [puppet] - 10https://gerrit.wikimedia.org/r/644792 (owner: 10Muehlenhoff) [12:28:34] (03PS7) 10Arturo Borrero Gonzalez: cloud: add conntrackd for better neutron l3 agent failover [puppet] - 10https://gerrit.wikimedia.org/r/644556 (https://phabricator.wikimedia.org/T268335) [12:29:00] (03CR) 10Muehlenhoff: [C: 03+2] Fix typo in Ubuntu mirror timer [puppet] - 10https://gerrit.wikimedia.org/r/644792 (owner: 10Muehlenhoff) [12:29:22] (03PS2) 10JMeybohm: admin_ng: Generalization, prod values anf fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/644787 (https://phabricator.wikimedia.org/T268434) [12:29:58] (03PS8) 10Arturo Borrero Gonzalez: cloud: add conntrackd for better neutron l3 agent failover [puppet] - 10https://gerrit.wikimedia.org/r/644556 (https://phabricator.wikimedia.org/T268335) [12:30:57] (03CR) 10Jbond: [C: 03+2] Revert "Remove Hiera option to enable adduser config" [puppet] - 10https://gerrit.wikimedia.org/r/644807 (owner: 10Jbond) [12:31:18] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 121143688 and 8 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:31:30] (03CR) 10Alexandros Kosiaris: [C: 03+1] coredns: Create a wmfcoredns copy in charts dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/643936 (owner: 10JMeybohm) [12:31:38] (03PS1) 10Jbond: Remove Hiera option to enable adduser config [puppet] - 10https://gerrit.wikimedia.org/r/644808 (https://phabricator.wikimedia.org/T235162) [12:32:23] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/643301 (owner: 10Alexandros Kosiaris) [12:32:50] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 128017160 and 10 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:33:27] (03Merged) 10jenkins-bot: eventrouter: Switch chart to public repo, deploys to internal repo [deployment-charts] - 10https://gerrit.wikimedia.org/r/643301 (owner: 10Alexandros Kosiaris) [12:33:53] (03CR) 10Alexandros Kosiaris: [C: 03+2] k8s: Remove profile::kubernetes::master::storage_backend fully [puppet] - 10https://gerrit.wikimedia.org/r/644234 (owner: 10Alexandros Kosiaris) [12:33:58] (03CR) 10Alexandros Kosiaris: [C: 03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/644234 (owner: 10Alexandros Kosiaris) [12:34:18] (03PS3) 10Jbond: Enable managed adduser/sysusers config also for WMCS [puppet] - 10https://gerrit.wikimedia.org/r/602286 (https://phabricator.wikimedia.org/T235162) (owner: 10Muehlenhoff) [12:34:24] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/644237 (owner: 10Alexandros Kosiaris) [12:34:34] (03CR) 10jerkins-bot: [V: 04-1] Remove Hiera option to enable adduser config [puppet] - 10https://gerrit.wikimedia.org/r/644808 (https://phabricator.wikimedia.org/T235162) (owner: 10Jbond) [12:35:31] (03PS2) 10Jbond: Remove Hiera option to enable adduser config [puppet] - 10https://gerrit.wikimedia.org/r/644808 (https://phabricator.wikimedia.org/T235162) [12:35:58] (03PS3) 10Jbond: Remove Hiera option to enable adduser config [puppet] - 10https://gerrit.wikimedia.org/r/644808 (https://phabricator.wikimedia.org/T235162) [12:36:30] (03CR) 10Jbond: [C: 03+1] "Ok fixed and noticed you did already have the dependency specified sorry missed that" [puppet] - 10https://gerrit.wikimedia.org/r/644808 (https://phabricator.wikimedia.org/T235162) (owner: 10Jbond) [12:36:42] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1563551048 and 83 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:37:18] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 212531688 and 12 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:37:34] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 203417584 and 208 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:37:34] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 713558952 and 208 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:37:58] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1237942664 and 234 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:38:12] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 709152480 and 246 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:38:26] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 92512512 and 260 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:38:32] (03PS9) 10Arturo Borrero Gonzalez: cloud: add conntrackd for better neutron l3 agent failover [puppet] - 10https://gerrit.wikimedia.org/r/644556 (https://phabricator.wikimedia.org/T268335) [12:39:50] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 164128 and 66 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:40:12] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 26416 and 88 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:41:14] PROBLEM - tileratorui on maps1010 is CRITICAL: connect to address 10.64.48.6 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [12:41:46] RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 451168 and 182 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:41:48] RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 203656 and 184 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:42:38] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1001/26835/" [puppet] - 10https://gerrit.wikimedia.org/r/644556 (https://phabricator.wikimedia.org/T268335) (owner: 10Arturo Borrero Gonzalez) [12:43:40] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 157494368 and 7 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:44:04] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 262551344 and 10 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:46:30] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1876776 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:50:02] RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 25128 and 71 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:50:30] PROBLEM - Postgres Replication Lag on maps2009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2578096800 and 228 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:50:46] RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 2640 and 116 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:51:34] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 92224 and 164 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:52:04] RECOVERY - Postgres Replication Lag on maps2009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 8248 and 194 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:52:06] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 347064 and 196 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:58:13] (03Abandoned) 10Jcrespo: admin: Add Peter.ovchyn to the list of privileged ldap users [puppet] - 10https://gerrit.wikimedia.org/r/585200 (https://phabricator.wikimedia.org/T249037) (owner: 10Jcrespo) [13:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201202T1300) [13:00:14] (03CR) 10Jcrespo: [C: 03+2] mariadb: Cleanup old cron deletions after some time after deploy [puppet] - 10https://gerrit.wikimedia.org/r/636644 (https://phabricator.wikimedia.org/T265138) (owner: 10Jcrespo) [13:00:36] (03CR) 10TK-999: "Looks like our corrections will land in next week's MaxMind DB update. I'll check it out once it's live to make sure everything is looking" [dns] - 10https://gerrit.wikimedia.org/r/643983 (owner: 10TK-999) [13:01:02] (03CR) 10Jcrespo: "Waiting for tomorrow's backup scheduled run." [puppet] - 10https://gerrit.wikimedia.org/r/644662 (owner: 10Jcrespo) [13:04:22] (03PS1) 10Alexandros Kosiaris: helm: Split repo update in 2 systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/644793 [13:06:48] !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'kube-system' for release 'eventrouter' . [13:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:54] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'kube-system' for release 'eventrouter' . [13:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:00] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'kube-system' for release 'eventrouter' . [13:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:16] !log installing brotli security updates [13:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:54] (03PS1) 10Muehlenhoff: Add library hint for brotli [puppet] - 10https://gerrit.wikimedia.org/r/644797 [13:17:40] (03PS9) 10Kormat: test: Start implementation of integration-env [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/644231 (https://phabricator.wikimedia.org/T265266) [13:21:34] PROBLEM - very high load average likely xfs on ms-be2031 is CRITICAL: CRITICAL - load average: 100.16, 100.50, 95.29 https://wikitech.wikimedia.org/wiki/Swift [13:22:22] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26836/console" [puppet] - 10https://gerrit.wikimedia.org/r/644363 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [13:22:24] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for brotli [puppet] - 10https://gerrit.wikimedia.org/r/644797 (owner: 10Muehlenhoff) [13:22:26] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/644363 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [13:22:50] akosiaris: shall I merge along? [13:22:57] (03CR) 10Alexandros Kosiaris: [C: 03+2] k8s: Allow using cergen [puppet] - 10https://gerrit.wikimedia.org/r/644238 (owner: 10Alexandros Kosiaris) [13:25:52] PROBLEM - Check systemd state on ms-be2031 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:26:59] (03Abandoned) 10CDanis: prepend esams/knams [homer/public] - 10https://gerrit.wikimedia.org/r/627920 (owner: 10CDanis) [13:31:02] (03PS1) 10Muehlenhoff: Extend MOU for piccardi [puppet] - 10https://gerrit.wikimedia.org/r/644799 [13:34:56] 10Operations, 10Puppet, 10Patch-For-Review: systemd.timer not executing on cumin2001 after command was modified - https://phabricator.wikimedia.org/T268974 (10jcrespo) I just got a new behaviour: yesterday didn't work on cumin1001 either (this is new), after adding the full path. The timer there says: ` On... [13:35:00] 10Operations, 10Traffic, 10Maps (Kartographer): Purge maps.wikimedia.org/geoshapes and /geoline cache to fix map markers not always being displayed - https://phabricator.wikimedia.org/T268927 (10MSantos) [13:38:03] 10Operations, 10Traffic, 10Maps (Kartographer): Purge maps.wikimedia.org/geoshapes and /geoline cache to fix map markers not always being displayed - https://phabricator.wikimedia.org/T268927 (10MSantos) @The_Equalizer it looks like there are still some cache from the maps outage, I'm tagging #traffic to hel... [13:39:26] 10Operations, 10Traffic, 10Maps (Kartographer): Purge maps.wikimedia.org/geoshapes and /geoline cache to fix map markers not always being displayed - https://phabricator.wikimedia.org/T268927 (10CDanis) The max lifetime of any object in the Traffic CDN is 24 hours. Are you sure they're being cached there? [13:40:14] 10Operations, 10Puppet, 10Patch-For-Review: systemd.timer not executing on cumin2001 after command was modified - https://phabricator.wikimedia.org/T268974 (10jcrespo) Could it be the underscore? https://github.com/coreos/fleet/issues/579 I saw some people mentioning issues with systemd and underscore names... [13:40:15] (03PS2) 10Marostegui: production-m2.sql.erb: Add sockpuppet users [puppet] - 10https://gerrit.wikimedia.org/r/644745 (https://phabricator.wikimedia.org/T268505) [13:44:34] (03PS1) 10Jcrespo: mariadb-backups: Rename snapshoting systemd unit to avoid underscores [puppet] - 10https://gerrit.wikimedia.org/r/644801 (https://phabricator.wikimedia.org/T268974) [13:44:40] (03CR) 10Volans: [C: 03+1] "I'm not familiar with various bits of the whole process but the change looks sane. Also it's only affecting integration testing so I'd say" (033 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/644231 (https://phabricator.wikimedia.org/T265266) (owner: 10Kormat) [13:44:51] (03CR) 10Muehlenhoff: [C: 03+2] Extend MOU for piccardi [puppet] - 10https://gerrit.wikimedia.org/r/644799 (owner: 10Muehlenhoff) [13:46:07] (03PS1) 10Filippo Giunchedi: grafana: set thanos as default datasource [puppet] - 10https://gerrit.wikimedia.org/r/644802 (https://phabricator.wikimedia.org/T256954) [13:46:18] (03CR) 10Jcrespo: "We lose nothing, and I prefer this name (it is a tiny more explicit). ¯\_(ツ)_/¯" [puppet] - 10https://gerrit.wikimedia.org/r/644801 (https://phabricator.wikimedia.org/T268974) (owner: 10Jcrespo) [13:47:06] 10Operations, 10Maps (Kartographer): Purge maps.wikimedia.org/geoshapes and /geoline cache to fix map markers not always being displayed - https://phabricator.wikimedia.org/T268927 (10MSantos) >>! In T268927#6662840, @CDanis wrote: > The max lifetime of any object in the Traffic CDN is 24 hours. Are you sure... [13:52:35] (03PS1) 10Papaul: Add new ms-be nodes to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/644803 (https://phabricator.wikimedia.org/T265419) [13:53:58] (03PS10) 10Kormat: test: Start implementation of integration-env [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/644231 (https://phabricator.wikimedia.org/T265266) [13:54:18] PROBLEM - very high load average likely xfs on ms-be2031 is CRITICAL: CRITICAL - load average: 122.61, 100.52, 93.57 https://wikitech.wikimedia.org/wiki/Swift [13:55:19] (03CR) 10Papaul: [C: 03+2] Add new ms-be nodes to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/644803 (https://phabricator.wikimedia.org/T265419) (owner: 10Papaul) [13:55:25] (03PS2) 10Papaul: Add new ms-be nodes to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/644803 (https://phabricator.wikimedia.org/T265419) [13:55:27] (03CR) 10jerkins-bot: [V: 04-1] test: Start implementation of integration-env [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/644231 (https://phabricator.wikimedia.org/T265266) (owner: 10Kormat) [13:55:29] (03CR) 10Papaul: [V: 03+2 C: 03+2] Add new ms-be nodes to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/644803 (https://phabricator.wikimedia.org/T265419) (owner: 10Papaul) [13:56:50] (03PS1) 10Arturo Borrero Gonzalez: openstack: l3 agent: fix conntrackd hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/644805 (https://phabricator.wikimedia.org/T268335) [13:57:00] RECOVERY - Check systemd state on ms-be2031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:57:30] (03CR) 10Hnowlan: [C: 03+1] production-m2.sql.erb: Add sockpuppet users [puppet] - 10https://gerrit.wikimedia.org/r/644745 (https://phabricator.wikimedia.org/T268505) (owner: 10Marostegui) [13:58:22] 10Operations, 10Maps (Kartographer): Some PostgreSQL replicas are not fully updated - https://phabricator.wikimedia.org/T268927 (10MSantos) [14:00:04] hashar and twentyafterfour: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Mediawiki train - European+American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201202T1400). [14:01:22] (03PS11) 10Kormat: test: Start implementation of integration-env [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/644231 (https://phabricator.wikimedia.org/T265266) [14:01:53] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` ms-be2060.codfw.wmnet ` The log can be found in `... [14:01:57] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ms-be2060.codfw.wmnet'] ` Of which those **FAILED**: ` ['ms-be2060.codfw.wmnet'] ` [14:02:08] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` ms-be2060.codfw.wmnet ` The log can be found in `... [14:02:09] (03CR) 10Kormat: "Updated, now with 300% more tuple-isation." (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/644231 (https://phabricator.wikimedia.org/T265266) (owner: 10Kormat) [14:02:28] (03PS1) 10Urbanecm: Add all subdomains of artsdatabanken.no to the wgCopyUploadsDomains allowlist for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644826 (https://phabricator.wikimedia.org/T267784) [14:02:48] (03PS2) 10Arturo Borrero Gonzalez: openstack: l3 agent: fix conntrackd hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/644805 (https://phabricator.wikimedia.org/T268335) [14:03:29] (03PS1) 10Hashar: group1 wikis to 1.36.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644827 [14:03:31] (03CR) 10Hashar: [C: 03+2] group1 wikis to 1.36.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644827 (owner: 10Hashar) [14:04:15] (03Merged) 10jenkins-bot: group1 wikis to 1.36.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644827 (owner: 10Hashar) [14:04:23] (03CR) 10Alexandros Kosiaris: [C: 03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/644802 (https://phabricator.wikimedia.org/T256954) (owner: 10Filippo Giunchedi) [14:05:05] (03PS3) 10Arturo Borrero Gonzalez: openstack: l3 agent: fix conntrackd hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/644805 (https://phabricator.wikimedia.org/T268335) [14:05:22] RECOVERY - Ensure local MW versions match expected deployment on deploy1002 is OK: OKAY: Not alerting due to fresh production wikiversions: 518 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:06:07] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.20 [14:06:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:16] jouncebot: now [14:07:16] For the next 1 hour(s) and 52 minute(s): Mediawiki train - European+American Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201202T1400) [14:07:26] !log hashar@deploy1001 Synchronized php: group1 wikis to 1.36.0-wmf.20 (duration: 01m 18s) [14:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:15] (03PS4) 10Arturo Borrero Gonzalez: openstack: l3 agent: fix conntrackd hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/644805 (https://phabricator.wikimedia.org/T268335) [14:09:33] (03CR) 10Volans: [C: 03+1] "LGTM" (033 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/644231 (https://phabricator.wikimedia.org/T265266) (owner: 10Kormat) [14:10:22] !log Start of mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=$wiki --print-orphaned-records-to=/tmp/urbanecm/$wiki-orphaned.log --progress-markers > $wiki.log in a tmux at mwmaint1002 (wiki=ptwiki; T246539) [14:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:28] T246539: Dry-run, then actually run updateVarDumps - https://phabricator.wikimedia.org/T246539 [14:11:34] (03CR) 10Kormat: [C: 03+2] test: Start implementation of integration-env [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/644231 (https://phabricator.wikimedia.org/T265266) (owner: 10Kormat) [14:12:01] !log Start of mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=$wiki --print-orphaned-records-to=/tmp/urbanecm/$wiki-orphaned.log --progress-markers > $wiki.log in a tmux at mwmaint1002 (wiki=commonswiki; T246539) [14:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:03] (03Merged) 10jenkins-bot: test: Start implementation of integration-env [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/644231 (https://phabricator.wikimedia.org/T265266) (owner: 10Kormat) [14:17:22] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [14:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:03] (03CR) 10Ottomata: "One naming nit, but LGTM otherwise!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644672 (https://phabricator.wikimedia.org/T268784) (owner: 10Razzi) [14:19:17] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [14:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:54] (03CR) 10Ottomata: "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/644535 (https://phabricator.wikimedia.org/T264358) (owner: 10Ottomata) [14:20:05] (03PS1) 10Elukey: hive: allow multiple metastores in the hive-site.xml config [puppet] - 10https://gerrit.wikimedia.org/r/644832 [14:21:21] (03CR) 10Filippo Giunchedi: [C: 03+2] grafana: set thanos as default datasource [puppet] - 10https://gerrit.wikimedia.org/r/644802 (https://phabricator.wikimedia.org/T256954) (owner: 10Filippo Giunchedi) [14:22:59] (03CR) 10Elukey: "One comment on cache size, +1 to what Andrew said!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644672 (https://phabricator.wikimedia.org/T268784) (owner: 10Razzi) [14:23:10] 10Operations, 10Puppet, 10puppet-compiler, 10Patch-For-Review: Ensure Puppet checks types as part of the build - https://phabricator.wikimedia.org/T261693 (10Ottomata) Interesting, thank you! It's good to have that on the profile class. I think we were hoping this could be done as a more general thing on... [14:24:57] (03PS5) 10Arturo Borrero Gonzalez: openstack: l3 agent: fix conntrackd hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/644805 (https://phabricator.wikimedia.org/T268335) [14:25:30] (03CR) 10Ottomata: zookeeper: configure test-eqiad single-node cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/642497 (https://phabricator.wikimedia.org/T268202) (owner: 10Razzi) [14:26:19] (03CR) 10CDanis: [C: 04-1] puppet-merge: readd check for unbounded variables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/643746 (owner: 10Jbond) [14:26:43] (03PS2) 10Elukey: WIP - hive: allow multiple metastores in the hive-site.xml config [puppet] - 10https://gerrit.wikimedia.org/r/644832 [14:28:16] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26843/console" [puppet] - 10https://gerrit.wikimedia.org/r/644832 (owner: 10Elukey) [14:29:39] (03PS1) 10Muehlenhoff: CAS support for debmonitor, step 1 (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/644833 [14:30:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:33:06] (03PS6) 10Arturo Borrero Gonzalez: openstack: l3 agent: fix conntrackd hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/644805 (https://phabricator.wikimedia.org/T268335) [14:33:45] ottomata: We are hoping to test drive the new networkpolicy thing in the next 2 weeks. After that is deemed successful, it's just a recreation of all the clusters thing, which should be a couple of days per cluster in the next quarter. [14:34:44] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:35:15] (03PS7) 10Arturo Borrero Gonzalez: openstack: l3 agent: fix conntrackd hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/644805 (https://phabricator.wikimedia.org/T268335) [14:36:24] RECOVERY - very high load average likely xfs on ms-be2031 is OK: OK - load average: 60.56, 68.25, 79.09 https://wikitech.wikimedia.org/wiki/Swift [14:36:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1002/26845/" [puppet] - 10https://gerrit.wikimedia.org/r/644805 (https://phabricator.wikimedia.org/T268335) (owner: 10Arturo Borrero Gonzalez) [14:38:02] akosiaris: ok cool i just noticed some discrepancies between what is real and what is in the per serivce values.yaml [14:38:04] want to be careful :) [14:39:03] (03PS15) 10Elukey: kerberos::exec: enable kerberos by default [puppet] - 10https://gerrit.wikimedia.org/r/641958 (https://phabricator.wikimedia.org/T268220) [14:39:10] ottomata: ouch. We 'll be definitely test driving everything in the new cluster, but thanks for that. I am hoping we 'll have this done before more divergence shows up [14:39:46] (03PS1) 10Mholloway: Event Platform: Rename mw_session_tick to mediawiki.client.session_tick [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644839 [14:41:03] (03PS1) 10Hashar: prometheus: collect ci master hosts [puppet] - 10https://gerrit.wikimedia.org/r/644840 [14:41:28] (03PS1) 10Arturo Borrero Gonzalez: openstack: l3_agent: fix typo in hiera [puppet] - 10https://gerrit.wikimedia.org/r/644841 (https://phabricator.wikimedia.org/T268335) [14:42:18] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: l3_agent: fix typo in hiera [puppet] - 10https://gerrit.wikimedia.org/r/644841 (https://phabricator.wikimedia.org/T268335) (owner: 10Arturo Borrero Gonzalez) [14:42:22] (03CR) 10Hashar: "As pointed by Chris yesterday. I have followed Filippo guidances :]" [puppet] - 10https://gerrit.wikimedia.org/r/644840 (owner: 10Hashar) [14:42:31] (03CR) 10JMeybohm: [C: 03+2] helm: Split repo update in 2 systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/644793 (owner: 10Alexandros Kosiaris) [14:42:58] (03CR) 10Bearloga: [C: 03+2] Event Platform: Rename mw_session_tick to mediawiki.client.session_tick [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644839 (owner: 10Mholloway) [14:43:00] (03CR) 10JMeybohm: [C: 03+1] helm: Split repo update in 2 systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/644793 (owner: 10Alexandros Kosiaris) [14:43:19] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: collect ci master hosts [puppet] - 10https://gerrit.wikimedia.org/r/644840 (owner: 10Hashar) [14:44:23] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: collect ci master hosts [puppet] - 10https://gerrit.wikimedia.org/r/644840 (owner: 10Hashar) [14:44:33] 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): Degraded RAID on labstore1006 - https://phabricator.wikimedia.org/T268281 (10Cmjohnson) @RobH Can you attempt to pull the ADU report off this server. I cannot get into the U/I through mgmt. Let me know if you can and I will forward you the ftp sit... [14:44:43] (03CR) 10Reedy: "I just rebased this to fix the conflicts added by 1e34f4bcea3d1ef247e37e0d9fd6631d2150e411 / I6ee83512b8a3c8e1662752e627b5ab37497874cb" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580096 (https://phabricator.wikimedia.org/T116550) (owner: 10Krinkle) [14:45:30] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ms-be2060.codfw.wmnet'] ` and were **ALL** successful. [14:46:00] (03Merged) 10jenkins-bot: Event Platform: Rename mw_session_tick to mediawiki.client.session_tick [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644839 (owner: 10Mholloway) [14:47:02] (03PS4) 10Reedy: [WIP] logging: Remove useMicrosecondTimestamps(false) calls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580096 (https://phabricator.wikimedia.org/T116550) (owner: 10Krinkle) [14:48:55] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Event Platform: Rename mw_session_tick stream to mediawiki.client.session_tick (duration: 01m 07s) [14:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:07] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26846/console" [puppet] - 10https://gerrit.wikimedia.org/r/641958 (https://phabricator.wikimedia.org/T268220) (owner: 10Elukey) [14:49:50] (03CR) 10Ottomata: [C: 03+1] Event Platform: Rename mw_session_tick to mediawiki.client.session_tick [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644839 (owner: 10Mholloway) [14:51:14] (03CR) 10Ottomata: [C: 03+1] kerberos::exec: enable kerberos by default [puppet] - 10https://gerrit.wikimedia.org/r/641958 (https://phabricator.wikimedia.org/T268220) (owner: 10Elukey) [14:56:08] !log Promoting group0 to 1.36.0-wmf.20 since I haven't done so yesterday :-\ # T263186 [14:56:12] (03PS1) 10Hashar: group0 wikis to 1.36.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644845 [14:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:14] (03CR) 10Hashar: [C: 03+2] group0 wikis to 1.36.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644845 (owner: 10Hashar) [14:56:14] T263186: 1.36.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T263186 [14:57:36] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/644801 (https://phabricator.wikimedia.org/T268974) (owner: 10Jcrespo) [14:58:41] (03PS1) 10Filippo Giunchedi: site: assign roles for all ms-be / ms-fe hosts [puppet] - 10https://gerrit.wikimedia.org/r/644847 (https://phabricator.wikimedia.org/T265419) [14:58:45] (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644845 (owner: 10Hashar) [15:00:26] (03CR) 10Elukey: [V: 03+1 C: 03+2] kerberos::exec: enable kerberos by default [puppet] - 10https://gerrit.wikimedia.org/r/641958 (https://phabricator.wikimedia.org/T268220) (owner: 10Elukey) [15:00:31] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.20 [15:00:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:08] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Rename snapshoting systemd unit to avoid underscores [puppet] - 10https://gerrit.wikimedia.org/r/644801 (https://phabricator.wikimedia.org/T268974) (owner: 10Jcrespo) [15:01:34] elukey: do you merge or I do? [15:01:58] (mine is ready) [15:02:29] jynus: do you mind if I puppet-merge? So I can re-check my patch, it is a jumbo one :) [15:02:37] sure, ping me when done [15:02:48] (mine is not in a hurry) [15:03:27] merging! [15:03:41] ah, I thought it was going to take more :-D [15:04:10] nono I reviewed it 100 times in gerrit, I just needed to verify a couple of things [15:04:14] :D [15:04:27] I was like, ok, as long as it takes less than 1 h [15:04:33] :-)) [15:05:03] ahhahah yes yes :D [15:05:30] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26851/console" [puppet] - 10https://gerrit.wikimedia.org/r/644847 (https://phabricator.wikimedia.org/T265419) (owner: 10Filippo Giunchedi) [15:06:18] (03CR) 10JMeybohm: [C: 03+2] coredns: Create a wmfcoredns copy in charts dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/643936 (owner: 10JMeybohm) [15:06:35] (03PS3) 10JMeybohm: Run a helm repo update before linting [deployment-charts] - 10https://gerrit.wikimedia.org/r/644474 [15:06:59] (03PS1) 10Jcrespo: Revert "mariadb-backups: Rename snapshoting systemd unit to avoid underscores" [puppet] - 10https://gerrit.wikimedia.org/r/644809 [15:07:32] (03CR) 10Jcrespo: "I will wait for jbond's +1 to retry merging." [puppet] - 10https://gerrit.wikimedia.org/r/644809 (owner: 10Jcrespo) [15:07:42] (03Merged) 10jenkins-bot: coredns: Create a wmfcoredns copy in charts dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/643936 (owner: 10JMeybohm) [15:09:17] (03PS1) 10Jbond: late_command: configure the run, var and fact paths to match puppet conf [puppet] - 10https://gerrit.wikimedia.org/r/644850 [15:09:20] (03CR) 10Jcrespo: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/644809 (owner: 10Jcrespo) [15:09:56] (03CR) 10Alexandros Kosiaris: [C: 04-1] "A number of comments inline. Mostly LGTM though." (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/644462 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [15:10:13] (03CR) 10jerkins-bot: [V: 04-1] Revert "mariadb-backups: Rename snapshoting systemd unit to avoid underscores" [puppet] - 10https://gerrit.wikimedia.org/r/644809 (owner: 10Jcrespo) [15:10:17] (03Abandoned) 10Jcrespo: [WIP] Change backup hosts into using the package version of scripts [puppet] - 10https://gerrit.wikimedia.org/r/620312 (https://phabricator.wikimedia.org/T165358) (owner: 10Jcrespo) [15:10:47] (03PS2) 10Jbond: late_command: configure the run, var and fact paths to match puppet conf [puppet] - 10https://gerrit.wikimedia.org/r/644850 [15:11:24] (03PS2) 10Jcrespo: Revert "mariadb-backups: Rename snapshoting systemd unit to avoid ..." [puppet] - 10https://gerrit.wikimedia.org/r/644809 [15:13:08] (03CR) 10Volans: "LGTM but I'm unsure on one point" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644850 (owner: 10Jbond) [15:14:01] (03CR) 10Jcrespo: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/644809 (owner: 10Jcrespo) [15:14:31] (03CR) 10jerkins-bot: [V: 04-1] Revert "mariadb-backups: Rename snapshoting systemd unit to avoid ..." [puppet] - 10https://gerrit.wikimedia.org/r/644809 (owner: 10Jcrespo) [15:15:15] (03PS3) 10Jcrespo: Revert "mariadb-backups: Rename snapshoting systemd unit to avoid ..." [puppet] - 10https://gerrit.wikimedia.org/r/644809 [15:15:20] (03CR) 10Jcrespo: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/644809 (owner: 10Jcrespo) [15:15:53] (03PS3) 10Jbond: late_command: configure the run, var and fact paths to match puppet conf [puppet] - 10https://gerrit.wikimedia.org/r/644850 (https://phabricator.wikimedia.org/T269187) [15:18:13] (03PS1) 10Jcrespo: mariadb-backups: Rename snapshoting systemd unit to avoid underscores [puppet] - 10https://gerrit.wikimedia.org/r/644851 (https://phabricator.wikimedia.org/T268974) [15:18:27] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backups: Rename snapshoting systemd unit to avoid underscores [puppet] - 10https://gerrit.wikimedia.org/r/644851 (https://phabricator.wikimedia.org/T268974) (owner: 10Jcrespo) [15:18:57] (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb-backups: Rename snapshoting systemd unit to avoid ..." [puppet] - 10https://gerrit.wikimedia.org/r/644809 (owner: 10Jcrespo) [15:20:08] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Sigh, this leads to a duplicate declaration of the timer. Fixing." [puppet] - 10https://gerrit.wikimedia.org/r/644793 (owner: 10Alexandros Kosiaris) [15:22:22] (03PS2) 10Jcrespo: mariadb-backups: Rename snapshoting systemd unit to avoid underscores [puppet] - 10https://gerrit.wikimedia.org/r/644851 (https://phabricator.wikimedia.org/T268974) [15:23:40] (03CR) 10Jcrespo: "The second time is the good one 0:-)" [puppet] - 10https://gerrit.wikimedia.org/r/644851 (https://phabricator.wikimedia.org/T268974) (owner: 10Jcrespo) [15:25:42] (03PS1) 10Jbond: Revert "profile: migrate to shared spec_test" [puppet] - 10https://gerrit.wikimedia.org/r/644810 [15:27:03] (03PS4) 10JMeybohm: Add calico helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/644462 (https://phabricator.wikimedia.org/T267653) [15:27:05] !log restarting turnilo [15:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:42] (03CR) 10Jbond: [C: 03+2] Revert "profile: migrate to shared spec_test" [puppet] - 10https://gerrit.wikimedia.org/r/644810 (owner: 10Jbond) [15:29:56] !log installing libproxy security updates on Buster [15:30:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:19] (03PS4) 10Jbond: late_command: configure the run, var and fact paths to match puppet conf [puppet] - 10https://gerrit.wikimedia.org/r/644850 (https://phabricator.wikimedia.org/T269187) [15:30:31] (03CR) 10JMeybohm: [C: 04-1] "Thanks for looking into this!" (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/644462 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [15:31:03] (03PS2) 10Alexandros Kosiaris: helm: Split repo update in 2 systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/644793 [15:32:18] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/644793 (owner: 10Alexandros Kosiaris) [15:38:02] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on es1023 - https://phabricator.wikimedia.org/T268796 (10Cmjohnson) Ticket created with Dell, Sending a new disk to equinix. [15:38:04] (03PS12) 10Elukey: Enable kerberos in kerberos::systemd_timer by default [puppet] - 10https://gerrit.wikimedia.org/r/642446 (https://phabricator.wikimedia.org/T268220) [15:40:09] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` ms-be2061.codfw.wmnet ` The log can be found in `... [15:40:11] 10Operations, 10ops-eqiad: mw1304.mgmt down - https://phabricator.wikimedia.org/T269050 (10Cmjohnson) a:03Jclark-ctr okay, it may not be the cable at all, it may very need to be powered off and back on. Assigning to @jclark-ctr to try replacing the green mgmt cable first. [15:40:14] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/644850 (https://phabricator.wikimedia.org/T269187) (owner: 10Jbond) [15:40:37] (03CR) 10Jbond: [C: 03+2] late_command: configure the run, var and fact paths to match puppet conf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644850 (https://phabricator.wikimedia.org/T269187) (owner: 10Jbond) [15:41:02] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10Papaul) [15:42:01] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26853/console" [puppet] - 10https://gerrit.wikimedia.org/r/642446 (https://phabricator.wikimedia.org/T268220) (owner: 10Elukey) [15:42:37] (03PS1) 10Muehlenhoff: Add Tyler as approval contact for Gerrit/contint [puppet] - 10https://gerrit.wikimedia.org/r/644856 [15:43:08] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/644851 (https://phabricator.wikimedia.org/T268974) (owner: 10Jcrespo) [15:44:27] (03PS1) 10Jcrespo: Move section script from software/dbtools to wmfmariapy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/644857 [15:45:39] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Rename snapshoting systemd unit to avoid underscores [puppet] - 10https://gerrit.wikimedia.org/r/644851 (https://phabricator.wikimedia.org/T268974) (owner: 10Jcrespo) [15:47:05] (03CR) 10Jbond: [C: 03+2] systemd: add validate command for unit files [puppet] - 10https://gerrit.wikimedia.org/r/644784 (https://phabricator.wikimedia.org/T268974) (owner: 10Jbond) [15:47:20] 10Operations, 10ops-eqiad, 10Analytics-Clusters: an-presto1004 shows only the NIC in the boot list - https://phabricator.wikimedia.org/T268951 (10Cmjohnson) I will have to take a look at the server and get back to you. I am assuming this is okay to take down since the disks are not being seen. Sounds lik... [15:48:01] 10Operations, 10Puppet, 10Patch-For-Review: systemd.timer not executing on cumin2001 after command was modified - https://phabricator.wikimedia.org/T268974 (10jcrespo) This looks promising enough: ` Active: active (waiting) Started Periodic execution of database-backups-snapshots. ` [15:48:11] 10Operations, 10ops-eqiad, 10Analytics-Clusters: an-presto1004 shows only the NIC in the boot list - https://phabricator.wikimedia.org/T268951 (10elukey) @Cmjohnson yes please take it down anytime :) [15:49:29] (03PS1) 10Jcrespo: mariadb-backups: Remove old scheduled job disabling [puppet] - 10https://gerrit.wikimedia.org/r/644861 [15:59:01] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:00:15] 10Operations, 10Puppet, 10puppet-compiler, 10Patch-For-Review: Ensure Puppet checks types as part of the build - https://phabricator.wikimedia.org/T261693 (10jbond) >>! In T261693#6663005, @Ottomata wrote: > Interesting, thank you! It's good to have that on the profile class. unfortunately i had to revert... [16:03:16] (03PS2) 10Jbond: puppet-merge: readd check for unbounded variables [puppet] - 10https://gerrit.wikimedia.org/r/643746 [16:03:19] (03CR) 10Jbond: "updated" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/643746 (owner: 10Jbond) [16:03:21] (03CR) 10Jbond: [C: 03+1] Add Tyler as approval contact for Gerrit/contint [puppet] - 10https://gerrit.wikimedia.org/r/644856 (owner: 10Muehlenhoff) [16:03:24] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26854/console" [puppet] - 10https://gerrit.wikimedia.org/r/644793 (owner: 10Alexandros Kosiaris) [16:04:30] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ms-be2061.codfw.wmnet'] ` Of which those **FAILED**: ` ['ms-be2061.codfw.wmnet'] ` [16:06:49] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` ms-be2061.codfw.wmnet ` The log can be found in `... [16:07:26] (03PS13) 10Jbond: turnilo: add export mappings for network devices via query_resources [puppet] - 10https://gerrit.wikimedia.org/r/643703 (https://phabricator.wikimedia.org/T254332) [16:08:14] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26855/console" [puppet] - 10https://gerrit.wikimedia.org/r/643703 (https://phabricator.wikimedia.org/T254332) (owner: 10Jbond) [16:09:05] (03CR) 10KartikMistry: WIP: Add apertium helm chart (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/644531 (https://phabricator.wikimedia.org/T255672) (owner: 10KartikMistry) [16:09:08] (03CR) 10Jbond: [V: 03+1] "updated" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/643703 (https://phabricator.wikimedia.org/T254332) (owner: 10Jbond) [16:09:18] (03CR) 10Ayounsi: [C: 03+1] turnilo: add export mappings for network devices via query_resources [puppet] - 10https://gerrit.wikimedia.org/r/643703 (https://phabricator.wikimedia.org/T254332) (owner: 10Jbond) [16:09:31] (03PS10) 10KartikMistry: WIP: Add apertium helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/644531 (https://phabricator.wikimedia.org/T255672) [16:09:34] (03CR) 10Jbond: [V: 03+1 C: 03+2] turnilo: add export mappings for network devices via query_resources [puppet] - 10https://gerrit.wikimedia.org/r/643703 (https://phabricator.wikimedia.org/T254332) (owner: 10Jbond) [16:09:40] (03PS14) 10Jbond: turnilo: add export mappings for network devices via query_resources [puppet] - 10https://gerrit.wikimedia.org/r/643703 (https://phabricator.wikimedia.org/T254332) [16:12:22] XioNoX: fyi i just merged the turnilo change [16:12:25] PROBLEM - Ensure local MW versions match expected deployment on deploy1002 is CRITICAL: CRITICAL: 646 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [16:12:35] jbond42: thx! [16:12:45] np [16:17:53] (03PS3) 10JMeybohm: admin_ng: Generalization, prod values anf fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/644787 (https://phabricator.wikimedia.org/T268434) [16:18:49] (03CR) 10Jbond: [C: 03+2] (WIP) icinga_status: add schedualed downtime status (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/641995 (https://phabricator.wikimedia.org/T268211) (owner: 10Jbond) [16:19:05] (03PS8) 10Jbond: (WIP) icinga_status: add schedualed downtime status [puppet] - 10https://gerrit.wikimedia.org/r/641995 (https://phabricator.wikimedia.org/T268211) [16:19:18] (03CR) 10Jdlrobson: [C: 04-1] Move disabling sitenotice on wikimedia wikis to mediawiki-config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644627 (https://phabricator.wikimedia.org/T269173) (owner: 10Florianschmidtwelzow) [16:19:50] (03CR) 10jerkins-bot: [V: 04-1] (WIP) icinga_status: add schedualed downtime status [puppet] - 10https://gerrit.wikimedia.org/r/641995 (https://phabricator.wikimedia.org/T268211) (owner: 10Jbond) [16:21:20] (03PS7) 10Razzi: superset: add cache to an-tool1010 [puppet] - 10https://gerrit.wikimedia.org/r/644672 (https://phabricator.wikimedia.org/T268784) [16:22:23] (03PS9) 10Jbond: icinga_status: add schedualed downtime status [puppet] - 10https://gerrit.wikimedia.org/r/641995 (https://phabricator.wikimedia.org/T268211) [16:23:28] (03CR) 10Jbond: [C: 03+2] icinga_status: add schedualed downtime status [puppet] - 10https://gerrit.wikimedia.org/r/641995 (https://phabricator.wikimedia.org/T268211) (owner: 10Jbond) [16:23:39] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [16:23:43] PROBLEM - Check systemd state on ms-be2031 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:33] (03PS11) 10KartikMistry: Add apertium helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/644531 (https://phabricator.wikimedia.org/T255672) [16:25:36] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [16:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:59] (03CR) 10Ayounsi: Run Homer during the decom cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/643979 (owner: 10Ayounsi) [16:26:09] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission es1017.eqiad.wmnet - https://phabricator.wikimedia.org/T268825 (10wiki_willy) a:05wiki_willy→03Cmjohnson [16:26:23] (03PS10) 10Jbond: icinga_status: add scheduled downtime status [puppet] - 10https://gerrit.wikimedia.org/r/641995 (https://phabricator.wikimedia.org/T268211) [16:26:28] thanks jbond42 :] [16:28:00] mforns: your welcome :D what for [16:28:20] merging my patch [16:28:31] ahh no probs [16:29:03] (03CR) 10Jbond: [C: 03+2] disable-puppet: add username to disable message [puppet] - 10https://gerrit.wikimedia.org/r/643945 (owner: 10Jbond) [16:29:20] (03PS1) 10MSantos: mobileapps: bump to 2020-12-02-162223-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/644867 [16:32:24] (03PS1) 10Jbond: test puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/644869 [16:33:14] (03PS1) 10MSantos: wikifeeds: bump to 2020-12-02-143136-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/644870 [16:34:01] (03CR) 10Jbond: [C: 03+2] test puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/644869 (owner: 10Jbond) [16:34:37] (03PS1) 10Jbond: Revert "test puppet-merge" [puppet] - 10https://gerrit.wikimedia.org/r/644811 [16:34:43] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "test puppet-merge" [puppet] - 10https://gerrit.wikimedia.org/r/644811 (owner: 10Jbond) [16:36:19] (03CR) 10Dzahn: "Isn't this just another case of "needs full path in command line" like I saw passing by for other cron->systemd conversions?" [puppet] - 10https://gerrit.wikimedia.org/r/644662 (owner: 10Jcrespo) [16:37:34] (03CR) 10Clarakosi: "Will do!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/644542 (owner: 10Holger Knust) [16:37:48] (03CR) 10Volans: [C: 04-1] "Looks nice, minor improvements inline" (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/643979 (owner: 10Ayounsi) [16:37:50] (03CR) 10Dzahn: "I said that because in https://gerrit.wikimedia.org/r/c/operations/puppet/+/636410/5/modules/profile/manifests/mariadb/backup/transfer.pp " [puppet] - 10https://gerrit.wikimedia.org/r/644662 (owner: 10Jcrespo) [16:41:53] (03PS1) 10Ayounsi: .gitignore .venv [software/homer] - 10https://gerrit.wikimedia.org/r/644871 [16:41:55] (03PS1) 10Ayounsi: Add per device _get_vlans() [software/homer] - 10https://gerrit.wikimedia.org/r/644872 [16:42:21] 10Operations, 10ConfirmEdit (CAPTCHA extension), 10Patch-For-Review, 10Python3-Porting: captcha.py needs to be ported to Python 3 - https://phabricator.wikimedia.org/T268468 (10Dzahn) `python3-pil` is now installed across the board: https://debmonitor.wikimedia.org/packages/python3-pil [16:44:31] (03PS7) 10Jbond: puppetmaster: replace cron to remove old reports with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/636104 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [16:44:47] (03CR) 10Dzahn: "I noticed there is a typo in the command line "ubunut"" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636082 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [16:46:50] (03CR) 10Jbond: [C: 03+1] "Sorry this one slipped through the gaps lgtm now" [puppet] - 10https://gerrit.wikimedia.org/r/636104 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [16:47:51] (03PS2) 10Ayounsi: Add per device _get_vlans() [software/homer] - 10https://gerrit.wikimedia.org/r/644872 [16:48:19] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:48:42] (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer] - 10https://gerrit.wikimedia.org/r/644871 (owner: 10Ayounsi) [16:49:15] (03CR) 10Dzahn: zookeeper: configure test-eqiad single-node cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/642497 (https://phabricator.wikimedia.org/T268202) (owner: 10Razzi) [16:49:37] (03CR) 10Ayounsi: [C: 03+2] .gitignore .venv [software/homer] - 10https://gerrit.wikimedia.org/r/644871 (owner: 10Ayounsi) [16:50:01] (03PS14) 10Jbond: labstore: add data types and some other style fixes [puppet] - 10https://gerrit.wikimedia.org/r/622666 (owner: 10Dzahn) [16:50:18] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] helm: Split repo update in 2 systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/644793 (owner: 10Alexandros Kosiaris) [16:51:09] (03CR) 10Volans: [C: 04-1] "Missing tests, looks good otherwise, small nits inline." (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/644872 (owner: 10Ayounsi) [16:52:25] jbond42: confirmed working as expected [16:52:48] (03PS4) 10Ssingh: Initial commit of the knead-wikidough test suite [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/639838 (https://phabricator.wikimedia.org/T267424) [16:52:52] (03PS15) 10Jbond: labstore: add data types and some other style fixes [puppet] - 10https://gerrit.wikimedia.org/r/622666 (owner: 10Dzahn) [16:52:56] XioNoX: awesome :D [16:53:01] (03CR) 10jerkins-bot: [V: 04-1] Initial commit of the knead-wikidough test suite [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/639838 (https://phabricator.wikimedia.org/T267424) (owner: 10Ssingh) [16:53:34] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26858/console" [puppet] - 10https://gerrit.wikimedia.org/r/622666 (owner: 10Dzahn) [16:54:13] (03CR) 10jerkins-bot: [V: 04-1] labstore: add data types and some other style fixes [puppet] - 10https://gerrit.wikimedia.org/r/622666 (owner: 10Dzahn) [16:54:49] PROBLEM - Host ms-be2061 is DOWN: PING CRITICAL - Packet loss = 100% [16:55:25] RECOVERY - Host ms-be2061 is UP: PING OK - Packet loss = 0%, RTA = 33.48 ms [16:55:58] (03CR) 10Ssingh: "This is still failing as expected because of `assert ssl.HAS_TLSv1_3` and T241195. Let's not merge before that; we will run "recheck" once" [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/639838 (https://phabricator.wikimedia.org/T267424) (owner: 10Ssingh) [16:57:08] (03PS16) 10Jbond: labstore: add data types and some other style fixes [puppet] - 10https://gerrit.wikimedia.org/r/622666 (owner: 10Dzahn) [16:57:14] (03PS3) 10Ayounsi: Add per device _get_vlans() [software/homer] - 10https://gerrit.wikimedia.org/r/644872 [16:57:29] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:06] jouncebot: now [17:01:06] No deployments scheduled for the next 1 hour(s) and 58 minute(s) [17:01:59] (03CR) 10Jbond: "need to pop to a meeting so this was a quick pass" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/622666 (owner: 10Dzahn) [17:05:28] 10Operations, 10Puppet, 10Patch-For-Review: systemd.timer not executing on cumin2001 after command was modified - https://phabricator.wikimedia.org/T268974 (10elukey) >>! In T268974#6663282, @gerritbot wrote: > Change 644784 **merged** by Jbond: > [operations/puppet@production] systemd: add validate command... [17:05:37] (03PS1) 10Jdlrobson: Disable QuickSurvey tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644873 (https://phabricator.wikimedia.org/T269053) [17:06:07] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01011 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [17:09:29] (03CR) 10Muehlenhoff: "The typo was fixed in a followup patch." [puppet] - 10https://gerrit.wikimedia.org/r/636082 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [17:11:48] !log uploading scap 3.16.0-1 [17:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:02] end of year = when everyone tries to get their backlog and queues smaller = lots of merges of stuff that had been waiting a while [17:12:57] RECOVERY - Check systemd state on ms-be2031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:47] (03PS5) 10JMeybohm: Add calico helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/644462 (https://phabricator.wikimedia.org/T267653) [17:17:08] (03CR) 10Bstorm: wmcs: set the add_wiki cookbook to only run meta_p on some hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/644552 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm) [17:17:47] (03CR) 10JMeybohm: "> The chart is currently still missing all the necessary config objects like IPPools and BGPConfig" [deployment-charts] - 10https://gerrit.wikimedia.org/r/644462 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [17:20:57] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "noop 'master' https://puppet-compiler.wmflabs.org/compiler1001/26859/ noop 'state' https://puppet-compiler.wmflabs.org/compiler1003/26860" [puppet] - 10https://gerrit.wikimedia.org/r/644357 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [17:24:18] (03CR) 10Dzahn: "noop on puppetmaster1001, pybal-test2003, mwmaint1002, mwdebug1002.." [puppet] - 10https://gerrit.wikimedia.org/r/644357 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [17:25:05] (03CR) 10Bstorm: "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/643337 (https://phabricator.wikimedia.org/T266300) (owner: 10Bstorm) [17:25:30] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: dc=eqiad,cluster=maps,service=kartotherian,name=maps1010.eqiad.wmnet [17:25:32] jbond42: re: https://gerrit.wikimedia.org/r/c/operations/puppet/+/644784/4/modules/systemd/manifests/unit.pp#41 systemd-analyze is in /usr/bin not /bin afaics :| [17:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:18] (03CR) 10Bstorm: [C: 03+2] wmcs_wheel_of_misfortune: add mysql to exempt shells [puppet] - 10https://gerrit.wikimedia.org/r/643337 (https://phabricator.wikimedia.org/T266300) (owner: 10Bstorm) [17:28:05] (03PS7) 10Dzahn: httpd/simplelamp2: add parameter to not purge manual config [puppet] - 10https://gerrit.wikimedia.org/r/597052 (https://phabricator.wikimedia.org/T169368) [17:29:32] (03CR) 10Filippo Giunchedi: "Change is breaking puppet on a newly provisioned ms-be host: systemd-analyze is in /usr/bin. even with a bandaid then verify fails. e.g." [puppet] - 10https://gerrit.wikimedia.org/r/644784 (https://phabricator.wikimedia.org/T268974) (owner: 10Jbond) [17:29:37] godog: ack thank, in a meeting at the moment butt i think that cr needs to be reverted elukey also reported a seperate issue [17:29:59] jbond42: ack, ok yeah revert sounds good to me too [17:30:34] I'll revert in the meantime [17:30:48] (03PS1) 10Filippo Giunchedi: Revert "systemd: add validate command for unit files" [puppet] - 10https://gerrit.wikimedia.org/r/644812 [17:31:02] looking for +1s on ^ [17:31:04] ah nice finding godog, I didn't see the path issue :( [17:31:52] yeah, looking forward to merged /usr [17:31:58] (03CR) 10Dzahn: [C: 03+1] Revert "systemd: add validate command for unit files" [puppet] - 10https://gerrit.wikimedia.org/r/644812 (owner: 10Filippo Giunchedi) [17:32:18] apt install usrmerge if you are curious [17:32:32] mutante: thanks! [17:32:50] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "systemd: add validate command for unit files" [puppet] - 10https://gerrit.wikimedia.org/r/644812 (owner: 10Filippo Giunchedi) [17:34:01] np, yea, simple revert is the safest when touching like all systemd units theoretically [17:34:40] indeed [17:34:53] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ms-be2061.codfw.wmnet'] ` and were **ALL** successful. [17:36:45] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db214[234] - https://phabricator.wikimedia.org/T267041 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` db2143.codfw.wmnet ` The log can be found in `/var/log/wmf-... [17:37:47] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10Papaul) [17:39:29] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot (issue continues after board change) 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10jcrespo) @Jclark-ctr were you able to contact HP again? Host is again down so it can be managed at any time. [17:40:49] godog: thx [17:41:14] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10Papaul) 05Open→03Resolved @fgiunchedi complete [17:41:35] jbond42: np, there's likely other issues with validate afaics besides the path [17:42:33] 10Operations, 10ops-codfw: codfw: ms-be2058 memory error - https://phabricator.wikimedia.org/T269266 (10Papaul) [17:44:38] 10Operations, 10ops-codfw: codfw: ms-be2058 memory error - https://phabricator.wikimedia.org/T269266 (10Papaul) p:05Triage→03Medium [17:44:45] (03PS1) 10Bstorm: wmcs_wheel_of_misfortune: avoid race condition with proc info [puppet] - 10https://gerrit.wikimedia.org/r/644878 (https://phabricator.wikimedia.org/T266300) [17:46:08] yes i think, it may do some file extension checking but need to double check,. thanks for the revrt will take another look tomorrow [17:46:57] (03CR) 10Dzahn: "ACK, and thanks for merging this!" [puppet] - 10https://gerrit.wikimedia.org/r/636082 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [17:47:23] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:48:35] (03CR) 10Dzahn: "did we talk about the change from user "mirror" to "root"? I may have forgotten." [puppet] - 10https://gerrit.wikimedia.org/r/636082 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [17:49:52] (03CR) 10Dzahn: mirrors: replace cron jobs with systemd timers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636082 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [17:49:52] XioNoX: we're about to start netflow migration from wmf database to event database [17:50:07] ok! [17:50:31] 10Operations, 10Platform Team Workboards (Green): Move maps20[05-10] to production - https://phabricator.wikimedia.org/T266820 (10hnowlan) 05Open→03Resolved [17:50:34] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10hnowlan) [17:51:10] (03PS1) 10Ladsgroup: Remove var_dump() left by mistake [extensions/UrlShortener] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644814 (https://phabricator.wikimedia.org/T269264) [17:51:27] (03CR) 10Daimona Eaytoy: [C: 03+1] Remove var_dump() left by mistake [extensions/UrlShortener] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644814 (https://phabricator.wikimedia.org/T269264) (owner: 10Ladsgroup) [17:51:29] (03CR) 10Ladsgroup: [C: 03+2] Remove var_dump() left by mistake [extensions/UrlShortener] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644814 (https://phabricator.wikimedia.org/T269264) (owner: 10Ladsgroup) [17:51:50] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [17:51:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:59] (03CR) 10Dzahn: "we also need to manually remove cronjobs (unless we first set them to absent in code). In this case it's just 1 server, sodium, so manual " [puppet] - 10https://gerrit.wikimedia.org/r/636082 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [17:52:26] jouncebot: now [17:52:26] No deployments scheduled for the next 1 hour(s) and 7 minute(s) [17:52:50] (03PS2) 10Urbanecm: Add all subdomains of artsdatabanken.no to the wgCopyUploadsDomains allowlist for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644826 (https://phabricator.wikimedia.org/T267784) [17:53:18] (03CR) 10Urbanecm: [C: 03+2] Add all subdomains of artsdatabanken.no to the wgCopyUploadsDomains allowlist for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644826 (https://phabricator.wikimedia.org/T267784) (owner: 10Urbanecm) [17:53:28] (03PS8) 10Razzi: superset: add cache to an-tool1010 [puppet] - 10https://gerrit.wikimedia.org/r/644672 (https://phabricator.wikimedia.org/T268784) [17:53:35] !log sodium - commenting "sync ubuntu mirror / sync tails mirror" cronjobs in the crontab of user 'mirror' after they were replaced by systemd timers by gerrit:636082 [17:53:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:44] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [17:53:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:08] Amir1: uh, sorry, I just +2'ed a config patch (fix for url upload allowlist), fyi [17:54:27] no worries [17:54:32] I'm sorry [17:54:45] I'm trying to figure out why 1.20 is not a git repo [17:55:03] (03Merged) 10jenkins-bot: Add all subdomains of artsdatabanken.no to the wgCopyUploadsDomains allowlist for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644826 (https://phabricator.wikimedia.org/T267784) (owner: 10Urbanecm) [17:55:05] np, your backport will get probably merged later anyway, just informing so we don't clash :) [17:55:10] (wdym by 1.20 not being a git repo?) [17:55:13] RECOVERY - tileratorui on maps1010 is OK: HTTP OK: HTTP/1.1 200 OK - 315 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [17:55:36] `Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).` [17:55:44] this is the error, it's git [17:56:15] Amir1: it looks like a valid git repo to me [17:56:18] are you in the stagging dir? [17:56:22] this is my output https://www.irccloud.com/pastebin/2jLkAEes/ [17:56:54] 10Operations, 10observability, 10CAS-SSO: Sign-in links from Grafana dashboards don't work when not signed into SSO - https://phabricator.wikimedia.org/T269272 (10RLazarus) [17:56:59] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): relocate/reimage cloudvirt1025 with 10G interfaces - https://phabricator.wikimedia.org/T266187 (10RobH) [17:57:03] syncing mine [17:57:08] (03Merged) 10jenkins-bot: Remove var_dump() left by mistake [extensions/UrlShortener] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644814 (https://phabricator.wikimedia.org/T269264) (owner: 10Ladsgroup) [17:57:23] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): relocate/reimage cloudvirt1025 with 10G interfaces - https://phabricator.wikimedia.org/T266187 (10RobH) So the checklist wasn't updated for this task, but I'm told its been moved, with its ports setup and puppet updated. However, it is... [17:57:32] of course, it was /srv/mediawiki/ [17:57:35] thanks [17:57:39] no problem :) [17:57:39] It's been a long day [17:58:01] (03PS1) 10Ppchelko: Use ParserOutput::extensionData instead of dynamic properties. [extensions/CategoryTree] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644815 (https://phabricator.wikimedia.org/T269235) [17:58:02] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 24da542256f7c4cc955365ccd9739354f7162cc5: Add all subdomains of artsdatabanken.no to the wgCopyUploadsDomains allowlist for commonswiki (T267784) (duration: 01m 06s) [17:58:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:09] T267784: Add artsdatabanken.no to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T267784 [17:58:12] Amir1: okay, I'm done - over to you :) [17:58:34] cool [17:58:36] thanks [17:58:56] (03PS1) 10RobH: updating cloudvirt1025 mac address [puppet] - 10https://gerrit.wikimedia.org/r/644881 (https://phabricator.wikimedia.org/T266187) [17:59:47] (03PS1) 10Ppchelko: Parser: use setter instead of accessing ParserOutput property [core] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644816 (https://phabricator.wikimedia.org/T269236) [18:00:00] (03CR) 10RobH: [C: 03+2] updating cloudvirt1025 mac address [puppet] - 10https://gerrit.wikimedia.org/r/644881 (https://phabricator.wikimedia.org/T266187) (owner: 10RobH) [18:00:05] (03PS1) 10Ppchelko: ParserOutput: temporary undeprecate getting dynamic properties. [core] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644817 (https://phabricator.wikimedia.org/T263851) [18:00:20] (03CR) 10Elukey: [C: 03+2] analytics::refinery::job: Move netflow to event database [puppet] - 10https://gerrit.wikimedia.org/r/643765 (https://phabricator.wikimedia.org/T231339) (owner: 10Mforns) [18:00:24] !log ladsgroup@deploy1001 Synchronized php-1.36.0-wmf.20/extensions/UrlShortener/includes/UrlShortenerUtils.php: [[gerrit:644879|Remove var_dump() left by mistake (duration: 01m 09s) [18:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:31] (03PS4) 10Elukey: analytics::refinery::job: Move netflow to event database [puppet] - 10https://gerrit.wikimedia.org/r/643765 (https://phabricator.wikimedia.org/T231339) (owner: 10Mforns) [18:00:43] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db214[234] - https://phabricator.wikimedia.org/T267041 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2143.codfw.wmnet'] ` Of which those **FAILED**: ` ['db2143.codfw.wmnet'] ` [18:01:42] (03CR) 10Elukey: [C: 03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/643765 (https://phabricator.wikimedia.org/T231339) (owner: 10Mforns) [18:03:44] I'm done too [18:03:48] see you later [18:04:04] see you! [18:04:06] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db214[234] - https://phabricator.wikimedia.org/T267041 (10Papaul) [18:08:52] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): relocate/reimage cloudvirt1025 with 10G interfaces - https://phabricator.wikimedia.org/T266187 (10RobH) [18:11:35] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): relocate/reimage cloudvirt1025 with 10G interfaces - https://phabricator.wikimedia.org/T266187 (10RobH) a:05Cmjohnson→03Andrew >>! In T266187#6663977, @gerritbot wrote: > Change 644881 **merged** by RobH: > [operations/puppet@produc... [18:12:53] (03CR) 10Bstorm: "> Patch Set 22:" [puppet] - 10https://gerrit.wikimedia.org/r/642570 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm) [18:13:04] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Hardware): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10RobH) [18:13:25] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): relocate/reimage cloudvirt1025 with 10G interfaces - https://phabricator.wikimedia.org/T266187 (10RobH) 05Open→03Resolved a:05Andrew→03RobH I'm closing this onsite task and creating a reimage task. [18:14:37] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): relocate/reimage cloudvirt1025 with 10G interfaces - https://phabricator.wikimedia.org/T266187 (10RobH) [18:16:08] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): relocate/reimage cloudvirt1025 with 10G interfaces - https://phabricator.wikimedia.org/T266187 (10RobH) [18:16:55] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Hardware): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10RobH) [18:20:03] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [18:20:03] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [18:20:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:35] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [18:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:22] XioNoX: we finished the migration successfully :] [18:21:30] congrats! [18:22:25] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:38] (03PS1) 10RobH: cloudvirt10[26-30] mac updates [puppet] - 10https://gerrit.wikimedia.org/r/644886 (https://phabricator.wikimedia.org/T216195) [18:28:02] (03CR) 10RobH: [C: 03+2] cloudvirt10[26-30] mac updates [puppet] - 10https://gerrit.wikimedia.org/r/644886 (https://phabricator.wikimedia.org/T216195) (owner: 10RobH) [18:30:50] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): relocate/reimage cloudvirt1029 with 10G interfaces - https://phabricator.wikimedia.org/T266206 (10RobH) 05Open→03Resolved a:05Cmjohnson→03RobH Please note I was requested to take a look at the batch of 10G moves for cloudvirt10[... [18:30:56] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10RobH) [18:31:10] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): relocate/reimage cloudvirt1029 with 10G interfaces - https://phabricator.wikimedia.org/T266206 (10RobH) [18:32:01] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): relocate/reimage cloudvirt1026 with 10G interfaces - https://phabricator.wikimedia.org/T266281 (10RobH) 05Open→03Resolved a:05Cmjohnson→03RobH Please note I was requested to take a look at the batch of 10G moves for cloudvirt10[... [18:32:10] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10RobH) [18:32:13] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): relocate/reimage cloudvirt1027 with 10G interfaces - https://phabricator.wikimedia.org/T266369 (10RobH) 05Open→03Resolved a:05Cmjohnson→03RobH Please note I was requested to take a look at the batch of 10G moves for cloudvirt10[... [18:32:15] (03CR) 10Alexandros Kosiaris: Add {calico, kubernetes}-future components to buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644469 (owner: 10Alexandros Kosiaris) [18:32:26] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10RobH) [18:32:30] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): relocate/reimage cloudvirt1028 with 10G interfaces - https://phabricator.wikimedia.org/T266514 (10RobH) 05Open→03Resolved a:05Cmjohnson→03RobH Please note I was requested to take a look at the batch of 10G moves for cloudvirt10[... [18:32:37] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10RobH) [18:32:50] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): relocate/reimage cloudvirt1030 with 10G interfaces - https://phabricator.wikimedia.org/T266623 (10RobH) 05Open→03Resolved a:05Cmjohnson→03RobH Please note I was requested to take a look at the batch of 10G moves for cloudvirt10[... [18:32:57] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10RobH) [18:33:20] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): relocate/reimage cloudvirt1026 with 10G interfaces - https://phabricator.wikimedia.org/T266281 (10RobH) [18:33:31] (03PS1) 10Mforns: analytics::refinery::job::druid_load.pp: correct druid_datasource param [puppet] - 10https://gerrit.wikimedia.org/r/644889 (https://phabricator.wikimedia.org/T231339) [18:33:38] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): relocate/reimage cloudvirt1028 with 10G interfaces - https://phabricator.wikimedia.org/T266514 (10RobH) [18:33:49] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): relocate/reimage cloudvirt1030 with 10G interfaces - https://phabricator.wikimedia.org/T266623 (10RobH) [18:34:02] (03CR) 10Alexandros Kosiaris: prometheus::k8s: Support arbitrary clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644262 (owner: 10Alexandros Kosiaris) [18:34:08] just spam a bit why dont ya wikibugs =P [18:34:32] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10RobH) [18:34:36] (03CR) 10Elukey: [C: 03+2] analytics::refinery::job::druid_load.pp: correct druid_datasource param [puppet] - 10https://gerrit.wikimedia.org/r/644889 (https://phabricator.wikimedia.org/T231339) (owner: 10Mforns) [18:37:32] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10RobH) Please note I've corrected the MAC entries for cloudvirt10[25-30], they were all off by two characters. This was likely caused by polling the wrong interfac... [18:41:07] (03CR) 10jerkins-bot: [V: 04-1] Parser: use setter instead of accessing ParserOutput property [core] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644816 (https://phabricator.wikimedia.org/T269236) (owner: 10Ppchelko) [18:41:47] (03CR) 10Ppchelko: "recheck" [core] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644816 (https://phabricator.wikimedia.org/T269236) (owner: 10Ppchelko) [18:42:32] (03PS1) 10Volans: wmf-auto-reimage: force a second puppet run [puppet] - 10https://gerrit.wikimedia.org/r/644892 (https://phabricator.wikimedia.org/T269187) [18:43:11] (03PS1) 10Mforns: analytics::refinery::job: Bump up jar versions for netflow [puppet] - 10https://gerrit.wikimedia.org/r/644893 (https://phabricator.wikimedia.org/T231339) [18:44:09] (03PS9) 10Alexandros Kosiaris: prometheus::k8s: Support arbitrary clusters [puppet] - 10https://gerrit.wikimedia.org/r/644262 [18:44:11] (03PS9) 10Alexandros Kosiaris: k8s codfw staging: Assign role to node [puppet] - 10https://gerrit.wikimedia.org/r/644235 [18:44:13] (03PS3) 10Alexandros Kosiaris: package_from_component: Move to ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/644509 [18:44:15] (03PS6) 10Alexandros Kosiaris: Add {calico, kubernetes}-future components to buster [puppet] - 10https://gerrit.wikimedia.org/r/644469 [18:44:20] (03CR) 10Volans: [C: 03+2] "Self-merging to unblock provisioning" [puppet] - 10https://gerrit.wikimedia.org/r/644892 (https://phabricator.wikimedia.org/T269187) (owner: 10Volans) [18:44:47] (03CR) 10Elukey: [C: 03+2] analytics::refinery::job: Bump up jar versions for netflow [puppet] - 10https://gerrit.wikimedia.org/r/644893 (https://phabricator.wikimedia.org/T231339) (owner: 10Mforns) [18:51:10] (03PS4) 10Ppchelko: Add EventStream config for link recommendations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643230 (https://phabricator.wikimedia.org/T261407) (owner: 10Gergő Tisza) [18:51:23] (03CR) 10Ppchelko: [C: 03+1] Add EventStream config for link recommendations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643230 (https://phabricator.wikimedia.org/T261407) (owner: 10Gergő Tisza) [18:52:45] (03PS10) 10Alexandros Kosiaris: prometheus::k8s: Support arbitrary clusters [puppet] - 10https://gerrit.wikimedia.org/r/644262 [18:54:43] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26862/console" [puppet] - 10https://gerrit.wikimedia.org/r/644262 (owner: 10Alexandros Kosiaris) [18:59:23] 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): Degraded RAID on labstore1006 - https://phabricator.wikimedia.org/T268281 (10RobH) >>! In T268281#6663112, @Cmjohnson wrote: > @RobH Can you attempt to pull the ADU report off this server. I cannot get into the U/I through mgmt. Let me know if you... [19:00:04] hashar and twentyafterfour: I, the Bot under the Fountain, allow thee, The Deployer, to do Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201202T1900). [19:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor I � Unicode. All rise for Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201202T1900). [19:00:05] No GERRIT patches in the queue for this window AFAICS. [19:06:46] I forgot to add https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/643230 to the config window; I'll deploy it now [19:07:03] it's a no-op, nothing is using the stream yet [19:08:02] tgr_: +1 [19:08:03] (03CR) 10Gergő Tisza: [C: 03+2] Add EventStream config for link recommendations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643230 (https://phabricator.wikimedia.org/T261407) (owner: 10Gergő Tisza) [19:08:53] (03Merged) 10jenkins-bot: Add EventStream config for link recommendations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643230 (https://phabricator.wikimedia.org/T261407) (owner: 10Gergő Tisza) [19:09:42] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26863/console" [puppet] - 10https://gerrit.wikimedia.org/r/644672 (https://phabricator.wikimedia.org/T268784) (owner: 10Razzi) [19:10:48] (03CR) 10Alexandros Kosiaris: [V: 03+1] prometheus::k8s: Support arbitrary clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644262 (owner: 10Alexandros Kosiaris) [19:14:10] (03PS10) 10Alexandros Kosiaris: k8s codfw staging: Assign role to node [puppet] - 10https://gerrit.wikimedia.org/r/644235 [19:14:12] (03PS4) 10Alexandros Kosiaris: package_from_component: Move to ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/644509 [19:14:15] (03PS7) 10Alexandros Kosiaris: Add {calico, kubernetes}-future components to buster [puppet] - 10https://gerrit.wikimedia.org/r/644469 [19:14:49] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Data for WMDE-leszek - https://phabricator.wikimedia.org/T269284 (10WMDE-leszek) [19:17:12] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:643230|Add EventStream config for link recommendations (T261407)]] (duration: 01m 06s) [19:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:19] T261407: Add a link engineering: Create event for event gate to update search index after obtaining link recommendations - https://phabricator.wikimedia.org/T261407 [19:17:28] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26864/console" [puppet] - 10https://gerrit.wikimedia.org/r/644235 (owner: 10Alexandros Kosiaris) [19:19:33] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26865/console" [puppet] - 10https://gerrit.wikimedia.org/r/644509 (owner: 10Alexandros Kosiaris) [19:20:59] !log restarted turnilo to clear deleted datasource [19:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:33] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26866/console" [puppet] - 10https://gerrit.wikimedia.org/r/644469 (owner: 10Alexandros Kosiaris) [19:22:45] (03PS2) 10Mforns: analytics::refinery::job::druid_load.pp: add fields to netflow long term [puppet] - 10https://gerrit.wikimedia.org/r/644862 (https://phabricator.wikimedia.org/T254332) [19:23:17] (03CR) 10Dzahn: "there are still some issues here but debugging them:" [puppet] - 10https://gerrit.wikimedia.org/r/636082 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [19:23:57] ottomata: can you please look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/644862 and merge if OK? [19:24:28] it's just adding new fields to be ingested into druid for netflow long term [19:26:19] (03PS9) 10Razzi: superset: add cache to an-tool1010 [puppet] - 10https://gerrit.wikimedia.org/r/644672 (https://phabricator.wikimedia.org/T268784) [19:34:50] or razzi! can you look at my puppet patch please? ^ [19:34:53] (03PS10) 10Razzi: superset: add cache to an-tool1010 [puppet] - 10https://gerrit.wikimedia.org/r/644672 (https://phabricator.wikimedia.org/T268784) [19:36:20] mforns: do you need someone to review for correctness wrt druid, or just someone with +2 on ops/puppet? :) [19:37:21] heh cdanis I think just +2 (and check for typos!), the fields have been discussed and approved on Phab task [19:38:12] 10Operations, 10Puppet, 10Patch-For-Review: systemd.timer not executing on cumin2001 after command was modified - https://phabricator.wikimedia.org/T268974 (10jcrespo) p:05Unbreak!→03Medium Good news: ` journalctl -u database-backups-snapshots Dec 02 19:00:01 cumin2001 systemd[1]: Started Generate mysq... [19:38:51] (03CR) 10CDanis: [C: 03+2] analytics::refinery::job::druid_load.pp: add fields to netflow long term [puppet] - 10https://gerrit.wikimedia.org/r/644862 (https://phabricator.wikimedia.org/T254332) (owner: 10Mforns) [19:39:20] akosiaris: okay to merge your change? [19:39:45] ah they're just labs/private [19:40:08] mforns: done :) [19:40:24] cdanis: thanks a lot! :] [19:45:02] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Hardware): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10ayounsi) [19:45:57] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): relocate/reimage cloudvirt1029 with 10G interfaces - https://phabricator.wikimedia.org/T266206 (10ayounsi) 05Resolved→03Open I see: ` [edit interfaces interface-range vlan-cloud-storage1-eqiad] member xe-0/0/30; member xe-0/... [19:47:59] (03PS10) 10Razzi: zookeeper: configure test-eqiad single-node cluster [puppet] - 10https://gerrit.wikimedia.org/r/642497 (https://phabricator.wikimedia.org/T268202) [19:49:43] (03PS1) 10Dzahn: mirrors: use "1h" instead of "hourly" for interval [puppet] - 10https://gerrit.wikimedia.org/r/644905 (https://phabricator.wikimedia.org/T265138) [19:50:35] (03CR) 10Dzahn: [C: 03+2] mirrors: use "1h" instead of "hourly" for interval [puppet] - 10https://gerrit.wikimedia.org/r/644905 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [19:51:14] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26867/console" [puppet] - 10https://gerrit.wikimedia.org/r/644672 (https://phabricator.wikimedia.org/T268784) (owner: 10Razzi) [19:51:21] (03PS1) 10Bstorm: wikireplicas: fix error in the logic for multiinstance selections [puppet] - 10https://gerrit.wikimedia.org/r/644906 (https://phabricator.wikimedia.org/T268312) [19:52:02] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db214[234] - https://phabricator.wikimedia.org/T267041 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` db2144.codfw.wmnet ` The log can be found in `/var/log/wmf-... [19:56:00] !log sodium - systemctl restart update-tails-mirror.timer [19:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:04] hashar and twentyafterfour: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Mediawiki train - European+American Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201202T2000). [20:00:28] (03PS1) 10Dzahn: mirrors: use 'mirror' user instead of root for periodic jobs [puppet] - 10https://gerrit.wikimedia.org/r/644908 [20:01:11] (03CR) 10Bstorm: [C: 03+2] wikireplicas: fix error in the logic for multiinstance selections [puppet] - 10https://gerrit.wikimedia.org/r/644906 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm) [20:01:27] (03CR) 10Dzahn: [C: 03+2] mirrors: use 'mirror' user instead of root for periodic jobs [puppet] - 10https://gerrit.wikimedia.org/r/644908 (owner: 10Dzahn) [20:02:17] mutante: can I merge your patch? [20:02:17] bstorm: yes, you can merge multiple :) [20:02:22] cool [20:02:23] :) [20:02:40] I could tell from the lock file message, heh. thx [20:07:32] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:08:00] (03CR) 10Dzahn: "checked after this if any files under /srv/mirrors are root-owned, but it's just the index.html, all ok" [puppet] - 10https://gerrit.wikimedia.org/r/644908 (owner: 10Dzahn) [20:08:08] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [20:08:11] !log sodium systemctl reset-failed [20:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:00] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [20:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:26] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db214[234] - https://phabricator.wikimedia.org/T267041 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2144.codfw.wmnet'] ` Of which those **FAILED**: ` ['db2144.codfw.wmnet'] ` [20:13:40] (03CR) 10Ottomata: [C: 03+1] zookeeper: configure test-eqiad single-node cluster [puppet] - 10https://gerrit.wikimedia.org/r/642497 (https://phabricator.wikimedia.org/T268202) (owner: 10Razzi) [20:14:17] (03CR) 10Ottomata: [C: 03+1] superset: add cache to an-tool1010 [puppet] - 10https://gerrit.wikimedia.org/r/644672 (https://phabricator.wikimedia.org/T268784) (owner: 10Razzi) [20:14:19] !log sodium - started update-ubuntu-mirror systemd timer - debugging why it fails; manually syncing with sudo -u mirror /usr/local/sbin/update-ubuntu-mirror [20:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:07] (03PS11) 10Razzi: zookeeper: configure test-eqiad single-node cluster [puppet] - 10https://gerrit.wikimedia.org/r/642497 (https://phabricator.wikimedia.org/T268202) [20:16:08] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:17:12] ACKNOWLEDGEMENT - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn debugging in progress https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:17:30] (03CR) 10MSantos: [C: 03+2] wikifeeds: bump to 2020-12-02-143136-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/644870 (owner: 10MSantos) [20:18:46] (03Merged) 10jenkins-bot: wikifeeds: bump to 2020-12-02-143136-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/644870 (owner: 10MSantos) [20:19:15] (03CR) 10Razzi: [C: 03+2] zookeeper: configure test-eqiad single-node cluster [puppet] - 10https://gerrit.wikimedia.org/r/642497 (https://phabricator.wikimedia.org/T268202) (owner: 10Razzi) [20:19:30] (03PS1) 10Ottomata: eventgate-analytics-external - use refactored batch based stream config [deployment-charts] - 10https://gerrit.wikimedia.org/r/644912 (https://phabricator.wikimedia.org/T266573) [20:20:03] !log mbsantos@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [20:20:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:40] (03PS2) 10Ottomata: eventgate-analytics-external - use refactored batch based stream config [deployment-charts] - 10https://gerrit.wikimedia.org/r/644912 (https://phabricator.wikimedia.org/T266573) [20:21:21] !log mbsantos@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [20:21:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:49] (03CR) 10Dzahn: "still issues after this:" [puppet] - 10https://gerrit.wikimedia.org/r/644908 (owner: 10Dzahn) [20:21:54] (03CR) 10Ottomata: [C: 03+2] eventgate-analytics-external - use refactored batch based stream config [deployment-charts] - 10https://gerrit.wikimedia.org/r/644912 (https://phabricator.wikimedia.org/T266573) (owner: 10Ottomata) [20:22:39] (03Merged) 10jenkins-bot: eventgate-analytics-external - use refactored batch based stream config [deployment-charts] - 10https://gerrit.wikimedia.org/r/644912 (https://phabricator.wikimedia.org/T266573) (owner: 10Ottomata) [20:22:55] !log mbsantos@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [20:23:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:10] (03CR) 10MSantos: [C: 03+2] mobileapps: bump to 2020-12-02-162223-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/644867 (owner: 10MSantos) [20:23:27] (03PS11) 10Razzi: superset: add cache to an-tool1010 [puppet] - 10https://gerrit.wikimedia.org/r/644672 (https://phabricator.wikimedia.org/T268784) [20:23:56] (03CR) 10Dzahn: "ahaha. this has a "#/bin/dash" shebang" [puppet] - 10https://gerrit.wikimedia.org/r/644908 (owner: 10Dzahn) [20:24:22] (03Merged) 10jenkins-bot: mobileapps: bump to 2020-12-02-162223-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/644867 (owner: 10MSantos) [20:25:37] !log mbsantos@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [20:25:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:29] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Data for WMDE-leszek - https://phabricator.wikimedia.org/T269284 (10ssingh) p:05Triage→03Medium a:03ssingh [20:27:35] !log mbsantos@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [20:27:35] !log mbsantos@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [20:27:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:51] (03PS1) 10Ottomata: eventgate-analytics-external - only use remote schema repos [deployment-charts] - 10https://gerrit.wikimedia.org/r/644914 [20:32:11] !log mbsantos@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [20:32:11] !log mbsantos@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [20:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:26] (03PS2) 10Ottomata: eventgate-analytics-external - only use remote schema repos [deployment-charts] - 10https://gerrit.wikimedia.org/r/644914 [20:33:53] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db214[234] - https://phabricator.wikimedia.org/T267041 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` db2144.codfw.wmnet ` The log can be found in `/var/log/wmf-... [20:34:03] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:34:42] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Data for WMDE-leszek - https://phabricator.wikimedia.org/T269284 (10Ottomata) Approved. @WMDE-leszek will also need `nda` (or `wmf`?) LDAP group membership, as well as [[ https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos#Create_a... [20:35:02] (03CR) 10Ottomata: [C: 03+2] eventgate-analytics-external - only use remote schema repos [deployment-charts] - 10https://gerrit.wikimedia.org/r/644914 (owner: 10Ottomata) [20:35:09] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:35:17] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:36:17] (03Merged) 10jenkins-bot: eventgate-analytics-external - only use remote schema repos [deployment-charts] - 10https://gerrit.wikimedia.org/r/644914 (owner: 10Ottomata) [20:38:12] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [20:38:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:55] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [20:41:56] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [20:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:54] (03PS1) 10Dzahn: mirrors: fix shebang line in update-ubuntu-mirror script [puppet] - 10https://gerrit.wikimedia.org/r/644920 (https://phabricator.wikimedia.org/T265138) [20:44:21] (03PS1) 10RLazarus: httpbb: Rearrange appserver tests. [puppet] - 10https://gerrit.wikimedia.org/r/644921 [20:44:43] (03CR) 10CDanis: [C: 03+1] "looks good thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/643746 (owner: 10Jbond) [20:45:36] (03CR) 10CDanis: [C: 03+1] site: assign roles for all ms-be / ms-fe hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644847 (https://phabricator.wikimedia.org/T265419) (owner: 10Filippo Giunchedi) [20:46:15] (03Abandoned) 10CDanis: depool ulsfo [dns] - 10https://gerrit.wikimedia.org/r/622875 (owner: 10CDanis) [20:46:23] (03Abandoned) 10CDanis: prepend eqiad/eqord [homer/public] - 10https://gerrit.wikimedia.org/r/627923 (owner: 10CDanis) [20:48:09] (03CR) 10Dzahn: [C: 03+2] mirrors: fix shebang line in update-ubuntu-mirror script [puppet] - 10https://gerrit.wikimedia.org/r/644920 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [20:49:13] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [20:49:13] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [20:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:13] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [20:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:23] !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:31] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db214[234] - https://phabricator.wikimedia.org/T267041 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2144.codfw.wmnet'] ` Of which those **FAILED**: ` ['db2144.codfw.wmnet'] ` [20:53:06] (03CR) 10Dzahn: "needed 3 follow-ups all for different reasons:" [puppet] - 10https://gerrit.wikimedia.org/r/636082 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [20:53:08] (03CR) 10Clarakosi: [C: 03+2] Configuration chage to allow custom comment reverts on Wikidata [deployment-charts] - 10https://gerrit.wikimedia.org/r/644542 (owner: 10Holger Knust) [20:54:33] (03Merged) 10jenkins-bot: Configuration chage to allow custom comment reverts on Wikidata [deployment-charts] - 10https://gerrit.wikimedia.org/r/644542 (owner: 10Holger Knust) [20:54:43] (03PS1) 10Ottomata: eventgate-{main,analytics,logging-external} - bump to 2020-12-02-151648-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/644922 (https://phabricator.wikimedia.org/T266573) [20:56:22] !log deploying backports for 1.36.0-wmf.20 refs T263186 [20:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:30] T263186: 1.36.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T263186 [20:57:44] twentyafterfour: mind deploying some more backports? https://phabricator.wikimedia.org/T263186#6664022 [21:00:04] chrisalbon and accraze: #bothumor I � Unicode. All rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201202T2100). [21:00:51] (03PS1) 10Ottomata: eventstreams - bump to 2020-12-02-204016-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/644923 [21:01:15] (03CR) 10Dzahn: [C: 03+1] "makes sense to me. things are just being moved around and it seems logical to me what is separated from the "main" ones" [puppet] - 10https://gerrit.wikimedia.org/r/644921 (owner: 10RLazarus) [21:01:24] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [21:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:33] (03CR) 10Ottomata: [C: 03+2] eventstreams - bump to 2020-12-02-204016-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/644923 (owner: 10Ottomata) [21:01:35] (03PS1) 10Ssingh: admin: add wmde-leszek to analytics groups [puppet] - 10https://gerrit.wikimedia.org/r/644924 (https://phabricator.wikimedia.org/T269284) [21:01:54] (03PS2) 10Clarakosi: Update change-prop to 0.10.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/644261 (owner: 10Ppchelko) [21:02:28] (03PS3) 10Ottomata: Add new service eventstreams-internal [deployment-charts] - 10https://gerrit.wikimedia.org/r/644612 (https://phabricator.wikimedia.org/T269160) [21:02:34] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytics Data for WMDE-leszek - https://phabricator.wikimedia.org/T269284 (10ssingh) [21:02:47] (03Merged) 10jenkins-bot: eventstreams - bump to 2020-12-02-204016-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/644923 (owner: 10Ottomata) [21:02:52] Pchelolo: sure [21:03:04] thank you! [21:03:27] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:55] (03CR) 10RLazarus: [C: 03+2] httpbb: Rearrange appserver tests. [puppet] - 10https://gerrit.wikimedia.org/r/644921 (owner: 10RLazarus) [21:04:39] (03PS3) 10Clarakosi: Update change-prop to 0.10.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/644261 (owner: 10Ppchelko) [21:06:23] (03CR) 10Clarakosi: [C: 03+2] Update change-prop to 0.10.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/644261 (owner: 10Ppchelko) [21:07:41] (03Merged) 10jenkins-bot: Update change-prop to 0.10.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/644261 (owner: 10Ppchelko) [21:07:59] (03CR) 1020after4: [C: 03+2] Use ParserOutput::extensionData instead of dynamic properties. [extensions/CategoryTree] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644815 (https://phabricator.wikimedia.org/T269235) (owner: 10Ppchelko) [21:09:11] (03CR) 1020after4: [C: 03+2] Parser: use setter instead of accessing ParserOutput property [core] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644816 (https://phabricator.wikimedia.org/T269236) (owner: 10Ppchelko) [21:09:54] (03CR) 1020after4: [C: 03+2] ParserOutput: temporary undeprecate getting dynamic properties. [core] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644817 (https://phabricator.wikimedia.org/T263851) (owner: 10Ppchelko) [21:11:25] !log clarakosi@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [21:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:58] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytics Data for WMDE-leszek - https://phabricator.wikimedia.org/T269284 (10ssingh) [21:12:53] (03Merged) 10jenkins-bot: Use ParserOutput::extensionData instead of dynamic properties. [extensions/CategoryTree] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644815 (https://phabricator.wikimedia.org/T269235) (owner: 10Ppchelko) [21:14:27] !log clarakosi@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop' for release 'production' . [21:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:27] !log clarakosi@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'changeprop' for release 'production' . [21:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:26] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [21:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:13] 10Operations, 10Maps (Kartographer): Some PostgreSQL replicas are not fully updated - https://phabricator.wikimedia.org/T268927 (10The_Equalizer) Sounds like you chaps have a very quick handle on things - great stuff and keep up the good work. [21:19:14] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytics Data for WMDE-leszek - https://phabricator.wikimedia.org/T269284 (10ssingh) Change has been requested by a WMDE manager themselves (https://wikitech.wikimedia.org/wiki/SRE_Clinic_Duty#wmde_access) and approved by Andrew.... [21:20:23] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [21:20:23] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [21:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:39] (03PS11) 10Alexandros Kosiaris: k8s codfw staging: Assign role to node [puppet] - 10https://gerrit.wikimedia.org/r/644235 [21:22:41] (03PS5) 10Alexandros Kosiaris: package_from_component: Move to ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/644509 [21:22:43] (03PS8) 10Alexandros Kosiaris: Add {calico, kubernetes}-future components to buster [puppet] - 10https://gerrit.wikimedia.org/r/644469 [21:23:25] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [21:23:25] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [21:23:28] (03CR) 10Alexandros Kosiaris: [C: 03+2] "PCC is as expected, latest is a rebase, merging" [puppet] - 10https://gerrit.wikimedia.org/r/644235 (owner: 10Alexandros Kosiaris) [21:23:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:58] (03CR) 10Alexandros Kosiaris: [C: 03+2] "PCC at previous patch set was as expected, latest is a rebase, merging" [puppet] - 10https://gerrit.wikimedia.org/r/644509 (owner: 10Alexandros Kosiaris) [21:24:35] (03CR) 10Alexandros Kosiaris: [C: 03+2] "PCC at previous patchset was as expected, latest is a rebase, merging" [puppet] - 10https://gerrit.wikimedia.org/r/644469 (owner: 10Alexandros Kosiaris) [21:25:34] Pchelolo: mediawiki-quibble-apitests-vendor-docker failed on 644817,1 [21:26:07] not sure what that's about, it says there is an error with lodash when running `npm ci` https://integration.wikimedia.org/ci/job/mediawiki-quibble-apitests-vendor-docker/25335/ [21:29:20] (03CR) 10CDanis: [C: 03+1] admin: add wmde-leszek to analytics groups [puppet] - 10https://gerrit.wikimedia.org/r/644924 (https://phabricator.wikimedia.org/T269284) (owner: 10Ssingh) [21:30:00] (03CR) 10Ssingh: [C: 03+2] admin: add wmde-leszek to analytics groups [puppet] - 10https://gerrit.wikimedia.org/r/644924 (https://phabricator.wikimedia.org/T269284) (owner: 10Ssingh) [21:31:39] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01009 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [21:32:11] ^ oops that's mine, fixing [21:34:59] (03PS1) 10RLazarus: httpbb: Remove httpbb::test_suite for baseurls.yaml [puppet] - 10https://gerrit.wikimedia.org/r/644926 [21:35:01] (03Merged) 10jenkins-bot: Parser: use setter instead of accessing ParserOutput property [core] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644816 (https://phabricator.wikimedia.org/T269236) (owner: 10Ppchelko) [21:35:08] (03CR) 10jerkins-bot: [V: 04-1] ParserOutput: temporary undeprecate getting dynamic properties. [core] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644817 (https://phabricator.wikimedia.org/T263851) (owner: 10Ppchelko) [21:35:21] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Hardware): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10RobH) [21:35:40] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytics Data for WMDE-leszek - https://phabricator.wikimedia.org/T269284 (10ssingh) @WMDE-leszek: This request has been merged, you should have access! I will leave this task open for another day or so but please let me know if t... [21:35:53] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): relocate/reimage cloudvirt1029 with 10G interfaces - https://phabricator.wikimedia.org/T266206 (10RobH) 05Open→03Resolved >>! In T266206#6664557, @ayounsi wrote: > I see: > ` > [edit interfaces interface-range vlan-cloud-storage1-eq... [21:39:14] (03CR) 10Dzahn: [C: 03+1] httpbb: Remove httpbb::test_suite for baseurls.yaml [puppet] - 10https://gerrit.wikimedia.org/r/644926 (owner: 10RLazarus) [21:39:44] (03CR) 10RLazarus: [C: 03+2] httpbb: Remove httpbb::test_suite for baseurls.yaml [puppet] - 10https://gerrit.wikimedia.org/r/644926 (owner: 10RLazarus) [21:40:31] (03PS1) 10Alexandros Kosiaris: k8s-staging: Add the ssl cert [puppet] - 10https://gerrit.wikimedia.org/r/644929 [21:40:44] (03CR) 1020after4: "recheck" [core] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644816 (https://phabricator.wikimedia.org/T269236) (owner: 10Ppchelko) [21:41:19] (03CR) 1020after4: [C: 03+2] "recheck" [core] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644817 (https://phabricator.wikimedia.org/T263851) (owner: 10Ppchelko) [21:41:40] (03CR) 10Alexandros Kosiaris: [C: 03+2] k8s-staging: Add the ssl cert [puppet] - 10https://gerrit.wikimedia.org/r/644929 (owner: 10Alexandros Kosiaris) [21:44:51] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db214[234] - https://phabricator.wikimedia.org/T267041 (10Papaul) [21:44:54] 10Operations, 10Librarization, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 4 others: Split GeoIP into a new component - https://phabricator.wikimedia.org/T102848 (10kaldari) Note that the LandingCheck extension only does its own geolocation lookup as a fallback (if Geo.country was not passed to it... [21:45:26] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db214[234] - https://phabricator.wikimedia.org/T267041 (10Papaul) 05Open→03Resolved @Marostegui all yours [21:46:55] (03PS1) 10Papaul: DHCP: Add MAC address for logstash203[345] [puppet] - 10https://gerrit.wikimedia.org/r/644930 (https://phabricator.wikimedia.org/T267420) [21:48:13] 10Operations, 10Wikimedia-Mailing-lists: https://lists.wikimedia.org/pipermail/wikija-l/ has broken encoding - https://phabricator.wikimedia.org/T269301 (10Urbanecm) [21:50:30] (03CR) 10Papaul: [C: 03+2] DHCP: Add MAC address for logstash203[345] [puppet] - 10https://gerrit.wikimedia.org/r/644930 (https://phabricator.wikimedia.org/T267420) (owner: 10Papaul) [21:51:25] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install logstash203[345] - https://phabricator.wikimedia.org/T267420 (10Papaul) [21:53:32] (03PS1) 10Alexandros Kosiaris: apt:package_from_component: Fix the ordering relationships [puppet] - 10https://gerrit.wikimedia.org/r/644932 [21:55:07] (03CR) 10Alexandros Kosiaris: [C: 03+2] apt:package_from_component: Fix the ordering relationships [puppet] - 10https://gerrit.wikimedia.org/r/644932 (owner: 10Alexandros Kosiaris) [21:58:31] the puppet failure I know about is fixed, one or more unrelated things are still broken [22:03:19] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ml-depoly200[1-4] - https://phabricator.wikimedia.org/T267957 (10RobH) 05Open→03Invalid duplicate of T267670 [22:07:18] (03Merged) 10jenkins-bot: ParserOutput: temporary undeprecate getting dynamic properties. [core] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644817 (https://phabricator.wikimedia.org/T263851) (owner: 10Ppchelko) [22:18:10] (03PS1) 10Bstorm: wikireplicas: Fix another mistake in the script [puppet] - 10https://gerrit.wikimedia.org/r/644934 (https://phabricator.wikimedia.org/T268312) [22:18:38] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/26861/" [puppet] - 10https://gerrit.wikimedia.org/r/597052 (https://phabricator.wikimedia.org/T169368) (owner: 10Dzahn) [22:24:06] (03CR) 10Bstorm: [C: 03+2] wikireplicas: Fix another mistake in the script [puppet] - 10https://gerrit.wikimedia.org/r/644934 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm) [22:27:57] 10Puppet, 10Cloud-VPS, 10Patch-For-Review: role::simplelamp takes ownership of all content in /etc/apache2/sites-enabled - https://phabricator.wikimedia.org/T169368 (10Dzahn) @bd808 @Freddy2001 I finally merged the proposed fix above. Now the httpd class has a new parameter that lets you toggle the purge be... [22:34:51] 10Operations, 10Parsoid, 10serviceops: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10Dzahn) @jijiki Not sure, I think in the past the "official" instructions included rebooting once and that would also fix it. [22:38:46] !log twentyafterfour@deploy1001 Synchronized php-1.36.0-wmf.20/includes/parser/: Deploying backports for wmf.20 refs T263186 (duration: 01m 08s) [22:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:55] T263186: 1.36.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T263186 [22:39:32] 10Operations, 10Domains, 10Traffic: URL to redirect to upcoming Wikipedia Birthday page on wikimediafoundation.org - https://phabricator.wikimedia.org/T264367 (10Dzahn) @hdothiduc Alright, thank you. You're welcome. To clarify for us it's actually easiest to complete it now but either way is no problem. It... [22:43:05] !log twentyafterfour@deploy1001 Synchronized php-1.36.0-wmf.20/extensions/CategoryTree/: Deploying backport f6c2d74259b9 to wmf.20, bug: T269235 refs T263186 (duration: 01m 07s) [22:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:14] T269235: [X8ehRwpAICgAALHNCUsAAACI] /wiki/clamor ErrorException from line 157 of /srv/mediawiki/php-1.36.0-wmf.20/extensions/CategoryTree/includes/CategoryTreeHooks.php: ParserOutput::mCategoryTreeTag dynamic property write access deprecated [Called from CategoryTreeHooks::parserHook] - https://phabricator.wikimedia.org/T269235 [22:46:50] (03PS6) 10Mstyles: Add new helm chart for rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/640571 (https://phabricator.wikimedia.org/T265526) [22:47:39] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [22:47:43] (03CR) 10jerkins-bot: [V: 04-1] Add new helm chart for rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/640571 (https://phabricator.wikimedia.org/T265526) (owner: 10Mstyles) [22:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:43] (03PS1) 10Reedy: Correctly forwardport LogstashFormatter from monolog/monolog 1.25.3 [core] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644822 (https://phabricator.wikimedia.org/T269286) [22:49:07] jouncebot: now [22:49:08] No deployments scheduled for the next 1 hour(s) and 10 minute(s) [22:49:09] jouncebot: next [22:49:09] In 1 hour(s) and 10 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201203T0000) [22:49:40] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [22:49:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:58] Reedy: I'm done with backports [22:50:10] Cheers [22:50:33] I'll wait for CI on that, and confirm it fixes T269286 in prod [22:50:33] T269286: Monolog update removes exception object from logstash - https://phabricator.wikimedia.org/T269286 [23:00:20] (03CR) 10CDanis: "Thanks! Overall I think LGTM although I have some questions -- here's a first pass." (033 comments) [software/ecs] - 10https://gerrit.wikimedia.org/r/637719 (owner: 10Cwhite) [23:03:01] (03Abandoned) 10Florianschmidtwelzow: Move disabling sitenotice on wikimedia wikis to mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644627 (https://phabricator.wikimedia.org/T269173) (owner: 10Florianschmidtwelzow) [23:06:25] 10Puppet, 10Cloud-VPS, 10Patch-For-Review: role::simplelamp takes ownership of all content in /etc/apache2/sites-enabled - https://phabricator.wikimedia.org/T169368 (10Dzahn) 05Open→03Resolved a:03Dzahn claiming it's resolved .. per IRC chat [23:07:48] (03PS1) 10Dzahn: httpbb: add test for 20.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/644942 (https://phabricator.wikimedia.org/T264367) [23:12:39] (03PS1) 10Andrew Bogott: Cloudvirt1025-1030: Update nic names for 10Gb/Buster [puppet] - 10https://gerrit.wikimedia.org/r/644943 (https://phabricator.wikimedia.org/T216195) [23:13:22] (03CR) 10Andrew Bogott: [C: 03+2] Cloudvirt1025-1030: Update nic names for 10Gb/Buster [puppet] - 10https://gerrit.wikimedia.org/r/644943 (https://phabricator.wikimedia.org/T216195) (owner: 10Andrew Bogott) [23:13:46] (03CR) 10Reedy: [C: 03+2] Correctly forwardport LogstashFormatter from monolog/monolog 1.25.3 [core] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644822 (https://phabricator.wikimedia.org/T269286) (owner: 10Reedy) [23:16:51] (03CR) 10RLazarus: [C: 03+1] httpbb: add test for 20.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/644942 (https://phabricator.wikimedia.org/T264367) (owner: 10Dzahn) [23:18:51] 10Operations, 10Domains, 10Traffic, 10Patch-For-Review: URL to redirect to upcoming Wikipedia Birthday page on wikimediafoundation.org - https://phabricator.wikimedia.org/T264367 (10Dzahn) @hdothiduc Just fyi, with the change above I added a formal test to proof that things work as expected even before the... [23:20:01] (03PS5) 10Cwhite: First attempt at a JSONSchema template generator utility. [software/ecs] - 10https://gerrit.wikimedia.org/r/637719 [23:20:16] (03CR) 10Dzahn: [C: 03+2] httpbb: add test for 20.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/644942 (https://phabricator.wikimedia.org/T264367) (owner: 10Dzahn) [23:27:05] I encounter serious performance issue visiting MetaWiki from France. Am I the only one? [23:27:42] No one else has complained [23:27:45] Just meta? [23:30:03] In fact, fr.wikipedia too: images from Commons do not load [23:30:50] It lasts for about 20 minutes. [23:33:45] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [23:34:42] Pols12: Can you get https://wikitech.wikimedia.org/wiki/Report_connectivity_issue to load? [23:37:37] (03PS1) 10RLazarus: httpbb: Add test_suites for the new files [puppet] - 10https://gerrit.wikimedia.org/r/644946 [23:38:31] (03Merged) 10jenkins-bot: Correctly forwardport LogstashFormatter from monolog/monolog 1.25.3 [core] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644822 (https://phabricator.wikimedia.org/T269286) (owner: 10Reedy) [23:40:39] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26868/console" [puppet] - 10https://gerrit.wikimedia.org/r/644946 (owner: 10RLazarus) [23:41:10] (03CR) 10Dzahn: [C: 03+1] httpbb: Add test_suites for the new files [puppet] - 10https://gerrit.wikimedia.org/r/644946 (owner: 10RLazarus) [23:42:03] !log reedy@deploy1001 Synchronized php-1.36.0-wmf.20/includes/debug/logger/monolog/LogstashFormatter.php: T269286 (duration: 01m 07s) [23:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:11] T269286: Monolog update removes exception object from logstash - https://phabricator.wikimedia.org/T269286 [23:42:12] (03CR) 10Dzahn: [C: 03+2] httpbb: Add test_suites for the new files [puppet] - 10https://gerrit.wikimedia.org/r/644946 (owner: 10RLazarus) [23:43:14] (03PS2) 10Bstorm: wmcs_wheel_of_misfortune: avoid race condition with proc info [puppet] - 10https://gerrit.wikimedia.org/r/644878 (https://phabricator.wikimedia.org/T266300) [23:43:48] (03CR) 10Bstorm: [C: 03+2] wmcs_wheel_of_misfortune: avoid race condition with proc info [puppet] - 10https://gerrit.wikimedia.org/r/644878 (https://phabricator.wikimedia.org/T266300) (owner: 10Bstorm) [23:43:52] This seems to be better now. Thanks for the link: if I meet a performance issue again, I am knowing how to report it. [23:44:30] (03CR) 10Cwhite: "Thanks for the feedback!" (033 comments) [software/ecs] - 10https://gerrit.wikimedia.org/r/637719 (owner: 10Cwhite) [23:47:00] Pols12: there is also https://performance.wikimedia.org btw [23:54:14] (03CR) 10Bstorm: [C: 03+2] wikireplicas: extend maintain_dbusers to multiinstance replicas [puppet] - 10https://gerrit.wikimedia.org/r/642570 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm) [23:55:53] (03PS1) 10Bstorm: Revert "wikireplicas: extend maintain_dbusers to multiinstance replicas" [puppet] - 10https://gerrit.wikimedia.org/r/644824 [23:57:52] (03CR) 10Bstorm: "The patch was getting:" [puppet] - 10https://gerrit.wikimedia.org/r/644824 (owner: 10Bstorm) [23:57:57] (03CR) 10Bstorm: [C: 03+2] Revert "wikireplicas: extend maintain_dbusers to multiinstance replicas" [puppet] - 10https://gerrit.wikimedia.org/r/644824 (owner: 10Bstorm)