[00:00:04] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201201T0000). [00:00:04] hmonroy: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:10] RECOVERY - Check systemd state on idp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:45] RoanKattouw: sorry can we stop that [00:02:00] I meant to schedule for tomorrow [00:02:26] hnowlan: no problem, I'll remove it from the schedule :) [00:02:46] sorry about that :) [00:02:58] can happen :) [00:03:02] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.0101 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [00:03:28] PROBLEM - snapshot of x1 in eqiad on alert1001 is CRITICAL: snapshot for x1 at eqiad taken more than 3 days ago: Most recent backup 2020-11-27 23:31:06 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [00:04:52] removed [00:05:16] PROBLEM - Check systemd state on idp1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:28] urbanecm: thank you! [00:05:36] np [00:08:48] PROBLEM - Number of messages locally queued by purged for processing on cp1083 is CRITICAL: cluster=cache_text instance=cp1083 job=purged layer=backend site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1083 [00:08:52] PROBLEM - Number of messages locally queued by purged for processing on cp3064 is CRITICAL: cluster=cache_text instance=cp3064 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3064 [00:08:56] PROBLEM - Number of messages locally queued by purged for processing on cp1075 is CRITICAL: cluster=cache_text instance=cp1075 job=purged layer=backend site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1075 [00:09:32] PROBLEM - Number of messages locally queued by purged for processing on cp1077 is CRITICAL: cluster=cache_text instance=cp1077 job=purged layer=backend site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1077 [00:09:44] PROBLEM - Number of messages locally queued by purged for processing on cp1089 is CRITICAL: cluster=cache_text instance=cp1089 job=purged layer=backend site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1089 [00:09:58] PROBLEM - Number of messages locally queued by purged for processing on cp1087 is CRITICAL: cluster=cache_text instance=cp1087 job=purged layer=backend site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1087 [00:10:43] (03CR) 10CRusnov: "I'm not sure who to get to review this. Also, to make it pass tox necessitated catching actual exceptions, but I am not sure of the implic" [puppet] - 10https://gerrit.wikimedia.org/r/644365 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [00:11:50] PROBLEM - Number of messages locally queued by purged for processing on cp2027 is CRITICAL: cluster=cache_text instance=cp2027 job=purged layer=backend site=codfw https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2027 [00:13:08] RECOVERY - Number of messages locally queued by purged for processing on cp1089 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1089 [00:13:19] (03PS2) 10CRusnov: Port elasticsearch/es-tool.py to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/644365 (https://phabricator.wikimedia.org/T247364) [00:13:24] RECOVERY - Number of messages locally queued by purged for processing on cp1087 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1087 [00:13:32] RECOVERY - Number of messages locally queued by purged for processing on cp2027 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2027 [00:13:56] RECOVERY - Number of messages locally queued by purged for processing on cp1083 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1083 [00:14:00] RECOVERY - Number of messages locally queued by purged for processing on cp3064 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3064 [00:14:28] PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad total VRPs alert, total VRPs alert, valid ROAs alert, valid ROAs alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [00:14:42] RECOVERY - Number of messages locally queued by purged for processing on cp1077 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1077 [00:15:46] RECOVERY - Number of messages locally queued by purged for processing on cp1075 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1075 [00:16:14] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-reload [00:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:22] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-reload [00:17:22] !log ryankemper@cumin2001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [00:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:36] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-reload [00:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:13] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-reload [00:20:18] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-reload [00:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:36] !log T259588 Beginning wdqs categories data-reload on the following instances (one each from `[public, internal] x [eqiad, codfw]`): `wdqs1004`, `wdqs2001`, `wdqs1003`, `wdqs2004` [00:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:42] T259588: Reload categories once 1.36.0-wmf.3 is running on all groups - https://phabricator.wikimedia.org/T259588 [00:29:38] PROBLEM - Keyholder SSH agent on deploy1002 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder [00:29:42] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:31:46] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1641139888 and 272 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:33:04] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` ms-be2058.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reim... [00:33:08] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ms-be2058.codfw.wmnet'] ` Of which those **FAILED**: ` ['ms-be2058.codfw.wmnet'] ` [00:33:28] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` ms-be2058.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reim... [00:34:08] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 512519248 and 413 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:45:30] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/644372 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [00:46:26] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 41759016 and 312 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:46:30] (03CR) 10CRusnov: "I'll also note that I've tested this and it appears to work as expected against en.wikipedia.org / en.m.wikipedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/644372 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [00:49:00] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 2012128 and 144 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:49:38] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 4120 and 183 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:51:28] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 46866784 and 613 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:51:38] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1268056 and 624 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:54:52] RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 216864 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:56:42] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 17279032 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:56:46] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 23515016 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:00:00] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:00:04] RECOVERY - Check systemd state on idp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:00:10] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1302816 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:00:14] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1513368 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:05:14] PROBLEM - Check systemd state on idp1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:11:54] PROBLEM - mediawiki-installation DSH group on deploy1002 is CRITICAL: Host deploy1002 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [01:14:08] (03PS1) 10Bstorm: wikireplicas: make meta_p work on multiinstance and automatically [puppet] - 10https://gerrit.wikimedia.org/r/644375 [01:15:50] (03CR) 10jerkins-bot: [V: 04-1] wikireplicas: make meta_p work on multiinstance and automatically [puppet] - 10https://gerrit.wikimedia.org/r/644375 (owner: 10Bstorm) [01:16:41] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [01:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:18:38] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [01:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:23:05] (03Abandoned) 10CRusnov: Port drac.py to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/644358 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [01:34:56] (03PS1) 10Aklapper: Phabricator monthly email: [Hopefully] fix priority median calculation [puppet] - 10https://gerrit.wikimedia.org/r/644383 [01:42:00] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10Epic: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10AntiCompositeNumber) The OSM tile servers are designed to support osm.org only, and do not support all features. The Si... [01:46:54] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 66 probes of 574 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:48:05] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) [01:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:48:38] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:02] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) [01:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:50:41] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10Papaul) [01:51:55] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 49 probes of 574 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:55:01] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) [01:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:58:25] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) [01:58:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:00:35] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ms-be2058.codfw.wmnet'] ` Of which those **FAILED**: ` ['ms-be2058.codfw.wmnet'] ` [02:00:37] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10Papaul) @fgiunchedi I re-imaged ms-be2059 and ms-be2058, puppet is not happy ` WARNING: Puppet has 1 failures. Last run 42 seconds ago with 1 failures. Failed resources (up to... [02:07:08] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.20 [core] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644387 [02:17:49] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:24:49] (03PS1) 10Papaul: Add db214[234] and logstash203[345] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/644391 (https://phabricator.wikimedia.org/T267041) [02:24:51] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [02:26:39] (03CR) 10Papaul: [C: 03+2] Add db214[234] and logstash203[345] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/644391 (https://phabricator.wikimedia.org/T267041) (owner: 10Papaul) [02:26:57] RECOVERY - rpki grafana alert on alert1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [02:27:07] (03CR) 10jerkins-bot: [V: 04-1] Branch commit for wmf/1.36.0-wmf.20 [core] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644387 (owner: 10TrainBranchBot) [02:31:10] (03CR) 10Ladsgroup: [C: 03+1] thumbor: move thumbor mediawiki role to profile [puppet] - 10https://gerrit.wikimedia.org/r/643112 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [02:43:21] (03PS1) 10Ladsgroup: hadoop: Migrate hiera() to lookup() and setting datatype in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/644392 (https://phabricator.wikimedia.org/T209953) [02:48:13] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:49:06] (03PS2) 10Ladsgroup: hadoop: Migrate hiera() to lookup() and setting datatype in spark2 [puppet] - 10https://gerrit.wikimedia.org/r/644392 (https://phabricator.wikimedia.org/T209953) [02:50:32] (03CR) 10Ladsgroup: "So far it looks good: https://puppet-compiler.wmflabs.org/compiler1001/26802/ just another host that's failing completely:" [puppet] - 10https://gerrit.wikimedia.org/r/644392 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [03:03:31] (03PS2) 10DannyS712: Branch commit for wmf/1.36.0-wmf.20 [core] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644387 (https://phabricator.wikimedia.org/T263186) (owner: 10TrainBranchBot) [03:32:55] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01009 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [03:43:37] (03PS3) 10Gergő Tisza: Add EventStream config for link recommendations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643230 (https://phabricator.wikimedia.org/T261407) [04:00:59] RECOVERY - Check systemd state on idp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:06:07] PROBLEM - Check systemd state on idp1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:13:23] !log resetting elukey's jenkins API token (T268978) [04:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:13:36] (03CR) 10C. Scott Ananian: "Parsoid was a little late for the train. We'll need to cherry-pick I55e467133763345203c5f7083c999762e9203206 to the wmf.20 branch of medi" [core] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644387 (https://phabricator.wikimedia.org/T263186) (owner: 10TrainBranchBot) [04:14:35] (03PS1) 10C. Scott Ananian: Bump wikimedia/parsoid to 0.13.0-a18 [vendor] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644221 (https://phabricator.wikimedia.org/T51538) [05:28:57] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 101 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:00:17] RECOVERY - Check systemd state on idp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:03:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1017 and es1018 for reboot', diff saved to https://phabricator.wikimedia.org/P13487 and previous config saved to /var/cache/conftool/dbconfig/20201201-060313-marostegui.json [06:03:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:39] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:05:16] (03PS1) 10Marostegui: es1017,es1018: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/644400 (https://phabricator.wikimedia.org/T264154) [06:05:27] PROBLEM - Check systemd state on idp1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:06:01] (03CR) 10Marostegui: [C: 03+2] es1017,es1018: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/644400 (https://phabricator.wikimedia.org/T264154) (owner: 10Marostegui) [06:08:41] (03PS1) 10Marostegui: es1018: Remove it from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/644401 (https://phabricator.wikimedia.org/T269069) [06:12:03] (03CR) 10Marostegui: [C: 03+2] es1018: Remove it from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/644401 (https://phabricator.wikimedia.org/T269069) (owner: 10Marostegui) [06:13:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove es1018 from dbctl T269069', diff saved to https://phabricator.wikimedia.org/P13488 and previous config saved to /var/cache/conftool/dbconfig/20201201-061321-marostegui.json [06:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:30] T269069: decommission es1018.eqiad.wmnet - https://phabricator.wikimedia.org/T269069 [06:15:03] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:46:19] 10Operations, 10vm-requests, 10Patch-For-Review: Eq: 5 VM request for kafka-test-eqiad cluster - https://phabricator.wikimedia.org/T268202 (10elukey) @razzi can you copy/paste in here what failed for the dns netbox step? There might be some follow ups to do to avoid an inconsistent state.. [06:51:00] (03CR) 10Elukey: [C: 04-1] "Thanks a lot for the code change! It is a little bit more complicated that find/replace sadly, for the following reasons:" [puppet] - 10https://gerrit.wikimedia.org/r/644353 (https://phabricator.wikimedia.org/T268028) (owner: 10Razzi) [06:51:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1112 for schema change', diff saved to https://phabricator.wikimedia.org/P13489 and previous config saved to /var/cache/conftool/dbconfig/20201201-065125-marostegui.json [06:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:13] (03PS1) 10Marostegui: es1017: Remove from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/644408 (https://phabricator.wikimedia.org/T268825) [06:53:47] (03CR) 10Marostegui: [C: 03+2] es1017: Remove from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/644408 (https://phabricator.wikimedia.org/T268825) (owner: 10Marostegui) [06:54:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove es1017 from dbctl T268825', diff saved to https://phabricator.wikimedia.org/P13490 and previous config saved to /var/cache/conftool/dbconfig/20201201-065419-marostegui.json [06:54:23] (03CR) 10Elukey: "only one nit, the rest looks good!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644392 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [06:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:27] T268825: decommission es1017.eqiad.wmnet - https://phabricator.wikimedia.org/T268825 [07:00:33] RECOVERY - Check systemd state on idp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:05:23] PROBLEM - Check systemd state on idp1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:15:43] !log Deploy labsdb role on all clouddb instances (except clouddb1020*) T268312 [07:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:51] T268312: Deploy labsdbuser and views to new clouddb hosts - https://phabricator.wikimedia.org/T268312 [07:24:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P13491 and previous config saved to /var/cache/conftool/dbconfig/20201201-072451-root.json [07:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:53] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:31:01] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:31:11] !log Deploy "_p" databases to all clouddb hosts (except clouddb1020*) T268312 [07:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:20] T268312: Deploy labsdbuser and views to new clouddb hosts - https://phabricator.wikimedia.org/T268312 [07:35:36] the link between cr1 eqiad and codfw is down, probably maintenance [07:37:03] mmmm I don't see maintenance for the Telia link [07:37:59] the link is not down but BFD detected a problem [07:39:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P13492 and previous config saved to /var/cache/conftool/dbconfig/20201201-073955-root.json [07:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P13493 and previous config saved to /var/cache/conftool/dbconfig/20201201-075458-root.json [07:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:03] RECOVERY - Check systemd state on idp-test2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:00:53] RECOVERY - Check systemd state on idp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:03:41] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10ayounsi) > So, please let us know if you're OK with reducing to 60 or you'd rather keep the 90. OK! > Would you guys be wil... [08:04:13] (03PS1) 10Marostegui: valid_section: Add x2 [puppet] - 10https://gerrit.wikimedia.org/r/644453 (https://phabricator.wikimedia.org/T264584) [08:05:09] PROBLEM - Check systemd state on idp-test2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:05:20] !log Create database mwaddlink on m2 - T267214 [08:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:27] T267214: Add a link engineering: Database for link recommendation service - https://phabricator.wikimedia.org/T267214 [08:05:59] PROBLEM - Check systemd state on idp1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:07:09] (03PS3) 10Ladsgroup: hadoop: Migrate hiera() to lookup() and setting datatype in spark2 [puppet] - 10https://gerrit.wikimedia.org/r/644392 (https://phabricator.wikimedia.org/T209953) [08:07:53] (03CR) 10Ladsgroup: hadoop: Migrate hiera() to lookup() and setting datatype in spark2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644392 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [08:10:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P13494 and previous config saved to /var/cache/conftool/dbconfig/20201201-081002-root.json [08:10:03] ACKNOWLEDGEMENT - HP RAID on ms-be1030 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T269075 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Inform [08:10:07] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T269075 (10ops-monitoring-bot) [08:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:15] !log upgrading spicerack to 0.0.45 on cumin2001 [08:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:26] (03CR) 10Elukey: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/644392 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [08:11:04] (03CR) 10Elukey: [C: 03+2] hadoop: Migrate hiera() to lookup() and setting datatype in spark2 [puppet] - 10https://gerrit.wikimedia.org/r/644392 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [08:11:07] !log volans@cumin2001 START - Cookbook sre.dns.netbox [08:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:19] (03CR) 10Ladsgroup: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/644392 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [08:13:40] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:13:57] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:18:30] !log volans@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:37] (03PS2) 10Muehlenhoff: Add Andrew Otto as approval contact for Hadoop and analytics groups [puppet] - 10https://gerrit.wikimedia.org/r/643875 [08:20:55] RECOVERY - snapshot of x1 in eqiad on alert1001 is OK: Last snapshot for x1 at eqiad (db1102.eqiad.wmnet:3320) taken on 2020-12-01 07:55:30 (207 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [08:21:39] (03CR) 10Muehlenhoff: [C: 03+1] "Agreed, this seems obsolete." [puppet] - 10https://gerrit.wikimedia.org/r/644364 (owner: 10Dzahn) [08:21:52] (03PS1) 10Marostegui: production-m2: Add grants for mwaddlink new database [puppet] - 10https://gerrit.wikimedia.org/r/644456 (https://phabricator.wikimedia.org/T267214) [08:26:14] (03CR) 10Muehlenhoff: [C: 04-1] "Better depend on default-mysql-client, this will do the right thing also on Stretch, i.e. you don't even need the OS conditional" [puppet] - 10https://gerrit.wikimedia.org/r/644350 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn) [08:27:08] (03PS2) 10Marostegui: production-m2: Add grants for mwaddlink new database [puppet] - 10https://gerrit.wikimedia.org/r/644456 (https://phabricator.wikimedia.org/T267214) [08:28:26] 10Operations, 10vm-requests, 10Patch-For-Review: Eq: 5 VM request for kafka-test-eqiad cluster - https://phabricator.wikimedia.org/T268202 (10Volans) @razzi in general on FAIL always better to investigate what happens. In this case it left some changes in Netbox not propagated to the DNS (see https://icinga.... [08:31:36] (03PS3) 10Marostegui: production-m2: Add grants for mwaddlink new database [puppet] - 10https://gerrit.wikimedia.org/r/644456 (https://phabricator.wikimedia.org/T267214) [08:34:43] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/644372 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [08:39:15] PROBLEM - Host ms-be2058 is DOWN: PING CRITICAL - Packet loss = 100% [08:43:33] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [08:44:03] PROBLEM - Host ms-be2059 is DOWN: PING CRITICAL - Packet loss = 100% [08:44:52] (03PS1) 10Elukey: oozie: improve log retention and cleanup [puppet] - 10https://gerrit.wikimedia.org/r/644457 [08:45:37] RECOVERY - Host ms-be2059 is UP: PING OK - Packet loss = 0%, RTA = 33.42 ms [08:46:30] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10fgiunchedi) >>! In T265419#6657950, @Papaul wrote: > @fgiunchedi I re-imaged ms-be2059 and ms-be2058, puppet is not happy > ` > WARNING: Puppet has 1 failures. Last run 42 seco... [08:49:27] (03PS1) 10Volans: Avoid double output when running Cumin commands [cookbooks] - 10https://gerrit.wikimedia.org/r/644458 (https://phabricator.wikimedia.org/T212783) [08:49:50] (03PS2) 10Volans: Avoid double output when running Cumin commands [cookbooks] - 10https://gerrit.wikimedia.org/r/644458 (https://phabricator.wikimedia.org/T212783) [08:50:48] (03CR) 10Volans: "Hi all, I've added you to the review because at least one of your cookbooks is affected by this change." [cookbooks] - 10https://gerrit.wikimedia.org/r/644458 (https://phabricator.wikimedia.org/T212783) (owner: 10Volans) [08:51:19] (03PS1) 10Marostegui: mariadb: Decommission es1018 [puppet] - 10https://gerrit.wikimedia.org/r/644459 (https://phabricator.wikimedia.org/T269069) [08:52:06] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26803/console" [puppet] - 10https://gerrit.wikimedia.org/r/644457 (owner: 10Elukey) [08:52:55] (03CR) 10Elukey: [C: 03+1] Avoid double output when running Cumin commands [cookbooks] - 10https://gerrit.wikimedia.org/r/644458 (https://phabricator.wikimedia.org/T212783) (owner: 10Volans) [08:53:50] (03CR) 10Elukey: [V: 03+1 C: 03+2] oozie: improve log retention and cleanup [puppet] - 10https://gerrit.wikimedia.org/r/644457 (owner: 10Elukey) [08:58:17] (03PS2) 10Alexandros Kosiaris: bacula: Move logs to /var/log/bacula [puppet] - 10https://gerrit.wikimedia.org/r/546972 (https://phabricator.wikimedia.org/T236406) [08:58:21] (03CR) 10ArielGlenn: "Added Brooke as reviewer for whenever this is ready, since the affected servers are WMCS ones" [puppet] - 10https://gerrit.wikimedia.org/r/642446 (https://phabricator.wikimedia.org/T268220) (owner: 10Elukey) [08:58:50] (03CR) 10Alexandros Kosiaris: "Resurrecting an old change. Let me know if you think it's useful, otherwise feel free to abandon it" [puppet] - 10https://gerrit.wikimedia.org/r/546972 (https://phabricator.wikimedia.org/T236406) (owner: 10Alexandros Kosiaris) [08:59:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1075 for schema change', diff saved to https://phabricator.wikimedia.org/P13495 and previous config saved to /var/cache/conftool/dbconfig/20201201-085916-marostegui.json [08:59:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:01] RECOVERY - Check systemd state on idp-test2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:00:04] (03CR) 10Kormat: "It would be good to add this to `profile::mariadb::section_ports` in `hieradata/common/profile/mariadb.yaml`" [puppet] - 10https://gerrit.wikimedia.org/r/644453 (https://phabricator.wikimedia.org/T264584) (owner: 10Marostegui) [09:00:49] RECOVERY - Check systemd state on idp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:01:43] (03CR) 10Marostegui: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/644453 (https://phabricator.wikimedia.org/T264584) (owner: 10Marostegui) [09:03:02] (03CR) 10Jcrespo: [C: 03+2] bacula: Move logs to /var/log/bacula [puppet] - 10https://gerrit.wikimedia.org/r/546972 (https://phabricator.wikimedia.org/T236406) (owner: 10Alexandros Kosiaris) [09:04:57] PROBLEM - Check systemd state on idp-test2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:05:40] !log removing obsolete resources on idp* and idp-test* hosts after going active-active [09:05:45] PROBLEM - Check systemd state on idp1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:20] (03CR) 10Ayounsi: [C: 03+1] "I TRUST YOU!" [cookbooks] - 10https://gerrit.wikimedia.org/r/644458 (https://phabricator.wikimedia.org/T212783) (owner: 10Volans) [09:10:36] (03PS1) 10JMeybohm: Add calico chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/644462 [09:10:41] RECOVERY - Check systemd state on idp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:16:02] (03PS3) 10Noa wmde: Add log channel Wikibase.IdGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643874 (https://phabricator.wikimedia.org/T268625) (owner: 10Lucas Werkmeister (WMDE)) [09:16:29] RECOVERY - Check systemd state on idp-test2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:17:43] (03CR) 10Kormat: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/644453 (https://phabricator.wikimedia.org/T264584) (owner: 10Marostegui) [09:17:47] (03CR) 10Volans: [C: 03+2] Avoid double output when running Cumin commands [cookbooks] - 10https://gerrit.wikimedia.org/r/644458 (https://phabricator.wikimedia.org/T212783) (owner: 10Volans) [09:18:55] (03PS2) 10Marostegui: valid_section: Add x2 [puppet] - 10https://gerrit.wikimedia.org/r/644453 (https://phabricator.wikimedia.org/T264584) [09:19:16] (03Merged) 10jenkins-bot: Avoid double output when running Cumin commands [cookbooks] - 10https://gerrit.wikimedia.org/r/644458 (https://phabricator.wikimedia.org/T212783) (owner: 10Volans) [09:19:48] (03CR) 10Hashar: [C: 03+2] Branch commit for wmf/1.36.0-wmf.20 [core] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644387 (https://phabricator.wikimedia.org/T263186) (owner: 10TrainBranchBot) [09:20:24] (03CR) 10Kormat: [C: 03+1] "LGTM, WCGW." [puppet] - 10https://gerrit.wikimedia.org/r/644453 (https://phabricator.wikimedia.org/T264584) (owner: 10Marostegui) [09:20:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1075 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P13496 and previous config saved to /var/cache/conftool/dbconfig/20201201-092030-root.json [09:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:44] (03CR) 10Marostegui: [C: 03+2] valid_section: Add x2 [puppet] - 10https://gerrit.wikimedia.org/r/644453 (https://phabricator.wikimedia.org/T264584) (owner: 10Marostegui) [09:21:31] !log volans@cumin2001 START - Cookbook sre.hosts.decommission [09:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:45] (03CR) 10Hashar: "I guess it is a matter of taste? It seems easier to me to handle those settings on a per host basis :]" [puppet] - 10https://gerrit.wikimedia.org/r/643918 (owner: 10Hashar) [09:22:35] (03Abandoned) 10Hashar: Gerrit: Setup avatars url in gerrit config [puppet] - 10https://gerrit.wikimedia.org/r/456437 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox) [09:24:50] (03PS2) 10ArielGlenn: Add sample code illustrating use of the commandmanagement module classes [dumps] - 10https://gerrit.wikimedia.org/r/627475 [09:25:14] (03CR) 10jerkins-bot: [V: 04-1] Add sample code illustrating use of the commandmanagement module classes [dumps] - 10https://gerrit.wikimedia.org/r/627475 (owner: 10ArielGlenn) [09:25:35] (03CR) 10Hashar: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/643944 (owner: 10Hashar) [09:26:03] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/643944 (owner: 10Hashar) [09:26:12] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/644459 (https://phabricator.wikimedia.org/T269069) (owner: 10Marostegui) [09:27:21] (03PS3) 10ArielGlenn: Add sample code illustrating use of the commandmanagement module classes [dumps] - 10https://gerrit.wikimedia.org/r/627475 [09:27:25] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission es1018 [puppet] - 10https://gerrit.wikimedia.org/r/644459 (https://phabricator.wikimedia.org/T269069) (owner: 10Marostegui) [09:27:47] (03CR) 10jerkins-bot: [V: 04-1] Add sample code illustrating use of the commandmanagement module classes [dumps] - 10https://gerrit.wikimedia.org/r/627475 (owner: 10ArielGlenn) [09:28:55] (03PS1) 10Ema: vcl: remove legacy temporary parameter workaround [puppet] - 10https://gerrit.wikimedia.org/r/644465 [09:28:57] (03PS1) 10Ema: vcl: move /static Host normalization to cluster_fe_recv_pre_purge [puppet] - 10https://gerrit.wikimedia.org/r/644466 (https://phabricator.wikimedia.org/T130904) [09:29:36] (03PS1) 10Elukey: sre.hosts.decom: fix kerberos keytabs path [cookbooks] - 10https://gerrit.wikimedia.org/r/644467 [09:31:20] (03CR) 10Ayounsi: turnilo: add export mappings for network devices via query_resources (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/643703 (https://phabricator.wikimedia.org/T254332) (owner: 10Jbond) [09:32:37] !log volans@cumin2001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [09:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:56] (03PS4) 10ArielGlenn: Add sample code illustrating use of the commandmanagement module classes [dumps] - 10https://gerrit.wikimedia.org/r/627475 [09:32:58] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/644467 (owner: 10Elukey) [09:33:19] (03CR) 10jerkins-bot: [V: 04-1] Add sample code illustrating use of the commandmanagement module classes [dumps] - 10https://gerrit.wikimedia.org/r/627475 (owner: 10ArielGlenn) [09:34:08] (03CR) 10Elukey: [C: 03+2] sre.hosts.decom: fix kerberos keytabs path [cookbooks] - 10https://gerrit.wikimedia.org/r/644467 (owner: 10Elukey) [09:35:12] (03CR) 10Muehlenhoff: [C: 03+2] Add Andrew Otto as approval contact for Hadoop and analytics groups [puppet] - 10https://gerrit.wikimedia.org/r/643875 (owner: 10Muehlenhoff) [09:35:29] !log upgrading spicerack to 0.0.45 on cumin1001 [09:35:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1075 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P13497 and previous config saved to /var/cache/conftool/dbconfig/20201201-093534-root.json [09:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:16] (03PS5) 10Alexandros Kosiaris: k8s: Remove profile::kubernetes::master::storage_backend fully [puppet] - 10https://gerrit.wikimedia.org/r/644234 [09:36:18] (03PS5) 10Alexandros Kosiaris: k8s::master: Remove redundant has_lvs hiera check [puppet] - 10https://gerrit.wikimedia.org/r/644237 [09:36:20] (03PS5) 10Alexandros Kosiaris: k8s: Allow using cergen [puppet] - 10https://gerrit.wikimedia.org/r/644238 [09:36:22] (03PS6) 10Alexandros Kosiaris: k8s codfw staging: Assign role to node [puppet] - 10https://gerrit.wikimedia.org/r/644235 [09:36:24] (03PS6) 10Alexandros Kosiaris: prometheus::k8s: Support arbitrary clusters [puppet] - 10https://gerrit.wikimedia.org/r/644262 [09:36:26] (03PS1) 10Alexandros Kosiaris: Add {calico, kubernetes}-future components to buster [puppet] - 10https://gerrit.wikimedia.org/r/644469 [09:37:22] (03CR) 10jerkins-bot: [V: 04-1] k8s codfw staging: Assign role to node [puppet] - 10https://gerrit.wikimedia.org/r/644235 (owner: 10Alexandros Kosiaris) [09:40:50] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission es1018.eqiad.wmnet - https://phabricator.wikimedia.org/T269069 (10Marostegui) a:05Marostegui→03wiki_willy [09:41:25] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission es1018.eqiad.wmnet - https://phabricator.wikimedia.org/T269069 (10Marostegui) Ready for DC Ops! [09:43:44] (03PS5) 10ArielGlenn: Add sample code illustrating use of the commandmanagement module classes [dumps] - 10https://gerrit.wikimedia.org/r/627475 [09:44:33] (03CR) 10Hashar: "https://puppet-compiler.wmflabs.org/compiler1002/642/" [puppet] - 10https://gerrit.wikimedia.org/r/643944 (owner: 10Hashar) [09:46:45] (03CR) 10ArielGlenn: [C: 03+2] Add sample code illustrating use of the commandmanagement module classes [dumps] - 10https://gerrit.wikimedia.org/r/627475 (owner: 10ArielGlenn) [09:47:12] (03Merged) 10jenkins-bot: Add sample code illustrating use of the commandmanagement module classes [dumps] - 10https://gerrit.wikimedia.org/r/627475 (owner: 10ArielGlenn) [09:49:34] !log elukey@cumin1001 START - Cookbook sre.hosts.decommission [09:49:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1075 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P13498 and previous config saved to /var/cache/conftool/dbconfig/20201201-095037-root.json [09:50:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:06] (03Merged) 10jenkins-bot: Branch commit for wmf/1.36.0-wmf.20 [core] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644387 (https://phabricator.wikimedia.org/T263186) (owner: 10TrainBranchBot) [09:52:19] (03CR) 10Jbond: "updated" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/643703 (https://phabricator.wikimedia.org/T254332) (owner: 10Jbond) [09:55:40] (03PS2) 10Alexandros Kosiaris: Add {calico, kubernetes}-future components to buster [puppet] - 10https://gerrit.wikimedia.org/r/644469 [09:57:42] (03CR) 10jerkins-bot: [V: 04-1] Add {calico, kubernetes}-future components to buster [puppet] - 10https://gerrit.wikimedia.org/r/644469 (owner: 10Alexandros Kosiaris) [09:59:51] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [09:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:49] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/643112 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [10:02:09] !log elukey@cumin1001 START - Cookbook sre.hosts.decommission [10:02:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:13] (03PS2) 10Aklapper: Phabricator monthly email: [Hopefully] fix priority median calculation [puppet] - 10https://gerrit.wikimedia.org/r/644383 (https://phabricator.wikimedia.org/T269076) [10:05:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1075 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P13499 and previous config saved to /var/cache/conftool/dbconfig/20201201-100541-root.json [10:05:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:32] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [10:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:52] !log elukey@cumin1001 START - Cookbook sre.hosts.decommission [10:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:15] (03PS1) 10Volans: sre.hosts.decommission: fix kerberos check [cookbooks] - 10https://gerrit.wikimedia.org/r/644471 [10:12:47] (03CR) 10Elukey: [C: 03+1] sre.hosts.decommission: fix kerberos check [cookbooks] - 10https://gerrit.wikimedia.org/r/644471 (owner: 10Volans) [10:13:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1078 for schema change', diff saved to https://phabricator.wikimedia.org/P13500 and previous config saved to /var/cache/conftool/dbconfig/20201201-101346-marostegui.json [10:13:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:19] (03PS2) 10Ema: vcl: move /static Host normalization to cluster_fe_recv_pre_purge [puppet] - 10https://gerrit.wikimedia.org/r/644466 (https://phabricator.wikimedia.org/T130904) [10:14:23] (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: fix kerberos check [cookbooks] - 10https://gerrit.wikimedia.org/r/644471 (owner: 10Volans) [10:15:02] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: fix cluster config out of sync alert [puppet] - 10https://gerrit.wikimedia.org/r/644184 (https://phabricator.wikimedia.org/T266017) (owner: 10Filippo Giunchedi) [10:15:36] (03Merged) 10jenkins-bot: sre.hosts.decommission: fix kerberos check [cookbooks] - 10https://gerrit.wikimedia.org/r/644471 (owner: 10Volans) [10:16:57] (03CR) 10Jbond: Port elasticsearch/es-tool.py to Python3 (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/644365 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [10:17:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1078 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P13501 and previous config saved to /var/cache/conftool/dbconfig/20201201-101703-root.json [10:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:22] (03PS1) 10Volans: sre.hosts.reboot-cluster: remove executable bit [cookbooks] - 10https://gerrit.wikimedia.org/r/644473 [10:21:46] (03CR) 10Volans: [C: 03+2] sre.hosts.reboot-cluster: remove executable bit [cookbooks] - 10https://gerrit.wikimedia.org/r/644473 (owner: 10Volans) [10:22:56] (03Merged) 10jenkins-bot: sre.hosts.reboot-cluster: remove executable bit [cookbooks] - 10https://gerrit.wikimedia.org/r/644473 (owner: 10Volans) [10:23:36] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/644372 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [10:24:37] (03PS1) 10JMeybohm: Run a helm repo update before linting [deployment-charts] - 10https://gerrit.wikimedia.org/r/644474 [10:24:54] 10Operations, 10Performance-Team, 10serviceops, 10User-jijiki: Run latest Thumbor on Docker with Buster + Python 3 - https://phabricator.wikimedia.org/T267327 (10jijiki) [10:25:41] (03PS2) 10JMeybohm: Run a helm repo update before linting [deployment-charts] - 10https://gerrit.wikimedia.org/r/644474 [10:26:28] (03Abandoned) 10Ema: [untested] Rewrite /static/ also for PURGE requests [puppet] - 10https://gerrit.wikimedia.org/r/279564 (https://phabricator.wikimedia.org/T130904) (owner: 10GWicke) [10:27:13] (03CR) 10JMeybohm: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/644462 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [10:30:28] !log elukey@cumin1001 END (ERROR) - Cookbook sre.hosts.decommission (exit_code=97) [10:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:36] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [10:31:16] !log elukey@cumin1001 START - Cookbook sre.hosts.decommission [10:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1078 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P13503 and previous config saved to /var/cache/conftool/dbconfig/20201201-103207-root.json [10:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:38] 10Operations, 10observability, 10CAS-SSO: thanos u/i gives errors if left idle for a few hours - https://phabricator.wikimedia.org/T268233 (10jbond) I have not been able to recreate this, is this still causing an issue? [10:35:43] (03PS1) 10ArielGlenn: add the ability to skip a job via configuration [dumps] - 10https://gerrit.wikimedia.org/r/644476 [10:35:44] (03CR) 10Filippo Giunchedi: "I like the general idea, I'm wondering how to make it more explicit that each new cluster will also require setting up a new prometheus in" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644262 (owner: 10Alexandros Kosiaris) [10:35:48] (03CR) 10jerkins-bot: [V: 04-1] add the ability to skip a job via configuration [dumps] - 10https://gerrit.wikimedia.org/r/644476 (owner: 10ArielGlenn) [10:36:58] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [10:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:42] 10Operations, 10observability, 10CAS-SSO: thanos u/i gives errors if left idle for a few hours - https://phabricator.wikimedia.org/T268233 (10Kormat) I'm still getting failures, but it's not clear where the issue is. Firefox: {F33929783} Chrome: {F33929785} [10:38:18] !log elukey@cumin1001 START - Cookbook sre.hosts.decommission [10:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:04] (03PS1) 10Volans: sre.hosts.reboot_group: move functionalities [cookbooks] - 10https://gerrit.wikimedia.org/r/644477 [10:44:35] (03PS2) 10ArielGlenn: add the ability to skip a job via configuration [dumps] - 10https://gerrit.wikimedia.org/r/644476 [10:45:37] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [10:45:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:57] (03PS6) 10Kormat: test: Start implementation of integration-env [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/644231 (https://phabricator.wikimedia.org/T265266) [10:46:27] (03PS1) 10Elukey: install_server: remove lefovers of analytics105[1-7] [puppet] - 10https://gerrit.wikimedia.org/r/644478 (https://phabricator.wikimedia.org/T267932) [10:47:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1078 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P13505 and previous config saved to /var/cache/conftool/dbconfig/20201201-104710-root.json [10:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:29] (03CR) 10Elukey: [C: 03+2] install_server: remove lefovers of analytics105[1-7] [puppet] - 10https://gerrit.wikimedia.org/r/644478 (https://phabricator.wikimedia.org/T267932) (owner: 10Elukey) [11:02:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1078 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P13506 and previous config saved to /var/cache/conftool/dbconfig/20201201-110214-root.json [11:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:27] (03PS1) 10MSantos: WIP: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) [11:13:56] (03CR) 10jerkins-bot: [V: 04-1] WIP: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [11:15:00] PROBLEM - tileratorui on maps1006 is CRITICAL: connect to address 10.64.0.18 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [11:15:04] PROBLEM - tileratorui on maps1007 is CRITICAL: connect to address 10.64.16.6 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [11:15:08] PROBLEM - tilerator on maps1010 is CRITICAL: connect to address 10.64.48.6 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [11:15:16] expiring downtimes ^ [11:15:20] PROBLEM - tileratorui on maps1008 is CRITICAL: connect to address 10.64.16.27 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [11:15:28] PROBLEM - cassandra CQL 10.64.48.6:9042 on maps1010 is CRITICAL: connect to address 10.64.48.6 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [11:15:36] PROBLEM - Check systemd state on maps1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:15:52] PROBLEM - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:15:58] PROBLEM - tileratorui on maps1009 is CRITICAL: connect to address 10.64.32.8 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [11:16:08] PROBLEM - cassandra CQL 10.64.32.8:9042 on maps1009 is CRITICAL: connect to address 10.64.32.8 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [11:16:10] PROBLEM - tileratorui on maps1005 is CRITICAL: connect to address 10.64.0.12 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [11:16:38] PROBLEM - cassandra service on maps1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:16:42] PROBLEM - cassandra service on maps1010 is CRITICAL: CRITICAL - Unit cassandra is active but reported SubState exited, wanted running https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:16:44] PROBLEM - tileratorui on maps1010 is CRITICAL: connect to address 10.64.48.6 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [11:19:12] (03CR) 10Muehlenhoff: [C: 03+1] "Thanks, looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/644477 (owner: 10Volans) [11:19:58] RECOVERY - tileratorui on maps1006 is OK: HTTP OK: HTTP/1.1 200 OK - 315 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [11:20:02] RECOVERY - tileratorui on maps1007 is OK: HTTP OK: HTTP/1.1 200 OK - 315 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [11:20:53] ACKNOWLEDGEMENT - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Hnowlan new hosts, not pooled. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:20:53] ACKNOWLEDGEMENT - cassandra CQL 10.64.32.8:9042 on maps1009 is CRITICAL: connect to address 10.64.32.8 and port 9042: Connection refused Hnowlan new hosts, not pooled. https://phabricator.wikimedia.org/T93886 [11:20:53] ACKNOWLEDGEMENT - cassandra service on maps1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed Hnowlan new hosts, not pooled. https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:20:53] ACKNOWLEDGEMENT - tileratorui on maps1009 is CRITICAL: connect to address 10.64.32.8 and port 6535: Connection refused Hnowlan new hosts, not pooled. https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [11:20:53] ACKNOWLEDGEMENT - Check systemd state on maps1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Hnowlan new hosts, not pooled. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:20:54] ACKNOWLEDGEMENT - cassandra CQL 10.64.48.6:9042 on maps1010 is CRITICAL: connect to address 10.64.48.6 and port 9042: Connection refused Hnowlan new hosts, not pooled. https://phabricator.wikimedia.org/T93886 [11:20:54] ACKNOWLEDGEMENT - cassandra service on maps1010 is CRITICAL: CRITICAL - Unit cassandra is active but reported SubState exited, wanted running Hnowlan new hosts, not pooled. https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:20:55] ACKNOWLEDGEMENT - tilerator on maps1010 is CRITICAL: connect to address 10.64.48.6 and port 6534: Connection refused Hnowlan new hosts, not pooled. https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [11:20:55] ACKNOWLEDGEMENT - tileratorui on maps1010 is CRITICAL: connect to address 10.64.48.6 and port 6535: Connection refused Hnowlan new hosts, not pooled. https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [11:26:14] (03PS3) 10ArielGlenn: add the ability to skip a job via configuration [dumps] - 10https://gerrit.wikimedia.org/r/644476 [11:33:12] RECOVERY - cassandra service on maps1009 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:38:10] PROBLEM - cassandra service on maps1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:42:51] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: dmz_cidr: detail the list of private production addresses [puppet] - 10https://gerrit.wikimedia.org/r/641977 (owner: 10Arturo Borrero Gonzalez) [11:46:56] 10Operations, 10observability, 10CAS-SSO: thanos u/i gives errors if left idle for a few hours - https://phabricator.wikimedia.org/T268233 (10jbond) Do you get this error on all expressions, a specific expression or spasmodically? have also tagged observability in case there is something other then CORS in... [11:48:53] !log Install bsd-mailx on the new clouddb hosts (needed for the check private data) T267090 T268725 [11:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:01] T267090: Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 [11:49:01] T268725: Include mail on standard_packages.pp - https://phabricator.wikimedia.org/T268725 [11:49:19] 10Operations, 10observability, 10CAS-SSO: thanos u/i gives errors if left idle for a few hours - https://phabricator.wikimedia.org/T268233 (10jbond) in fact observability is already tagged, @fgiunchedi wodner if this could be a more general issue? [11:56:22] stashbot should be reconnecting soon [11:57:00] (03CR) 10Volans: [C: 03+2] sre.hosts.reboot_group: move functionalities [cookbooks] - 10https://gerrit.wikimedia.org/r/644477 (owner: 10Volans) [11:58:40] (03Merged) 10jenkins-bot: sre.hosts.reboot_group: move functionalities [cookbooks] - 10https://gerrit.wikimedia.org/r/644477 (owner: 10Volans) [12:00:07] Amir1, Lucas_WMDE, awight, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201201T1200). [12:00:07] No GERRIT patches in the queue for this window AFAICS. [12:00:19] (03PS1) 10Muehlenhoff: Extend access for aarora [puppet] - 10https://gerrit.wikimedia.org/r/644507 [12:00:26] looks like nothing to deploy indeed [12:01:16] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for aarora [puppet] - 10https://gerrit.wikimedia.org/r/644507 (owner: 10Muehlenhoff) [12:03:00] 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage: Create a read-only swift identity for backup taking - https://phabricator.wikimedia.org/T269108 (10jcrespo) [12:03:08] RECOVERY - cassandra service on maps1009 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:04:04] (03PS4) 10Volans: sre.hosts.downtime: convert to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/633484 (https://phabricator.wikimedia.org/T221212) [12:04:31] 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage: Create a read-only swift identity for backup taking - https://phabricator.wikimedia.org/T269108 (10jcrespo) ^@fgiunchedi this is the task I told you about (pinging on comment because sometimes notifications cannot be seen on creation). [12:05:29] !log [11:53 moritzm] uploaded lxml 3.4.0-1+deb8u1+wmf1 to apt.wikimedia.org/jessie-wikimedia [12:05:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:06] PROBLEM - cassandra service on maps1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:12:31] (03PS6) 10Alexandros Kosiaris: k8s: Remove profile::kubernetes::master::storage_backend fully [puppet] - 10https://gerrit.wikimedia.org/r/644234 [12:12:33] (03PS6) 10Alexandros Kosiaris: k8s::master: Remove redundant has_lvs hiera check [puppet] - 10https://gerrit.wikimedia.org/r/644237 [12:12:35] (03PS6) 10Alexandros Kosiaris: k8s: Allow using cergen [puppet] - 10https://gerrit.wikimedia.org/r/644238 [12:12:37] (03PS7) 10Alexandros Kosiaris: k8s codfw staging: Assign role to node [puppet] - 10https://gerrit.wikimedia.org/r/644235 [12:12:39] (03PS7) 10Alexandros Kosiaris: prometheus::k8s: Support arbitrary clusters [puppet] - 10https://gerrit.wikimedia.org/r/644262 [12:12:41] (03PS3) 10Alexandros Kosiaris: Add {calico, kubernetes}-future components to buster [puppet] - 10https://gerrit.wikimedia.org/r/644469 [12:13:21] (03CR) 10jerkins-bot: [V: 04-1] k8s codfw staging: Assign role to node [puppet] - 10https://gerrit.wikimedia.org/r/644235 (owner: 10Alexandros Kosiaris) [12:15:02] (03CR) 10jerkins-bot: [V: 04-1] Add {calico, kubernetes}-future components to buster [puppet] - 10https://gerrit.wikimedia.org/r/644469 (owner: 10Alexandros Kosiaris) [12:24:17] PROBLEM - Postgres Replication Lag on maps1009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 672400568 and 123 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:24:17] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 291221480 and 123 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:24:27] (03PS4) 10Alexandros Kosiaris: Add {calico, kubernetes}-future components to buster [puppet] - 10https://gerrit.wikimedia.org/r/644469 [12:24:29] (03PS1) 10Alexandros Kosiaris: package_from_component: Move to ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/644509 [12:24:51] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 200908968 and 156 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:24:51] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 999694520 and 156 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:26:21] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 2056472 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:26:21] RECOVERY - Postgres Replication Lag on maps1009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1803936 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:26:21] (03CR) 10jerkins-bot: [V: 04-1] Add {calico, kubernetes}-future components to buster [puppet] - 10https://gerrit.wikimedia.org/r/644469 (owner: 10Alexandros Kosiaris) [12:26:57] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 59602192 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:27:57] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1921160 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:28:59] RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1934488 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:30:33] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 224382384 and 180 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:31:09] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 33456544 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:31:19] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 29449272 and 225 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:31:27] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 76425320 and 234 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:31:29] PROBLEM - Postgres Replication Lag on maps2009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 896010928 and 236 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:33:02] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [12:33:02] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:33:03] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26807/console" [puppet] - 10https://gerrit.wikimedia.org/r/644469 (owner: 10Alexandros Kosiaris) [12:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:15] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [12:33:16] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:33:20] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [12:33:20] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [12:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:29] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 20784352 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:33:31] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 20433400 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:49] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 35508808 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:33:49] RECOVERY - Check systemd state on maps1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:33:57] RECOVERY - cassandra service on maps1009 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:34:39] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 65480 and 55 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:34:43] RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 49200 and 57 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:34:45] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1917224 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:35:45] (03PS3) 10Hnowlan: postgres: increase number of WAL files retained by master [puppet] - 10https://gerrit.wikimedia.org/r/643717 [12:38:21] ACKNOWLEDGEMENT - MD RAID on ms-be1022 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T269125 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [12:38:26] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1022 - https://phabricator.wikimedia.org/T269125 (10ops-monitoring-bot) [12:38:28] !log uploaded libonig 5.9.5-3.2+deb8u4+wmf1 to apt.wikimedia.org/jessie-wikimedia [12:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:41] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 552544 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:39:05] RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 574096 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:40:22] (03CR) 10Hnowlan: postgres: increase number of WAL files retained by master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/643717 (owner: 10Hnowlan) [12:41:27] RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 5000 and 51 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:41:33] RECOVERY - Postgres Replication Lag on maps2009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 106896 and 57 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:46:32] (03PS8) 10Alexandros Kosiaris: k8s codfw staging: Assign role to node [puppet] - 10https://gerrit.wikimedia.org/r/644235 [12:46:34] (03PS8) 10Alexandros Kosiaris: prometheus::k8s: Support arbitrary clusters [puppet] - 10https://gerrit.wikimedia.org/r/644262 [12:46:36] (03PS2) 10Alexandros Kosiaris: package_from_component: Move to ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/644509 [12:46:38] (03PS5) 10Alexandros Kosiaris: Add {calico, kubernetes}-future components to buster [puppet] - 10https://gerrit.wikimedia.org/r/644469 [12:55:23] 10Operations, 10observability, 10CAS-SSO: thanos u/i gives errors if left idle for a few hours - https://phabricator.wikimedia.org/T268233 (10Kormat) >>! In T268233#6659067, @jbond wrote: > Do you get this error on all expressions, a specific expression or spasmodically? have also tagged observability in ca... [12:56:00] 10Operations, 10Performance-Team, 10serviceops, 10User-jijiki: Run latest Thumbor on Docker with Buster + Python 3 - https://phabricator.wikimedia.org/T267327 (10Ladsgroup) My 2c. From what I learned from docker books and such, containers and k8s are not recommended for two types of applications: 1- statef... [12:56:11] jouncebot: now [12:56:11] For the next 0 hour(s) and 3 minute(s): European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201201T1200) [12:57:56] !log Preparing deployment of 1.36.0-wmf.20 # T263186 [12:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:03] T263186: 1.36.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T263186 [12:58:26] (03PS1) 10Marostegui: tables_to_check: Add pagelinks,templatelinks and categorylinks [software] - 10https://gerrit.wikimedia.org/r/644515 [12:59:38] (03PS2) 10Marostegui: tables_to_check: Add pagelinks,templatelinks and categorylinks [software] - 10https://gerrit.wikimedia.org/r/644515 [13:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201201T1300) [13:01:36] (03CR) 10Jgiannelos: "Hey @MSantos, i did a first pass and added some (mostly) nit comments." (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [13:03:13] (03CR) 10Jgiannelos: [C: 04-1] WIP: start using imposm as OSM sync tool (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [13:08:56] (03CR) 10Jcrespo: [C: 03+1] "I am afraid they will be huge on some wikis." [software] - 10https://gerrit.wikimedia.org/r/644515 (owner: 10Marostegui) [13:08:58] (03PS1) 10Hashar: testwikis wikis to 1.36.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644516 [13:09:00] (03CR) 10Hashar: [C: 03+2] testwikis wikis to 1.36.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644516 (owner: 10Hashar) [13:09:26] (03CR) 10Marostegui: [C: 03+2] tables_to_check: Add pagelinks,templatelinks and categorylinks [software] - 10https://gerrit.wikimedia.org/r/644515 (owner: 10Marostegui) [13:09:37] (03Merged) 10jenkins-bot: testwikis wikis to 1.36.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644516 (owner: 10Hashar) [13:10:08] !log hashar@deploy1001 Started scap: testwikis wikis to 1.36.0-wmf.20 [13:10:11] (03Merged) 10jenkins-bot: tables_to_check: Add pagelinks,templatelinks and categorylinks [software] - 10https://gerrit.wikimedia.org/r/644515 (owner: 10Marostegui) [13:10:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:31] (03PS1) 10Marostegui: Revert "tables_to_check: Add pagelinks,templatelinks and categorylinks" [software] - 10https://gerrit.wikimedia.org/r/644489 [13:13:15] (03CR) 10Marostegui: [C: 03+2] Revert "tables_to_check: Add pagelinks,templatelinks and categorylinks" [software] - 10https://gerrit.wikimedia.org/r/644489 (owner: 10Marostegui) [13:13:49] (03Merged) 10jenkins-bot: Revert "tables_to_check: Add pagelinks,templatelinks and categorylinks" [software] - 10https://gerrit.wikimedia.org/r/644489 (owner: 10Marostegui) [13:19:33] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [13:19:51] mmmmm [13:20:54] * elukey runs sudo cookbook -d sre.dns.netbox "test" [13:26:33] (03PS1) 10Filippo Giunchedi: alertmanager: use o11y address as from [puppet] - 10https://gerrit.wikimedia.org/r/644517 (https://phabricator.wikimedia.org/T268995) [13:27:14] so it is a lot of cloudvirt instances [13:28:44] ah ok Chris is working on them (TIL Netbox changelog) [13:30:28] 10Operations, 10Performance-Team, 10serviceops, 10User-jijiki: Run latest Thumbor on Docker with Buster + Python 3 - https://phabricator.wikimedia.org/T267327 (10Gilles) Thumbor is neither stateful nor high I/O. [13:33:24] 10Operations, 10Wikimedia-Logstash, 10observability, 10service-runner, 10Patch-For-Review: Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125 (10fgiunchedi) [13:33:35] 10Operations, 10Maps, 10Wikimedia-Logstash, 10observability, and 3 others: Move kartotherian/tilerator logging to new logging pipeline - https://phabricator.wikimedia.org/T222377 (10fgiunchedi) 05Open→03Resolved I can indeed confirm all gelf traffic from maps has stopped, thank you @MSantos and @hnowla... [13:37:20] 10Operations, 10ops-codfw: Degraded RAID on ms-be2031 - https://phabricator.wikimedia.org/T268773 (10fgiunchedi) 05Open→03Resolved Thanks @papaul, disk is back now. re: spares, we should get some from decom of ms-be hosts (if that's a thing) [13:39:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1106 to clone clouddb hosts', diff saved to https://phabricator.wikimedia.org/P13507 and previous config saved to /var/cache/conftool/dbconfig/20201201-133917-marostegui.json [13:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:22] (03CR) 10Subramanya Sastry: [C: 03+2] Bump wikimedia/parsoid to 0.13.0-a18 [vendor] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644221 (https://phabricator.wikimedia.org/T51538) (owner: 10C. Scott Ananian) [13:40:57] (03PS1) 10Marostegui: db1106: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/644519 (https://phabricator.wikimedia.org/T267090) [13:41:38] (03PS2) 10Bearloga: sessionTick: Add event stream and enable on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637539 (https://phabricator.wikimedia.org/T248987) (owner: 10Jason Linehan) [13:41:42] (03PS3) 10Bearloga: sessionTick: Add event stream and enable on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637539 (https://phabricator.wikimedia.org/T248987) (owner: 10Jason Linehan) [13:42:54] (03CR) 10Bearloga: [C: 03+1] "Updated stream name per I809afc34c878ed6bdcbf0d3f5cc6a4c9990ef845" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637539 (https://phabricator.wikimedia.org/T248987) (owner: 10Jason Linehan) [13:43:49] (03CR) 10Marostegui: [C: 03+2] db1106: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/644519 (https://phabricator.wikimedia.org/T267090) (owner: 10Marostegui) [13:44:41] hey hashar we got late getting vendor patch merged before the branch was cut. i just +2ed the cherry-pick. ( https://gerrit.wikimedia.org/r/644221 ) can that be synced as well once that merges? [13:47:11] https://phabricator.wikimedia.org/T268804 https://netbox.wikimedia.org/dcim/cables/3395/ Cleaning fiber in C4 & C7 in eqiad [13:48:51] but i am going to -2 it to block the merge in case you aren't ready to scap it. [13:48:55] subbu: oh parsoid is shipped as a composer dependency [13:49:01] (03CR) 10Subramanya Sastry: [C: 04-2] Bump wikimedia/parsoid to 0.13.0-a18 [vendor] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644221 (https://phabricator.wikimedia.org/T51538) (owner: 10C. Scott Ananian) [13:49:05] I though it was deployed as an extension .. bah ;) [13:49:06] 10Operations, 10Discovery-Search: Google Search Console access for Search Platform team - https://phabricator.wikimedia.org/T188453 (10mpopov) 05Open→03Invalid [13:49:33] subbu: I am running the deploy promote step right now, so in half an hour or so mediawiki will have bene deployed [13:49:34] hashar, yes. we changed that in feb. :) [13:49:57] I guess once the vendor change has been merged, it is all of a matter of deploying it the usual way [13:50:11] I can handle it ;) [13:50:14] ok, should i +2 that patch ? ok. [13:50:23] (03CR) 10Subramanya Sastry: [C: 03+2] Bump wikimedia/parsoid to 0.13.0-a18 [vendor] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644221 (https://phabricator.wikimedia.org/T51538) (owner: 10C. Scott Ananian) [13:50:50] that ^ is the cherry-pick to wmf.20 from master. [13:53:03] !log elukey@cumin1001 START - Cookbook sre.dns.netbox [13:53:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:15] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005703 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:55:52] subbu: great thx sync-apaches: 30% (ok: 106; fail: 0; left: 239) [13:55:59] still some more apaches to complete [13:56:20] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T269075 (10fgiunchedi) [13:56:21] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T268036 (10fgiunchedi) [13:56:36] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T268997 (10fgiunchedi) [13:56:38] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T268036 (10fgiunchedi) [13:57:28] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1022 - https://phabricator.wikimedia.org/T269125 (10fgiunchedi) [13:57:32] 10Operations, 10ops-eqiad, 10SRE-swift-storage: ms-be1022 smart storage battery failure; disk sdb possibly bad - https://phabricator.wikimedia.org/T267870 (10fgiunchedi) [14:00:04] hashar and twentyafterfour: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Mediawiki train - European+American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201201T1400). [14:00:28] !log asw2-d-eqiad> request virtual-chassis vc-port delete pic-slot 0 member 2 port 53 - T268808 [14:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:35] T268808: Replace asw2-d-eqiad VC cable - https://phabricator.wikimedia.org/T268808 [14:04:00] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.13.0-a18 [vendor] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644221 (https://phabricator.wikimedia.org/T51538) (owner: 10C. Scott Ananian) [14:05:07] hashar, merged ^ .. do verify that the submodule in core is updated to reflect that before syncing .. i believe it should happen automatically, but just in case. [14:05:21] !log asw2-d-eqiad> request virtual-chassis vc-port set pic-slot 0 member 2 port 53 - T268808 [14:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:49] 10Operations, 10ConfirmEdit (CAPTCHA extension), 10Python3-Porting: captcha.py needs to be ported to Python 3 - https://phabricator.wikimedia.org/T268468 (10Reedy) 05Open→03Stalled Marking stalled as it's unclear what needs fixing (if anything) [14:08:17] !log elukey@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:52] subbu: yeah I will once the current deployment has completed [14:10:05] k [14:12:12] 10Operations, 10ConfirmEdit (CAPTCHA extension), 10Python3-Porting: captcha.py needs to be ported to Python 3 - https://phabricator.wikimedia.org/T268468 (10MoritzMuehlenhoff) 05Stalled→03Open Well, at minimum the shebang needs to be switched to #!/usr/bin/python3. If that's all that's needed, even better. [14:14:21] RECOVERY - Host ms-be2058 is UP: PING OK - Packet loss = 0%, RTA = 33.38 ms [14:14:23] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:53] 10Operations, 10ops-eqiad: Replace asw2-c-eqiad VC cable - https://phabricator.wikimedia.org/T268804 (10Jclark-ctr) Cleaned both fiber ends will leave ticket open for now while monitoring. [14:15:20] 10Operations, 10ops-eqiad: Replace asw2-d-eqiad VC cable - https://phabricator.wikimedia.org/T268808 (10Jclark-ctr) replaced qsfp+ on D2, Cleaned both fiber ends will leave ticket open while monitoring [14:15:53] 10Operations, 10ops-eqiad: Replace asw2-d-eqiad VC cable - https://phabricator.wikimedia.org/T268808 (10Jclark-ctr) updated netbox with cable info. [14:15:57] 10Operations, 10ops-eqiad: Replace asw2-c-eqiad VC cable - https://phabricator.wikimedia.org/T268804 (10Jclark-ctr) updated netbox with cable info. [14:16:32] (03PS4) 10ArielGlenn: add the ability to skip a job via configuration [dumps] - 10https://gerrit.wikimedia.org/r/644476 [14:16:34] (03PS2) 10ArielGlenn: add a sample job for illustration purposes [dumps] - 10https://gerrit.wikimedia.org/r/625930 [14:17:40] (03CR) 10JMeybohm: [C: 03+1] k8s: Remove profile::kubernetes::master::storage_backend fully [puppet] - 10https://gerrit.wikimedia.org/r/644234 (owner: 10Alexandros Kosiaris) [14:20:51] PROBLEM - Ensure local MW versions match expected deployment on wtp1035 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:20:51] PROBLEM - Ensure local MW versions match expected deployment on mw2247 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:20:57] PROBLEM - Ensure local MW versions match expected deployment on mw1327 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:20:59] PROBLEM - Ensure local MW versions match expected deployment on mw2331 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:20:59] PROBLEM - Ensure local MW versions match expected deployment on mw2325 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:21:01] PROBLEM - Ensure local MW versions match expected deployment on mw2219 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:21:07] PROBLEM - Ensure local MW versions match expected deployment on mw2266 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:21:07] PROBLEM - Ensure local MW versions match expected deployment on mw2283 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:21:11] PROBLEM - Ensure local MW versions match expected deployment on mw1354 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:21:19] ^^^ no idea [14:21:19] PROBLEM - Ensure local MW versions match expected deployment on mw1322 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:21:21] PROBLEM - Ensure local MW versions match expected deployment on mw1290 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:21:21] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10Papaul) @fgiunchedi thanks . ms-be2058 has memory error the same DIMM we were having problem with on msbe2057 was used in ms-be2058 so I will go ahead and ask for replacement .... [14:21:25] PROBLEM - Ensure local MW versions match expected deployment on mw1338 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:21:25] PROBLEM - Ensure local MW versions match expected deployment on mw2273 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:21:31] PROBLEM - Ensure local MW versions match expected deployment on mw2311 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:21:31] PROBLEM - Ensure local MW versions match expected deployment on parse2013 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:21:31] PROBLEM - Ensure local MW versions match expected deployment on mw1362 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:21:34] but scap deploy promote is being run so that is surely related [14:21:35] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:21:35] PROBLEM - Ensure local MW versions match expected deployment on mw1296 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:21:47] PROBLEM - Ensure local MW versions match expected deployment on mw2263 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:21:51] PROBLEM - Ensure local MW versions match expected deployment on mw2216 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:21:59] PROBLEM - Ensure local MW versions match expected deployment on mw1392 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:22:01] PROBLEM - Ensure local MW versions match expected deployment on mw1300 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:22:05] PROBLEM - Ensure local MW versions match expected deployment on wtp1036 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:22:05] PROBLEM - Ensure local MW versions match expected deployment on deploy1002 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:22:07] PROBLEM - Ensure local MW versions match expected deployment on mw2330 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:22:07] PROBLEM - Ensure local MW versions match expected deployment on mw2274 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:22:11] PROBLEM - Ensure local MW versions match expected deployment on mwdebug1002 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:22:13] PROBLEM - Ensure local MW versions match expected deployment on mw1383 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:22:13] PROBLEM - Ensure local MW versions match expected deployment on mw1349 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:22:17] PROBLEM - Ensure local MW versions match expected deployment on mw2292 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:22:17] PROBLEM - Ensure local MW versions match expected deployment on mw2372 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:22:17] PROBLEM - Ensure local MW versions match expected deployment on mwdebug2002 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:22:33] PROBLEM - Ensure local MW versions match expected deployment on mw2258 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:22:43] PROBLEM - Ensure local MW versions match expected deployment on snapshot1008 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:22:45] PROBLEM - Ensure local MW versions match expected deployment on mw2221 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:22:47] PROBLEM - Ensure local MW versions match expected deployment on mw1385 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:22:47] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` ms-be2060.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reim... [14:23:03] PROBLEM - Ensure local MW versions match expected deployment on mw1328 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:23:03] PROBLEM - Ensure local MW versions match expected deployment on mw2296 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:23:17] PROBLEM - Ensure local MW versions match expected deployment on mw1351 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:23:25] PROBLEM - Ensure local MW versions match expected deployment on wtp1027 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:23:29] PROBLEM - Ensure local MW versions match expected deployment on mw2310 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:23:29] PROBLEM - Ensure local MW versions match expected deployment on mw2324 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:23:33] PROBLEM - Ensure local MW versions match expected deployment on mw1275 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:23:41] PROBLEM - Ensure local MW versions match expected deployment on wtp1040 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:23:45] PROBLEM - Ensure local MW versions match expected deployment on mw1368 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:23:45] PROBLEM - Ensure local MW versions match expected deployment on snapshot1005 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:23:45] PROBLEM - Ensure local MW versions match expected deployment on parse2007 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:23:47] PROBLEM - Ensure local MW versions match expected deployment on mw2271 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:23:53] PROBLEM - Ensure local MW versions match expected deployment on mw2329 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:24:01] PROBLEM - Ensure local MW versions match expected deployment on mw2317 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:24:05] PROBLEM - Ensure local MW versions match expected deployment on wtp1041 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:24:16] (03CR) 10JMeybohm: [C: 03+1] k8s::master: Remove redundant has_lvs hiera check [puppet] - 10https://gerrit.wikimedia.org/r/644237 (owner: 10Alexandros Kosiaris) [14:24:23] PROBLEM - Ensure local MW versions match expected deployment on mw1363 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:24:34] !log hashar@deploy1001 sync-world aborted: testwikis wikis to 1.36.0-wmf.20 (duration: 74m 55s) [14:24:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:15] grrr wrong window [14:25:48] fwiw the wikis with mismatched versions are testwiki, labtestwiki and testwikidatawiki but I assume that's expected [14:25:51] RECOVERY - Ensure local MW versions match expected deployment on wtp1035 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:25:51] RECOVERY - Ensure local MW versions match expected deployment on mw2247 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:26:00] (03CR) 10JMeybohm: [C: 03+1] k8s: Allow using cergen [puppet] - 10https://gerrit.wikimedia.org/r/644238 (owner: 10Alexandros Kosiaris) [14:26:01] RECOVERY - Ensure local MW versions match expected deployment on mw2331 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:26:01] RECOVERY - Ensure local MW versions match expected deployment on mw2325 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:26:03] RECOVERY - Ensure local MW versions match expected deployment on mw2219 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:26:03] !log hashar@deploy1001 Started scap: (no justification provided) [14:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:09] RECOVERY - Ensure local MW versions match expected deployment on mw2266 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:26:09] RECOVERY - Ensure local MW versions match expected deployment on mw2283 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:26:13] RECOVERY - Ensure local MW versions match expected deployment on mw1354 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:26:25] RECOVERY - Ensure local MW versions match expected deployment on mw1322 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:26:25] RECOVERY - Ensure local MW versions match expected deployment on mw1290 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:26:29] RECOVERY - Ensure local MW versions match expected deployment on mw1338 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:26:31] RECOVERY - Ensure local MW versions match expected deployment on mw2273 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:26:35] RECOVERY - Ensure local MW versions match expected deployment on mw2311 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:26:35] RECOVERY - Ensure local MW versions match expected deployment on parse2013 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:26:37] RECOVERY - Ensure local MW versions match expected deployment on mw1362 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:26:41] RECOVERY - Ensure local MW versions match expected deployment on mw1296 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:26:49] RECOVERY - Ensure local MW versions match expected deployment on mw2263 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:26:53] RECOVERY - Ensure local MW versions match expected deployment on mw2216 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:26:59] RECOVERY - Ensure local MW versions match expected deployment on wtp1036 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:27:03] RECOVERY - Ensure local MW versions match expected deployment on mw1392 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:27:03] RECOVERY - Ensure local MW versions match expected deployment on mw1300 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:27:11] RECOVERY - Ensure local MW versions match expected deployment on mw2330 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:27:11] RECOVERY - Ensure local MW versions match expected deployment on mw2274 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:27:15] RECOVERY - Ensure local MW versions match expected deployment on mwdebug1002 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:27:19] RECOVERY - Ensure local MW versions match expected deployment on mw1383 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:27:19] RECOVERY - Ensure local MW versions match expected deployment on mw1349 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:27:23] RECOVERY - Ensure local MW versions match expected deployment on mw2292 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:27:23] RECOVERY - Ensure local MW versions match expected deployment on mw2372 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:27:23] RECOVERY - Ensure local MW versions match expected deployment on mwdebug2002 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:27:23] (03PS1) 10Alexandros Kosiaris: Add apertium namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/644530 (https://phabricator.wikimedia.org/T255672) [14:27:29] (03CR) 10JMeybohm: [C: 03+1] k8s codfw staging: Assign role to node [puppet] - 10https://gerrit.wikimedia.org/r/644235 (owner: 10Alexandros Kosiaris) [14:27:37] RECOVERY - Ensure local MW versions match expected deployment on mw2258 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:27:43] RECOVERY - Ensure local MW versions match expected deployment on snapshot1008 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:27:45] RECOVERY - Ensure local MW versions match expected deployment on mw2221 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:27:47] RECOVERY - Ensure local MW versions match expected deployment on mw1385 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:28:01] 10Operations, 10ConfirmEdit (CAPTCHA extension), 10Patch-For-Review, 10Python3-Porting: captcha.py needs to be ported to Python 3 - https://phabricator.wikimedia.org/T268468 (10Reedy) I think that's possibly all that is needed at this point. 2to3 changes aren't necessary, I think? I have tested (not exten... [14:28:05] RECOVERY - Ensure local MW versions match expected deployment on mw1328 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:28:05] RECOVERY - Ensure local MW versions match expected deployment on mw2296 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:28:19] RECOVERY - Ensure local MW versions match expected deployment on mw1351 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:28:24] 10Operations, 10Performance-Team, 10serviceops, 10User-jijiki: Run latest Thumbor on Docker with Buster + Python 3 - https://phabricator.wikimedia.org/T267327 (10Ladsgroup) oh it's not stateful but I think it's high I/O compared to other applications (maybe not as high as jitsi but higher than other apps i... [14:28:25] RECOVERY - Ensure local MW versions match expected deployment on wtp1027 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:28:29] RECOVERY - Ensure local MW versions match expected deployment on mw2310 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:28:29] RECOVERY - Ensure local MW versions match expected deployment on mw2324 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:28:35] RECOVERY - Ensure local MW versions match expected deployment on mw1275 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:28:45] RECOVERY - Ensure local MW versions match expected deployment on wtp1040 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:28:47] RECOVERY - Ensure local MW versions match expected deployment on mw1368 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:28:47] RECOVERY - Ensure local MW versions match expected deployment on snapshot1005 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:28:49] RECOVERY - Ensure local MW versions match expected deployment on parse2007 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:28:49] RECOVERY - Ensure local MW versions match expected deployment on mw2271 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:28:55] RECOVERY - Ensure local MW versions match expected deployment on mw2329 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:29:05] RECOVERY - Ensure local MW versions match expected deployment on mw2317 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:29:13] RECOVERY - Ensure local MW versions match expected deployment on wtp1041 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:29:31] RECOVERY - Ensure local MW versions match expected deployment on mw1363 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:29:41] (03PS1) 10KartikMistry: WIP: Add apertium helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/644531 (https://phabricator.wikimedia.org/T255672) [14:31:05] RECOVERY - Ensure local MW versions match expected deployment on mw1327 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:31:08] (03PS1) 10Muehlenhoff: Extend PIL Python package with Python 3 counterparts [puppet] - 10https://gerrit.wikimedia.org/r/644532 (https://phabricator.wikimedia.org/T268468) [14:31:17] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ms-be2060.codfw.wmnet'] ` Of which those **FAILED**: ` ['ms-be2060.codfw.wmnet'] ` [14:34:01] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` ms-be2060.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reim... [14:34:24] (03PS1) 10Andrew Bogott: Nova config upgrades for Stein [puppet] - 10https://gerrit.wikimedia.org/r/644534 (https://phabricator.wikimedia.org/T261134) [14:36:08] (03CR) 10JMeybohm: [C: 04-1] "Cool, thanks! (Just a nit on aptrepo)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644469 (owner: 10Alexandros Kosiaris) [14:36:55] PROBLEM - SSH on ms-be2031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:37:27] (03CR) 10JMeybohm: [C: 03+1] package_from_component: Move to ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/644509 (owner: 10Alexandros Kosiaris) [14:38:29] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/644532 (https://phabricator.wikimedia.org/T268468) (owner: 10Muehlenhoff) [14:39:13] (03PS1) 10Ottomata: Set oozie.service.coord.default.max.timeout to 13 months [puppet] - 10https://gerrit.wikimedia.org/r/644535 (https://phabricator.wikimedia.org/T264358) [14:40:55] RECOVERY - SSH on ms-be2031 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:44:22] subbu: i got the vendor patch for parsoid in and ran a whole sync again [14:44:39] thanks. [14:44:50] (03CR) 10Andrew Bogott: [C: 03+2] Nova config upgrades for Stein [puppet] - 10https://gerrit.wikimedia.org/r/644534 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [14:51:20] !log instal lxml updates [14:51:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:47] (03CR) 10JMeybohm: Add calico helm chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/644462 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [14:55:09] PROBLEM - Check systemd state on ms-be2031 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:56:47] RECOVERY - Check systemd state on ms-be2031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:59:19] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ms-be2060.codfw.wmnet'] ` Of which those **FAILED**: ` ['ms-be2060.codfw.wmnet'] ` [14:59:34] (03CR) 10Bstorm: [C: 03+2] "I need to start merging the precursor patches for this so that I can focus on changes to the meatier one." [puppet] - 10https://gerrit.wikimedia.org/r/643356 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm) [14:59:36] !log install libonig updates to scp [14:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:05] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission es1018.eqiad.wmnet - https://phabricator.wikimedia.org/T269069 (10wiki_willy) a:05wiki_willy→03Cmjohnson [15:01:44] (03PS5) 10Bstorm: wikireplicas: Upgrade maintain-meta_p.py to python 3 [puppet] - 10https://gerrit.wikimedia.org/r/643363 (https://phabricator.wikimedia.org/T268312) [15:03:11] PROBLEM - SSH on ms-be2031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:05:17] (03PS1) 10Kormat: test: Standardise integration_env output [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/644537 [15:05:19] (03PS12) 10Jbond: turnilo: add export mappings for network devices via query_resources [puppet] - 10https://gerrit.wikimedia.org/r/643703 (https://phabricator.wikimedia.org/T254332) [15:06:21] RECOVERY - SSH on ms-be2031 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:06:57] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` ms-be2060.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reim... [15:07:11] (03CR) 10jerkins-bot: [V: 04-1] test: Standardise integration_env output [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/644537 (owner: 10Kormat) [15:07:14] (03CR) 10Bstorm: [C: 03+2] wikireplicas: Upgrade maintain-meta_p.py to python 3 [puppet] - 10https://gerrit.wikimedia.org/r/643363 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm) [15:10:14] (03PS2) 10Kormat: test: Standardise integration_env output [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/644537 [15:10:19] !log hashar@deploy1001 Finished scap: (no justification provided) (duration: 44m 20s) [15:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:29] ACKNOWLEDGEMENT - HP RAID on ms-be1030 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T269143 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Inform [15:10:32] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T269143 (10ops-monitoring-bot) [15:10:53] subbu: testwiki has parsoid 0.13.0-a18 https://test.wikipedia.org/wiki/Special:Version [15:11:20] nice! [15:14:40] (03PS2) 10Bstorm: wikireplicas: add maintain-meta_p only to s7 and legacy replicas [puppet] - 10https://gerrit.wikimedia.org/r/643578 (https://phabricator.wikimedia.org/T268312) [15:18:23] RECOVERY - tileratorui on maps1005 is OK: HTTP OK: HTTP/1.1 200 OK - 315 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [15:18:28] (03PS3) 10Bstorm: wikireplicas: add maintain-meta_p only to s7 and legacy replicas [puppet] - 10https://gerrit.wikimedia.org/r/643578 (https://phabricator.wikimedia.org/T268312) [15:18:45] RECOVERY - tileratorui on maps1008 is OK: HTTP OK: HTTP/1.1 200 OK - 315 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [15:18:47] (03PS3) 10ArielGlenn: add a sample job for illustration purposes [dumps] - 10https://gerrit.wikimedia.org/r/625930 [15:19:16] (03Abandoned) 10Razzi: analytics: Replace an-coord1001 with analytics-hive [puppet] - 10https://gerrit.wikimedia.org/r/644353 (https://phabricator.wikimedia.org/T268028) (owner: 10Razzi) [15:20:14] 1.36.0-wmf.20 is on testwikis [15:20:27] I am off for roughly an hour [15:20:52] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ms-be2060.codfw.wmnet'] ` Of which those **FAILED**: ` ['ms-be2060.codfw.wmnet'] ` [15:22:26] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [15:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:20] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [15:24:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:16] (03PS4) 10ArielGlenn: add a sample job for illustration purposes [dumps] - 10https://gerrit.wikimedia.org/r/625930 [15:31:02] (03CR) 10RLazarus: "Thanks for the patch! Terrible news: These are being migrated from an old format (cron) to a new format (profile::mediawiki::periodic_job)" [puppet] - 10https://gerrit.wikimedia.org/r/643917 (https://phabricator.wikimedia.org/T262857) (owner: 10Cparle) [15:31:19] 10Operations, 10Performance-Team, 10serviceops, 10User-jijiki: Run latest Thumbor on Docker with Buster + Python 3 - https://phabricator.wikimedia.org/T267327 (10akosiaris) >>! In T267327#6659813, @Ladsgroup wrote: > oh it's not stateful but I think it's high I/O compared to other applications (maybe not a... [15:33:53] 10Operations, 10vm-requests, 10Patch-For-Review: Eq: 5 VM request for kafka-test-eqiad cluster - https://phabricator.wikimedia.org/T268202 (10razzi) Here's the cumin output for the kafka-test1001 decomission: ` razzi@cumin1001:~$ sudo cookbook sre.hosts.decommission kafka-test1001.eqiad.wmnet -t T268202 STA... [15:34:37] PROBLEM - SSH on ms-be2031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:37:49] (03CR) 10Alexandros Kosiaris: [C: 04-1] "LGTM, couple of inline comments" (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/643936 (owner: 10JMeybohm) [15:39:00] (03PS3) 10Kormat: test: Standardise integration_env output [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/644537 [15:42:38] (03CR) 10Alexandros Kosiaris: [C: 03+1] "LGTM, got a small nitpick in there, but +1 otherwise" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/643974 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [15:43:21] (03CR) 10ArielGlenn: [C: 03+2] add the ability to skip a job via configuration [dumps] - 10https://gerrit.wikimedia.org/r/644476 (owner: 10ArielGlenn) [15:43:48] (03Merged) 10jenkins-bot: add the ability to skip a job via configuration [dumps] - 10https://gerrit.wikimedia.org/r/644476 (owner: 10ArielGlenn) [15:44:38] (03CR) 10ArielGlenn: [C: 03+2] add a sample job for illustration purposes [dumps] - 10https://gerrit.wikimedia.org/r/625930 (owner: 10ArielGlenn) [15:45:51] (03Merged) 10jenkins-bot: add a sample job for illustration purposes [dumps] - 10https://gerrit.wikimedia.org/r/625930 (owner: 10ArielGlenn) [15:45:57] RECOVERY - SSH on ms-be2031 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:46:54] (03CR) 10Herron: [C: 03+1] alertmanager: use o11y address as from [puppet] - 10https://gerrit.wikimedia.org/r/644517 (https://phabricator.wikimedia.org/T268995) (owner: 10Filippo Giunchedi) [15:47:13] PROBLEM - very high load average likely xfs on ms-be2031 is CRITICAL: CRITICAL - load average: 80.68, 150.80, 115.20 https://wikitech.wikimedia.org/wiki/Swift [15:47:32] (03CR) 10Bstorm: "I'm removing the dependency on simplejson because it really serves no purpose except in possibly ancient python that may have existed when" [puppet] - 10https://gerrit.wikimedia.org/r/643578 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm) [15:48:05] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: use o11y address as from [puppet] - 10https://gerrit.wikimedia.org/r/644517 (https://phabricator.wikimedia.org/T268995) (owner: 10Filippo Giunchedi) [15:48:23] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:53] (03PS2) 10Razzi: Ensure /tmp/sqoop-jars/ is present for role::analytics_cluster::launcher [puppet] - 10https://gerrit.wikimedia.org/r/644347 (https://phabricator.wikimedia.org/T251788) [15:50:15] (03CR) 10Razzi: "A quick ops week patch :)" [puppet] - 10https://gerrit.wikimedia.org/r/644347 (https://phabricator.wikimedia.org/T251788) (owner: 10Razzi) [15:50:56] (03CR) 10DannyS712: [C: 03+1] Add log channel Wikibase.IdGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643874 (https://phabricator.wikimedia.org/T268625) (owner: 10Lucas Werkmeister (WMDE)) [15:52:32] (03PS1) 10Holger Knust: Configuration chage to allow custom comment reverts on Wikidata [deployment-charts] - 10https://gerrit.wikimedia.org/r/644542 [15:53:41] (03PS1) 10Jbond: raktables: hand off authentication to httpd [puppet] - 10https://gerrit.wikimedia.org/r/644543 [15:53:43] (03PS1) 10Jbond: racktables: Make everyone admin [puppet] - 10https://gerrit.wikimedia.org/r/644544 [15:55:12] (03CR) 10Elukey: Ensure /tmp/sqoop-jars/ is present for role::analytics_cluster::launcher (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644347 (https://phabricator.wikimedia.org/T251788) (owner: 10Razzi) [15:57:05] RECOVERY - very high load average likely xfs on ms-be2031 is OK: OK - load average: 37.97, 55.26, 79.71 https://wikitech.wikimedia.org/wiki/Swift [15:57:32] (03PS1) 10Hnowlan: maps: fix typo in postgres command, retry 5 times before alerting [puppet] - 10https://gerrit.wikimedia.org/r/644545 [16:01:18] (03PS1) 10Mholloway: sessionTick: Changes stream name to 'mw_session_tick' [extensions/WikimediaEvents] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644492 [16:07:06] (03CR) 10Ppchelko: [C: 04-1] "Additionally, need to bump the patch version of the chart." (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/644542 (owner: 10Holger Knust) [16:08:24] 10Operations, 10MediaWiki-General, 10Platform Engineering: Allow easier ICU transitions in MediaWiki (change how sortkey collation is managed in the categorylinks table) - https://phabricator.wikimedia.org/T263437 (10Ladsgroup) It might sound like a promotion but I think before getting this done (in any way,... [16:08:37] (03PS2) 10CRusnov: icinga/check_legal_html.py: Port to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/644372 (https://phabricator.wikimedia.org/T247364) [16:09:18] (03PS1) 10Tchanders: extension-list: Add IPInfo extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644548 (https://phabricator.wikimedia.org/T260599) [16:09:20] (03PS1) 10Tchanders: Add IPInfo config to InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644549 (https://phabricator.wikimedia.org/T260599) [16:09:23] (03PS1) 10Tchanders: Add IPInfo extension config to InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644550 (https://phabricator.wikimedia.org/T260599) [16:09:25] (03PS1) 10Tchanders: Load IPInfo extension in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644551 (https://phabricator.wikimedia.org/T260599) [16:09:33] (03CR) 10jerkins-bot: [V: 04-1] extension-list: Add IPInfo extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644548 (https://phabricator.wikimedia.org/T260599) (owner: 10Tchanders) [16:09:33] !log installing vips security updates [16:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:43] (03CR) 10jerkins-bot: [V: 04-1] Add IPInfo config to InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644549 (https://phabricator.wikimedia.org/T260599) (owner: 10Tchanders) [16:09:45] (03CR) 10jerkins-bot: [V: 04-1] Add IPInfo extension config to InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644550 (https://phabricator.wikimedia.org/T260599) (owner: 10Tchanders) [16:09:52] (03CR) 10jerkins-bot: [V: 04-1] Load IPInfo extension in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644551 (https://phabricator.wikimedia.org/T260599) (owner: 10Tchanders) [16:10:31] (03PS1) 10Bstorm: wmcs: set the add_wiki cookbook to only run meta_p on some hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/644552 (https://phabricator.wikimedia.org/T268312) [16:11:47] (03PS2) 10Bstorm: wmcs: set the add_wiki cookbook to only run meta_p on some hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/644552 (https://phabricator.wikimedia.org/T268312) [16:11:55] (03CR) 10CRusnov: [C: 03+2] icinga/check_legal_html.py: Port to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/644372 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [16:12:42] (03CR) 10Ebernhardson: [C: 03+1] cirrus: alert on pool counter reject spike [puppet] - 10https://gerrit.wikimedia.org/r/643362 (https://phabricator.wikimedia.org/T262694) (owner: 10Ryan Kemper) [16:15:18] (03PS1) 10Clarakosi: Remove OAuth experimental routes from beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644553 (https://phabricator.wikimedia.org/T262495) [16:16:50] (03PS1) 10Muehlenhoff: Update vips library hints [puppet] - 10https://gerrit.wikimedia.org/r/644555 [16:17:20] (03PS2) 10Clarakosi: Remove OAuth experimental routes from beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644553 (https://phabricator.wikimedia.org/T262495) [16:17:40] (03CR) 10Ppchelko: [C: 04-2] "Looks good, need to wait for the dependency to land on all groups first, so -2 until that happens." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644553 (https://phabricator.wikimedia.org/T262495) (owner: 10Clarakosi) [16:18:05] (03PS1) 10Arturo Borrero Gonzalez: cloud: add conntrackd for better neutron l3 agent failover [puppet] - 10https://gerrit.wikimedia.org/r/644556 (https://phabricator.wikimedia.org/T268335) [16:18:23] 10Operations, 10ops-eqiad: Replace asw2-c-eqiad VC cable - https://phabricator.wikimedia.org/T268804 (10ayounsi) 05Open→03Resolved No more issues, thanks! [16:19:28] 10Operations, 10ops-eqiad: Replace asw2-d-eqiad VC cable - https://phabricator.wikimedia.org/T268808 (10ayounsi) 05Open→03Resolved a:03Jclark-ctr No more errors, thanks! [16:19:36] (03CR) 10jerkins-bot: [V: 04-1] cloud: add conntrackd for better neutron l3 agent failover [puppet] - 10https://gerrit.wikimedia.org/r/644556 (https://phabricator.wikimedia.org/T268335) (owner: 10Arturo Borrero Gonzalez) [16:23:51] 10Operations, 10MediaWiki-General, 10Platform Engineering Roadmap Decision Making: Allow easier ICU transitions in MediaWiki (change how sortkey collation is managed in the categorylinks table) - https://phabricator.wikimedia.org/T263437 (10daniel) [16:27:18] (03PS3) 10CRusnov: Port elasticsearch/es-tool.py to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/644365 (https://phabricator.wikimedia.org/T247364) [16:28:01] (03CR) 10jerkins-bot: [V: 04-1] Port elasticsearch/es-tool.py to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/644365 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [16:31:44] (03PS4) 10Kormat: test: Standardise integration_env output [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/644537 [16:31:46] (03PS1) 10Kormat: intpy2 [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/644561 [16:32:46] (03PS1) 10Ayounsi: Speed up Homer by fixing fetch_device_circuits() [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/644563 [16:34:59] (03PS4) 10Lucas Werkmeister (WMDE): Add log channel Wikibase.IdGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643874 (https://phabricator.wikimedia.org/T268625) [16:35:01] (03PS1) 10Lucas Werkmeister (WMDE): Enable Wikibase Repo ID generator logging on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644564 (https://phabricator.wikimedia.org/T268625) [16:35:21] (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/644563 (owner: 10Ayounsi) [16:35:27] (03PS4) 10CRusnov: Port elasticsearch/es-tool.py to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/644365 (https://phabricator.wikimedia.org/T247364) [16:35:29] (03CR) 10CRusnov: Port elasticsearch/es-tool.py to Python3 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/644365 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [16:38:09] (03CR) 10CRusnov: "thanks!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644365 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [16:39:19] PROBLEM - Device not healthy -SMART- on ms-be1030 is CRITICAL: cluster=swift device=None instance=ms-be1030 job=node site=eqiad https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be1030&var-datasource=eqiad+prometheus/ops [16:39:44] (03CR) 10Jbond: [C: 03+1] "LGTM, thx" [puppet] - 10https://gerrit.wikimedia.org/r/644365 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [16:40:01] (03CR) 10Ebernhardson: [C: 03+1] Port elasticsearch/es-tool.py to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/644365 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [16:41:11] (03CR) 10Volans: "Pure cookbook-foo nit inline" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/644552 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm) [16:46:03] (03CR) 10CRusnov: [C: 03+2] Port elasticsearch/es-tool.py to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/644365 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [16:47:45] RECOVERY - Device not healthy -SMART- on ms-be1022 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be1022&var-datasource=eqiad+prometheus/ops [16:48:26] (03CR) 10Bstorm: wmcs: set the add_wiki cookbook to only run meta_p on some hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/644552 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm) [16:50:10] (03PS4) 10Bstorm: wikireplicas: add maintain-meta_p only to s7 and legacy replicas [puppet] - 10https://gerrit.wikimedia.org/r/643578 (https://phabricator.wikimedia.org/T268312) [16:52:32] (03CR) 10Volans: "reply inline" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/644552 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm) [16:56:13] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10mforns) Cool! Thanks :] Will do. I'll let @JAllemandou coordinate with you on a good date and time for the team presentation! [16:56:55] 10Operations, 10fundraising-tech-ops, 10netops: Manage frack switches with Netbox - https://phabricator.wikimedia.org/T268802 (10Dwisehaupt) @ayounsi LLDP should be possible for us. We are at the start of the busy time for fundraising so it'll will probably be a few weeks or most likely January when we would... [16:56:59] (03PS3) 10Razzi: Ensure /tmp/sqoop-jars/ is present [puppet] - 10https://gerrit.wikimedia.org/r/644347 (https://phabricator.wikimedia.org/T251788) [16:57:42] (03PS1) 10Mforns: analytics::refinery::job::druid_load.pp: reduce netflow retention [puppet] - 10https://gerrit.wikimedia.org/r/644569 (https://phabricator.wikimedia.org/T254332) [16:58:21] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:03] (03PS2) 10Holger Knust: Configuration chage to allow custom comment reverts on Wikidata [deployment-charts] - 10https://gerrit.wikimedia.org/r/644542 [17:00:05] jbond42 and cdanis: May I have your attention please! Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201201T1700) [17:06:39] (03PS3) 10Holger Knust: Configuration chage to allow custom comment reverts on Wikidata [deployment-charts] - 10https://gerrit.wikimedia.org/r/644542 [17:08:49] (03CR) 10JMeybohm: prometheus::k8s: Support arbitrary clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644262 (owner: 10Alexandros Kosiaris) [17:09:27] (03PS2) 10Mholloway: Add event stream config for android.user_contributions_screen [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639284 (https://phabricator.wikimedia.org/T228179) [17:09:59] RECOVERY - Device not healthy -SMART- on ms-be1030 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be1030&var-datasource=eqiad+prometheus/ops [17:17:14] (03CR) 10Bstorm: wmcs: set the add_wiki cookbook to only run meta_p on some hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/644552 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm) [17:18:45] (03CR) 10Effie Mouzeli: [C: 03+1] Extend PIL Python package with Python 3 counterparts [puppet] - 10https://gerrit.wikimedia.org/r/644532 (https://phabricator.wikimedia.org/T268468) (owner: 10Muehlenhoff) [17:19:24] (03PS4) 10Holger Knust: Configuration chage to allow custom comment reverts on Wikidata [deployment-charts] - 10https://gerrit.wikimedia.org/r/644542 [17:19:31] !log Sanitize s1 on clouddb1013 and clouddb1017 - T267090 [17:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:40] T267090: Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 [17:20:37] (03CR) 10Bstorm: wmcs: set the add_wiki cookbook to only run meta_p on some hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/644552 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm) [17:20:57] (03CR) 10Joal: [C: 03+1] "Ideas - looking for me without them" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644347 (https://phabricator.wikimedia.org/T251788) (owner: 10Razzi) [17:21:57] (03PS1) 10Mholloway: sessionTick: Changes stream name to 'mw_session_tick' [extensions/WikimediaEvents] (wmf/1.36.0-wmf.18) - 10https://gerrit.wikimedia.org/r/644493 [17:28:31] (03PS2) 10JMeybohm: coredns: Create a wmfcoredns copy in charts dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/643936 [17:28:32] PROBLEM - Device not healthy -SMART- on ms-be1022 is CRITICAL: cluster=swift device=None instance=ms-be1022 job=node site=eqiad https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be1022&var-datasource=eqiad+prometheus/ops [17:31:38] (03CR) 10Mholloway: [C: 03+2] Add event stream config for android.user_contributions_screen [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639284 (https://phabricator.wikimedia.org/T228179) (owner: 10Mholloway) [17:32:27] (03Merged) 10jenkins-bot: Add event stream config for android.user_contributions_screen [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639284 (https://phabricator.wikimedia.org/T228179) (owner: 10Mholloway) [17:34:22] PROBLEM - SSH on ms-be2031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:34:54] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Add event stream config for android.user_contributions_screen T228179 (duration: 01m 07s) [17:35:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:02] T228179: Event Platform Client — Android - https://phabricator.wikimedia.org/T228179 [17:38:04] RECOVERY - SSH on ms-be2031 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:46:21] (03CR) 10Mholloway: [C: 03+2] sessionTick: Changes stream name to 'mw_session_tick' [extensions/WikimediaEvents] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644492 (owner: 10Mholloway) [17:47:28] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:48:09] (03CR) 10Mholloway: [C: 03+2] sessionTick: Changes stream name to 'mw_session_tick' [extensions/WikimediaEvents] (wmf/1.36.0-wmf.18) - 10https://gerrit.wikimedia.org/r/644493 (owner: 10Mholloway) [17:50:59] (03Merged) 10jenkins-bot: sessionTick: Changes stream name to 'mw_session_tick' [extensions/WikimediaEvents] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644492 (owner: 10Mholloway) [17:52:45] (03Merged) 10jenkins-bot: sessionTick: Changes stream name to 'mw_session_tick' [extensions/WikimediaEvents] (wmf/1.36.0-wmf.18) - 10https://gerrit.wikimedia.org/r/644493 (owner: 10Mholloway) [17:54:27] !log mholloway-shell@deploy1001 Synchronized php-1.36.0-wmf.20/extensions/WikimediaEvents: Backport: sessionTick: Update stream name to mw_session_tick (duration: 01m 07s) [17:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:23] (03PS2) 10BPirkle: Enable WikimediaApiPortalOAuth on apiportalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644305 (https://phabricator.wikimedia.org/T262495) (owner: 10Clarakosi) [17:57:26] !log mholloway-shell@deploy1001 Synchronized php-1.36.0-wmf.18/extensions/WikimediaEvents: Backport: sessionTick: Update stream name to mw_session_tick (duration: 01m 04s) [17:57:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:58] (03PS4) 10Mholloway: sessionTick: Add event stream and enable on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637539 (https://phabricator.wikimedia.org/T248987) (owner: 10Jason Linehan) [17:59:47] (03CR) 10Mholloway: [C: 03+2] sessionTick: Add event stream and enable on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637539 (https://phabricator.wikimedia.org/T248987) (owner: 10Jason Linehan) [18:00:05] chrisalbon and accraze: (Dis)respected human, time to deploy Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201201T1800). Please do the needful. [18:00:17] (03PS2) 10Hnowlan: similarusers: Create basic chart and service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) [18:00:27] Anyone object to us deploying a small config change in a few minutes? This is related to the not-yet-officially-released API Portal. Here's the change: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/644305 [18:00:34] (03CR) 10jerkins-bot: [V: 04-1] similarusers: Create basic chart and service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [18:00:39] (03Merged) 10jenkins-bot: sessionTick: Add event stream and enable on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637539 (https://phabricator.wikimedia.org/T248987) (owner: 10Jason Linehan) [18:02:02] (03PS3) 10Hnowlan: similarusers: Create basic chart and service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) [18:02:26] (03CR) 10jerkins-bot: [V: 04-1] similarusers: Create basic chart and service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [18:02:45] PROBLEM - SSH on ms-be2031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:03:03] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable session length instrument on officewiki T267494 (duration: 01m 06s) [18:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:11] T267494: [SessionLength] View how long users interact with our products - https://phabricator.wikimedia.org/T267494 [18:03:59] RECOVERY - SSH on ms-be2031 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:04:21] (03PS4) 10Hnowlan: similarusers: Create basic chart and service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) [18:04:40] (03CR) 10jerkins-bot: [V: 04-1] similarusers: Create basic chart and service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [18:05:40] (03PS5) 10Hnowlan: similarusers: Create basic chart and service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) [18:08:37] (03PS1) 10Mholloway: [BETA] Enable session length instrument on all Beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644575 [18:09:30] (03CR) 10BPirkle: [C: 03+2] Enable WikimediaApiPortalOAuth on apiportalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644305 (https://phabricator.wikimedia.org/T262495) (owner: 10Clarakosi) [18:10:33] (03PS1) 10Elukey: admin: remove users already in 'researchers' from 'analytics-users' [puppet] - 10https://gerrit.wikimedia.org/r/644576 (https://phabricator.wikimedia.org/T269150) [18:10:52] (03PS3) 10BPirkle: Enable WikimediaApiPortalOAuth on apiportalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644305 (https://phabricator.wikimedia.org/T262495) (owner: 10Clarakosi) [18:11:34] (03CR) 10BPirkle: Enable WikimediaApiPortalOAuth on apiportalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644305 (https://phabricator.wikimedia.org/T262495) (owner: 10Clarakosi) [18:11:47] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26808/console" [puppet] - 10https://gerrit.wikimedia.org/r/644576 (https://phabricator.wikimedia.org/T269150) (owner: 10Elukey) [18:11:49] (03CR) 10BPirkle: [C: 03+2] Enable WikimediaApiPortalOAuth on apiportalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644305 (https://phabricator.wikimedia.org/T262495) (owner: 10Clarakosi) [18:12:37] (03Merged) 10jenkins-bot: Enable WikimediaApiPortalOAuth on apiportalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644305 (https://phabricator.wikimedia.org/T262495) (owner: 10Clarakosi) [18:13:56] (03CR) 10Elukey: [V: 03+1 C: 03+2] admin: remove users already in 'researchers' from 'analytics-users' [puppet] - 10https://gerrit.wikimedia.org/r/644576 (https://phabricator.wikimedia.org/T269150) (owner: 10Elukey) [18:14:40] chaomodus: o/ [18:14:57] I see a change from you in puppet-merge, should I proceed? [18:15:07] "Port elasticsearch/es-tool.py to Python3" [18:15:59] 10Operations, 10ConfirmEdit (CAPTCHA extension), 10Patch-For-Review, 10Python3-Porting: captcha.py needs to be ported to Python 3 - https://phabricator.wikimedia.org/T268468 (10Reedy) >>! In T268468#6659719, @MoritzMuehlenhoff wrote: > Well, at minimum the shebang needs to be switched to #!/usr/bin/python3... [18:18:21] judging from https://gerrit.wikimedia.org/r/c/operations/puppet/+/644365/ it seems that ebernhardson +1ed so the folks in discovery are aware [18:19:03] PROBLEM - SSH on ms-be2031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:19:05] (03CR) 10CDanis: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/643945 (owner: 10Jbond) [18:19:19] all right merging [18:19:32] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` ms-be2060.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reim... [18:20:26] (03PS1) 10RobH: swapping new cloudcephmon eqiad hosts to partition same as existing [puppet] - 10https://gerrit.wikimedia.org/r/644578 (https://phabricator.wikimedia.org/T268746) [18:22:38] (03CR) 10RobH: [C: 03+2] swapping new cloudcephmon eqiad hosts to partition same as existing [puppet] - 10https://gerrit.wikimedia.org/r/644578 (https://phabricator.wikimedia.org/T268746) (owner: 10RobH) [18:24:11] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01395 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [18:25:14] puppet is newly failing on logstash/elasticsearch hosts [18:25:34] E: Unable to locate package python3-ipaddr [18:25:49] RECOVERY - SSH on ms-be2031 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:26:09] cdanis: ah nice it is the change that I merged with mine, lovely [18:26:19] (03CR) 10CDanis: "This looks good, but I think you will also need to update the prometheus/ops configuration to scrape the exporter and get the data in Prom" [puppet] - 10https://gerrit.wikimedia.org/r/644245 (owner: 10Hashar) [18:26:28] (03CR) 10CDanis: [C: 03+2] ci: add prometheus exporter for Apache [puppet] - 10https://gerrit.wikimedia.org/r/644245 (owner: 10Hashar) [18:28:55] cdanis: ah lovely python3-ipaddr is only in bullseye [18:29:01] chaomodus: --^ [18:29:05] shall we revert? [18:29:11] boh [18:29:13] thanks [18:29:16] i should've chdecked that one [18:29:23] hopefully not hard to backport :/ [18:29:42] (03PS1) 10Elukey: Revert "Port elasticsearch/es-tool.py to Python3" [puppet] - 10https://gerrit.wikimedia.org/r/644495 [18:29:50] ah there you are :) [18:29:55] (03PS1) 10Andrew Bogott: cloudvirt2003-dev: move to ceph-enabled virt role [puppet] - 10https://gerrit.wikimedia.org/r/644584 (https://phabricator.wikimedia.org/T265965) [18:30:10] (03PS1) 10CRusnov: Revert "Port elasticsearch/es-tool.py to Python3" [puppet] - 10https://gerrit.wikimedia.org/r/644496 [18:30:20] (03PS2) 10CRusnov: Revert "Port elasticsearch/es-tool.py to Python3" [puppet] - 10https://gerrit.wikimedia.org/r/644496 [18:30:20] ok abandoning mine [18:30:38] (03Abandoned) 10Elukey: Revert "Port elasticsearch/es-tool.py to Python3" [puppet] - 10https://gerrit.wikimedia.org/r/644495 (owner: 10Elukey) [18:30:57] (03PS1) 10Ottomata: Keep EventLogging SpecialMuteSubmit on old system for now [puppet] - 10https://gerrit.wikimedia.org/r/644585 (https://phabricator.wikimedia.org/T268517) [18:31:23] (03CR) 10CRusnov: [C: 03+2] Revert "Port elasticsearch/es-tool.py to Python3" [puppet] - 10https://gerrit.wikimedia.org/r/644496 (owner: 10CRusnov) [18:31:49] (03CR) 10Ppchelko: [C: 04-1] Configuration chage to allow custom comment reverts on Wikidata (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/644542 (owner: 10Holger Knust) [18:33:36] (03CR) 10Ottomata: [C: 03+2] Keep EventLogging SpecialMuteSubmit on old system for now [puppet] - 10https://gerrit.wikimedia.org/r/644585 (https://phabricator.wikimedia.org/T268517) (owner: 10Ottomata) [18:33:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:34:09] probably a better solution would be to make it use ipaddress [18:34:38] 10Operations, 10ConfirmEdit (CAPTCHA extension), 10Patch-For-Review, 10Python3-Porting: captcha.py needs to be ported to Python 3 - https://phabricator.wikimedia.org/T268468 (10Reedy) I'm guessing there's no reason to actually make the python version (ie whether to use `python` or `python3` configurable wi... [18:35:38] bpirkle: o/ Do you still plan to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/644305/? I've got one more config change to go out, but was waiting until you're finished. [18:35:47] Yep, almost done. [18:35:53] OK, cool, thanks! [18:38:46] !log bpirkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable WikimediaApiPortalOAuth on apiportalwiki gerrit:644305 (duration: 01m 06s) [18:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:33] (03PS5) 10Holger Knust: Configuration chage to allow custom comment reverts on Wikidata [deployment-charts] - 10https://gerrit.wikimedia.org/r/644542 [18:40:11] !log bpirkle@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: Enable WikimediaApiPortalOAuth on apiportalwiki gerrit:644305 (duration: 01m 06s) [18:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:57] hm, about three weeks until the day is 5 hours 49 minutes 4 seconds long, and then the next day is a little longer [18:42:01] mholloway: All done, confirmed that changes had the intended effect and no explosions seen in logstash. [18:42:12] bpirkle: Thanks! [18:42:17] wrong window for me sorry [18:42:19] (03PS2) 10Andrew Bogott: cloudvirt2003-dev: move to ceph-enabled virt role, add ceph hiera elsewhere [puppet] - 10https://gerrit.wikimedia.org/r/644584 (https://phabricator.wikimedia.org/T265965) [18:43:19] (03PS2) 10Mholloway: [BETA] Enable session length instrument on all Beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644575 [18:43:27] (03CR) 10CRusnov: "This is a redo of https://gerrit.wikimedia.org/r/c/operations/puppet/+/644365 after it was discovered that python3-ipaddr is not available" [puppet] - 10https://gerrit.wikimedia.org/r/644591 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [18:44:31] (03PS1) 10Elukey: admin: add comments to analytics posix groups [puppet] - 10https://gerrit.wikimedia.org/r/644592 (https://phabricator.wikimedia.org/T269150) [18:45:24] !log razzi@cumin1001 START - Cookbook sre.hosts.decommission [18:45:24] (03CR) 10Mholloway: [C: 03+2] [BETA] Enable session length instrument on all Beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644575 (owner: 10Mholloway) [18:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:32] (03CR) 10Elukey: [C: 03+2] admin: add comments to analytics posix groups [puppet] - 10https://gerrit.wikimedia.org/r/644592 (https://phabricator.wikimedia.org/T269150) (owner: 10Elukey) [18:45:41] (03PS2) 10Razzi: Configure zookeeper-test1002.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/644344 (https://phabricator.wikimedia.org/T268074) [18:45:48] (03CR) 10Elukey: [C: 03+2] analytics::refinery::job::druid_load.pp: reduce netflow retention [puppet] - 10https://gerrit.wikimedia.org/r/644569 (https://phabricator.wikimedia.org/T254332) (owner: 10Mforns) [18:46:07] (03PS3) 10Andrew Bogott: cloudvirt2003-dev: move to ceph-enabled virt role, add ceph hiera elsewhere [puppet] - 10https://gerrit.wikimedia.org/r/644584 (https://phabricator.wikimedia.org/T265965) [18:46:24] (03Merged) 10jenkins-bot: [BETA] Enable session length instrument on all Beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644575 (owner: 10Mholloway) [18:48:31] (03CR) 10Elukey: [C: 03+1] Configure zookeeper-test1002.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/644344 (https://phabricator.wikimedia.org/T268074) (owner: 10Razzi) [18:49:49] (03PS4) 10Andrew Bogott: cloudvirt2003-dev: move to ceph-enabled virt role, add ceph hiera elsewhere [puppet] - 10https://gerrit.wikimedia.org/r/644584 (https://phabricator.wikimedia.org/T265965) [18:52:50] (03PS5) 10Andrew Bogott: cloudvirt2003-dev: move to ceph-enabled virt role, add ceph hiera elsewhere [puppet] - 10https://gerrit.wikimedia.org/r/644584 (https://phabricator.wikimedia.org/T265965) [18:54:26] (03PS1) 10RobH: cloudcephosd update was not correct [puppet] - 10https://gerrit.wikimedia.org/r/644593 (https://phabricator.wikimedia.org/T268746) [18:54:31] (03PS6) 10Andrew Bogott: cloudvirt2003-dev: move to ceph-enabled virt role, add ceph hiera elsewhere [puppet] - 10https://gerrit.wikimedia.org/r/644584 (https://phabricator.wikimedia.org/T265965) [18:55:35] (03CR) 10RobH: [C: 03+2] cloudcephosd update was not correct [puppet] - 10https://gerrit.wikimedia.org/r/644593 (https://phabricator.wikimedia.org/T268746) (owner: 10RobH) [18:56:08] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt2003-dev: move to ceph-enabled virt role, add ceph hiera elsewhere [puppet] - 10https://gerrit.wikimedia.org/r/644584 (https://phabricator.wikimedia.org/T265965) (owner: 10Andrew Bogott) [19:00:05] RoanKattouw, Niharika, and Urbanecm: How many deployers does it take to do Morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201201T1900). [19:00:05] No GERRIT patches in the queue for this window AFAICS. [19:01:11] 10Operations, 10ConfirmEdit (CAPTCHA extension), 10Patch-For-Review, 10Python3-Porting: captcha.py needs to be ported to Python 3 - https://phabricator.wikimedia.org/T268468 (10MoritzMuehlenhoff) >>! In T268468#6660710, @Reedy wrote: > I'm guessing there's no reason to actually make the python version (ie... [19:07:34] !log razzi@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [19:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:40] 10Operations, 10vm-requests, 10Patch-For-Review: Eq: 5 VM request for kafka-test-eqiad cluster - https://phabricator.wikimedia.org/T268202 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by razzi@cumin1001 for hosts: `kafka-test1003.eqiad.wmnet` - kafka-test1003.eqiad.wmnet (**WARN**) - **... [19:07:43] (03PS1) 10Ottomata: Bump eventstreams image version to 2020-12-01-181032-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/644595 [19:09:11] (03CR) 10Ottomata: [C: 03+2] Bump eventstreams image version to 2020-12-01-181032-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/644595 (owner: 10Ottomata) [19:10:55] (03CR) 10Dzahn: [C: 03+2] Extend PIL Python package with Python 3 counterparts [puppet] - 10https://gerrit.wikimedia.org/r/644532 (https://phabricator.wikimedia.org/T268468) (owner: 10Muehlenhoff) [19:12:36] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ms-be2060.codfw.wmnet'] ` Of which those **FAILED**: ` ['ms-be2060.codfw.wmnet'] ` [19:14:20] (03CR) 10Dzahn: "on random host mw1401: Notice: /Stage[main]/Mediawiki::Packages/Package[python3-pil]/ensure: created" [puppet] - 10https://gerrit.wikimedia.org/r/644532 (https://phabricator.wikimedia.org/T268468) (owner: 10Muehlenhoff) [19:15:47] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` ms-be2060.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reim... [19:19:11] (03CR) 10Papaul: [C: 03+1] delete the drac module [puppet] - 10https://gerrit.wikimedia.org/r/644364 (owner: 10Dzahn) [19:20:29] (03CR) 10Dzahn: [C: 03+2] delete the drac module [puppet] - 10https://gerrit.wikimedia.org/r/644364 (owner: 10Dzahn) [19:25:44] (03PS2) 10Dzahn: deployment::server: buster support, use default-mysql-client package [puppet] - 10https://gerrit.wikimedia.org/r/644350 (https://phabricator.wikimedia.org/T265963) [19:26:07] 10Operations, 10puppet-compiler: String vs Binary issues while running the puppet compiler - https://phabricator.wikimedia.org/T268978 (10Legoktm) [19:26:47] (03CR) 10Dzahn: "thanks! though using "default-mysql-client" pulls mariadb packages on stretch as well, including mariadb-common. I would slightly prefer n" [puppet] - 10https://gerrit.wikimedia.org/r/644350 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn) [19:26:57] PROBLEM - SSH on ms-be2031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:26:59] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [19:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:03] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:30:06] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ms-be2060.codfw.wmnet'] ` Of which those **FAILED**: ` ['ms-be2060.codfw.wmnet'] ` [19:30:56] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` ms-be2060.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reim... [19:30:58] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ms-be2060.codfw.wmnet'] ` Of which those **FAILED**: ` ['ms-be2060.codfw.wmnet'] ` [19:31:41] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [19:31:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:58] 10Operations, 10Android-app-Bugs, 10Fundraising-Backlog, 10Thank-You-Page, and 4 others: Deal with donatewiki Thank You page launching in apps - https://phabricator.wikimedia.org/T259312 (10CDanis) FWIW, having looked at the past week of webrequest data, I've started to wonder as to whether or not the file... [19:32:47] (03PS1) 10CDanis: admin: add cdanis to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/644600 [19:33:11] (03CR) 10Ottomata: [C: 03+1] admin: add cdanis to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/644600 (owner: 10CDanis) [19:33:19] PROBLEM - very high load average likely xfs on ms-be2031 is CRITICAL: CRITICAL - load average: 180.91, 163.20, 107.82 https://wikitech.wikimedia.org/wiki/Swift [19:33:24] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [19:33:24] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [19:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:33] RECOVERY - SSH on ms-be2031 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:33:33] PROBLEM - Check systemd state on ms-be2031 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:33:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:36] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [19:33:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:07] 10Operations, 10Android-app-Bugs, 10Fundraising-Backlog, 10Thank-You-Page, and 4 others: Deal with donatewiki Thank You page launching in apps - https://phabricator.wikimedia.org/T259312 (10MattCleinman) I believe I read something that they're now caching the app site association file on the Apple CDN, so... [19:34:17] (03PS3) 10Bstorm: wmcs: set the add_wiki cookbook to only run meta_p on some hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/644552 (https://phabricator.wikimedia.org/T268312) [19:34:21] (03CR) 10CDanis: [C: 03+2] admin: add cdanis to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/644600 (owner: 10CDanis) [19:35:03] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:35:54] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [19:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:59] !log razzi@cumin1001 START - Cookbook sre.dns.netbox [19:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:57] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:38:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:31] 10Operations, 10Android-app-Bugs, 10Fundraising-Backlog, 10Thank-You-Page, and 4 others: Deal with donatewiki Thank You page launching in apps - https://phabricator.wikimedia.org/T259312 (10CDanis) >>! In T259312#6660843, @MattCleinman wrote: > I believe I read something that they're now caching the app si... [19:38:51] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1001/26816/deploy1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/644350 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn) [19:38:56] (03CR) 10Bstorm: wmcs: set the add_wiki cookbook to only run meta_p on some hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/644552 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm) [19:40:46] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [19:40:46] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [19:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:36] RECOVERY - very high load average likely xfs on ms-be2031 is OK: OK - load average: 26.48, 49.15, 72.82 https://wikitech.wikimedia.org/wiki/Swift [19:44:00] !log deploy refinery with refinery-source v0.0.140 [19:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:19] !log razzi@deploy1001 Started deploy [analytics/refinery@41c60d9]: Regular analytics weekly train [analytics/refinery@3e42f46c62722256a1678809097114740806a184] [19:45:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:14] (03CR) 10Dzahn: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/643532 (https://phabricator.wikimedia.org/T268779) (owner: 10Ryan Kemper) [19:49:06] (03PS1) 10Ryan Kemper: maps: remove no-longer-accurate insetup role [puppet] - 10https://gerrit.wikimedia.org/r/644603 (https://phabricator.wikimedia.org/T260269) [19:50:11] 10Operations, 10ops-codfw: RMA failed codfw C7 switch - WMF6114 - https://phabricator.wikimedia.org/T267950 (10Papaul) UPS Ship Notification, Tracking Number 1ZA19A021298055451 [19:51:39] !log razzi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:29] (03PS1) 10Bstorm: wikireplicas: add cumin aliases that include multiinstance servers [puppet] - 10https://gerrit.wikimedia.org/r/644606 (https://phabricator.wikimedia.org/T268312) [19:52:40] (03CR) 10Ryan Kemper: [C: 03+2] elasticsearch-cluster: support for cloudelastic [software/spicerack] - 10https://gerrit.wikimedia.org/r/643532 (https://phabricator.wikimedia.org/T268779) (owner: 10Ryan Kemper) [19:53:00] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] elasticsearch-cluster: support for cloudelastic [software/spicerack] - 10https://gerrit.wikimedia.org/r/643532 (https://phabricator.wikimedia.org/T268779) (owner: 10Ryan Kemper) [19:54:04] !log razzi@deploy1001 Finished deploy [analytics/refinery@41c60d9]: Regular analytics weekly train [analytics/refinery@3e42f46c62722256a1678809097114740806a184] (duration: 08m 45s) [19:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:47] (03CR) 10Bstorm: "If there's no objections soon, I'll merge this. Just curious if there were any." [puppet] - 10https://gerrit.wikimedia.org/r/643337 (https://phabricator.wikimedia.org/T266300) (owner: 10Bstorm) [19:56:06] !log razzi@deploy1001 Started deploy [analytics/refinery@41c60d9] (thin): Regular analytics weekly train [analytics/refinery@3e42f46c62722256a1678809097114740806a184] [19:56:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:13] !log razzi@deploy1001 Finished deploy [analytics/refinery@41c60d9] (thin): Regular analytics weekly train [analytics/refinery@3e42f46c62722256a1678809097114740806a184] (duration: 00m 07s) [19:56:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:58] (03CR) 10Dzahn: [C: 03+2] deployment::server: buster support, use default-mysql-client package [puppet] - 10https://gerrit.wikimedia.org/r/644350 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn) [19:59:04] (03PS3) 10Dzahn: deployment::server: buster support, use default-mysql-client package [puppet] - 10https://gerrit.wikimedia.org/r/644350 (https://phabricator.wikimedia.org/T265963) [19:59:42] (03CR) 10Holger Knust: Configuration chage to allow custom comment reverts on Wikidata (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/644542 (owner: 10Holger Knust) [20:00:05] hashar and twentyafterfour: How many deployers does it take to do Mediawiki train - European+American Version (secondary timeslot) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201201T2000). [20:00:27] (03CR) 10BryanDavis: [C: 03+1] "The only issue I have with this is the "where does it end" question. Is it ok to idle in vim? emacs? redis-cli?" [puppet] - 10https://gerrit.wikimedia.org/r/643337 (https://phabricator.wikimedia.org/T266300) (owner: 10Bstorm) [20:01:17] (03PS1) 10Razzi: Add kafka-test1006.eqiad.wmnet virtual machine [puppet] - 10https://gerrit.wikimedia.org/r/644607 (https://phabricator.wikimedia.org/T268202) [20:01:28] (03CR) 10Ppchelko: Configuration chage to allow custom comment reverts on Wikidata (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/644542 (owner: 10Holger Knust) [20:02:16] 10Operations, 10ops-eqiad: eqiad: add VC-links IDs to Netbox - https://phabricator.wikimedia.org/T268750 (10wiki_willy) a:03Jclark-ctr [20:04:02] (03CR) 10Ppchelko: [C: 04-1] Configuration chage to allow custom comment reverts on Wikidata [deployment-charts] - 10https://gerrit.wikimedia.org/r/644542 (owner: 10Holger Knust) [20:04:22] RECOVERY - Check systemd state on ms-be2031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:06:33] (03CR) 10Dzahn: "ACK, it's noop on stretch like this, confirmed on deploy2001 which already had the mariadb packages as well" [puppet] - 10https://gerrit.wikimedia.org/r/644350 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn) [20:09:31] (03CR) 10Dzahn: "noop on prod deployment servers, fixed issue on new buster servers (where unrelated issues remain)" [puppet] - 10https://gerrit.wikimedia.org/r/644350 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn) [20:16:08] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:19:29] (03PS3) 10Dzahn: site: add deploy2002 and unify deployment server role regex [puppet] - 10https://gerrit.wikimedia.org/r/644333 (https://phabricator.wikimedia.org/T265963) [20:21:48] (03CR) 10Ryan Kemper: "I think given these were added to `role(maps::replica)`" [puppet] - 10https://gerrit.wikimedia.org/r/644603 (https://phabricator.wikimedia.org/T260269) (owner: 10Ryan Kemper) [20:23:35] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T269143 (10Peachey88) [20:23:38] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T268036 (10Peachey88) [20:26:43] (03CR) 10Dzahn: [C: 03+2] site: add deploy2002 and unify deployment server role regex [puppet] - 10https://gerrit.wikimedia.org/r/644333 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn) [20:27:55] (03PS1) 10Ryan Kemper: maps: remove no-longer-accurate insetup role [puppet] - 10https://gerrit.wikimedia.org/r/644611 (https://phabricator.wikimedia.org/T260271) [20:29:14] (03PS1) 10Ottomata: Add new service eventstreams-internal [deployment-charts] - 10https://gerrit.wikimedia.org/r/644612 (https://phabricator.wikimedia.org/T269160) [20:30:44] (03PS2) 10Ottomata: Add new service eventstreams-internal [deployment-charts] - 10https://gerrit.wikimedia.org/r/644612 (https://phabricator.wikimedia.org/T269160) [20:31:20] 10Operations, 10Analytics, 10Event-Platform, 10EventStreams, and 4 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10Ottomata) [20:33:48] (03CR) 10Dzahn: "noop on deploy1001/2001, on 2002 mcrouter cert is needed" [puppet] - 10https://gerrit.wikimedia.org/r/644333 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn) [20:35:16] mutante: are you creating a new deployment host? [20:35:46] yes, hashar. T265963 [20:35:46] T265963: Replace production deployment servers and update them to Buster - https://phabricator.wikimedia.org/T265963 [20:36:01] mutante: there are a a couple time in which we had a new deploy server added and it came with no git repositories cloned at all [20:36:13] and eventually puppet or some cron kicks in and rsync all the repositories [20:36:18] with --delete [20:36:21] yes, i know. described in https://phabricator.wikimedia.org/T265963#6660917 [20:36:38] resulting in the actual primary ones to be wiped out entirely (including the local only repos having private settings :\ ) [20:37:31] ah yeah there is that [20:37:43] that's probably the reason that scap is now blocked with that lock file [20:37:52] maybe [20:38:03] but I think there is another mecanism that keeps the deployment hosts in sync [20:38:30] yea, we have a checkbox for that " sync repo data over from old servers to new servers" [20:39:01] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [20:39:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:21] if something bad had happened then it would have happened before when deploy1002 was added [20:39:32] (03PS2) 10Ryan Kemper: maps: remove no-longer-accurate insetup role [puppet] - 10https://gerrit.wikimedia.org/r/644611 (https://phabricator.wikimedia.org/T260271) [20:39:43] (03CR) 10CDanis: [C: 03+1] "sounds good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/643503 (https://phabricator.wikimedia.org/T266016) (owner: 10Filippo Giunchedi) [20:40:13] mutante: sounds good :] [20:40:32] also puppet doesn't run anyways because there is no mcrouter cert [20:40:59] but I am also happy to revert back to "insetup" [20:42:07] (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] cirrus: alert on pool counter reject spike [puppet] - 10https://gerrit.wikimedia.org/r/643362 (https://phabricator.wikimedia.org/T262694) (owner: 10Ryan Kemper) [20:42:36] mutante: well I can't find the task or the incident report (if we had any) [20:43:04] I think it was some kind of cronjob kicking in on the to be new primary server but since it had empty repo that wiped the other masters [20:43:06] something like that [20:43:11] guess it got addressed [20:44:15] (03PS6) 10Holger Knust: Configuration chage to allow custom comment reverts on Wikidata [deployment-charts] - 10https://gerrit.wikimedia.org/r/644542 [20:44:39] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:56] (03CR) 10Holger Knust: Configuration chage to allow custom comment reverts on Wikidata (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/644542 (owner: 10Holger Knust) [20:47:04] hashar: thanks for the warning, i am double checking the cron right now. [20:47:46] or maybe that was when scap init ran [20:48:06] well, there is actually a cron and it has a variable $ensure [20:48:19] and default absent, but also on a new host [20:51:04] ok, but what this does is pull FROM the old host to the new host [20:51:17] with --delete but from deploy1001 to ...local [20:51:49] it filled /srv/deployment up in the right direction [20:52:36] sounds good [20:52:46] so maybe the issue was years and years ago [20:53:11] and that got addressed [20:53:39] hashar: yea, so I checked the code from top to bottom. it looks up what is current deployment_server in Hiera in common.yaml [20:54:02] then there is some code that goes "if this is the active server then do NOT have the cron, otherwise do have it" [20:54:17] so that cron gets applied on all (currently 3) non-active hosts [20:54:24] but they are all the same, pulling from the 1 active host [20:54:32] but good to double check this stuff [20:54:35] yeah [20:54:42] the interesting part is when we switch in hiera [20:54:44] and not doing that yet [20:55:01] notably we want /srv/mediawiki-staging to be around I guess. Notably the private settings [20:55:09] my next question is different [20:55:17] and that is "how do you properly bootstrap scap" [20:55:22] oh [20:55:24] you can't run scap deploy --init [20:55:27] which puppet tries to do [20:55:34] but can't because .. this is not the active server [20:56:02] so either it has to be a scheduled maintenance window where we switch it and someone runs all the scap init stuff [20:56:12] or we have to allow doing it before it is an active server [20:56:24] but making REALLY sure it only inits and doesnt mess with actual deployments [20:56:43] I had the issue when provisioning a new repo [20:56:49] either way it should be fixed because right now that is just failed puppet [20:56:55] https://phabricator.wikimedia.org/T257317 [20:57:13] heh, yes, that :) [20:57:25] and Jayme pointed out the same thing you did [20:57:36] the lock prevent scap deploy --init from running [20:57:47] ok, all I have to do is quote Jaime [20:57:49] "scap syncronization (all methods) should be disabled because of the lock, but probably --init should be allowed " [20:57:53] that was what I wanted to say [20:58:14] let me link that,thx [20:58:24] oh [20:58:26] and https://phabricator.wikimedia.org/T257319 [20:58:32] 14:25:38 deploy failed: Failed to acquire lock "/var/lock/scap-global-lock"; owner is "root"; reason is "Not the active deployment server, use deploy1001.eqiad.wmnet [20:59:15] (03CR) 10Muehlenhoff: "Yeah, mysql-client is just a transitional package to default-mysql-client in Stretch: https://packages.debian.org/stretch/mysql-client" [puppet] - 10https://gerrit.wikimedia.org/r/644350 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn) [20:59:18] but I can't remember how I fixed it. Maybe the first deploy synced stuff to the secondary deploy server and that unbroke puppet [20:59:28] or it is a race condition in puppet that eventually self fixes [20:59:54] hashar: I think what happened is we switched what the active server is and then ran puppet and that fixes it [21:00:11] but would be nicer to be able to do this earlier [21:00:17] and confirm everything is ok before switching [21:01:57] (03CR) 10Ottomata: [C: 03+1] Configure zookeeper-test1002.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/644344 (https://phabricator.wikimedia.org/T268074) (owner: 10Razzi) [21:02:33] I will add the mcrouter cert for deploy2002, then puppet is not failed but just back to those warnings like on deploy1002. [21:06:26] (03CR) 10Ppchelko: [C: 03+1] "@Clara will you lead the way deploying this? There's also https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/644261 that you " [deployment-charts] - 10https://gerrit.wikimedia.org/r/644542 (owner: 10Holger Knust) [21:06:41] (03PS1) 10Dzahn: add fake mcrouter certs for deploy2002 [labs/private] - 10https://gerrit.wikimedia.org/r/644616 (https://phabricator.wikimedia.org/T265963) [21:07:11] ACKNOWLEDGEMENT - HP RAID on ms-be1030 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T269166 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Inform [21:07:11] (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake mcrouter certs for deploy2002 [labs/private] - 10https://gerrit.wikimedia.org/r/644616 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn) [21:07:14] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T269166 (10ops-monitoring-bot) [21:09:21] (03PS3) 10Dzahn: gerrit: daemon option in gerrit.config [puppet] - 10https://gerrit.wikimedia.org/r/643944 (owner: 10Hashar) [21:10:19] mutante: yeah that gerrit config change is all fine. The other one that mess up with the host config I am not 100% sure :\ [21:12:41] (03PS3) 10Razzi: Configure zookeeper-test1002.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/644344 (https://phabricator.wikimedia.org/T268074) [21:14:27] yes, I see it the same way. Not planning to merge that other one yet. [21:14:49] has open comments [21:14:51] but the daemon option you can merge it, I will look at the replica [21:15:45] (03CR) 10Dzahn: [C: 03+2] "confirmed this is not changing the active server, just the replica https://puppet-compiler.wmflabs.org/compiler1001/26817/" [puppet] - 10https://gerrit.wikimedia.org/r/643944 (owner: 10Hashar) [21:17:10] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [21:17:12] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:51] !log applied deployment_server role on deploy2002, added mcrouter cert, initial puppet run pulls mediawiki-config and other repos, downtimed in Icinga for 40 days (T265963) [21:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:58] T265963: Replace production deployment servers and update them to Buster - https://phabricator.wikimedia.org/T265963 [21:19:34] (03CR) 10Dzahn: "noop on gerrit1001 - changed config on gerrit2001" [puppet] - 10https://gerrit.wikimedia.org/r/643944 (owner: 10Hashar) [21:19:37] hashar: it's been applied ^ [21:20:32] puppet changed the config and triggered systemd daemon-reload [21:20:38] but for this kind of change it needs an actual restart [21:20:45] to change the command line [21:20:52] ahh [21:20:55] that is what I was wondering [21:21:12] on the prod gerrit server nothing changed at all, fwiw [21:21:23] !log gerrit2001: restarting Gerrit to take in account a config change in the daemon ( --replica moved to daemonOpt config file) [21:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:49] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` ms-be2061.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reim... [21:22:23] (03PS2) 10Razzi: Add kafka-test1006.eqiad.wmnet virtual machine [puppet] - 10https://gerrit.wikimedia.org/r/644607 (https://phabricator.wikimedia.org/T268202) [21:24:46] mutante: looks good. Thank you! [21:25:35] (03CR) 10Razzi: [C: 03+1] Set oozie.service.coord.default.max.timeout to 13 months [puppet] - 10https://gerrit.wikimedia.org/r/644535 (https://phabricator.wikimedia.org/T264358) (owner: 10Ottomata) [21:27:56] hashar: cool, thanks for checking. then let's leave it at that for now for both gerrit and deployment servers [21:28:16] yeah [21:28:21] next we need to figure out how we want to properly scap --init 2 new servers, one eqiad and one codfw [21:28:50] that I am afraid I don't quite know :\ [21:28:57] i'll go take a break for now. puppet still running on deploy2002 because it pulls all the things.. but that takes a while and it's downtimed [21:29:09] and obviously not considered an active server [21:30:02] hashar: yea, if can be really sure that "scap deploy --init" never deploys TO anything else then we can reduce it to how to allow that while keeping scap locked for "sync" actions [21:30:12] mutante: great. I am going to bed myself, but others in releng should be able to assist I guess [21:30:17] and surely have more knowledge than me [21:30:30] but in short scap deploy --init does the cloning and a few other actions to setup the git repos [21:30:35] (03PS1) 10Razzi: Add kafka-test1006 as start of test kafka cluster [puppet] - 10https://gerrit.wikimedia.org/r/644620 (https://phabricator.wikimedia.org/T268202) [21:30:39] we will ping releng from team to team [21:30:56] (03PS1) 10Andrew Bogott: Make clouddvirt1030 a ceph-enabled hypervisor [puppet] - 10https://gerrit.wikimedia.org/r/644621 (https://phabricator.wikimedia.org/T261132) [21:31:16] ack, cu later hashar! [21:31:18] mutante: cool :]] [21:31:27] (03CR) 10Razzi: [C: 03+2] Configure zookeeper-test1002.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/644344 (https://phabricator.wikimedia.org/T268074) (owner: 10Razzi) [21:31:27] can't wait for the new servers hehe [21:31:35] have a good lunch / break [21:31:42] originally it was only because the hardware is old [21:31:52] the part that OS is upgraded too was added on to it later [21:31:56] bye [21:32:08] 10Operations, 10ops-eqiad, 10DC-Ops, 10Platform Team Workboards (Green): eqiad: Server moves to free up space on 10g racks - https://phabricator.wikimedia.org/T267065 (10Clarakosi) [21:36:04] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ms-be2061.codfw.wmnet'] ` Of which those **FAILED**: ` ['ms-be2061.codfw.wmnet'] ` [21:37:40] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [21:37:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:46] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [21:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:24] (03PS3) 10Razzi: Add kafka-test1006.eqiad.wmnet virtual machine [puppet] - 10https://gerrit.wikimedia.org/r/644607 (https://phabricator.wikimedia.org/T268202) [21:53:58] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [21:54:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:59] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:40] 10Operations, 10Traffic, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10JMinor) I'm taking this off our active release board for now. We're discussi... [22:11:50] testing out the MW warmup script in codfw -- it'll produce some nonpaging alerts about appserver latency, those are safe to ignore [22:13:12] !log rzl@cumin2001 START - Cookbook sre.switchdc.mediawiki.00-warmup-caches [22:13:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:54] !log rzl@cumin2001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-warmup-caches (exit_code=0) [22:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:54] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [22:17:31] (03PS1) 10Florianschmidtwelzow: Move disabling sitenotice on wikimedia wikis to mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644627 (https://phabricator.wikimedia.org/T269173) [22:18:30] RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [22:21:31] done [22:26:56] (03PS7) 10Dzahn: thumbor: move thumbor mediawiki role to profile [puppet] - 10https://gerrit.wikimedia.org/r/643112 (https://phabricator.wikimedia.org/T209953) [22:28:34] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/26818/thumbor2004.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/643112 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [22:31:07] (03PS2) 10Dzahn: conftool: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/644357 (https://phabricator.wikimedia.org/T209953) [22:31:31] (03CR) 10Dzahn: "noop on thumbor1004/2004" [puppet] - 10https://gerrit.wikimedia.org/r/643112 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [22:32:32] (03CR) 10jerkins-bot: [V: 04-1] conftool: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/644357 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [22:36:34] (03PS3) 10Dzahn: conftool: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/644357 (https://phabricator.wikimedia.org/T209953) [22:37:15] (03CR) 10Dzahn: "yep, module is already deleted now" [puppet] - 10https://gerrit.wikimedia.org/r/644358 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [22:37:39] (03CR) 10CRusnov: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/644358 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [22:37:49] (03CR) 10Dzahn: [C: 04-1] "do deploy2002 as well right away" [puppet] - 10https://gerrit.wikimedia.org/r/635079 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn) [22:38:54] 10Puppet, 10DBA, 10SRE-tools, 10conftool, and 2 others: Alerting spam and wrong state of primary dc source info on databases while switching dc from eqiad -> codfw - https://phabricator.wikimedia.org/T261767 (10RLazarus) 05Open→03Resolved a:03RLazarus >>! In T261767#6450152, @Marostegui wrote: > @RLa... [22:38:59] (03PS2) 10Dzahn: k8s: replace hiera with lookup [puppet] - 10https://gerrit.wikimedia.org/r/644363 (https://phabricator.wikimedia.org/T209953) [22:39:57] (03PS2) 10Dzahn: ores: move LB setup for cloud from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/643117 [22:41:23] (03CR) 10jerkins-bot: [V: 04-1] ores: move LB setup for cloud from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/643117 (owner: 10Dzahn) [22:41:27] !log Start of mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=$wiki --print-orphaned-records-to=/tmp/urbanecm/$wiki-orphaned.log --progress-markers > $wiki.log in a tmux at mwmaint1002 (wiki=arwiki; T246539) [22:41:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:34] T246539: Dry-run, then actually run updateVarDumps - https://phabricator.wikimedia.org/T246539 [22:43:03] ottomata: check out the "standard analytics fields" tab https://mep-index.wmflabs.org/ :D [22:54:47] (03CR) 10Ottomata: [C: 03+1] Add kafka-test1006.eqiad.wmnet virtual machine [puppet] - 10https://gerrit.wikimedia.org/r/644607 (https://phabricator.wikimedia.org/T268202) (owner: 10Razzi) [22:57:19] 10Operations, 10CirrusSearch, 10Elasticsearch, 10Discovery-Search (Current work): Search is currently too busy - https://phabricator.wikimedia.org/T262694 (10RKemper) Following deploy, the alert shows up in icinga: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=alert1001&service=Mediaw... [23:03:21] (03PS8) 10Razzi: zookeeper: configure test-eqiad single-node cluster [puppet] - 10https://gerrit.wikimedia.org/r/642497 (https://phabricator.wikimedia.org/T268202) [23:07:16] (03CR) 10Razzi: [C: 03+2] Add kafka-test1006.eqiad.wmnet virtual machine [puppet] - 10https://gerrit.wikimedia.org/r/644607 (https://phabricator.wikimedia.org/T268202) (owner: 10Razzi) [23:12:42] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-reload [23:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:49] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-reload [23:12:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:55] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-reload [23:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:06] !log ryankemper@cumin2001 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97) [23:13:11] 10Operations, 10serviceops, 10Datacenter-Switchover: Updates to warmup script - https://phabricator.wikimedia.org/T269179 (10RLazarus) [23:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:13] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-reload [23:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:20] !log razzi@cumin1001 START - Cookbook sre.dns.netbox [23:13:21] !log razzi@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [23:13:21] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-reload [23:13:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:30] !log T259588 Beginning wdqs categories data-reload on the following instances (one each from `[public, internal] x [eqiad, codfw]`): `wdqs1005`, `wdqs2002`, `wdqs1008`, `wdqs2005` [23:15:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:37] T259588: Reload categories once 1.36.0-wmf.3 is running on all groups - https://phabricator.wikimedia.org/T259588 [23:16:21] (03PS1) 10Papaul: DHCP: Add MAC address for db214[234] [puppet] - 10https://gerrit.wikimedia.org/r/644632 (https://phabricator.wikimedia.org/T267041) [23:21:22] (03CR) 10Papaul: [C: 03+2] DHCP: Add MAC address for db214[234] [puppet] - 10https://gerrit.wikimedia.org/r/644632 (https://phabricator.wikimedia.org/T267041) (owner: 10Papaul) [23:22:36] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): relocate/reimage cloudvirt1025 with 10G interfaces - https://phabricator.wikimedia.org/T266187 (10Andrew) I'm unable to pxe boot this host. It doesn't display much of anything, just hangs for a while and then fails over to hdd. The sa... [23:29:54] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-11-29) rack/setup/install db214[234] - https://phabricator.wikimedia.org/T267041 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` db2142.codfw.wmnet ` The log can be f... [23:29:59] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-11-29) rack/setup/install db214[234] - https://phabricator.wikimedia.org/T267041 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2142.codfw.wmnet'] ` Of which those **FAILED**: ` ['db2142.codfw.wmnet'] ` [23:30:09] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-11-29) rack/setup/install db214[234] - https://phabricator.wikimedia.org/T267041 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` db2142.codfw.wmnet ` The log can be f... [23:32:44] ACKNOWLEDGEMENT - HP RAID on ms-be1030 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T269181 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Inform [23:32:49] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T269181 (10ops-monitoring-bot) [23:44:06] (03PS9) 10Razzi: zookeeper: configure test-eqiad single-node cluster [puppet] - 10https://gerrit.wikimedia.org/r/642497 (https://phabricator.wikimedia.org/T268202) [23:49:09] (03PS4) 10Razzi: Ensure /tmp/sqoop-jars/ is present [puppet] - 10https://gerrit.wikimedia.org/r/644347 (https://phabricator.wikimedia.org/T251788) [23:49:15] (03CR) 10Jdlrobson: [C: 03+1] Move disabling sitenotice on wikimedia wikis to mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644627 (https://phabricator.wikimedia.org/T269173) (owner: 10Florianschmidtwelzow) [23:50:12] (03CR) 10Jdlrobson: [C: 04-1] Move disabling sitenotice on wikimedia wikis to mediawiki-config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644627 (https://phabricator.wikimedia.org/T269173) (owner: 10Florianschmidtwelzow)