[00:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201201T0000).
[00:00:04] <jouncebot>	 hmonroy: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[00:00:10] <icinga-wm>	 RECOVERY - Check systemd state on idp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:01:45] <hmonroy>	 RoanKattouw: sorry can we stop that
[00:02:00] <hmonroy>	 I meant to schedule for tomorrow
[00:02:26] <Urbanecm>	 hnowlan: no problem, I'll remove it from the schedule :)
[00:02:46] <hmonroy>	 sorry about that :)
[00:02:58] <Urbanecm>	 can happen :)
[00:03:02] <icinga-wm>	 PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.0101 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[00:03:28] <icinga-wm>	 PROBLEM - snapshot of x1 in eqiad on alert1001 is CRITICAL: snapshot for x1 at eqiad taken more than 3 days ago: Most recent backup 2020-11-27 23:31:06 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting
[00:04:52] <Urbanecm>	 removed
[00:05:16] <icinga-wm>	 PROBLEM - Check systemd state on idp1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:05:28] <hmonroy>	 urbanecm: thank you!
[00:05:36] <Urbanecm>	 np
[00:08:48] <icinga-wm>	 PROBLEM - Number of messages locally queued by purged for processing on cp1083 is CRITICAL: cluster=cache_text instance=cp1083 job=purged layer=backend site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1083
[00:08:52] <icinga-wm>	 PROBLEM - Number of messages locally queued by purged for processing on cp3064 is CRITICAL: cluster=cache_text instance=cp3064 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3064
[00:08:56] <icinga-wm>	 PROBLEM - Number of messages locally queued by purged for processing on cp1075 is CRITICAL: cluster=cache_text instance=cp1075 job=purged layer=backend site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1075
[00:09:32] <icinga-wm>	 PROBLEM - Number of messages locally queued by purged for processing on cp1077 is CRITICAL: cluster=cache_text instance=cp1077 job=purged layer=backend site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1077
[00:09:44] <icinga-wm>	 PROBLEM - Number of messages locally queued by purged for processing on cp1089 is CRITICAL: cluster=cache_text instance=cp1089 job=purged layer=backend site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1089
[00:09:58] <icinga-wm>	 PROBLEM - Number of messages locally queued by purged for processing on cp1087 is CRITICAL: cluster=cache_text instance=cp1087 job=purged layer=backend site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1087
[00:10:43] <wikibugs>	 (03CR) 10CRusnov: "I'm not sure who to get to review this. Also, to make it pass tox necessitated catching actual exceptions, but I am not sure of the implic" [puppet] - 10https://gerrit.wikimedia.org/r/644365 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[00:11:50] <icinga-wm>	 PROBLEM - Number of messages locally queued by purged for processing on cp2027 is CRITICAL: cluster=cache_text instance=cp2027 job=purged layer=backend site=codfw https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2027
[00:13:08] <icinga-wm>	 RECOVERY - Number of messages locally queued by purged for processing on cp1089 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1089
[00:13:19] <wikibugs>	 (03PS2) 10CRusnov: Port elasticsearch/es-tool.py to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/644365 (https://phabricator.wikimedia.org/T247364)
[00:13:24] <icinga-wm>	 RECOVERY - Number of messages locally queued by purged for processing on cp1087 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1087
[00:13:32] <icinga-wm>	 RECOVERY - Number of messages locally queued by purged for processing on cp2027 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2027
[00:13:56] <icinga-wm>	 RECOVERY - Number of messages locally queued by purged for processing on cp1083 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1083
[00:14:00] <icinga-wm>	 RECOVERY - Number of messages locally queued by purged for processing on cp3064 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3064
[00:14:28] <icinga-wm>	 PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad total VRPs alert, total VRPs alert, valid ROAs alert, valid ROAs alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/
[00:14:42] <icinga-wm>	 RECOVERY - Number of messages locally queued by purged for processing on cp1077 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1077
[00:15:46] <icinga-wm>	 RECOVERY - Number of messages locally queued by purged for processing on cp1075 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1075
[00:16:14] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-reload
[00:16:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:17:22] <logmsgbot>	 !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-reload
[00:17:22] <logmsgbot>	 !log ryankemper@cumin2001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
[00:17:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:17:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:17:36] <logmsgbot>	 !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-reload
[00:17:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:20:13] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-reload
[00:20:18] <logmsgbot>	 !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-reload
[00:20:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:20:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:22:36] <ryankemper>	 !log T259588 Beginning wdqs categories data-reload on the following instances (one each from `[public, internal] x [eqiad, codfw]`): `wdqs1004`, `wdqs2001`, `wdqs1003`, `wdqs2004`
[00:22:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:22:42] <stashbot>	 T259588: Reload categories once 1.36.0-wmf.3 is running on all groups - https://phabricator.wikimedia.org/T259588
[00:29:38] <icinga-wm>	 PROBLEM - Keyholder SSH agent on deploy1002 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder
[00:29:42] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:31:46] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1641139888 and 272 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:33:04] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` ms-be2058.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reim...
[00:33:08] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ms-be2058.codfw.wmnet'] `  Of which those **FAILED**: ` ['ms-be2058.codfw.wmnet'] `
[00:33:28] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` ms-be2058.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reim...
[00:34:08] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 512519248 and 413 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:45:30] <wikibugs>	 (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/644372 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[00:46:26] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 41759016 and 312 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:46:30] <wikibugs>	 (03CR) 10CRusnov: "I'll also note that I've tested this and it appears to work as expected against en.wikipedia.org / en.m.wikipedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/644372 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[00:49:00] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 2012128 and 144 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:49:38] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 4120 and 183 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:51:28] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 46866784 and 613 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:51:38] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1268056 and 624 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:54:52] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 216864 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:56:42] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 17279032 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:56:46] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 23515016 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:00:00] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:00:04] <icinga-wm>	 RECOVERY - Check systemd state on idp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:00:10] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1302816 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:00:14] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1513368 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:05:14] <icinga-wm>	 PROBLEM - Check systemd state on idp1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:11:54] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on deploy1002 is CRITICAL: Host deploy1002 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[01:14:08] <wikibugs>	 (03PS1) 10Bstorm: wikireplicas: make meta_p work on multiinstance and automatically [puppet] - 10https://gerrit.wikimedia.org/r/644375
[01:15:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wikireplicas: make meta_p work on multiinstance and automatically [puppet] - 10https://gerrit.wikimedia.org/r/644375 (owner: 10Bstorm)
[01:16:41] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime
[01:16:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:18:38] <logmsgbot>	 !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[01:18:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:23:05] <wikibugs>	 (03Abandoned) 10CRusnov: Port drac.py to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/644358 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[01:34:56] <wikibugs>	 (03PS1) 10Aklapper: Phabricator monthly email: [Hopefully] fix priority median calculation [puppet] - 10https://gerrit.wikimedia.org/r/644383
[01:42:00] <wikibugs>	 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10Epic: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10AntiCompositeNumber) The OSM tile servers are designed to support osm.org only, and do not support all features. The Si...
[01:46:54] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 66 probes of 574 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[01:48:05] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0)
[01:48:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:48:38] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:49:02] <logmsgbot>	 !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0)
[01:49:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:50:41] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10Papaul)
[01:51:55] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 49 probes of 574 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[01:55:01] <logmsgbot>	 !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0)
[01:55:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:58:25] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0)
[01:58:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:00:35] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ms-be2058.codfw.wmnet'] `  Of which those **FAILED**: ` ['ms-be2058.codfw.wmnet'] `
[02:00:37] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10Papaul) @fgiunchedi I re-imaged ms-be2059 and ms-be2058, puppet is not happy ` WARNING: Puppet has 1 failures. Last run 42 seconds ago with 1 failures. Failed resources (up to...
[02:07:08] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.20 [core] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644387
[02:17:49] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:24:49] <wikibugs>	 (03PS1) 10Papaul: Add db214[234] and logstash203[345] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/644391 (https://phabricator.wikimedia.org/T267041)
[02:24:51] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[02:26:39] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] Add db214[234] and logstash203[345] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/644391 (https://phabricator.wikimedia.org/T267041) (owner: 10Papaul)
[02:26:57] <icinga-wm>	 RECOVERY - rpki grafana alert on alert1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/
[02:27:07] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Branch commit for wmf/1.36.0-wmf.20 [core] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644387 (owner: 10TrainBranchBot)
[02:31:10] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] thumbor: move thumbor mediawiki role to profile [puppet] - 10https://gerrit.wikimedia.org/r/643112 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[02:43:21] <wikibugs>	 (03PS1) 10Ladsgroup: hadoop: Migrate hiera() to lookup() and setting datatype in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/644392 (https://phabricator.wikimedia.org/T209953)
[02:48:13] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:49:06] <wikibugs>	 (03PS2) 10Ladsgroup: hadoop: Migrate hiera() to lookup() and setting datatype in spark2 [puppet] - 10https://gerrit.wikimedia.org/r/644392 (https://phabricator.wikimedia.org/T209953)
[02:50:32] <wikibugs>	 (03CR) 10Ladsgroup: "So far it looks good: https://puppet-compiler.wmflabs.org/compiler1001/26802/ just another host that's failing completely:" [puppet] - 10https://gerrit.wikimedia.org/r/644392 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[03:03:31] <wikibugs>	 (03PS2) 10DannyS712: Branch commit for wmf/1.36.0-wmf.20 [core] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644387 (https://phabricator.wikimedia.org/T263186) (owner: 10TrainBranchBot)
[03:32:55] <icinga-wm>	 PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01009 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[03:43:37] <wikibugs>	 (03PS3) 10Gergő Tisza: Add EventStream config for link recommendations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643230 (https://phabricator.wikimedia.org/T261407)
[04:00:59] <icinga-wm>	 RECOVERY - Check systemd state on idp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:06:07] <icinga-wm>	 PROBLEM - Check systemd state on idp1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:13:23] <legoktm>	 !log resetting elukey's jenkins API token (T268978)
[04:13:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:13:36] <wikibugs>	 (03CR) 10C. Scott Ananian: "Parsoid was a little late for the train.  We'll need to cherry-pick I55e467133763345203c5f7083c999762e9203206 to the wmf.20 branch of medi" [core] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644387 (https://phabricator.wikimedia.org/T263186) (owner: 10TrainBranchBot)
[04:14:35] <wikibugs>	 (03PS1) 10C. Scott Ananian: Bump wikimedia/parsoid to 0.13.0-a18 [vendor] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644221 (https://phabricator.wikimedia.org/T51538)
[05:28:57] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 101 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[06:00:17] <icinga-wm>	 RECOVERY - Check systemd state on idp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:03:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1017 and es1018 for reboot', diff saved to https://phabricator.wikimedia.org/P13487 and previous config saved to /var/cache/conftool/dbconfig/20201201-060313-marostegui.json
[06:03:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:04:39] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:05:16] <wikibugs>	 (03PS1) 10Marostegui: es1017,es1018: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/644400 (https://phabricator.wikimedia.org/T264154)
[06:05:27] <icinga-wm>	 PROBLEM - Check systemd state on idp1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:06:01] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] es1017,es1018: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/644400 (https://phabricator.wikimedia.org/T264154) (owner: 10Marostegui)
[06:08:41] <wikibugs>	 (03PS1) 10Marostegui: es1018: Remove it from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/644401 (https://phabricator.wikimedia.org/T269069)
[06:12:03] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] es1018: Remove it from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/644401 (https://phabricator.wikimedia.org/T269069) (owner: 10Marostegui)
[06:13:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove es1018 from dbctl T269069', diff saved to https://phabricator.wikimedia.org/P13488 and previous config saved to /var/cache/conftool/dbconfig/20201201-061321-marostegui.json
[06:13:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:13:30] <stashbot>	 T269069: decommission es1018.eqiad.wmnet - https://phabricator.wikimedia.org/T269069
[06:15:03] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:46:19] <wikibugs>	 10Operations, 10vm-requests, 10Patch-For-Review: Eq: 5 VM request for kafka-test-eqiad cluster - https://phabricator.wikimedia.org/T268202 (10elukey) @razzi can you copy/paste in here what failed for the dns netbox step? There might be some follow ups to do to avoid an inconsistent state..
[06:51:00] <wikibugs>	 (03CR) 10Elukey: [C: 04-1] "Thanks a lot for the code change! It is a little bit more complicated that find/replace sadly, for the following reasons:" [puppet] - 10https://gerrit.wikimedia.org/r/644353 (https://phabricator.wikimedia.org/T268028) (owner: 10Razzi)
[06:51:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1112 for schema change', diff saved to https://phabricator.wikimedia.org/P13489 and previous config saved to /var/cache/conftool/dbconfig/20201201-065125-marostegui.json
[06:51:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:53:13] <wikibugs>	 (03PS1) 10Marostegui: es1017: Remove from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/644408 (https://phabricator.wikimedia.org/T268825)
[06:53:47] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] es1017: Remove from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/644408 (https://phabricator.wikimedia.org/T268825) (owner: 10Marostegui)
[06:54:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove es1017 from dbctl T268825', diff saved to https://phabricator.wikimedia.org/P13490 and previous config saved to /var/cache/conftool/dbconfig/20201201-065419-marostegui.json
[06:54:23] <wikibugs>	 (03CR) 10Elukey: "only one nit, the rest looks good!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644392 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[06:54:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:54:27] <stashbot>	 T268825: decommission es1017.eqiad.wmnet - https://phabricator.wikimedia.org/T268825
[07:00:33] <icinga-wm>	 RECOVERY - Check systemd state on idp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:05:23] <icinga-wm>	 PROBLEM - Check systemd state on idp1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:15:43] <marostegui>	 !log Deploy labsdb role on all clouddb instances (except clouddb1020*) T268312
[07:15:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:15:51] <stashbot>	 T268312: Deploy labsdbuser and views to new clouddb hosts - https://phabricator.wikimedia.org/T268312
[07:24:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P13491 and previous config saved to /var/cache/conftool/dbconfig/20201201-072451-root.json
[07:24:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:30:53] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:31:01] <icinga-wm>	 PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:31:11] <marostegui>	 !log Deploy "_p" databases to all clouddb hosts (except clouddb1020*) T268312
[07:31:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:31:20] <stashbot>	 T268312: Deploy labsdbuser and views to new clouddb hosts - https://phabricator.wikimedia.org/T268312
[07:35:36] <elukey>	 the link between cr1 eqiad and codfw is down, probably maintenance
[07:37:03] <elukey>	 mmmm I don't see maintenance for the Telia link
[07:37:59] <elukey>	 the link is not down but BFD detected a problem
[07:39:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P13492 and previous config saved to /var/cache/conftool/dbconfig/20201201-073955-root.json
[07:40:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:54:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P13493 and previous config saved to /var/cache/conftool/dbconfig/20201201-075458-root.json
[07:55:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:00:03] <icinga-wm>	 RECOVERY - Check systemd state on idp-test2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:00:53] <icinga-wm>	 RECOVERY - Check systemd state on idp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:03:41] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10ayounsi) > So, please let us know if you're OK with reducing to 60 or you'd rather keep the 90. OK!  > Would you guys be wil...
[08:04:13] <wikibugs>	 (03PS1) 10Marostegui: valid_section: Add x2 [puppet] - 10https://gerrit.wikimedia.org/r/644453 (https://phabricator.wikimedia.org/T264584)
[08:05:09] <icinga-wm>	 PROBLEM - Check systemd state on idp-test2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:05:20] <marostegui>	 !log Create database mwaddlink on m2 - T267214
[08:05:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:05:27] <stashbot>	 T267214: Add a link engineering: Database for link recommendation service - https://phabricator.wikimedia.org/T267214
[08:05:59] <icinga-wm>	 PROBLEM - Check systemd state on idp1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:07:09] <wikibugs>	 (03PS3) 10Ladsgroup: hadoop: Migrate hiera() to lookup() and setting datatype in spark2 [puppet] - 10https://gerrit.wikimedia.org/r/644392 (https://phabricator.wikimedia.org/T209953)
[08:07:53] <wikibugs>	 (03CR) 10Ladsgroup: hadoop: Migrate hiera() to lookup() and setting datatype in spark2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644392 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[08:10:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P13494 and previous config saved to /var/cache/conftool/dbconfig/20201201-081002-root.json
[08:10:03] <icinga-wm>	 ACKNOWLEDGEMENT - HP RAID on ms-be1030 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T269075 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Inform
[08:10:07] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T269075 (10ops-monitoring-bot)
[08:10:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:10:15] <volans>	 !log upgrading spicerack to 0.0.45 on cumin2001
[08:10:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:10:26] <wikibugs>	 (03CR) 10Elukey: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/644392 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[08:11:04] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] hadoop: Migrate hiera() to lookup() and setting datatype in spark2 [puppet] - 10https://gerrit.wikimedia.org/r/644392 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[08:11:07] <logmsgbot>	 !log volans@cumin2001 START - Cookbook sre.dns.netbox
[08:11:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:11:19] <wikibugs>	 (03CR) 10Ladsgroup: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/644392 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[08:13:40] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:13:57] <icinga-wm>	 RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:18:30] <logmsgbot>	 !log volans@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:18:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:19:37] <wikibugs>	 (03PS2) 10Muehlenhoff: Add Andrew Otto as approval contact for Hadoop and analytics groups [puppet] - 10https://gerrit.wikimedia.org/r/643875
[08:20:55] <icinga-wm>	 RECOVERY - snapshot of x1 in eqiad on alert1001 is OK: Last snapshot for x1 at eqiad (db1102.eqiad.wmnet:3320) taken on 2020-12-01 07:55:30 (207 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting
[08:21:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Agreed, this seems obsolete." [puppet] - 10https://gerrit.wikimedia.org/r/644364 (owner: 10Dzahn)
[08:21:52] <wikibugs>	 (03PS1) 10Marostegui: production-m2: Add grants for mwaddlink new database [puppet] - 10https://gerrit.wikimedia.org/r/644456 (https://phabricator.wikimedia.org/T267214)
[08:26:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 04-1] "Better depend on default-mysql-client, this will do the right thing also on Stretch, i.e. you don't even need the OS conditional" [puppet] - 10https://gerrit.wikimedia.org/r/644350 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn)
[08:27:08] <wikibugs>	 (03PS2) 10Marostegui: production-m2: Add grants for mwaddlink new database [puppet] - 10https://gerrit.wikimedia.org/r/644456 (https://phabricator.wikimedia.org/T267214)
[08:28:26] <wikibugs>	 10Operations, 10vm-requests, 10Patch-For-Review: Eq: 5 VM request for kafka-test-eqiad cluster - https://phabricator.wikimedia.org/T268202 (10Volans) @razzi in general on FAIL always better to investigate what happens. In this case it left some changes in Netbox not propagated to the DNS (see https://icinga....
[08:31:36] <wikibugs>	 (03PS3) 10Marostegui: production-m2: Add grants for mwaddlink new database [puppet] - 10https://gerrit.wikimedia.org/r/644456 (https://phabricator.wikimedia.org/T267214)
[08:34:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/644372 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[08:39:15] <icinga-wm>	 PROBLEM - Host ms-be2058 is DOWN: PING CRITICAL - Packet loss = 100%
[08:43:33] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[08:44:03] <icinga-wm>	 PROBLEM - Host ms-be2059 is DOWN: PING CRITICAL - Packet loss = 100%
[08:44:52] <wikibugs>	 (03PS1) 10Elukey: oozie: improve log retention and cleanup [puppet] - 10https://gerrit.wikimedia.org/r/644457
[08:45:37] <icinga-wm>	 RECOVERY - Host ms-be2059 is UP: PING OK - Packet loss = 0%, RTA = 33.42 ms
[08:46:30] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10fgiunchedi) >>! In T265419#6657950, @Papaul wrote: > @fgiunchedi I re-imaged ms-be2059 and ms-be2058, puppet is not happy > ` > WARNING: Puppet has 1 failures. Last run 42 seco...
[08:49:27] <wikibugs>	 (03PS1) 10Volans: Avoid double output when running Cumin commands [cookbooks] - 10https://gerrit.wikimedia.org/r/644458 (https://phabricator.wikimedia.org/T212783)
[08:49:50] <wikibugs>	 (03PS2) 10Volans: Avoid double output when running Cumin commands [cookbooks] - 10https://gerrit.wikimedia.org/r/644458 (https://phabricator.wikimedia.org/T212783)
[08:50:48] <wikibugs>	 (03CR) 10Volans: "Hi all, I've added you to the review because at least one of your cookbooks is affected by this change." [cookbooks] - 10https://gerrit.wikimedia.org/r/644458 (https://phabricator.wikimedia.org/T212783) (owner: 10Volans)
[08:51:19] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Decommission es1018 [puppet] - 10https://gerrit.wikimedia.org/r/644459 (https://phabricator.wikimedia.org/T269069)
[08:52:06] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26803/console" [puppet] - 10https://gerrit.wikimedia.org/r/644457 (owner: 10Elukey)
[08:52:55] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Avoid double output when running Cumin commands [cookbooks] - 10https://gerrit.wikimedia.org/r/644458 (https://phabricator.wikimedia.org/T212783) (owner: 10Volans)
[08:53:50] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] oozie: improve log retention and cleanup [puppet] - 10https://gerrit.wikimedia.org/r/644457 (owner: 10Elukey)
[08:58:17] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: bacula: Move logs to /var/log/bacula [puppet] - 10https://gerrit.wikimedia.org/r/546972 (https://phabricator.wikimedia.org/T236406)
[08:58:21] <wikibugs>	 (03CR) 10ArielGlenn: "Added Brooke as reviewer for whenever this is ready, since the affected servers are WMCS ones" [puppet] - 10https://gerrit.wikimedia.org/r/642446 (https://phabricator.wikimedia.org/T268220) (owner: 10Elukey)
[08:58:50] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "Resurrecting an old change. Let me know if you think it's useful, otherwise feel free to abandon it" [puppet] - 10https://gerrit.wikimedia.org/r/546972 (https://phabricator.wikimedia.org/T236406) (owner: 10Alexandros Kosiaris)
[08:59:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1075 for schema change', diff saved to https://phabricator.wikimedia.org/P13495 and previous config saved to /var/cache/conftool/dbconfig/20201201-085916-marostegui.json
[08:59:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:00:01] <icinga-wm>	 RECOVERY - Check systemd state on idp-test2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:00:04] <wikibugs>	 (03CR) 10Kormat: "It would be good to add this to `profile::mariadb::section_ports` in `hieradata/common/profile/mariadb.yaml`" [puppet] - 10https://gerrit.wikimedia.org/r/644453 (https://phabricator.wikimedia.org/T264584) (owner: 10Marostegui)
[09:00:49] <icinga-wm>	 RECOVERY - Check systemd state on idp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:01:43] <wikibugs>	 (03CR) 10Marostegui: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/644453 (https://phabricator.wikimedia.org/T264584) (owner: 10Marostegui)
[09:03:02] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] bacula: Move logs to /var/log/bacula [puppet] - 10https://gerrit.wikimedia.org/r/546972 (https://phabricator.wikimedia.org/T236406) (owner: 10Alexandros Kosiaris)
[09:04:57] <icinga-wm>	 PROBLEM - Check systemd state on idp-test2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:05:40] <moritzm>	 !log removing obsolete resources on idp* and idp-test* hosts after going active-active
[09:05:45] <icinga-wm>	 PROBLEM - Check systemd state on idp1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:05:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:08:20] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "I TRUST YOU!" [cookbooks] - 10https://gerrit.wikimedia.org/r/644458 (https://phabricator.wikimedia.org/T212783) (owner: 10Volans)
[09:10:36] <wikibugs>	 (03PS1) 10JMeybohm: Add calico chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/644462
[09:10:41] <icinga-wm>	 RECOVERY - Check systemd state on idp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:16:02] <wikibugs>	 (03PS3) 10Noa wmde: Add log channel Wikibase.IdGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643874 (https://phabricator.wikimedia.org/T268625) (owner: 10Lucas Werkmeister (WMDE))
[09:16:29] <icinga-wm>	 RECOVERY - Check systemd state on idp-test2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:17:43] <wikibugs>	 (03CR) 10Kormat: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/644453 (https://phabricator.wikimedia.org/T264584) (owner: 10Marostegui)
[09:17:47] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Avoid double output when running Cumin commands [cookbooks] - 10https://gerrit.wikimedia.org/r/644458 (https://phabricator.wikimedia.org/T212783) (owner: 10Volans)
[09:18:55] <wikibugs>	 (03PS2) 10Marostegui: valid_section: Add x2 [puppet] - 10https://gerrit.wikimedia.org/r/644453 (https://phabricator.wikimedia.org/T264584)
[09:19:16] <wikibugs>	 (03Merged) 10jenkins-bot: Avoid double output when running Cumin commands [cookbooks] - 10https://gerrit.wikimedia.org/r/644458 (https://phabricator.wikimedia.org/T212783) (owner: 10Volans)
[09:19:48] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] Branch commit for wmf/1.36.0-wmf.20 [core] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644387 (https://phabricator.wikimedia.org/T263186) (owner: 10TrainBranchBot)
[09:20:24] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] "LGTM, WCGW." [puppet] - 10https://gerrit.wikimedia.org/r/644453 (https://phabricator.wikimedia.org/T264584) (owner: 10Marostegui)
[09:20:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1075 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P13496 and previous config saved to /var/cache/conftool/dbconfig/20201201-092030-root.json
[09:20:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:20:44] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] valid_section: Add x2 [puppet] - 10https://gerrit.wikimedia.org/r/644453 (https://phabricator.wikimedia.org/T264584) (owner: 10Marostegui)
[09:21:31] <logmsgbot>	 !log volans@cumin2001 START - Cookbook sre.hosts.decommission
[09:21:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:21:45] <wikibugs>	 (03CR) 10Hashar: "I guess it is a matter of taste?  It seems easier to me to handle those settings on a per host basis :]" [puppet] - 10https://gerrit.wikimedia.org/r/643918 (owner: 10Hashar)
[09:22:35] <wikibugs>	 (03Abandoned) 10Hashar: Gerrit: Setup avatars url in gerrit config [puppet] - 10https://gerrit.wikimedia.org/r/456437 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox)
[09:24:50] <wikibugs>	 (03PS2) 10ArielGlenn: Add sample code illustrating use of the commandmanagement module classes [dumps] - 10https://gerrit.wikimedia.org/r/627475
[09:25:14] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add sample code illustrating use of the commandmanagement module classes [dumps] - 10https://gerrit.wikimedia.org/r/627475 (owner: 10ArielGlenn)
[09:25:35] <wikibugs>	 (03CR) 10Hashar: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/643944 (owner: 10Hashar)
[09:26:03] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/643944 (owner: 10Hashar)
[09:26:12] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/644459 (https://phabricator.wikimedia.org/T269069) (owner: 10Marostegui)
[09:27:21] <wikibugs>	 (03PS3) 10ArielGlenn: Add sample code illustrating use of the commandmanagement module classes [dumps] - 10https://gerrit.wikimedia.org/r/627475
[09:27:25] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission es1018 [puppet] - 10https://gerrit.wikimedia.org/r/644459 (https://phabricator.wikimedia.org/T269069) (owner: 10Marostegui)
[09:27:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add sample code illustrating use of the commandmanagement module classes [dumps] - 10https://gerrit.wikimedia.org/r/627475 (owner: 10ArielGlenn)
[09:28:55] <wikibugs>	 (03PS1) 10Ema: vcl: remove legacy temporary parameter workaround [puppet] - 10https://gerrit.wikimedia.org/r/644465
[09:28:57] <wikibugs>	 (03PS1) 10Ema: vcl: move /static Host normalization to cluster_fe_recv_pre_purge [puppet] - 10https://gerrit.wikimedia.org/r/644466 (https://phabricator.wikimedia.org/T130904)
[09:29:36] <wikibugs>	 (03PS1) 10Elukey: sre.hosts.decom: fix kerberos keytabs path [cookbooks] - 10https://gerrit.wikimedia.org/r/644467
[09:31:20] <wikibugs>	 (03CR) 10Ayounsi: turnilo: add export mappings for network devices via query_resources (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/643703 (https://phabricator.wikimedia.org/T254332) (owner: 10Jbond)
[09:32:37] <logmsgbot>	 !log volans@cumin2001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
[09:32:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:32:56] <wikibugs>	 (03PS4) 10ArielGlenn: Add sample code illustrating use of the commandmanagement module classes [dumps] - 10https://gerrit.wikimedia.org/r/627475
[09:32:58] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/644467 (owner: 10Elukey)
[09:33:19] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add sample code illustrating use of the commandmanagement module classes [dumps] - 10https://gerrit.wikimedia.org/r/627475 (owner: 10ArielGlenn)
[09:34:08] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] sre.hosts.decom: fix kerberos keytabs path [cookbooks] - 10https://gerrit.wikimedia.org/r/644467 (owner: 10Elukey)
[09:35:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add Andrew Otto as approval contact for Hadoop and analytics groups [puppet] - 10https://gerrit.wikimedia.org/r/643875 (owner: 10Muehlenhoff)
[09:35:29] <volans>	 !log upgrading spicerack to 0.0.45 on cumin1001
[09:35:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1075 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P13497 and previous config saved to /var/cache/conftool/dbconfig/20201201-093534-root.json
[09:35:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:35:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:36:16] <wikibugs>	 (03PS5) 10Alexandros Kosiaris: k8s: Remove profile::kubernetes::master::storage_backend fully [puppet] - 10https://gerrit.wikimedia.org/r/644234
[09:36:18] <wikibugs>	 (03PS5) 10Alexandros Kosiaris: k8s::master: Remove redundant has_lvs hiera check [puppet] - 10https://gerrit.wikimedia.org/r/644237
[09:36:20] <wikibugs>	 (03PS5) 10Alexandros Kosiaris: k8s: Allow using cergen [puppet] - 10https://gerrit.wikimedia.org/r/644238
[09:36:22] <wikibugs>	 (03PS6) 10Alexandros Kosiaris: k8s codfw staging: Assign role to node [puppet] - 10https://gerrit.wikimedia.org/r/644235
[09:36:24] <wikibugs>	 (03PS6) 10Alexandros Kosiaris: prometheus::k8s: Support arbitrary clusters [puppet] - 10https://gerrit.wikimedia.org/r/644262
[09:36:26] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Add {calico, kubernetes}-future components to buster [puppet] - 10https://gerrit.wikimedia.org/r/644469
[09:37:22] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] k8s codfw staging: Assign role to node [puppet] - 10https://gerrit.wikimedia.org/r/644235 (owner: 10Alexandros Kosiaris)
[09:40:50] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission es1018.eqiad.wmnet - https://phabricator.wikimedia.org/T269069 (10Marostegui) a:05Marostegui→03wiki_willy
[09:41:25] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission es1018.eqiad.wmnet - https://phabricator.wikimedia.org/T269069 (10Marostegui) Ready for DC Ops!
[09:43:44] <wikibugs>	 (03PS5) 10ArielGlenn: Add sample code illustrating use of the commandmanagement module classes [dumps] - 10https://gerrit.wikimedia.org/r/627475
[09:44:33] <wikibugs>	 (03CR) 10Hashar: "https://puppet-compiler.wmflabs.org/compiler1002/642/" [puppet] - 10https://gerrit.wikimedia.org/r/643944 (owner: 10Hashar)
[09:46:45] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] Add sample code illustrating use of the commandmanagement module classes [dumps] - 10https://gerrit.wikimedia.org/r/627475 (owner: 10ArielGlenn)
[09:47:12] <wikibugs>	 (03Merged) 10jenkins-bot: Add sample code illustrating use of the commandmanagement module classes [dumps] - 10https://gerrit.wikimedia.org/r/627475 (owner: 10ArielGlenn)
[09:49:34] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.decommission
[09:49:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1075 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P13498 and previous config saved to /var/cache/conftool/dbconfig/20201201-095037-root.json
[09:50:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:51:06] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.36.0-wmf.20 [core] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644387 (https://phabricator.wikimedia.org/T263186) (owner: 10TrainBranchBot)
[09:52:19] <wikibugs>	 (03CR) 10Jbond: "updated" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/643703 (https://phabricator.wikimedia.org/T254332) (owner: 10Jbond)
[09:55:40] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Add {calico, kubernetes}-future components to buster [puppet] - 10https://gerrit.wikimedia.org/r/644469
[09:57:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add {calico, kubernetes}-future components to buster [puppet] - 10https://gerrit.wikimedia.org/r/644469 (owner: 10Alexandros Kosiaris)
[09:59:51] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
[09:59:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:01:49] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/643112 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[10:02:09] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.decommission
[10:02:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:05:13] <wikibugs>	 (03PS2) 10Aklapper: Phabricator monthly email: [Hopefully] fix priority median calculation [puppet] - 10https://gerrit.wikimedia.org/r/644383 (https://phabricator.wikimedia.org/T269076)
[10:05:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1075 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P13499 and previous config saved to /var/cache/conftool/dbconfig/20201201-100541-root.json
[10:05:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:32] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
[10:08:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:52] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.decommission
[10:08:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:12:15] <wikibugs>	 (03PS1) 10Volans: sre.hosts.decommission: fix kerberos check [cookbooks] - 10https://gerrit.wikimedia.org/r/644471
[10:12:47] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] sre.hosts.decommission: fix kerberos check [cookbooks] - 10https://gerrit.wikimedia.org/r/644471 (owner: 10Volans)
[10:13:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1078 for schema change', diff saved to https://phabricator.wikimedia.org/P13500 and previous config saved to /var/cache/conftool/dbconfig/20201201-101346-marostegui.json
[10:13:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:14:19] <wikibugs>	 (03PS2) 10Ema: vcl: move /static Host normalization to cluster_fe_recv_pre_purge [puppet] - 10https://gerrit.wikimedia.org/r/644466 (https://phabricator.wikimedia.org/T130904)
[10:14:23] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: fix kerberos check [cookbooks] - 10https://gerrit.wikimedia.org/r/644471 (owner: 10Volans)
[10:15:02] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: fix cluster config out of sync alert [puppet] - 10https://gerrit.wikimedia.org/r/644184 (https://phabricator.wikimedia.org/T266017) (owner: 10Filippo Giunchedi)
[10:15:36] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.decommission: fix kerberos check [cookbooks] - 10https://gerrit.wikimedia.org/r/644471 (owner: 10Volans)
[10:16:57] <wikibugs>	 (03CR) 10Jbond: Port elasticsearch/es-tool.py to Python3 (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/644365 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[10:17:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1078 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P13501 and previous config saved to /var/cache/conftool/dbconfig/20201201-101703-root.json
[10:17:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:17:22] <wikibugs>	 (03PS1) 10Volans: sre.hosts.reboot-cluster: remove executable bit [cookbooks] - 10https://gerrit.wikimedia.org/r/644473
[10:21:46] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.hosts.reboot-cluster: remove executable bit [cookbooks] - 10https://gerrit.wikimedia.org/r/644473 (owner: 10Volans)
[10:22:56] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.reboot-cluster: remove executable bit [cookbooks] - 10https://gerrit.wikimedia.org/r/644473 (owner: 10Volans)
[10:23:36] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/644372 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[10:24:37] <wikibugs>	 (03PS1) 10JMeybohm: Run a helm repo update before linting [deployment-charts] - 10https://gerrit.wikimedia.org/r/644474
[10:24:54] <wikibugs>	 10Operations, 10Performance-Team, 10serviceops, 10User-jijiki: Run latest Thumbor on Docker with Buster + Python 3 - https://phabricator.wikimedia.org/T267327 (10jijiki)
[10:25:41] <wikibugs>	 (03PS2) 10JMeybohm: Run a helm repo update before linting [deployment-charts] - 10https://gerrit.wikimedia.org/r/644474
[10:26:28] <wikibugs>	 (03Abandoned) 10Ema: [untested] Rewrite /static/ also for PURGE requests [puppet] - 10https://gerrit.wikimedia.org/r/279564 (https://phabricator.wikimedia.org/T130904) (owner: 10GWicke)
[10:27:13] <wikibugs>	 (03CR) 10JMeybohm: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/644462 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm)
[10:30:28] <logmsgbot>	 !log elukey@cumin1001 END (ERROR) - Cookbook sre.hosts.decommission (exit_code=97)
[10:30:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:30:36] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[10:31:16] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.decommission
[10:31:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:32:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1078 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P13503 and previous config saved to /var/cache/conftool/dbconfig/20201201-103207-root.json
[10:32:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:38] <wikibugs>	 10Operations, 10observability, 10CAS-SSO: thanos u/i gives errors if left idle for a few hours - https://phabricator.wikimedia.org/T268233 (10jbond) I have not been able to recreate this, is this still causing an issue?
[10:35:43] <wikibugs>	 (03PS1) 10ArielGlenn: add the ability to skip a job via configuration [dumps] - 10https://gerrit.wikimedia.org/r/644476
[10:35:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: "I like the general idea, I'm wondering how to make it more explicit that each new cluster will also require setting up a new prometheus in" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644262 (owner: 10Alexandros Kosiaris)
[10:35:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] add the ability to skip a job via configuration [dumps] - 10https://gerrit.wikimedia.org/r/644476 (owner: 10ArielGlenn)
[10:36:58] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
[10:37:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:37:42] <wikibugs>	 10Operations, 10observability, 10CAS-SSO: thanos u/i gives errors if left idle for a few hours - https://phabricator.wikimedia.org/T268233 (10Kormat) I'm still getting failures, but it's not clear where the issue is.  Firefox: {F33929783}  Chrome: {F33929785}
[10:38:18] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.decommission
[10:38:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:44:04] <wikibugs>	 (03PS1) 10Volans: sre.hosts.reboot_group: move functionalities [cookbooks] - 10https://gerrit.wikimedia.org/r/644477
[10:44:35] <wikibugs>	 (03PS2) 10ArielGlenn: add the ability to skip a job via configuration [dumps] - 10https://gerrit.wikimedia.org/r/644476
[10:45:37] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
[10:45:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:57] <wikibugs>	 (03PS6) 10Kormat: test: Start implementation of integration-env [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/644231 (https://phabricator.wikimedia.org/T265266)
[10:46:27] <wikibugs>	 (03PS1) 10Elukey: install_server: remove lefovers of analytics105[1-7] [puppet] - 10https://gerrit.wikimedia.org/r/644478 (https://phabricator.wikimedia.org/T267932)
[10:47:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1078 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P13505 and previous config saved to /var/cache/conftool/dbconfig/20201201-104710-root.json
[10:47:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:47:29] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] install_server: remove lefovers of analytics105[1-7] [puppet] - 10https://gerrit.wikimedia.org/r/644478 (https://phabricator.wikimedia.org/T267932) (owner: 10Elukey)
[11:02:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1078 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P13506 and previous config saved to /var/cache/conftool/dbconfig/20201201-110214-root.json
[11:02:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:12:27] <wikibugs>	 (03PS1) 10MSantos: WIP: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949)
[11:13:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos)
[11:15:00] <icinga-wm>	 PROBLEM - tileratorui on maps1006 is CRITICAL: connect to address 10.64.0.18 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui
[11:15:04] <icinga-wm>	 PROBLEM - tileratorui on maps1007 is CRITICAL: connect to address 10.64.16.6 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui
[11:15:08] <icinga-wm>	 PROBLEM - tilerator on maps1010 is CRITICAL: connect to address 10.64.48.6 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator
[11:15:16] <hnowlan>	 expiring downtimes ^ 
[11:15:20] <icinga-wm>	 PROBLEM - tileratorui on maps1008 is CRITICAL: connect to address 10.64.16.27 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui
[11:15:28] <icinga-wm>	 PROBLEM - cassandra CQL 10.64.48.6:9042 on maps1010 is CRITICAL: connect to address 10.64.48.6 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[11:15:36] <icinga-wm>	 PROBLEM - Check systemd state on maps1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:15:52] <icinga-wm>	 PROBLEM - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:15:58] <icinga-wm>	 PROBLEM - tileratorui on maps1009 is CRITICAL: connect to address 10.64.32.8 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui
[11:16:08] <icinga-wm>	 PROBLEM - cassandra CQL 10.64.32.8:9042 on maps1009 is CRITICAL: connect to address 10.64.32.8 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[11:16:10] <icinga-wm>	 PROBLEM - tileratorui on maps1005 is CRITICAL: connect to address 10.64.0.12 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui
[11:16:38] <icinga-wm>	 PROBLEM - cassandra service on maps1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:16:42] <icinga-wm>	 PROBLEM - cassandra service on maps1010 is CRITICAL: CRITICAL - Unit cassandra is active but reported SubState exited, wanted running https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:16:44] <icinga-wm>	 PROBLEM - tileratorui on maps1010 is CRITICAL: connect to address 10.64.48.6 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui
[11:19:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Thanks, looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/644477 (owner: 10Volans)
[11:19:58] <icinga-wm>	 RECOVERY - tileratorui on maps1006 is OK: HTTP OK: HTTP/1.1 200 OK - 315 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui
[11:20:02] <icinga-wm>	 RECOVERY - tileratorui on maps1007 is OK: HTTP OK: HTTP/1.1 200 OK - 315 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui
[11:20:53] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Hnowlan new hosts, not pooled. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:20:53] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra CQL 10.64.32.8:9042 on maps1009 is CRITICAL: connect to address 10.64.32.8 and port 9042: Connection refused Hnowlan new hosts, not pooled. https://phabricator.wikimedia.org/T93886
[11:20:53] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra service on maps1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed Hnowlan new hosts, not pooled. https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:20:53] <icinga-wm>	 ACKNOWLEDGEMENT - tileratorui on maps1009 is CRITICAL: connect to address 10.64.32.8 and port 6535: Connection refused Hnowlan new hosts, not pooled. https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui
[11:20:53] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on maps1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Hnowlan new hosts, not pooled. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:20:54] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra CQL 10.64.48.6:9042 on maps1010 is CRITICAL: connect to address 10.64.48.6 and port 9042: Connection refused Hnowlan new hosts, not pooled. https://phabricator.wikimedia.org/T93886
[11:20:54] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra service on maps1010 is CRITICAL: CRITICAL - Unit cassandra is active but reported SubState exited, wanted running Hnowlan new hosts, not pooled. https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:20:55] <icinga-wm>	 ACKNOWLEDGEMENT - tilerator on maps1010 is CRITICAL: connect to address 10.64.48.6 and port 6534: Connection refused Hnowlan new hosts, not pooled. https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator
[11:20:55] <icinga-wm>	 ACKNOWLEDGEMENT - tileratorui on maps1010 is CRITICAL: connect to address 10.64.48.6 and port 6535: Connection refused Hnowlan new hosts, not pooled. https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui
[11:26:14] <wikibugs>	 (03PS3) 10ArielGlenn: add the ability to skip a job via configuration [dumps] - 10https://gerrit.wikimedia.org/r/644476
[11:33:12] <icinga-wm>	 RECOVERY - cassandra service on maps1009 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:38:10] <icinga-wm>	 PROBLEM - cassandra service on maps1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:42:51] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: dmz_cidr: detail the list of private production addresses [puppet] - 10https://gerrit.wikimedia.org/r/641977 (owner: 10Arturo Borrero Gonzalez)
[11:46:56] <wikibugs>	 10Operations, 10observability, 10CAS-SSO: thanos u/i gives errors if left idle for a few hours - https://phabricator.wikimedia.org/T268233 (10jbond) Do you get this error on all expressions, a specific expression or spasmodically?  have also tagged observability in case there is something other then CORS in...
[11:48:53] <marostegui>	 !log Install bsd-mailx on the new clouddb hosts (needed for the check private data) T267090 T268725
[11:49:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:49:01] <stashbot>	 T267090: Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090
[11:49:01] <stashbot>	 T268725: Include mail on standard_packages.pp - https://phabricator.wikimedia.org/T268725
[11:49:19] <wikibugs>	 10Operations, 10observability, 10CAS-SSO: thanos u/i gives errors if left idle for a few hours - https://phabricator.wikimedia.org/T268233 (10jbond) in fact observability is already tagged, @fgiunchedi wodner if this could be a more general issue?
[11:56:22] <arturo>	 stashbot should be reconnecting soon
[11:57:00] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.hosts.reboot_group: move functionalities [cookbooks] - 10https://gerrit.wikimedia.org/r/644477 (owner: 10Volans)
[11:58:40] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.reboot_group: move functionalities [cookbooks] - 10https://gerrit.wikimedia.org/r/644477 (owner: 10Volans)
[12:00:07] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201201T1200).
[12:00:07] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[12:00:19] <wikibugs>	 (03PS1) 10Muehlenhoff: Extend access for aarora [puppet] - 10https://gerrit.wikimedia.org/r/644507
[12:00:26] <Lucas_WMDE>	 looks like nothing to deploy indeed
[12:01:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Extend access for aarora [puppet] - 10https://gerrit.wikimedia.org/r/644507 (owner: 10Muehlenhoff)
[12:03:00] <wikibugs>	 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage: Create a read-only swift identity for backup taking - https://phabricator.wikimedia.org/T269108 (10jcrespo)
[12:03:08] <icinga-wm>	 RECOVERY - cassandra service on maps1009 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:04:04] <wikibugs>	 (03PS4) 10Volans: sre.hosts.downtime: convert to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/633484 (https://phabricator.wikimedia.org/T221212)
[12:04:31] <wikibugs>	 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage: Create a read-only swift identity for backup taking - https://phabricator.wikimedia.org/T269108 (10jcrespo) ^@fgiunchedi this is the task I told you about (pinging on comment because sometimes notifications cannot be seen on creation).
[12:05:29] <arturo>	 !log [11:53 moritzm] uploaded lxml 3.4.0-1+deb8u1+wmf1 to apt.wikimedia.org/jessie-wikimedia
[12:05:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:08:06] <icinga-wm>	 PROBLEM - cassandra service on maps1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:12:31] <wikibugs>	 (03PS6) 10Alexandros Kosiaris: k8s: Remove profile::kubernetes::master::storage_backend fully [puppet] - 10https://gerrit.wikimedia.org/r/644234
[12:12:33] <wikibugs>	 (03PS6) 10Alexandros Kosiaris: k8s::master: Remove redundant has_lvs hiera check [puppet] - 10https://gerrit.wikimedia.org/r/644237
[12:12:35] <wikibugs>	 (03PS6) 10Alexandros Kosiaris: k8s: Allow using cergen [puppet] - 10https://gerrit.wikimedia.org/r/644238
[12:12:37] <wikibugs>	 (03PS7) 10Alexandros Kosiaris: k8s codfw staging: Assign role to node [puppet] - 10https://gerrit.wikimedia.org/r/644235
[12:12:39] <wikibugs>	 (03PS7) 10Alexandros Kosiaris: prometheus::k8s: Support arbitrary clusters [puppet] - 10https://gerrit.wikimedia.org/r/644262
[12:12:41] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: Add {calico, kubernetes}-future components to buster [puppet] - 10https://gerrit.wikimedia.org/r/644469
[12:13:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] k8s codfw staging: Assign role to node [puppet] - 10https://gerrit.wikimedia.org/r/644235 (owner: 10Alexandros Kosiaris)
[12:15:02] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add {calico, kubernetes}-future components to buster [puppet] - 10https://gerrit.wikimedia.org/r/644469 (owner: 10Alexandros Kosiaris)
[12:24:17] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 672400568 and 123 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:24:17] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 291221480 and 123 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:24:27] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: Add {calico, kubernetes}-future components to buster [puppet] - 10https://gerrit.wikimedia.org/r/644469
[12:24:29] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: package_from_component: Move to ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/644509
[12:24:51] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 200908968 and 156 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:24:51] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 999694520 and 156 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:26:21] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 2056472 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:26:21] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1803936 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:26:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add {calico, kubernetes}-future components to buster [puppet] - 10https://gerrit.wikimedia.org/r/644469 (owner: 10Alexandros Kosiaris)
[12:26:57] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 59602192 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:27:57] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1921160 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:28:59] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1934488 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:30:33] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 224382384 and 180 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:31:09] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 33456544 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:31:19] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 29449272 and 225 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:31:27] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 76425320 and 234 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:31:29] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 896010928 and 236 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:33:02] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime
[12:33:02] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[12:33:03] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26807/console" [puppet] - 10https://gerrit.wikimedia.org/r/644469 (owner: 10Alexandros Kosiaris)
[12:33:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:33:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:33:15] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime
[12:33:16] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[12:33:20] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime
[12:33:20] <logmsgbot>	 !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[12:33:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:33:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:33:29] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 20784352 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:33:31] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 20433400 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:33:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:33:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:33:49] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 35508808 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:33:49] <icinga-wm>	 RECOVERY - Check systemd state on maps1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:33:57] <icinga-wm>	 RECOVERY - cassandra service on maps1009 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:34:39] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 65480 and 55 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:34:43] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 49200 and 57 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:34:45] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1917224 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:35:45] <wikibugs>	 (03PS3) 10Hnowlan: postgres: increase number of WAL files retained by master [puppet] - 10https://gerrit.wikimedia.org/r/643717
[12:38:21] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on ms-be1022 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T269125 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[12:38:26] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on ms-be1022 - https://phabricator.wikimedia.org/T269125 (10ops-monitoring-bot)
[12:38:28] <moritzm>	 !log uploaded libonig 5.9.5-3.2+deb8u4+wmf1 to apt.wikimedia.org/jessie-wikimedia
[12:38:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:38:41] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 552544 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:39:05] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 574096 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:40:22] <wikibugs>	 (03CR) 10Hnowlan: postgres: increase number of WAL files retained by master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/643717 (owner: 10Hnowlan)
[12:41:27] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 5000 and 51 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:41:33] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 106896 and 57 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:46:32] <wikibugs>	 (03PS8) 10Alexandros Kosiaris: k8s codfw staging: Assign role to node [puppet] - 10https://gerrit.wikimedia.org/r/644235
[12:46:34] <wikibugs>	 (03PS8) 10Alexandros Kosiaris: prometheus::k8s: Support arbitrary clusters [puppet] - 10https://gerrit.wikimedia.org/r/644262
[12:46:36] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: package_from_component: Move to ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/644509
[12:46:38] <wikibugs>	 (03PS5) 10Alexandros Kosiaris: Add {calico, kubernetes}-future components to buster [puppet] - 10https://gerrit.wikimedia.org/r/644469
[12:55:23] <wikibugs>	 10Operations, 10observability, 10CAS-SSO: thanos u/i gives errors if left idle for a few hours - https://phabricator.wikimedia.org/T268233 (10Kormat) >>! In T268233#6659067, @jbond wrote: > Do you get this error on all expressions, a specific expression or spasmodically?  have also tagged observability in ca...
[12:56:00] <wikibugs>	 10Operations, 10Performance-Team, 10serviceops, 10User-jijiki: Run latest Thumbor on Docker with Buster + Python 3 - https://phabricator.wikimedia.org/T267327 (10Ladsgroup) My 2c. From what I learned from docker books and such, containers and k8s are not recommended for two types of applications: 1- statef...
[12:56:11] <hashar>	 jouncebot: now
[12:56:11] <jouncebot>	 For the next 0 hour(s) and 3 minute(s): European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201201T1200)
[12:57:56] <hashar>	 !log Preparing deployment of 1.36.0-wmf.20 # T263186
[12:58:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:58:03] <stashbot>	 T263186: 1.36.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T263186
[12:58:26] <wikibugs>	 (03PS1) 10Marostegui: tables_to_check: Add pagelinks,templatelinks and categorylinks [software] - 10https://gerrit.wikimedia.org/r/644515
[12:59:38] <wikibugs>	 (03PS2) 10Marostegui: tables_to_check: Add pagelinks,templatelinks and categorylinks [software] - 10https://gerrit.wikimedia.org/r/644515
[13:00:04] <jouncebot>	 Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201201T1300)
[13:01:36] <wikibugs>	 (03CR) 10Jgiannelos: "Hey @MSantos, i did a first pass and added some (mostly) nit comments." (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos)
[13:03:13] <wikibugs>	 (03CR) 10Jgiannelos: [C: 04-1] WIP: start using imposm as OSM sync tool (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos)
[13:08:56] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "I am afraid they will be huge on some wikis." [software] - 10https://gerrit.wikimedia.org/r/644515 (owner: 10Marostegui)
[13:08:58] <wikibugs>	 (03PS1) 10Hashar: testwikis wikis to 1.36.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644516
[13:09:00] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] testwikis wikis to 1.36.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644516 (owner: 10Hashar)
[13:09:26] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] tables_to_check: Add pagelinks,templatelinks and categorylinks [software] - 10https://gerrit.wikimedia.org/r/644515 (owner: 10Marostegui)
[13:09:37] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.36.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644516 (owner: 10Hashar)
[13:10:08] <logmsgbot>	 !log hashar@deploy1001 Started scap: testwikis wikis to 1.36.0-wmf.20
[13:10:11] <wikibugs>	 (03Merged) 10jenkins-bot: tables_to_check: Add pagelinks,templatelinks and categorylinks [software] - 10https://gerrit.wikimedia.org/r/644515 (owner: 10Marostegui)
[13:10:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:12:31] <wikibugs>	 (03PS1) 10Marostegui: Revert "tables_to_check: Add pagelinks,templatelinks and categorylinks" [software] - 10https://gerrit.wikimedia.org/r/644489
[13:13:15] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "tables_to_check: Add pagelinks,templatelinks and categorylinks" [software] - 10https://gerrit.wikimedia.org/r/644489 (owner: 10Marostegui)
[13:13:49] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "tables_to_check: Add pagelinks,templatelinks and categorylinks" [software] - 10https://gerrit.wikimedia.org/r/644489 (owner: 10Marostegui)
[13:19:33] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[13:19:51] <elukey>	 mmmmm
[13:20:54] * elukey runs sudo cookbook -d sre.dns.netbox "test"
[13:26:33] <wikibugs>	 (03PS1) 10Filippo Giunchedi: alertmanager: use o11y address as from [puppet] - 10https://gerrit.wikimedia.org/r/644517 (https://phabricator.wikimedia.org/T268995)
[13:27:14] <elukey>	 so it is a lot of cloudvirt instances
[13:28:44] <elukey>	 ah ok Chris is working on them (TIL Netbox changelog)
[13:30:28] <wikibugs>	 10Operations, 10Performance-Team, 10serviceops, 10User-jijiki: Run latest Thumbor on Docker with Buster + Python 3 - https://phabricator.wikimedia.org/T267327 (10Gilles) Thumbor is neither stateful nor high I/O.
[13:33:24] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10observability, 10service-runner, 10Patch-For-Review: Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125 (10fgiunchedi)
[13:33:35] <wikibugs>	 10Operations, 10Maps, 10Wikimedia-Logstash, 10observability, and 3 others: Move kartotherian/tilerator logging to new logging pipeline - https://phabricator.wikimedia.org/T222377 (10fgiunchedi) 05Open→03Resolved I can indeed confirm all gelf traffic from maps has stopped, thank you @MSantos and @hnowla...
[13:37:20] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2031 - https://phabricator.wikimedia.org/T268773 (10fgiunchedi) 05Open→03Resolved Thanks @papaul, disk is back now. re: spares, we should get some from decom of ms-be hosts (if that's a thing)
[13:39:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1106 to clone clouddb hosts', diff saved to https://phabricator.wikimedia.org/P13507 and previous config saved to /var/cache/conftool/dbconfig/20201201-133917-marostegui.json
[13:39:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:40:22] <wikibugs>	 (03CR) 10Subramanya Sastry: [C: 03+2] Bump wikimedia/parsoid to 0.13.0-a18 [vendor] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644221 (https://phabricator.wikimedia.org/T51538) (owner: 10C. Scott Ananian)
[13:40:57] <wikibugs>	 (03PS1) 10Marostegui: db1106: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/644519 (https://phabricator.wikimedia.org/T267090)
[13:41:38] <wikibugs>	 (03PS2) 10Bearloga: sessionTick: Add event stream and enable on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637539 (https://phabricator.wikimedia.org/T248987) (owner: 10Jason Linehan)
[13:41:42] <wikibugs>	 (03PS3) 10Bearloga: sessionTick: Add event stream and enable on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637539 (https://phabricator.wikimedia.org/T248987) (owner: 10Jason Linehan)
[13:42:54] <wikibugs>	 (03CR) 10Bearloga: [C: 03+1] "Updated stream name per I809afc34c878ed6bdcbf0d3f5cc6a4c9990ef845" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637539 (https://phabricator.wikimedia.org/T248987) (owner: 10Jason Linehan)
[13:43:49] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1106: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/644519 (https://phabricator.wikimedia.org/T267090) (owner: 10Marostegui)
[13:44:41] <subbu>	 hey hashar we got late getting vendor patch merged before the branch was cut. i just +2ed the cherry-pick.  ( https://gerrit.wikimedia.org/r/644221 ) can that be synced as well once that merges?
[13:47:11] <jclark-ctr>	 https://phabricator.wikimedia.org/T268804  https://netbox.wikimedia.org/dcim/cables/3395/ Cleaning fiber in C4 & C7 in eqiad 
[13:48:51] <subbu>	 but i am going to -2 it to block the merge in case you aren't ready to scap it.
[13:48:55] <hashar>	 subbu: oh parsoid is shipped as a composer dependency
[13:49:01] <wikibugs>	 (03CR) 10Subramanya Sastry: [C: 04-2] Bump wikimedia/parsoid to 0.13.0-a18 [vendor] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644221 (https://phabricator.wikimedia.org/T51538) (owner: 10C. Scott Ananian)
[13:49:05] <hashar>	 I though it was deployed as an extension .. bah ;)
[13:49:06] <wikibugs>	 10Operations, 10Discovery-Search: Google Search Console access for Search Platform team - https://phabricator.wikimedia.org/T188453 (10mpopov) 05Open→03Invalid
[13:49:33] <hashar>	 subbu: I am running the deploy promote step right now, so in half an hour or so mediawiki will have bene deployed
[13:49:34] <subbu>	 hashar, yes. we changed that in feb. :)
[13:49:57] <hashar>	 I guess once the vendor change has been merged, it is all of a matter of deploying it the usual way
[13:50:11] <hashar>	 I can handle it ;)
[13:50:14] <subbu>	 ok, should i +2 that patch ? ok.
[13:50:23] <wikibugs>	 (03CR) 10Subramanya Sastry: [C: 03+2] Bump wikimedia/parsoid to 0.13.0-a18 [vendor] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644221 (https://phabricator.wikimedia.org/T51538) (owner: 10C. Scott Ananian)
[13:50:50] <subbu>	 that ^ is the cherry-pick to wmf.20 from master.
[13:53:03] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.dns.netbox
[13:53:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:15] <icinga-wm>	 RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005703 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[13:55:52] <hashar>	 subbu: great thx sync-apaches:  30% (ok: 106; fail: 0; left: 239)                                
[13:55:59] <hashar>	 still some more apaches to complete 
[13:56:20] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T269075 (10fgiunchedi)
[13:56:21] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T268036 (10fgiunchedi)
[13:56:36] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T268997 (10fgiunchedi)
[13:56:38] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T268036 (10fgiunchedi)
[13:57:28] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on ms-be1022 - https://phabricator.wikimedia.org/T269125 (10fgiunchedi)
[13:57:32] <wikibugs>	 10Operations, 10ops-eqiad, 10SRE-swift-storage: ms-be1022 smart storage battery failure; disk sdb possibly bad - https://phabricator.wikimedia.org/T267870 (10fgiunchedi)
[14:00:04] <jouncebot>	 hashar and twentyafterfour: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Mediawiki train - European+American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201201T1400).
[14:00:28] <XioNoX>	 !log asw2-d-eqiad> request virtual-chassis vc-port delete pic-slot 0 member 2 port 53 - T268808
[14:00:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:00:35] <stashbot>	 T268808: Replace asw2-d-eqiad VC cable - https://phabricator.wikimedia.org/T268808
[14:04:00] <wikibugs>	 (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.13.0-a18 [vendor] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644221 (https://phabricator.wikimedia.org/T51538) (owner: 10C. Scott Ananian)
[14:05:07] <subbu>	 hashar, merged ^ .. do verify that the submodule in core is updated to reflect that before syncing .. i believe it should happen automatically, but just in case.
[14:05:21] <XioNoX>	 !log asw2-d-eqiad> request virtual-chassis vc-port set pic-slot 0 member 2 port 53 - T268808
[14:05:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:07:49] <wikibugs>	 10Operations, 10ConfirmEdit (CAPTCHA extension), 10Python3-Porting: captcha.py needs to be ported to Python 3 - https://phabricator.wikimedia.org/T268468 (10Reedy) 05Open→03Stalled Marking stalled as it's unclear what needs fixing (if anything)
[14:08:17] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:08:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:52] <hashar>	 subbu: yeah I will once the current deployment has completed
[14:10:05] <subbu>	 k
[14:12:12] <wikibugs>	 10Operations, 10ConfirmEdit (CAPTCHA extension), 10Python3-Porting: captcha.py needs to be ported to Python 3 - https://phabricator.wikimedia.org/T268468 (10MoritzMuehlenhoff) 05Stalled→03Open Well, at minimum the shebang needs to be switched to #!/usr/bin/python3. If that's all that's needed, even better.
[14:14:21] <icinga-wm>	 RECOVERY - Host ms-be2058 is UP: PING OK - Packet loss = 0%, RTA = 33.38 ms
[14:14:23] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:14:53] <wikibugs>	 10Operations, 10ops-eqiad: Replace asw2-c-eqiad VC cable - https://phabricator.wikimedia.org/T268804 (10Jclark-ctr) Cleaned both fiber ends will leave ticket open for now while monitoring.
[14:15:20] <wikibugs>	 10Operations, 10ops-eqiad: Replace asw2-d-eqiad VC cable - https://phabricator.wikimedia.org/T268808 (10Jclark-ctr) replaced qsfp+ on D2,  Cleaned both fiber ends will leave ticket open while monitoring
[14:15:53] <wikibugs>	 10Operations, 10ops-eqiad: Replace asw2-d-eqiad VC cable - https://phabricator.wikimedia.org/T268808 (10Jclark-ctr) updated netbox with cable info.
[14:15:57] <wikibugs>	 10Operations, 10ops-eqiad: Replace asw2-c-eqiad VC cable - https://phabricator.wikimedia.org/T268804 (10Jclark-ctr) updated netbox with cable info.
[14:16:32] <wikibugs>	 (03PS4) 10ArielGlenn: add the ability to skip a job via configuration [dumps] - 10https://gerrit.wikimedia.org/r/644476
[14:16:34] <wikibugs>	 (03PS2) 10ArielGlenn: add a sample job for illustration purposes [dumps] - 10https://gerrit.wikimedia.org/r/625930
[14:17:40] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] k8s: Remove profile::kubernetes::master::storage_backend fully [puppet] - 10https://gerrit.wikimedia.org/r/644234 (owner: 10Alexandros Kosiaris)
[14:20:51] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on wtp1035 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:20:51] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2247 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:20:57] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw1327 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:20:59] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2331 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:20:59] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2325 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:21:01] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2219 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:21:07] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2266 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:21:07] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2283 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:21:11] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw1354 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:21:19] <hashar>	 ^^^ no idea
[14:21:19] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw1322 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:21:21] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw1290 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:21:21] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10Papaul) @fgiunchedi thanks . ms-be2058 has memory error the same DIMM we were having problem with on msbe2057 was used in ms-be2058 so I will go ahead and ask for replacement ....
[14:21:25] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw1338 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:21:25] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2273 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:21:31] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2311 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:21:31] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on parse2013 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:21:31] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw1362 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:21:34] <hashar>	 but scap deploy promote is being run so that is surely related
[14:21:35] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[14:21:35] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw1296 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:21:47] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2263 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:21:51] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2216 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:21:59] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw1392 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:22:01] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw1300 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:22:05] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on wtp1036 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:22:05] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on deploy1002 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:22:07] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2330 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:22:07] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2274 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:22:11] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mwdebug1002 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:22:13] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw1383 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:22:13] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw1349 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:22:17] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2292 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:22:17] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2372 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:22:17] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mwdebug2002 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:22:33] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2258 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:22:43] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on snapshot1008 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:22:45] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2221 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:22:47] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw1385 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:22:47] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` ms-be2060.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reim...
[14:23:03] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw1328 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:23:03] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2296 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:23:17] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw1351 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:23:25] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on wtp1027 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:23:29] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2310 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:23:29] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2324 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:23:33] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw1275 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:23:41] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on wtp1040 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:23:45] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw1368 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:23:45] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on snapshot1005 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:23:45] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on parse2007 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:23:47] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2271 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:23:53] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2329 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:24:01] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2317 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:24:05] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on wtp1041 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:24:16] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] k8s::master: Remove redundant has_lvs hiera check [puppet] - 10https://gerrit.wikimedia.org/r/644237 (owner: 10Alexandros Kosiaris)
[14:24:23] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw1363 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:24:34] <logmsgbot>	 !log hashar@deploy1001 sync-world aborted: testwikis wikis to 1.36.0-wmf.20 (duration: 74m 55s)
[14:24:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:15] <hashar>	 grrr wrong window
[14:25:48] <hnowlan>	 fwiw the wikis with mismatched versions are testwiki, labtestwiki and testwikidatawiki but I assume that's expected
[14:25:51] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on wtp1035 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:25:51] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2247 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:26:00] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] k8s: Allow using cergen [puppet] - 10https://gerrit.wikimedia.org/r/644238 (owner: 10Alexandros Kosiaris)
[14:26:01] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2331 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:26:01] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2325 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:26:03] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2219 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:26:03] <logmsgbot>	 !log hashar@deploy1001 Started scap: (no justification provided)
[14:26:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:26:09] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2266 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:26:09] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2283 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:26:13] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw1354 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:26:25] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw1322 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:26:25] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw1290 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:26:29] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw1338 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:26:31] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2273 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:26:35] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2311 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:26:35] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on parse2013 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:26:37] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw1362 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:26:41] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw1296 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:26:49] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2263 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:26:53] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2216 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:26:59] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on wtp1036 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:27:03] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw1392 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:27:03] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw1300 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:27:11] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2330 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:27:11] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2274 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:27:15] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mwdebug1002 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:27:19] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw1383 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:27:19] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw1349 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:27:23] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2292 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:27:23] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2372 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:27:23] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mwdebug2002 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:27:23] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Add apertium namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/644530 (https://phabricator.wikimedia.org/T255672)
[14:27:29] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] k8s codfw staging: Assign role to node [puppet] - 10https://gerrit.wikimedia.org/r/644235 (owner: 10Alexandros Kosiaris)
[14:27:37] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2258 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:27:43] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on snapshot1008 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:27:45] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2221 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:27:47] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw1385 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:28:01] <wikibugs>	 10Operations, 10ConfirmEdit (CAPTCHA extension), 10Patch-For-Review, 10Python3-Porting: captcha.py needs to be ported to Python 3 - https://phabricator.wikimedia.org/T268468 (10Reedy) I think that's possibly all that is needed at this point. 2to3 changes aren't necessary, I think?  I have tested (not exten...
[14:28:05] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw1328 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:28:05] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2296 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:28:19] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw1351 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:28:24] <wikibugs>	 10Operations, 10Performance-Team, 10serviceops, 10User-jijiki: Run latest Thumbor on Docker with Buster + Python 3 - https://phabricator.wikimedia.org/T267327 (10Ladsgroup) oh it's not stateful but I think it's high I/O compared to other applications (maybe not as high as jitsi but higher than other apps i...
[14:28:25] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on wtp1027 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:28:29] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2310 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:28:29] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2324 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:28:35] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw1275 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:28:45] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on wtp1040 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:28:47] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw1368 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:28:47] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on snapshot1005 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:28:49] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on parse2007 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:28:49] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2271 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:28:55] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2329 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:29:05] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2317 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:29:13] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on wtp1041 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:29:31] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw1363 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:29:41] <wikibugs>	 (03PS1) 10KartikMistry: WIP: Add apertium helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/644531 (https://phabricator.wikimedia.org/T255672)
[14:31:05] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw1327 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:31:08] <wikibugs>	 (03PS1) 10Muehlenhoff: Extend PIL Python package with Python 3 counterparts [puppet] - 10https://gerrit.wikimedia.org/r/644532 (https://phabricator.wikimedia.org/T268468)
[14:31:17] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ms-be2060.codfw.wmnet'] `  Of which those **FAILED**: ` ['ms-be2060.codfw.wmnet'] `
[14:34:01] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` ms-be2060.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reim...
[14:34:24] <wikibugs>	 (03PS1) 10Andrew Bogott: Nova config upgrades for Stein [puppet] - 10https://gerrit.wikimedia.org/r/644534 (https://phabricator.wikimedia.org/T261134)
[14:36:08] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] "Cool, thanks! (Just a nit on aptrepo)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644469 (owner: 10Alexandros Kosiaris)
[14:36:55] <icinga-wm>	 PROBLEM - SSH on ms-be2031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[14:37:27] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] package_from_component: Move to ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/644509 (owner: 10Alexandros Kosiaris)
[14:38:29] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/644532 (https://phabricator.wikimedia.org/T268468) (owner: 10Muehlenhoff)
[14:39:13] <wikibugs>	 (03PS1) 10Ottomata: Set oozie.service.coord.default.max.timeout to 13 months [puppet] - 10https://gerrit.wikimedia.org/r/644535 (https://phabricator.wikimedia.org/T264358)
[14:40:55] <icinga-wm>	 RECOVERY - SSH on ms-be2031 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[14:44:22] <hashar>	 subbu: i got the vendor patch for parsoid in and ran a whole sync again
[14:44:39] <subbu>	 thanks.
[14:44:50] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Nova config upgrades for Stein [puppet] - 10https://gerrit.wikimedia.org/r/644534 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott)
[14:51:20] <jbond42>	 !log instal lxml updates
[14:51:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:47] <wikibugs>	 (03CR) 10JMeybohm: Add calico helm chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/644462 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm)
[14:55:09] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2031 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:56:47] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:59:19] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ms-be2060.codfw.wmnet'] `  Of which those **FAILED**: ` ['ms-be2060.codfw.wmnet'] `
[14:59:34] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] "I need to start merging the precursor patches for this so that I can focus on changes to the meatier one." [puppet] - 10https://gerrit.wikimedia.org/r/643356 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm)
[14:59:36] <jbond42>	 !log install libonig updates to scp
[14:59:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:05] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission es1018.eqiad.wmnet - https://phabricator.wikimedia.org/T269069 (10wiki_willy) a:05wiki_willy→03Cmjohnson
[15:01:44] <wikibugs>	 (03PS5) 10Bstorm: wikireplicas: Upgrade maintain-meta_p.py to python 3 [puppet] - 10https://gerrit.wikimedia.org/r/643363 (https://phabricator.wikimedia.org/T268312)
[15:03:11] <icinga-wm>	 PROBLEM - SSH on ms-be2031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[15:05:17] <wikibugs>	 (03PS1) 10Kormat: test: Standardise integration_env output [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/644537
[15:05:19] <wikibugs>	 (03PS12) 10Jbond: turnilo: add export mappings for network devices via query_resources [puppet] - 10https://gerrit.wikimedia.org/r/643703 (https://phabricator.wikimedia.org/T254332)
[15:06:21] <icinga-wm>	 RECOVERY - SSH on ms-be2031 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[15:06:57] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` ms-be2060.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reim...
[15:07:11] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] test: Standardise integration_env output [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/644537 (owner: 10Kormat)
[15:07:14] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] wikireplicas: Upgrade maintain-meta_p.py to python 3 [puppet] - 10https://gerrit.wikimedia.org/r/643363 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm)
[15:10:14] <wikibugs>	 (03PS2) 10Kormat: test: Standardise integration_env output [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/644537
[15:10:19] <logmsgbot>	 !log hashar@deploy1001 Finished scap: (no justification provided) (duration: 44m 20s)
[15:10:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:10:29] <icinga-wm>	 ACKNOWLEDGEMENT - HP RAID on ms-be1030 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T269143 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Inform
[15:10:32] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T269143 (10ops-monitoring-bot)
[15:10:53] <hashar>	 subbu: testwiki has parsoid 0.13.0-a18  https://test.wikipedia.org/wiki/Special:Version
[15:11:20] <subbu>	 nice! 
[15:14:40] <wikibugs>	 (03PS2) 10Bstorm: wikireplicas: add maintain-meta_p only to s7 and legacy replicas [puppet] - 10https://gerrit.wikimedia.org/r/643578 (https://phabricator.wikimedia.org/T268312)
[15:18:23] <icinga-wm>	 RECOVERY - tileratorui on maps1005 is OK: HTTP OK: HTTP/1.1 200 OK - 315 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui
[15:18:28] <wikibugs>	 (03PS3) 10Bstorm: wikireplicas: add maintain-meta_p only to s7 and legacy replicas [puppet] - 10https://gerrit.wikimedia.org/r/643578 (https://phabricator.wikimedia.org/T268312)
[15:18:45] <icinga-wm>	 RECOVERY - tileratorui on maps1008 is OK: HTTP OK: HTTP/1.1 200 OK - 315 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui
[15:18:47] <wikibugs>	 (03PS3) 10ArielGlenn: add a sample job for illustration purposes [dumps] - 10https://gerrit.wikimedia.org/r/625930
[15:19:16] <wikibugs>	 (03Abandoned) 10Razzi: analytics: Replace an-coord1001 with analytics-hive [puppet] - 10https://gerrit.wikimedia.org/r/644353 (https://phabricator.wikimedia.org/T268028) (owner: 10Razzi)
[15:20:14] <hashar>	 1.36.0-wmf.20 is on testwikis
[15:20:27] <hashar>	 I am off for roughly an hour 
[15:20:52] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ms-be2060.codfw.wmnet'] `  Of which those **FAILED**: ` ['ms-be2060.codfw.wmnet'] `
[15:22:26] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime
[15:22:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:24:20] <logmsgbot>	 !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[15:24:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:26:16] <wikibugs>	 (03PS4) 10ArielGlenn: add a sample job for illustration purposes [dumps] - 10https://gerrit.wikimedia.org/r/625930
[15:31:02] <wikibugs>	 (03CR) 10RLazarus: "Thanks for the patch! Terrible news: These are being migrated from an old format (cron) to a new format (profile::mediawiki::periodic_job)" [puppet] - 10https://gerrit.wikimedia.org/r/643917 (https://phabricator.wikimedia.org/T262857) (owner: 10Cparle)
[15:31:19] <wikibugs>	 10Operations, 10Performance-Team, 10serviceops, 10User-jijiki: Run latest Thumbor on Docker with Buster + Python 3 - https://phabricator.wikimedia.org/T267327 (10akosiaris) >>! In T267327#6659813, @Ladsgroup wrote: > oh it's not stateful but I think it's high I/O compared to other applications (maybe not a...
[15:33:53] <wikibugs>	 10Operations, 10vm-requests, 10Patch-For-Review: Eq: 5 VM request for kafka-test-eqiad cluster - https://phabricator.wikimedia.org/T268202 (10razzi) Here's the cumin output for the kafka-test1001 decomission:  ` razzi@cumin1001:~$ sudo cookbook sre.hosts.decommission kafka-test1001.eqiad.wmnet -t T268202 STA...
[15:34:37] <icinga-wm>	 PROBLEM - SSH on ms-be2031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[15:37:49] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "LGTM, couple of inline comments" (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/643936 (owner: 10JMeybohm)
[15:39:00] <wikibugs>	 (03PS3) 10Kormat: test: Standardise integration_env output [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/644537
[15:42:38] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] "LGTM, got a small nitpick in there, but +1 otherwise" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/643974 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm)
[15:43:21] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] add the ability to skip a job via configuration [dumps] - 10https://gerrit.wikimedia.org/r/644476 (owner: 10ArielGlenn)
[15:43:48] <wikibugs>	 (03Merged) 10jenkins-bot: add the ability to skip a job via configuration [dumps] - 10https://gerrit.wikimedia.org/r/644476 (owner: 10ArielGlenn)
[15:44:38] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] add a sample job for illustration purposes [dumps] - 10https://gerrit.wikimedia.org/r/625930 (owner: 10ArielGlenn)
[15:45:51] <wikibugs>	 (03Merged) 10jenkins-bot: add a sample job for illustration purposes [dumps] - 10https://gerrit.wikimedia.org/r/625930 (owner: 10ArielGlenn)
[15:45:57] <icinga-wm>	 RECOVERY - SSH on ms-be2031 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[15:46:54] <wikibugs>	 (03CR) 10Herron: [C: 03+1] alertmanager: use o11y address as from [puppet] - 10https://gerrit.wikimedia.org/r/644517 (https://phabricator.wikimedia.org/T268995) (owner: 10Filippo Giunchedi)
[15:47:13] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2031 is CRITICAL: CRITICAL - load average: 80.68, 150.80, 115.20 https://wikitech.wikimedia.org/wiki/Swift
[15:47:32] <wikibugs>	 (03CR) 10Bstorm: "I'm removing the dependency on simplejson because it really serves no purpose except in possibly ancient python that may have existed when" [puppet] - 10https://gerrit.wikimedia.org/r/643578 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm)
[15:48:05] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: use o11y address as from [puppet] - 10https://gerrit.wikimedia.org/r/644517 (https://phabricator.wikimedia.org/T268995) (owner: 10Filippo Giunchedi)
[15:48:23] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:48:53] <wikibugs>	 (03PS2) 10Razzi: Ensure /tmp/sqoop-jars/ is present for role::analytics_cluster::launcher [puppet] - 10https://gerrit.wikimedia.org/r/644347 (https://phabricator.wikimedia.org/T251788)
[15:50:15] <wikibugs>	 (03CR) 10Razzi: "A quick ops week patch :)" [puppet] - 10https://gerrit.wikimedia.org/r/644347 (https://phabricator.wikimedia.org/T251788) (owner: 10Razzi)
[15:50:56] <wikibugs>	 (03CR) 10DannyS712: [C: 03+1] Add log channel Wikibase.IdGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643874 (https://phabricator.wikimedia.org/T268625) (owner: 10Lucas Werkmeister (WMDE))
[15:52:32] <wikibugs>	 (03PS1) 10Holger Knust: Configuration chage to allow custom comment reverts on Wikidata [deployment-charts] - 10https://gerrit.wikimedia.org/r/644542
[15:53:41] <wikibugs>	 (03PS1) 10Jbond: raktables: hand off authentication to httpd [puppet] - 10https://gerrit.wikimedia.org/r/644543
[15:53:43] <wikibugs>	 (03PS1) 10Jbond: racktables: Make everyone admin [puppet] - 10https://gerrit.wikimedia.org/r/644544
[15:55:12] <wikibugs>	 (03CR) 10Elukey: Ensure /tmp/sqoop-jars/ is present for role::analytics_cluster::launcher (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644347 (https://phabricator.wikimedia.org/T251788) (owner: 10Razzi)
[15:57:05] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be2031 is OK: OK - load average: 37.97, 55.26, 79.71 https://wikitech.wikimedia.org/wiki/Swift
[15:57:32] <wikibugs>	 (03PS1) 10Hnowlan: maps: fix typo in postgres command, retry 5 times before alerting [puppet] - 10https://gerrit.wikimedia.org/r/644545
[16:01:18] <wikibugs>	 (03PS1) 10Mholloway: sessionTick: Changes stream name to 'mw_session_tick' [extensions/WikimediaEvents] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644492
[16:07:06] <wikibugs>	 (03CR) 10Ppchelko: [C: 04-1] "Additionally, need to bump the patch version of the chart." (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/644542 (owner: 10Holger Knust)
[16:08:24] <wikibugs>	 10Operations, 10MediaWiki-General, 10Platform Engineering: Allow easier ICU transitions in MediaWiki (change how sortkey collation is managed in the categorylinks table) - https://phabricator.wikimedia.org/T263437 (10Ladsgroup) It might sound like a promotion but I think before getting this done (in any way,...
[16:08:37] <wikibugs>	 (03PS2) 10CRusnov: icinga/check_legal_html.py: Port to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/644372 (https://phabricator.wikimedia.org/T247364)
[16:09:18] <wikibugs>	 (03PS1) 10Tchanders: extension-list: Add IPInfo extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644548 (https://phabricator.wikimedia.org/T260599)
[16:09:20] <wikibugs>	 (03PS1) 10Tchanders: Add IPInfo config to InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644549 (https://phabricator.wikimedia.org/T260599)
[16:09:23] <wikibugs>	 (03PS1) 10Tchanders: Add IPInfo extension config to InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644550 (https://phabricator.wikimedia.org/T260599)
[16:09:25] <wikibugs>	 (03PS1) 10Tchanders: Load IPInfo extension in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644551 (https://phabricator.wikimedia.org/T260599)
[16:09:33] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] extension-list: Add IPInfo extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644548 (https://phabricator.wikimedia.org/T260599) (owner: 10Tchanders)
[16:09:33] <moritzm>	 !log installing vips security updates
[16:09:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:09:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add IPInfo config to InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644549 (https://phabricator.wikimedia.org/T260599) (owner: 10Tchanders)
[16:09:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add IPInfo extension config to InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644550 (https://phabricator.wikimedia.org/T260599) (owner: 10Tchanders)
[16:09:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Load IPInfo extension in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644551 (https://phabricator.wikimedia.org/T260599) (owner: 10Tchanders)
[16:10:31] <wikibugs>	 (03PS1) 10Bstorm: wmcs: set the add_wiki cookbook to only run meta_p on some hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/644552 (https://phabricator.wikimedia.org/T268312)
[16:11:47] <wikibugs>	 (03PS2) 10Bstorm: wmcs: set the add_wiki cookbook to only run meta_p on some hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/644552 (https://phabricator.wikimedia.org/T268312)
[16:11:55] <wikibugs>	 (03CR) 10CRusnov: [C: 03+2] icinga/check_legal_html.py: Port to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/644372 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[16:12:42] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+1] cirrus: alert on pool counter reject spike [puppet] - 10https://gerrit.wikimedia.org/r/643362 (https://phabricator.wikimedia.org/T262694) (owner: 10Ryan Kemper)
[16:15:18] <wikibugs>	 (03PS1) 10Clarakosi: Remove OAuth experimental routes from beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644553 (https://phabricator.wikimedia.org/T262495)
[16:16:50] <wikibugs>	 (03PS1) 10Muehlenhoff: Update vips library hints [puppet] - 10https://gerrit.wikimedia.org/r/644555
[16:17:20] <wikibugs>	 (03PS2) 10Clarakosi: Remove OAuth experimental routes from beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644553 (https://phabricator.wikimedia.org/T262495)
[16:17:40] <wikibugs>	 (03CR) 10Ppchelko: [C: 04-2] "Looks good, need to wait for the dependency to land on all groups first, so -2 until that happens." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644553 (https://phabricator.wikimedia.org/T262495) (owner: 10Clarakosi)
[16:18:05] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloud: add conntrackd for better neutron l3 agent failover [puppet] - 10https://gerrit.wikimedia.org/r/644556 (https://phabricator.wikimedia.org/T268335)
[16:18:23] <wikibugs>	 10Operations, 10ops-eqiad: Replace asw2-c-eqiad VC cable - https://phabricator.wikimedia.org/T268804 (10ayounsi) 05Open→03Resolved No more issues, thanks!
[16:19:28] <wikibugs>	 10Operations, 10ops-eqiad: Replace asw2-d-eqiad VC cable - https://phabricator.wikimedia.org/T268808 (10ayounsi) 05Open→03Resolved a:03Jclark-ctr No more errors, thanks!
[16:19:36] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cloud: add conntrackd for better neutron l3 agent failover [puppet] - 10https://gerrit.wikimedia.org/r/644556 (https://phabricator.wikimedia.org/T268335) (owner: 10Arturo Borrero Gonzalez)
[16:23:51] <wikibugs>	 10Operations, 10MediaWiki-General, 10Platform Engineering Roadmap Decision Making: Allow easier ICU transitions in MediaWiki (change how sortkey collation is managed in the categorylinks table) - https://phabricator.wikimedia.org/T263437 (10daniel)
[16:27:18] <wikibugs>	 (03PS3) 10CRusnov: Port elasticsearch/es-tool.py to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/644365 (https://phabricator.wikimedia.org/T247364)
[16:28:01] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Port elasticsearch/es-tool.py to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/644365 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[16:31:44] <wikibugs>	 (03PS4) 10Kormat: test: Standardise integration_env output [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/644537
[16:31:46] <wikibugs>	 (03PS1) 10Kormat: intpy2 [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/644561
[16:32:46] <wikibugs>	 (03PS1) 10Ayounsi: Speed up Homer by fixing fetch_device_circuits() [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/644563
[16:34:59] <wikibugs>	 (03PS4) 10Lucas Werkmeister (WMDE): Add log channel Wikibase.IdGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643874 (https://phabricator.wikimedia.org/T268625)
[16:35:01] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Enable Wikibase Repo ID generator logging on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644564 (https://phabricator.wikimedia.org/T268625)
[16:35:21] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/644563 (owner: 10Ayounsi)
[16:35:27] <wikibugs>	 (03PS4) 10CRusnov: Port elasticsearch/es-tool.py to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/644365 (https://phabricator.wikimedia.org/T247364)
[16:35:29] <wikibugs>	 (03CR) 10CRusnov: Port elasticsearch/es-tool.py to Python3 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/644365 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[16:38:09] <wikibugs>	 (03CR) 10CRusnov: "thanks!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644365 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[16:39:19] <icinga-wm>	 PROBLEM - Device not healthy -SMART- on ms-be1030 is CRITICAL: cluster=swift device=None instance=ms-be1030 job=node site=eqiad https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be1030&var-datasource=eqiad+prometheus/ops
[16:39:44] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM, thx" [puppet] - 10https://gerrit.wikimedia.org/r/644365 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[16:40:01] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+1] Port elasticsearch/es-tool.py to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/644365 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[16:41:11] <wikibugs>	 (03CR) 10Volans: "Pure cookbook-foo nit inline" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/644552 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm)
[16:46:03] <wikibugs>	 (03CR) 10CRusnov: [C: 03+2] Port elasticsearch/es-tool.py to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/644365 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[16:47:45] <icinga-wm>	 RECOVERY - Device not healthy -SMART- on ms-be1022 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be1022&var-datasource=eqiad+prometheus/ops
[16:48:26] <wikibugs>	 (03CR) 10Bstorm: wmcs: set the add_wiki cookbook to only run meta_p on some hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/644552 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm)
[16:50:10] <wikibugs>	 (03PS4) 10Bstorm: wikireplicas: add maintain-meta_p only to s7 and legacy replicas [puppet] - 10https://gerrit.wikimedia.org/r/643578 (https://phabricator.wikimedia.org/T268312)
[16:52:32] <wikibugs>	 (03CR) 10Volans: "reply inline" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/644552 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm)
[16:56:13] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10mforns) Cool! Thanks :] Will do. I'll let @JAllemandou coordinate with you on a good date and time for the team presentation!
[16:56:55] <wikibugs>	 10Operations, 10fundraising-tech-ops, 10netops: Manage frack switches with Netbox - https://phabricator.wikimedia.org/T268802 (10Dwisehaupt) @ayounsi LLDP should be possible for us. We are at the start of the busy time for fundraising so it'll will probably be a few weeks or most likely January when we would...
[16:56:59] <wikibugs>	 (03PS3) 10Razzi: Ensure /tmp/sqoop-jars/ is present [puppet] - 10https://gerrit.wikimedia.org/r/644347 (https://phabricator.wikimedia.org/T251788)
[16:57:42] <wikibugs>	 (03PS1) 10Mforns: analytics::refinery::job::druid_load.pp: reduce netflow retention [puppet] - 10https://gerrit.wikimedia.org/r/644569 (https://phabricator.wikimedia.org/T254332)
[16:58:21] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:00:03] <wikibugs>	 (03PS2) 10Holger Knust: Configuration chage to allow custom comment reverts on Wikidata [deployment-charts] - 10https://gerrit.wikimedia.org/r/644542
[17:00:05] <jouncebot>	 jbond42 and cdanis: May I have your attention please! Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201201T1700)
[17:06:39] <wikibugs>	 (03PS3) 10Holger Knust: Configuration chage to allow custom comment reverts on Wikidata [deployment-charts] - 10https://gerrit.wikimedia.org/r/644542
[17:08:49] <wikibugs>	 (03CR) 10JMeybohm: prometheus::k8s: Support arbitrary clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644262 (owner: 10Alexandros Kosiaris)
[17:09:27] <wikibugs>	 (03PS2) 10Mholloway: Add event stream config for android.user_contributions_screen [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639284 (https://phabricator.wikimedia.org/T228179)
[17:09:59] <icinga-wm>	 RECOVERY - Device not healthy -SMART- on ms-be1030 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be1030&var-datasource=eqiad+prometheus/ops
[17:17:14] <wikibugs>	 (03CR) 10Bstorm: wmcs: set the add_wiki cookbook to only run meta_p on some hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/644552 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm)
[17:18:45] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] Extend PIL Python package with Python 3 counterparts [puppet] - 10https://gerrit.wikimedia.org/r/644532 (https://phabricator.wikimedia.org/T268468) (owner: 10Muehlenhoff)
[17:19:24] <wikibugs>	 (03PS4) 10Holger Knust: Configuration chage to allow custom comment reverts on Wikidata [deployment-charts] - 10https://gerrit.wikimedia.org/r/644542
[17:19:31] <marostegui>	 !log Sanitize s1 on clouddb1013 and clouddb1017 - T267090
[17:19:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:19:40] <stashbot>	 T267090: Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090
[17:20:37] <wikibugs>	 (03CR) 10Bstorm: wmcs: set the add_wiki cookbook to only run meta_p on some hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/644552 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm)
[17:20:57] <wikibugs>	 (03CR) 10Joal: [C: 03+1] "Ideas - looking for me without them" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644347 (https://phabricator.wikimedia.org/T251788) (owner: 10Razzi)
[17:21:57] <wikibugs>	 (03PS1) 10Mholloway: sessionTick: Changes stream name to 'mw_session_tick' [extensions/WikimediaEvents] (wmf/1.36.0-wmf.18) - 10https://gerrit.wikimedia.org/r/644493
[17:28:31] <wikibugs>	 (03PS2) 10JMeybohm: coredns: Create a wmfcoredns copy in charts dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/643936
[17:28:32] <icinga-wm>	 PROBLEM - Device not healthy -SMART- on ms-be1022 is CRITICAL: cluster=swift device=None instance=ms-be1022 job=node site=eqiad https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be1022&var-datasource=eqiad+prometheus/ops
[17:31:38] <wikibugs>	 (03CR) 10Mholloway: [C: 03+2] Add event stream config for android.user_contributions_screen [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639284 (https://phabricator.wikimedia.org/T228179) (owner: 10Mholloway)
[17:32:27] <wikibugs>	 (03Merged) 10jenkins-bot: Add event stream config for android.user_contributions_screen [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639284 (https://phabricator.wikimedia.org/T228179) (owner: 10Mholloway)
[17:34:22] <icinga-wm>	 PROBLEM - SSH on ms-be2031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[17:34:54] <logmsgbot>	 !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Add event stream config for android.user_contributions_screen T228179 (duration: 01m 07s)
[17:35:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:35:02] <stashbot>	 T228179: Event Platform Client — Android - https://phabricator.wikimedia.org/T228179
[17:38:04] <icinga-wm>	 RECOVERY - SSH on ms-be2031 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[17:46:21] <wikibugs>	 (03CR) 10Mholloway: [C: 03+2] sessionTick: Changes stream name to 'mw_session_tick' [extensions/WikimediaEvents] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644492 (owner: 10Mholloway)
[17:47:28] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:48:09] <wikibugs>	 (03CR) 10Mholloway: [C: 03+2] sessionTick: Changes stream name to 'mw_session_tick' [extensions/WikimediaEvents] (wmf/1.36.0-wmf.18) - 10https://gerrit.wikimedia.org/r/644493 (owner: 10Mholloway)
[17:50:59] <wikibugs>	 (03Merged) 10jenkins-bot: sessionTick: Changes stream name to 'mw_session_tick' [extensions/WikimediaEvents] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/644492 (owner: 10Mholloway)
[17:52:45] <wikibugs>	 (03Merged) 10jenkins-bot: sessionTick: Changes stream name to 'mw_session_tick' [extensions/WikimediaEvents] (wmf/1.36.0-wmf.18) - 10https://gerrit.wikimedia.org/r/644493 (owner: 10Mholloway)
[17:54:27] <logmsgbot>	 !log mholloway-shell@deploy1001 Synchronized php-1.36.0-wmf.20/extensions/WikimediaEvents: Backport: sessionTick: Update stream name to mw_session_tick (duration: 01m 07s)
[17:54:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:56:23] <wikibugs>	 (03PS2) 10BPirkle: Enable WikimediaApiPortalOAuth on apiportalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644305 (https://phabricator.wikimedia.org/T262495) (owner: 10Clarakosi)
[17:57:26] <logmsgbot>	 !log mholloway-shell@deploy1001 Synchronized php-1.36.0-wmf.18/extensions/WikimediaEvents: Backport: sessionTick: Update stream name to mw_session_tick (duration: 01m 04s)
[17:57:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:57:58] <wikibugs>	 (03PS4) 10Mholloway: sessionTick: Add event stream and enable on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637539 (https://phabricator.wikimedia.org/T248987) (owner: 10Jason Linehan)
[17:59:47] <wikibugs>	 (03CR) 10Mholloway: [C: 03+2] sessionTick: Add event stream and enable on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637539 (https://phabricator.wikimedia.org/T248987) (owner: 10Jason Linehan)
[18:00:05] <jouncebot>	 chrisalbon and accraze: (Dis)respected human, time to deploy Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201201T1800). Please do the needful.
[18:00:17] <wikibugs>	 (03PS2) 10Hnowlan: similarusers: Create basic chart and service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837)
[18:00:27] <bpirkle>	 Anyone object to us deploying a small config change in a few minutes? This is related to the not-yet-officially-released API Portal. Here's the change: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/644305
[18:00:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] similarusers: Create basic chart and service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan)
[18:00:39] <wikibugs>	 (03Merged) 10jenkins-bot: sessionTick: Add event stream and enable on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637539 (https://phabricator.wikimedia.org/T248987) (owner: 10Jason Linehan)
[18:02:02] <wikibugs>	 (03PS3) 10Hnowlan: similarusers: Create basic chart and service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837)
[18:02:26] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] similarusers: Create basic chart and service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan)
[18:02:45] <icinga-wm>	 PROBLEM - SSH on ms-be2031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:03:03] <logmsgbot>	 !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable session length instrument on officewiki T267494 (duration: 01m 06s)
[18:03:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:03:11] <stashbot>	 T267494: [SessionLength] View how long users interact with our products - https://phabricator.wikimedia.org/T267494
[18:03:59] <icinga-wm>	 RECOVERY - SSH on ms-be2031 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:04:21] <wikibugs>	 (03PS4) 10Hnowlan: similarusers: Create basic chart and service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837)
[18:04:40] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] similarusers: Create basic chart and service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan)
[18:05:40] <wikibugs>	 (03PS5) 10Hnowlan: similarusers: Create basic chart and service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837)
[18:08:37] <wikibugs>	 (03PS1) 10Mholloway: [BETA] Enable session length instrument on all Beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644575
[18:09:30] <wikibugs>	 (03CR) 10BPirkle: [C: 03+2] Enable WikimediaApiPortalOAuth on apiportalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644305 (https://phabricator.wikimedia.org/T262495) (owner: 10Clarakosi)
[18:10:33] <wikibugs>	 (03PS1) 10Elukey: admin: remove users already in 'researchers' from 'analytics-users' [puppet] - 10https://gerrit.wikimedia.org/r/644576 (https://phabricator.wikimedia.org/T269150)
[18:10:52] <wikibugs>	 (03PS3) 10BPirkle: Enable WikimediaApiPortalOAuth on apiportalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644305 (https://phabricator.wikimedia.org/T262495) (owner: 10Clarakosi)
[18:11:34] <wikibugs>	 (03CR) 10BPirkle: Enable WikimediaApiPortalOAuth on apiportalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644305 (https://phabricator.wikimedia.org/T262495) (owner: 10Clarakosi)
[18:11:47] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26808/console" [puppet] - 10https://gerrit.wikimedia.org/r/644576 (https://phabricator.wikimedia.org/T269150) (owner: 10Elukey)
[18:11:49] <wikibugs>	 (03CR) 10BPirkle: [C: 03+2] Enable WikimediaApiPortalOAuth on apiportalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644305 (https://phabricator.wikimedia.org/T262495) (owner: 10Clarakosi)
[18:12:37] <wikibugs>	 (03Merged) 10jenkins-bot: Enable WikimediaApiPortalOAuth on apiportalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644305 (https://phabricator.wikimedia.org/T262495) (owner: 10Clarakosi)
[18:13:56] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] admin: remove users already in 'researchers' from 'analytics-users' [puppet] - 10https://gerrit.wikimedia.org/r/644576 (https://phabricator.wikimedia.org/T269150) (owner: 10Elukey)
[18:14:40] <elukey>	 chaomodus: o/
[18:14:57] <elukey>	 I see a change from you in puppet-merge, should I proceed?
[18:15:07] <elukey>	 "Port elasticsearch/es-tool.py to Python3"
[18:15:59] <wikibugs>	 10Operations, 10ConfirmEdit (CAPTCHA extension), 10Patch-For-Review, 10Python3-Porting: captcha.py needs to be ported to Python 3 - https://phabricator.wikimedia.org/T268468 (10Reedy) >>! In T268468#6659719, @MoritzMuehlenhoff wrote: > Well, at minimum the shebang needs to be switched to #!/usr/bin/python3...
[18:18:21] <elukey>	 judging from https://gerrit.wikimedia.org/r/c/operations/puppet/+/644365/ it seems that ebernhardson +1ed so the folks in discovery are aware
[18:19:03] <icinga-wm>	 PROBLEM - SSH on ms-be2031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:19:05] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/643945 (owner: 10Jbond)
[18:19:19] <elukey>	 all right merging
[18:19:32] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` ms-be2060.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reim...
[18:20:26] <wikibugs>	 (03PS1) 10RobH: swapping new cloudcephmon eqiad hosts to partition same as existing [puppet] - 10https://gerrit.wikimedia.org/r/644578 (https://phabricator.wikimedia.org/T268746)
[18:22:38] <wikibugs>	 (03CR) 10RobH: [C: 03+2] swapping new cloudcephmon eqiad hosts to partition same as existing [puppet] - 10https://gerrit.wikimedia.org/r/644578 (https://phabricator.wikimedia.org/T268746) (owner: 10RobH)
[18:24:11] <icinga-wm>	 PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01395 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[18:25:14] <cdanis>	 puppet is newly failing on logstash/elasticsearch hosts
[18:25:34] <cdanis>	 E: Unable to locate package python3-ipaddr
[18:25:49] <icinga-wm>	 RECOVERY - SSH on ms-be2031 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:26:09] <elukey>	 cdanis: ah nice it is the change that I merged with mine, lovely
[18:26:19] <wikibugs>	 (03CR) 10CDanis: "This looks good, but I think you will also need to update the prometheus/ops configuration to scrape the exporter and get the data in Prom" [puppet] - 10https://gerrit.wikimedia.org/r/644245 (owner: 10Hashar)
[18:26:28] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] ci: add prometheus exporter for Apache [puppet] - 10https://gerrit.wikimedia.org/r/644245 (owner: 10Hashar)
[18:28:55] <elukey>	 cdanis: ah lovely python3-ipaddr is only in bullseye
[18:29:01] <elukey>	 chaomodus: --^
[18:29:05] <elukey>	 shall we revert?
[18:29:11] <chaomodus>	 boh
[18:29:13] <chaomodus>	 thanks
[18:29:16] <chaomodus>	 i should've chdecked that one 
[18:29:23] <cdanis>	 hopefully not hard to backport :/
[18:29:42] <wikibugs>	 (03PS1) 10Elukey: Revert "Port elasticsearch/es-tool.py to Python3" [puppet] - 10https://gerrit.wikimedia.org/r/644495
[18:29:50] <elukey>	 ah there you are :)
[18:29:55] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudvirt2003-dev: move to ceph-enabled virt role [puppet] - 10https://gerrit.wikimedia.org/r/644584 (https://phabricator.wikimedia.org/T265965)
[18:30:10] <wikibugs>	 (03PS1) 10CRusnov: Revert "Port elasticsearch/es-tool.py to Python3" [puppet] - 10https://gerrit.wikimedia.org/r/644496
[18:30:20] <wikibugs>	 (03PS2) 10CRusnov: Revert "Port elasticsearch/es-tool.py to Python3" [puppet] - 10https://gerrit.wikimedia.org/r/644496
[18:30:20] <elukey>	 ok abandoning mine
[18:30:38] <wikibugs>	 (03Abandoned) 10Elukey: Revert "Port elasticsearch/es-tool.py to Python3" [puppet] - 10https://gerrit.wikimedia.org/r/644495 (owner: 10Elukey)
[18:30:57] <wikibugs>	 (03PS1) 10Ottomata: Keep EventLogging SpecialMuteSubmit on old system for now [puppet] - 10https://gerrit.wikimedia.org/r/644585 (https://phabricator.wikimedia.org/T268517)
[18:31:23] <wikibugs>	 (03CR) 10CRusnov: [C: 03+2] Revert "Port elasticsearch/es-tool.py to Python3" [puppet] - 10https://gerrit.wikimedia.org/r/644496 (owner: 10CRusnov)
[18:31:49] <wikibugs>	 (03CR) 10Ppchelko: [C: 04-1] Configuration chage to allow custom comment reverts on Wikidata (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/644542 (owner: 10Holger Knust)
[18:33:36] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Keep EventLogging SpecialMuteSubmit on old system for now [puppet] - 10https://gerrit.wikimedia.org/r/644585 (https://phabricator.wikimedia.org/T268517) (owner: 10Ottomata)
[18:33:43] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:34:09] <chaomodus>	 probably a better solution would be to make it use ipaddress
[18:34:38] <wikibugs>	 10Operations, 10ConfirmEdit (CAPTCHA extension), 10Patch-For-Review, 10Python3-Porting: captcha.py needs to be ported to Python 3 - https://phabricator.wikimedia.org/T268468 (10Reedy) I'm guessing there's no reason to actually make the python version (ie whether to use `python` or `python3` configurable wi...
[18:35:38] <mholloway>	 bpirkle: o/ Do you still plan to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/644305/? I've got one more config change to go out, but was waiting until you're finished.
[18:35:47] <bpirkle>	 Yep, almost done.
[18:35:53] <mholloway>	 OK, cool, thanks!
[18:38:46] <logmsgbot>	 !log bpirkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable WikimediaApiPortalOAuth on apiportalwiki gerrit:644305 (duration: 01m 06s)
[18:38:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:39:33] <wikibugs>	 (03PS5) 10Holger Knust: Configuration chage to allow custom comment reverts on Wikidata [deployment-charts] - 10https://gerrit.wikimedia.org/r/644542
[18:40:11] <logmsgbot>	 !log bpirkle@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: Enable WikimediaApiPortalOAuth on apiportalwiki gerrit:644305 (duration: 01m 06s)
[18:40:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:40:57] <liw>	 hm, about three weeks until the day is 5 hours 49 minutes 4 seconds long, and then the next day is a little longer
[18:42:01] <bpirkle>	 mholloway: All done, confirmed that changes had the intended effect and no explosions seen in logstash.
[18:42:12] <mholloway>	 bpirkle: Thanks!
[18:42:17] <liw>	 wrong window for me sorry
[18:42:19] <wikibugs>	 (03PS2) 10Andrew Bogott: cloudvirt2003-dev: move to ceph-enabled virt role, add ceph hiera elsewhere [puppet] - 10https://gerrit.wikimedia.org/r/644584 (https://phabricator.wikimedia.org/T265965)
[18:43:19] <wikibugs>	 (03PS2) 10Mholloway: [BETA] Enable session length instrument on all Beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644575
[18:43:27] <wikibugs>	 (03CR) 10CRusnov: "This is a redo of https://gerrit.wikimedia.org/r/c/operations/puppet/+/644365 after it was discovered that python3-ipaddr is not available" [puppet] - 10https://gerrit.wikimedia.org/r/644591 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[18:44:31] <wikibugs>	 (03PS1) 10Elukey: admin: add comments to analytics posix groups [puppet] - 10https://gerrit.wikimedia.org/r/644592 (https://phabricator.wikimedia.org/T269150)
[18:45:24] <logmsgbot>	 !log razzi@cumin1001 START - Cookbook sre.hosts.decommission
[18:45:24] <wikibugs>	 (03CR) 10Mholloway: [C: 03+2] [BETA] Enable session length instrument on all Beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644575 (owner: 10Mholloway)
[18:45:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:45:32] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] admin: add comments to analytics posix groups [puppet] - 10https://gerrit.wikimedia.org/r/644592 (https://phabricator.wikimedia.org/T269150) (owner: 10Elukey)
[18:45:41] <wikibugs>	 (03PS2) 10Razzi: Configure zookeeper-test1002.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/644344 (https://phabricator.wikimedia.org/T268074)
[18:45:48] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] analytics::refinery::job::druid_load.pp: reduce netflow retention [puppet] - 10https://gerrit.wikimedia.org/r/644569 (https://phabricator.wikimedia.org/T254332) (owner: 10Mforns)
[18:46:07] <wikibugs>	 (03PS3) 10Andrew Bogott: cloudvirt2003-dev: move to ceph-enabled virt role, add ceph hiera elsewhere [puppet] - 10https://gerrit.wikimedia.org/r/644584 (https://phabricator.wikimedia.org/T265965)
[18:46:24] <wikibugs>	 (03Merged) 10jenkins-bot: [BETA] Enable session length instrument on all Beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644575 (owner: 10Mholloway)
[18:48:31] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Configure zookeeper-test1002.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/644344 (https://phabricator.wikimedia.org/T268074) (owner: 10Razzi)
[18:49:49] <wikibugs>	 (03PS4) 10Andrew Bogott: cloudvirt2003-dev: move to ceph-enabled virt role, add ceph hiera elsewhere [puppet] - 10https://gerrit.wikimedia.org/r/644584 (https://phabricator.wikimedia.org/T265965)
[18:52:50] <wikibugs>	 (03PS5) 10Andrew Bogott: cloudvirt2003-dev: move to ceph-enabled virt role, add ceph hiera elsewhere [puppet] - 10https://gerrit.wikimedia.org/r/644584 (https://phabricator.wikimedia.org/T265965)
[18:54:26] <wikibugs>	 (03PS1) 10RobH: cloudcephosd update was not correct [puppet] - 10https://gerrit.wikimedia.org/r/644593 (https://phabricator.wikimedia.org/T268746)
[18:54:31] <wikibugs>	 (03PS6) 10Andrew Bogott: cloudvirt2003-dev: move to ceph-enabled virt role, add ceph hiera elsewhere [puppet] - 10https://gerrit.wikimedia.org/r/644584 (https://phabricator.wikimedia.org/T265965)
[18:55:35] <wikibugs>	 (03CR) 10RobH: [C: 03+2] cloudcephosd update was not correct [puppet] - 10https://gerrit.wikimedia.org/r/644593 (https://phabricator.wikimedia.org/T268746) (owner: 10RobH)
[18:56:08] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt2003-dev: move to ceph-enabled virt role, add ceph hiera elsewhere [puppet] - 10https://gerrit.wikimedia.org/r/644584 (https://phabricator.wikimedia.org/T265965) (owner: 10Andrew Bogott)
[19:00:05] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: How many deployers does it take to do Morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201201T1900).
[19:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[19:01:11] <wikibugs>	 10Operations, 10ConfirmEdit (CAPTCHA extension), 10Patch-For-Review, 10Python3-Porting: captcha.py needs to be ported to Python 3 - https://phabricator.wikimedia.org/T268468 (10MoritzMuehlenhoff) >>! In T268468#6660710, @Reedy wrote: > I'm guessing there's no reason to actually make the python version (ie...
[19:07:34] <logmsgbot>	 !log razzi@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
[19:07:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:07:40] <wikibugs>	 10Operations, 10vm-requests, 10Patch-For-Review: Eq: 5 VM request for kafka-test-eqiad cluster - https://phabricator.wikimedia.org/T268202 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by razzi@cumin1001 for hosts: `kafka-test1003.eqiad.wmnet` - kafka-test1003.eqiad.wmnet (**WARN**)   - **...
[19:07:43] <wikibugs>	 (03PS1) 10Ottomata: Bump eventstreams image version to 2020-12-01-181032-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/644595
[19:09:11] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Bump eventstreams image version to 2020-12-01-181032-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/644595 (owner: 10Ottomata)
[19:10:55] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Extend PIL Python package with Python 3 counterparts [puppet] - 10https://gerrit.wikimedia.org/r/644532 (https://phabricator.wikimedia.org/T268468) (owner: 10Muehlenhoff)
[19:12:36] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ms-be2060.codfw.wmnet'] `  Of which those **FAILED**: ` ['ms-be2060.codfw.wmnet'] `
[19:14:20] <wikibugs>	 (03CR) 10Dzahn: "on random host mw1401: Notice: /Stage[main]/Mediawiki::Packages/Package[python3-pil]/ensure: created" [puppet] - 10https://gerrit.wikimedia.org/r/644532 (https://phabricator.wikimedia.org/T268468) (owner: 10Muehlenhoff)
[19:15:47] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` ms-be2060.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reim...
[19:19:11] <wikibugs>	 (03CR) 10Papaul: [C: 03+1] delete the drac module [puppet] - 10https://gerrit.wikimedia.org/r/644364 (owner: 10Dzahn)
[19:20:29] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] delete the drac module [puppet] - 10https://gerrit.wikimedia.org/r/644364 (owner: 10Dzahn)
[19:25:44] <wikibugs>	 (03PS2) 10Dzahn: deployment::server: buster support, use default-mysql-client package [puppet] - 10https://gerrit.wikimedia.org/r/644350 (https://phabricator.wikimedia.org/T265963)
[19:26:07] <wikibugs>	 10Operations, 10puppet-compiler: String vs Binary issues while running the puppet compiler - https://phabricator.wikimedia.org/T268978 (10Legoktm)
[19:26:47] <wikibugs>	 (03CR) 10Dzahn: "thanks! though using "default-mysql-client" pulls mariadb packages on stretch as well, including mariadb-common. I would slightly prefer n" [puppet] - 10https://gerrit.wikimedia.org/r/644350 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn)
[19:26:57] <icinga-wm>	 PROBLEM - SSH on ms-be2031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:26:59] <logmsgbot>	 !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
[19:27:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:30:03] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:30:06] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ms-be2060.codfw.wmnet'] `  Of which those **FAILED**: ` ['ms-be2060.codfw.wmnet'] `
[19:30:56] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` ms-be2060.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reim...
[19:30:58] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ms-be2060.codfw.wmnet'] `  Of which those **FAILED**: ` ['ms-be2060.codfw.wmnet'] `
[19:31:41] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime
[19:31:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:31:58] <wikibugs>	 10Operations, 10Android-app-Bugs, 10Fundraising-Backlog, 10Thank-You-Page, and 4 others: Deal with donatewiki Thank You page launching in apps - https://phabricator.wikimedia.org/T259312 (10CDanis) FWIW, having looked at the past week of webrequest data, I've started to wonder as to whether or not the file...
[19:32:47] <wikibugs>	 (03PS1) 10CDanis: admin: add cdanis to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/644600
[19:33:11] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] admin: add cdanis to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/644600 (owner: 10CDanis)
[19:33:19] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2031 is CRITICAL: CRITICAL - load average: 180.91, 163.20, 107.82 https://wikitech.wikimedia.org/wiki/Swift
[19:33:24] <logmsgbot>	 !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventstreams' for release 'canary' .
[19:33:24] <logmsgbot>	 !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
[19:33:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:33:33] <icinga-wm>	 RECOVERY - SSH on ms-be2031 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:33:33] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2031 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:33:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:33:36] <logmsgbot>	 !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[19:33:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:34:07] <wikibugs>	 10Operations, 10Android-app-Bugs, 10Fundraising-Backlog, 10Thank-You-Page, and 4 others: Deal with donatewiki Thank You page launching in apps - https://phabricator.wikimedia.org/T259312 (10MattCleinman) I believe I read something that they're now caching the app site association file on the Apple CDN, so...
[19:34:17] <wikibugs>	 (03PS3) 10Bstorm: wmcs: set the add_wiki cookbook to only run meta_p on some hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/644552 (https://phabricator.wikimedia.org/T268312)
[19:34:21] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] admin: add cdanis to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/644600 (owner: 10CDanis)
[19:35:03] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:35:54] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime
[19:36:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:36:59] <logmsgbot>	 !log razzi@cumin1001 START - Cookbook sre.dns.netbox
[19:37:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:37:57] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[19:38:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:38:31] <wikibugs>	 10Operations, 10Android-app-Bugs, 10Fundraising-Backlog, 10Thank-You-Page, and 4 others: Deal with donatewiki Thank You page launching in apps - https://phabricator.wikimedia.org/T259312 (10CDanis) >>! In T259312#6660843, @MattCleinman wrote: > I believe I read something that they're now caching the app si...
[19:38:51] <wikibugs>	 (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1001/26816/deploy1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/644350 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn)
[19:38:56] <wikibugs>	 (03CR) 10Bstorm: wmcs: set the add_wiki cookbook to only run meta_p on some hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/644552 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm)
[19:40:46] <logmsgbot>	 !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'canary' .
[19:40:46] <logmsgbot>	 !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
[19:40:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:40:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:42:36] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be2031 is OK: OK - load average: 26.48, 49.15, 72.82 https://wikitech.wikimedia.org/wiki/Swift
[19:44:00] <razzi>	 !log deploy refinery with refinery-source v0.0.140
[19:44:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:45:19] <logmsgbot>	 !log razzi@deploy1001 Started deploy [analytics/refinery@41c60d9]: Regular analytics weekly train [analytics/refinery@3e42f46c62722256a1678809097114740806a184]
[19:45:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:46:14] <wikibugs>	 (03CR) 10Dzahn: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/643532 (https://phabricator.wikimedia.org/T268779) (owner: 10Ryan Kemper)
[19:49:06] <wikibugs>	 (03PS1) 10Ryan Kemper: maps: remove no-longer-accurate insetup role [puppet] - 10https://gerrit.wikimedia.org/r/644603 (https://phabricator.wikimedia.org/T260269)
[19:50:11] <wikibugs>	 10Operations, 10ops-codfw: RMA failed codfw C7 switch - WMF6114 - https://phabricator.wikimedia.org/T267950 (10Papaul) UPS Ship Notification, Tracking Number 1ZA19A021298055451
[19:51:39] <logmsgbot>	 !log razzi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:51:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:52:29] <wikibugs>	 (03PS1) 10Bstorm: wikireplicas: add cumin aliases that include multiinstance servers [puppet] - 10https://gerrit.wikimedia.org/r/644606 (https://phabricator.wikimedia.org/T268312)
[19:52:40] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] elasticsearch-cluster: support for cloudelastic [software/spicerack] - 10https://gerrit.wikimedia.org/r/643532 (https://phabricator.wikimedia.org/T268779) (owner: 10Ryan Kemper)
[19:53:00] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] elasticsearch-cluster: support for cloudelastic [software/spicerack] - 10https://gerrit.wikimedia.org/r/643532 (https://phabricator.wikimedia.org/T268779) (owner: 10Ryan Kemper)
[19:54:04] <logmsgbot>	 !log razzi@deploy1001 Finished deploy [analytics/refinery@41c60d9]: Regular analytics weekly train [analytics/refinery@3e42f46c62722256a1678809097114740806a184] (duration: 08m 45s)
[19:54:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:55:47] <wikibugs>	 (03CR) 10Bstorm: "If there's no objections soon, I'll merge this. Just curious if there were any." [puppet] - 10https://gerrit.wikimedia.org/r/643337 (https://phabricator.wikimedia.org/T266300) (owner: 10Bstorm)
[19:56:06] <logmsgbot>	 !log razzi@deploy1001 Started deploy [analytics/refinery@41c60d9] (thin): Regular analytics weekly train [analytics/refinery@3e42f46c62722256a1678809097114740806a184]
[19:56:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:56:13] <logmsgbot>	 !log razzi@deploy1001 Finished deploy [analytics/refinery@41c60d9] (thin): Regular analytics weekly train [analytics/refinery@3e42f46c62722256a1678809097114740806a184] (duration: 00m 07s)
[19:56:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:58:58] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] deployment::server: buster support, use default-mysql-client package [puppet] - 10https://gerrit.wikimedia.org/r/644350 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn)
[19:59:04] <wikibugs>	 (03PS3) 10Dzahn: deployment::server: buster support, use default-mysql-client package [puppet] - 10https://gerrit.wikimedia.org/r/644350 (https://phabricator.wikimedia.org/T265963)
[19:59:42] <wikibugs>	 (03CR) 10Holger Knust: Configuration chage to allow custom comment reverts on Wikidata (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/644542 (owner: 10Holger Knust)
[20:00:05] <jouncebot>	 hashar and twentyafterfour: How many deployers does it take to do Mediawiki train - European+American Version (secondary timeslot) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201201T2000).
[20:00:27] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+1] "The only issue I have with this is the "where does it end" question. Is it ok to idle in vim? emacs? redis-cli?" [puppet] - 10https://gerrit.wikimedia.org/r/643337 (https://phabricator.wikimedia.org/T266300) (owner: 10Bstorm)
[20:01:17] <wikibugs>	 (03PS1) 10Razzi: Add kafka-test1006.eqiad.wmnet virtual machine [puppet] - 10https://gerrit.wikimedia.org/r/644607 (https://phabricator.wikimedia.org/T268202)
[20:01:28] <wikibugs>	 (03CR) 10Ppchelko: Configuration chage to allow custom comment reverts on Wikidata (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/644542 (owner: 10Holger Knust)
[20:02:16] <wikibugs>	 10Operations, 10ops-eqiad: eqiad: add VC-links IDs to Netbox - https://phabricator.wikimedia.org/T268750 (10wiki_willy) a:03Jclark-ctr
[20:04:02] <wikibugs>	 (03CR) 10Ppchelko: [C: 04-1] Configuration chage to allow custom comment reverts on Wikidata [deployment-charts] - 10https://gerrit.wikimedia.org/r/644542 (owner: 10Holger Knust)
[20:04:22] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:06:33] <wikibugs>	 (03CR) 10Dzahn: "ACK, it's noop on stretch like this, confirmed on deploy2001 which already had the mariadb packages as well" [puppet] - 10https://gerrit.wikimedia.org/r/644350 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn)
[20:09:31] <wikibugs>	 (03CR) 10Dzahn: "noop on prod deployment servers, fixed issue on new buster servers (where unrelated issues remain)" [puppet] - 10https://gerrit.wikimedia.org/r/644350 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn)
[20:16:08] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:19:29] <wikibugs>	 (03PS3) 10Dzahn: site: add deploy2002 and unify deployment server role regex [puppet] - 10https://gerrit.wikimedia.org/r/644333 (https://phabricator.wikimedia.org/T265963)
[20:21:48] <wikibugs>	 (03CR) 10Ryan Kemper: "I think given these were added to `role(maps::replica)`" [puppet] - 10https://gerrit.wikimedia.org/r/644603 (https://phabricator.wikimedia.org/T260269) (owner: 10Ryan Kemper)
[20:23:35] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T269143 (10Peachey88)
[20:23:38] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T268036 (10Peachey88)
[20:26:43] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] site: add deploy2002 and unify deployment server role regex [puppet] - 10https://gerrit.wikimedia.org/r/644333 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn)
[20:27:55] <wikibugs>	 (03PS1) 10Ryan Kemper: maps: remove no-longer-accurate insetup role [puppet] - 10https://gerrit.wikimedia.org/r/644611 (https://phabricator.wikimedia.org/T260271)
[20:29:14] <wikibugs>	 (03PS1) 10Ottomata: Add new service eventstreams-internal [deployment-charts] - 10https://gerrit.wikimedia.org/r/644612 (https://phabricator.wikimedia.org/T269160)
[20:30:44] <wikibugs>	 (03PS2) 10Ottomata: Add new service eventstreams-internal [deployment-charts] - 10https://gerrit.wikimedia.org/r/644612 (https://phabricator.wikimedia.org/T269160)
[20:31:20] <wikibugs>	 10Operations, 10Analytics, 10Event-Platform, 10EventStreams, and 4 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10Ottomata)
[20:33:48] <wikibugs>	 (03CR) 10Dzahn: "noop on deploy1001/2001, on 2002 mcrouter cert is needed" [puppet] - 10https://gerrit.wikimedia.org/r/644333 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn)
[20:35:16] <hashar>	 mutante: are you creating a  new deployment host? 
[20:35:46] <mutante>	 yes, hashar. T265963
[20:35:46] <stashbot>	 T265963: Replace production deployment servers and update them to Buster - https://phabricator.wikimedia.org/T265963
[20:36:01] <hashar>	 mutante: there are a a couple time in which we had a new deploy server added  and it came with no git repositories cloned at all
[20:36:13] <hashar>	 and eventually puppet or some cron kicks in and rsync all the repositories
[20:36:18] <hashar>	 with --delete
[20:36:21] <mutante>	 yes, i know. described in https://phabricator.wikimedia.org/T265963#6660917
[20:36:38] <hashar>	 resulting in the actual primary ones to be wiped out entirely (including the local only repos having private settings :\ )
[20:37:31] <hashar>	 ah yeah there is that
[20:37:43] <mutante>	 that's probably the reason that scap is now blocked with that lock file
[20:37:52] <hashar>	 maybe
[20:38:03] <hashar>	 but I think there is another mecanism that keeps the deployment hosts in sync
[20:38:30] <mutante>	 yea, we have a checkbox for that " sync repo data over from old servers to new servers"
[20:39:01] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[20:39:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:39:21] <mutante>	 if something bad had happened then it would have happened before when deploy1002 was added
[20:39:32] <wikibugs>	 (03PS2) 10Ryan Kemper: maps: remove no-longer-accurate insetup role [puppet] - 10https://gerrit.wikimedia.org/r/644611 (https://phabricator.wikimedia.org/T260271)
[20:39:43] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] "sounds good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/643503 (https://phabricator.wikimedia.org/T266016) (owner: 10Filippo Giunchedi)
[20:40:13] <hashar>	 mutante: sounds good :]
[20:40:32] <mutante>	 also puppet doesn't run anyways because there is no mcrouter cert
[20:40:59] <mutante>	 but I am also happy to revert back to "insetup" 
[20:42:07] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] cirrus: alert on pool counter reject spike [puppet] - 10https://gerrit.wikimedia.org/r/643362 (https://phabricator.wikimedia.org/T262694) (owner: 10Ryan Kemper)
[20:42:36] <hashar>	 mutante: well I can't find the task or the incident report (if we had any)
[20:43:04] <hashar>	 I think it was some kind of cronjob kicking in  on the to be new primary server but since it had empty repo that wiped the other masters
[20:43:06] <hashar>	 something like that
[20:43:11] <hashar>	 guess it got addressed
[20:44:15] <wikibugs>	 (03PS6) 10Holger Knust: Configuration chage to allow custom comment reverts on Wikidata [deployment-charts] - 10https://gerrit.wikimedia.org/r/644542
[20:44:39] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:44:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:44:56] <wikibugs>	 (03CR) 10Holger Knust: Configuration chage to allow custom comment reverts on Wikidata (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/644542 (owner: 10Holger Knust)
[20:47:04] <mutante>	 hashar: thanks for the warning, i am double checking the cron right now.
[20:47:46] <hashar>	 or maybe that was when scap init ran
[20:48:06] <mutante>	 well, there is actually a cron and it has a variable $ensure
[20:48:19] <mutante>	 and default absent, but also on a new host
[20:51:04] <mutante>	 ok, but what this does is pull FROM the old host to the new host
[20:51:17] <mutante>	 with --delete but from deploy1001 to ...local
[20:51:49] <mutante>	 it filled /srv/deployment up in the right direction
[20:52:36] <hashar>	 sounds good
[20:52:46] <hashar>	 so maybe the issue was years and years ago
[20:53:11] <hashar>	 and that got addressed
[20:53:39] <mutante>	 hashar: yea, so I checked the code from top to bottom. it looks up what is current deployment_server in Hiera in common.yaml
[20:54:02] <mutante>	 then there is some code that goes "if this is the active server then do NOT have the cron, otherwise do have it"
[20:54:17] <mutante>	 so that cron gets applied on all (currently 3) non-active hosts
[20:54:24] <mutante>	 but they are all the same, pulling from the 1 active host
[20:54:32] <mutante>	 but good to double check this stuff
[20:54:35] <hashar>	 yeah
[20:54:42] <mutante>	 the interesting part is when we switch in hiera
[20:54:44] <mutante>	 and not doing that yet
[20:55:01] <hashar>	 notably we want /srv/mediawiki-staging to be around I guess. Notably  the private settings
[20:55:09] <mutante>	 my next question is different
[20:55:17] <mutante>	 and that is "how do you properly bootstrap scap"
[20:55:22] <hashar>	 oh
[20:55:24] <mutante>	 you can't run scap deploy --init 
[20:55:27] <mutante>	 which puppet tries to do
[20:55:34] <mutante>	 but can't because .. this is not the active server
[20:56:02] <mutante>	 so either it has to be a scheduled maintenance window where we switch it and someone runs all the scap init stuff
[20:56:12] <mutante>	 or we have to allow doing it before it is an active server
[20:56:24] <mutante>	 but making REALLY sure it only inits and doesnt mess with actual deployments
[20:56:43] <hashar>	 I had the issue when provisioning a new repo
[20:56:49] <mutante>	 either way it should be fixed because right now that is just failed puppet
[20:56:55] <hashar>	 https://phabricator.wikimedia.org/T257317 
[20:57:13] <mutante>	 heh, yes, that :)
[20:57:25] <hashar>	 and Jayme pointed out the same thing you did
[20:57:36] <hashar>	 the lock prevent  scap deploy --init from running
[20:57:47] <mutante>	 ok, all I have to do is quote Jaime
[20:57:49] <mutante>	 "scap syncronization (all methods) should be disabled because of the lock, but probably --init should be allowed "
[20:57:53] <mutante>	 that was what I wanted to say 
[20:58:14] <mutante>	 let me link that,thx
[20:58:24] <hashar>	 oh
[20:58:26] <hashar>	 and https://phabricator.wikimedia.org/T257319
[20:58:32] <hashar>	  14:25:38 deploy failed: <LockFailedError> Failed to acquire lock "/var/lock/scap-global-lock"; owner is "root"; reason is "Not the active deployment server, use deploy1001.eqiad.wmnet
[20:59:15] <wikibugs>	 (03CR) 10Muehlenhoff: "Yeah, mysql-client is just a transitional package to default-mysql-client in Stretch: https://packages.debian.org/stretch/mysql-client" [puppet] - 10https://gerrit.wikimedia.org/r/644350 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn)
[20:59:18] <hashar>	 but I can't remember how I fixed it. Maybe the first deploy synced stuff to the secondary deploy server and that unbroke puppet
[20:59:28] <hashar>	 or it is a race condition in puppet that eventually self  fixes
[20:59:54] <mutante>	 hashar: I think what happened is we switched what the active server is and then ran puppet and that fixes it
[21:00:11] <mutante>	 but would be nicer to be able to do this earlier
[21:00:17] <mutante>	 and confirm everything is ok before switching
[21:01:57] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] Configure zookeeper-test1002.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/644344 (https://phabricator.wikimedia.org/T268074) (owner: 10Razzi)
[21:02:33] <mutante>	 I will add the mcrouter cert for deploy2002, then puppet is not failed but just back to those warnings like on deploy1002.
[21:06:26] <wikibugs>	 (03CR) 10Ppchelko: [C: 03+1] "@Clara will you lead the way deploying this? There's also https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/644261 that you " [deployment-charts] - 10https://gerrit.wikimedia.org/r/644542 (owner: 10Holger Knust)
[21:06:41] <wikibugs>	 (03PS1) 10Dzahn: add fake mcrouter certs for deploy2002 [labs/private] - 10https://gerrit.wikimedia.org/r/644616 (https://phabricator.wikimedia.org/T265963)
[21:07:11] <icinga-wm>	 ACKNOWLEDGEMENT - HP RAID on ms-be1030 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T269166 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Inform
[21:07:11] <wikibugs>	 (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake mcrouter certs for deploy2002 [labs/private] - 10https://gerrit.wikimedia.org/r/644616 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn)
[21:07:14] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T269166 (10ops-monitoring-bot)
[21:09:21] <wikibugs>	 (03PS3) 10Dzahn: gerrit: daemon option in gerrit.config [puppet] - 10https://gerrit.wikimedia.org/r/643944 (owner: 10Hashar)
[21:10:19] <hashar>	 mutante: yeah that gerrit config change is all fine.  The other one that mess up with the host config  I am not 100% sure :\
[21:12:41] <wikibugs>	 (03PS3) 10Razzi: Configure zookeeper-test1002.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/644344 (https://phabricator.wikimedia.org/T268074)
[21:14:27] <mutante>	 yes, I see it the same way. Not planning to merge that other one yet.
[21:14:49] <mutante>	 has open comments
[21:14:51] <hashar>	 but the daemon option you can merge it, I will look at the replica
[21:15:45] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "confirmed this is not changing the active server, just the replica https://puppet-compiler.wmflabs.org/compiler1001/26817/" [puppet] - 10https://gerrit.wikimedia.org/r/643944 (owner: 10Hashar)
[21:17:10] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[21:17:12] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[21:17:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:17:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:18:51] <mutante>	 !log applied deployment_server role on deploy2002, added mcrouter cert, initial puppet run pulls mediawiki-config and other repos, downtimed in Icinga for 40 days (T265963)
[21:18:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:18:58] <stashbot>	 T265963: Replace production deployment servers and update them to Buster - https://phabricator.wikimedia.org/T265963
[21:19:34] <wikibugs>	 (03CR) 10Dzahn: "noop on gerrit1001 - changed config on gerrit2001" [puppet] - 10https://gerrit.wikimedia.org/r/643944 (owner: 10Hashar)
[21:19:37] <mutante>	 hashar: it's been applied ^
[21:20:32] <mutante>	 puppet changed the config and triggered systemd daemon-reload
[21:20:38] <mutante>	 but for this kind of change it needs an actual restart
[21:20:45] <mutante>	 to change the command line
[21:20:52] <hashar>	 ahh 
[21:20:55] <hashar>	 that is what I was wondering
[21:21:12] <mutante>	 on the prod gerrit server nothing changed at all, fwiw
[21:21:23] <hashar>	 !log gerrit2001: restarting Gerrit to take in account a config change in the daemon ( --replica moved to daemonOpt config file)
[21:21:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:21:49] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` ms-be2061.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reim...
[21:22:23] <wikibugs>	 (03PS2) 10Razzi: Add kafka-test1006.eqiad.wmnet virtual machine [puppet] - 10https://gerrit.wikimedia.org/r/644607 (https://phabricator.wikimedia.org/T268202)
[21:24:46] <hashar>	 mutante: looks good. Thank you!
[21:25:35] <wikibugs>	 (03CR) 10Razzi: [C: 03+1] Set oozie.service.coord.default.max.timeout to 13 months [puppet] - 10https://gerrit.wikimedia.org/r/644535 (https://phabricator.wikimedia.org/T264358) (owner: 10Ottomata)
[21:27:56] <mutante>	 hashar: cool, thanks for checking. then let's leave it at that for now for both gerrit and deployment servers
[21:28:16] <hashar>	 yeah 
[21:28:21] <mutante>	 next we need to figure out how we want to properly scap --init 2 new servers, one eqiad and one codfw
[21:28:50] <hashar>	 that I am afraid I don't quite know :\
[21:28:57] <mutante>	 i'll go take a break for now. puppet still running on deploy2002 because it pulls all the things.. but that takes a while and it's downtimed 
[21:29:09] <mutante>	 and obviously not considered an active server
[21:30:02] <mutante>	 hashar: yea, if can be really sure that "scap deploy --init" never deploys TO anything else then we can reduce it to how to allow that while keeping scap locked for "sync" actions
[21:30:12] <hashar>	 mutante: great.  I am going to bed myself, but others in releng should be able to assist I guess
[21:30:17] <hashar>	 and surely have more knowledge than me
[21:30:30] <hashar>	 but in short   scap deploy --init  does the cloning  and a few other actions to setup the git repos
[21:30:35] <wikibugs>	 (03PS1) 10Razzi: Add kafka-test1006 as start of test kafka cluster [puppet] - 10https://gerrit.wikimedia.org/r/644620 (https://phabricator.wikimedia.org/T268202)
[21:30:39] <mutante>	 we will ping releng from team to team 
[21:30:56] <wikibugs>	 (03PS1) 10Andrew Bogott: Make clouddvirt1030 a ceph-enabled hypervisor [puppet] - 10https://gerrit.wikimedia.org/r/644621 (https://phabricator.wikimedia.org/T261132)
[21:31:16] <mutante>	 ack, cu later hashar!
[21:31:18] <hashar>	 mutante: cool :]]
[21:31:27] <wikibugs>	 (03CR) 10Razzi: [C: 03+2] Configure zookeeper-test1002.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/644344 (https://phabricator.wikimedia.org/T268074) (owner: 10Razzi)
[21:31:27] <hashar>	 can't wait for the new servers hehe
[21:31:35] <hashar>	 have a good lunch / break
[21:31:42] <mutante>	 originally it was only because the hardware is old
[21:31:52] <mutante>	 the part that OS is upgraded too was added on to it later 
[21:31:56] <mutante>	 bye
[21:32:08] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Platform Team Workboards (Green): eqiad: Server moves to free up space on 10g racks - https://phabricator.wikimedia.org/T267065 (10Clarakosi)
[21:36:04] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ms-be2061.codfw.wmnet'] `  Of which those **FAILED**: ` ['ms-be2061.codfw.wmnet'] `
[21:37:40] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime
[21:37:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:39:46] <logmsgbot>	 !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[21:39:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:53:24] <wikibugs>	 (03PS3) 10Razzi: Add kafka-test1006.eqiad.wmnet virtual machine [puppet] - 10https://gerrit.wikimedia.org/r/644607 (https://phabricator.wikimedia.org/T268202)
[21:53:58] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime
[21:54:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:55:59] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[21:56:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:58:40] <wikibugs>	 10Operations, 10Traffic, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10JMinor) I'm taking this off our active release board for now. We're discussi...
[22:11:50] <rzl>	 testing out the MW warmup script in codfw -- it'll produce some nonpaging alerts about appserver latency, those are safe to ignore
[22:13:12] <logmsgbot>	 !log rzl@cumin2001 START - Cookbook sre.switchdc.mediawiki.00-warmup-caches
[22:13:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:15:54] <logmsgbot>	 !log rzl@cumin2001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-warmup-caches (exit_code=0)
[22:16:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:16:54] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[22:17:31] <wikibugs>	 (03PS1) 10Florianschmidtwelzow: Move disabling sitenotice on wikimedia wikis to mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644627 (https://phabricator.wikimedia.org/T269173)
[22:18:30] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[22:21:31] <rzl>	 done
[22:26:56] <wikibugs>	 (03PS7) 10Dzahn: thumbor: move thumbor mediawiki role to profile [puppet] - 10https://gerrit.wikimedia.org/r/643112 (https://phabricator.wikimedia.org/T209953)
[22:28:34] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/26818/thumbor2004.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/643112 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[22:31:07] <wikibugs>	 (03PS2) 10Dzahn: conftool: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/644357 (https://phabricator.wikimedia.org/T209953)
[22:31:31] <wikibugs>	 (03CR) 10Dzahn: "noop on thumbor1004/2004" [puppet] - 10https://gerrit.wikimedia.org/r/643112 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[22:32:32] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] conftool: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/644357 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[22:36:34] <wikibugs>	 (03PS3) 10Dzahn: conftool: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/644357 (https://phabricator.wikimedia.org/T209953)
[22:37:15] <wikibugs>	 (03CR) 10Dzahn: "yep, module is already deleted now" [puppet] - 10https://gerrit.wikimedia.org/r/644358 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[22:37:39] <wikibugs>	 (03CR) 10CRusnov: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/644358 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[22:37:49] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "do deploy2002 as well right away" [puppet] - 10https://gerrit.wikimedia.org/r/635079 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn)
[22:38:54] <wikibugs>	 10Puppet, 10DBA, 10SRE-tools, 10conftool, and 2 others: Alerting spam and wrong state of primary dc source info on databases while switching dc from eqiad -> codfw - https://phabricator.wikimedia.org/T261767 (10RLazarus) 05Open→03Resolved a:03RLazarus >>! In T261767#6450152, @Marostegui wrote: > @RLa...
[22:38:59] <wikibugs>	 (03PS2) 10Dzahn: k8s: replace hiera with lookup [puppet] - 10https://gerrit.wikimedia.org/r/644363 (https://phabricator.wikimedia.org/T209953)
[22:39:57] <wikibugs>	 (03PS2) 10Dzahn: ores: move LB setup for cloud from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/643117
[22:41:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] ores: move LB setup for cloud from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/643117 (owner: 10Dzahn)
[22:41:27] <Urbanecm>	 !log Start of mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=$wiki --print-orphaned-records-to=/tmp/urbanecm/$wiki-orphaned.log --progress-markers > $wiki.log in a tmux at mwmaint1002 (wiki=arwiki; T246539)
[22:41:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:41:34] <stashbot>	 T246539: Dry-run, then actually run updateVarDumps - https://phabricator.wikimedia.org/T246539
[22:43:03] <bearloga>	 ottomata: check out the "standard analytics fields" tab https://mep-index.wmflabs.org/ :D
[22:54:47] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] Add kafka-test1006.eqiad.wmnet virtual machine [puppet] - 10https://gerrit.wikimedia.org/r/644607 (https://phabricator.wikimedia.org/T268202) (owner: 10Razzi)
[22:57:19] <wikibugs>	 10Operations, 10CirrusSearch, 10Elasticsearch, 10Discovery-Search (Current work): Search is currently too busy - https://phabricator.wikimedia.org/T262694 (10RKemper) Following deploy, the alert shows up in icinga: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=alert1001&service=Mediaw...
[23:03:21] <wikibugs>	 (03PS8) 10Razzi: zookeeper: configure test-eqiad single-node cluster [puppet] - 10https://gerrit.wikimedia.org/r/642497 (https://phabricator.wikimedia.org/T268202)
[23:07:16] <wikibugs>	 (03CR) 10Razzi: [C: 03+2] Add kafka-test1006.eqiad.wmnet virtual machine [puppet] - 10https://gerrit.wikimedia.org/r/644607 (https://phabricator.wikimedia.org/T268202) (owner: 10Razzi)
[23:12:42] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-reload
[23:12:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:12:49] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-reload
[23:12:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:12:55] <logmsgbot>	 !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-reload
[23:12:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:13:06] <logmsgbot>	 !log ryankemper@cumin2001 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97)
[23:13:11] <wikibugs>	 10Operations, 10serviceops, 10Datacenter-Switchover: Updates to warmup script - https://phabricator.wikimedia.org/T269179 (10RLazarus)
[23:13:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:13:13] <logmsgbot>	 !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-reload
[23:13:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:13:20] <logmsgbot>	 !log razzi@cumin1001 START - Cookbook sre.dns.netbox
[23:13:21] <logmsgbot>	 !log razzi@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97)
[23:13:21] <logmsgbot>	 !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-reload
[23:13:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:13:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:13:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:15:30] <ryankemper>	 !log T259588 Beginning wdqs categories data-reload on the following instances (one each from `[public, internal] x [eqiad, codfw]`): `wdqs1005`, `wdqs2002`, `wdqs1008`, `wdqs2005`
[23:15:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:15:37] <stashbot>	 T259588: Reload categories once 1.36.0-wmf.3 is running on all groups - https://phabricator.wikimedia.org/T259588
[23:16:21] <wikibugs>	 (03PS1) 10Papaul: DHCP: Add MAC address for db214[234] [puppet] - 10https://gerrit.wikimedia.org/r/644632 (https://phabricator.wikimedia.org/T267041)
[23:21:22] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] DHCP: Add MAC address for db214[234] [puppet] - 10https://gerrit.wikimedia.org/r/644632 (https://phabricator.wikimedia.org/T267041) (owner: 10Papaul)
[23:22:36] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): relocate/reimage cloudvirt1025 with 10G interfaces - https://phabricator.wikimedia.org/T266187 (10Andrew) I'm unable to pxe boot this host.  It doesn't display much of anything, just hangs for a while and then fails over to hdd.  The sa...
[23:29:54] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-11-29) rack/setup/install db214[234] - https://phabricator.wikimedia.org/T267041 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` db2142.codfw.wmnet ` The log can be f...
[23:29:59] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-11-29) rack/setup/install db214[234] - https://phabricator.wikimedia.org/T267041 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2142.codfw.wmnet'] `  Of which those **FAILED**: ` ['db2142.codfw.wmnet'] `
[23:30:09] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-11-29) rack/setup/install db214[234] - https://phabricator.wikimedia.org/T267041 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` db2142.codfw.wmnet ` The log can be f...
[23:32:44] <icinga-wm>	 ACKNOWLEDGEMENT - HP RAID on ms-be1030 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T269181 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Inform
[23:32:49] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T269181 (10ops-monitoring-bot)
[23:44:06] <wikibugs>	 (03PS9) 10Razzi: zookeeper: configure test-eqiad single-node cluster [puppet] - 10https://gerrit.wikimedia.org/r/642497 (https://phabricator.wikimedia.org/T268202)
[23:49:09] <wikibugs>	 (03PS4) 10Razzi: Ensure /tmp/sqoop-jars/ is present [puppet] - 10https://gerrit.wikimedia.org/r/644347 (https://phabricator.wikimedia.org/T251788)
[23:49:15] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+1] Move disabling sitenotice on wikimedia wikis to mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644627 (https://phabricator.wikimedia.org/T269173) (owner: 10Florianschmidtwelzow)
[23:50:12] <wikibugs>	 (03CR) 10Jdlrobson: [C: 04-1] Move disabling sitenotice on wikimedia wikis to mediawiki-config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644627 (https://phabricator.wikimedia.org/T269173) (owner: 10Florianschmidtwelzow)