[00:14:44] <wikibugs>	 (03CR) 10CRusnov: [V: 03+2 C: 03+2] Rebuild Netbox 2.9.10 dependencies [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/651280 (owner: 10CRusnov)
[00:19:55] <logmsgbot>	 !log crusnov@deploy1001 Started deploy [netbox/deploy@b17db99]: Redeploy of 2.9.10 to netbox-dev for dep test
[00:19:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:20:49] <logmsgbot>	 !log crusnov@deploy1001 Finished deploy [netbox/deploy@b17db99]: Redeploy of 2.9.10 to netbox-dev for dep test (duration: 00m 54s)
[00:20:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:26:17] <wikibugs>	 (03CR) 10Ori.livneh: "> Patch Set 1: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/651267 (owner: 10Ori.livneh)
[00:32:50] <wikibugs>	 (03PS1) 10Legoktm: aptrepo: Pull pyall component over HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/651300
[00:41:17] <wikibugs>	 (03PS2) 10Legoktm: aptrepo: Pull pyall component over HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/651300
[00:48:57] <wikibugs>	 (03PS1) 10Bstorm: wikireplicas: set up VM haproxy layer [puppet] - 10https://gerrit.wikimedia.org/r/651301 (https://phabricator.wikimedia.org/T267376)
[00:50:03] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 10923952192 and 592 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:50:03] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 8314045280 and 433 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:50:27] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2451684784 and 154 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:50:27] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 6385288736 and 362 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:52:13] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 3135964968 and 295 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:52:31] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2444305600 and 271 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:52:41] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1554093408 and 238 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:52:43] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 4012208296 and 368 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:52:47] <wikibugs>	 (03CR) 10Bstorm: wikireplicas: set up VM haproxy layer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/651301 (https://phabricator.wikimedia.org/T267376) (owner: 10Bstorm)
[00:53:47] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 133408 and 237 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:55:33] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 5880 and 343 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:55:49] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 149224 and 359 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:56:01] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 67840 and 370 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:57:41] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 12704 and 471 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:58:21] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 32216 and 510 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:58:45] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 4376 and 535 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:59:59] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 22376 and 608 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:13:24] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T270663 (10ops-monitoring-bot)
[01:21:07] <wikibugs>	 10Operations, 10Traffic, 10netops: Was unable to connect (esams) for about 20 minutes - https://phabricator.wikimedia.org/T270664 (10AlexisJazz)
[02:07:38] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.23 [core] (wmf/1.36.0-wmf.23) - 10https://gerrit.wikimedia.org/r/651306
[02:11:22] <wikibugs>	 (03PS2) 10DannyS712: Branch commit for wmf/1.36.0-wmf.23 [core] (wmf/1.36.0-wmf.23) - 10https://gerrit.wikimedia.org/r/651306 (https://phabricator.wikimedia.org/T267416) (owner: 10TrainBranchBot)
[02:11:32] <wikibugs>	 (03Abandoned) 10DannyS712: Branch commit for wmf/1.36.0-wmf.23 [core] (wmf/1.36.0-wmf.23) - 10https://gerrit.wikimedia.org/r/651306 (https://phabricator.wikimedia.org/T267416) (owner: 10TrainBranchBot)
[02:32:57] <icinga-wm>	 PROBLEM - Check systemd state on dbprov2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:27:27] <icinga-wm>	 RECOVERY - Check systemd state on dbprov2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:53:24] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "Thanks!" [software/conftool] - 10https://gerrit.wikimedia.org/r/651269 (https://phabricator.wikimedia.org/T269324) (owner: 10CDanis)
[05:54:45] <wikibugs>	 10Operations, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Marostegui) Thank you @krinkle One further question, we are still thinking whether to configure ROW or STA...
[05:55:18] <wikibugs>	 (03CR) 10Marostegui: "This won't be needed, quoting Timo:" [puppet] - 10https://gerrit.wikimedia.org/r/649820 (https://phabricator.wikimedia.org/T269324) (owner: 10Jcrespo)
[06:00:48] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not reimage x2 eqiad hosts [puppet] - 10https://gerrit.wikimedia.org/r/651313
[06:01:33] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage x2 eqiad hosts [puppet] - 10https://gerrit.wikimedia.org/r/651313 (owner: 10Marostegui)
[06:04:33] <wikibugs>	 (03PS3) 10Legoktm: aptrepo: Pull pyall component over HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/651300
[06:59:17] <wikibugs>	 (03PS2) 10Elukey: hive: remove analytics-replicated-hive config [puppet] - 10https://gerrit.wikimedia.org/r/650077 (https://phabricator.wikimedia.org/T268028)
[07:01:35] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] hive: remove analytics-replicated-hive config [puppet] - 10https://gerrit.wikimedia.org/r/650077 (https://phabricator.wikimedia.org/T268028) (owner: 10Elukey)
[07:08:46] <jinxer-wm>	 (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org
[07:11:04] <elukey>	 ah this seems to be mr1
[07:11:13] <elukey>	 the alarm is missing something
[07:11:55] <elukey>	 yep indeed https://librenms.wikimedia.org/device/53
[07:13:46] <jinxer-wm>	 (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org
[07:20:51] <elukey>	 I checked and the cause seems to be flowd_octeon_hm, but IIUC from reading it seems a system daemon that sometimes processes packets at higher rate (consuming more cpu)
[07:27:43] <elukey>	 !log reboot stat100[4-8] (analytics hadoop clients) for kernel upgrades
[07:27:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:27:45] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: No response from remote host 208.80.154.199 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:28:01] <elukey>	 ah lovely
[07:28:21] <icinga-wm>	 PROBLEM - Juniper alarms on mr1-eqiad is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.154.199 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[07:30:28] <elukey>	 it seems the high cpu usage
[07:32:45] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: No response from remote host 208.80.154.199 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:34:15] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 208.80.154.199, interfaces up: 37, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:34:51] <icinga-wm>	 RECOVERY - Juniper alarms on mr1-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[07:40:05] <icinga-wm>	 RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:40:21] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:42:47] <jinxer-wm>	 (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org
[07:47:47] <jinxer-wm>	 (Processor usage over 85%) resolved: (2) Processor usage over 85% - https://alerts.wikimedia.org
[08:29:49] <wikibugs>	 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10observability, and 2 others: Add prometheus exporter to Gerrit - https://phabricator.wikimedia.org/T184086 (10hashar)
[08:36:01] <icinga-wm>	 PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:36:11] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:48:03] <wikibugs>	 (03PS1) 10Elukey: admin: deprecate the analytics-users posix group [puppet] - 10https://gerrit.wikimedia.org/r/651448 (https://phabricator.wikimedia.org/T269150)
[08:48:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] admin: deprecate the analytics-users posix group [puppet] - 10https://gerrit.wikimedia.org/r/651448 (https://phabricator.wikimedia.org/T269150) (owner: 10Elukey)
[08:49:43] <wikibugs>	 (03PS2) 10Elukey: admin: deprecate the analytics-users posix group [puppet] - 10https://gerrit.wikimedia.org/r/651448 (https://phabricator.wikimedia.org/T269150)
[08:52:02] <hashar>	 !log gerrit: running jhat heap analyzer on gerrit2001  # T263008
[08:52:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:52:06] <stashbot>	 T263008: Gerrit out of heap - https://phabricator.wikimedia.org/T263008
[08:55:20] <hashar>	 a 34GB dump :-\
[09:18:06] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Remove user/group settings from php-fpm's configuration [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/651450
[09:20:13] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T270663 (10Peachey88)
[09:21:12] <wikibugs>	 10Operations, 10Traffic, 10netops: Was unable to connect (esams) for about 20 minutes - https://phabricator.wikimedia.org/T270664 (10Peachey88)
[09:42:13] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: nftables: introduce nft-check exec [puppet] - 10https://gerrit.wikimedia.org/r/651453
[09:44:56] <wikibugs>	 (03PS1) 10Jbond: enable-puppet: ensure we make sure puppet is not running before enableing [puppet] - 10https://gerrit.wikimedia.org/r/651454
[09:45:20] <jbond42>	 volans: ^^^ could be the fix
[09:48:37] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "It shouldn't hurt and makes sense if we run the enable on a host with puppet already enabled and hence potentially running" [puppet] - 10https://gerrit.wikimedia.org/r/651454 (owner: 10Jbond)
[09:50:08] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "yes @dcaro I did some tests in toolsbeta. I'm convinced the patch works." [puppet] - 10https://gerrit.wikimedia.org/r/650470 (https://phabricator.wikimedia.org/T267966) (owner: 10Arturo Borrero Gonzalez)
[09:50:27] <wikibugs>	 (03Abandoned) 10Arturo Borrero Gonzalez: kubeadm: etcd: introduce systemd-based higher priority scheduling policy [puppet] - 10https://gerrit.wikimedia.org/r/650470 (https://phabricator.wikimedia.org/T267966) (owner: 10Arturo Borrero Gonzalez)
[10:00:07] <icinga-wm>	 PROBLEM - Hive Metastore on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive
[10:01:03] <elukey>	 this is me --^
[10:02:43] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T270663 (10Volans) 05Open→03Resolved a:03Volans It was a transient degradation due to a rebuild, it's all good now, resolving.  ` $ sudo /usr/local/lib/nagios/plugins/get-raid-s...
[10:06:53] <volans>	 !log upgraded python3-wmflib to 0.0.5 on cumin2001
[10:06:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:06] <wikibugs>	 (03PS1) 10Elukey: cdh::hive: don't use TLS for local db connections [puppet] - 10https://gerrit.wikimedia.org/r/651455
[10:08:30] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cdh::hive: don't use TLS for local db connections [puppet] - 10https://gerrit.wikimedia.org/r/651455 (owner: 10Elukey)
[10:08:39] <elukey>	 iff
[10:09:28] <wikibugs>	 (03PS2) 10Elukey: cdh::hive: don't use TLS for local db connections [puppet] - 10https://gerrit.wikimedia.org/r/651455
[10:11:30] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27229/console" [puppet] - 10https://gerrit.wikimedia.org/r/651455 (owner: 10Elukey)
[10:14:08] <wikibugs>	 (03PS3) 10Elukey: cdh::hive: don't use TLS for local db connections [puppet] - 10https://gerrit.wikimedia.org/r/651455
[10:14:13] <icinga-wm>	 RECOVERY - Hive Metastore on an-coord1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive
[10:16:40] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27230/console" [puppet] - 10https://gerrit.wikimedia.org/r/651455 (owner: 10Elukey)
[10:17:16] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] cdh::hive: don't use TLS for local db connections [puppet] - 10https://gerrit.wikimedia.org/r/651455 (owner: 10Elukey)
[10:20:24] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] enable-puppet: ensure we make sure puppet is not running before enableing [puppet] - 10https://gerrit.wikimedia.org/r/651454 (owner: 10Jbond)
[10:20:26] <wikibugs>	 (03PS1) 10Elukey: Failover analytics-hive.eqiad.wmnet to an-coord1001 [dns] - 10https://gerrit.wikimedia.org/r/651456 (https://phabricator.wikimedia.org/T268028)
[10:21:43] <wikibugs>	 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10aborrero)
[10:23:36] <wikibugs>	 (03PS1) 10Jbond: enable-puppet: initialise attempts [puppet] - 10https://gerrit.wikimedia.org/r/651457
[10:24:16] <wikibugs>	 (03PS2) 10Jbond: enable-puppet: initialise attempts [puppet] - 10https://gerrit.wikimedia.org/r/651457
[10:24:23] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] enable-puppet: initialise attempts [puppet] - 10https://gerrit.wikimedia.org/r/651457 (owner: 10Jbond)
[10:24:44] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Failover analytics-hive.eqiad.wmnet to an-coord1001 [dns] - 10https://gerrit.wikimedia.org/r/651456 (https://phabricator.wikimedia.org/T268028) (owner: 10Elukey)
[10:32:25] <wikibugs>	 10Operations, 10observability, 10CAS-SSO: Sign-in links from Grafana dashboards don't work when not signed into SSO - https://phabricator.wikimedia.org/T269272 (10Volans) This is happening to me even with an SSO session already active fwiw.
[10:34:34] <wikibugs>	 (03PS1) 10Elukey: role::analytics_clustr::coordinator: move Presto to analytics-hive [puppet] - 10https://gerrit.wikimedia.org/r/651458 (https://phabricator.wikimedia.org/T257412)
[10:34:51] <wikibugs>	 10Operations, 10observability, 10CAS-SSO: Sign-in links from Grafana dashboards don't work when not signed into SSO - https://phabricator.wikimedia.org/T269272 (10Volans) And to be clear, even if I just pate into the browser the URL: https://grafana-rw.wikimedia.org/d/000000377/host-overview?refresh=5m&orgId...
[10:35:00] <wikibugs>	 (03PS2) 10Elukey: role::analytics_cluster::coordinator: move Presto to analytics-hive [puppet] - 10https://gerrit.wikimedia.org/r/651458 (https://phabricator.wikimedia.org/T257412)
[10:36:04] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::coordinator: move Presto to analytics-hive [puppet] - 10https://gerrit.wikimedia.org/r/651458 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey)
[10:39:09] <wikibugs>	 (03PS1) 10Elukey: role::analytics_cluster::coordinator: set new hive metastore for presto [puppet] - 10https://gerrit.wikimedia.org/r/651459
[10:39:40] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::coordinator: set new hive metastore for presto [puppet] - 10https://gerrit.wikimedia.org/r/651459 (owner: 10Elukey)
[10:49:29] <logmsgbot>	 volans@cumin2001 test message
[10:49:33] <volans>	 elukey: ^^
[10:49:37] <volans>	 \o/
[10:49:37] <wikibugs>	 (03PS1) 10Odder: Add localised logos for the Madurese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/651462 (https://phabricator.wikimedia.org/T270693)
[10:49:40] <elukey>	 \o/
[10:49:42] <elukey>	 niceeeee
[10:51:51] <logmsgbot>	 !log volans@cumin2001 test SAL message from wmflib, please ignore
[10:51:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:51:57] <volans>	 elukey: ^^^ :D
[10:56:13] <wikibugs>	 (03PS1) 10Volans: doc: improve documentation [software/pywmflib] - 10https://gerrit.wikimedia.org/r/651463
[10:57:49] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] doc: improve documentation [software/pywmflib] - 10https://gerrit.wikimedia.org/r/651463 (owner: 10Volans)
[10:58:15] <wikibugs>	 (03CR) 10Volans: [C: 03+2] doc: improve documentation [software/pywmflib] - 10https://gerrit.wikimedia.org/r/651463 (owner: 10Volans)
[10:59:33] <wikibugs>	 (03Merged) 10jenkins-bot: doc: improve documentation [software/pywmflib] - 10https://gerrit.wikimedia.org/r/651463 (owner: 10Volans)
[10:59:58] <wikibugs>	 (03PS1) 10Odder: Add localised logos for the Madurese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/651464 (https://phabricator.wikimedia.org/T270693)
[11:00:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add localised logos for the Madurese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/651464 (https://phabricator.wikimedia.org/T270693) (owner: 10Odder)
[11:02:24] <jbond42>	 !log upload puppet 5.5.22 to stretch-wikimedia
[11:02:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:02:51] <wikibugs>	 (03Abandoned) 10Odder: Add localised logos for the Madurese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/651464 (https://phabricator.wikimedia.org/T270693) (owner: 10Odder)
[11:05:39] <wikibugs>	 (03PS1) 10Odder: Add localised logos for the Madurese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/651465 (https://phabricator.wikimedia.org/T270693)
[11:10:55] <jbond42>	 !log upload puppet 5.5.22 to jessie-wikimedia
[11:10:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:18:56] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "I just captured what is not being NAT'ed at this moment:" [puppet] - 10https://gerrit.wikimedia.org/r/651169 (https://phabricator.wikimedia.org/T209082) (owner: 10Arturo Borrero Gonzalez)
[11:19:57] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "That list is:" [puppet] - 10https://gerrit.wikimedia.org/r/651169 (https://phabricator.wikimedia.org/T209082) (owner: 10Arturo Borrero Gonzalez)
[11:42:23] <wikibugs>	 (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] make sure bz2 header is read when reading blocks backwards [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/645340 (https://phabricator.wikimedia.org/T269225) (owner: 10ArielGlenn)
[11:43:07] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] version 0.1.2 [debs/mwbzutils] - 10https://gerrit.wikimedia.org/r/651162 (owner: 10ArielGlenn)
[11:52:30] <marostegui>	 !log Set db1151 to writable T269324
[11:52:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:52:34] <stashbot>	 T269324: Productionize x2 databases - https://phabricator.wikimedia.org/T269324
[11:56:00] <wikibugs>	 10Operations, 10DBA: Refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[11:56:03] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10Marostegui)
[12:08:27] <icinga-wm>	 PROBLEM - DPKG on db2113 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[12:10:11] <wikibugs>	 10Operations, 10Puppet, 10User-jbond: Update puppet infrastructure latest 5.5 version - https://phabricator.wikimedia.org/T265139 (10jbond) puppet 5.5.22 has now been deployed fleet wide.
[12:10:23] <icinga-wm>	 PROBLEM - DPKG on elastic2050 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[12:10:38] <jbond42>	 looking ^^
[12:10:43] <icinga-wm>	 PROBLEM - DPKG on ms-be1028 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[12:11:52] <marostegui>	 jbond42: A puppet run on db2113 seems to have fixed it
[12:11:53] <icinga-wm>	 PROBLEM - DPKG on elastic2054 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[12:12:37] <jbond42>	 marostegui: thanks looks like the puppet update had a few issues on some machines
[12:12:39] <icinga-wm>	 PROBLEM - DPKG on kubernetes2012 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[12:12:43] <icinga-wm>	 RECOVERY - DPKG on db2113 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[12:13:01] <icinga-wm>	 PROBLEM - DPKG on an-worker1096 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[12:13:05] <icinga-wm>	 PROBLEM - DPKG on db1082 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[12:15:57] <icinga-wm>	 RECOVERY - DPKG on db1082 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[12:15:57] <icinga-wm>	 RECOVERY - DPKG on an-worker1096 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[12:15:57] <icinga-wm>	 RECOVERY - DPKG on ms-be1028 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[12:15:57] <icinga-wm>	 RECOVERY - DPKG on elastic2050 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[12:15:57] <icinga-wm>	 RECOVERY - DPKG on elastic2054 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[12:15:57] <icinga-wm>	 RECOVERY - DPKG on kubernetes2012 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[12:34:10] <wikibugs>	 (03PS2) 10David Caro: wmcs.backups: move wikidumpparse to cloudvirt1025 [puppet] - 10https://gerrit.wikimedia.org/r/651175
[12:38:26] <wikibugs>	 (03PS1) 10David Caro: wmcs.backup: Add a method to create a vm backup [puppet] - 10https://gerrit.wikimedia.org/r/651507
[12:38:37] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Great work! some comments inline." (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/651301 (https://phabricator.wikimedia.org/T267376) (owner: 10Bstorm)
[13:45:15] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] wmcs.backups: move wikidumpparse to cloudvirt1025 [puppet] - 10https://gerrit.wikimedia.org/r/651175 (owner: 10David Caro)
[14:11:28] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] "thanks! dbctl wikitech page also updated" [software/conftool] - 10https://gerrit.wikimedia.org/r/651269 (https://phabricator.wikimedia.org/T269324) (owner: 10CDanis)
[14:13:45] <wikibugs>	 (03Merged) 10jenkins-bot: dbctl: README: document section 'flavor' [software/conftool] - 10https://gerrit.wikimedia.org/r/651269 (https://phabricator.wikimedia.org/T269324) (owner: 10CDanis)
[14:23:03] <wikibugs>	 10Operations, 10Epic, 10cloud-services-team (Kanban): CloudVPS: network architecture - https://phabricator.wikimedia.org/T209460 (10aborrero)
[14:41:25] <wikibugs>	 (03PS1) 10David Caro: wmcs.backup: Remove all dangling snapshots [puppet] - 10https://gerrit.wikimedia.org/r/651537
[15:01:08] <wikibugs>	 (03CR) 10Razzi: [C: 03+2] role::analytics_cluster::ui::dashboards: Add superset to an-tool1010 [puppet] - 10https://gerrit.wikimedia.org/r/650179 (https://phabricator.wikimedia.org/T268219) (owner: 10Razzi)
[15:05:19] <wikibugs>	 (03CR) 10Razzi: [C: 03+2] superset: Switch traffic from analytics-tool1004 to an-tool1010 [puppet] - 10https://gerrit.wikimedia.org/r/650522 (owner: 10Razzi)
[15:05:20] <nemo-yiannis>	 There is a small change we would like to push before the holidays: https://gerrit.wikimedia.org/r/c/mediawiki/services/wikifeeds/+/650185. We have already reached out to thcipriani and it looks like the change is small enough to be pushed over the holiday deployment freeze.
[15:06:27] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:06:45] <wikibugs>	 10Operations, 10Traffic, 10netops: Was unable to connect (esams) for about 20 minutes - https://phabricator.wikimedia.org/T270664 (10CDanis) 05Open→03Resolved a:03CDanis Thanks for the report.  I took a look and found no evidence of larger issues at this time: * no increase in automated Network Error L...
[15:08:32] <logmsgbot>	 !log jgiannelos@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' .
[15:08:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:02] <logmsgbot>	 !log jgiannelos@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
[15:12:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:23:20] <thcipriani>	 nemo-yiannis: thanks :)
[15:23:37] <logmsgbot>	 !log jgiannelos@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
[15:23:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:58] <wikibugs>	 (03CR) 10Bstorm: wikireplicas: set up VM haproxy layer (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/651301 (https://phabricator.wikimedia.org/T267376) (owner: 10Bstorm)
[15:48:34] <wikibugs>	 (03PS10) 10CDanis: Pre-review: initial commit [software/klaxon] - 10https://gerrit.wikimedia.org/r/651238
[15:49:22] <wikibugs>	 (03PS1) 10Bstorm: cloud nfs: switch to the 10G interface for booting labstore1004 [puppet] - 10https://gerrit.wikimedia.org/r/651545 (https://phabricator.wikimedia.org/T266202)
[16:01:06] <icinga-wm>	 PROBLEM - Check systemd state on analytics-tool1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:01:08] <icinga-wm>	 PROBLEM - superset on analytics-tool1004 is CRITICAL: connect to address 10.64.36.116 and port 9080: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Superset
[16:04:08] <elukey>	 this is expected --^
[16:04:11] <elukey>	 razzi: ---^
[16:04:28] <elukey>	 (so you can test if the icinga ack works now :)
[16:15:08] <wikibugs>	 (03PS2) 10David Caro: wmcs.backup: Remove all dangling snapshots [puppet] - 10https://gerrit.wikimedia.org/r/651537
[16:15:10] <wikibugs>	 (03PS1) 10David Caro: wmcs.backup: Add a way to remove old backups and snapshots [puppet] - 10https://gerrit.wikimedia.org/r/651550
[16:15:16] <bstorm>	 !log downtimed and stopped puppet on labstore1004 and labstore1005 for failover T266202
[16:15:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:15:21] <stashbot>	 T266202: Move or recable labstore1004 to 10Gbps rack (if needed) and ethernet - https://phabricator.wikimedia.org/T266202
[16:20:54] <razzi>	 elukey: ah when I saw that I quickly downtimed it for 24h, will have to try icinga ack on something else
[16:27:26] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1026 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:38:17] <logmsgbot>	 !log mforns@deploy1001 Started deploy [analytics/refinery@21c0c89]: Regular analytics weekly train [analytics/refinery@Ie7bce02179547ee4c6756d52f9956f492c5b4df6]
[16:39:38] <RhinosF1>	 mforns: stashbot is dead
[16:39:52] <RhinosF1>	 I'm guessing it's know as nfs maint is ongoing
[16:40:15] <mforns>	 RhinosF1: aha, thanks for the heads up
[16:41:32] <RhinosF1>	 mforns: it just joined back if you wanna log again
[16:42:24] <mforns>	 ok!
[16:44:23] <mforns>	 RhinosF1: I tried to !log from the wikimedia-analytics channel, and it didn't work
[16:44:36] <RhinosF1>	 mforns: must be just not talking but online now
[16:44:49] <mforns>	 ok! thanks
[16:44:51] <RhinosF1>	 Will have to wait until b_storm finishes
[16:45:08] <RhinosF1>	 "Unfortunately, the failover has not gone well due to some upgrades to the secondary it seems. I apologize for the service disruptions, and we are trying to bring everything back as fast as possible."
[16:45:08] <mforns>	 np!
[16:45:15] <RhinosF1>	 That just went to the cloud list
[16:47:32] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1026 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[16:48:29] <volans>	 !log restarted ferm on ms-be1026 (failed with DNS query for 'ms-be1055.eqiad.wmnet' failed: query timed out )
[16:48:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:48:42] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:49:00] <icinga-wm>	 PROBLEM - k8s API server requests latencies on neon is CRITICAL: instance=10.64.0.40 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[16:50:28] <logmsgbot>	 !log mforns@deploy1001 Finished deploy [analytics/refinery@21c0c89]: Regular analytics weekly train [analytics/refinery@Ie7bce02179547ee4c6756d52f9956f492c5b4df6] (duration: 12m 11s)
[16:50:40] <icinga-wm>	 RECOVERY - k8s API server requests latencies on neon is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[16:51:05] <logmsgbot>	 !log mforns@deploy1001 Started deploy [analytics/refinery@21c0c89] (thin): Regular analytics weekly train THIN [analytics/refinery@Ie7bce02179547ee4c6756d52f9956f492c5b4df6]
[16:51:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:51:13] <logmsgbot>	 !log mforns@deploy1001 Finished deploy [analytics/refinery@21c0c89] (thin): Regular analytics weekly train THIN [analytics/refinery@Ie7bce02179547ee4c6756d52f9956f492c5b4df6] (duration: 00m 08s)
[16:51:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:54:20] <wikibugs>	 10Operations, 10ops-eqiad, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Hardware): Move or recable labstore1004 to 10Gbps rack (if needed) and ethernet - https://phabricator.wikimedia.org/T266202 (10Bstorm) Ok, that has gone very badly. Working to fix.
[16:56:59] <wikibugs>	 (03PS1) 10Mforns: analytics::refinery::job::refine: Bump up refinery_version to 142 [puppet] - 10https://gerrit.wikimedia.org/r/651559
[17:18:00] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1026 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[17:24:45] <wikibugs>	 10Operations, 10Traffic, 10netops: Was unable to connect (esams) for about 20 minutes - https://phabricator.wikimedia.org/T270664 (10AlexisJazz) >>! In T270664#6708236, @CDanis wrote: > Thanks for the report. >  > I took a look and found no evidence of larger issues at this time: > * no increase in automated...
[17:27:43] <andrewbogott>	 !log shutting down labstore1004 in preparation for move and reimage
[17:27:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:29:52] <icinga-wm>	 PROBLEM - Host mw1274.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:30:10] <icinga-wm>	 PROBLEM - Host mw1272.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:30:16] <icinga-wm>	 PROBLEM - Host mw1273.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:31:13] <wikibugs>	 (03PS1) 10Andrew Bogott: labstore1004: move to 'insetup' role [puppet] - 10https://gerrit.wikimedia.org/r/651564 (https://phabricator.wikimedia.org/T266202)
[17:31:28] <icinga-wm>	 PROBLEM - Host mw1279.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:31:32] <icinga-wm>	 PROBLEM - Host mw1275.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:31:59] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] labstore1004: move to 'insetup' role [puppet] - 10https://gerrit.wikimedia.org/r/651564 (https://phabricator.wikimedia.org/T266202) (owner: 10Andrew Bogott)
[17:32:01] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] labstore1004: move to 'insetup' role [puppet] - 10https://gerrit.wikimedia.org/r/651564 (https://phabricator.wikimedia.org/T266202) (owner: 10Andrew Bogott)
[17:33:04] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] cloud nfs: switch to the 10G interface for booting labstore1004 [puppet] - 10https://gerrit.wikimedia.org/r/651545 (https://phabricator.wikimedia.org/T266202) (owner: 10Bstorm)
[17:33:51] <wikibugs>	 (03PS1) 10Elukey: Force the Hive Servers to use their local Metastores [puppet] - 10https://gerrit.wikimedia.org/r/651565
[17:35:26] <icinga-wm>	 RECOVERY - Host mw1274.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms
[17:35:44] <icinga-wm>	 RECOVERY - Host mw1272.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms
[17:35:48] <icinga-wm>	 RECOVERY - Host mw1273.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms
[17:37:00] <icinga-wm>	 RECOVERY - Host mw1279.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms
[17:37:04] <icinga-wm>	 RECOVERY - Host mw1275.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.98 ms
[17:37:11] <wikibugs>	 (03PS2) 10Elukey: Force the Hive Servers to use their local Metastores [puppet] - 10https://gerrit.wikimedia.org/r/651565
[17:39:08] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27232/console" [puppet] - 10https://gerrit.wikimedia.org/r/651565 (owner: 10Elukey)
[17:51:56] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] Force the Hive Servers to use their local Metastores [puppet] - 10https://gerrit.wikimedia.org/r/651565 (owner: 10Elukey)
[17:56:54] <wikibugs>	 (03PS1) 10Bstorm: cloud nfs: get primary interface from puppet and create mountpoints [puppet] - 10https://gerrit.wikimedia.org/r/651568 (https://phabricator.wikimedia.org/T266202)
[18:06:16] <icinga-wm>	 RECOVERY - MegaRAID on db1101 is OK: OK: optimal, 1 logical, 10 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[18:07:29] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] analytics::refinery::job::refine: Bump up refinery_version to 142 [puppet] - 10https://gerrit.wikimedia.org/r/651559 (owner: 10Mforns)
[18:11:28] <wikibugs>	 (03PS1) 10Elukey: Point analytics-hive.eqiad.wmnet to an-coord1002 [dns] - 10https://gerrit.wikimedia.org/r/651569
[18:13:13] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Point analytics-hive.eqiad.wmnet to an-coord1002 [dns] - 10https://gerrit.wikimedia.org/r/651569 (owner: 10Elukey)
[18:25:28] <wikibugs>	 10Operations, 10Graphoid, 10serviceops, 10Platform Engineering (Icebox): The graphs on a 'info' special page are no longer displayed - https://phabricator.wikimedia.org/T270720 (10Iniquity)
[18:27:09] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on labstore1004.eqiad.wmnet with reason: REIMAGE
[18:27:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:29:08] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on labstore1004.eqiad.wmnet with reason: REIMAGE
[18:29:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:30:43] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "labstore1004: move to 'insetup' role" [puppet] - 10https://gerrit.wikimedia.org/r/651496
[18:31:45] <wikibugs>	 10Operations, 10MW-on-K8s, 10serviceops, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Pipeline): Deployment infrastructure for PHP microservices - https://phabricator.wikimedia.org/T261369 (10Legoktm)
[18:34:12] <wikibugs>	 (03CR) 10Bstorm: [C: 03+1] "May want to review https://gerrit.wikimedia.org/r/c/operations/puppet/+/651568 first. Puppet is disabled on 1005, so it would be safe enou" [puppet] - 10https://gerrit.wikimedia.org/r/651496 (owner: 10Andrew Bogott)
[18:37:22] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] cloud nfs: get primary interface from puppet and create mountpoints [puppet] - 10https://gerrit.wikimedia.org/r/651568 (https://phabricator.wikimedia.org/T266202) (owner: 10Bstorm)
[18:43:29] <wikibugs>	 (03PS2) 10Bstorm: cloud nfs: get primary interface from puppet and create mountpoints [puppet] - 10https://gerrit.wikimedia.org/r/651568 (https://phabricator.wikimedia.org/T266202)
[18:48:50] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cloud nfs: get primary interface from puppet and create mountpoints [puppet] - 10https://gerrit.wikimedia.org/r/651568 (https://phabricator.wikimedia.org/T266202) (owner: 10Bstorm)
[18:49:03] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Revert "labstore1004: move to 'insetup' role" [puppet] - 10https://gerrit.wikimedia.org/r/651496 (owner: 10Andrew Bogott)
[18:49:53] <wikibugs>	 (03PS1) 10Ahmon Dancy: 0.3.1: Use securityContext instead of main_app.rootImage [deployment-charts] - 10https://gerrit.wikimedia.org/r/651571
[18:56:47] <wikibugs>	 (03CR) 10RLazarus: "nitpicks ahoy!" (0314 comments) [software/klaxon] - 10https://gerrit.wikimedia.org/r/651238 (owner: 10CDanis)
[18:57:22] <wikibugs>	 (03PS1) 10Bstorm: labstore: remove excess variable from hiera [puppet] - 10https://gerrit.wikimedia.org/r/651575 (https://phabricator.wikimedia.org/T266202)
[18:59:32] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] labstore: remove excess variable from hiera [puppet] - 10https://gerrit.wikimedia.org/r/651575 (https://phabricator.wikimedia.org/T266202) (owner: 10Bstorm)
[19:01:23] <wikibugs>	 (03PS11) 10CDanis: Pre-review: initial commit [software/klaxon] - 10https://gerrit.wikimedia.org/r/651238
[19:11:25] <wikibugs>	 10Operations, 10ops-esams, 10DC-Ops, 10Patch-For-Review: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 (10wiki_willy)
[19:32:15] <mutante>	 !log restarting gerrit to pick up config change in gitiles for T269300
[19:34:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:34:22] <stashbot>	 T269300: Modify readme.md for gerrit markdown - https://phabricator.wikimedia.org/T269300
[19:39:52] <wikibugs>	 (03PS12) 10CDanis: Pre-review: initial commit [software/klaxon] - 10https://gerrit.wikimedia.org/r/651238
[19:40:11] <wikibugs>	 (03CR) 10CDanis: "Thanks for the helpful comments!  Most have been resolved." (0314 comments) [software/klaxon] - 10https://gerrit.wikimedia.org/r/651238 (owner: 10CDanis)
[19:41:07] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1101 - https://phabricator.wikimedia.org/T270571 (10Marostegui) 05Open→03Resolved Not sure who changed the disk, but thank you either John or Chris! ` 19:06:25 <+icinga-wm> RECOVERY - MegaRAID on db1101 is OK: OK: optimal, 1 logical, 10 physical, Write...
[19:45:47] <wikibugs>	 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) >>! In T245757#6695685, @hnowlan wrote: > mw1265 is now reimaged and pooled with weight 5 (as opposed...
[19:48:02] <wikibugs>	 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn)
[19:50:30] <wikibugs>	 (03PS1) 10Bstorm: cloud nfs: make sure the /srv/scratch dir is there and another fix [puppet] - 10https://gerrit.wikimedia.org/r/651608 (https://phabricator.wikimedia.org/T266202)
[19:51:17] <wikibugs>	 10Operations, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Krinkle) @Marostegui Regarding schema changes, I'd say very unlikely. It is a key-value store that hasn't...
[19:52:16] <wikibugs>	 10Operations, 10ops-eqiad, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Hardware): Move or recable labstore1004 to 10Gbps rack (if needed) and ethernet - https://phabricator.wikimedia.org/T266202 (10Bstorm)
[19:58:46] <wikibugs>	 10Operations, 10Graphoid, 10serviceops, 10Platform Engineering (Icebox): The graphs on a 'info' special page are no longer displayed - https://phabricator.wikimedia.org/T270720 (10Jdforrester-WMF) Yup, this is an inescapable consequence of client-side-only rendering. For now we should switch off the graphs...
[19:59:03] <wikibugs>	 10Operations, 10Graphoid, 10PageViewInfo, 10serviceops, 10Platform Engineering (Icebox): The graphs on a 'info' special page are no longer displayed - https://phabricator.wikimedia.org/T270720 (10Jdforrester-WMF)
[20:00:20] <wikibugs>	 (03PS1) 10Nikki Nikkhoui: Enable safehtml in gitiles [puppet] - 10https://gerrit.wikimedia.org/r/651614 (https://phabricator.wikimedia.org/T269300)
[20:06:31] <wikibugs>	 10Operations, 10Graphoid, 10PageViewInfo, 10serviceops, 10Platform Engineering (Icebox): The graphs on a 'info' special page are no longer displayed - https://phabricator.wikimedia.org/T270720 (10Jdforrester-WMF) Hmm, no, it's rendering for me? It's all client-side already.
[20:07:54] <wikibugs>	 10Operations, 10Graphoid, 10PageViewInfo, 10serviceops, 10Platform Engineering (Icebox): The graphs on a 'info' special page are no longer displayed - https://phabricator.wikimedia.org/T270720 (10Jdforrester-WMF) {F33970350} and {F33970348} for example.
[20:21:58] <wikibugs>	 10Operations, 10Graphoid, 10PageViewInfo, 10serviceops, 10Platform Engineering (Icebox): The graphs on a 'info' special page are no longer displayed - https://phabricator.wikimedia.org/T270720 (10Iniquity) >>! In T270720#6708984, @Jdforrester-WMF wrote: > Hmm, no, it's rendering for me? It's all client-s...
[20:23:56] <wikibugs>	 (03PS13) 10CDanis: Initial commit of Klaxon, a webapp for trusted users to page SRE [software/klaxon] - 10https://gerrit.wikimedia.org/r/651238 (https://phabricator.wikimedia.org/T270324)
[20:23:56] <wikibugs>	 10Operations, 10Graphoid, 10PageViewInfo, 10serviceops, 10Platform Engineering (Icebox): The graphs on a 'info' special page are no longer displayed - https://phabricator.wikimedia.org/T270720 (10Iniquity)
[20:35:03] <wikibugs>	 10Operations, 10Graphoid, 10PageViewInfo, 10serviceops, 10Platform Engineering (Icebox): The graphs on a 'info' special page are no longer displayed - https://phabricator.wikimedia.org/T270720 (10Jdforrester-WMF) It looks like `Шаблон:График просмотров` is inserting a graph callback, which isn't supporte...
[20:39:18] <icinga-wm>	 PROBLEM - Logstash rate of ingestion percent change compared to yesterday #o11y on alert1001 is CRITICAL: 238.9 ge 210 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen
[20:45:27] <wikibugs>	 10Operations, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Krinkle)
[20:45:52] <wikibugs>	 10Operations, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Krinkle)
[20:46:10] <wikibugs>	 10Operations, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Krinkle)
[20:55:30] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] "LGTM! 🚀" (034 comments) [software/klaxon] - 10https://gerrit.wikimedia.org/r/651238 (https://phabricator.wikimedia.org/T270324) (owner: 10CDanis)
[21:00:57] <wikibugs>	 (03Abandoned) 10Nikki Nikkhoui: Enable safehtml in gitiles [puppet] - 10https://gerrit.wikimedia.org/r/651614 (https://phabricator.wikimedia.org/T269300) (owner: 10Nikki Nikkhoui)
[21:03:49] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops: (Need By: 2021-01-30) rack/setup/install frqueue2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T269481 (10Papaul)
[21:04:17] <wikibugs>	 (03CR) 10CDanis: [V: 03+2 C: 03+2] "thanks again!" (033 comments) [software/klaxon] - 10https://gerrit.wikimedia.org/r/651238 (https://phabricator.wikimedia.org/T270324) (owner: 10CDanis)
[21:08:42] <wikibugs>	 (03PS1) 10CDanis: klaxon: address comments from ba13da2 [software/klaxon] - 10https://gerrit.wikimedia.org/r/651625
[21:09:10] <wikibugs>	 (03CR) 10CDanis: [V: 03+2 C: 03+2] klaxon: address comments from ba13da2 [software/klaxon] - 10https://gerrit.wikimedia.org/r/651625 (owner: 10CDanis)
[21:10:07] <wikibugs>	 10Operations, 10ops-codfw: RMA failed codfw C7 switch - WMF6114 - https://phabricator.wikimedia.org/T267950 (10Papaul) Received the new switch  {F33970453}
[21:23:56] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/27237/" [puppet] - 10https://gerrit.wikimedia.org/r/651158 (https://phabricator.wikimedia.org/T245757) (owner: 10Muehlenhoff)
[21:24:48] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "Thanks! it's also installed on mwmaint by the way: https://debmonitor.wikimedia.org/packages/php-readline" [puppet] - 10https://gerrit.wikimedia.org/r/651158 (https://phabricator.wikimedia.org/T245757) (owner: 10Muehlenhoff)
[21:26:48] <andrewbogott>	 !log upgrading wikitech-static: mediawiki to 1.35.1 and general apt upgrade
[21:26:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:31:15] <mutante>	 !log deploy1002/deploy2002 - apt-get remove --purge php-readline and let puppet reinstall it (7.2 vs 7.3 after gerrit 651158) T265963
[21:31:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:31:18] <stashbot>	 T265963: Replace production deployment servers and update them to Buster - https://phabricator.wikimedia.org/T265963
[21:32:32] <wikibugs>	 (03CR) 10Dzahn: "on deploy1001 the repo gets added but 2:7.2+69+0~20190215163918.14+stretch~1.gbpfa617b+wmf1 is installed before and after.  on deploy1002/" [puppet] - 10https://gerrit.wikimedia.org/r/651158 (https://phabricator.wikimedia.org/T245757) (owner: 10Muehlenhoff)
[21:34:08] <wikibugs>	 (03CR) 10Dzahn: "2:7.2+69+0~20190215163918.14+stretch~1.gbpfa617b+wmf1   is installed on all 4 deployment servers now" [puppet] - 10https://gerrit.wikimedia.org/r/651158 (https://phabricator.wikimedia.org/T245757) (owner: 10Muehlenhoff)
[21:45:19] <wikibugs>	 (03PS3) 10Dzahn: aptrepo: replace cron with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/641315 (https://phabricator.wikimedia.org/T265138)
[21:47:18] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 134 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:50:34] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 30 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:53:01] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] aptrepo: replace cron with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/641315 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn)
[21:56:08] <wikibugs>	 (03CR) 10Dzahn: "apt2001 - nothing because the sync is only on primary servers - on apt1001 cron removed, timer added" [puppet] - 10https://gerrit.wikimedia.org/r/641315 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn)
[21:57:50] <mutante>	 !log apt1001 - sudo systemctl status rsync-aptrepo-apt2001.wikimedia.org.service - confirmed timer job is working like the cron before
[21:57:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:58:48] <wikibugs>	 (03CR) 10Dzahn: "Also manually ran the service and confirmed it ran and finished fine." [puppet] - 10https://gerrit.wikimedia.org/r/641315 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn)
[21:59:53] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "only moves an existing line around and adds a comment (for now)" [dns] - 10https://gerrit.wikimedia.org/r/650628 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn)
[22:00:23] <wikibugs>	 (03PS2) 10Dzahn: add doc to misc services with multiple backends [dns] - 10https://gerrit.wikimedia.org/r/650628 (https://phabricator.wikimedia.org/T247653)
[22:04:58] <wikibugs>	 (03PS1) 10Razzi: Add cookbook for rebooting druid nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/651636 (https://phabricator.wikimedia.org/T269596)
[22:05:51] <wikibugs>	 (03CR) 10Razzi: "First pass at a cookbook - I'm sure there are many issues - let me know what you think, and if / how this can be tested!" [cookbooks] - 10https://gerrit.wikimedia.org/r/651636 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi)
[22:06:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add cookbook for rebooting druid nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/651636 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi)
[22:15:38] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] cloud nfs: make sure the /srv/scratch dir is there and another fix [puppet] - 10https://gerrit.wikimedia.org/r/651608 (https://phabricator.wikimedia.org/T266202) (owner: 10Bstorm)
[22:18:44] <wikibugs>	 (03PS2) 10Razzi: Add cookbook for rebooting druid nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/651636 (https://phabricator.wikimedia.org/T269596)
[22:38:26] <wikibugs>	 (03PS1) 10Ladsgroup: cache: Migrate hiera() to lookup() and setting datatype [puppet] - 10https://gerrit.wikimedia.org/r/651640 (https://phabricator.wikimedia.org/T209953)
[22:41:08] <wikibugs>	 (03PS2) 10Ladsgroup: cache: Migrate hiera() to lookup() and setting datatype [puppet] - 10https://gerrit.wikimedia.org/r/651640 (https://phabricator.wikimedia.org/T209953)
[22:51:10] <wikibugs>	 (03CR) 10Ladsgroup: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/27238/" [puppet] - 10https://gerrit.wikimedia.org/r/651640 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[22:54:04] <wikibugs>	 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Create Generalised blocking stratagy - https://phabricator.wikimedia.org/T270618 (10jbond)
[22:56:41] <wikibugs>	 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Create Generalised blocking stratagy - https://phabricator.wikimedia.org/T270618 (10jbond)
[22:57:19] <wikibugs>	 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Create Generalised blocking stratagy - https://phabricator.wikimedia.org/T270618 (10jbond)
[22:57:53] <mutante>	 win 68
[23:03:36] <wikibugs>	 (03PS1) 10Dzahn: Merge branch 'production' of ssh://gerrit.wikimedia.org:29418/operations/puppet into production [puppet] - 10https://gerrit.wikimedia.org/r/651643
[23:03:38] <wikibugs>	 (03PS1) 10Dzahn: aptrepo: remove absented cron code, replaced by timer [puppet] - 10https://gerrit.wikimedia.org/r/651644
[23:04:07] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Merge branch 'production' of ssh://gerrit.wikimedia.org:29418/operations/puppet into production [puppet] - 10https://gerrit.wikimedia.org/r/651643 (owner: 10Dzahn)
[23:04:26] <wikibugs>	 (03Abandoned) 10Dzahn: Merge branch 'production' of ssh://gerrit.wikimedia.org:29418/operations/puppet into production [puppet] - 10https://gerrit.wikimedia.org/r/651643 (owner: 10Dzahn)
[23:04:34] <wikibugs>	 (03PS2) 10Dzahn: aptrepo: remove absented cron code, replaced by timer [puppet] - 10https://gerrit.wikimedia.org/r/651644
[23:05:11] <wikibugs>	 (03PS3) 10Dzahn: aptrepo: remove absented cron code, replaced by timer [puppet] - 10https://gerrit.wikimedia.org/r/651644 (https://phabricator.wikimedia.org/T265138)
[23:05:47] <wikibugs>	 10Operations, 10ops-eqiad, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Hardware): Move or recable labstore1004 to 10Gbps rack (if needed) and ethernet - https://phabricator.wikimedia.org/T266202 (10Andrew)
[23:08:12] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] aptrepo: remove absented cron code, replaced by timer [puppet] - 10https://gerrit.wikimedia.org/r/651644 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn)
[23:10:32] <wikibugs>	 10Operations, 10Mail, 10Epic: Move most (all?) exim personal aliases to WMF ITS - https://phabricator.wikimedia.org/T122144 (10Dzahn) 05Stalled→03Open Will be resolved Jan 1st
[23:58:51] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10observability, 10User-DannyS712: Beta cluster logstash down - https://phabricator.wikimedia.org/T268200 (10DannyS712) >>! In T268200#6708351, @colewhite wrote: > Hi @DannyS712  >  > As of this writing, logs are flowing in deployme...