[00:00:05] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210127T0000).
[00:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[00:09:20] <wikibugs>	 10SRE, 10Parsoid-Tests, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Make testreduce web UI publicly accessible on the internet - https://phabricator.wikimedia.org/T266509 (10Dzahn)
[00:11:54] <wikibugs>	 (03PS2) 10Ryan Kemper: wdqs: use envoy for wdqs-internal [puppet] - 10https://gerrit.wikimedia.org/r/657913 (https://phabricator.wikimedia.org/T272713)
[00:14:53] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1003.eqiad.wmnet with reason: Enabling envoy for wdqs-internal
[00:14:54] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1003.eqiad.wmnet with reason: Enabling envoy for wdqs-internal
[00:14:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:14:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:15:00] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1008.eqiad.wmnet with reason: Enabling envoy for wdqs-internal
[00:15:01] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1008.eqiad.wmnet with reason: Enabling envoy for wdqs-internal
[00:15:01] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1011.eqiad.wmnet with reason: Enabling envoy for wdqs-internal
[00:15:02] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1011.eqiad.wmnet with reason: Enabling envoy for wdqs-internal
[00:15:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:15:03] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2004.codfw.wmnet with reason: Enabling envoy for wdqs-internal
[00:15:04] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2004.codfw.wmnet with reason: Enabling envoy for wdqs-internal
[00:15:04] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2005.codfw.wmnet with reason: Enabling envoy for wdqs-internal
[00:15:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:15:05] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2005.codfw.wmnet with reason: Enabling envoy for wdqs-internal
[00:15:06] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2006.codfw.wmnet with reason: Enabling envoy for wdqs-internal
[00:15:07] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2006.codfw.wmnet with reason: Enabling envoy for wdqs-internal
[00:15:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:15:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:15:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:15:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:15:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:15:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:15:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:15:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:15:26] <ryankemper>	 !log T272713 [Deploy envoy for `wdqs-internal`] Downtimed all `wdqs-internal` hosts on icinga
[00:15:27] <wikibugs>	 (03PS1) 10Dzahn: parsoid/testreduce: add envoy on testreduce1001 for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/658706 (https://phabricator.wikimedia.org/T266509)
[00:15:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:15:29] <stashbot>	 T272713: Failing HTTP check on WDQS servers after latest deployment - https://phabricator.wikimedia.org/T272713
[00:15:37] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] wdqs: use envoy for wdqs-internal [puppet] - 10https://gerrit.wikimedia.org/r/657913 (https://phabricator.wikimedia.org/T272713) (owner: 10Ryan Kemper)
[00:16:24] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2008.codfw.wmnet with reason: Enabling envoy for wdqs-internal
[00:16:25] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2008.codfw.wmnet with reason: Enabling envoy for wdqs-internal
[00:16:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:16:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:17:53] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/27683/testreduce1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/658706 (https://phabricator.wikimedia.org/T266509) (owner: 10Dzahn)
[00:20:00] <ryankemper>	 !log T272713 [Deploy envoy for `wdqs-internal`] Disabled puppet on all `wdqs-internal` hosts; merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/657913
[00:20:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:21:36] <icinga-wm>	 RECOVERY - Check systemd state on pki2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:27:13] <wikibugs>	 (03PS1) 10Dzahn: trafficserver/parsoid: switch TLS termination to 443, upstream port 8001 [puppet] - 10https://gerrit.wikimedia.org/r/658708 (https://phabricator.wikimedia.org/T266509)
[00:28:02] <icinga-wm>	 PROBLEM - Check systemd state on pki2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:29:16] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "wrong hiera key!" [puppet] - 10https://gerrit.wikimedia.org/r/658708 (https://phabricator.wikimedia.org/T266509) (owner: 10Dzahn)
[00:30:49] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by legoktm on cumin1001.eq...
[00:31:52] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:35:29] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/657889 (https://phabricator.wikimedia.org/T272539) (owner: 10Jbond)
[00:35:40] <wikibugs>	 (03PS2) 10Dzahn: trafficserver/parsoid: switch TLS termination to 443, upstream port 8001 [puppet] - 10https://gerrit.wikimedia.org/r/658708 (https://phabricator.wikimedia.org/T266509)
[00:36:18] <ryankemper>	 !log [Deploy envoy for `wdqs-internal`] `...Error while evaluating a Function Call, secret(): invalid secret ssl/wdqs-internal.discovery.wmnet.key (file: /etc/puppet/modules/sslcert/manifests/certificate.pp, line: 91, column: 26) (file: /etc/puppet/modules/profile/manifests/tlsproxy/envoy.pp, line: 129) on node wdqs1003.eqiad.wmnet`
[00:36:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:36:21] <wikibugs>	 (03PS3) 10Dzahn: trafficserver/parsoid: switch TLS termination to 443, upstream port 8001 [puppet] - 10https://gerrit.wikimedia.org/r/658708 (https://phabricator.wikimedia.org/T266509)
[00:38:18] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/27684/testreduce1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/658708 (https://phabricator.wikimedia.org/T266509) (owner: 10Dzahn)
[00:38:44] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:44:49] <ryankemper>	 (Forgot to prepend ticket number to previous SAL log message, so sending it again):
[00:44:57] <ryankemper>	 !log T272713 [Deploy envoy for `wdqs-internal`] `...Error while evaluating a Function Call, secret(): invalid secret ssl/wdqs-internal.discovery.wmnet.key (file: /etc/puppet/modules/sslcert/manifests/certificate.pp, line: 91, column: 26) (file: /etc/puppet/modules/profile/manifests/tlsproxy/envoy.pp, line: 129) on node wdqs1003.eqiad.wmnet`
[00:45:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:45:07] <stashbot>	 T272713: Failing HTTP check on WDQS servers after latest deployment - https://phabricator.wikimedia.org/T272713
[00:45:30] <ryankemper>	 !log T272713 [Deploy envoy for `wdqs-internal`] Discovered source of the above failure; the secret key in the puppetmaster `/srv/private` repo has a typo in its name (my error): it had `wqds` instead of `wdqs`. Opening up a patch now
[00:45:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:46:47] <mutante>	 ryankemper: you are not the only one. "wqds" even made it into the "typos" file in operations/puppet because I did that in the past. of course it won't help private repo
[00:47:24] <ryankemper>	 Heh
[00:47:40] <ryankemper>	 Something about the up-side down symmetry of the d and the q makes it extra easy to not spot
[00:47:50] <mutante>	 haha yea, i think that's true
[00:49:50] <logmsgbot>	 !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2296.codfw.wmnet with reason: REIMAGE
[00:49:53] <logmsgbot>	 !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2295.codfw.wmnet with reason: REIMAGE
[00:49:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:49:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:51:16] <wikibugs>	 (03PS7) 10Jeena Huneidi: Helmfile for continuous deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/634354 (https://phabricator.wikimedia.org/T214158)
[00:51:39] <ryankemper>	 !log T272713 [Deploy envoy for `wdqs-internal`] Fixed typo in private key in commit `ea152df802b55e939d34494a4965ed83a80a24f2`. Puppet run on `wdqs1003` was successful as a result. Monitoring...
[00:51:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:51:43] <stashbot>	 T272713: Failing HTTP check on WDQS servers after latest deployment - https://phabricator.wikimedia.org/T272713
[00:52:13] <logmsgbot>	 !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2295.codfw.wmnet with reason: REIMAGE
[00:52:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:54:13] <logmsgbot>	 !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2296.codfw.wmnet with reason: REIMAGE
[00:54:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:00:57] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 919918056 and 70 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:05:09] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 11328 and 14 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:05:53] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 997463256 and 124 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:07:25] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 109056 and 151 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:08:41] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1003 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.061 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[01:10:00] <mutante>	 ryankemper: SPARQL monitoring fixed?:) nice!
[01:10:45] <mutante>	 ryankemper: fyi, because we are doing almost the same thing for different services.. i just ran into https://phabricator.wikimedia.org/T255568
[01:10:51] <logmsgbot>	 !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@246b640]: remove link recommendations from hourly transfer deps
[01:10:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:11:06] <mutante>	 i was like "why.. is this not working" and it's that envoy is not listening on v6
[01:11:14] <mutante>	 while ATS tries to use it where it exists
[01:11:32] <mutante>	 at least I saw that ticket earlier and that is it here as well
[01:12:25] <ryankemper>	 Yeah SPARQL monitoring should be fixed now that there's actually a TLS port for the check to hit (well, about to re-enable puppet on the rest of the fleet but after that)
[01:14:06] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2295.codfw.wmnet', 'mw22...
[01:14:22] <logmsgbot>	 !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@246b640]: remove link recommendations from hourly transfer deps (duration: 03m 31s)
[01:14:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:14:37] <ryankemper>	 note BTW that the alert wasn't working at all previously (previously meaning before we broke it further thus necessitating this envoy rollout), so now we will see some `WDQS SPARQL` flapping whenever an instance's blazegraph locks up...still thinking about the best way to cut down on that noise, I think I have an idea though
[01:15:04] <ryankemper>	 mutante: interesting with the ipv4 vs ipv6 stuff
[01:16:28] <ryankemper>	 Looks like https://gerrit.wikimedia.org/r/c/operations/puppet/+/629343 is in the works to add an option to fix that, so I'll keep an eye on that patch
[01:17:16] <mutante>	 ryankemper: aha, good to know. one way to influence Icinga checks is to adjust the number of times it should retry before it calls a SOFT state a HARD state (only HARD state sends notifications), default is 3
[01:17:36] <mutante>	 yea, same here. I will continue with the v6 issue on this tomorrow.
[01:17:48] <mutante>	 the fix is for services_proxy but we are outside that
[01:17:51] <mutante>	 so gotta look more
[01:17:54] <ryankemper>	 Great, I was imagining a threshold bump and that retry option would do the same
[01:19:29] <ryankemper>	 (Well threshold if it were like  a "fire off alert if this is broken for 5 minutes" but it sounds like from what you said it's just running the check and firing based off the exit code)
[01:19:42] <ryankemper>	 The second part (that requires a bit of thinking) is because these wdqs instances can lock up indefinitely, we need something like a crappy cronjob/systemd timer that can probe its instance's blazegraph to detect deadlock and then restart the process
[01:20:18] <ryankemper>	 I remember working with k8s at my last job it had a hacky `docker-healthcheck` container that just ran a command like `docker ps` and restarted docker if it took >60s to respond, so I'm envisioning that general concept
[01:20:29] <mutante>	 ryankemper: modules/monitoring/manifests/service.pp:            max_check_attempts     => $retries,
[01:20:37] <mutante>	 When a service or host check results in a non-OK or non-UP state and the service check has not yet been (re)checked the number of times specified by the max_check_attempts directive in the service or host definition. This is called a soft error. 
[01:20:55] <logmsgbot>	 !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw2295.codfw.wmnet
[01:20:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:20:57] <mutante>	 soft errors are displayed in web UI but don't notify IRC 
[01:21:05] <logmsgbot>	 !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw2296.codfw.wmnet
[01:21:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:21:13] <ryankemper>	 awesome yeah that's exactly what I want
[01:21:23] <ryankemper>	 !log T272713 [Deploy envoy for `wdqs-internal`] Test queries to `wdqs1003.eqiad.wmnet` passed, and metrics in Grafana (https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&var-cluster_name=wdqs-internal&from=1611706751381&to=1611710190405) look good. Rolling out to rest of fleet
[01:21:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:21:27] <stashbot>	 T272713: Failing HTTP check on WDQS servers after latest deployment - https://phabricator.wikimedia.org/T272713
[01:22:32] <mutante>	 ryankemper: you can use "event_handler" in Icinga itself. basically that is "if alert X triggers then run command Y"
[01:22:38] <mutante>	 example is class { 'icinga::event_handlers::raid':
[01:22:49] <mutante>	 that is automatically creating tickets when there is  RAID alert
[01:22:57] <mutante>	 the command can also be automatic restart
[01:23:03] <mutante>	 so it will be auto-fixing based on alerts
[01:23:04] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1008 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.086 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[01:23:56] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs2005 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.187 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[01:23:56] <ryankemper>	 !log T272713 [Deploy envoy for `wdqs-internal`] Roll-out complete. Will monitor `wdqs-internal` for any issues. All the remaining `WDQS SPARQL` alerts should clear shortly
[01:24:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:24:11] <mutante>	 so .. that is all included in nagios/icinga already, glad to talk more about it later
[01:25:26] <ryankemper>	 great, I'll probably reach out to ya later this week
[01:25:44] <logmsgbot>	 !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=mw2295.codfw.wmnet
[01:25:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:25:50] <logmsgbot>	 !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=mw2296.codfw.wmnet
[01:25:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:31:16] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:35:48] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:37:26] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1011 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.065 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[01:37:26] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs2004 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.291 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[01:38:45] <logmsgbot>	 !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@ee948e0]: transfer_to_es: Enable catchup
[01:38:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:39:57] <logmsgbot>	 !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@ee948e0]: transfer_to_es: Enable catchup (duration: 01m 11s)
[01:39:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:44:06] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs2006 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.191 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[01:48:11] <ryankemper>	 !log [WDQS Deploy] Gearing up for deploy of wdqs `0.3.61`. Pre-deploy tests passing on canary `wdqs1003`
[01:48:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:48:33] <logmsgbot>	 !log ryankemper@deploy1001 Started deploy [wdqs/wdqs@6c6b2cb]: 0.3.61
[01:48:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:48:38] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs2008 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.200 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[01:50:18] <ryankemper>	 !log [WDQS Deploy] Tests passing following deploy of `0.3.61` on canary `wdqs1003`; proceeding to rest of fleet
[01:50:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:56:23] <logmsgbot>	 !log ryankemper@deploy1001 Finished deploy [wdqs/wdqs@6c6b2cb]: 0.3.61 (duration: 07m 50s)
[01:56:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:57:40] <ryankemper>	 !log [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'`
[01:57:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:57:57] <ryankemper>	 !log [WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'`
[01:58:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:58:06] <ryankemper>	 !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'`
[01:58:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:10:23] <icinga-wm>	 PROBLEM - Host analytics1073 is DOWN: PING CRITICAL - Packet loss = 100%
[02:21:17] <logmsgbot>	 !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@9c85a21]: transfer_to_es: start date 2020 -> 2021
[02:21:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:22:40] <wikibugs>	 10SRE, 10CAS-SSO: idp.wikimedia.org asking twice for YubiKey - https://phabricator.wikimedia.org/T258029 (10Krinkle)
[02:24:16] <logmsgbot>	 !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@9c85a21]: transfer_to_es: start date 2020 -> 2021 (duration: 02m 59s)
[02:24:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:31:53] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:38:37] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:18:01] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:32:05] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:32:15] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 7.929 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:39:13] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:39:23] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:46:23] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.070 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:48:55] <ryankemper>	 !log (Restarted `wdqs-blazegraph` on `wdqs1012`)
[03:48:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:16:35] <wikibugs>	 (03PS1) 10Jeena Huneidi: [WIP] Apply global helmfile after pull [puppet] - 10https://gerrit.wikimedia.org/r/658750 (https://phabricator.wikimedia.org/T214158)
[04:18:03] <wikibugs>	 (03PS2) 10Jeena Huneidi: [WIP] Apply global helmfile after pull [puppet] - 10https://gerrit.wikimedia.org/r/658750 (https://phabricator.wikimedia.org/T214158)
[04:19:40] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] Apply global helmfile after pull [puppet] - 10https://gerrit.wikimedia.org/r/658750 (https://phabricator.wikimedia.org/T214158) (owner: 10Jeena Huneidi)
[04:27:55] <wikibugs>	 (03PS3) 10Jeena Huneidi: [WIP] Apply global helmfile after pull [puppet] - 10https://gerrit.wikimedia.org/r/658750 (https://phabricator.wikimedia.org/T214158)
[04:29:33] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] Apply global helmfile after pull [puppet] - 10https://gerrit.wikimedia.org/r/658750 (https://phabricator.wikimedia.org/T214158) (owner: 10Jeena Huneidi)
[04:30:07] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:36:36] <wikibugs>	 (03PS4) 10Jeena Huneidi: [WIP] Apply global helmfile after pull [puppet] - 10https://gerrit.wikimedia.org/r/658750 (https://phabricator.wikimedia.org/T214158)
[04:37:17] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:41:36] <wikibugs>	 (03CR) 10Jeena Huneidi: "I am trying to find a way for us to do continuous deployment. I thought it might be good to do it from the deployment server directly sinc" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/658750 (https://phabricator.wikimedia.org/T214158) (owner: 10Jeena Huneidi)
[04:52:29] <wikibugs>	 10SRE, 10CAS-SSO: SSO Portal: Fix "Remember me" checkbox alignment - https://phabricator.wikimedia.org/T273023 (10Krinkle)
[04:53:28] <wikibugs>	 (03PS1) 10Krinkle: apereo_cas: Remove inlined duplicate copy of bootstrap.css [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658759
[04:53:30] <wikibugs>	 (03PS1) 10Krinkle: apereo_cas: Improve loginform design [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658760 (https://phabricator.wikimedia.org/T273023)
[04:58:33] <wikibugs>	 10SRE, 10CAS-SSO, 10Patch-For-Review: SSO Portal: Fix "Remember me" checkbox alignment - https://phabricator.wikimedia.org/T273023 (10Krinkle) ##### Before  {F34035759 height=300}  ##### After  {F34035761 height=300}
[05:17:49] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[05:20:11] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[05:20:43] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:30:05] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:37:25] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:44:55] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:50:03] <marostegui>	 !log Deploy schema change on s3 T270055
[05:50:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:50:07] <stashbot>	 T270055: Schema change for timestamp field of uploadstash - https://phabricator.wikimedia.org/T270055
[05:51:55] <marostegui>	 twentyafterfour: hey! ready to start in 10 minutes?
[06:00:04] <jouncebot>	 marostegui and twentyafterfour: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for m3 (phabricator) database master restart. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210127T0600).
[06:00:38] <twentyafterfour>	 marostegui: ready when you are
[06:00:45] <marostegui>	 o/
[06:00:54] <marostegui>	 !log m3 master restart, phabricator will go on read only - T272596
[06:00:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:00:58] <stashbot>	 T272596: Restart m3 (phabricator) database master db1132 - https://phabricator.wikimedia.org/T272596
[06:00:58] <marostegui>	 twentyafterfour: ready!
[06:01:53] <twentyafterfour>	 !log phabricator is read-only
[06:01:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:01:57] <twentyafterfour>	 marostegui: go :)
[06:02:21] <marostegui>	 restarting!
[06:02:44] <marostegui>	 twentyafterfour: done
[06:03:13] <twentyafterfour>	 !log phabricator is read-write
[06:03:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:03:22] <twentyafterfour>	 everything looks good
[06:03:23] <marostegui>	 I can edit fine
[06:03:26] <marostegui>	 https://phabricator.wikimedia.org/P13964
[06:03:39] <twentyafterfour>	 !log phabricator appears to be up and running fine
[06:03:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:03:55] <marostegui>	 twentyafterfour: thank you very much!
[06:04:02] <twentyafterfour>	 🖒
[06:04:12] <twentyafterfour>	 marostegui: you're welcome! no problem at all :)
[06:04:16] <marostegui>	 <3
[06:13:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1160 with minimal weight T258361', diff saved to https://phabricator.wikimedia.org/P13965 and previous config saved to /var/cache/conftool/dbconfig/20210127-061336-marostegui.json
[06:13:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:13:42] <stashbot>	 T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
[06:15:09] <wikibugs>	 (03PS1) 10Marostegui: db1160: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/658782 (https://phabricator.wikimedia.org/T258361)
[06:15:58] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1160: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/658782 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui)
[06:17:40] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not reimage db1160 [puppet] - 10https://gerrit.wikimedia.org/r/658784 (https://phabricator.wikimedia.org/T258361)
[06:18:23] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1160 [puppet] - 10https://gerrit.wikimedia.org/r/658784 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui)
[06:21:57] <icinga-wm>	 RECOVERY - Check systemd state on pki2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:29:07] <icinga-wm>	 PROBLEM - Check systemd state on pki2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:30:06] <wikibugs>	 10SRE, 10Traffic, 10serviceops, 10Wikimedia-production-error: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Joe) >>! In T273003#6778171, @CDanis wrote: > It seems the User-Agent being used is `Peachy MediaWiki Bot API Versio...
[06:30:59] <wikibugs>	 (03CR) 10Ema: [C: 03+1] varnish: Set debug=1 in X-Analytics header [puppet] - 10https://gerrit.wikimedia.org/r/629735 (https://phabricator.wikimedia.org/T263683) (owner: 10Effie Mouzeli)
[06:31:08] <wikibugs>	 (03CR) 10Ema: [C: 03+1] varnish: include X-Client-Port in X-Analytics [puppet] - 10https://gerrit.wikimedia.org/r/658567 (https://phabricator.wikimedia.org/T181368) (owner: 10Effie Mouzeli)
[06:31:37] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:38:43] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:39:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Give db1160 some more small weight T258361', diff saved to https://phabricator.wikimedia.org/P13966 and previous config saved to /var/cache/conftool/dbconfig/20210127-063930-marostegui.json
[06:39:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:39:35] <stashbot>	 T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
[06:48:46] <wikibugs>	 10SRE, 10DBA, 10Platform Engineering Roadmap Decision Making, 10Performance-Team (Radar), 10User-Kormat: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 (10Marostegui) Thanks @Krinkle - I will probably start first with s6 codfw (frwiki,jawiki,ruwiki), and using wikimediadebug to...
[06:57:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Give db1160 some more small weight T258361', diff saved to https://phabricator.wikimedia.org/P13967 and previous config saved to /var/cache/conftool/dbconfig/20210127-065715-marostegui.json
[06:57:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:57:19] <stashbot>	 T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
[07:05:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1085 T272008', diff saved to https://phabricator.wikimedia.org/P13968 and previous config saved to /var/cache/conftool/dbconfig/20210127-070502-marostegui.json
[07:05:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:05:07] <stashbot>	 T272008: Move wikireplicas under the new sanitarium hosts (db1154, db1155) - https://phabricator.wikimedia.org/T272008
[07:05:09] <wikibugs>	 10SRE, 10Traffic, 10Wikimedia-production-error: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Joe) p:05Medium→03Low I also checked the logs from yesterday, and there was no error reported by the backend servers (in envoy o...
[07:09:27] <wikibugs>	 10SRE, 10Traffic, 10Wikimedia-production-error: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Joe) I'm not even sure this qualifies for the "production error" tags. We're talking about 50 events over the last week, that's way...
[07:19:12] <wikibugs>	 10SRE, 10Traffic: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Krinkle)
[07:21:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1085 (re)pooling @ 25%: After moving clouddb replicas', diff saved to https://phabricator.wikimedia.org/P13969 and previous config saved to /var/cache/conftool/dbconfig/20210127-072135-root.json
[07:21:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:21:56] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "LGTM, we can merge anytime, just ping Analytics when doing it so we are aware :)" [puppet] - 10https://gerrit.wikimedia.org/r/629735 (https://phabricator.wikimedia.org/T263683) (owner: 10Effie Mouzeli)
[07:22:13] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "LGTM, we can merge anytime, just ping Analytics when doing it so we are aware :)" [puppet] - 10https://gerrit.wikimedia.org/r/658567 (https://phabricator.wikimedia.org/T181368) (owner: 10Effie Mouzeli)
[07:24:48] <wikibugs>	 (03PS7) 10Effie Mouzeli: service_proxy: add ipv6 config option on services_proxy config [puppet] - 10https://gerrit.wikimedia.org/r/629343 (https://phabricator.wikimedia.org/T255568)
[07:26:29] <wikibugs>	 (03CR) 10Effie Mouzeli: service_proxy: add ipv6 config option on services_proxy config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629343 (https://phabricator.wikimedia.org/T255568) (owner: 10Effie Mouzeli)
[07:26:49] <elukey>	 !log powercycle analytics1073 - kernel soft lock up bug registered, os needs a reboot
[07:26:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:29:22] <icinga-wm>	 RECOVERY - Host analytics1073 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[07:29:47] <wikibugs>	 (03CR) 10Effie Mouzeli: [V: 03+1] "PCC https://puppet-compiler.wmflabs.org/compiler1001/27685/" [puppet] - 10https://gerrit.wikimedia.org/r/629343 (https://phabricator.wikimedia.org/T255568) (owner: 10Effie Mouzeli)
[07:30:32] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:30:49] <wikibugs>	 (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+2] service_proxy: add ipv6 config option on services_proxy config [puppet] - 10https://gerrit.wikimedia.org/r/629343 (https://phabricator.wikimedia.org/T255568) (owner: 10Effie Mouzeli)
[07:30:57] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Enable bracket matching on the first wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658594 (https://phabricator.wikimedia.org/T270238) (owner: 10WMDE-Fisch)
[07:31:10] <wikibugs>	 (03PS8) 10Effie Mouzeli: service_proxy: add ipv6 config option on services_proxy config [puppet] - 10https://gerrit.wikimedia.org/r/629343 (https://phabricator.wikimedia.org/T255568)
[07:33:16] <wikibugs>	 10SRE, 10Analytics: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10elukey)
[07:33:47] <wikibugs>	 10SRE, 10Analytics: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10elukey)
[07:34:03] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[07:36:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1085 (re)pooling @ 50%: After moving clouddb replicas', diff saved to https://phabricator.wikimedia.org/P13970 and previous config saved to /var/cache/conftool/dbconfig/20210127-073638-root.json
[07:36:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:36:49] <wikibugs>	 10SRE, 10Traffic: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Cyberpower678) Joe, not all requests are 502s.  They are insignificant compared to the amount of requests returning null responses.  This is a very problematic occu...
[07:36:58] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:51:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1085 (re)pooling @ 75%: After moving clouddb replicas', diff saved to https://phabricator.wikimedia.org/P13971 and previous config saved to /var/cache/conftool/dbconfig/20210127-075142-root.json
[07:51:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:55:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/658455 (owner: 10Legoktm)
[07:57:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Give db1160 some more small weight T258361', diff saved to https://phabricator.wikimedia.org/P13972 and previous config saved to /var/cache/conftool/dbconfig/20210127-075715-marostegui.json
[07:57:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:57:23] <stashbot>	 T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
[08:00:57] <wikibugs>	 (03CR) 10Effie Mouzeli: [V: 03+1] "-o modern includes slab_reassign, PCC https://puppet-compiler.wmflabs.org/compiler1003/27688/mc1025.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/656385 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli)
[08:01:19] <wikibugs>	 (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+2] hiera: clean up memcached configuration [puppet] - 10https://gerrit.wikimedia.org/r/656385 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli)
[08:05:51] <wikibugs>	 (03PS4) 10Effie Mouzeli: profile::memcached::instance: simplify handling of extendend_options [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey)
[08:06:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1085 (re)pooling @ 100%: After moving clouddb replicas', diff saved to https://phabricator.wikimedia.org/P13973 and previous config saved to /var/cache/conftool/dbconfig/20210127-080645-root.json
[08:06:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:07:06] <wikibugs>	 (03PS1) 10Ladsgroup: Add WMCS to the exception of ratelimit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658890 (https://phabricator.wikimedia.org/T209011)
[08:07:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1121', diff saved to https://phabricator.wikimedia.org/P13974 and previous config saved to /var/cache/conftool/dbconfig/20210127-080753-marostegui.json
[08:07:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:11:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1121', diff saved to https://phabricator.wikimedia.org/P13975 and previous config saved to /var/cache/conftool/dbconfig/20210127-081150-marostegui.json
[08:11:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:26:44] <wikibugs>	 (03PS1) 10Elukey: Remove duplicate in Hadoop Analytics' HDFS topology [puppet] - 10https://gerrit.wikimedia.org/r/658895
[08:27:08] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Remove duplicate in Hadoop Analytics' HDFS topology [puppet] - 10https://gerrit.wikimedia.org/r/658895 (owner: 10Elukey)
[08:27:10] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize db1169 in s1 [puppet] - 10https://gerrit.wikimedia.org/r/658896 (https://phabricator.wikimedia.org/T258361)
[08:28:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1089 to clone db1169 T258361', diff saved to https://phabricator.wikimedia.org/P13976 and previous config saved to /var/cache/conftool/dbconfig/20210127-082826-marostegui.json
[08:28:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:28:32] <stashbot>	 T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
[08:28:56] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 40.88 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash
[08:29:18] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Productionize db1169 in s1 [puppet] - 10https://gerrit.wikimedia.org/r/658896 (https://phabricator.wikimedia.org/T258361)
[08:29:26] <marostegui>	 !log Stop mysql on db1089 to clone db1169 T258361
[08:29:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:14] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:36:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1160 with more weight T258361', diff saved to https://phabricator.wikimedia.org/P13978 and previous config saved to /var/cache/conftool/dbconfig/20210127-083618-marostegui.json
[08:36:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:36:23] <stashbot>	 T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
[08:38:25] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) Hi @wiki_willy thanks a lot for following up!   I re-done the calculations of the workers' distribution after the last racking and this is what I g...
[08:39:00] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:48:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond)
[08:53:58] <wikibugs>	 (03PS1) 10Thiemo Kreuz (WMDE): Improve matchbrackets performance when moving the cursor [extensions/CodeMirror] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/658815 (https://phabricator.wikimedia.org/T270317)
[09:00:02] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Improve matchbrackets performance when moving the cursor [extensions/CodeMirror] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/658815 (https://phabricator.wikimedia.org/T270317) (owner: 10Thiemo Kreuz (WMDE))
[09:03:33] <godog>	 !log swift codfw-prod decrease SSD weight for ms-be20[16-27] - T272837
[09:03:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:03:37] <stashbot>	 T272837:  Decom ms-be[2016-2027] from swift - https://phabricator.wikimedia.org/T272837
[09:04:10] <jbond42>	 !log deploy fix to enable-puppet
[09:04:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:04:14] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] enable-puppet: allow fall back to enable puppet disabled by root [puppet] - 10https://gerrit.wikimedia.org/r/657889 (https://phabricator.wikimedia.org/T272539) (owner: 10Jbond)
[09:11:18] <wikibugs>	 10Puppet, 10SRE: run-puppet-agent --enable flag is broken - https://phabricator.wikimedia.org/T272539 (10jbond) Seems to be working  ` lang=console $ sudo disable-puppet 'test disable puppet with sudo'                                                                $ sudo puppet agent -t...
[09:11:30] <wikibugs>	 10Puppet, 10SRE: run-puppet-agent --enable flag is broken - https://phabricator.wikimedia.org/T272539 (10jbond) 05Open→03Resolved
[09:15:42] <wikibugs>	 (03PS5) 10Effie Mouzeli: WIP: profile::memcached::instance: remove "default_values" [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey)
[09:16:10] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 04-2] WIP: profile::memcached::instance: remove "default_values" [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey)
[09:17:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP: profile::memcached::instance: remove "default_values" [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey)
[09:19:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1160 with more weight T258361', diff saved to https://phabricator.wikimedia.org/P13979 and previous config saved to /var/cache/conftool/dbconfig/20210127-091909-marostegui.json
[09:19:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:14] <stashbot>	 T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
[09:20:15] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1169 in s1 [puppet] - 10https://gerrit.wikimedia.org/r/658896 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui)
[09:22:16] <wikibugs>	 (03PS1) 10Gilles: Add snappy dependency to coal [puppet] - 10https://gerrit.wikimedia.org/r/658918 (https://phabricator.wikimedia.org/T273033)
[09:31:08] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:31:21] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "thanks for the patch, comment inline." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658890 (https://phabricator.wikimedia.org/T209011) (owner: 10Ladsgroup)
[09:33:06] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] Migrate hiera() to lookup() and set datatypes in purge.pp [puppet] - 10https://gerrit.wikimedia.org/r/658503 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[09:35:22] <wikibugs>	 (03PS1) 10Filippo Giunchedi: alertmanager: route Icinga compat alerts to sink IRC channel [puppet] - 10https://gerrit.wikimedia.org/r/658919 (https://phabricator.wikimedia.org/T272453)
[09:36:36] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "is getting harder for me to track this change with each openstack version." [puppet] - 10https://gerrit.wikimedia.org/r/658653 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott)
[09:36:48] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1099 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[09:36:50] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on an-worker1099 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T273034 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[09:36:54] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on an-worker1099 - https://phabricator.wikimedia.org/T273034 (10ops-monitoring-bot)
[09:37:32] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:38:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1160 with more weight T258361', diff saved to https://phabricator.wikimedia.org/P13980 and previous config saved to /var/cache/conftool/dbconfig/20210127-093802-marostegui.json
[09:38:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:38:06] <stashbot>	 T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
[09:38:09] <wikibugs>	 (03CR) 10Alexandros Kosiaris: service_proxy: add ipv6 config option on services_proxy config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629343 (https://phabricator.wikimedia.org/T255568) (owner: 10Effie Mouzeli)
[09:40:10] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::web::prod_sites: remove unused code from main.conf [puppet] - 10https://gerrit.wikimedia.org/r/657138 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto)
[09:44:01] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics: Degraded RAID on an-worker1099 - https://phabricator.wikimedia.org/T273034 (10elukey)
[09:59:08] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] varnish: Set debug=1 in X-Analytics header [puppet] - 10https://gerrit.wikimedia.org/r/629735 (https://phabricator.wikimedia.org/T263683) (owner: 10Effie Mouzeli)
[10:00:26] <wikibugs>	 (03PS2) 10Thiemo Kreuz (WMDE): Improve matchbrackets performance when moving the cursor [extensions/CodeMirror] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/658815 (https://phabricator.wikimedia.org/T270317)
[10:00:55] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] apereo_cas: Improve loginform design [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658760 (https://phabricator.wikimedia.org/T273023) (owner: 10Krinkle)
[10:02:15] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] apereo_cas: Remove inlined duplicate copy of bootstrap.css [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658759 (owner: 10Krinkle)
[10:02:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1160 with more weight T258361', diff saved to https://phabricator.wikimedia.org/P13981 and previous config saved to /var/cache/conftool/dbconfig/20210127-100220-marostegui.json
[10:02:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:26] <stashbot>	 T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
[10:03:43] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db1169 [puppet] - 10https://gerrit.wikimedia.org/r/658920 (https://phabricator.wikimedia.org/T258361)
[10:03:47] <wikibugs>	 (03PS1) 10DCausse: [cirrus] Swith to perfield builder for spaceless languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658921 (https://phabricator.wikimedia.org/T266027)
[10:05:06] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1169 [puppet] - 10https://gerrit.wikimedia.org/r/658920 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui)
[10:05:28] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Swith to perfield builder for spaceless languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658921 (https://phabricator.wikimedia.org/T266027) (owner: 10DCausse)
[10:05:52] <elukey>	 !log reboot matomo1002 for kernel upgrades
[10:05:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:10:29] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): "This backport was a little more complicated because of multiple merge conflicts. A good way to review the result is to compare it with the" [extensions/CodeMirror] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/658814 (https://phabricator.wikimedia.org/T270317) (owner: 10Thiemo Kreuz (WMDE))
[10:11:49] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): "This backport was clean with no conflict. The only additional change is the tiny fix from Id923398." [extensions/CodeMirror] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/658815 (https://phabricator.wikimedia.org/T270317) (owner: 10Thiemo Kreuz (WMDE))
[10:12:12] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host schema1003.eqiad.wmnet
[10:12:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:14:40] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema1003.eqiad.wmnet
[10:14:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:15:09] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host schema1004.eqiad.wmnet
[10:15:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:16:43] <wikibugs>	 (03PS1) 10ArielGlenn: build for buster [debs/mwbzutils] - 10https://gerrit.wikimedia.org/r/658923
[10:17:38] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema1004.eqiad.wmnet
[10:17:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:17:55] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[10:18:38] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host schema2003.codfw.wmnet
[10:18:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:20:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1160 with final weight T258361', diff saved to https://phabricator.wikimedia.org/P13982 and previous config saved to /var/cache/conftool/dbconfig/20210127-102042-marostegui.json
[10:20:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:20:47] <stashbot>	 T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
[10:22:02] <wikibugs>	 (03PS2) 10DCausse: [cirrus] Swith to perfield builder for spaceless languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658921 (https://phabricator.wikimedia.org/T266027)
[10:23:13] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema2003.codfw.wmnet
[10:23:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:23:27] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host schema2004.codfw.wmnet
[10:23:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:24:26] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[10:24:38] <wikibugs>	 (03PS1) 10Jbond: add debian build instructions [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658924
[10:24:42] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[10:25:02] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] add debian build instructions [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658924 (owner: 10Jbond)
[10:27:21] <wikibugs>	 (03PS1) 10Jbond: create debian release [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658947
[10:28:06] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] create debian release [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658947 (owner: 10Jbond)
[10:29:18] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: termbox: Stop referencing deprecated service-proxy.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/658948
[10:31:00] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:31:39] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "Double checked on the deployment host, no diff, merging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/658948 (owner: 10Alexandros Kosiaris)
[10:32:50] <wikibugs>	 (03PS5) 10JMeybohm: Allow the kube-controller-manager to run without superuser permissions [puppet] - 10https://gerrit.wikimedia.org/r/650469 (https://phabricator.wikimedia.org/T228967)
[10:33:07] <wikibugs>	 (03Merged) 10jenkins-bot: termbox: Stop referencing deprecated service-proxy.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/658948 (owner: 10Alexandros Kosiaris)
[10:33:13] <wikibugs>	 (03PS3) 10Thiemo Kreuz (WMDE): Improve matchbrackets performance when moving the cursor [extensions/CodeMirror] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/658814 (https://phabricator.wikimedia.org/T270317)
[10:33:27] <wikibugs>	 (03PS6) 10JMeybohm: Allow the kube-controller-manager to run without superuser permissions [puppet] - 10https://gerrit.wikimedia.org/r/650469 (https://phabricator.wikimedia.org/T228967)
[10:33:43] <wikibugs>	 (03CR) 10JMeybohm: Allow the kube-controller-manager to run without superuser permissions (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/650469 (https://phabricator.wikimedia.org/T228967) (owner: 10JMeybohm)
[10:36:29] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema2004.codfw.wmnet
[10:36:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:37:02] <wikibugs>	 (03CR) 10Awight: [V: 03+1 C: 03+1] "Works for me locally." [extensions/CodeMirror] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/658814 (https://phabricator.wikimedia.org/T270317) (owner: 10Thiemo Kreuz (WMDE))
[10:37:28] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:37:31] <wikibugs>	 (03CR) 10Awight: [V: 03+1 C: 03+1] Improve matchbrackets performance when moving the cursor [extensions/CodeMirror] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/658815 (https://phabricator.wikimedia.org/T270317) (owner: 10Thiemo Kreuz (WMDE))
[10:38:32] <icinga-wm>	 PROBLEM - Check systemd state on deploy1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:40:59] <wikibugs>	 (03PS4) 10Jcrespo: mariadb: Remove mariadb module mysqld_safe [puppet] - 10https://gerrit.wikimedia.org/r/657820 (https://phabricator.wikimedia.org/T272559)
[10:42:52] <icinga-wm>	 RECOVERY - Check systemd state on deploy1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:43:52] <wikibugs>	 10SRE, 10envoy, 10serviceops, 10Service-Architecture: Using envoy to connect from MediaWiki to restbase causes an explosion of live LVS connections. - https://phabricator.wikimedia.org/T266855 (10Joe)
[10:44:03] <wikibugs>	 10SRE, 10envoy, 10serviceops, 10Kubernetes, 10Service-Architecture: Allow canarying new envoy configurations in kubernetes - https://phabricator.wikimedia.org/T265882 (10Joe)
[10:44:14] <wikibugs>	 10SRE, 10envoy, 10serviceops, 10Kubernetes, 10Service-Architecture: Improve envoy configuration CI checks - https://phabricator.wikimedia.org/T265881 (10Joe)
[10:44:23] <wikibugs>	 10SRE, 10envoy, 10serviceops, 10Kubernetes, 10Service-Architecture: Upgrade envoy configuration to use the v3 API - https://phabricator.wikimedia.org/T265880 (10Joe)
[10:44:40] <wikibugs>	 10SRE, 10envoy, 10serviceops, 10Kubernetes, 10Service-Architecture: Consider using a file-based xDS system for envoy in k8s - https://phabricator.wikimedia.org/T265879 (10Joe)
[10:44:56] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] varnish: include X-Client-Port in X-Analytics [puppet] - 10https://gerrit.wikimedia.org/r/658567 (https://phabricator.wikimedia.org/T181368) (owner: 10Effie Mouzeli)
[10:45:12] <wikibugs>	 (03PS2) 10Vgutierrez: varnish: include X-Client-Port in X-Analytics [puppet] - 10https://gerrit.wikimedia.org/r/658567 (https://phabricator.wikimedia.org/T181368) (owner: 10Effie Mouzeli)
[10:45:55] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] "> Patch Set 2: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/658628 (owner: 10Alexandros Kosiaris)
[10:47:58] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/658572 (owner: 10Hnowlan)
[10:48:33] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/658628 (owner: 10Alexandros Kosiaris)
[10:50:37] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] "Nuke away." [puppet] - 10https://gerrit.wikimedia.org/r/657821 (https://phabricator.wikimedia.org/T272559) (owner: 10Jcrespo)
[10:52:35] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] "Nuke away" [puppet] - 10https://gerrit.wikimedia.org/r/657820 (https://phabricator.wikimedia.org/T272559) (owner: 10Jcrespo)
[10:57:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1143 for kernel upgrade and enablement of report_host', diff saved to https://phabricator.wikimedia.org/P13984 and previous config saved to /var/cache/conftool/dbconfig/20210127-105735-marostegui.json
[10:57:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:57:48] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable ProxyPreserveHost for the debmonitor Apache site [puppet] - 10https://gerrit.wikimedia.org/r/658952
[10:58:57] <wikibugs>	 (03CR) 10Muehlenhoff: debmonitor: Also allow localhost (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657793 (owner: 10Muehlenhoff)
[10:59:04] <wikibugs>	 (03Abandoned) 10Muehlenhoff: debmonitor: Also allow localhost [puppet] - 10https://gerrit.wikimedia.org/r/657793 (owner: 10Muehlenhoff)
[10:59:47] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mariadb: Remove mariadb module mysqld_safe [puppet] - 10https://gerrit.wikimedia.org/r/657820 (https://phabricator.wikimedia.org/T272559) (owner: 10Jcrespo)
[11:00:07] <wikibugs>	 (03PS7) 10Jcrespo: mariadb-backups: Document logical backups grants throughout production dbs [puppet] - 10https://gerrit.wikimedia.org/r/657801 (https://phabricator.wikimedia.org/T111929)
[11:00:09] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Remove obsolete mariadb.server init script [puppet] - 10https://gerrit.wikimedia.org/r/658953 (https://phabricator.wikimedia.org/T272559)
[11:03:56] <wikibugs>	 (03CR) 10Jcrespo: "Followup/related to https://gerrit.wikimedia.org/r/c/operations/puppet/+/657820/" [puppet] - 10https://gerrit.wikimedia.org/r/658953 (https://phabricator.wikimedia.org/T272559) (owner: 10Jcrespo)
[11:06:00] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mariadb: Remove mariadb module mylvmbackup [puppet] - 10https://gerrit.wikimedia.org/r/657821 (https://phabricator.wikimedia.org/T272559) (owner: 10Jcrespo)
[11:06:07] <wikibugs>	 (03PS4) 10Jcrespo: mariadb: Remove mariadb module mylvmbackup [puppet] - 10https://gerrit.wikimedia.org/r/657821 (https://phabricator.wikimedia.org/T272559)
[11:08:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Enable base::service_auto_restart for netbox/apache [puppet] - 10https://gerrit.wikimedia.org/r/656187 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[11:12:08] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:12:11] <wikibugs>	 10SRE, 10DC-Ops, 10Data-Persistence-Backup, 10Patch-For-Review: decom helium and heze - https://phabricator.wikimedia.org/T260717 (10jcrespo) a:03jcrespo
[11:14:32] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:30:32] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: admin: also remove the old ed25519 key for the time being [puppet] - 10https://gerrit.wikimedia.org/r/635497
[11:32:09] <wikibugs>	 (03PS1) 10Filippo Giunchedi: alertmanager: add phalerts webhook receiver [puppet] - 10https://gerrit.wikimedia.org/r/658956 (https://phabricator.wikimedia.org/T272453)
[11:32:12] <wikibugs>	 (03PS1) 10Filippo Giunchedi: alertmanager: add receivers to create tasks from alerts [puppet] - 10https://gerrit.wikimedia.org/r/658957 (https://phabricator.wikimedia.org/T272453)
[11:32:15] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: add job for alertmanager::phab [puppet] - 10https://gerrit.wikimedia.org/r/658958 (https://phabricator.wikimedia.org/T272453)
[11:32:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 10%: After upgrading the kernel', diff saved to https://phabricator.wikimedia.org/P13985 and previous config saved to /var/cache/conftool/dbconfig/20210127-113245-root.json
[11:32:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:41:34] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/658952 (owner: 10Muehlenhoff)
[11:47:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 25%: After upgrading the kernel', diff saved to https://phabricator.wikimedia.org/P13986 and previous config saved to /var/cache/conftool/dbconfig/20210127-114749-root.json
[11:47:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:47:59] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Reduce reconnectTimeout for etcd to 0.1 seconds, release 1.15.9 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/658964 (https://phabricator.wikimedia.org/T264362)
[11:48:52] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "Please note; this is a cherry-pick of the already merged change I177ebb8521d87f2fdbc64550645e490c143a3a2c that was erroneously submitted t" [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/658964 (https://phabricator.wikimedia.org/T264362) (owner: 10Giuseppe Lavagetto)
[11:53:42] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] Reduce reconnectTimeout for etcd to 0.1 seconds, release 1.15.9 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/658964 (https://phabricator.wikimedia.org/T264362) (owner: 10Giuseppe Lavagetto)
[11:59:18] <wikibugs>	 10SRE, 10DC-Ops, 10Data-Persistence-Backup, 10Patch-For-Review: decom helium and heze - https://phabricator.wikimedia.org/T260717 (10jcrespo)
[12:00:05] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: Time to snap out of that daydream and deploy European mid-day backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210127T1200).
[12:00:05] <jouncebot>	 awight and CFisch_WMDE: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[12:00:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Enable ProxyPreserveHost for the debmonitor Apache site [puppet] - 10https://gerrit.wikimedia.org/r/658952 (owner: 10Muehlenhoff)
[12:02:18] <CFisch_WMDE>	 o/
[12:02:23] <wikibugs>	 (03PS1) 10ArielGlenn: include correct mysql client package for dumps on buster [puppet] - 10https://gerrit.wikimedia.org/r/658966
[12:02:48] <CFisch_WMDE>	 awight: wanna do both, I'm might get interrupted here.
[12:02:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 50%: After upgrading the kernel', diff saved to https://phabricator.wikimedia.org/P13987 and previous config saved to /var/cache/conftool/dbconfig/20210127-120253-root.json
[12:02:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:03:00] <awight>	 CFisch_WMDE: can do!
[12:03:15] <CFisch_WMDE>	 great! thanks
[12:03:36] <wikibugs>	 (03CR) 10Awight: [V: 03+1 C: 03+2] "backport window" [extensions/CodeMirror] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/658814 (https://phabricator.wikimedia.org/T270317) (owner: 10Thiemo Kreuz (WMDE))
[12:03:43] <wikibugs>	 (03CR) 10Awight: [V: 03+1 C: 03+2] "backport window" [extensions/CodeMirror] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/658815 (https://phabricator.wikimedia.org/T270317) (owner: 10Thiemo Kreuz (WMDE))
[12:03:51] <wikibugs>	 (03PS2) 10Awight: Enable bracket matching on the first wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658594 (https://phabricator.wikimedia.org/T270238) (owner: 10WMDE-Fisch)
[12:05:23] <wikibugs>	 10SRE, 10Traffic: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Cyberpower678) p:05Low→03Medium
[12:05:49] <wikibugs>	 (03PS1) 10Jbond: 6.2.7: release [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658967
[12:05:56] <wikibugs>	 10SRE, 10Traffic: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Cyberpower678) Got another surge of bad responses from production.
[12:06:12] <wikibugs>	 10SRE, 10DC-Ops, 10Data-Persistence-Backup, 10Patch-For-Review: decom helium and heze - https://phabricator.wikimedia.org/T260717 (10jcrespo)
[12:06:14] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] 6.2.7: release [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658967 (owner: 10Jbond)
[12:07:16] <wikibugs>	 10SRE: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10jcrespo)
[12:08:04] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] include correct mysql client package for dumps on buster [puppet] - 10https://gerrit.wikimedia.org/r/658966 (owner: 10ArielGlenn)
[12:09:24] <wikibugs>	 10SRE: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10jcrespo) I hav estarted the decommissioning process oh helium and heze and marked the tracking tasks on the description. It may take a bit to process them as we need to do lots of cleanup (beyond the previously pro...
[12:09:59] <wikibugs>	 (03Merged) 10jenkins-bot: Improve matchbrackets performance when moving the cursor [extensions/CodeMirror] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/658814 (https://phabricator.wikimedia.org/T270317) (owner: 10Thiemo Kreuz (WMDE))
[12:10:22] <wikibugs>	 (03PS1) 10Jbond: gradle.properties: add proxy settings back [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658968
[12:10:25] <wikibugs>	 (03Merged) 10jenkins-bot: Improve matchbrackets performance when moving the cursor [extensions/CodeMirror] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/658815 (https://phabricator.wikimedia.org/T270317) (owner: 10Thiemo Kreuz (WMDE))
[12:10:28] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] mariadb: Remove obsolete mariadb.server init script [puppet] - 10https://gerrit.wikimedia.org/r/658953 (https://phabricator.wikimedia.org/T272559) (owner: 10Jcrespo)
[12:11:38] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27690/console" [puppet] - 10https://gerrit.wikimedia.org/r/656460 (https://phabricator.wikimedia.org/T238753) (owner: 10Hnowlan)
[12:11:55] <wikibugs>	 (03PS1) 10Jcrespo: Remove helium and heze references from puppet [puppet] - 10https://gerrit.wikimedia.org/r/658969 (https://phabricator.wikimedia.org/T273049)
[12:11:57] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] gradle.properties: add proxy settings back [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658968 (owner: 10Jbond)
[12:13:25] <awight>	 Wikibase has rogue changes in the wmf.27 directory.
[12:14:11] <CFisch_WMDE>	 o.O
[12:14:24] <awight>	 Maybe just something about how the local patches were applied.
[12:15:13] <awight>	 CFisch_WMDE: both of our backports should be live on mwdebug1001
[12:15:26] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:15:49] <wikibugs>	 (03PS2) 10Jcrespo: bacula: Remove helium and heze references from puppet [puppet] - 10https://gerrit.wikimedia.org/r/658969 (https://phabricator.wikimedia.org/T273049)
[12:16:16] <CFisch_WMDE>	 awight: +1 Hard do test with the features still disabled :-D
[12:16:57] <wikibugs>	 (03CR) 10Jcrespo: "This is basically a rebase of https://gerrit.wikimedia.org/r/c/operations/puppet/+/621038 But I would like to cleanup db grants/old bacula" [puppet] - 10https://gerrit.wikimedia.org/r/658969 (https://phabricator.wikimedia.org/T273049) (owner: 10Jcrespo)
[12:17:08] <awight>	 CFisch_WMDE: Yeah all I can prove to myself so far is that we haven't broken anything obvious.  Next time we should enable on testwiki or something...  Anyway, going ahead with deployment now.
[12:17:35] <CFisch_WMDE>	 good point
[12:17:35] <wikibugs>	 (03CR) 10Jcrespo: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/658969 (https://phabricator.wikimedia.org/T273049) (owner: 10Jcrespo)
[12:17:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 75%: After upgrading the kernel', diff saved to https://phabricator.wikimedia.org/P13988 and previous config saved to /var/cache/conftool/dbconfig/20210127-121756-root.json
[12:17:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:18:31] <CFisch_WMDE>	 And at least it seems nothing broken so far.
[12:18:37] <awight>	 :-)
[12:19:11] <logmsgbot>	 !log awight@deploy1001 Synchronized php-1.36.0-wmf.28/extensions/CodeMirror: Backport: [[gerrit:658815|Improve matchbrackets performance when moving the cursor (T270317)]] (duration: 01m 14s)
[12:19:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:19:14] <wikibugs>	 (03CR) 10Jcrespo: "Moritz: do we keep the denylist, empty, or do we remove it in its entirety?" [puppet] - 10https://gerrit.wikimedia.org/r/658969 (https://phabricator.wikimedia.org/T273049) (owner: 10Jcrespo)
[12:19:15] <stashbot>	 T270317: Optimization and limits for bracket matching - https://phabricator.wikimedia.org/T270317
[12:19:42] <CFisch_WMDE>	 brb
[12:20:06] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:20:12] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:20:42] <logmsgbot>	 !log awight@deploy1001 Synchronized php-1.36.0-wmf.27/extensions/CodeMirror: Backport: [[gerrit:658814|Improve matchbrackets performance when moving the cursor (T270317)]] (duration: 01m 06s)
[12:20:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:20:50] <wikibugs>	 (03PS23) 10Hnowlan: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos)
[12:20:56] <wikibugs>	 (03CR) 10Awight: [C: 03+2] "Config window" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658594 (https://phabricator.wikimedia.org/T270238) (owner: 10WMDE-Fisch)
[12:21:58] <wikibugs>	 (03Merged) 10jenkins-bot: Enable bracket matching on the first wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658594 (https://phabricator.wikimedia.org/T270238) (owner: 10WMDE-Fisch)
[12:22:09] <wikibugs>	 (03CR) 10Muehlenhoff: bacula: Remove helium and heze references from puppet (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/658969 (https://phabricator.wikimedia.org/T273049) (owner: 10Jcrespo)
[12:22:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos)
[12:22:22] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27691/console" [puppet] - 10https://gerrit.wikimedia.org/r/656460 (https://phabricator.wikimedia.org/T238753) (owner: 10Hnowlan)
[12:22:32] <awight>	 CFisch_WMDE: Bracket matching enabled on mwdebug1001, for e.g. dewiki
[12:23:15] <awight>	 works!
[12:24:43] <CFisch_WMDE>	 like a charm
[12:25:09] <wikibugs>	 (03CR) 10Jcrespo: "Thanks for the comments, on some part of the docs it recommended using spare, will do as suggested." [puppet] - 10https://gerrit.wikimedia.org/r/658969 (https://phabricator.wikimedia.org/T273049) (owner: 10Jcrespo)
[12:25:22] <logmsgbot>	 !log awight@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:658594|Enable bracket matching on the first wikis (T270238)]] (duration: 01m 07s)
[12:25:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:25:26] <stashbot>	 T270238: Enable bracket matching on the first wikis - https://phabricator.wikimedia.org/T270238
[12:25:56] <awight>	 !log EU bacon done
[12:25:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:28:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/658969 (https://phabricator.wikimedia.org/T273049) (owner: 10Jcrespo)
[12:31:02] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:32:37] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Remove m1 references to old database bacula, leave only bacula9 [puppet] - 10https://gerrit.wikimedia.org/r/658970 (https://phabricator.wikimedia.org/T260717)
[12:33:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 100%: After upgrading the kernel', diff saved to https://phabricator.wikimedia.org/P13989 and previous config saved to /var/cache/conftool/dbconfig/20210127-123300-root.json
[12:33:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:36:54] <wikibugs>	 (03PS3) 10Jcrespo: bacula: Remove helium and heze references from puppet [puppet] - 10https://gerrit.wikimedia.org/r/658969 (https://phabricator.wikimedia.org/T273049)
[12:38:00] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:39:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] bacula: Remove helium and heze references from puppet [puppet] - 10https://gerrit.wikimedia.org/r/658969 (https://phabricator.wikimedia.org/T273049) (owner: 10Jcrespo)
[12:39:40] <wikibugs>	 (03PS24) 10Hnowlan: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos)
[12:41:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos)
[12:41:47] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27692/console" [puppet] - 10https://gerrit.wikimedia.org/r/656460 (https://phabricator.wikimedia.org/T238753) (owner: 10Hnowlan)
[12:42:28] <wikibugs>	 (03PS4) 10Jcrespo: bacula: Remove helium and heze references from puppet [puppet] - 10https://gerrit.wikimedia.org/r/658969 (https://phabricator.wikimedia.org/T273049)
[12:43:33] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Allow the kube-controller-manager to run without superuser permissions [puppet] - 10https://gerrit.wikimedia.org/r/650469 (https://phabricator.wikimedia.org/T228967) (owner: 10JMeybohm)
[12:44:19] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/658628 (owner: 10Alexandros Kosiaris)
[12:44:27] <wikibugs>	 (03CR) 10Jcrespo: "Not super-urgent, but I likely will need coordination with you, Manuel, to drop bacula db (deploy, documentation check, patch review, etc." [puppet] - 10https://gerrit.wikimedia.org/r/658970 (https://phabricator.wikimedia.org/T260717) (owner: 10Jcrespo)
[12:44:33] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: Absent /etc/helmfile-defaults/service-proxy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/658628
[12:44:59] <wikibugs>	 (03CR) 10VolkerE: [C: 03+1] "This is great, do we mind additionally changing the primary button to Accent50 blue background https://design.wikimedia.org/style-guide/co" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658760 (https://phabricator.wikimedia.org/T273023) (owner: 10Krinkle)
[12:45:10] <wikibugs>	 (03CR) 10Marostegui: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/658970 (https://phabricator.wikimedia.org/T260717) (owner: 10Jcrespo)
[12:46:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/658969 (https://phabricator.wikimedia.org/T273049) (owner: 10Jcrespo)
[12:46:12] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:47:39] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-2] "I will deploy this as is unless someone has improvements, but will block on the other maintenance needed as blocker." [puppet] - 10https://gerrit.wikimedia.org/r/658969 (https://phabricator.wikimedia.org/T273049) (owner: 10Jcrespo)
[12:50:14] <wikibugs>	 (03CR) 10Alexandros Kosiaris: service proxy: Add apertium (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/658629 (owner: 10Alexandros Kosiaris)
[12:50:32] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: similar-users, linkrecommendation: Switch to production [puppet] - 10https://gerrit.wikimedia.org/r/658630 (https://phabricator.wikimedia.org/T265603)
[12:50:34] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: service proxy: Add apertium [puppet] - 10https://gerrit.wikimedia.org/r/658629
[13:02:07] <wikibugs>	 (03PS1) 10Muehlenhoff: Add a ferm module to Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/658972
[13:05:22] <wikibugs>	 10SRE, 10Traffic: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Dreamy_Jazz) Just to note that cyberbot I has been blocked because of the blanking issues on enwiki. The bot has also been blocked on two other wikis for the blanki...
[13:05:59] <wikibugs>	 10SRE, 10DBA, 10User-Kormat: Add monitoring to ensure consistency between tendril and zarcillo - https://phabricator.wikimedia.org/T257822 (10jcrespo) This looks very related to T242571, but not merging because it is a topic very likely to evolve.
[13:06:07] <wikibugs>	 (03PS1) 10Marostegui: db1089: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/658973
[13:06:48] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1089: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/658973 (owner: 10Marostegui)
[13:07:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add a ferm module to Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/658972 (owner: 10Muehlenhoff)
[13:08:09] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable CAS for debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/658974
[13:13:46] <wikibugs>	 10SRE, 10DBA, 10observability, 10Patch-For-Review: MySQL metrics monitoring - https://phabricator.wikimedia.org/T143896 (10jcrespo)
[13:18:29] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] similar-users, linkrecommendation: Switch to production [puppet] - 10https://gerrit.wikimedia.org/r/658630 (https://phabricator.wikimedia.org/T265603) (owner: 10Alexandros Kosiaris)
[13:18:56] <wikibugs>	 10SRE, 10DBA, 10observability, 10Patch-For-Review: MySQL metrics monitoring - https://phabricator.wikimedia.org/T143896 (10jcrespo)
[13:19:38] <wikibugs>	 (03PS1) 10Hnowlan: Add two LVS IPs for similar-users. [dns] - 10https://gerrit.wikimedia.org/r/658976 (https://phabricator.wikimedia.org/T268837)
[13:19:59] <godog>	 !log swift codfw-prod decrease SSD weight for ms-be20[16-27] - T272837
[13:20:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:20:03] <stashbot>	 T272837:  Decom ms-be[2016-2027] from swift - https://phabricator.wikimedia.org/T272837
[13:20:20] <wikibugs>	 10SRE, 10DBA, 10User-Kormat: Add monitoring to ensure consistency between tendril and zarcillo - https://phabricator.wikimedia.org/T257822 (10Kormat) 05Open→03Resolved a:03Kormat Resolving this as tendril is going away.
[13:20:25] <wikibugs>	 10SRE, 10DBA, 10Epic, 10User-Kormat: Use zarcillo as an authoritative inventory of db instances/roles - https://phabricator.wikimedia.org/T257814 (10Kormat)
[13:20:28] <wikibugs>	 (03PS1) 10ArielGlenn: add snapshot03 to deployment-prep mw installation targets [puppet] - 10https://gerrit.wikimedia.org/r/658977
[13:24:43] <wikibugs>	 10SRE, 10DBA, 10Epic, 10User-Kormat: Use zarcillo as an authoritative inventory of db instances/roles - https://phabricator.wikimedia.org/T257814 (10Kormat)
[13:24:51] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] add snapshot03 to deployment-prep mw installation targets [puppet] - 10https://gerrit.wikimedia.org/r/658977 (owner: 10ArielGlenn)
[13:25:24] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: similar-users, linkrecommendation: Add discovery [dns] - 10https://gerrit.wikimedia.org/r/658980 (https://phabricator.wikimedia.org/T265603)
[13:25:44] <godog>	 !log swift codfw-prod decrease SSD weight for ms-be20[16-27] - T272837
[13:25:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:25:48] <stashbot>	 T272837:  Decom ms-be[2016-2027] from swift - https://phabricator.wikimedia.org/T272837
[13:26:17] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] bacula: Remove helium and heze references from puppet [puppet] - 10https://gerrit.wikimedia.org/r/658969 (https://phabricator.wikimedia.org/T273049) (owner: 10Jcrespo)
[13:28:14] <wikibugs>	 (03PS1) 10ArielGlenn: add new instance snapshot03 to dumps scap targets in deployment-prep [dumps/scap] - 10https://gerrit.wikimedia.org/r/658981
[13:30:07] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] mariadb: Remove m1 references to old database bacula, leave only bacula9 [puppet] - 10https://gerrit.wikimedia.org/r/658970 (https://phabricator.wikimedia.org/T260717) (owner: 10Jcrespo)
[13:31:47] <wikibugs>	 (03PS1) 10Jbond: 6.3.0: updated ready for 6.3.0 release [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658983
[13:31:54] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:32:48] <wikibugs>	 (03PS1) 10Jbond: gradle: remove -tomcat (used for local testing) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658984
[13:33:34] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] gradle: remove -tomcat (used for local testing) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658984 (owner: 10Jbond)
[13:35:00] <wikibugs>	 (03PS1) 10Kosta Harlan: linkrecommendation: release new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/658985
[13:35:02] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:36:22] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 287817984 and 158 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[13:36:36] <wikibugs>	 (03PS2) 10Kosta Harlan: linkrecommendation: release new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/658985
[13:36:52] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/658974 (owner: 10Muehlenhoff)
[13:37:20] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:38:43] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: release new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/658985 (owner: 10Kosta Harlan)
[13:38:44] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:40:09] <wikibugs>	 (03Merged) 10jenkins-bot: linkrecommendation: release new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/658985 (owner: 10Kosta Harlan)
[13:40:15] <wikibugs>	 (03PS2) 10Muehlenhoff: Add a ferm module to Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/658972
[13:42:47] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) firing: Primary outbound port utilisation over 80%  #page - https://alerts.wikimedia.org
[13:43:12] <vgutierrez>	 uh :)
[13:43:44] <logmsgbot>	 !log kharlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
[13:43:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:44:39] <wikibugs>	 (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] add new instance snapshot03 to dumps scap targets in deployment-prep [dumps/scap] - 10https://gerrit.wikimedia.org/r/658981 (owner: 10ArielGlenn)
[13:44:45] <godog>	 mmhh interesting, icinga should be paging shortly too I think ?
[13:44:53] <godog>	 re: the librenms alert
[13:46:29] <volans>	 there is a spike in both tx and rx in the last 20m
[13:46:50] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:47:39] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add a ferm module to Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/658972 (owner: 10Muehlenhoff)
[13:47:41] <marostegui>	 parsercache is lagging again in codfw
[13:47:47] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) resolved: Primary outbound port utilisation over 80%  #page - https://alerts.wikimedia.org
[13:48:12] <wikibugs>	 10SRE, 10SRE-Access-Requests: Add user to analytics-privatedata-users group - https://phabricator.wikimedia.org/T273058 (10elukey) @gmodena hi! This should follow https://wikitech.wikimedia.org/wiki/Production_access, and also https://wikitech.wikimedia.org/wiki/Analytics/Data_access  IIUC ssh access is not ne...
[13:48:53] <logmsgbot>	 !log kharlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
[13:48:53] <logmsgbot>	 !log kharlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
[13:48:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:48:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:50:55] <elukey>	 volans: do you have the link handy for librenms?
[13:51:06] <jbond42>	 https://librenms.wikimedia.org/graphs/device=162/type=device_bits/from=1611751500/legend=yes/popup_title=Device+Traffic/to=1611755100/?_token=jRhvEa5o1DIrGuxZrUmNTzkcO7mmyqeEGev27n0O
[13:51:09] <elukey>	 <3
[13:52:06] <elukey>	 ok so we (analytics) are transferring data to the backup cluster, so I am wondering if it is cross row traffic between worker ndoes
[13:52:08] <wikibugs>	 (03PS3) 10Muehlenhoff: Add a ferm module to Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/658972
[13:52:09] <elukey>	 *nodes
[13:52:31] <wikibugs>	 (03PS1) 10Ottomata: Add ppel to analytics-privatedata-users with no ssh access [puppet] - 10https://gerrit.wikimedia.org/r/658992 (https://phabricator.wikimedia.org/T271602)
[13:52:47] <logmsgbot>	 !log kharlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
[13:52:47] <logmsgbot>	 !log kharlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
[13:52:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:53:14] <ottomata>	 elukey:  does this look right?
[13:53:14] <ottomata>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/658992/
[13:53:54] <elukey>	 checking
[13:54:11] <wikibugs>	 10SRE, 10SRE-Access-Requests: Add user to analytics-privatedata-users group - https://phabricator.wikimedia.org/T273058 (10gmodena) Hi @elukey!  ack: ssh access is not required, only superset. @sdkim was onboarded in Hadoop, and can already login in Superset and access a subset of the data exposed. However, he...
[13:55:19] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Add ppel to analytics-privatedata-users with no ssh access [puppet] - 10https://gerrit.wikimedia.org/r/658992 (https://phabricator.wikimedia.org/T271602) (owner: 10Ottomata)
[13:55:40] <volans>	 godog: would be nice if the email sent by alerts could retain/format the description having one item per line (Device Name, Severity, etc...)
[13:55:59] <volans>	 #featurereuqest
[13:57:28] <godog>	 volans: afaics it depends on librenms and how it formats the "description" annotation specifically
[13:57:42] <elukey>	 did we track down the problem? Sorry didn't follow
[13:57:54] <godog>	 other (shorter) annotations/labels are one line per item
[13:58:42] <godog>	 jumping in a meeting now, but LMK if I can help
[13:58:57] <godog>	 elukey: what you mentioned makes sense to me, I saw also another alert for access port util for dumpsdata
[13:58:58] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:00:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add a ferm module to Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/658972 (owner: 10Muehlenhoff)
[14:01:50] <elukey>	 I don't recall if there is a way to see bw usage for single nodes
[14:02:03] <elukey>	 err single ports on a router
[14:02:16] <elukey>	 s/router/switch
[14:02:21] <elukey>	 today it is not a good day :D
[14:02:40] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:03:06] <volans>	 elukey: from what host to what other are you transferring?
[14:03:24] <volans>	 or is the cluster rebalancing itself
[14:03:33] <elukey>	 volans: it is difficult to say, it is between two clusters, the mappers can be anywhere on analytics10* or an-worker* nodes
[14:03:46] <elukey>	 I see that some of them in asw2-c are using close to 1Gbps
[14:03:57] <elukey>	 we are copying data for backup
[14:04:06] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:04:09] <elukey>	 I am checking https://librenms.wikimedia.org/device/162/ports
[14:04:14] <elukey>	 and grepping for an-worker
[14:06:58] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:07:29] <elukey>	 volans: I killed the data transfer, wasn't able to reach Joseph
[14:07:35] <elukey>	 so in theory bw usage should decrease
[14:07:39] <elukey>	 let's see if it was us
[14:08:02] <wikibugs>	 (03PS4) 10Muehlenhoff: Add a ferm module to Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/658972
[14:08:12] <volans>	 ok
[14:08:28] <elukey>	 no bueno, 10g nodes can push a lot of data if they can
[14:08:34] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:08:43] <elukey>	 for this data transfer we can control bw
[14:08:59] <elukey>	 but in general we can't for cross hadoop worker node tcp conns
[14:09:07] <elukey>	 at least, I don't know if it is possible :D
[14:10:21] <elukey>	 but I am a bit confused, we should have 40G (aggregated) between asw2-c to cr2-eqiad
[14:10:21] <klausman>	 I don't think it can be done in Hadoop/at the app level. About the only way I can think of is OS-level tc, and even that would be hard to implement.
[14:10:43] <elukey>	 and from the alerts I see that the port to cr2 was using close to 10G
[14:10:46] <elukey>	 or am I misreading?
[14:12:06] <elukey>	 volans: am I right or do we also have asw2-a alarming?
[14:12:38] <elukey>	 this is not great
[14:12:54] <volans>	 elukey: we have a and c flapping between warnign and critical
[14:13:17] <volans>	 if you click the alert you get the ports
[14:13:34] <volans>	 xe-2/0/45;  xe-7/0/44;  on asw2-a-eqiad
[14:13:51] <elukey>	 yes yes
[14:14:00] <elukey>	 that are the links to crX routers
[14:14:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add a ferm module to Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/658972 (owner: 10Muehlenhoff)
[14:14:41] <volans>	 but we're doing just few Gbps
[14:15:05] <wikibugs>	 (03PS1) 10Urbanecm: arwiki: Configure wgGEHomepageManualAssignmentMentorsList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658996 (https://phabricator.wikimedia.org/T273060)
[14:15:32] <volans>	 see https://netbox.wikimedia.org/dcim/devices/1112/#interface_ae1
[14:15:34] <elukey>	 volans: I recall that we should have aggregate eth (4x10g links) between asw-x and cr1/cr2, but I am not sure what the severity of the alert is
[14:16:00] <elukey>	 yes yes, but can a single 10g of the group alert?
[14:16:05] <volans>	 indeed, it seems like if the aggregation is not evenly distributing among them
[14:16:31] <volans>	 we have 2 alarming
[14:17:46] <volans>	 elukey: they're all 4 very high
[14:17:50] <elukey>	 should we call in people? This has the potential for a big issue
[14:18:57] <elukey>	 ah snap I forgot to do one thing to kill the jobs
[14:18:59] <elukey>	 lemme try
[14:19:09] <wikibugs>	 (03CR) 10Andrew Bogott: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/658653 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott)
[14:19:10] <volans>	 ack
[14:19:11] <elukey>	 but https://grafana.wikimedia.org/d/000000585/hadoop?viewPanel=50&orgId=1&var-datasource=eqiad%20prometheus%2Fanalytics&var-hadoop_cluster=analytics-backup-hadoop&var-worker=All
[14:19:15] <elukey>	 seems to match
[14:19:21] <elukey>	 it is the usage of the backup cluster
[14:19:33] <elukey>	 volans: can you check timings and see if it matches?
[14:19:37] <elukey>	 I'll kill the job in the meantime
[14:19:53] <volans>	 elukey: ack
[14:20:17] <volans>	 elukey: yes it matches with few minutes of delay
[14:20:22] <volans>	 the time to ramp-up the copy
[14:21:22] <elukey>	 sigh
[14:21:41] <elukey>	 ok I found a way to kill the job, but it is taking a bit
[14:21:50] <volans>	 try with -9 :D
[14:21:52] <elukey>	 the usage should decrease
[14:21:53] <elukey>	 ahahha yes
[14:22:08] <volans>	 I see prometheus graph going down
[14:22:11] <volans>	 so you did something
[14:22:17] <volans>	 let's see the network usage
[14:22:20] <elukey>	 so before I killed only the client app that Joseph was using, not the map-reduce job on the cluster
[14:22:23] <elukey>	 my bad
[14:22:37] <wikibugs>	 10SRE, 10SRE-Access-Requests: Add sdkim to analytics-privatedata-users group  - https://phabricator.wikimedia.org/T273058 (10Aklapper)
[14:22:46] <wikibugs>	 (03PS3) 10Andrew Bogott: Neutron: forward our dmz hacks from version Stein to Train [puppet] - 10https://gerrit.wikimedia.org/r/658653 (https://phabricator.wikimedia.org/T261135)
[14:22:48] <wikibugs>	 (03PS1) 10Andrew Bogott: Neutron: update the l3 files with Neutron Train upstream [puppet] - 10https://gerrit.wikimedia.org/r/658999 (https://phabricator.wikimedia.org/T261135)
[14:23:46] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 04-1] "Hm, something went wrong, let me try again" [puppet] - 10https://gerrit.wikimedia.org/r/658653 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott)
[14:24:17] <wikibugs>	 10SRE, 10Traffic: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Aklapper) p:05Medium→03Low Please don't change the priority value if you don't plan to work on fixing this - thanks a lot! :)
[14:25:20] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 40486352 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[14:26:28] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 740768 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[14:26:39] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] service proxy: Add apertium [puppet] - 10https://gerrit.wikimedia.org/r/658629 (owner: 10Alexandros Kosiaris)
[14:26:53] <wikibugs>	 (03PS5) 10Muehlenhoff: Add a ferm module to Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/658972
[14:27:15] <elukey>	 in theory we should see recovery
[14:27:30] <volans>	 elukey: confermed graph going down
[14:27:31] <volans>	 thanks
[14:27:57] <ottomata>	 elukey:  FYI i added some extra stuff to your really great data access docs
[14:27:57] <wikibugs>	 (03PS4) 10Andrew Bogott: Neutron: forward our dmz hacks from version Stein to Train [puppet] - 10https://gerrit.wikimedia.org/r/658653 (https://phabricator.wikimedia.org/T261135)
[14:28:14] <elukey>	 ottomata: <3
[14:28:15] <ottomata>	 to hopefully make it easier for requeting users to figure out what to ask for
[14:28:16] <ottomata>	 https://wikitech.wikimedia.org/wiki/Analytics/Data_access
[14:28:31] <elukey>	 yes thanks a lot, I forgot to send the last version to the team for review :(
[14:28:39] <wikibugs>	 (03CR) 10Andrew Bogott: "ok, there, does that make more sense?" [puppet] - 10https://gerrit.wikimedia.org/r/658653 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott)
[14:28:49] <wikibugs>	 (03PS1) 10Jayprakash12345: Add accountcreator user rights group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659000 (https://phabricator.wikimedia.org/T269067)
[14:29:37] <elukey>	 ottomata: this is great !https://wikitech.wikimedia.org/wiki/Analytics/Data_access#What_access_should_I_request?
[14:29:50] <ottomata>	 ya lemme know if i got that right!
[14:30:10] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:30:34] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:30:42] <elukey>	 ottomata: yep all good!
[14:33:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add a ferm module to Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/658972 (owner: 10Muehlenhoff)
[14:34:43] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Hue access for Peter Pelberg - https://phabricator.wikimedia.org/T271602 (10Ottomata)
[14:34:46] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Add ppel to analytics-privatedata-users with no ssh access [puppet] - 10https://gerrit.wikimedia.org/r/658992 (https://phabricator.wikimedia.org/T271602) (owner: 10Ottomata)
[14:36:26] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:38:50] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Hue access for Peter Pelberg - https://phabricator.wikimedia.org/T271602 (10Ottomata) @ppelberg, I've applied your access patch, please try again!  Also, it seems that in {T223351} you were given a shell user name of `ppel`, not `ppelberg` (as originally requ...
[14:43:20] <wikibugs>	 (03CR) 10Jbond: "saw this fly past so did a quick pass" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/658972 (owner: 10Muehlenhoff)
[14:44:53] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] similar-users, linkrecommendation: Add discovery [dns] - 10https://gerrit.wikimedia.org/r/658980 (https://phabricator.wikimedia.org/T265603) (owner: 10Alexandros Kosiaris)
[14:50:30] <cdanis>	 elukey: re: earlier -- was the data transfer one single TCP flow?
[14:51:06] <elukey>	 cdanis: o/ it was a map-reduce job from the backup cluster, pulling in parallel from the "analytics" one
[14:51:15] <cdanis>	 ah okay
[14:51:22] <cdanis>	 if it was a job with many workers it is odd it didn't spread itself out
[14:53:05] <volans>	 cdanis: it did very well
[14:53:07] <wikibugs>	 10SRE, 10Analytics: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10akosiaris) This is weird. I don't think we have encountered this before.   ExecStop in the systemd unit file runs `ifdown ens5` but running that on the host returns  ` root@kafka-test1006:...
[14:53:07] <volans>	 saturating all of them
[14:53:12] <volans>	 some more some less
[14:53:43] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/658629 (owner: 10Alexandros Kosiaris)
[14:54:08] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [dns] - 10https://gerrit.wikimedia.org/r/658980 (https://phabricator.wikimedia.org/T265603) (owner: 10Alexandros Kosiaris)
[14:54:15] <wikibugs>	 (03PS2) 10Jbond: 6.3.0: updated ready for 6.3.0 release [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658983
[14:54:52] <cdanis>	 volans: ah okay, I thought I read only one link was saturated, got it
[14:55:09] * cdanis still on first ☕
[14:55:14] <volans>	 ttyl :D
[14:55:49] <elukey>	 cdanis: my bad since I thought initially that was one port the problem, but then more alarmed etc..
[14:56:03] <elukey>	 some of the links were worst than others though
[14:56:04] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Allow talking to the registry over HTTP [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/658684 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm)
[14:56:09] <jbond42>	 cdanis: some links in the lacp were saturated but not however some of the an-workers were pushing much more traffic then others so i think that this is just an artifact of the hashing algo
[14:56:28] <cdanis>	 yah makes sense, pretty typical
[14:56:38] <wikibugs>	 10SRE, 10SRE-Access-Requests: Add sdkim to analytics-privatedata-users group - https://phabricator.wikimedia.org/T273058 (10Ottomata) Approved!
[14:58:35] <elukey>	 yep I am very ignorant on the lacp part :(
[14:58:48] <wikibugs>	 (03PS1) 10Jbond: apereo: update tomcat proxy setting post 6.3 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/659004
[15:00:47] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Add snappy dependency to coal [puppet] - 10https://gerrit.wikimedia.org/r/658918 (https://phabricator.wikimedia.org/T273033) (owner: 10Gilles)
[15:00:58] <jbond42>	 elukey: i would need to dig into the exact details but i think its simlar to ecmp in that it hashs (source, destination) tuples.  i took a quick look and couldn't see anything related to port so it probably is just src, dst.  and without digging further im not sure if src/dst would be layer-2 or layer-3 (allthough it makes little difference in reality) 
[15:01:02] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "IMHO, this line from the manual isn't very clear. The Debian Mentors FAQ[1] elaborates on this more and makes an important recommendation" [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/657218 (owner: 10Legoktm)
[15:03:00] <jbond42>	 elukey: https://www.juniper.net/documentation/en_US/junos/topics/topic-map/load-balancing-aggregated-ethernet-interfaces.html
[15:03:04] <elukey>	 jbond42: yes yes got it, I also asked some explanation to Faidon, now I feel less ignorant, but I was not taking into consideration that part of the network config
[15:03:16] <elukey>	 my mental model was 40g router<->switches
[15:03:26] <elukey>	 without really considering leaf/spine, lacp, etc..
[15:03:27] <akosiaris>	 elukey: avoid it for servers :P. That's my recommendation 
[15:03:42] <jbond42>	 ack
[15:04:12] <elukey>	 akosiaris: I am thinking about forming a leaf/spine set up with hadoop worker nodes, I'll send you the details
[15:04:13] <akosiaris>	 but otherwise, yeah it splits traffic per a hashing algorith (configurable) in the 2 interfaces :-)
[15:04:28] <akosiaris>	 2 or more to be pedantic
[15:04:34] * elukey nods
[15:04:40] <akosiaris>	 elukey: please do. You got me interested
[15:05:07] <wikibugs>	 (03PS2) 10Jbond: apereo: update tomcat proxy setting post 6.3 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/659004
[15:06:16] <elukey>	 akosiaris: ahahahah
[15:06:47] <wikibugs>	 (03CR) 10Jbond: "I have tested this locally and all seems fine, will plan to install on idp-test once the CSS changes have been migrated to production" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658983 (owner: 10Jbond)
[15:10:58] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on centrallog1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection timed out https://wikitech.wikimedia.org/wiki/Logs
[15:12:58] <wikibugs>	 10SRE, 10Traffic: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10CDanis) Can you please provide a complete dump of a "null response", with both the complete response headers and the raw response body?  What is the HTTP status cod...
[15:13:47] <godog>	 oof, I thought we'd fixed the tls listener of rsyslog, clearly not
[15:15:12] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on centrallog1001 is OK: SSL OK - Certificate centrallog1001.eqiad.wmnet valid until 2024-06-25 15:42:33 +0000 (expires in 1245 days) https://wikitech.wikimedia.org/wiki/Logs
[15:15:20] <godog>	 !log bounce rsyslog on centrallog1001
[15:15:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:15:44] <godog>	 if it happens again I'll reintroduce the former remedy :(
[15:15:48] <wikibugs>	 (03PS1) 10David Caro: puppet: add puppetmaster retrieval [software/spicerack] - 10https://gerrit.wikimedia.org/r/659008
[15:15:50] <wikibugs>	 (03PS1) 10David Caro: remote: allow prepending every command with sudo [software/spicerack] - 10https://gerrit.wikimedia.org/r/659009
[15:17:34] <wikibugs>	 10SRE, 10Analytics: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10elukey) @akosiaris not reliably, but today I rebooted the 4 schema VMs and one of them got back with the same issue..
[15:19:00] <wikibugs>	 (03PS1) 10Ottomata: Migrat 5 NavigationTiming eventlogging streams on group0 and group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659010 (https://phabricator.wikimedia.org/T271208)
[15:22:35] <wikibugs>	 (03PS2) 10Ottomata: Migrate 5 NavigationTiming eventlogging streams on group0 and group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659010 (https://phabricator.wikimedia.org/T271208)
[15:22:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] remote: allow prepending every command with sudo [software/spicerack] - 10https://gerrit.wikimedia.org/r/659009 (owner: 10David Caro)
[15:22:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] puppet: add puppetmaster retrieval [software/spicerack] - 10https://gerrit.wikimedia.org/r/659008 (owner: 10David Caro)
[15:24:52] <wikibugs>	 10SRE, 10Analytics: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10MoritzMuehlenhoff) >>! In T273026#6780528, @akosiaris wrote: > This is weird. I don't think we have encountered this before.  >  > ExecStop in the systemd unit file runs `ifdown ens5` but...
[15:25:08] <wikibugs>	 (03PS3) 10Ottomata: Migrate 5 NavigationTiming eventlogging streams on group0 and group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659010 (https://phabricator.wikimedia.org/T271208)
[15:26:22] <wikibugs>	 (03CR) 10Alexandros Kosiaris: Add support for php deployments (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/651757 (owner: 10Giuseppe Lavagetto)
[15:29:54] <wikibugs>	 (03CR) 10Muehlenhoff: Add a ferm module to Spicerack (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/658972 (owner: 10Muehlenhoff)
[15:29:58] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Migrate 5 NavigationTiming eventlogging streams on group0 and group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659010 (https://phabricator.wikimedia.org/T271208) (owner: 10Ottomata)
[15:31:42] <logmsgbot>	 !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Migrate 5 NavigationTiming schemas to Event Platform on group0 and group1 - T271208 (duration: 01m 07s)
[15:31:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:46] <stashbot>	 T271208: NavigationTiming Extension schemas Event Platform Migration - https://phabricator.wikimedia.org/T271208
[15:31:56] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:37:25] <wikibugs>	 (03CR) 10JMeybohm: Add support for php deployments (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/651757 (owner: 10Giuseppe Lavagetto)
[15:38:13] <wikibugs>	 10SRE, 10Analytics: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10akosiaris) >>! In T273026#6780640, @MoritzMuehlenhoff wrote: >>>! In T273026#6780528, @akosiaris wrote: >> This is weird. I don't think we have encountered this before.  >>  >> ExecStop in...
[15:38:34] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:38:49] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+1 C: 03+2] cassandra::single_instance: use dedicated hiera key, don't use 'cluster' [puppet] - 10https://gerrit.wikimedia.org/r/658572 (owner: 10Hnowlan)
[15:39:10] <wikibugs>	 10SRE, 10Analytics: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10elukey) I recall VMs only from my past experience, I encountered this problem a couple of times before this one.
[15:41:05] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "I wasn't even aware about that script :)" [puppet] - 10https://gerrit.wikimedia.org/r/658485 (owner: 10Legoktm)
[15:42:13] <elukey>	 !log umount /var/hadoop/data/r on an-worker1099 and restart hadoop daemons - T273034
[15:42:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:42:17] <stashbot>	 T273034: Degraded RAID on an-worker1099 - https://phabricator.wikimedia.org/T273034
[15:45:05] <wikibugs>	 10SRE, 10Analytics, 10Analytics-Kanban, 10Event-Platform, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10JMeybohm) >>! In T269160#6777382, @elukey wrote: > Waiting for @JMeybohm's greenli...
[15:46:23] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics: Degraded RAID on an-worker1099 - https://phabricator.wikimedia.org/T273034 (10elukey) @Ottomata @razzi this is the first datanode disk failure after the change that I made to use facter to populate the available partitions that Yarn and HDFS can use on a given worker node. In...
[15:47:19] <wikibugs>	 10SRE, 10Analytics, 10Analytics-Kanban, 10Event-Platform, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10elukey) >>! In T269160#6780685, @JMeybohm wrote: >>>! In T269160#6777382, @elukey...
[15:49:36] <wikibugs>	 10SRE, 10Analytics: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10MoritzMuehlenhoff) >>! In T273026#6780670, @akosiaris wrote: > Do you by any chance remember if it was on VMs only? Or was it physical hosts too?  From my memory only VMs. I've checked my...
[15:56:53] <wikibugs>	 10SRE, 10Analytics, 10Analytics-Kanban, 10Event-Platform, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10elukey) ` Error: pods is forbidden: User "eventstreams-internal" cannot list resou...
[15:58:25] <wikibugs>	 (03CR) 10Herron: alertmanager: add phalerts webhook receiver (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/658956 (https://phabricator.wikimedia.org/T272453) (owner: 10Filippo Giunchedi)
[15:59:11] <wikibugs>	 (03CR) 10Herron: alertmanager: add phalerts webhook receiver (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/658956 (https://phabricator.wikimedia.org/T272453) (owner: 10Filippo Giunchedi)
[16:00:45] <wikibugs>	 10ops-eqiad, 10Data-Persistence-Backup, 10decommission-hardware, 10Patch-For-Review: decommission helium.eqiad.wmnet and helium-array - https://phabricator.wikimedia.org/T273049 (10RobH)
[16:01:23] <wikibugs>	 10SRE, 10Analytics, 10Analytics-Kanban, 10Event-Platform, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10JMeybohm) You probably have not yet depoyed the admin part (the new namespace etc....
[16:02:04] <wikibugs>	 10ops-codfw, 10Data-Persistence-Backup, 10decommission-hardware, 10Patch-For-Review: decommission heze and heze-array1 - https://phabricator.wikimedia.org/T273051 (10jcrespo)
[16:04:16] <wikibugs>	 10SRE, 10Analytics, 10Analytics-Kanban, 10Event-Platform, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10elukey) >>! In T269160#6780761, @JMeybohm wrote: > You probably have not yet deplo...
[16:05:37] <wikibugs>	 (03CR) 10Herron: [C: 03+1] alertmanager: route Icinga compat alerts to sink IRC channel [puppet] - 10https://gerrit.wikimedia.org/r/658919 (https://phabricator.wikimedia.org/T272453) (owner: 10Filippo Giunchedi)
[16:06:45] <wikibugs>	 10SRE, 10Analytics, 10Analytics-Kanban, 10Event-Platform, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10JMeybohm) Apart from you testing my attention again (kube_env admin [codfw|eqiad])...
[16:11:40] <wikibugs>	 (03PS1) 10Jbond: P:idp: Add hiera defaults [puppet] - 10https://gerrit.wikimedia.org/r/659016
[16:16:16] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] alertmanager: route Icinga compat alerts to sink IRC channel [puppet] - 10https://gerrit.wikimedia.org/r/658919 (https://phabricator.wikimedia.org/T272453) (owner: 10Filippo Giunchedi)
[16:17:06] <wikibugs>	 (03PS2) 10Jbond: P:idp: Add hiera defaults [puppet] - 10https://gerrit.wikimedia.org/r/659016
[16:17:24] <wikibugs>	 10SRE, 10vm-requests: eqiad: 3 VM request for ML team etcd - https://phabricator.wikimedia.org/T273074 (10klausman)
[16:17:58] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27696/console" [puppet] - 10https://gerrit.wikimedia.org/r/659016 (owner: 10Jbond)
[16:18:29] <moritzm>	 !log installing python-bottle security updates
[16:18:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:18:33] <wikibugs>	 10SRE, 10vm-requests: codfw: 3 VM request for ML team etcd - https://phabricator.wikimedia.org/T273075 (10klausman)
[16:19:30] <wikibugs>	 (03CR) 10Cwhite: "LGTM caveat the token in the private repo or some default for PCC" [puppet] - 10https://gerrit.wikimedia.org/r/658956 (https://phabricator.wikimedia.org/T272453) (owner: 10Filippo Giunchedi)
[16:20:16] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] prometheus: add job for alertmanager::phab [puppet] - 10https://gerrit.wikimedia.org/r/658958 (https://phabricator.wikimedia.org/T272453) (owner: 10Filippo Giunchedi)
[16:20:59] <wikibugs>	 10SRE, 10vm-requests: eqiad: 3 VM request for ML team etcd - https://phabricator.wikimedia.org/T273074 (10MoritzMuehlenhoff) Please use row B and D and either of A or C for the third instance (the latter is fairly full, while the former have ample space).
[16:21:57] <logmsgbot>	 !log jayme@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' .
[16:21:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:23:01] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] P:idp: Add hiera defaults [puppet] - 10https://gerrit.wikimedia.org/r/659016 (owner: 10Jbond)
[16:28:11] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "Alternatively, phid lookup can also be done through conduit: https://phabricator.wikimedia.org/conduit/method/project.query/" [puppet] - 10https://gerrit.wikimedia.org/r/658957 (https://phabricator.wikimedia.org/T272453) (owner: 10Filippo Giunchedi)
[16:30:46] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:33:06] <icinga-wm>	 PROBLEM - Host cp1087 is DOWN: PING CRITICAL - Packet loss = 100%
[16:37:20] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:37:29] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "Thanks for the patch, couple of things missing:" (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/658972 (owner: 10Muehlenhoff)
[16:38:53] <logmsgbot>	 !log jayme@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' .
[16:38:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:40:07] <logmsgbot>	 !log elukey@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' .
[16:40:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:50:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27697/console" [puppet] - 10https://gerrit.wikimedia.org/r/658958 (https://phabricator.wikimedia.org/T272453) (owner: 10Filippo Giunchedi)
[16:50:57] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "Did a first pass, see a couple of comments inline." (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/659008 (owner: 10David Caro)
[16:51:01] <wikibugs>	 (03PS1) 10Mforns: Declare 6 more NavigationTiming eventlogging streams and migrate on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659022 (https://phabricator.wikimedia.org/T271208)
[16:51:17] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Thanks for the reviews! The dummy token is now in private.git" [puppet] - 10https://gerrit.wikimedia.org/r/658956 (https://phabricator.wikimedia.org/T272453) (owner: 10Filippo Giunchedi)
[16:53:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27698/console" [puppet] - 10https://gerrit.wikimedia.org/r/658956 (https://phabricator.wikimedia.org/T272453) (owner: 10Filippo Giunchedi)
[16:54:38] <logmsgbot>	 !log elukey@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' .
[16:54:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:56:23] <wikibugs>	 10SRE, 10Analytics, 10Analytics-Kanban, 10Event-Platform, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10elukey) es-internal deployed in both eqiad and codfw, next steps are:  - test loca...
[16:57:14] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] alertmanager: add phalerts webhook receiver [puppet] - 10https://gerrit.wikimedia.org/r/658956 (https://phabricator.wikimedia.org/T272453) (owner: 10Filippo Giunchedi)
[17:00:08] <wikibugs>	 10SRE, 10Analytics, 10Analytics-Kanban, 10Event-Platform, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10Ottomata) @elukey [[ https://logstash.wikimedia.org/goto/b408da9f4b39f66a0d0980062...
[17:01:30] <wikibugs>	 10SRE, 10CAS-SSO: SSO Portal: Fix "Remember me" checkbox alignment - https://phabricator.wikimedia.org/T273023 (10Legoktm) p:05Triage→03Low
[17:03:32] <wikibugs>	 10SRE, 10Analytics: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10Legoktm) p:05Triage→03Low
[17:03:39] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[17:05:28] <wikibugs>	 (03PS1) 10Jbond: idp: update idp profile to support ldaps or ldap starttls [puppet] - 10https://gerrit.wikimedia.org/r/659024
[17:06:10] <wikibugs>	 (03CR) 10Alexandros Kosiaris: Add support for php deployments (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/651757 (owner: 10Giuseppe Lavagetto)
[17:06:15] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27699/console" [puppet] - 10https://gerrit.wikimedia.org/r/659024 (owner: 10Jbond)
[17:07:01] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] idp: update idp profile to support ldaps or ldap starttls [puppet] - 10https://gerrit.wikimedia.org/r/659024 (owner: 10Jbond)
[17:08:18] <wikibugs>	 (03PS2) 10Jbond: idp: update idp profile to support ldaps or ldap starttls [puppet] - 10https://gerrit.wikimedia.org/r/659024
[17:09:05] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27700/console" [puppet] - 10https://gerrit.wikimedia.org/r/659024 (owner: 10Jbond)
[17:10:11] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] idp: update idp profile to support ldaps or ldap starttls [puppet] - 10https://gerrit.wikimedia.org/r/659024 (owner: 10Jbond)
[17:10:28] <wikibugs>	 10SRE, 10Analytics: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10akosiaris) I 'll take your word for it. +1 on the cleanup thing.
[17:10:51] <jbond42>	 godog: fyi merged pruv repo change
[17:11:03] <jbond42>	 godog: fyi i merged your priv repo change
[17:11:19] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[17:11:38] <wikibugs>	 (03Abandoned) 10Jeena Huneidi: [WIP] Apply global helmfile after pull [puppet] - 10https://gerrit.wikimedia.org/r/658750 (https://phabricator.wikimedia.org/T214158) (owner: 10Jeena Huneidi)
[17:12:11] <godog>	 jbond42: cheers! keep forgetting :(
[17:12:25] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[17:12:34] <jbond42>	 godog: no worries :)
[17:13:41] <wikibugs>	 (03PS1) 10Jbond: hiera: fix lookup [puppet] - 10https://gerrit.wikimedia.org/r/659027
[17:13:52] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[17:13:55] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "Did a first pass, this will need more thoughts and coordination with the current efforts towards non-root cumin." (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/659009 (owner: 10David Caro)
[17:13:59] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] hiera: fix lookup [puppet] - 10https://gerrit.wikimedia.org/r/659027 (owner: 10Jbond)
[17:16:08] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 254589792 and 150 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[17:16:09] <mutante>	 jouncebot: next
[17:16:10] <jouncebot>	 In 1 hour(s) and 43 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210127T1900)
[17:16:10] <jouncebot>	 In 1 hour(s) and 43 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210127T1900)
[17:18:04] <wikibugs>	 10SRE, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Elitre) @Pchelolo new instance today, sadly.
[17:18:54] <wikibugs>	 (03PS1) 10Jbond: hiera: use correct prefix [puppet] - 10https://gerrit.wikimedia.org/r/659030
[17:18:55] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[17:19:21] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] hiera: use correct prefix [puppet] - 10https://gerrit.wikimedia.org/r/659030 (owner: 10Jbond)
[17:19:24] <wikibugs>	 (03PS4) 10Dzahn: profile::base: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/655508 (https://phabricator.wikimedia.org/T209953)
[17:19:45] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] hiera: use correct prefix [puppet] - 10https://gerrit.wikimedia.org/r/659030 (owner: 10Jbond)
[17:21:12] <wikibugs>	 (03CR) 10Herron: [C: 03+1] alertmanager: add phalerts webhook receiver [puppet] - 10https://gerrit.wikimedia.org/r/658956 (https://phabricator.wikimedia.org/T272453) (owner: 10Filippo Giunchedi)
[17:21:24] <wikibugs>	 10SRE: archiva artifact links point to 127.0.0.1 - https://phabricator.wikimedia.org/T164993 (10hashar) + @elukey cause he seems to have done a few changes to our Archiva setup beside @Ottomata  The issue is still present with Archiva 2.2.4 and it also happens for non snapshot release ( https://archiva.wikimedia...
[17:21:35] <wikibugs>	 (03CR) 10Herron: [C: 03+1] alertmanager: add receivers to create tasks from alerts [puppet] - 10https://gerrit.wikimedia.org/r/658957 (https://phabricator.wikimedia.org/T272453) (owner: 10Filippo Giunchedi)
[17:21:39] <wikibugs>	 (03CR) 10Herron: [C: 03+1] prometheus: add job for alertmanager::phab [puppet] - 10https://gerrit.wikimedia.org/r/658958 (https://phabricator.wikimedia.org/T272453) (owner: 10Filippo Giunchedi)
[17:23:00] <wikibugs>	 (03Abandoned) 10Herron: kibana: change backend naming from kibana-next to kibana7 [puppet] - 10https://gerrit.wikimedia.org/r/654294 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron)
[17:24:49] <wikibugs>	 10SRE: Downloading from Archiva.wikimedia.org seems slower than Maven Central - https://phabricator.wikimedia.org/T273086 (10hashar)
[17:24:54] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "it's fine on most hosts but a few cases have " parameter 'domain_search' expects a String value, got Tuple"" [puppet] - 10https://gerrit.wikimedia.org/r/655508 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[17:25:44] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1268.eqiad.wmnet with reason: REIMAGE
[17:25:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:26:32] <wikibugs>	 (03Abandoned) 10Herron: dns: rename kibana-next.svc to kibana7.svc [dns] - 10https://gerrit.wikimedia.org/r/618140 (owner: 10Herron)
[17:27:07] <wikibugs>	 10SRE, 10Analytics: archiva artifact links point to 127.0.0.1 - https://phabricator.wikimedia.org/T164993 (10elukey)
[17:27:36] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 379976 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[17:27:54] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1268.eqiad.wmnet with reason: REIMAGE
[17:27:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:28:02] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1407.eqiad.wmnet with reason: REIMAGE
[17:28:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:28:38] <wikibugs>	 (03PS1) 10Jbond: hiera - ldap: add ldap global config for cloud [puppet] - 10https://gerrit.wikimedia.org/r/659033
[17:29:16] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1406.eqiad.wmnet with reason: REIMAGE
[17:29:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:29:19] <wikibugs>	 (03PS2) 10Mforns: Migrate WebUIActionsTracking schemas to Event Platform on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658426 (https://phabricator.wikimedia.org/T267347)
[17:30:08] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1407.eqiad.wmnet with reason: REIMAGE
[17:30:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:32:08] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1406.eqiad.wmnet with reason: REIMAGE
[17:32:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:32:50] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2301.codfw.wmnet with reason: REIMAGE
[17:32:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:34:57] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2301.codfw.wmnet with reason: REIMAGE
[17:34:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:35:27] <wikibugs>	 (03CR) 10Jbond: profile::base: hiera->lookup, add data types (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655508 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[17:35:52] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[17:36:53] <wikibugs>	 10SRE, 10Dumps-Generation, 10Platform Engineering, 10serviceops: Upgrade snapshot hosts to Buster - https://phabricator.wikimedia.org/T269377 (10ArielGlenn) I've built th package and set up a test instance in deployment-prep, but there's issues with mediawiki scripts there; see T273089 for the details.
[17:37:03] <wikibugs>	 (03CR) 10Dzahn: profile::base: hiera->lookup, add data types (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655508 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[17:38:03] <wikibugs>	 (03PS5) 10Dzahn: profile::base: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/655508 (https://phabricator.wikimedia.org/T209953)
[17:38:05] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] Allow talking to the registry over HTTP [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/658684 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm)
[17:39:33] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile::base: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/655508 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[17:45:58] <wikibugs>	 (03PS2) 10Mstyles: bump memory for flink processes [deployment-charts] - 10https://gerrit.wikimedia.org/r/657941
[17:46:29] <wikibugs>	 (03PS6) 10Dzahn: profile::base: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/655508 (https://phabricator.wikimedia.org/T209953)
[17:46:49] <wikibugs>	 (03PS3) 10Mstyles: bump memory for flink processes [deployment-charts] - 10https://gerrit.wikimedia.org/r/657941
[17:47:00] <wikibugs>	 (03CR) 10Bstorm: [C: 03+1] "Looks fine." [puppet] - 10https://gerrit.wikimedia.org/r/658397 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[17:47:03] <wikibugs>	 (03CR) 10Mstyles: bump memory for flink processes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/657941 (owner: 10Mstyles)
[17:48:09] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2215.codfw.wmnet with reason: REIMAGE
[17:48:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:48:20] <wikibugs>	 (03PS1) 10Bstorm: wikireplicas: open the proxies for the new ports [puppet] - 10https://gerrit.wikimedia.org/r/659041 (https://phabricator.wikimedia.org/T271476)
[17:50:18] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2215.codfw.wmnet with reason: REIMAGE
[17:50:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:50:36] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] wmcs: Migrate hiera() to lookup() and set datatypes in nfs primary [puppet] - 10https://gerrit.wikimedia.org/r/658397 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[17:51:32] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1407.eqiad.wmnet'] `  an...
[17:52:20] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1406.eqiad.wmnet'] `  an...
[17:53:04] <wikibugs>	 (03PS5) 10Jcrespo: bacula: Remove helium and heze references from puppet [puppet] - 10https://gerrit.wikimedia.org/r/658969 (https://phabricator.wikimedia.org/T273049)
[17:53:05] <wikibugs>	 10SRE, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Elitre)
[17:53:06] <wikibugs>	 (03PS1) 10Jcrespo: bacula: Undo conditionals added while transitioning helium->backup1001 [puppet] - 10https://gerrit.wikimedia.org/r/659046 (https://phabricator.wikimedia.org/T238048)
[17:53:08] <wikibugs>	 (03CR) 10Bstorm: "confirmed noop on labstore1004" [puppet] - 10https://gerrit.wikimedia.org/r/658397 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[17:54:34] <wikibugs>	 (03CR) 10Jcrespo: "This is WIP, a first approach of the most obvious things that will likely fail tests. Big early feedback is welcome of big gaps not mentio" [puppet] - 10https://gerrit.wikimedia.org/r/659046 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo)
[17:54:35] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mwdebug1001 is CRITICAL: connect to address 127.0.0.1 and port 9631: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[17:54:54] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] bacula: Undo conditionals added while transitioning helium->backup1001 [puppet] - 10https://gerrit.wikimedia.org/r/659046 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo)
[17:55:16] <wikibugs>	 (03CR) 10Bstorm: "Regardless of the eventual ingress, the firewall rule in this will not need to change, so this will allow testing to begin. I've given pub" [puppet] - 10https://gerrit.wikimedia.org/r/659041 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm)
[17:55:40] <wikibugs>	 (03CR) 10Jcrespo: bacula: Undo conditionals added while transitioning helium->backup1001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/659046 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo)
[17:57:14] <wikibugs>	 (03CR) 10Jcrespo: "My editor apparently lost its setup :-(." [puppet] - 10https://gerrit.wikimedia.org/r/659046 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo)
[17:59:53] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 241680392 and 31 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[18:00:58] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2301.codfw.wmnet'] `  an...
[18:01:03] <wikibugs>	 (03PS1) 10Dzahn: etcd::replication: hiera -> lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/659048 (https://phabricator.wikimedia.org/T209953)
[18:01:17] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 593456 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[18:02:33] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] etcd::replication: hiera -> lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/659048 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[18:03:38] <wikibugs>	 (03PS2) 10Dzahn: etcd::replication: hiera -> lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/659048 (https://phabricator.wikimedia.org/T209953)
[18:03:46] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1406.eqiad.wmnet
[18:03:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:03:55] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2301.codfw.wmnet
[18:03:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:04:11] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1407.eqiad.wmnet
[18:04:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:04:43] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:05:08] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2216.codfw.wmnet with reason: REIMAGE
[18:05:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:06:06] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1407.eqiad.wmnet
[18:06:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:06:23] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:07:09] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2216.codfw.wmnet with reason: REIMAGE
[18:07:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:07:58] <wikibugs>	 (03PS2) 10Jcrespo: bacula: Undo conditionals added while transitioning helium->backup1001 [puppet] - 10https://gerrit.wikimedia.org/r/659046 (https://phabricator.wikimedia.org/T238048)
[18:08:15] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[18:10:22] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1406.eqiad.wmnet
[18:10:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:10:54] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[18:13:06] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2301.codfw.wmnet
[18:13:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:14:06] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[18:14:48] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/27703/" [puppet] - 10https://gerrit.wikimedia.org/r/659048 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[18:15:08] <logmsgbot>	 !log dpifke@deploy1001 Started deploy [performance/arc-lamp@e24f319]: Re-deploying ArcLamp to webperf1002
[18:15:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:15:13] <logmsgbot>	 !log dpifke@deploy1001 Finished deploy [performance/arc-lamp@e24f319]: Re-deploying ArcLamp to webperf1002 (duration: 00m 05s)
[18:15:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:16:39] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/658969 (https://phabricator.wikimedia.org/T273049) (owner: 10Jcrespo)
[18:17:46] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:17:48] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] scap: add deploy1002 and deploy2002 to mediawiki hosts [puppet] - 10https://gerrit.wikimedia.org/r/658643 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn)
[18:18:55] <wikibugs>	 10SRE, 10Data-Persistence-Backup: print a list of backed up directories in the MOTD of production servers - https://phabricator.wikimedia.org/T272686 (10jcrespo) Apparently, there is the following code on backup::set:  ` $motd_content = "#!/bin/sh\necho \"Backed up on this host: ${name}\""         @motd::scrip...
[18:19:24] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:20:47] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1268.eqiad.wmnet'] `  an...
[18:21:50] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1268.eqiad.wmnet
[18:21:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:22:46] <wikibugs>	 (03PS1) 10Dzahn: parsoid::testreduce: let envoy listen on IPv6 as well [puppet] - 10https://gerrit.wikimedia.org/r/659051 (https://phabricator.wikimedia.org/T266509)
[18:23:57] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1268.eqiad.wmnet
[18:23:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:25:00] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[18:25:05] <logmsgbot>	 !log hashar@deploy1001 Started deploy [integration/docroot@da43ad4]: Add Shellbox to doc.wm.o , misc build related changes fdf0917..da43ad4
[18:25:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:25:15] <logmsgbot>	 !log hashar@deploy1001 Finished deploy [integration/docroot@da43ad4]: Add Shellbox to doc.wm.o , misc build related changes fdf0917..da43ad4 (duration: 00m 10s)
[18:25:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:25:22] <logmsgbot>	 !log hashar@deploy1001 Started deploy [integration/docroot@da43ad4]: Add Shellbox to doc.wm.o , misc build related changes fdf0917..da43ad4
[18:25:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:25:31] <logmsgbot>	 !log hashar@deploy1001 Finished deploy [integration/docroot@da43ad4]: Add Shellbox to doc.wm.o , misc build related changes fdf0917..da43ad4 (duration: 00m 07s)
[18:25:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:25:43] <hashar>	 not sure why there are dupes bah
[18:26:01] <wikibugs>	 (03PS1) 10Bstorm: cloud-maps-proxy: small change in the reply messages of maps proxy config [puppet] - 10https://gerrit.wikimedia.org/r/659053
[18:26:19] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] parsoid::testreduce: let envoy listen on IPv6 as well [puppet] - 10https://gerrit.wikimedia.org/r/659051 (https://phabricator.wikimedia.org/T266509) (owner: 10Dzahn)
[18:26:43] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+1] cloud-maps-proxy: small change in the reply messages of maps proxy config [puppet] - 10https://gerrit.wikimedia.org/r/659053 (owner: 10Bstorm)
[18:26:44] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10wiki_willy) Hi @elukey - thanks for the mapping.  What makes it tough is that the remaining 6x hosts need to be on 10g switches, which really limits our op...
[18:27:23] <Majavah>	 Can someone lookup the full stack trace for T273094 please?
[18:27:23] <stashbot>	 T273094: Uncaught SyntaxError: Unexpected identifier - CentralAuthLogin throwing SyntaxError - https://phabricator.wikimedia.org/T273094
[18:27:43] <icinga-wm>	 PROBLEM - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:27:56] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] cloud-maps-proxy: small change in the reply messages of maps proxy config [puppet] - 10https://gerrit.wikimedia.org/r/659053 (owner: 10Bstorm)
[18:28:15] <wikibugs>	 (03CR) 10Jeena Huneidi: "correction: pipeline meeting, not repo" [puppet] - 10https://gerrit.wikimedia.org/r/658750 (https://phabricator.wikimedia.org/T214158) (owner: 10Jeena Huneidi)
[18:30:05] <icinga-wm>	 PROBLEM - Maps HTTPS on maps1009 is CRITICAL: connect to address 10.64.32.8 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Maps/RunBook
[18:30:20] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 03+1] bump memory for flink processes [deployment-charts] - 10https://gerrit.wikimedia.org/r/657941 (owner: 10Mstyles)
[18:30:48] <Tchanders>	 !log Creating the table securepoll_log in votewiki and testwiki (T271270)
[18:30:51] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:30:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:30:53] <stashbot>	 T271270: Create new logging table in SecurePoll - https://phabricator.wikimedia.org/T271270
[18:31:15] <wikibugs>	 10SRE, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Elitre)
[18:32:17] <icinga-wm>	 PROBLEM - cassandra CQL 10.64.32.8:9042 on maps1009 is CRITICAL: connect to address 10.64.32.8 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[18:34:35] <icinga-wm>	 PROBLEM - cassandra service on maps1009 is CRITICAL: CRITICAL - Unit cassandra is active but reported SubState exited, wanted running https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[18:35:39] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:36:13] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on deploy1002 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[18:36:51] <icinga-wm>	 PROBLEM - tileratorui on maps1009 is CRITICAL: connect to address 10.64.32.8 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui
[18:37:32] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2217.codfw.wmnet with reason: REIMAGE
[18:37:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:39:41] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2217.codfw.wmnet with reason: REIMAGE
[18:39:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:39:55] <wikibugs>	 (03PS1) 10Jcrespo: bacula: Fix bug on realizing backup sets on motd [puppet] - 10https://gerrit.wikimedia.org/r/659056 (https://phabricator.wikimedia.org/T272686)
[18:40:06] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2218.codfw.wmnet with reason: REIMAGE
[18:40:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:41:25] <wikibugs>	 (03CR) 10Jcrespo: "I think this was a bug, but please clarify if it was weirdly disabled for some reason." [puppet] - 10https://gerrit.wikimedia.org/r/659056 (https://phabricator.wikimedia.org/T272686) (owner: 10Jcrespo)
[18:41:49] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] bacula: Fix bug on realizing backup sets on motd [puppet] - 10https://gerrit.wikimedia.org/r/659056 (https://phabricator.wikimedia.org/T272686) (owner: 10Jcrespo)
[18:42:18] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2218.codfw.wmnet with reason: REIMAGE
[18:42:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:43:21] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2219.codfw.wmnet with reason: REIMAGE
[18:43:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:44:04] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2215.codfw.wmnet'] `  an...
[18:45:22] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2219.codfw.wmnet with reason: REIMAGE
[18:45:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:47:05] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1263.eqiad.wmnet with reason: REIMAGE
[18:47:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:47:40] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:47:43] <wikibugs>	 (03PS2) 10Jcrespo: bacula: Fix bug on realizing backup sets on motd [puppet] - 10https://gerrit.wikimedia.org/r/659056 (https://phabricator.wikimedia.org/T272686)
[18:48:46] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:49:09] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1263.eqiad.wmnet with reason: REIMAGE
[18:49:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:49:33] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] bacula: Fix bug on realizing backup sets on motd [puppet] - 10https://gerrit.wikimedia.org/r/659056 (https://phabricator.wikimedia.org/T272686) (owner: 10Jcrespo)
[18:50:14] <mutante>	 !log testreduce1001 - making nginx listen on IPv6 and restarting it T266509
[18:50:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:50:17] <stashbot>	 T266509: Make testreduce web UI publicly accessible on the internet - https://phabricator.wikimedia.org/T266509
[18:53:05] <wikibugs>	 (03CR) 10Bstorm: "Looks good https://puppet-compiler.wmflabs.org/compiler1003/27704/" [puppet] - 10https://gerrit.wikimedia.org/r/659041 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm)
[18:53:18] <wikibugs>	 (03PS1) 10Dzahn: parsoid/testing: let nginx also listen on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/659058 (https://phabricator.wikimedia.org/T266509)
[18:53:21] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] wikireplicas: open the proxies for the new ports [puppet] - 10https://gerrit.wikimedia.org/r/659041 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm)
[18:53:22] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2215.codfw.wmnet
[18:53:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:53:27] <wikibugs>	 (03PS3) 10Jcrespo: bacula: Fix bug on realizing backup sets on motd [puppet] - 10https://gerrit.wikimedia.org/r/659056 (https://phabricator.wikimedia.org/T272686)
[18:55:26] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] bacula: Fix bug on realizing backup sets on motd [puppet] - 10https://gerrit.wikimedia.org/r/659056 (https://phabricator.wikimedia.org/T272686) (owner: 10Jcrespo)
[18:57:53] <mutante>	 jouncebot: next
[18:57:53] <jouncebot>	 In 0 hour(s) and 2 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210127T1900)
[18:57:53] <jouncebot>	 In 0 hour(s) and 2 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210127T1900)
[18:58:08] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2216.codfw.wmnet'] `  an...
[19:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: #bothumor I � Unicode. All rise for Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210127T1900).
[19:00:05] <jouncebot>	 Jayprakash12345 and mforns: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[19:00:05] <jouncebot>	 dancy and brennen: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Train log triage with CPT . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210127T1900).
[19:00:16] <Urbanecm>	 I can deploy today!
[19:00:22] <mforns>	 heya, I'm here
[19:00:43] <Urbanecm>	 hi mforns 
[19:00:53] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Migrate WebUIActionsTracking schemas to Event Platform on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658426 (https://phabricator.wikimedia.org/T267347) (owner: 10Mforns)
[19:03:31] <wikibugs>	 (03Merged) 10jenkins-bot: Migrate WebUIActionsTracking schemas to Event Platform on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658426 (https://phabricator.wikimedia.org/T267347) (owner: 10Mforns)
[19:04:51] <icinga-wm>	 PROBLEM - PHP opcache health on mwdebug1001 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[19:05:15] <Urbanecm>	 mforns: can you test the first patch at mwdebug1001, please?
[19:05:22] <mforns>	 yes Urbanecm 
[19:05:28] <Urbanecm>	 thanks, let me know how it goes
[19:06:19] <mforns>	 Urbanecm: tested, looks good!
[19:06:24] <Urbanecm>	 thanks, syncing
[19:06:56] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2215.codfw.wmnet
[19:06:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:07:22] <Urbanecm>	 mutante: ftr, I'm scap'ing now
[19:07:54] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 9382a9879bd6823fd664c0d3721fd0a9dc0d56d8: Migrate WebUIActionsTracking schemas to Event Platform on all wikis (T267347,T271164) (duration: 01m 03s)
[19:07:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:07:58] <stashbot>	 T267347: MobileWebUIActionsTracking Event Platform Migration - https://phabricator.wikimedia.org/T267347
[19:07:58] <stashbot>	 T271164: DesktopWebUIActionsTracking Event Platform Migration - https://phabricator.wikimedia.org/T271164
[19:08:35] <icinga-wm>	 RECOVERY - PHP opcache health on mwdebug1001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[19:08:49] <wikibugs>	 (03PS2) 10Urbanecm: Declare 6 more NavigationTiming eventlogging streams and migrate on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659022 (https://phabricator.wikimedia.org/T271208) (owner: 10Mforns)
[19:08:59] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Declare 6 more NavigationTiming eventlogging streams and migrate on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659022 (https://phabricator.wikimedia.org/T271208) (owner: 10Mforns)
[19:09:00] <mutante>	 Urbanecm: new thing is that deploy1002/deploy2002 should be part of the scap
[19:09:18] <Urbanecm>	 they were not previously?
[19:09:22] <mutante>	 no
[19:09:22] <Urbanecm>	 that sounds weird
[19:09:37] <mutante>	 it's not, they are new
[19:09:43] <Urbanecm>	 aha
[19:10:01] <Urbanecm>	 will we have two deploy servers? or will deploy1001 be eventually removed?
[19:10:18] <mutante>	 deploy1001/deploy2001 will be removed eventually
[19:10:25] <mutante>	 this is just about stretch->buster
[19:10:30] <Urbanecm>	 ah, got it
[19:10:40] <wikibugs>	 10SRE, 10Traffic: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Cyberpower678) I've added additional logging data to the framework.   ` Date/Time: Wed, 27 Jan 2021 19:08:52 +0000 Method: GET URL: https://en.wikipedia.org/w/api.p...
[19:10:42] <wikibugs>	 (03Merged) 10jenkins-bot: Declare 6 more NavigationTiming eventlogging streams and migrate on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659022 (https://phabricator.wikimedia.org/T271208) (owner: 10Mforns)
[19:11:02] <Urbanecm>	 mforns: pulled onto mwdebug1001, please test
[19:11:09] <mforns>	 Urbanecm: ack
[19:13:17] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:15:25] <mforns>	 Urbanecm: 4 of the 6 schemas went through, I think I can not see the other ones because of low throughput, but it seems that the patch overall works!
[19:15:26] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[19:15:33] <Urbanecm>	 good, I'll sync
[19:17:00] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: cabb2e2009f97bb86c1b8827c3cc61cc991c41a9: Declare 6 more NavigationTiming eventlogging streams and migrate on testwiki (T271208) (duration: 01m 00s)
[19:17:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:17:04] <stashbot>	 T271208: NavigationTiming Extension schemas Event Platform Migration - https://phabricator.wikimedia.org/T271208
[19:17:15] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] parsoid/testing: let nginx also listen on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/659058 (https://phabricator.wikimedia.org/T266509) (owner: 10Dzahn)
[19:17:19] <Urbanecm>	 mforns: done. Anything else?
[19:17:34] <mforns>	 no :] that's all, thanks Urbanecm 
[19:17:40] <Urbanecm>	 no problem :)
[19:17:58] <wikibugs>	 (03PS2) 10Urbanecm: arwiki: Configure wgGEHomepageManualAssignmentMentorsList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658996 (https://phabricator.wikimedia.org/T273060)
[19:18:02] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] arwiki: Configure wgGEHomepageManualAssignmentMentorsList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658996 (https://phabricator.wikimedia.org/T273060) (owner: 10Urbanecm)
[19:18:53] <wikibugs>	 (03Merged) 10jenkins-bot: arwiki: Configure wgGEHomepageManualAssignmentMentorsList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658996 (https://phabricator.wikimedia.org/T273060) (owner: 10Urbanecm)
[19:19:25] <wikibugs>	 (03CR) 10Mstyles: [C: 03+2] bump memory for flink processes [deployment-charts] - 10https://gerrit.wikimedia.org/r/657941 (owner: 10Mstyles)
[19:19:37] <elukey>	 !log reboot an-launcher1002 for kernel upgrades
[19:19:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:19:51] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:20:08] <Jayprakash12345>	 Hi, I have a patch for deployment in Morning backport window.
[19:20:18] <Jayprakash12345>	 It is closed?
[19:20:34] <Urbanecm>	 Jayprakash12345: not yet
[19:21:22] <wikibugs>	 (03Merged) 10jenkins-bot: bump memory for flink processes [deployment-charts] - 10https://gerrit.wikimedia.org/r/657941 (owner: 10Mstyles)
[19:21:38] <Urbanecm>	 I'll ping you when ready Jayprakash12345 
[19:21:56] <Jayprakash12345>	 Urbanecm: Okay :)
[19:22:45] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 53419ab6c0f2c306a68edb8979106bd42536211a: arwiki: Configure wgGEHomepageManualAssignmentMentorsList (T273060) (duration: 00m 59s)
[19:22:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:22:50] <stashbot>	 T273060: Configure wgGEHomepageManualAssignmentMentorsList for ar.wikipedia.org - https://phabricator.wikimedia.org/T273060
[19:23:11] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-1] Add accountcreator user rights group (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659000 (https://phabricator.wikimedia.org/T269067) (owner: 10Jayprakash12345)
[19:23:19] <Urbanecm>	 ^^ Jayprakash12345 ^^
[19:24:57] <icinga-wm>	 RECOVERY - Check no envoy runtime configuration is left persistent on mwdebug1001 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[19:26:34] <Jayprakash12345>	 Urbanecm: Yes, It is already exist. Checked at https://mr.wikisource.org/w/index.php?title=Special:UserList&group=accountcreator.
[19:26:46] <Jayprakash12345>	 Thanks for pointing out.
[19:26:51] <Urbanecm>	 Jayprakash12345: I mean, it doesn't make sense to redefine it :)
[19:28:02] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2217.codfw.wmnet'] `  an...
[19:28:29] <wikibugs>	 (03Abandoned) 10Jayprakash12345: Add accountcreator user rights group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659000 (https://phabricator.wikimedia.org/T269067) (owner: 10Jayprakash12345)
[19:30:03] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2218.codfw.wmnet'] `  an...
[19:30:06] <wikibugs>	 (03PS4) 10Jcrespo: bacula: Fix bug on realizing backup sets on motd [puppet] - 10https://gerrit.wikimedia.org/r/659056 (https://phabricator.wikimedia.org/T272686)
[19:30:31] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:30:33] <Urbanecm>	 Jayprakash12345: anything else?
[19:31:48] <Jayprakash12345>	 Urbanecm: Not for deployment, just wait for your comment on the task https://phabricator.wikimedia.org/T269067. I asked a question there.
[19:31:55] <Urbanecm>	 will have a look Jayprakash12345 
[19:32:36] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2219.codfw.wmnet'] `  an...
[19:34:10] <wikibugs>	 (03PS1) 10Krinkle: objectcache: fix broken for loop in RedisBagOStuff::doSetMulti() [core] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/658939 (https://phabricator.wikimedia.org/T273006)
[19:36:23] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:38:27] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1263.eqiad.wmnet'] `  an...
[19:40:11] <wikibugs>	 (03CR) 10Jcrespo: "I think this was disabled, probably not on purpose, by a refactoring by Faidon 7 years ago (according to blame). CCing him, please shout i" [puppet] - 10https://gerrit.wikimedia.org/r/659056 (https://phabricator.wikimedia.org/T272686) (owner: 10Jcrespo)
[19:40:35] <wikibugs>	 (03PS1) 10Bstorm: wikireplicas: open the firewall for multiinstance databases [homer/public] - 10https://gerrit.wikimedia.org/r/659070 (https://phabricator.wikimedia.org/T271476)
[19:44:43] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2221.codfw.wmnet with reason: REIMAGE
[19:44:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:46:49] <wikibugs>	 (03CR) 10Jcrespo: "Seems to work: https://puppet-compiler.wmflabs.org/compiler1001/27705/" [puppet] - 10https://gerrit.wikimedia.org/r/659056 (https://phabricator.wikimedia.org/T272686) (owner: 10Jcrespo)
[19:46:53] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2221.codfw.wmnet with reason: REIMAGE
[19:46:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:50:30] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "compiled on everything - noop, the special cases are already 404 in prod catalog: https://puppet-compiler.wmflabs.org/compiler1002/27702/" [puppet] - 10https://gerrit.wikimedia.org/r/655508 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[19:50:50] <mutante>	 Amir1: ^ base :p
[19:51:37] <mutante>	 "based"
[19:54:34] <wikibugs>	 (03CR) 10Nskaggs: [C: 03+1] "While this might not be the long term solution, pending https://phabricator.wikimedia.org/T267376, I'm in support of ensuring we can condu" [homer/public] - 10https://gerrit.wikimedia.org/r/659070 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm)
[19:54:52] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/27703/" [puppet] - 10https://gerrit.wikimedia.org/r/659048 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[19:54:55] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] etcd::replication: hiera -> lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/659048 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[19:56:58] <wikibugs>	 (03CR) 10Dzahn: "noop conf1005,conf2003" [puppet] - 10https://gerrit.wikimedia.org/r/659048 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[20:00:05] <jouncebot>	 dancy and brennen: How many deployers does it take to do Mediawiki train - American Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210127T2000).
[20:01:11] <wikibugs>	 (03PS1) 10Dzahn: base::certificates: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/659071 (https://phabricator.wikimedia.org/T209953)
[20:01:23] <wikibugs>	 (03PS1) 10Urbanecm: Undeploy cswiki birthday logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659072
[20:01:38] <Urbanecm>	 dancy: brennen: mind me shipping the above?
[20:02:49] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] base::certificates: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/659071 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[20:02:59] <dancy>	 Urbanecm: go ahead.
[20:03:03] <Urbanecm>	 thank you!
[20:03:11] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Undeploy cswiki birthday logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659072 (owner: 10Urbanecm)
[20:03:56] <wikibugs>	 (03PS1) 10Dzahn: tlsproxy::prometheus: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/659073 (https://phabricator.wikimedia.org/T209953)
[20:04:08] <wikibugs>	 (03Merged) 10jenkins-bot: Undeploy cswiki birthday logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659072 (owner: 10Urbanecm)
[20:04:41] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1263.eqiad.wmnet
[20:04:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:04:55] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2219.codfw.wmnet
[20:04:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:05:11] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2218.codfw.wmnet
[20:05:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:05:41] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] tlsproxy::prometheus: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/659073 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[20:06:21] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/logos.php: 6c5dd65e6138eb32db8059720a2149d4728763e7: Undeploy cswiki birthday logo (duration: 01m 06s)
[20:06:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:07:40] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized logos/config.yaml: 6c5dd65e6138eb32db8059720a2149d4728763e7: Undeploy cswiki birthday logo (duration: 01m 05s)
[20:07:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:07:47] <Urbanecm>	 dancy: all done, thanks again :)
[20:07:55] <dancy>	 np
[20:09:04] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2216.codfw.wmnet
[20:09:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:10:23] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1263.eqiad.wmnet
[20:10:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:12:11] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[20:13:08] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+2] objectcache: fix broken for loop in RedisBagOStuff::doSetMulti() [core] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/658939 (https://phabricator.wikimedia.org/T273006) (owner: 10Krinkle)
[20:13:57] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2219.codfw.wmnet
[20:14:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:16:23] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[20:16:33] <icinga-wm>	 PROBLEM - Memcached on mwdebug2001 is CRITICAL: connect to address 10.192.0.98 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[20:18:41] <icinga-wm>	 RECOVERY - Memcached on mwdebug2001 is OK: TCP OK - 0.032 second response time on 10.192.0.98 port 11210 https://wikitech.wikimedia.org/wiki/Memcached
[20:18:55] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2218.codfw.wmnet
[20:18:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:22:57] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[20:25:19] <icinga-wm>	 PROBLEM - Check systemd state on mwdebug2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:25:44] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2216.codfw.wmnet
[20:25:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:29:08] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1405.eqiad.wmnet with reason: REIMAGE
[20:29:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:29:55] <icinga-wm>	 RECOVERY - Check systemd state on mwdebug2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:29:59] <brennen>	 !log 1.36.0-wmf.28 (T271342): taking over train while dancy is afk; waiting on [[gerrit:658939]] to merge and will sync for verification on testwikis
[20:30:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:30:02] <stashbot>	 T271342: 1.36.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T271342
[20:31:11] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1405.eqiad.wmnet with reason: REIMAGE
[20:31:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:38] <wikibugs>	 (03PS2) 10Dzahn: base::certificates: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/659071 (https://phabricator.wikimedia.org/T209953)
[20:31:58] <wikibugs>	 (03PS2) 10Dzahn: tlsproxy::prometheus: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/659073 (https://phabricator.wikimedia.org/T209953)
[20:32:59] <wikibugs>	 (03CR) 10Cwhite: profile: update netdev to output ECS-formatted logs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/647029 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[20:33:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] tlsproxy::prometheus: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/659073 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[20:34:48] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] profile: add ecs pre and post filters to pipeline [puppet] - 10https://gerrit.wikimedia.org/r/647028 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[20:35:14] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2221.codfw.wmnet'] `  an...
[20:35:20] <wikibugs>	 (03PS1) 10Dzahn: monitoring::service: hiera -> lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/659075
[20:37:27] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] monitoring::service: hiera -> lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/659075 (owner: 10Dzahn)
[20:37:31] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2221.codfw.wmnet
[20:37:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:39:32] <wikibugs>	 (03PS1) 10Jdlrobson: Enable language in header on beta cluster for QA purposes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659077 (https://phabricator.wikimedia.org/T260738)
[20:40:43] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2221.codfw.wmnet
[20:40:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:42:14] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2217.codfw.wmnet
[20:42:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:43:45] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2217 is CRITICAL: Host mw2217 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[20:43:57] <wikibugs>	 (03PS1) 10Volans: mypy: temporary force upper version [software/spicerack] - 10https://gerrit.wikimedia.org/r/659078
[20:43:57] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2217.codfw.wmnet
[20:43:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:44:03] <wikibugs>	 (03Merged) 10jenkins-bot: objectcache: fix broken for loop in RedisBagOStuff::doSetMulti() [core] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/658939 (https://phabricator.wikimedia.org/T273006) (owner: 10Krinkle)
[20:44:15] <icinga-wm>	 PROBLEM - Memcached on mwdebug2001 is CRITICAL: connect to address 10.192.0.98 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[20:44:37] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2299.codfw.wmnet
[20:44:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:45:39] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2222.codfw.wmnet with reason: REIMAGE
[20:45:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:46:46] <wikibugs>	 (03PS1) 10Jdlrobson: Disable max-width on page namespace for wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659079 (https://phabricator.wikimedia.org/T260091)
[20:47:44] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2222.codfw.wmnet with reason: REIMAGE
[20:47:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:48:36] <wikibugs>	 10SRE, 10vm-requests: codfw: 3 VM request for ML team etcd - https://phabricator.wikimedia.org/T273075 (10Dzahn) @klausman Could you add the new cluster prefixes for ml (ml-etcd and others) to https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions#Servers ? that would be nice, thank you!
[20:51:52] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1405.eqiad.wmnet'] `  an...
[20:53:10] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2246.codfw.wmnet with reason: REIMAGE
[20:53:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:54:31] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "Self-merging to unblock all pending CRs. I'll revert the temporary fix once upstream has fixed the issue." [software/spicerack] - 10https://gerrit.wikimedia.org/r/659078 (owner: 10Volans)
[20:55:11] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2246.codfw.wmnet with reason: REIMAGE
[20:55:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:57:28] <wikibugs>	 10SRE, 10vm-requests: codfw: 3 VM request for ML team etcd - https://phabricator.wikimedia.org/T273075 (10Dzahn) a:03Dzahn
[20:58:49] <logmsgbot>	 !log brennen@deploy1001 Synchronized php-1.36.0-wmf.28/includes/libs/objectcache/RedisBagOStuff.php: Backport: [[gerrit:658780|objectcache: fix broken for loop in RedisBagOStuff::doSetMulti() (T273006)]] (duration: 01m 07s)
[20:58:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:58:56] <stashbot>	 T273006: MediumSpecificBagOStuff.php Undefined offset errors - https://phabricator.wikimedia.org/T273006
[21:00:01] <brennen>	 Krinkle, AaronSchulz: patch for above is currently on testwikis.
[21:00:04] <jouncebot>	 chrisalbon and accraze: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210127T2100).
[21:00:10] <icinga-wm>	 PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting%23Nova-fullstack
[21:00:59] <wikibugs>	 (03Merged) 10jenkins-bot: mypy: temporary force upper version [software/spicerack] - 10https://gerrit.wikimedia.org/r/659078 (owner: 10Volans)
[21:05:10] <wikibugs>	 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install atlas-codfw.wikimedia.org - https://phabricator.wikimedia.org/T273114 (10RobH)
[21:05:18] <wikibugs>	 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install atlas-codfw.wikimedia.org - https://phabricator.wikimedia.org/T273114 (10RobH)
[21:05:28] <wikibugs>	 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install atlas-codfw.wikimedia.org - https://phabricator.wikimedia.org/T273114 (10RobH)
[21:05:30] <wikibugs>	 10SRE, 10ops-codfw, 10netops: ripe-atlas-codfw is down - https://phabricator.wikimedia.org/T267714 (10RobH)
[21:08:18] <wikibugs>	 (03PS6) 10Volans: icinga: add wait_for_optimal function [software/spicerack] - 10https://gerrit.wikimedia.org/r/657385 (owner: 10Jbond)
[21:08:32] <wikibugs>	 (03PS6) 10Volans: Add a ferm module to Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/658972 (owner: 10Muehlenhoff)
[21:08:44] <wikibugs>	 (03PS2) 10Volans: puppet: add puppetmaster retrieval [software/spicerack] - 10https://gerrit.wikimedia.org/r/659008 (owner: 10David Caro)
[21:08:58] <wikibugs>	 (03PS2) 10Volans: remote: allow prepending every command with sudo [software/spicerack] - 10https://gerrit.wikimedia.org/r/659009 (owner: 10David Caro)
[21:09:18] <wikibugs>	 (03PS2) 10Volans: (WIP) debdeploy: Add debdeploy functionality [software/spicerack] - 10https://gerrit.wikimedia.org/r/658626 (owner: 10Jbond)
[21:09:45] <logmsgbot>	 !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@1c9d487]: airflow: hourly tasks must wait for yesterdays daily tank
[21:09:45] <logmsgbot>	 !log ebernhardson@deploy1001 deploy aborted: airflow: hourly tasks must wait for yesterdays daily tank (duration: 00m 00s)
[21:09:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:09:47] <logmsgbot>	 !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@1c9d487]: airflow: hourly tasks must wait for yesterdays daily task
[21:09:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:09:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:12:39] <Urbanecm>	 brennen: hey, what's the status of train? It would be great to backport https://gerrit.wikimedia.org/r/c/mediawiki/core/+/655148/ into both wmf.28/wmf.27 if possible :)
[21:15:04] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] (WIP) debdeploy: Add debdeploy functionality [software/spicerack] - 10https://gerrit.wikimedia.org/r/658626 (owner: 10Jbond)
[21:15:06] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add a ferm module to Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/658972 (owner: 10Muehlenhoff)
[21:15:11] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] puppet: add puppetmaster retrieval [software/spicerack] - 10https://gerrit.wikimedia.org/r/659008 (owner: 10David Caro)
[21:15:20] <brennen>	 Urbanecm: awaiting validation of https://gerrit.wikimedia.org/r/c/mediawiki/core/+/658780
[21:15:34] <Urbanecm>	 ack :)
[21:16:36] <wikibugs>	 10SRE, 10SRE-Access-Requests: Add sdkim to analytics-privatedata-users group - https://phabricator.wikimedia.org/T273058 (10Legoktm) I believe we just need approval from @sdkim's manager now.  (Also in the future please use the form at https://wikitech.wikimedia.org/wiki/Production_access#Filing_the_request wh...
[21:16:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] remote: allow prepending every command with sudo [software/spicerack] - 10https://gerrit.wikimedia.org/r/659009 (owner: 10David Caro)
[21:17:11] <wikibugs>	 (03CR) 10Urbanecm: "This change is ready for review." [core] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/658940 (https://phabricator.wikimedia.org/T271551) (owner: 10Urbanecm)
[21:17:36] <wikibugs>	 (03PS1) 10Urbanecm: Fix fetching ipblock-exempt within BlockManager::getUserBlock [core] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/658941 (https://phabricator.wikimedia.org/T271551)
[21:17:41] <logmsgbot>	 !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@1c9d487]: airflow: hourly tasks must wait for yesterdays daily task (duration: 07m 54s)
[21:17:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:19:47] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the addition!" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/657385 (owner: 10Jbond)
[21:19:49] <wikibugs>	 10SRE, 10SRE-Access-Requests: Add sdkim to analytics-privatedata-users group - https://phabricator.wikimedia.org/T273058 (10Legoktm)
[21:20:54] <wikibugs>	 10SRE, 10SRE-Access-Requests: Add sdkim to analytics-privatedata-users group - https://phabricator.wikimedia.org/T273058 (10Legoktm) p:05Triage→03Medium
[21:21:03] <wikibugs>	 10SRE, 10SRE-Access-Requests: Add sdkim to analytics-privatedata-users group - https://phabricator.wikimedia.org/T273058 (10Legoktm) a:03sdkim @sdkim: You will also need to agree to {L3}.
[21:21:59] <wikibugs>	 10SRE, 10SRE-Access-Requests: Hue access for Peter Pelberg - https://phabricator.wikimedia.org/T271602 (10Legoktm) a:05ppelberg→03Ottomata
[21:23:41] <wikibugs>	 10SRE, 10Sustainability: Alerts for drifts in /srv/mediawiki - https://phabricator.wikimedia.org/T272967 (10Legoktm)
[21:23:53] <Urbanecm>	 thanks legoktm 
[21:24:06] <icinga-wm>	 PROBLEM - nova instance creation test on cloudcontrol1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[21:24:28] <legoktm>	 :)
[21:27:43] <wikibugs>	 10SRE, 10vm-requests: eqiad: 3 VM request for ML team etcd - https://phabricator.wikimedia.org/T273074 (10Legoktm) p:05Triage→03Medium
[21:28:49] <wikibugs>	 10SRE, 10Traffic: Downloading from Archiva.wikimedia.org seems slower than Maven Central - https://phabricator.wikimedia.org/T273086 (10Legoktm) p:05Triage→03Low
[21:29:05] <wikibugs>	 10SRE, 10vm-requests: eqiad: 3 VM request for ML team etcd - https://phabricator.wikimedia.org/T273074 (10akosiaris) LGTM. Docs for proceeding with the creation at https://wikitech.wikimedia.org/wiki/Ganeti#Create_a_VM
[21:30:18] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:34:34] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:36:25] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2222.codfw.wmnet'] `  an...
[21:36:48] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:37:02] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:37:15] <wikibugs>	 10SRE, 10SRE-Access-Requests: Add sdkim to analytics-privatedata-users group - https://phabricator.wikimedia.org/T273058 (10Ottomata) @Legoktm FYI, Seve does not need ssh access (but can certainly have it if he wants it!).  The user in data.yaml can have an empty array for ssh_keys, e.g. `ssh_keys: []`.  https...
[21:37:40] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2246.codfw.wmnet'] `  an...
[21:38:16] <legoktm>	 ottomata: so is L3 only needed if they're getting ssh access? I read that wiki page but it wasn't obvious to me
[21:38:22] <wikibugs>	 10SRE, 10vm-requests: codfw: 3 VM request for ML team etcd - https://phabricator.wikimedia.org/T273075 (10Dzahn) a:05Dzahn→03None
[21:38:54] <wikibugs>	 10SRE, 10DNS, 10Mail, 10Traffic: ITS request to update SPF & DNS Records for Trust & Safety - https://phabricator.wikimedia.org/T272750 (10pkang) @drochford hey david, based on Andre's last comment, would the team be open to have the emails be sent from a subdomain like @zendesk.wikimedia.org?
[21:40:42] <ottomata>	 Hmm, legoktm  good q.  I'd say they probably should read that, for the Handling sensitive data bit.   but do they need to sign it?  hm.  
[21:41:01] <ottomata>	 this is a relatively new support for us, to be able to grant access to some of this data without ssh login
[21:41:26] <ottomata>	 will send email to you and moritz and luca asking
[21:41:39] <legoktm>	 ok, thanks
[21:41:53] <legoktm>	 since we're still waiting on manager approval I don't think it'll slow anything down in the meantime
[21:44:58] <ottomata>	 k
[21:46:06] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2217 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[21:48:52] <logmsgbot>	 !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@ae24e12]: repoint ores thresholds to yesterday
[21:48:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:50:40] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[21:51:12] <shdubsh>	 ^^ looking
[21:51:15] <logmsgbot>	 !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@ae24e12]: repoint ores thresholds to yesterday (duration: 02m 23s)
[21:51:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:52:09] <wikibugs>	 (03PS1) 10Dzahn: base: adjust data type for $debdeploy_filter_services [puppet] - 10https://gerrit.wikimedia.org/r/659084
[21:52:34] <wikibugs>	 (03PS1) 10Effie Mouzeli: WIP: memcached: enable the use of unix socket in memcached [puppet] - 10https://gerrit.wikimedia.org/r/659085 (https://phabricator.wikimedia.org/T273115)
[21:53:02] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP: memcached: enable the use of unix socket in memcached [puppet] - 10https://gerrit.wikimedia.org/r/659085 (https://phabricator.wikimedia.org/T273115) (owner: 10Effie Mouzeli)
[21:53:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] base: adjust data type for $debdeploy_filter_services [puppet] - 10https://gerrit.wikimedia.org/r/659084 (owner: 10Dzahn)
[21:54:17] <wikibugs>	 (03PS1) 10Gergő Tisza: Fix BaseModule::BASE_CSS_CLASS visibility [extensions/GrowthExperiments] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/658943 (https://phabricator.wikimedia.org/T273099)
[21:55:25] <wikibugs>	 (03PS1) 10Ahmon Dancy: group0 wikis to 1.36.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659086
[21:55:27] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+2] group0 wikis to 1.36.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659086 (owner: 10Ahmon Dancy)
[21:56:12] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659086 (owner: 10Ahmon Dancy)
[21:57:12] <tgr_>	 brennen / dancy: the patch above fixes a probably cause of logspam / minor breakage on group1. I can deploy it in the backport window, if it doesn't fit into the train schedule, it doesn't affect that many users.
[21:57:14] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[21:57:43] <brennen>	 tgr_: we're just now rolling forward to group0
[21:57:48] <logmsgbot>	 !log dancy@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.28
[21:57:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:58:05] <wikibugs>	 (03PS1) 10Andrew Bogott: base.pp: move value_type and merge behavior into the options hash [puppet] - 10https://gerrit.wikimedia.org/r/659087
[21:58:08] <dancy>	 tgr_: Thanks for the offer!
[21:58:22] <brennen>	 tgr_, dancy: i do notice a few of those BASE_CSS_CLASS ones - maybe it would be good to have that before we go forward to group1?
[21:58:29] <brennen>	 if it seems likely to blow up...
[21:58:57] <dancy>	 Agreed.  It's unclear what effect it has.  A ticket for that got filed during log triage I think
[21:58:59] <wikibugs>	 (03PS2) 10Andrew Bogott: base.pp: move value_type and merge behavior into the options hash [puppet] - 10https://gerrit.wikimedia.org/r/659087
[21:59:18] <dancy>	 yeah. https://phabricator.wikimedia.org/T273099
[21:59:20] <wikibugs>	 (03PS1) 10Dzahn: base: remove Hash data type for $debdeploy_filter_services [puppet] - 10https://gerrit.wikimedia.org/r/659088
[21:59:35] <tgr_>	 It's a trivial fix, should be safe to backport. Most wikis using the feature are on group2 so up to you.
[22:00:12] <dancy>	 OK. I see the backport cherry pick: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/658943/
[22:00:40] <brennen>	 (keeping one eye on things, but i've a meeting for the next ~hour)
[22:01:01] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] base: remove Hash data type for $debdeploy_filter_services [puppet] - 10https://gerrit.wikimedia.org/r/659088 (owner: 10Dzahn)
[22:01:18] <wikibugs>	 (03PS2) 10Effie Mouzeli: WIP: memcached: enable the use of unix socket in memcached [puppet] - 10https://gerrit.wikimedia.org/r/659085 (https://phabricator.wikimedia.org/T273115)
[22:01:25] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+1] Fix BaseModule::BASE_CSS_CLASS visibility [extensions/GrowthExperiments] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/658943 (https://phabricator.wikimedia.org/T273099) (owner: 10Gergő Tisza)
[22:01:54] <dancy>	 tgr_: Go ahead and deploy https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/658943 during your window.  I'll watch logs as well
[22:01:56] <icinga-wm>	 RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 1 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting%23Nova-fullstack
[22:02:08] <tgr_>	 ack
[22:04:08] <tgr_>	 you mean ignore it for the moment, right? The backport window is two hours from now.
[22:05:03] <dancy>	 Yeah.  That should be ok.
[22:06:06] <dancy>	 That said, it doesn't look like it would conflict with anything to do right now.
[22:07:34] <icinga-wm>	 RECOVERY - nova instance creation test on cloudcontrol1003 is OK: PROCS OK: 1 process with command name python, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[22:11:24] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] base.pp: move value_type and merge behavior into the options hash [puppet] - 10https://gerrit.wikimedia.org/r/659087 (owner: 10Andrew Bogott)
[22:13:32] <wikibugs>	 (03PS3) 10Effie Mouzeli: WIP: memcached: enable the use of unix socket in memcached [puppet] - 10https://gerrit.wikimedia.org/r/659085 (https://phabricator.wikimedia.org/T273115)
[22:14:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP: memcached: enable the use of unix socket in memcached [puppet] - 10https://gerrit.wikimedia.org/r/659085 (https://phabricator.wikimedia.org/T273115) (owner: 10Effie Mouzeli)
[22:16:14] <wikibugs>	 (03Abandoned) 10Dzahn: base: remove Hash data type for $debdeploy_filter_services [puppet] - 10https://gerrit.wikimedia.org/r/659088 (owner: 10Dzahn)
[22:16:34] <wikibugs>	 (03Abandoned) 10Dzahn: base: adjust data type for $debdeploy_filter_services [puppet] - 10https://gerrit.wikimedia.org/r/659084 (owner: 10Dzahn)
[22:16:52] <wikibugs>	 (03PS4) 10Effie Mouzeli: WIP: memcached: enable the use of unix socket in memcached [puppet] - 10https://gerrit.wikimedia.org/r/659085 (https://phabricator.wikimedia.org/T273115)
[22:19:23] <wikibugs>	 (03PS3) 10Dzahn: tlsproxy::prometheus: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/659073 (https://phabricator.wikimedia.org/T209953)
[22:20:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] tlsproxy::prometheus: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/659073 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[22:23:20] <wikibugs>	 (03PS4) 10Dzahn: tlsproxy::prometheus: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/659073 (https://phabricator.wikimedia.org/T209953)
[22:26:49] <wikibugs>	 (03PS1) 10Ahmon Dancy: objectcache: return false during more error cases in RedisBagOStuff::*Multi() methods [core] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/658945
[22:31:26] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:32:01] <wikibugs>	 (03PS2) 10Legoktm: Allow talking to the registry over HTTP [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/658684 (https://phabricator.wikimedia.org/T179696)
[22:32:08] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] Allow talking to the registry over HTTP [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/658684 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm)
[22:32:32] <wikibugs>	 (03PS2) 10Dzahn: monitoring::service: hiera -> lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/659075
[22:34:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] monitoring::service: hiera -> lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/659075 (owner: 10Dzahn)
[22:34:35] <wikibugs>	 (03Merged) 10jenkins-bot: Allow talking to the registry over HTTP [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/658684 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm)
[22:37:46] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:39:20] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1405.eqiad.wmnet
[22:39:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:39:25] <wikibugs>	 (03PS1) 10Legoktm: d/changelog: Bump version to 0.0.11 [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/659091
[22:40:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] d/changelog: Bump version to 0.0.11 [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/659091 (owner: 10Legoktm)
[22:41:06] <wikibugs>	 (03PS3) 10Dzahn: monitoring::service: hiera -> lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/659075
[22:42:53] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1405.eqiad.wmnet
[22:42:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:44:27] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2222.codfw.wmnet
[22:44:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:45:10] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2222.codfw.wmnet
[22:45:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:46:41] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2246.codfw.wmnet
[22:46:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:47:26] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2246.codfw.wmnet
[22:47:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:47:52] <wikibugs>	 (03PS1) 10Cwhite: profile: place drop ecs filter in place of ecs pre-filter [puppet] - 10https://gerrit.wikimedia.org/r/659092
[22:49:47] <wikibugs>	 10SRE, 10Sustainability: Alerts for drifts in /srv/mediawiki - https://phabricator.wikimedia.org/T272967 (10Dzahn) running 'scap pull' on all hosts that are being re-pooled
[22:49:54] <icinga-wm>	 PROBLEM - PHP opcache health on mwdebug2001 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[22:49:57] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] profile: place drop ecs filter in place of ecs pre-filter [puppet] - 10https://gerrit.wikimedia.org/r/659092 (owner: 10Cwhite)
[22:51:05] <wikibugs>	 (03CR) 10Legoktm: "Hmm, was I not supposed to push the upstream/0.0.11 tag initially?" [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/659091 (owner: 10Legoktm)
[22:53:44] <wikibugs>	 (03PS2) 10Jdlrobson: Disable max-width on page namespace for wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659079 (https://phabricator.wikimedia.org/T260091)
[22:54:30] <wikibugs>	 (03PS1) 10Cwhite: profile: get the ecs pre-filter filename right [puppet] - 10https://gerrit.wikimedia.org/r/659093
[22:54:55] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Disable max-width on page namespace for wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659079 (https://phabricator.wikimedia.org/T260091) (owner: 10Jdlrobson)
[22:55:34] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[22:56:05] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] profile: get the ecs pre-filter filename right [puppet] - 10https://gerrit.wikimedia.org/r/659093 (owner: 10Cwhite)
[22:57:04] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[22:57:46] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[22:59:16] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[23:06:43] <wikibugs>	 10SRE, 10Traffic: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Urbanecm) @Cyberpower678 This doesn't sound to be a complete dump of a raw request. I'm pretty confident the response has at least one line (the one starting with `...
[23:10:44] <wikibugs>	 (03PS1) 10Legoktm: docker_registry_ha: Have build-homepage talk directly to the registry [puppet] - 10https://gerrit.wikimedia.org/r/659095 (https://phabricator.wikimedia.org/T179696)
[23:11:04] <wikibugs>	 (03PS2) 10Legoktm: docker_registry_ha: Have build-homepage talk directly to the registry [puppet] - 10https://gerrit.wikimedia.org/r/659095 (https://phabricator.wikimedia.org/T179696)
[23:14:13] <wikibugs>	 10SRE, 10Traffic: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Legoktm) >>! In T273003#6778468, @Cyberpower678 wrote: > I believe it only does maxlag on write requests, like when it edits.  These are all read requests.  You sho...
[23:16:09] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27710/console" [puppet] - 10https://gerrit.wikimedia.org/r/659095 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm)
[23:16:52] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] "Also tested on registry2002 manually." [puppet] - 10https://gerrit.wikimedia.org/r/659095 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm)
[23:17:24] <wikibugs>	 10SRE, 10Traffic: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Cyberpower678) >>! In T273003#6782382, @Urbanecm wrote: > @Cyberpower678 This doesn't sound to be a complete dump of a raw request. I'm pretty confident the respons...
[23:18:36] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:19:18] <icinga-wm>	 PROBLEM - SSH on logstash2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[23:21:46] <legoktm>	 ok, registry2002 should really stop flapping now
[23:24:15] <wikibugs>	 10SRE, 10vm-requests: codfw: 3 VM request for ML team etcd - https://phabricator.wikimedia.org/T273075 (10Legoktm) p:05Triage→03Medium
[23:26:53] <wikibugs>	 (03PS3) 10Jdlrobson: Disable max-width on page namespace for wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659079 (https://phabricator.wikimedia.org/T260091)
[23:30:26] <shdubsh>	 !log reboot logstash2006
[23:30:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:30:47] <mutante>	 legoktm: cool!
[23:32:52] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[23:33:26] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[23:33:26] <icinga-wm>	 PROBLEM - Check systemd state on logstash2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:34:48] <icinga-wm>	 RECOVERY - SSH on logstash2006 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[23:35:04] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[23:35:38] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[23:40:58] <icinga-wm>	 RECOVERY - Memcached on mwdebug2001 is OK: TCP OK - 0.032 second response time on 10.192.0.98 port 11210 https://wikitech.wikimedia.org/wiki/Memcached