[00:00:05] RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210127T0000). [00:00:05] No GERRIT patches in the queue for this window AFAICS. [00:09:20] 10SRE, 10Parsoid-Tests, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Make testreduce web UI publicly accessible on the internet - https://phabricator.wikimedia.org/T266509 (10Dzahn) [00:11:54] (03PS2) 10Ryan Kemper: wdqs: use envoy for wdqs-internal [puppet] - 10https://gerrit.wikimedia.org/r/657913 (https://phabricator.wikimedia.org/T272713) [00:14:53] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1003.eqiad.wmnet with reason: Enabling envoy for wdqs-internal [00:14:54] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1003.eqiad.wmnet with reason: Enabling envoy for wdqs-internal [00:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:00] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1008.eqiad.wmnet with reason: Enabling envoy for wdqs-internal [00:15:01] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1008.eqiad.wmnet with reason: Enabling envoy for wdqs-internal [00:15:01] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1011.eqiad.wmnet with reason: Enabling envoy for wdqs-internal [00:15:02] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1011.eqiad.wmnet with reason: Enabling envoy for wdqs-internal [00:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:03] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2004.codfw.wmnet with reason: Enabling envoy for wdqs-internal [00:15:04] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2004.codfw.wmnet with reason: Enabling envoy for wdqs-internal [00:15:04] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2005.codfw.wmnet with reason: Enabling envoy for wdqs-internal [00:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:05] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2005.codfw.wmnet with reason: Enabling envoy for wdqs-internal [00:15:06] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2006.codfw.wmnet with reason: Enabling envoy for wdqs-internal [00:15:07] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2006.codfw.wmnet with reason: Enabling envoy for wdqs-internal [00:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:26] !log T272713 [Deploy envoy for `wdqs-internal`] Downtimed all `wdqs-internal` hosts on icinga [00:15:27] (03PS1) 10Dzahn: parsoid/testreduce: add envoy on testreduce1001 for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/658706 (https://phabricator.wikimedia.org/T266509) [00:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:29] T272713: Failing HTTP check on WDQS servers after latest deployment - https://phabricator.wikimedia.org/T272713 [00:15:37] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: use envoy for wdqs-internal [puppet] - 10https://gerrit.wikimedia.org/r/657913 (https://phabricator.wikimedia.org/T272713) (owner: 10Ryan Kemper) [00:16:24] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2008.codfw.wmnet with reason: Enabling envoy for wdqs-internal [00:16:25] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2008.codfw.wmnet with reason: Enabling envoy for wdqs-internal [00:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:53] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/27683/testreduce1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/658706 (https://phabricator.wikimedia.org/T266509) (owner: 10Dzahn) [00:20:00] !log T272713 [Deploy envoy for `wdqs-internal`] Disabled puppet on all `wdqs-internal` hosts; merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/657913 [00:20:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:36] RECOVERY - Check systemd state on pki2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:27:13] (03PS1) 10Dzahn: trafficserver/parsoid: switch TLS termination to 443, upstream port 8001 [puppet] - 10https://gerrit.wikimedia.org/r/658708 (https://phabricator.wikimedia.org/T266509) [00:28:02] PROBLEM - Check systemd state on pki2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:29:16] (03CR) 10Dzahn: [C: 04-1] "wrong hiera key!" [puppet] - 10https://gerrit.wikimedia.org/r/658708 (https://phabricator.wikimedia.org/T266509) (owner: 10Dzahn) [00:30:49] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by legoktm on cumin1001.eq... [00:31:52] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:35:29] (03CR) 10CDanis: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/657889 (https://phabricator.wikimedia.org/T272539) (owner: 10Jbond) [00:35:40] (03PS2) 10Dzahn: trafficserver/parsoid: switch TLS termination to 443, upstream port 8001 [puppet] - 10https://gerrit.wikimedia.org/r/658708 (https://phabricator.wikimedia.org/T266509) [00:36:18] !log [Deploy envoy for `wdqs-internal`] `...Error while evaluating a Function Call, secret(): invalid secret ssl/wdqs-internal.discovery.wmnet.key (file: /etc/puppet/modules/sslcert/manifests/certificate.pp, line: 91, column: 26) (file: /etc/puppet/modules/profile/manifests/tlsproxy/envoy.pp, line: 129) on node wdqs1003.eqiad.wmnet` [00:36:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:21] (03PS3) 10Dzahn: trafficserver/parsoid: switch TLS termination to 443, upstream port 8001 [puppet] - 10https://gerrit.wikimedia.org/r/658708 (https://phabricator.wikimedia.org/T266509) [00:38:18] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/27684/testreduce1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/658708 (https://phabricator.wikimedia.org/T266509) (owner: 10Dzahn) [00:38:44] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:44:49] (Forgot to prepend ticket number to previous SAL log message, so sending it again): [00:44:57] !log T272713 [Deploy envoy for `wdqs-internal`] `...Error while evaluating a Function Call, secret(): invalid secret ssl/wdqs-internal.discovery.wmnet.key (file: /etc/puppet/modules/sslcert/manifests/certificate.pp, line: 91, column: 26) (file: /etc/puppet/modules/profile/manifests/tlsproxy/envoy.pp, line: 129) on node wdqs1003.eqiad.wmnet` [00:45:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:07] T272713: Failing HTTP check on WDQS servers after latest deployment - https://phabricator.wikimedia.org/T272713 [00:45:30] !log T272713 [Deploy envoy for `wdqs-internal`] Discovered source of the above failure; the secret key in the puppetmaster `/srv/private` repo has a typo in its name (my error): it had `wqds` instead of `wdqs`. Opening up a patch now [00:45:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:47] ryankemper: you are not the only one. "wqds" even made it into the "typos" file in operations/puppet because I did that in the past. of course it won't help private repo [00:47:24] Heh [00:47:40] Something about the up-side down symmetry of the d and the q makes it extra easy to not spot [00:47:50] haha yea, i think that's true [00:49:50] !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2296.codfw.wmnet with reason: REIMAGE [00:49:53] !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2295.codfw.wmnet with reason: REIMAGE [00:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:16] (03PS7) 10Jeena Huneidi: Helmfile for continuous deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/634354 (https://phabricator.wikimedia.org/T214158) [00:51:39] !log T272713 [Deploy envoy for `wdqs-internal`] Fixed typo in private key in commit `ea152df802b55e939d34494a4965ed83a80a24f2`. Puppet run on `wdqs1003` was successful as a result. Monitoring... [00:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:43] T272713: Failing HTTP check on WDQS servers after latest deployment - https://phabricator.wikimedia.org/T272713 [00:52:13] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2295.codfw.wmnet with reason: REIMAGE [00:52:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:54:13] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2296.codfw.wmnet with reason: REIMAGE [00:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:57] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 919918056 and 70 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:05:09] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 11328 and 14 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:05:53] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 997463256 and 124 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:07:25] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 109056 and 151 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:08:41] RECOVERY - WDQS SPARQL on wdqs1003 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.061 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [01:10:00] ryankemper: SPARQL monitoring fixed?:) nice! [01:10:45] ryankemper: fyi, because we are doing almost the same thing for different services.. i just ran into https://phabricator.wikimedia.org/T255568 [01:10:51] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@246b640]: remove link recommendations from hourly transfer deps [01:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:11:06] i was like "why.. is this not working" and it's that envoy is not listening on v6 [01:11:14] while ATS tries to use it where it exists [01:11:32] at least I saw that ticket earlier and that is it here as well [01:12:25] Yeah SPARQL monitoring should be fixed now that there's actually a TLS port for the check to hit (well, about to re-enable puppet on the rest of the fleet but after that) [01:14:06] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2295.codfw.wmnet', 'mw22... [01:14:22] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@246b640]: remove link recommendations from hourly transfer deps (duration: 03m 31s) [01:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:14:37] note BTW that the alert wasn't working at all previously (previously meaning before we broke it further thus necessitating this envoy rollout), so now we will see some `WDQS SPARQL` flapping whenever an instance's blazegraph locks up...still thinking about the best way to cut down on that noise, I think I have an idea though [01:15:04] mutante: interesting with the ipv4 vs ipv6 stuff [01:16:28] Looks like https://gerrit.wikimedia.org/r/c/operations/puppet/+/629343 is in the works to add an option to fix that, so I'll keep an eye on that patch [01:17:16] ryankemper: aha, good to know. one way to influence Icinga checks is to adjust the number of times it should retry before it calls a SOFT state a HARD state (only HARD state sends notifications), default is 3 [01:17:36] yea, same here. I will continue with the v6 issue on this tomorrow. [01:17:48] the fix is for services_proxy but we are outside that [01:17:51] so gotta look more [01:17:54] Great, I was imagining a threshold bump and that retry option would do the same [01:19:29] (Well threshold if it were like a "fire off alert if this is broken for 5 minutes" but it sounds like from what you said it's just running the check and firing based off the exit code) [01:19:42] The second part (that requires a bit of thinking) is because these wdqs instances can lock up indefinitely, we need something like a crappy cronjob/systemd timer that can probe its instance's blazegraph to detect deadlock and then restart the process [01:20:18] I remember working with k8s at my last job it had a hacky `docker-healthcheck` container that just ran a command like `docker ps` and restarted docker if it took >60s to respond, so I'm envisioning that general concept [01:20:29] ryankemper: modules/monitoring/manifests/service.pp: max_check_attempts => $retries, [01:20:37] When a service or host check results in a non-OK or non-UP state and the service check has not yet been (re)checked the number of times specified by the max_check_attempts directive in the service or host definition. This is called a soft error. [01:20:55] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw2295.codfw.wmnet [01:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:20:57] soft errors are displayed in web UI but don't notify IRC [01:21:05] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw2296.codfw.wmnet [01:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:21:13] awesome yeah that's exactly what I want [01:21:23] !log T272713 [Deploy envoy for `wdqs-internal`] Test queries to `wdqs1003.eqiad.wmnet` passed, and metrics in Grafana (https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&var-cluster_name=wdqs-internal&from=1611706751381&to=1611710190405) look good. Rolling out to rest of fleet [01:21:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:21:27] T272713: Failing HTTP check on WDQS servers after latest deployment - https://phabricator.wikimedia.org/T272713 [01:22:32] ryankemper: you can use "event_handler" in Icinga itself. basically that is "if alert X triggers then run command Y" [01:22:38] example is class { 'icinga::event_handlers::raid': [01:22:49] that is automatically creating tickets when there is RAID alert [01:22:57] the command can also be automatic restart [01:23:03] so it will be auto-fixing based on alerts [01:23:04] RECOVERY - WDQS SPARQL on wdqs1008 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.086 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [01:23:56] RECOVERY - WDQS SPARQL on wdqs2005 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.187 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [01:23:56] !log T272713 [Deploy envoy for `wdqs-internal`] Roll-out complete. Will monitor `wdqs-internal` for any issues. All the remaining `WDQS SPARQL` alerts should clear shortly [01:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:24:11] so .. that is all included in nagios/icinga already, glad to talk more about it later [01:25:26] great, I'll probably reach out to ya later this week [01:25:44] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=mw2295.codfw.wmnet [01:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:25:50] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=mw2296.codfw.wmnet [01:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:31:16] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:35:48] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:37:26] RECOVERY - WDQS SPARQL on wdqs1011 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.065 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [01:37:26] RECOVERY - WDQS SPARQL on wdqs2004 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.291 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [01:38:45] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@ee948e0]: transfer_to_es: Enable catchup [01:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:39:57] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@ee948e0]: transfer_to_es: Enable catchup (duration: 01m 11s) [01:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:44:06] RECOVERY - WDQS SPARQL on wdqs2006 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.191 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [01:48:11] !log [WDQS Deploy] Gearing up for deploy of wdqs `0.3.61`. Pre-deploy tests passing on canary `wdqs1003` [01:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:48:33] !log ryankemper@deploy1001 Started deploy [wdqs/wdqs@6c6b2cb]: 0.3.61 [01:48:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:48:38] RECOVERY - WDQS SPARQL on wdqs2008 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.200 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [01:50:18] !log [WDQS Deploy] Tests passing following deploy of `0.3.61` on canary `wdqs1003`; proceeding to rest of fleet [01:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:56:23] !log ryankemper@deploy1001 Finished deploy [wdqs/wdqs@6c6b2cb]: 0.3.61 (duration: 07m 50s) [01:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:57:40] !log [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'` [01:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:57:57] !log [WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'` [01:58:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:58:06] !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'` [01:58:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:10:23] PROBLEM - Host analytics1073 is DOWN: PING CRITICAL - Packet loss = 100% [02:21:17] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@9c85a21]: transfer_to_es: start date 2020 -> 2021 [02:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:22:40] 10SRE, 10CAS-SSO: idp.wikimedia.org asking twice for YubiKey - https://phabricator.wikimedia.org/T258029 (10Krinkle) [02:24:16] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@9c85a21]: transfer_to_es: start date 2020 -> 2021 (duration: 02m 59s) [02:24:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:31:53] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:38:37] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:18:01] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:32:05] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:32:15] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 7.929 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:39:13] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:39:23] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:46:23] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.070 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:48:55] !log (Restarted `wdqs-blazegraph` on `wdqs1012`) [03:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:16:35] (03PS1) 10Jeena Huneidi: [WIP] Apply global helmfile after pull [puppet] - 10https://gerrit.wikimedia.org/r/658750 (https://phabricator.wikimedia.org/T214158) [04:18:03] (03PS2) 10Jeena Huneidi: [WIP] Apply global helmfile after pull [puppet] - 10https://gerrit.wikimedia.org/r/658750 (https://phabricator.wikimedia.org/T214158) [04:19:40] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Apply global helmfile after pull [puppet] - 10https://gerrit.wikimedia.org/r/658750 (https://phabricator.wikimedia.org/T214158) (owner: 10Jeena Huneidi) [04:27:55] (03PS3) 10Jeena Huneidi: [WIP] Apply global helmfile after pull [puppet] - 10https://gerrit.wikimedia.org/r/658750 (https://phabricator.wikimedia.org/T214158) [04:29:33] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Apply global helmfile after pull [puppet] - 10https://gerrit.wikimedia.org/r/658750 (https://phabricator.wikimedia.org/T214158) (owner: 10Jeena Huneidi) [04:30:07] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:36:36] (03PS4) 10Jeena Huneidi: [WIP] Apply global helmfile after pull [puppet] - 10https://gerrit.wikimedia.org/r/658750 (https://phabricator.wikimedia.org/T214158) [04:37:17] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:41:36] (03CR) 10Jeena Huneidi: "I am trying to find a way for us to do continuous deployment. I thought it might be good to do it from the deployment server directly sinc" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/658750 (https://phabricator.wikimedia.org/T214158) (owner: 10Jeena Huneidi) [04:52:29] 10SRE, 10CAS-SSO: SSO Portal: Fix "Remember me" checkbox alignment - https://phabricator.wikimedia.org/T273023 (10Krinkle) [04:53:28] (03PS1) 10Krinkle: apereo_cas: Remove inlined duplicate copy of bootstrap.css [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658759 [04:53:30] (03PS1) 10Krinkle: apereo_cas: Improve loginform design [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658760 (https://phabricator.wikimedia.org/T273023) [04:58:33] 10SRE, 10CAS-SSO, 10Patch-For-Review: SSO Portal: Fix "Remember me" checkbox alignment - https://phabricator.wikimedia.org/T273023 (10Krinkle) ##### Before {F34035759 height=300} ##### After {F34035761 height=300} [05:17:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:20:11] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:20:43] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:30:05] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:37:25] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:44:55] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:50:03] !log Deploy schema change on s3 T270055 [05:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:07] T270055: Schema change for timestamp field of uploadstash - https://phabricator.wikimedia.org/T270055 [05:51:55] twentyafterfour: hey! ready to start in 10 minutes? [06:00:04] marostegui and twentyafterfour: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for m3 (phabricator) database master restart. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210127T0600). [06:00:38] marostegui: ready when you are [06:00:45] o/ [06:00:54] !log m3 master restart, phabricator will go on read only - T272596 [06:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:58] T272596: Restart m3 (phabricator) database master db1132 - https://phabricator.wikimedia.org/T272596 [06:00:58] twentyafterfour: ready! [06:01:53] !log phabricator is read-only [06:01:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:57] marostegui: go :) [06:02:21] restarting! [06:02:44] twentyafterfour: done [06:03:13] !log phabricator is read-write [06:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:22] everything looks good [06:03:23] I can edit fine [06:03:26] https://phabricator.wikimedia.org/P13964 [06:03:39] !log phabricator appears to be up and running fine [06:03:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:55] twentyafterfour: thank you very much! [06:04:02] πŸ–’ [06:04:12] marostegui: you're welcome! no problem at all :) [06:04:16] <3 [06:13:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1160 with minimal weight T258361', diff saved to https://phabricator.wikimedia.org/P13965 and previous config saved to /var/cache/conftool/dbconfig/20210127-061336-marostegui.json [06:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:42] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [06:15:09] (03PS1) 10Marostegui: db1160: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/658782 (https://phabricator.wikimedia.org/T258361) [06:15:58] (03CR) 10Marostegui: [C: 03+2] db1160: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/658782 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [06:17:40] (03PS1) 10Marostegui: install_server: Do not reimage db1160 [puppet] - 10https://gerrit.wikimedia.org/r/658784 (https://phabricator.wikimedia.org/T258361) [06:18:23] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1160 [puppet] - 10https://gerrit.wikimedia.org/r/658784 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [06:21:57] RECOVERY - Check systemd state on pki2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:07] PROBLEM - Check systemd state on pki2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:06] 10SRE, 10Traffic, 10serviceops, 10Wikimedia-production-error: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Joe) >>! In T273003#6778171, @CDanis wrote: > It seems the User-Agent being used is `Peachy MediaWiki Bot API Versio... [06:30:59] (03CR) 10Ema: [C: 03+1] varnish: Set debug=1 in X-Analytics header [puppet] - 10https://gerrit.wikimedia.org/r/629735 (https://phabricator.wikimedia.org/T263683) (owner: 10Effie Mouzeli) [06:31:08] (03CR) 10Ema: [C: 03+1] varnish: include X-Client-Port in X-Analytics [puppet] - 10https://gerrit.wikimedia.org/r/658567 (https://phabricator.wikimedia.org/T181368) (owner: 10Effie Mouzeli) [06:31:37] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:38:43] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:39:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give db1160 some more small weight T258361', diff saved to https://phabricator.wikimedia.org/P13966 and previous config saved to /var/cache/conftool/dbconfig/20210127-063930-marostegui.json [06:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:35] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [06:48:46] 10SRE, 10DBA, 10Platform Engineering Roadmap Decision Making, 10Performance-Team (Radar), 10User-Kormat: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 (10Marostegui) Thanks @Krinkle - I will probably start first with s6 codfw (frwiki,jawiki,ruwiki), and using wikimediadebug to... [06:57:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give db1160 some more small weight T258361', diff saved to https://phabricator.wikimedia.org/P13967 and previous config saved to /var/cache/conftool/dbconfig/20210127-065715-marostegui.json [06:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:19] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [07:05:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1085 T272008', diff saved to https://phabricator.wikimedia.org/P13968 and previous config saved to /var/cache/conftool/dbconfig/20210127-070502-marostegui.json [07:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:07] T272008: Move wikireplicas under the new sanitarium hosts (db1154, db1155) - https://phabricator.wikimedia.org/T272008 [07:05:09] 10SRE, 10Traffic, 10Wikimedia-production-error: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Joe) p:05Mediumβ†’03Low I also checked the logs from yesterday, and there was no error reported by the backend servers (in envoy o... [07:09:27] 10SRE, 10Traffic, 10Wikimedia-production-error: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Joe) I'm not even sure this qualifies for the "production error" tags. We're talking about 50 events over the last week, that's way... [07:19:12] 10SRE, 10Traffic: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Krinkle) [07:21:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1085 (re)pooling @ 25%: After moving clouddb replicas', diff saved to https://phabricator.wikimedia.org/P13969 and previous config saved to /var/cache/conftool/dbconfig/20210127-072135-root.json [07:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:56] (03CR) 10Elukey: [C: 03+1] "LGTM, we can merge anytime, just ping Analytics when doing it so we are aware :)" [puppet] - 10https://gerrit.wikimedia.org/r/629735 (https://phabricator.wikimedia.org/T263683) (owner: 10Effie Mouzeli) [07:22:13] (03CR) 10Elukey: [C: 03+1] "LGTM, we can merge anytime, just ping Analytics when doing it so we are aware :)" [puppet] - 10https://gerrit.wikimedia.org/r/658567 (https://phabricator.wikimedia.org/T181368) (owner: 10Effie Mouzeli) [07:24:48] (03PS7) 10Effie Mouzeli: service_proxy: add ipv6 config option on services_proxy config [puppet] - 10https://gerrit.wikimedia.org/r/629343 (https://phabricator.wikimedia.org/T255568) [07:26:29] (03CR) 10Effie Mouzeli: service_proxy: add ipv6 config option on services_proxy config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629343 (https://phabricator.wikimedia.org/T255568) (owner: 10Effie Mouzeli) [07:26:49] !log powercycle analytics1073 - kernel soft lock up bug registered, os needs a reboot [07:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:22] RECOVERY - Host analytics1073 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [07:29:47] (03CR) 10Effie Mouzeli: [V: 03+1] "PCC https://puppet-compiler.wmflabs.org/compiler1001/27685/" [puppet] - 10https://gerrit.wikimedia.org/r/629343 (https://phabricator.wikimedia.org/T255568) (owner: 10Effie Mouzeli) [07:30:32] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:30:49] (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+2] service_proxy: add ipv6 config option on services_proxy config [puppet] - 10https://gerrit.wikimedia.org/r/629343 (https://phabricator.wikimedia.org/T255568) (owner: 10Effie Mouzeli) [07:30:57] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Enable bracket matching on the first wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658594 (https://phabricator.wikimedia.org/T270238) (owner: 10WMDE-Fisch) [07:31:10] (03PS8) 10Effie Mouzeli: service_proxy: add ipv6 config option on services_proxy config [puppet] - 10https://gerrit.wikimedia.org/r/629343 (https://phabricator.wikimedia.org/T255568) [07:33:16] 10SRE, 10Analytics: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10elukey) [07:33:47] 10SRE, 10Analytics: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10elukey) [07:34:03] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [07:36:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1085 (re)pooling @ 50%: After moving clouddb replicas', diff saved to https://phabricator.wikimedia.org/P13970 and previous config saved to /var/cache/conftool/dbconfig/20210127-073638-root.json [07:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:49] 10SRE, 10Traffic: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Cyberpower678) Joe, not all requests are 502s. They are insignificant compared to the amount of requests returning null responses. This is a very problematic occu... [07:36:58] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:51:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1085 (re)pooling @ 75%: After moving clouddb replicas', diff saved to https://phabricator.wikimedia.org/P13971 and previous config saved to /var/cache/conftool/dbconfig/20210127-075142-root.json [07:51:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:58] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/658455 (owner: 10Legoktm) [07:57:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give db1160 some more small weight T258361', diff saved to https://phabricator.wikimedia.org/P13972 and previous config saved to /var/cache/conftool/dbconfig/20210127-075715-marostegui.json [07:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:23] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [08:00:57] (03CR) 10Effie Mouzeli: [V: 03+1] "-o modern includes slab_reassign, PCC https://puppet-compiler.wmflabs.org/compiler1003/27688/mc1025.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/656385 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [08:01:19] (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+2] hiera: clean up memcached configuration [puppet] - 10https://gerrit.wikimedia.org/r/656385 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [08:05:51] (03PS4) 10Effie Mouzeli: profile::memcached::instance: simplify handling of extendend_options [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey) [08:06:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1085 (re)pooling @ 100%: After moving clouddb replicas', diff saved to https://phabricator.wikimedia.org/P13973 and previous config saved to /var/cache/conftool/dbconfig/20210127-080645-root.json [08:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:06] (03PS1) 10Ladsgroup: Add WMCS to the exception of ratelimit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658890 (https://phabricator.wikimedia.org/T209011) [08:07:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1121', diff saved to https://phabricator.wikimedia.org/P13974 and previous config saved to /var/cache/conftool/dbconfig/20210127-080753-marostegui.json [08:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1121', diff saved to https://phabricator.wikimedia.org/P13975 and previous config saved to /var/cache/conftool/dbconfig/20210127-081150-marostegui.json [08:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:44] (03PS1) 10Elukey: Remove duplicate in Hadoop Analytics' HDFS topology [puppet] - 10https://gerrit.wikimedia.org/r/658895 [08:27:08] (03CR) 10Elukey: [C: 03+2] Remove duplicate in Hadoop Analytics' HDFS topology [puppet] - 10https://gerrit.wikimedia.org/r/658895 (owner: 10Elukey) [08:27:10] (03PS1) 10Marostegui: mariadb: Productionize db1169 in s1 [puppet] - 10https://gerrit.wikimedia.org/r/658896 (https://phabricator.wikimedia.org/T258361) [08:28:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1089 to clone db1169 T258361', diff saved to https://phabricator.wikimedia.org/P13976 and previous config saved to /var/cache/conftool/dbconfig/20210127-082826-marostegui.json [08:28:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:32] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [08:28:56] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 40.88 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [08:29:18] (03PS2) 10Marostegui: mariadb: Productionize db1169 in s1 [puppet] - 10https://gerrit.wikimedia.org/r/658896 (https://phabricator.wikimedia.org/T258361) [08:29:26] !log Stop mysql on db1089 to clone db1169 T258361 [08:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:14] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:36:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1160 with more weight T258361', diff saved to https://phabricator.wikimedia.org/P13978 and previous config saved to /var/cache/conftool/dbconfig/20210127-083618-marostegui.json [08:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:23] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [08:38:25] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) Hi @wiki_willy thanks a lot for following up! I re-done the calculations of the workers' distribution after the last racking and this is what I g... [08:39:00] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:48:08] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond) [08:53:58] (03PS1) 10Thiemo Kreuz (WMDE): Improve matchbrackets performance when moving the cursor [extensions/CodeMirror] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/658815 (https://phabricator.wikimedia.org/T270317) [09:00:02] (03CR) 10jerkins-bot: [V: 04-1] Improve matchbrackets performance when moving the cursor [extensions/CodeMirror] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/658815 (https://phabricator.wikimedia.org/T270317) (owner: 10Thiemo Kreuz (WMDE)) [09:03:33] !log swift codfw-prod decrease SSD weight for ms-be20[16-27] - T272837 [09:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:37] T272837: Decom ms-be[2016-2027] from swift - https://phabricator.wikimedia.org/T272837 [09:04:10] !log deploy fix to enable-puppet [09:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:14] (03CR) 10Jbond: [C: 03+2] enable-puppet: allow fall back to enable puppet disabled by root [puppet] - 10https://gerrit.wikimedia.org/r/657889 (https://phabricator.wikimedia.org/T272539) (owner: 10Jbond) [09:11:18] 10Puppet, 10SRE: run-puppet-agent --enable flag is broken - https://phabricator.wikimedia.org/T272539 (10jbond) Seems to be working ` lang=console $ sudo disable-puppet 'test disable puppet with sudo' $ sudo puppet agent -t... [09:11:30] 10Puppet, 10SRE: run-puppet-agent --enable flag is broken - https://phabricator.wikimedia.org/T272539 (10jbond) 05Openβ†’03Resolved [09:15:42] (03PS5) 10Effie Mouzeli: WIP: profile::memcached::instance: remove "default_values" [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey) [09:16:10] (03CR) 10Effie Mouzeli: [C: 04-2] WIP: profile::memcached::instance: remove "default_values" [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey) [09:17:24] (03CR) 10jerkins-bot: [V: 04-1] WIP: profile::memcached::instance: remove "default_values" [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey) [09:19:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1160 with more weight T258361', diff saved to https://phabricator.wikimedia.org/P13979 and previous config saved to /var/cache/conftool/dbconfig/20210127-091909-marostegui.json [09:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:14] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [09:20:15] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1169 in s1 [puppet] - 10https://gerrit.wikimedia.org/r/658896 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [09:22:16] (03PS1) 10Gilles: Add snappy dependency to coal [puppet] - 10https://gerrit.wikimedia.org/r/658918 (https://phabricator.wikimedia.org/T273033) [09:31:08] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:31:21] (03CR) 10Arturo Borrero Gonzalez: "thanks for the patch, comment inline." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658890 (https://phabricator.wikimedia.org/T209011) (owner: 10Ladsgroup) [09:33:06] (03CR) 10Vgutierrez: [C: 03+2] Migrate hiera() to lookup() and set datatypes in purge.pp [puppet] - 10https://gerrit.wikimedia.org/r/658503 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [09:35:22] (03PS1) 10Filippo Giunchedi: alertmanager: route Icinga compat alerts to sink IRC channel [puppet] - 10https://gerrit.wikimedia.org/r/658919 (https://phabricator.wikimedia.org/T272453) [09:36:36] (03CR) 10Arturo Borrero Gonzalez: "is getting harder for me to track this change with each openstack version." [puppet] - 10https://gerrit.wikimedia.org/r/658653 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott) [09:36:48] PROBLEM - MegaRAID on an-worker1099 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:36:50] ACKNOWLEDGEMENT - MegaRAID on an-worker1099 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T273034 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:36:54] 10SRE, 10ops-eqiad: Degraded RAID on an-worker1099 - https://phabricator.wikimedia.org/T273034 (10ops-monitoring-bot) [09:37:32] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:38:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1160 with more weight T258361', diff saved to https://phabricator.wikimedia.org/P13980 and previous config saved to /var/cache/conftool/dbconfig/20210127-093802-marostegui.json [09:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:06] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [09:38:09] (03CR) 10Alexandros Kosiaris: service_proxy: add ipv6 config option on services_proxy config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629343 (https://phabricator.wikimedia.org/T255568) (owner: 10Effie Mouzeli) [09:40:10] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::web::prod_sites: remove unused code from main.conf [puppet] - 10https://gerrit.wikimedia.org/r/657138 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto) [09:44:01] 10SRE, 10ops-eqiad, 10Analytics: Degraded RAID on an-worker1099 - https://phabricator.wikimedia.org/T273034 (10elukey) [09:59:08] (03CR) 10Vgutierrez: [C: 03+2] varnish: Set debug=1 in X-Analytics header [puppet] - 10https://gerrit.wikimedia.org/r/629735 (https://phabricator.wikimedia.org/T263683) (owner: 10Effie Mouzeli) [10:00:26] (03PS2) 10Thiemo Kreuz (WMDE): Improve matchbrackets performance when moving the cursor [extensions/CodeMirror] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/658815 (https://phabricator.wikimedia.org/T270317) [10:00:55] (03CR) 10Jbond: [V: 03+2 C: 03+2] apereo_cas: Improve loginform design [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658760 (https://phabricator.wikimedia.org/T273023) (owner: 10Krinkle) [10:02:15] (03CR) 10Jbond: [V: 03+2 C: 03+2] apereo_cas: Remove inlined duplicate copy of bootstrap.css [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658759 (owner: 10Krinkle) [10:02:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1160 with more weight T258361', diff saved to https://phabricator.wikimedia.org/P13981 and previous config saved to /var/cache/conftool/dbconfig/20210127-100220-marostegui.json [10:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:26] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [10:03:43] (03PS1) 10Marostegui: instances.yaml: Add db1169 [puppet] - 10https://gerrit.wikimedia.org/r/658920 (https://phabricator.wikimedia.org/T258361) [10:03:47] (03PS1) 10DCausse: [cirrus] Swith to perfield builder for spaceless languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658921 (https://phabricator.wikimedia.org/T266027) [10:05:06] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1169 [puppet] - 10https://gerrit.wikimedia.org/r/658920 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [10:05:28] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Swith to perfield builder for spaceless languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658921 (https://phabricator.wikimedia.org/T266027) (owner: 10DCausse) [10:05:52] !log reboot matomo1002 for kernel upgrades [10:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:29] (03CR) 10Thiemo Kreuz (WMDE): "This backport was a little more complicated because of multiple merge conflicts. A good way to review the result is to compare it with the" [extensions/CodeMirror] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/658814 (https://phabricator.wikimedia.org/T270317) (owner: 10Thiemo Kreuz (WMDE)) [10:11:49] (03CR) 10Thiemo Kreuz (WMDE): "This backport was clean with no conflict. The only additional change is the tiny fix from Id923398." [extensions/CodeMirror] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/658815 (https://phabricator.wikimedia.org/T270317) (owner: 10Thiemo Kreuz (WMDE)) [10:12:12] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host schema1003.eqiad.wmnet [10:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:40] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema1003.eqiad.wmnet [10:14:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:09] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host schema1004.eqiad.wmnet [10:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:43] (03PS1) 10ArielGlenn: build for buster [debs/mwbzutils] - 10https://gerrit.wikimedia.org/r/658923 [10:17:38] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema1004.eqiad.wmnet [10:17:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:55] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [10:18:38] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host schema2003.codfw.wmnet [10:18:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1160 with final weight T258361', diff saved to https://phabricator.wikimedia.org/P13982 and previous config saved to /var/cache/conftool/dbconfig/20210127-102042-marostegui.json [10:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:47] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [10:22:02] (03PS2) 10DCausse: [cirrus] Swith to perfield builder for spaceless languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658921 (https://phabricator.wikimedia.org/T266027) [10:23:13] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema2003.codfw.wmnet [10:23:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:27] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host schema2004.codfw.wmnet [10:23:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:26] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [10:24:38] (03PS1) 10Jbond: add debian build instructions [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658924 [10:24:42] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [10:25:02] (03CR) 10Jbond: [V: 03+2 C: 03+2] add debian build instructions [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658924 (owner: 10Jbond) [10:27:21] (03PS1) 10Jbond: create debian release [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658947 [10:28:06] (03CR) 10Jbond: [V: 03+2 C: 03+2] create debian release [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658947 (owner: 10Jbond) [10:29:18] (03PS1) 10Alexandros Kosiaris: termbox: Stop referencing deprecated service-proxy.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/658948 [10:31:00] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:31:39] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Double checked on the deployment host, no diff, merging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/658948 (owner: 10Alexandros Kosiaris) [10:32:50] (03PS5) 10JMeybohm: Allow the kube-controller-manager to run without superuser permissions [puppet] - 10https://gerrit.wikimedia.org/r/650469 (https://phabricator.wikimedia.org/T228967) [10:33:07] (03Merged) 10jenkins-bot: termbox: Stop referencing deprecated service-proxy.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/658948 (owner: 10Alexandros Kosiaris) [10:33:13] (03PS3) 10Thiemo Kreuz (WMDE): Improve matchbrackets performance when moving the cursor [extensions/CodeMirror] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/658814 (https://phabricator.wikimedia.org/T270317) [10:33:27] (03PS6) 10JMeybohm: Allow the kube-controller-manager to run without superuser permissions [puppet] - 10https://gerrit.wikimedia.org/r/650469 (https://phabricator.wikimedia.org/T228967) [10:33:43] (03CR) 10JMeybohm: Allow the kube-controller-manager to run without superuser permissions (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/650469 (https://phabricator.wikimedia.org/T228967) (owner: 10JMeybohm) [10:36:29] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema2004.codfw.wmnet [10:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:02] (03CR) 10Awight: [V: 03+1 C: 03+1] "Works for me locally." [extensions/CodeMirror] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/658814 (https://phabricator.wikimedia.org/T270317) (owner: 10Thiemo Kreuz (WMDE)) [10:37:28] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:37:31] (03CR) 10Awight: [V: 03+1 C: 03+1] Improve matchbrackets performance when moving the cursor [extensions/CodeMirror] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/658815 (https://phabricator.wikimedia.org/T270317) (owner: 10Thiemo Kreuz (WMDE)) [10:38:32] PROBLEM - Check systemd state on deploy1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:40:59] (03PS4) 10Jcrespo: mariadb: Remove mariadb module mysqld_safe [puppet] - 10https://gerrit.wikimedia.org/r/657820 (https://phabricator.wikimedia.org/T272559) [10:42:52] RECOVERY - Check systemd state on deploy1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:43:52] 10SRE, 10envoy, 10serviceops, 10Service-Architecture: Using envoy to connect from MediaWiki to restbase causes an explosion of live LVS connections. - https://phabricator.wikimedia.org/T266855 (10Joe) [10:44:03] 10SRE, 10envoy, 10serviceops, 10Kubernetes, 10Service-Architecture: Allow canarying new envoy configurations in kubernetes - https://phabricator.wikimedia.org/T265882 (10Joe) [10:44:14] 10SRE, 10envoy, 10serviceops, 10Kubernetes, 10Service-Architecture: Improve envoy configuration CI checks - https://phabricator.wikimedia.org/T265881 (10Joe) [10:44:23] 10SRE, 10envoy, 10serviceops, 10Kubernetes, 10Service-Architecture: Upgrade envoy configuration to use the v3 API - https://phabricator.wikimedia.org/T265880 (10Joe) [10:44:40] 10SRE, 10envoy, 10serviceops, 10Kubernetes, 10Service-Architecture: Consider using a file-based xDS system for envoy in k8s - https://phabricator.wikimedia.org/T265879 (10Joe) [10:44:56] (03CR) 10Vgutierrez: [C: 03+2] varnish: include X-Client-Port in X-Analytics [puppet] - 10https://gerrit.wikimedia.org/r/658567 (https://phabricator.wikimedia.org/T181368) (owner: 10Effie Mouzeli) [10:45:12] (03PS2) 10Vgutierrez: varnish: include X-Client-Port in X-Analytics [puppet] - 10https://gerrit.wikimedia.org/r/658567 (https://phabricator.wikimedia.org/T181368) (owner: 10Effie Mouzeli) [10:45:55] (03CR) 10Alexandros Kosiaris: [C: 03+1] "> Patch Set 2: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/658628 (owner: 10Alexandros Kosiaris) [10:47:58] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/658572 (owner: 10Hnowlan) [10:48:33] (03CR) 10JMeybohm: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/658628 (owner: 10Alexandros Kosiaris) [10:50:37] (03CR) 10Kormat: [C: 03+1] "Nuke away." [puppet] - 10https://gerrit.wikimedia.org/r/657821 (https://phabricator.wikimedia.org/T272559) (owner: 10Jcrespo) [10:52:35] (03CR) 10Kormat: [C: 03+1] "Nuke away" [puppet] - 10https://gerrit.wikimedia.org/r/657820 (https://phabricator.wikimedia.org/T272559) (owner: 10Jcrespo) [10:57:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1143 for kernel upgrade and enablement of report_host', diff saved to https://phabricator.wikimedia.org/P13984 and previous config saved to /var/cache/conftool/dbconfig/20210127-105735-marostegui.json [10:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:48] (03PS1) 10Muehlenhoff: Enable ProxyPreserveHost for the debmonitor Apache site [puppet] - 10https://gerrit.wikimedia.org/r/658952 [10:58:57] (03CR) 10Muehlenhoff: debmonitor: Also allow localhost (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657793 (owner: 10Muehlenhoff) [10:59:04] (03Abandoned) 10Muehlenhoff: debmonitor: Also allow localhost [puppet] - 10https://gerrit.wikimedia.org/r/657793 (owner: 10Muehlenhoff) [10:59:47] (03CR) 10Jcrespo: [C: 03+2] mariadb: Remove mariadb module mysqld_safe [puppet] - 10https://gerrit.wikimedia.org/r/657820 (https://phabricator.wikimedia.org/T272559) (owner: 10Jcrespo) [11:00:07] (03PS7) 10Jcrespo: mariadb-backups: Document logical backups grants throughout production dbs [puppet] - 10https://gerrit.wikimedia.org/r/657801 (https://phabricator.wikimedia.org/T111929) [11:00:09] (03PS1) 10Jcrespo: mariadb: Remove obsolete mariadb.server init script [puppet] - 10https://gerrit.wikimedia.org/r/658953 (https://phabricator.wikimedia.org/T272559) [11:03:56] (03CR) 10Jcrespo: "Followup/related to https://gerrit.wikimedia.org/r/c/operations/puppet/+/657820/" [puppet] - 10https://gerrit.wikimedia.org/r/658953 (https://phabricator.wikimedia.org/T272559) (owner: 10Jcrespo) [11:06:00] (03CR) 10Jcrespo: [C: 03+2] mariadb: Remove mariadb module mylvmbackup [puppet] - 10https://gerrit.wikimedia.org/r/657821 (https://phabricator.wikimedia.org/T272559) (owner: 10Jcrespo) [11:06:07] (03PS4) 10Jcrespo: mariadb: Remove mariadb module mylvmbackup [puppet] - 10https://gerrit.wikimedia.org/r/657821 (https://phabricator.wikimedia.org/T272559) [11:08:30] (03CR) 10Muehlenhoff: [C: 03+2] Enable base::service_auto_restart for netbox/apache [puppet] - 10https://gerrit.wikimedia.org/r/656187 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:12:08] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:12:11] 10SRE, 10DC-Ops, 10Data-Persistence-Backup, 10Patch-For-Review: decom helium and heze - https://phabricator.wikimedia.org/T260717 (10jcrespo) a:03jcrespo [11:14:32] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:30:32] (03PS3) 10Giuseppe Lavagetto: admin: also remove the old ed25519 key for the time being [puppet] - 10https://gerrit.wikimedia.org/r/635497 [11:32:09] (03PS1) 10Filippo Giunchedi: alertmanager: add phalerts webhook receiver [puppet] - 10https://gerrit.wikimedia.org/r/658956 (https://phabricator.wikimedia.org/T272453) [11:32:12] (03PS1) 10Filippo Giunchedi: alertmanager: add receivers to create tasks from alerts [puppet] - 10https://gerrit.wikimedia.org/r/658957 (https://phabricator.wikimedia.org/T272453) [11:32:15] (03PS1) 10Filippo Giunchedi: prometheus: add job for alertmanager::phab [puppet] - 10https://gerrit.wikimedia.org/r/658958 (https://phabricator.wikimedia.org/T272453) [11:32:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 10%: After upgrading the kernel', diff saved to https://phabricator.wikimedia.org/P13985 and previous config saved to /var/cache/conftool/dbconfig/20210127-113245-root.json [11:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:34] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/658952 (owner: 10Muehlenhoff) [11:47:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 25%: After upgrading the kernel', diff saved to https://phabricator.wikimedia.org/P13986 and previous config saved to /var/cache/conftool/dbconfig/20210127-114749-root.json [11:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:59] (03PS1) 10Giuseppe Lavagetto: Reduce reconnectTimeout for etcd to 0.1 seconds, release 1.15.9 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/658964 (https://phabricator.wikimedia.org/T264362) [11:48:52] (03CR) 10Giuseppe Lavagetto: "Please note; this is a cherry-pick of the already merged change I177ebb8521d87f2fdbc64550645e490c143a3a2c that was erroneously submitted t" [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/658964 (https://phabricator.wikimedia.org/T264362) (owner: 10Giuseppe Lavagetto) [11:53:42] (03CR) 10Vgutierrez: [C: 03+1] Reduce reconnectTimeout for etcd to 0.1 seconds, release 1.15.9 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/658964 (https://phabricator.wikimedia.org/T264362) (owner: 10Giuseppe Lavagetto) [11:59:18] 10SRE, 10DC-Ops, 10Data-Persistence-Backup, 10Patch-For-Review: decom helium and heze - https://phabricator.wikimedia.org/T260717 (10jcrespo) [12:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: Time to snap out of that daydream and deploy European mid-day backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210127T1200). [12:00:05] awight and CFisch_WMDE: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:43] (03CR) 10Muehlenhoff: [C: 03+2] Enable ProxyPreserveHost for the debmonitor Apache site [puppet] - 10https://gerrit.wikimedia.org/r/658952 (owner: 10Muehlenhoff) [12:02:18] o/ [12:02:23] (03PS1) 10ArielGlenn: include correct mysql client package for dumps on buster [puppet] - 10https://gerrit.wikimedia.org/r/658966 [12:02:48] awight: wanna do both, I'm might get interrupted here. [12:02:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 50%: After upgrading the kernel', diff saved to https://phabricator.wikimedia.org/P13987 and previous config saved to /var/cache/conftool/dbconfig/20210127-120253-root.json [12:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:00] CFisch_WMDE: can do! [12:03:15] great! thanks [12:03:36] (03CR) 10Awight: [V: 03+1 C: 03+2] "backport window" [extensions/CodeMirror] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/658814 (https://phabricator.wikimedia.org/T270317) (owner: 10Thiemo Kreuz (WMDE)) [12:03:43] (03CR) 10Awight: [V: 03+1 C: 03+2] "backport window" [extensions/CodeMirror] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/658815 (https://phabricator.wikimedia.org/T270317) (owner: 10Thiemo Kreuz (WMDE)) [12:03:51] (03PS2) 10Awight: Enable bracket matching on the first wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658594 (https://phabricator.wikimedia.org/T270238) (owner: 10WMDE-Fisch) [12:05:23] 10SRE, 10Traffic: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Cyberpower678) p:05Lowβ†’03Medium [12:05:49] (03PS1) 10Jbond: 6.2.7: release [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658967 [12:05:56] 10SRE, 10Traffic: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Cyberpower678) Got another surge of bad responses from production. [12:06:12] 10SRE, 10DC-Ops, 10Data-Persistence-Backup, 10Patch-For-Review: decom helium and heze - https://phabricator.wikimedia.org/T260717 (10jcrespo) [12:06:14] (03CR) 10Jbond: [V: 03+2 C: 03+2] 6.2.7: release [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658967 (owner: 10Jbond) [12:07:16] 10SRE: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10jcrespo) [12:08:04] (03CR) 10ArielGlenn: [C: 03+2] include correct mysql client package for dumps on buster [puppet] - 10https://gerrit.wikimedia.org/r/658966 (owner: 10ArielGlenn) [12:09:24] 10SRE: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10jcrespo) I hav estarted the decommissioning process oh helium and heze and marked the tracking tasks on the description. It may take a bit to process them as we need to do lots of cleanup (beyond the previously pro... [12:09:59] (03Merged) 10jenkins-bot: Improve matchbrackets performance when moving the cursor [extensions/CodeMirror] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/658814 (https://phabricator.wikimedia.org/T270317) (owner: 10Thiemo Kreuz (WMDE)) [12:10:22] (03PS1) 10Jbond: gradle.properties: add proxy settings back [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658968 [12:10:25] (03Merged) 10jenkins-bot: Improve matchbrackets performance when moving the cursor [extensions/CodeMirror] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/658815 (https://phabricator.wikimedia.org/T270317) (owner: 10Thiemo Kreuz (WMDE)) [12:10:28] (03CR) 10Kormat: [C: 03+1] mariadb: Remove obsolete mariadb.server init script [puppet] - 10https://gerrit.wikimedia.org/r/658953 (https://phabricator.wikimedia.org/T272559) (owner: 10Jcrespo) [12:11:38] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27690/console" [puppet] - 10https://gerrit.wikimedia.org/r/656460 (https://phabricator.wikimedia.org/T238753) (owner: 10Hnowlan) [12:11:55] (03PS1) 10Jcrespo: Remove helium and heze references from puppet [puppet] - 10https://gerrit.wikimedia.org/r/658969 (https://phabricator.wikimedia.org/T273049) [12:11:57] (03CR) 10Jbond: [V: 03+2 C: 03+2] gradle.properties: add proxy settings back [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658968 (owner: 10Jbond) [12:13:25] Wikibase has rogue changes in the wmf.27 directory. [12:14:11] o.O [12:14:24] Maybe just something about how the local patches were applied. [12:15:13] CFisch_WMDE: both of our backports should be live on mwdebug1001 [12:15:26] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:15:49] (03PS2) 10Jcrespo: bacula: Remove helium and heze references from puppet [puppet] - 10https://gerrit.wikimedia.org/r/658969 (https://phabricator.wikimedia.org/T273049) [12:16:16] awight: +1 Hard do test with the features still disabled :-D [12:16:57] (03CR) 10Jcrespo: "This is basically a rebase of https://gerrit.wikimedia.org/r/c/operations/puppet/+/621038 But I would like to cleanup db grants/old bacula" [puppet] - 10https://gerrit.wikimedia.org/r/658969 (https://phabricator.wikimedia.org/T273049) (owner: 10Jcrespo) [12:17:08] CFisch_WMDE: Yeah all I can prove to myself so far is that we haven't broken anything obvious. Next time we should enable on testwiki or something... Anyway, going ahead with deployment now. [12:17:35] good point [12:17:35] (03CR) 10Jcrespo: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/658969 (https://phabricator.wikimedia.org/T273049) (owner: 10Jcrespo) [12:17:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 75%: After upgrading the kernel', diff saved to https://phabricator.wikimedia.org/P13988 and previous config saved to /var/cache/conftool/dbconfig/20210127-121756-root.json [12:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:31] And at least it seems nothing broken so far. [12:18:37] :-) [12:19:11] !log awight@deploy1001 Synchronized php-1.36.0-wmf.28/extensions/CodeMirror: Backport: [[gerrit:658815|Improve matchbrackets performance when moving the cursor (T270317)]] (duration: 01m 14s) [12:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:14] (03CR) 10Jcrespo: "Moritz: do we keep the denylist, empty, or do we remove it in its entirety?" [puppet] - 10https://gerrit.wikimedia.org/r/658969 (https://phabricator.wikimedia.org/T273049) (owner: 10Jcrespo) [12:19:15] T270317: Optimization and limits for bracket matching - https://phabricator.wikimedia.org/T270317 [12:19:42] brb [12:20:06] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:20:12] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:20:42] !log awight@deploy1001 Synchronized php-1.36.0-wmf.27/extensions/CodeMirror: Backport: [[gerrit:658814|Improve matchbrackets performance when moving the cursor (T270317)]] (duration: 01m 06s) [12:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:50] (03PS23) 10Hnowlan: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [12:20:56] (03CR) 10Awight: [C: 03+2] "Config window" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658594 (https://phabricator.wikimedia.org/T270238) (owner: 10WMDE-Fisch) [12:21:58] (03Merged) 10jenkins-bot: Enable bracket matching on the first wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658594 (https://phabricator.wikimedia.org/T270238) (owner: 10WMDE-Fisch) [12:22:09] (03CR) 10Muehlenhoff: bacula: Remove helium and heze references from puppet (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/658969 (https://phabricator.wikimedia.org/T273049) (owner: 10Jcrespo) [12:22:20] (03CR) 10jerkins-bot: [V: 04-1] start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [12:22:22] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27691/console" [puppet] - 10https://gerrit.wikimedia.org/r/656460 (https://phabricator.wikimedia.org/T238753) (owner: 10Hnowlan) [12:22:32] CFisch_WMDE: Bracket matching enabled on mwdebug1001, for e.g. dewiki [12:23:15] works! [12:24:43] like a charm [12:25:09] (03CR) 10Jcrespo: "Thanks for the comments, on some part of the docs it recommended using spare, will do as suggested." [puppet] - 10https://gerrit.wikimedia.org/r/658969 (https://phabricator.wikimedia.org/T273049) (owner: 10Jcrespo) [12:25:22] !log awight@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:658594|Enable bracket matching on the first wikis (T270238)]] (duration: 01m 07s) [12:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:26] T270238: Enable bracket matching on the first wikis - https://phabricator.wikimedia.org/T270238 [12:25:56] !log EU bacon done [12:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:27] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/658969 (https://phabricator.wikimedia.org/T273049) (owner: 10Jcrespo) [12:31:02] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:32:37] (03PS1) 10Jcrespo: mariadb: Remove m1 references to old database bacula, leave only bacula9 [puppet] - 10https://gerrit.wikimedia.org/r/658970 (https://phabricator.wikimedia.org/T260717) [12:33:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 100%: After upgrading the kernel', diff saved to https://phabricator.wikimedia.org/P13989 and previous config saved to /var/cache/conftool/dbconfig/20210127-123300-root.json [12:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:54] (03PS3) 10Jcrespo: bacula: Remove helium and heze references from puppet [puppet] - 10https://gerrit.wikimedia.org/r/658969 (https://phabricator.wikimedia.org/T273049) [12:38:00] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:39:10] (03CR) 10jerkins-bot: [V: 04-1] bacula: Remove helium and heze references from puppet [puppet] - 10https://gerrit.wikimedia.org/r/658969 (https://phabricator.wikimedia.org/T273049) (owner: 10Jcrespo) [12:39:40] (03PS24) 10Hnowlan: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [12:41:15] (03CR) 10jerkins-bot: [V: 04-1] start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [12:41:47] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27692/console" [puppet] - 10https://gerrit.wikimedia.org/r/656460 (https://phabricator.wikimedia.org/T238753) (owner: 10Hnowlan) [12:42:28] (03PS4) 10Jcrespo: bacula: Remove helium and heze references from puppet [puppet] - 10https://gerrit.wikimedia.org/r/658969 (https://phabricator.wikimedia.org/T273049) [12:43:33] (03CR) 10Alexandros Kosiaris: [C: 03+1] Allow the kube-controller-manager to run without superuser permissions [puppet] - 10https://gerrit.wikimedia.org/r/650469 (https://phabricator.wikimedia.org/T228967) (owner: 10JMeybohm) [12:44:19] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/658628 (owner: 10Alexandros Kosiaris) [12:44:27] (03CR) 10Jcrespo: "Not super-urgent, but I likely will need coordination with you, Manuel, to drop bacula db (deploy, documentation check, patch review, etc." [puppet] - 10https://gerrit.wikimedia.org/r/658970 (https://phabricator.wikimedia.org/T260717) (owner: 10Jcrespo) [12:44:33] (03PS3) 10Alexandros Kosiaris: Absent /etc/helmfile-defaults/service-proxy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/658628 [12:44:59] (03CR) 10VolkerE: [C: 03+1] "This is great, do we mind additionally changing the primary button to Accent50 blue background https://design.wikimedia.org/style-guide/co" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658760 (https://phabricator.wikimedia.org/T273023) (owner: 10Krinkle) [12:45:10] (03CR) 10Marostegui: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/658970 (https://phabricator.wikimedia.org/T260717) (owner: 10Jcrespo) [12:46:01] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/658969 (https://phabricator.wikimedia.org/T273049) (owner: 10Jcrespo) [12:46:12] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:47:39] (03CR) 10Jcrespo: [C: 04-2] "I will deploy this as is unless someone has improvements, but will block on the other maintenance needed as blocker." [puppet] - 10https://gerrit.wikimedia.org/r/658969 (https://phabricator.wikimedia.org/T273049) (owner: 10Jcrespo) [12:50:14] (03CR) 10Alexandros Kosiaris: service proxy: Add apertium (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/658629 (owner: 10Alexandros Kosiaris) [12:50:32] (03PS2) 10Alexandros Kosiaris: similar-users, linkrecommendation: Switch to production [puppet] - 10https://gerrit.wikimedia.org/r/658630 (https://phabricator.wikimedia.org/T265603) [12:50:34] (03PS3) 10Alexandros Kosiaris: service proxy: Add apertium [puppet] - 10https://gerrit.wikimedia.org/r/658629 [13:02:07] (03PS1) 10Muehlenhoff: Add a ferm module to Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/658972 [13:05:22] 10SRE, 10Traffic: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Dreamy_Jazz) Just to note that cyberbot I has been blocked because of the blanking issues on enwiki. The bot has also been blocked on two other wikis for the blanki... [13:05:59] 10SRE, 10DBA, 10User-Kormat: Add monitoring to ensure consistency between tendril and zarcillo - https://phabricator.wikimedia.org/T257822 (10jcrespo) This looks very related to T242571, but not merging because it is a topic very likely to evolve. [13:06:07] (03PS1) 10Marostegui: db1089: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/658973 [13:06:48] (03CR) 10Marostegui: [C: 03+2] db1089: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/658973 (owner: 10Marostegui) [13:07:16] (03CR) 10jerkins-bot: [V: 04-1] Add a ferm module to Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/658972 (owner: 10Muehlenhoff) [13:08:09] (03PS1) 10Muehlenhoff: Enable CAS for debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/658974 [13:13:46] 10SRE, 10DBA, 10observability, 10Patch-For-Review: MySQL metrics monitoring - https://phabricator.wikimedia.org/T143896 (10jcrespo) [13:18:29] (03CR) 10Alexandros Kosiaris: [C: 03+2] similar-users, linkrecommendation: Switch to production [puppet] - 10https://gerrit.wikimedia.org/r/658630 (https://phabricator.wikimedia.org/T265603) (owner: 10Alexandros Kosiaris) [13:18:56] 10SRE, 10DBA, 10observability, 10Patch-For-Review: MySQL metrics monitoring - https://phabricator.wikimedia.org/T143896 (10jcrespo) [13:19:38] (03PS1) 10Hnowlan: Add two LVS IPs for similar-users. [dns] - 10https://gerrit.wikimedia.org/r/658976 (https://phabricator.wikimedia.org/T268837) [13:19:59] !log swift codfw-prod decrease SSD weight for ms-be20[16-27] - T272837 [13:20:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:03] T272837: Decom ms-be[2016-2027] from swift - https://phabricator.wikimedia.org/T272837 [13:20:20] 10SRE, 10DBA, 10User-Kormat: Add monitoring to ensure consistency between tendril and zarcillo - https://phabricator.wikimedia.org/T257822 (10Kormat) 05Openβ†’03Resolved a:03Kormat Resolving this as tendril is going away. [13:20:25] 10SRE, 10DBA, 10Epic, 10User-Kormat: Use zarcillo as an authoritative inventory of db instances/roles - https://phabricator.wikimedia.org/T257814 (10Kormat) [13:20:28] (03PS1) 10ArielGlenn: add snapshot03 to deployment-prep mw installation targets [puppet] - 10https://gerrit.wikimedia.org/r/658977 [13:24:43] 10SRE, 10DBA, 10Epic, 10User-Kormat: Use zarcillo as an authoritative inventory of db instances/roles - https://phabricator.wikimedia.org/T257814 (10Kormat) [13:24:51] (03CR) 10ArielGlenn: [C: 03+2] add snapshot03 to deployment-prep mw installation targets [puppet] - 10https://gerrit.wikimedia.org/r/658977 (owner: 10ArielGlenn) [13:25:24] (03PS1) 10Alexandros Kosiaris: similar-users, linkrecommendation: Add discovery [dns] - 10https://gerrit.wikimedia.org/r/658980 (https://phabricator.wikimedia.org/T265603) [13:25:44] !log swift codfw-prod decrease SSD weight for ms-be20[16-27] - T272837 [13:25:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:48] T272837: Decom ms-be[2016-2027] from swift - https://phabricator.wikimedia.org/T272837 [13:26:17] (03CR) 10Alexandros Kosiaris: [C: 03+1] bacula: Remove helium and heze references from puppet [puppet] - 10https://gerrit.wikimedia.org/r/658969 (https://phabricator.wikimedia.org/T273049) (owner: 10Jcrespo) [13:28:14] (03PS1) 10ArielGlenn: add new instance snapshot03 to dumps scap targets in deployment-prep [dumps/scap] - 10https://gerrit.wikimedia.org/r/658981 [13:30:07] (03CR) 10Alexandros Kosiaris: [C: 03+1] mariadb: Remove m1 references to old database bacula, leave only bacula9 [puppet] - 10https://gerrit.wikimedia.org/r/658970 (https://phabricator.wikimedia.org/T260717) (owner: 10Jcrespo) [13:31:47] (03PS1) 10Jbond: 6.3.0: updated ready for 6.3.0 release [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658983 [13:31:54] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:32:48] (03PS1) 10Jbond: gradle: remove -tomcat (used for local testing) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658984 [13:33:34] (03CR) 10Jbond: [V: 03+2 C: 03+2] gradle: remove -tomcat (used for local testing) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658984 (owner: 10Jbond) [13:35:00] (03PS1) 10Kosta Harlan: linkrecommendation: release new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/658985 [13:35:02] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:36:22] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 287817984 and 158 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:36:36] (03PS2) 10Kosta Harlan: linkrecommendation: release new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/658985 [13:36:52] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/658974 (owner: 10Muehlenhoff) [13:37:20] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:38:43] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: release new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/658985 (owner: 10Kosta Harlan) [13:38:44] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:40:09] (03Merged) 10jenkins-bot: linkrecommendation: release new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/658985 (owner: 10Kosta Harlan) [13:40:15] (03PS2) 10Muehlenhoff: Add a ferm module to Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/658972 [13:42:47] (Primary outbound port utilisation over 80% #page) firing: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [13:43:12] uh :) [13:43:44] !log kharlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [13:43:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:39] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] add new instance snapshot03 to dumps scap targets in deployment-prep [dumps/scap] - 10https://gerrit.wikimedia.org/r/658981 (owner: 10ArielGlenn) [13:44:45] mmhh interesting, icinga should be paging shortly too I think ? [13:44:53] re: the librenms alert [13:46:29] there is a spike in both tx and rx in the last 20m [13:46:50] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:47:39] (03CR) 10jerkins-bot: [V: 04-1] Add a ferm module to Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/658972 (owner: 10Muehlenhoff) [13:47:41] parsercache is lagging again in codfw [13:47:47] (Primary outbound port utilisation over 80% #page) resolved: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [13:48:12] 10SRE, 10SRE-Access-Requests: Add user to analytics-privatedata-users group - https://phabricator.wikimedia.org/T273058 (10elukey) @gmodena hi! This should follow https://wikitech.wikimedia.org/wiki/Production_access, and also https://wikitech.wikimedia.org/wiki/Analytics/Data_access IIUC ssh access is not ne... [13:48:53] !log kharlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [13:48:53] !log kharlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [13:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:55] volans: do you have the link handy for librenms? [13:51:06] https://librenms.wikimedia.org/graphs/device=162/type=device_bits/from=1611751500/legend=yes/popup_title=Device+Traffic/to=1611755100/?_token=jRhvEa5o1DIrGuxZrUmNTzkcO7mmyqeEGev27n0O [13:51:09] <3 [13:52:06] ok so we (analytics) are transferring data to the backup cluster, so I am wondering if it is cross row traffic between worker ndoes [13:52:08] (03PS3) 10Muehlenhoff: Add a ferm module to Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/658972 [13:52:09] *nodes [13:52:31] (03PS1) 10Ottomata: Add ppel to analytics-privatedata-users with no ssh access [puppet] - 10https://gerrit.wikimedia.org/r/658992 (https://phabricator.wikimedia.org/T271602) [13:52:47] !log kharlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [13:52:47] !log kharlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [13:52:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:14] elukey: does this look right? [13:53:14] https://gerrit.wikimedia.org/r/c/operations/puppet/+/658992/ [13:53:54] checking [13:54:11] 10SRE, 10SRE-Access-Requests: Add user to analytics-privatedata-users group - https://phabricator.wikimedia.org/T273058 (10gmodena) Hi @elukey! ack: ssh access is not required, only superset. @sdkim was onboarded in Hadoop, and can already login in Superset and access a subset of the data exposed. However, he... [13:55:19] (03CR) 10Elukey: [C: 03+1] Add ppel to analytics-privatedata-users with no ssh access [puppet] - 10https://gerrit.wikimedia.org/r/658992 (https://phabricator.wikimedia.org/T271602) (owner: 10Ottomata) [13:55:40] godog: would be nice if the email sent by alerts could retain/format the description having one item per line (Device Name, Severity, etc...) [13:55:59] #featurereuqest [13:57:28] volans: afaics it depends on librenms and how it formats the "description" annotation specifically [13:57:42] did we track down the problem? Sorry didn't follow [13:57:54] other (shorter) annotations/labels are one line per item [13:58:42] jumping in a meeting now, but LMK if I can help [13:58:57] elukey: what you mentioned makes sense to me, I saw also another alert for access port util for dumpsdata [13:58:58] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:00:18] (03CR) 10jerkins-bot: [V: 04-1] Add a ferm module to Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/658972 (owner: 10Muehlenhoff) [14:01:50] I don't recall if there is a way to see bw usage for single nodes [14:02:03] err single ports on a router [14:02:16] s/router/switch [14:02:21] today it is not a good day :D [14:02:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:03:06] elukey: from what host to what other are you transferring? [14:03:24] or is the cluster rebalancing itself [14:03:33] volans: it is difficult to say, it is between two clusters, the mappers can be anywhere on analytics10* or an-worker* nodes [14:03:46] I see that some of them in asw2-c are using close to 1Gbps [14:03:57] we are copying data for backup [14:04:06] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:04:09] I am checking https://librenms.wikimedia.org/device/162/ports [14:04:14] and grepping for an-worker [14:06:58] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:07:29] volans: I killed the data transfer, wasn't able to reach Joseph [14:07:35] so in theory bw usage should decrease [14:07:39] let's see if it was us [14:08:02] (03PS4) 10Muehlenhoff: Add a ferm module to Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/658972 [14:08:12] ok [14:08:28] no bueno, 10g nodes can push a lot of data if they can [14:08:34] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:08:43] for this data transfer we can control bw [14:08:59] but in general we can't for cross hadoop worker node tcp conns [14:09:07] at least, I don't know if it is possible :D [14:10:21] but I am a bit confused, we should have 40G (aggregated) between asw2-c to cr2-eqiad [14:10:21] I don't think it can be done in Hadoop/at the app level. About the only way I can think of is OS-level tc, and even that would be hard to implement. [14:10:43] and from the alerts I see that the port to cr2 was using close to 10G [14:10:46] or am I misreading? [14:12:06] volans: am I right or do we also have asw2-a alarming? [14:12:38] this is not great [14:12:54] elukey: we have a and c flapping between warnign and critical [14:13:17] if you click the alert you get the ports [14:13:34] xe-2/0/45; xe-7/0/44; on asw2-a-eqiad [14:13:51] yes yes [14:14:00] that are the links to crX routers [14:14:34] (03CR) 10jerkins-bot: [V: 04-1] Add a ferm module to Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/658972 (owner: 10Muehlenhoff) [14:14:41] but we're doing just few Gbps [14:15:05] (03PS1) 10Urbanecm: arwiki: Configure wgGEHomepageManualAssignmentMentorsList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658996 (https://phabricator.wikimedia.org/T273060) [14:15:32] see https://netbox.wikimedia.org/dcim/devices/1112/#interface_ae1 [14:15:34] volans: I recall that we should have aggregate eth (4x10g links) between asw-x and cr1/cr2, but I am not sure what the severity of the alert is [14:16:00] yes yes, but can a single 10g of the group alert? [14:16:05] indeed, it seems like if the aggregation is not evenly distributing among them [14:16:31] we have 2 alarming [14:17:46] elukey: they're all 4 very high [14:17:50] should we call in people? This has the potential for a big issue [14:18:57] ah snap I forgot to do one thing to kill the jobs [14:18:59] lemme try [14:19:09] (03CR) 10Andrew Bogott: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/658653 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott) [14:19:10] ack [14:19:11] but https://grafana.wikimedia.org/d/000000585/hadoop?viewPanel=50&orgId=1&var-datasource=eqiad%20prometheus%2Fanalytics&var-hadoop_cluster=analytics-backup-hadoop&var-worker=All [14:19:15] seems to match [14:19:21] it is the usage of the backup cluster [14:19:33] volans: can you check timings and see if it matches? [14:19:37] I'll kill the job in the meantime [14:19:53] elukey: ack [14:20:17] elukey: yes it matches with few minutes of delay [14:20:22] the time to ramp-up the copy [14:21:22] sigh [14:21:41] ok I found a way to kill the job, but it is taking a bit [14:21:50] try with -9 :D [14:21:52] the usage should decrease [14:21:53] ahahha yes [14:22:08] I see prometheus graph going down [14:22:11] so you did something [14:22:17] let's see the network usage [14:22:20] so before I killed only the client app that Joseph was using, not the map-reduce job on the cluster [14:22:23] my bad [14:22:37] 10SRE, 10SRE-Access-Requests: Add sdkim to analytics-privatedata-users group - https://phabricator.wikimedia.org/T273058 (10Aklapper) [14:22:46] (03PS3) 10Andrew Bogott: Neutron: forward our dmz hacks from version Stein to Train [puppet] - 10https://gerrit.wikimedia.org/r/658653 (https://phabricator.wikimedia.org/T261135) [14:22:48] (03PS1) 10Andrew Bogott: Neutron: update the l3 files with Neutron Train upstream [puppet] - 10https://gerrit.wikimedia.org/r/658999 (https://phabricator.wikimedia.org/T261135) [14:23:46] (03CR) 10Andrew Bogott: [C: 04-1] "Hm, something went wrong, let me try again" [puppet] - 10https://gerrit.wikimedia.org/r/658653 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott) [14:24:17] 10SRE, 10Traffic: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Aklapper) p:05Mediumβ†’03Low Please don't change the priority value if you don't plan to work on fixing this - thanks a lot! :) [14:25:20] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 40486352 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:26:28] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 740768 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:26:39] (03CR) 10JMeybohm: [C: 03+1] service proxy: Add apertium [puppet] - 10https://gerrit.wikimedia.org/r/658629 (owner: 10Alexandros Kosiaris) [14:26:53] (03PS5) 10Muehlenhoff: Add a ferm module to Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/658972 [14:27:15] in theory we should see recovery [14:27:30] elukey: confermed graph going down [14:27:31] thanks [14:27:57] elukey: FYI i added some extra stuff to your really great data access docs [14:27:57] (03PS4) 10Andrew Bogott: Neutron: forward our dmz hacks from version Stein to Train [puppet] - 10https://gerrit.wikimedia.org/r/658653 (https://phabricator.wikimedia.org/T261135) [14:28:14] ottomata: <3 [14:28:15] to hopefully make it easier for requeting users to figure out what to ask for [14:28:16] https://wikitech.wikimedia.org/wiki/Analytics/Data_access [14:28:31] yes thanks a lot, I forgot to send the last version to the team for review :( [14:28:39] (03CR) 10Andrew Bogott: "ok, there, does that make more sense?" [puppet] - 10https://gerrit.wikimedia.org/r/658653 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott) [14:28:49] (03PS1) 10Jayprakash12345: Add accountcreator user rights group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659000 (https://phabricator.wikimedia.org/T269067) [14:29:37] ottomata: this is great !https://wikitech.wikimedia.org/wiki/Analytics/Data_access#What_access_should_I_request? [14:29:50] ya lemme know if i got that right! [14:30:10] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:30:34] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:42] ottomata: yep all good! [14:33:00] (03CR) 10jerkins-bot: [V: 04-1] Add a ferm module to Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/658972 (owner: 10Muehlenhoff) [14:34:43] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Hue access for Peter Pelberg - https://phabricator.wikimedia.org/T271602 (10Ottomata) [14:34:46] (03CR) 10Ottomata: [C: 03+2] Add ppel to analytics-privatedata-users with no ssh access [puppet] - 10https://gerrit.wikimedia.org/r/658992 (https://phabricator.wikimedia.org/T271602) (owner: 10Ottomata) [14:36:26] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:38:50] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Hue access for Peter Pelberg - https://phabricator.wikimedia.org/T271602 (10Ottomata) @ppelberg, I've applied your access patch, please try again! Also, it seems that in {T223351} you were given a shell user name of `ppel`, not `ppelberg` (as originally requ... [14:43:20] (03CR) 10Jbond: "saw this fly past so did a quick pass" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/658972 (owner: 10Muehlenhoff) [14:44:53] (03CR) 10Hnowlan: [C: 03+1] similar-users, linkrecommendation: Add discovery [dns] - 10https://gerrit.wikimedia.org/r/658980 (https://phabricator.wikimedia.org/T265603) (owner: 10Alexandros Kosiaris) [14:50:30] elukey: re: earlier -- was the data transfer one single TCP flow? [14:51:06] cdanis: o/ it was a map-reduce job from the backup cluster, pulling in parallel from the "analytics" one [14:51:15] ah okay [14:51:22] if it was a job with many workers it is odd it didn't spread itself out [14:53:05] cdanis: it did very well [14:53:07] 10SRE, 10Analytics: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10akosiaris) This is weird. I don't think we have encountered this before. ExecStop in the systemd unit file runs `ifdown ens5` but running that on the host returns ` root@kafka-test1006:... [14:53:07] saturating all of them [14:53:12] some more some less [14:53:43] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/658629 (owner: 10Alexandros Kosiaris) [14:54:08] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [dns] - 10https://gerrit.wikimedia.org/r/658980 (https://phabricator.wikimedia.org/T265603) (owner: 10Alexandros Kosiaris) [14:54:15] (03PS2) 10Jbond: 6.3.0: updated ready for 6.3.0 release [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658983 [14:54:52] volans: ah okay, I thought I read only one link was saturated, got it [14:55:09] * cdanis still on first β˜• [14:55:14] ttyl :D [14:55:49] cdanis: my bad since I thought initially that was one port the problem, but then more alarmed etc.. [14:56:03] some of the links were worst than others though [14:56:04] (03CR) 10Alexandros Kosiaris: [C: 03+1] Allow talking to the registry over HTTP [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/658684 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm) [14:56:09] cdanis: some links in the lacp were saturated but not however some of the an-workers were pushing much more traffic then others so i think that this is just an artifact of the hashing algo [14:56:28] yah makes sense, pretty typical [14:56:38] 10SRE, 10SRE-Access-Requests: Add sdkim to analytics-privatedata-users group - https://phabricator.wikimedia.org/T273058 (10Ottomata) Approved! [14:58:35] yep I am very ignorant on the lacp part :( [14:58:48] (03PS1) 10Jbond: apereo: update tomcat proxy setting post 6.3 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/659004 [15:00:47] (03CR) 10Ottomata: [C: 03+2] Add snappy dependency to coal [puppet] - 10https://gerrit.wikimedia.org/r/658918 (https://phabricator.wikimedia.org/T273033) (owner: 10Gilles) [15:00:58] elukey: i would need to dig into the exact details but i think its simlar to ecmp in that it hashs (source, destination) tuples. i took a quick look and couldn't see anything related to port so it probably is just src, dst. and without digging further im not sure if src/dst would be layer-2 or layer-3 (allthough it makes little difference in reality) [15:01:02] (03CR) 10Alexandros Kosiaris: [C: 04-1] "IMHO, this line from the manual isn't very clear. The Debian Mentors FAQ[1] elaborates on this more and makes an important recommendation" [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/657218 (owner: 10Legoktm) [15:03:00] elukey: https://www.juniper.net/documentation/en_US/junos/topics/topic-map/load-balancing-aggregated-ethernet-interfaces.html [15:03:04] jbond42: yes yes got it, I also asked some explanation to Faidon, now I feel less ignorant, but I was not taking into consideration that part of the network config [15:03:16] my mental model was 40g router<->switches [15:03:26] without really considering leaf/spine, lacp, etc.. [15:03:27] elukey: avoid it for servers :P. That's my recommendation [15:03:42] ack [15:04:12] akosiaris: I am thinking about forming a leaf/spine set up with hadoop worker nodes, I'll send you the details [15:04:13] but otherwise, yeah it splits traffic per a hashing algorith (configurable) in the 2 interfaces :-) [15:04:28] 2 or more to be pedantic [15:04:34] * elukey nods [15:04:40] elukey: please do. You got me interested [15:05:07] (03PS2) 10Jbond: apereo: update tomcat proxy setting post 6.3 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/659004 [15:06:16] akosiaris: ahahahah [15:06:47] (03CR) 10Jbond: "I have tested this locally and all seems fine, will plan to install on idp-test once the CSS changes have been migrated to production" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658983 (owner: 10Jbond) [15:10:58] PROBLEM - rsyslog TLS listener on port 6514 on centrallog1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection timed out https://wikitech.wikimedia.org/wiki/Logs [15:12:58] 10SRE, 10Traffic: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10CDanis) Can you please provide a complete dump of a "null response", with both the complete response headers and the raw response body? What is the HTTP status cod... [15:13:47] oof, I thought we'd fixed the tls listener of rsyslog, clearly not [15:15:12] RECOVERY - rsyslog TLS listener on port 6514 on centrallog1001 is OK: SSL OK - Certificate centrallog1001.eqiad.wmnet valid until 2024-06-25 15:42:33 +0000 (expires in 1245 days) https://wikitech.wikimedia.org/wiki/Logs [15:15:20] !log bounce rsyslog on centrallog1001 [15:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:44] if it happens again I'll reintroduce the former remedy :( [15:15:48] (03PS1) 10David Caro: puppet: add puppetmaster retrieval [software/spicerack] - 10https://gerrit.wikimedia.org/r/659008 [15:15:50] (03PS1) 10David Caro: remote: allow prepending every command with sudo [software/spicerack] - 10https://gerrit.wikimedia.org/r/659009 [15:17:34] 10SRE, 10Analytics: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10elukey) @akosiaris not reliably, but today I rebooted the 4 schema VMs and one of them got back with the same issue.. [15:19:00] (03PS1) 10Ottomata: Migrat 5 NavigationTiming eventlogging streams on group0 and group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659010 (https://phabricator.wikimedia.org/T271208) [15:22:35] (03PS2) 10Ottomata: Migrate 5 NavigationTiming eventlogging streams on group0 and group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659010 (https://phabricator.wikimedia.org/T271208) [15:22:45] (03CR) 10jerkins-bot: [V: 04-1] remote: allow prepending every command with sudo [software/spicerack] - 10https://gerrit.wikimedia.org/r/659009 (owner: 10David Caro) [15:22:56] (03CR) 10jerkins-bot: [V: 04-1] puppet: add puppetmaster retrieval [software/spicerack] - 10https://gerrit.wikimedia.org/r/659008 (owner: 10David Caro) [15:24:52] 10SRE, 10Analytics: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10MoritzMuehlenhoff) >>! In T273026#6780528, @akosiaris wrote: > This is weird. I don't think we have encountered this before. > > ExecStop in the systemd unit file runs `ifdown ens5` but... [15:25:08] (03PS3) 10Ottomata: Migrate 5 NavigationTiming eventlogging streams on group0 and group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659010 (https://phabricator.wikimedia.org/T271208) [15:26:22] (03CR) 10Alexandros Kosiaris: Add support for php deployments (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/651757 (owner: 10Giuseppe Lavagetto) [15:29:54] (03CR) 10Muehlenhoff: Add a ferm module to Spicerack (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/658972 (owner: 10Muehlenhoff) [15:29:58] (03CR) 10Ottomata: [C: 03+2] Migrate 5 NavigationTiming eventlogging streams on group0 and group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659010 (https://phabricator.wikimedia.org/T271208) (owner: 10Ottomata) [15:31:42] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Migrate 5 NavigationTiming schemas to Event Platform on group0 and group1 - T271208 (duration: 01m 07s) [15:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:46] T271208: NavigationTiming Extension schemas Event Platform Migration - https://phabricator.wikimedia.org/T271208 [15:31:56] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:37:25] (03CR) 10JMeybohm: Add support for php deployments (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/651757 (owner: 10Giuseppe Lavagetto) [15:38:13] 10SRE, 10Analytics: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10akosiaris) >>! In T273026#6780640, @MoritzMuehlenhoff wrote: >>>! In T273026#6780528, @akosiaris wrote: >> This is weird. I don't think we have encountered this before. >> >> ExecStop in... [15:38:34] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:38:49] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] cassandra::single_instance: use dedicated hiera key, don't use 'cluster' [puppet] - 10https://gerrit.wikimedia.org/r/658572 (owner: 10Hnowlan) [15:39:10] 10SRE, 10Analytics: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10elukey) I recall VMs only from my past experience, I encountered this problem a couple of times before this one. [15:41:05] (03CR) 10Hashar: [C: 03+1] "I wasn't even aware about that script :)" [puppet] - 10https://gerrit.wikimedia.org/r/658485 (owner: 10Legoktm) [15:42:13] !log umount /var/hadoop/data/r on an-worker1099 and restart hadoop daemons - T273034 [15:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:17] T273034: Degraded RAID on an-worker1099 - https://phabricator.wikimedia.org/T273034 [15:45:05] 10SRE, 10Analytics, 10Analytics-Kanban, 10Event-Platform, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10JMeybohm) >>! In T269160#6777382, @elukey wrote: > Waiting for @JMeybohm's greenli... [15:46:23] 10SRE, 10ops-eqiad, 10Analytics: Degraded RAID on an-worker1099 - https://phabricator.wikimedia.org/T273034 (10elukey) @Ottomata @razzi this is the first datanode disk failure after the change that I made to use facter to populate the available partitions that Yarn and HDFS can use on a given worker node. In... [15:47:19] 10SRE, 10Analytics, 10Analytics-Kanban, 10Event-Platform, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10elukey) >>! In T269160#6780685, @JMeybohm wrote: >>>! In T269160#6777382, @elukey... [15:49:36] 10SRE, 10Analytics: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10MoritzMuehlenhoff) >>! In T273026#6780670, @akosiaris wrote: > Do you by any chance remember if it was on VMs only? Or was it physical hosts too? From my memory only VMs. I've checked my... [15:56:53] 10SRE, 10Analytics, 10Analytics-Kanban, 10Event-Platform, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10elukey) ` Error: pods is forbidden: User "eventstreams-internal" cannot list resou... [15:58:25] (03CR) 10Herron: alertmanager: add phalerts webhook receiver (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/658956 (https://phabricator.wikimedia.org/T272453) (owner: 10Filippo Giunchedi) [15:59:11] (03CR) 10Herron: alertmanager: add phalerts webhook receiver (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/658956 (https://phabricator.wikimedia.org/T272453) (owner: 10Filippo Giunchedi) [16:00:45] 10ops-eqiad, 10Data-Persistence-Backup, 10decommission-hardware, 10Patch-For-Review: decommission helium.eqiad.wmnet and helium-array - https://phabricator.wikimedia.org/T273049 (10RobH) [16:01:23] 10SRE, 10Analytics, 10Analytics-Kanban, 10Event-Platform, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10JMeybohm) You probably have not yet depoyed the admin part (the new namespace etc.... [16:02:04] 10ops-codfw, 10Data-Persistence-Backup, 10decommission-hardware, 10Patch-For-Review: decommission heze and heze-array1 - https://phabricator.wikimedia.org/T273051 (10jcrespo) [16:04:16] 10SRE, 10Analytics, 10Analytics-Kanban, 10Event-Platform, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10elukey) >>! In T269160#6780761, @JMeybohm wrote: > You probably have not yet deplo... [16:05:37] (03CR) 10Herron: [C: 03+1] alertmanager: route Icinga compat alerts to sink IRC channel [puppet] - 10https://gerrit.wikimedia.org/r/658919 (https://phabricator.wikimedia.org/T272453) (owner: 10Filippo Giunchedi) [16:06:45] 10SRE, 10Analytics, 10Analytics-Kanban, 10Event-Platform, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10JMeybohm) Apart from you testing my attention again (kube_env admin [codfw|eqiad])... [16:11:40] (03PS1) 10Jbond: P:idp: Add hiera defaults [puppet] - 10https://gerrit.wikimedia.org/r/659016 [16:16:16] (03CR) 10Cwhite: [C: 03+1] alertmanager: route Icinga compat alerts to sink IRC channel [puppet] - 10https://gerrit.wikimedia.org/r/658919 (https://phabricator.wikimedia.org/T272453) (owner: 10Filippo Giunchedi) [16:17:06] (03PS2) 10Jbond: P:idp: Add hiera defaults [puppet] - 10https://gerrit.wikimedia.org/r/659016 [16:17:24] 10SRE, 10vm-requests: eqiad: 3 VM request for ML team etcd - https://phabricator.wikimedia.org/T273074 (10klausman) [16:17:58] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27696/console" [puppet] - 10https://gerrit.wikimedia.org/r/659016 (owner: 10Jbond) [16:18:29] !log installing python-bottle security updates [16:18:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:33] 10SRE, 10vm-requests: codfw: 3 VM request for ML team etcd - https://phabricator.wikimedia.org/T273075 (10klausman) [16:19:30] (03CR) 10Cwhite: "LGTM caveat the token in the private repo or some default for PCC" [puppet] - 10https://gerrit.wikimedia.org/r/658956 (https://phabricator.wikimedia.org/T272453) (owner: 10Filippo Giunchedi) [16:20:16] (03CR) 10Cwhite: [C: 03+1] prometheus: add job for alertmanager::phab [puppet] - 10https://gerrit.wikimedia.org/r/658958 (https://phabricator.wikimedia.org/T272453) (owner: 10Filippo Giunchedi) [16:20:59] 10SRE, 10vm-requests: eqiad: 3 VM request for ML team etcd - https://phabricator.wikimedia.org/T273074 (10MoritzMuehlenhoff) Please use row B and D and either of A or C for the third instance (the latter is fairly full, while the former have ample space). [16:21:57] !log jayme@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [16:21:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:01] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:idp: Add hiera defaults [puppet] - 10https://gerrit.wikimedia.org/r/659016 (owner: 10Jbond) [16:28:11] (03CR) 10Cwhite: [C: 03+1] "Alternatively, phid lookup can also be done through conduit: https://phabricator.wikimedia.org/conduit/method/project.query/" [puppet] - 10https://gerrit.wikimedia.org/r/658957 (https://phabricator.wikimedia.org/T272453) (owner: 10Filippo Giunchedi) [16:30:46] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:33:06] PROBLEM - Host cp1087 is DOWN: PING CRITICAL - Packet loss = 100% [16:37:20] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:37:29] (03CR) 10Volans: [C: 04-1] "Thanks for the patch, couple of things missing:" (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/658972 (owner: 10Muehlenhoff) [16:38:53] !log jayme@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [16:38:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:07] !log elukey@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' . [16:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:35] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27697/console" [puppet] - 10https://gerrit.wikimedia.org/r/658958 (https://phabricator.wikimedia.org/T272453) (owner: 10Filippo Giunchedi) [16:50:57] (03CR) 10Volans: [C: 04-1] "Did a first pass, see a couple of comments inline." (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/659008 (owner: 10David Caro) [16:51:01] (03PS1) 10Mforns: Declare 6 more NavigationTiming eventlogging streams and migrate on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659022 (https://phabricator.wikimedia.org/T271208) [16:51:17] (03CR) 10Filippo Giunchedi: "Thanks for the reviews! The dummy token is now in private.git" [puppet] - 10https://gerrit.wikimedia.org/r/658956 (https://phabricator.wikimedia.org/T272453) (owner: 10Filippo Giunchedi) [16:53:10] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27698/console" [puppet] - 10https://gerrit.wikimedia.org/r/658956 (https://phabricator.wikimedia.org/T272453) (owner: 10Filippo Giunchedi) [16:54:38] !log elukey@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' . [16:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:23] 10SRE, 10Analytics, 10Analytics-Kanban, 10Event-Platform, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10elukey) es-internal deployed in both eqiad and codfw, next steps are: - test loca... [16:57:14] (03CR) 10Cwhite: [C: 03+1] alertmanager: add phalerts webhook receiver [puppet] - 10https://gerrit.wikimedia.org/r/658956 (https://phabricator.wikimedia.org/T272453) (owner: 10Filippo Giunchedi) [17:00:08] 10SRE, 10Analytics, 10Analytics-Kanban, 10Event-Platform, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10Ottomata) @elukey [[ https://logstash.wikimedia.org/goto/b408da9f4b39f66a0d0980062... [17:01:30] 10SRE, 10CAS-SSO: SSO Portal: Fix "Remember me" checkbox alignment - https://phabricator.wikimedia.org/T273023 (10Legoktm) p:05Triageβ†’03Low [17:03:32] 10SRE, 10Analytics: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10Legoktm) p:05Triageβ†’03Low [17:03:39] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [17:05:28] (03PS1) 10Jbond: idp: update idp profile to support ldaps or ldap starttls [puppet] - 10https://gerrit.wikimedia.org/r/659024 [17:06:10] (03CR) 10Alexandros Kosiaris: Add support for php deployments (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/651757 (owner: 10Giuseppe Lavagetto) [17:06:15] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27699/console" [puppet] - 10https://gerrit.wikimedia.org/r/659024 (owner: 10Jbond) [17:07:01] (03CR) 10jerkins-bot: [V: 04-1] idp: update idp profile to support ldaps or ldap starttls [puppet] - 10https://gerrit.wikimedia.org/r/659024 (owner: 10Jbond) [17:08:18] (03PS2) 10Jbond: idp: update idp profile to support ldaps or ldap starttls [puppet] - 10https://gerrit.wikimedia.org/r/659024 [17:09:05] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27700/console" [puppet] - 10https://gerrit.wikimedia.org/r/659024 (owner: 10Jbond) [17:10:11] (03CR) 10Jbond: [V: 03+1 C: 03+2] idp: update idp profile to support ldaps or ldap starttls [puppet] - 10https://gerrit.wikimedia.org/r/659024 (owner: 10Jbond) [17:10:28] 10SRE, 10Analytics: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10akosiaris) I 'll take your word for it. +1 on the cleanup thing. [17:10:51] godog: fyi merged pruv repo change [17:11:03] godog: fyi i merged your priv repo change [17:11:19] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [17:11:38] (03Abandoned) 10Jeena Huneidi: [WIP] Apply global helmfile after pull [puppet] - 10https://gerrit.wikimedia.org/r/658750 (https://phabricator.wikimedia.org/T214158) (owner: 10Jeena Huneidi) [17:12:11] jbond42: cheers! keep forgetting :( [17:12:25] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [17:12:34] godog: no worries :) [17:13:41] (03PS1) 10Jbond: hiera: fix lookup [puppet] - 10https://gerrit.wikimedia.org/r/659027 [17:13:52] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [17:13:55] (03CR) 10Volans: [C: 04-1] "Did a first pass, this will need more thoughts and coordination with the current efforts towards non-root cumin." (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/659009 (owner: 10David Caro) [17:13:59] (03CR) 10Jbond: [V: 03+2 C: 03+2] hiera: fix lookup [puppet] - 10https://gerrit.wikimedia.org/r/659027 (owner: 10Jbond) [17:16:08] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 254589792 and 150 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:16:09] jouncebot: next [17:16:10] In 1 hour(s) and 43 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210127T1900) [17:16:10] In 1 hour(s) and 43 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210127T1900) [17:18:04] 10SRE, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Elitre) @Pchelolo new instance today, sadly. [17:18:54] (03PS1) 10Jbond: hiera: use correct prefix [puppet] - 10https://gerrit.wikimedia.org/r/659030 [17:18:55] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [17:19:21] (03CR) 10Jbond: [C: 03+2] hiera: use correct prefix [puppet] - 10https://gerrit.wikimedia.org/r/659030 (owner: 10Jbond) [17:19:24] (03PS4) 10Dzahn: profile::base: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/655508 (https://phabricator.wikimedia.org/T209953) [17:19:45] (03CR) 10Jbond: [V: 03+2 C: 03+2] hiera: use correct prefix [puppet] - 10https://gerrit.wikimedia.org/r/659030 (owner: 10Jbond) [17:21:12] (03CR) 10Herron: [C: 03+1] alertmanager: add phalerts webhook receiver [puppet] - 10https://gerrit.wikimedia.org/r/658956 (https://phabricator.wikimedia.org/T272453) (owner: 10Filippo Giunchedi) [17:21:24] 10SRE: archiva artifact links point to 127.0.0.1 - https://phabricator.wikimedia.org/T164993 (10hashar) + @elukey cause he seems to have done a few changes to our Archiva setup beside @Ottomata The issue is still present with Archiva 2.2.4 and it also happens for non snapshot release ( https://archiva.wikimedia... [17:21:35] (03CR) 10Herron: [C: 03+1] alertmanager: add receivers to create tasks from alerts [puppet] - 10https://gerrit.wikimedia.org/r/658957 (https://phabricator.wikimedia.org/T272453) (owner: 10Filippo Giunchedi) [17:21:39] (03CR) 10Herron: [C: 03+1] prometheus: add job for alertmanager::phab [puppet] - 10https://gerrit.wikimedia.org/r/658958 (https://phabricator.wikimedia.org/T272453) (owner: 10Filippo Giunchedi) [17:23:00] (03Abandoned) 10Herron: kibana: change backend naming from kibana-next to kibana7 [puppet] - 10https://gerrit.wikimedia.org/r/654294 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [17:24:49] 10SRE: Downloading from Archiva.wikimedia.org seems slower than Maven Central - https://phabricator.wikimedia.org/T273086 (10hashar) [17:24:54] (03CR) 10Dzahn: [C: 04-1] "it's fine on most hosts but a few cases have " parameter 'domain_search' expects a String value, got Tuple"" [puppet] - 10https://gerrit.wikimedia.org/r/655508 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [17:25:44] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1268.eqiad.wmnet with reason: REIMAGE [17:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:32] (03Abandoned) 10Herron: dns: rename kibana-next.svc to kibana7.svc [dns] - 10https://gerrit.wikimedia.org/r/618140 (owner: 10Herron) [17:27:07] 10SRE, 10Analytics: archiva artifact links point to 127.0.0.1 - https://phabricator.wikimedia.org/T164993 (10elukey) [17:27:36] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 379976 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:27:54] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1268.eqiad.wmnet with reason: REIMAGE [17:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:02] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1407.eqiad.wmnet with reason: REIMAGE [17:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:38] (03PS1) 10Jbond: hiera - ldap: add ldap global config for cloud [puppet] - 10https://gerrit.wikimedia.org/r/659033 [17:29:16] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1406.eqiad.wmnet with reason: REIMAGE [17:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:19] (03PS2) 10Mforns: Migrate WebUIActionsTracking schemas to Event Platform on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658426 (https://phabricator.wikimedia.org/T267347) [17:30:08] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1407.eqiad.wmnet with reason: REIMAGE [17:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:08] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1406.eqiad.wmnet with reason: REIMAGE [17:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:50] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2301.codfw.wmnet with reason: REIMAGE [17:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:57] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2301.codfw.wmnet with reason: REIMAGE [17:34:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:27] (03CR) 10Jbond: profile::base: hiera->lookup, add data types (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655508 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [17:35:52] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [17:36:53] 10SRE, 10Dumps-Generation, 10Platform Engineering, 10serviceops: Upgrade snapshot hosts to Buster - https://phabricator.wikimedia.org/T269377 (10ArielGlenn) I've built th package and set up a test instance in deployment-prep, but there's issues with mediawiki scripts there; see T273089 for the details. [17:37:03] (03CR) 10Dzahn: profile::base: hiera->lookup, add data types (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655508 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [17:38:03] (03PS5) 10Dzahn: profile::base: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/655508 (https://phabricator.wikimedia.org/T209953) [17:38:05] (03CR) 10Ahmon Dancy: [C: 03+1] Allow talking to the registry over HTTP [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/658684 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm) [17:39:33] (03CR) 10jerkins-bot: [V: 04-1] profile::base: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/655508 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [17:45:58] (03PS2) 10Mstyles: bump memory for flink processes [deployment-charts] - 10https://gerrit.wikimedia.org/r/657941 [17:46:29] (03PS6) 10Dzahn: profile::base: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/655508 (https://phabricator.wikimedia.org/T209953) [17:46:49] (03PS3) 10Mstyles: bump memory for flink processes [deployment-charts] - 10https://gerrit.wikimedia.org/r/657941 [17:47:00] (03CR) 10Bstorm: [C: 03+1] "Looks fine." [puppet] - 10https://gerrit.wikimedia.org/r/658397 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [17:47:03] (03CR) 10Mstyles: bump memory for flink processes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/657941 (owner: 10Mstyles) [17:48:09] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2215.codfw.wmnet with reason: REIMAGE [17:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:20] (03PS1) 10Bstorm: wikireplicas: open the proxies for the new ports [puppet] - 10https://gerrit.wikimedia.org/r/659041 (https://phabricator.wikimedia.org/T271476) [17:50:18] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2215.codfw.wmnet with reason: REIMAGE [17:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:36] (03CR) 10Bstorm: [C: 03+2] wmcs: Migrate hiera() to lookup() and set datatypes in nfs primary [puppet] - 10https://gerrit.wikimedia.org/r/658397 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [17:51:32] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1407.eqiad.wmnet'] ` an... [17:52:20] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1406.eqiad.wmnet'] ` an... [17:53:04] (03PS5) 10Jcrespo: bacula: Remove helium and heze references from puppet [puppet] - 10https://gerrit.wikimedia.org/r/658969 (https://phabricator.wikimedia.org/T273049) [17:53:05] 10SRE, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Elitre) [17:53:06] (03PS1) 10Jcrespo: bacula: Undo conditionals added while transitioning helium->backup1001 [puppet] - 10https://gerrit.wikimedia.org/r/659046 (https://phabricator.wikimedia.org/T238048) [17:53:08] (03CR) 10Bstorm: "confirmed noop on labstore1004" [puppet] - 10https://gerrit.wikimedia.org/r/658397 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [17:54:34] (03CR) 10Jcrespo: "This is WIP, a first approach of the most obvious things that will likely fail tests. Big early feedback is welcome of big gaps not mentio" [puppet] - 10https://gerrit.wikimedia.org/r/659046 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [17:54:35] PROBLEM - Check no envoy runtime configuration is left persistent on mwdebug1001 is CRITICAL: connect to address 127.0.0.1 and port 9631: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [17:54:54] (03CR) 10jerkins-bot: [V: 04-1] bacula: Undo conditionals added while transitioning helium->backup1001 [puppet] - 10https://gerrit.wikimedia.org/r/659046 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [17:55:16] (03CR) 10Bstorm: "Regardless of the eventual ingress, the firewall rule in this will not need to change, so this will allow testing to begin. I've given pub" [puppet] - 10https://gerrit.wikimedia.org/r/659041 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm) [17:55:40] (03CR) 10Jcrespo: bacula: Undo conditionals added while transitioning helium->backup1001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/659046 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [17:57:14] (03CR) 10Jcrespo: "My editor apparently lost its setup :-(." [puppet] - 10https://gerrit.wikimedia.org/r/659046 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [17:59:53] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 241680392 and 31 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:00:58] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2301.codfw.wmnet'] ` an... [18:01:03] (03PS1) 10Dzahn: etcd::replication: hiera -> lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/659048 (https://phabricator.wikimedia.org/T209953) [18:01:17] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 593456 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:02:33] (03CR) 10jerkins-bot: [V: 04-1] etcd::replication: hiera -> lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/659048 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [18:03:38] (03PS2) 10Dzahn: etcd::replication: hiera -> lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/659048 (https://phabricator.wikimedia.org/T209953) [18:03:46] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1406.eqiad.wmnet [18:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:55] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2301.codfw.wmnet [18:03:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:11] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1407.eqiad.wmnet [18:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:43] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:05:08] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2216.codfw.wmnet with reason: REIMAGE [18:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:06] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1407.eqiad.wmnet [18:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:23] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:07:09] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2216.codfw.wmnet with reason: REIMAGE [18:07:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:58] (03PS2) 10Jcrespo: bacula: Undo conditionals added while transitioning helium->backup1001 [puppet] - 10https://gerrit.wikimedia.org/r/659046 (https://phabricator.wikimedia.org/T238048) [18:08:15] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [18:10:22] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1406.eqiad.wmnet [18:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:54] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [18:13:06] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2301.codfw.wmnet [18:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:06] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [18:14:48] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/27703/" [puppet] - 10https://gerrit.wikimedia.org/r/659048 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [18:15:08] !log dpifke@deploy1001 Started deploy [performance/arc-lamp@e24f319]: Re-deploying ArcLamp to webperf1002 [18:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:13] !log dpifke@deploy1001 Finished deploy [performance/arc-lamp@e24f319]: Re-deploying ArcLamp to webperf1002 (duration: 00m 05s) [18:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:39] (03CR) 10Dzahn: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/658969 (https://phabricator.wikimedia.org/T273049) (owner: 10Jcrespo) [18:17:46] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:17:48] (03CR) 10Dzahn: [C: 03+2] scap: add deploy1002 and deploy2002 to mediawiki hosts [puppet] - 10https://gerrit.wikimedia.org/r/658643 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn) [18:18:55] 10SRE, 10Data-Persistence-Backup: print a list of backed up directories in the MOTD of production servers - https://phabricator.wikimedia.org/T272686 (10jcrespo) Apparently, there is the following code on backup::set: ` $motd_content = "#!/bin/sh\necho \"Backed up on this host: ${name}\"" @motd::scrip... [18:19:24] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:20:47] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1268.eqiad.wmnet'] ` an... [18:21:50] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1268.eqiad.wmnet [18:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:46] (03PS1) 10Dzahn: parsoid::testreduce: let envoy listen on IPv6 as well [puppet] - 10https://gerrit.wikimedia.org/r/659051 (https://phabricator.wikimedia.org/T266509) [18:23:57] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1268.eqiad.wmnet [18:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:00] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [18:25:05] !log hashar@deploy1001 Started deploy [integration/docroot@da43ad4]: Add Shellbox to doc.wm.o , misc build related changes fdf0917..da43ad4 [18:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:15] !log hashar@deploy1001 Finished deploy [integration/docroot@da43ad4]: Add Shellbox to doc.wm.o , misc build related changes fdf0917..da43ad4 (duration: 00m 10s) [18:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:22] !log hashar@deploy1001 Started deploy [integration/docroot@da43ad4]: Add Shellbox to doc.wm.o , misc build related changes fdf0917..da43ad4 [18:25:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:31] !log hashar@deploy1001 Finished deploy [integration/docroot@da43ad4]: Add Shellbox to doc.wm.o , misc build related changes fdf0917..da43ad4 (duration: 00m 07s) [18:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:43] not sure why there are dupes bah [18:26:01] (03PS1) 10Bstorm: cloud-maps-proxy: small change in the reply messages of maps proxy config [puppet] - 10https://gerrit.wikimedia.org/r/659053 [18:26:19] (03CR) 10Dzahn: [C: 03+2] parsoid::testreduce: let envoy listen on IPv6 as well [puppet] - 10https://gerrit.wikimedia.org/r/659051 (https://phabricator.wikimedia.org/T266509) (owner: 10Dzahn) [18:26:43] (03CR) 10BryanDavis: [C: 03+1] cloud-maps-proxy: small change in the reply messages of maps proxy config [puppet] - 10https://gerrit.wikimedia.org/r/659053 (owner: 10Bstorm) [18:26:44] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10wiki_willy) Hi @elukey - thanks for the mapping. What makes it tough is that the remaining 6x hosts need to be on 10g switches, which really limits our op... [18:27:23] Can someone lookup the full stack trace for T273094 please? [18:27:23] T273094: Uncaught SyntaxError: Unexpected identifier - CentralAuthLogin throwing SyntaxError - https://phabricator.wikimedia.org/T273094 [18:27:43] PROBLEM - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:27:56] (03CR) 10Bstorm: [C: 03+2] cloud-maps-proxy: small change in the reply messages of maps proxy config [puppet] - 10https://gerrit.wikimedia.org/r/659053 (owner: 10Bstorm) [18:28:15] (03CR) 10Jeena Huneidi: "correction: pipeline meeting, not repo" [puppet] - 10https://gerrit.wikimedia.org/r/658750 (https://phabricator.wikimedia.org/T214158) (owner: 10Jeena Huneidi) [18:30:05] PROBLEM - Maps HTTPS on maps1009 is CRITICAL: connect to address 10.64.32.8 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Maps/RunBook [18:30:20] (03CR) 10Jeena Huneidi: [C: 03+1] bump memory for flink processes [deployment-charts] - 10https://gerrit.wikimedia.org/r/657941 (owner: 10Mstyles) [18:30:48] !log Creating the table securepoll_log in votewiki and testwiki (T271270) [18:30:51] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:53] T271270: Create new logging table in SecurePoll - https://phabricator.wikimedia.org/T271270 [18:31:15] 10SRE, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Elitre) [18:32:17] PROBLEM - cassandra CQL 10.64.32.8:9042 on maps1009 is CRITICAL: connect to address 10.64.32.8 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [18:34:35] PROBLEM - cassandra service on maps1009 is CRITICAL: CRITICAL - Unit cassandra is active but reported SubState exited, wanted running https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:35:39] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:36:13] RECOVERY - mediawiki-installation DSH group on deploy1002 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [18:36:51] PROBLEM - tileratorui on maps1009 is CRITICAL: connect to address 10.64.32.8 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [18:37:32] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2217.codfw.wmnet with reason: REIMAGE [18:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:41] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2217.codfw.wmnet with reason: REIMAGE [18:39:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:55] (03PS1) 10Jcrespo: bacula: Fix bug on realizing backup sets on motd [puppet] - 10https://gerrit.wikimedia.org/r/659056 (https://phabricator.wikimedia.org/T272686) [18:40:06] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2218.codfw.wmnet with reason: REIMAGE [18:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:25] (03CR) 10Jcrespo: "I think this was a bug, but please clarify if it was weirdly disabled for some reason." [puppet] - 10https://gerrit.wikimedia.org/r/659056 (https://phabricator.wikimedia.org/T272686) (owner: 10Jcrespo) [18:41:49] (03CR) 10jerkins-bot: [V: 04-1] bacula: Fix bug on realizing backup sets on motd [puppet] - 10https://gerrit.wikimedia.org/r/659056 (https://phabricator.wikimedia.org/T272686) (owner: 10Jcrespo) [18:42:18] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2218.codfw.wmnet with reason: REIMAGE [18:42:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:21] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2219.codfw.wmnet with reason: REIMAGE [18:43:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:04] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2215.codfw.wmnet'] ` an... [18:45:22] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2219.codfw.wmnet with reason: REIMAGE [18:45:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:05] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1263.eqiad.wmnet with reason: REIMAGE [18:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:47:43] (03PS2) 10Jcrespo: bacula: Fix bug on realizing backup sets on motd [puppet] - 10https://gerrit.wikimedia.org/r/659056 (https://phabricator.wikimedia.org/T272686) [18:48:46] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:49:09] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1263.eqiad.wmnet with reason: REIMAGE [18:49:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:33] (03CR) 10jerkins-bot: [V: 04-1] bacula: Fix bug on realizing backup sets on motd [puppet] - 10https://gerrit.wikimedia.org/r/659056 (https://phabricator.wikimedia.org/T272686) (owner: 10Jcrespo) [18:50:14] !log testreduce1001 - making nginx listen on IPv6 and restarting it T266509 [18:50:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:17] T266509: Make testreduce web UI publicly accessible on the internet - https://phabricator.wikimedia.org/T266509 [18:53:05] (03CR) 10Bstorm: "Looks good https://puppet-compiler.wmflabs.org/compiler1003/27704/" [puppet] - 10https://gerrit.wikimedia.org/r/659041 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm) [18:53:18] (03PS1) 10Dzahn: parsoid/testing: let nginx also listen on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/659058 (https://phabricator.wikimedia.org/T266509) [18:53:21] (03CR) 10Bstorm: [C: 03+2] wikireplicas: open the proxies for the new ports [puppet] - 10https://gerrit.wikimedia.org/r/659041 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm) [18:53:22] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2215.codfw.wmnet [18:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:27] (03PS3) 10Jcrespo: bacula: Fix bug on realizing backup sets on motd [puppet] - 10https://gerrit.wikimedia.org/r/659056 (https://phabricator.wikimedia.org/T272686) [18:55:26] (03CR) 10jerkins-bot: [V: 04-1] bacula: Fix bug on realizing backup sets on motd [puppet] - 10https://gerrit.wikimedia.org/r/659056 (https://phabricator.wikimedia.org/T272686) (owner: 10Jcrespo) [18:57:53] jouncebot: next [18:57:53] In 0 hour(s) and 2 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210127T1900) [18:57:53] In 0 hour(s) and 2 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210127T1900) [18:58:08] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2216.codfw.wmnet'] ` an... [19:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor I οΏ½ Unicode. All rise for Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210127T1900). [19:00:05] Jayprakash12345 and mforns: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:05] dancy and brennen: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Train log triage with CPT . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210127T1900). [19:00:16] I can deploy today! [19:00:22] heya, I'm here [19:00:43] hi mforns [19:00:53] (03CR) 10Urbanecm: [C: 03+2] Migrate WebUIActionsTracking schemas to Event Platform on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658426 (https://phabricator.wikimedia.org/T267347) (owner: 10Mforns) [19:03:31] (03Merged) 10jenkins-bot: Migrate WebUIActionsTracking schemas to Event Platform on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658426 (https://phabricator.wikimedia.org/T267347) (owner: 10Mforns) [19:04:51] PROBLEM - PHP opcache health on mwdebug1001 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:05:15] mforns: can you test the first patch at mwdebug1001, please? [19:05:22] yes Urbanecm [19:05:28] thanks, let me know how it goes [19:06:19] Urbanecm: tested, looks good! [19:06:24] thanks, syncing [19:06:56] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2215.codfw.wmnet [19:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:22] mutante: ftr, I'm scap'ing now [19:07:54] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 9382a9879bd6823fd664c0d3721fd0a9dc0d56d8: Migrate WebUIActionsTracking schemas to Event Platform on all wikis (T267347,T271164) (duration: 01m 03s) [19:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:58] T267347: MobileWebUIActionsTracking Event Platform Migration - https://phabricator.wikimedia.org/T267347 [19:07:58] T271164: DesktopWebUIActionsTracking Event Platform Migration - https://phabricator.wikimedia.org/T271164 [19:08:35] RECOVERY - PHP opcache health on mwdebug1001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:08:49] (03PS2) 10Urbanecm: Declare 6 more NavigationTiming eventlogging streams and migrate on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659022 (https://phabricator.wikimedia.org/T271208) (owner: 10Mforns) [19:08:59] (03CR) 10Urbanecm: [C: 03+2] Declare 6 more NavigationTiming eventlogging streams and migrate on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659022 (https://phabricator.wikimedia.org/T271208) (owner: 10Mforns) [19:09:00] Urbanecm: new thing is that deploy1002/deploy2002 should be part of the scap [19:09:18] they were not previously? [19:09:22] no [19:09:22] that sounds weird [19:09:37] it's not, they are new [19:09:43] aha [19:10:01] will we have two deploy servers? or will deploy1001 be eventually removed? [19:10:18] deploy1001/deploy2001 will be removed eventually [19:10:25] this is just about stretch->buster [19:10:30] ah, got it [19:10:40] 10SRE, 10Traffic: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Cyberpower678) I've added additional logging data to the framework. ` Date/Time: Wed, 27 Jan 2021 19:08:52 +0000 Method: GET URL: https://en.wikipedia.org/w/api.p... [19:10:42] (03Merged) 10jenkins-bot: Declare 6 more NavigationTiming eventlogging streams and migrate on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659022 (https://phabricator.wikimedia.org/T271208) (owner: 10Mforns) [19:11:02] mforns: pulled onto mwdebug1001, please test [19:11:09] Urbanecm: ack [19:13:17] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:15:25] Urbanecm: 4 of the 6 schemas went through, I think I can not see the other ones because of low throughput, but it seems that the patch overall works! [19:15:26] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [19:15:33] good, I'll sync [19:17:00] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: cabb2e2009f97bb86c1b8827c3cc61cc991c41a9: Declare 6 more NavigationTiming eventlogging streams and migrate on testwiki (T271208) (duration: 01m 00s) [19:17:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:04] T271208: NavigationTiming Extension schemas Event Platform Migration - https://phabricator.wikimedia.org/T271208 [19:17:15] (03CR) 10Dzahn: [C: 03+2] parsoid/testing: let nginx also listen on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/659058 (https://phabricator.wikimedia.org/T266509) (owner: 10Dzahn) [19:17:19] mforns: done. Anything else? [19:17:34] no :] that's all, thanks Urbanecm [19:17:40] no problem :) [19:17:58] (03PS2) 10Urbanecm: arwiki: Configure wgGEHomepageManualAssignmentMentorsList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658996 (https://phabricator.wikimedia.org/T273060) [19:18:02] (03CR) 10Urbanecm: [C: 03+2] arwiki: Configure wgGEHomepageManualAssignmentMentorsList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658996 (https://phabricator.wikimedia.org/T273060) (owner: 10Urbanecm) [19:18:53] (03Merged) 10jenkins-bot: arwiki: Configure wgGEHomepageManualAssignmentMentorsList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658996 (https://phabricator.wikimedia.org/T273060) (owner: 10Urbanecm) [19:19:25] (03CR) 10Mstyles: [C: 03+2] bump memory for flink processes [deployment-charts] - 10https://gerrit.wikimedia.org/r/657941 (owner: 10Mstyles) [19:19:37] !log reboot an-launcher1002 for kernel upgrades [19:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:51] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:20:08] Hi, I have a patch for deployment in Morning backport window. [19:20:18] It is closed? [19:20:34] Jayprakash12345: not yet [19:21:22] (03Merged) 10jenkins-bot: bump memory for flink processes [deployment-charts] - 10https://gerrit.wikimedia.org/r/657941 (owner: 10Mstyles) [19:21:38] I'll ping you when ready Jayprakash12345 [19:21:56] Urbanecm: Okay :) [19:22:45] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 53419ab6c0f2c306a68edb8979106bd42536211a: arwiki: Configure wgGEHomepageManualAssignmentMentorsList (T273060) (duration: 00m 59s) [19:22:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:50] T273060: Configure wgGEHomepageManualAssignmentMentorsList for ar.wikipedia.org - https://phabricator.wikimedia.org/T273060 [19:23:11] (03CR) 10Urbanecm: [C: 04-1] Add accountcreator user rights group (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659000 (https://phabricator.wikimedia.org/T269067) (owner: 10Jayprakash12345) [19:23:19] ^^ Jayprakash12345 ^^ [19:24:57] RECOVERY - Check no envoy runtime configuration is left persistent on mwdebug1001 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [19:26:34] Urbanecm: Yes, It is already exist. Checked at https://mr.wikisource.org/w/index.php?title=Special:UserList&group=accountcreator. [19:26:46] Thanks for pointing out. [19:26:51] Jayprakash12345: I mean, it doesn't make sense to redefine it :) [19:28:02] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2217.codfw.wmnet'] ` an... [19:28:29] (03Abandoned) 10Jayprakash12345: Add accountcreator user rights group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659000 (https://phabricator.wikimedia.org/T269067) (owner: 10Jayprakash12345) [19:30:03] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2218.codfw.wmnet'] ` an... [19:30:06] (03PS4) 10Jcrespo: bacula: Fix bug on realizing backup sets on motd [puppet] - 10https://gerrit.wikimedia.org/r/659056 (https://phabricator.wikimedia.org/T272686) [19:30:31] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:30:33] Jayprakash12345: anything else? [19:31:48] Urbanecm: Not for deployment, just wait for your comment on the task https://phabricator.wikimedia.org/T269067. I asked a question there. [19:31:55] will have a look Jayprakash12345 [19:32:36] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2219.codfw.wmnet'] ` an... [19:34:10] (03PS1) 10Krinkle: objectcache: fix broken for loop in RedisBagOStuff::doSetMulti() [core] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/658939 (https://phabricator.wikimedia.org/T273006) [19:36:23] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:38:27] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1263.eqiad.wmnet'] ` an... [19:40:11] (03CR) 10Jcrespo: "I think this was disabled, probably not on purpose, by a refactoring by Faidon 7 years ago (according to blame). CCing him, please shout i" [puppet] - 10https://gerrit.wikimedia.org/r/659056 (https://phabricator.wikimedia.org/T272686) (owner: 10Jcrespo) [19:40:35] (03PS1) 10Bstorm: wikireplicas: open the firewall for multiinstance databases [homer/public] - 10https://gerrit.wikimedia.org/r/659070 (https://phabricator.wikimedia.org/T271476) [19:44:43] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2221.codfw.wmnet with reason: REIMAGE [19:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:49] (03CR) 10Jcrespo: "Seems to work: https://puppet-compiler.wmflabs.org/compiler1001/27705/" [puppet] - 10https://gerrit.wikimedia.org/r/659056 (https://phabricator.wikimedia.org/T272686) (owner: 10Jcrespo) [19:46:53] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2221.codfw.wmnet with reason: REIMAGE [19:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:30] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "compiled on everything - noop, the special cases are already 404 in prod catalog: https://puppet-compiler.wmflabs.org/compiler1002/27702/" [puppet] - 10https://gerrit.wikimedia.org/r/655508 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [19:50:50] Amir1: ^ base :p [19:51:37] "based" [19:54:34] (03CR) 10Nskaggs: [C: 03+1] "While this might not be the long term solution, pending https://phabricator.wikimedia.org/T267376, I'm in support of ensuring we can condu" [homer/public] - 10https://gerrit.wikimedia.org/r/659070 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm) [19:54:52] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/27703/" [puppet] - 10https://gerrit.wikimedia.org/r/659048 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [19:54:55] (03CR) 10Dzahn: [V: 03+1 C: 03+2] etcd::replication: hiera -> lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/659048 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [19:56:58] (03CR) 10Dzahn: "noop conf1005,conf2003" [puppet] - 10https://gerrit.wikimedia.org/r/659048 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [20:00:05] dancy and brennen: How many deployers does it take to do Mediawiki train - American Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210127T2000). [20:01:11] (03PS1) 10Dzahn: base::certificates: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/659071 (https://phabricator.wikimedia.org/T209953) [20:01:23] (03PS1) 10Urbanecm: Undeploy cswiki birthday logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659072 [20:01:38] dancy: brennen: mind me shipping the above? [20:02:49] (03CR) 10jerkins-bot: [V: 04-1] base::certificates: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/659071 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [20:02:59] Urbanecm: go ahead. [20:03:03] thank you! [20:03:11] (03CR) 10Urbanecm: [C: 03+2] Undeploy cswiki birthday logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659072 (owner: 10Urbanecm) [20:03:56] (03PS1) 10Dzahn: tlsproxy::prometheus: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/659073 (https://phabricator.wikimedia.org/T209953) [20:04:08] (03Merged) 10jenkins-bot: Undeploy cswiki birthday logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659072 (owner: 10Urbanecm) [20:04:41] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1263.eqiad.wmnet [20:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:55] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2219.codfw.wmnet [20:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:11] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2218.codfw.wmnet [20:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:41] (03CR) 10jerkins-bot: [V: 04-1] tlsproxy::prometheus: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/659073 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [20:06:21] !log urbanecm@deploy1001 Synchronized wmf-config/logos.php: 6c5dd65e6138eb32db8059720a2149d4728763e7: Undeploy cswiki birthday logo (duration: 01m 06s) [20:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:40] !log urbanecm@deploy1001 Synchronized logos/config.yaml: 6c5dd65e6138eb32db8059720a2149d4728763e7: Undeploy cswiki birthday logo (duration: 01m 05s) [20:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:47] dancy: all done, thanks again :) [20:07:55] np [20:09:04] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2216.codfw.wmnet [20:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:23] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1263.eqiad.wmnet [20:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:11] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:13:08] (03CR) 10Ahmon Dancy: [C: 03+2] objectcache: fix broken for loop in RedisBagOStuff::doSetMulti() [core] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/658939 (https://phabricator.wikimedia.org/T273006) (owner: 10Krinkle) [20:13:57] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2219.codfw.wmnet [20:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:23] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:16:33] PROBLEM - Memcached on mwdebug2001 is CRITICAL: connect to address 10.192.0.98 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [20:18:41] RECOVERY - Memcached on mwdebug2001 is OK: TCP OK - 0.032 second response time on 10.192.0.98 port 11210 https://wikitech.wikimedia.org/wiki/Memcached [20:18:55] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2218.codfw.wmnet [20:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:57] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:25:19] PROBLEM - Check systemd state on mwdebug2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:25:44] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2216.codfw.wmnet [20:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:08] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1405.eqiad.wmnet with reason: REIMAGE [20:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:55] RECOVERY - Check systemd state on mwdebug2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:29:59] !log 1.36.0-wmf.28 (T271342): taking over train while dancy is afk; waiting on [[gerrit:658939]] to merge and will sync for verification on testwikis [20:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:02] T271342: 1.36.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T271342 [20:31:11] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1405.eqiad.wmnet with reason: REIMAGE [20:31:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:38] (03PS2) 10Dzahn: base::certificates: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/659071 (https://phabricator.wikimedia.org/T209953) [20:31:58] (03PS2) 10Dzahn: tlsproxy::prometheus: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/659073 (https://phabricator.wikimedia.org/T209953) [20:32:59] (03CR) 10Cwhite: profile: update netdev to output ECS-formatted logs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/647029 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [20:33:35] (03CR) 10jerkins-bot: [V: 04-1] tlsproxy::prometheus: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/659073 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [20:34:48] (03CR) 10Cwhite: [C: 03+2] profile: add ecs pre and post filters to pipeline [puppet] - 10https://gerrit.wikimedia.org/r/647028 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [20:35:14] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2221.codfw.wmnet'] ` an... [20:35:20] (03PS1) 10Dzahn: monitoring::service: hiera -> lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/659075 [20:37:27] (03CR) 10jerkins-bot: [V: 04-1] monitoring::service: hiera -> lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/659075 (owner: 10Dzahn) [20:37:31] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2221.codfw.wmnet [20:37:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:32] (03PS1) 10Jdlrobson: Enable language in header on beta cluster for QA purposes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659077 (https://phabricator.wikimedia.org/T260738) [20:40:43] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2221.codfw.wmnet [20:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:14] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2217.codfw.wmnet [20:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:45] PROBLEM - mediawiki-installation DSH group on mw2217 is CRITICAL: Host mw2217 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [20:43:57] (03PS1) 10Volans: mypy: temporary force upper version [software/spicerack] - 10https://gerrit.wikimedia.org/r/659078 [20:43:57] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2217.codfw.wmnet [20:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:03] (03Merged) 10jenkins-bot: objectcache: fix broken for loop in RedisBagOStuff::doSetMulti() [core] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/658939 (https://phabricator.wikimedia.org/T273006) (owner: 10Krinkle) [20:44:15] PROBLEM - Memcached on mwdebug2001 is CRITICAL: connect to address 10.192.0.98 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [20:44:37] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2299.codfw.wmnet [20:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:39] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2222.codfw.wmnet with reason: REIMAGE [20:45:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:46] (03PS1) 10Jdlrobson: Disable max-width on page namespace for wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659079 (https://phabricator.wikimedia.org/T260091) [20:47:44] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2222.codfw.wmnet with reason: REIMAGE [20:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:36] 10SRE, 10vm-requests: codfw: 3 VM request for ML team etcd - https://phabricator.wikimedia.org/T273075 (10Dzahn) @klausman Could you add the new cluster prefixes for ml (ml-etcd and others) to https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions#Servers ? that would be nice, thank you! [20:51:52] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1405.eqiad.wmnet'] ` an... [20:53:10] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2246.codfw.wmnet with reason: REIMAGE [20:53:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:31] (03CR) 10Volans: [C: 03+2] "Self-merging to unblock all pending CRs. I'll revert the temporary fix once upstream has fixed the issue." [software/spicerack] - 10https://gerrit.wikimedia.org/r/659078 (owner: 10Volans) [20:55:11] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2246.codfw.wmnet with reason: REIMAGE [20:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:28] 10SRE, 10vm-requests: codfw: 3 VM request for ML team etcd - https://phabricator.wikimedia.org/T273075 (10Dzahn) a:03Dzahn [20:58:49] !log brennen@deploy1001 Synchronized php-1.36.0-wmf.28/includes/libs/objectcache/RedisBagOStuff.php: Backport: [[gerrit:658780|objectcache: fix broken for loop in RedisBagOStuff::doSetMulti() (T273006)]] (duration: 01m 07s) [20:58:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:56] T273006: MediumSpecificBagOStuff.php Undefined offset errors - https://phabricator.wikimedia.org/T273006 [21:00:01] Krinkle, AaronSchulz: patch for above is currently on testwikis. [21:00:04] chrisalbon and accraze: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210127T2100). [21:00:10] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting%23Nova-fullstack [21:00:59] (03Merged) 10jenkins-bot: mypy: temporary force upper version [software/spicerack] - 10https://gerrit.wikimedia.org/r/659078 (owner: 10Volans) [21:05:10] 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install atlas-codfw.wikimedia.org - https://phabricator.wikimedia.org/T273114 (10RobH) [21:05:18] 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install atlas-codfw.wikimedia.org - https://phabricator.wikimedia.org/T273114 (10RobH) [21:05:28] 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install atlas-codfw.wikimedia.org - https://phabricator.wikimedia.org/T273114 (10RobH) [21:05:30] 10SRE, 10ops-codfw, 10netops: ripe-atlas-codfw is down - https://phabricator.wikimedia.org/T267714 (10RobH) [21:08:18] (03PS6) 10Volans: icinga: add wait_for_optimal function [software/spicerack] - 10https://gerrit.wikimedia.org/r/657385 (owner: 10Jbond) [21:08:32] (03PS6) 10Volans: Add a ferm module to Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/658972 (owner: 10Muehlenhoff) [21:08:44] (03PS2) 10Volans: puppet: add puppetmaster retrieval [software/spicerack] - 10https://gerrit.wikimedia.org/r/659008 (owner: 10David Caro) [21:08:58] (03PS2) 10Volans: remote: allow prepending every command with sudo [software/spicerack] - 10https://gerrit.wikimedia.org/r/659009 (owner: 10David Caro) [21:09:18] (03PS2) 10Volans: (WIP) debdeploy: Add debdeploy functionality [software/spicerack] - 10https://gerrit.wikimedia.org/r/658626 (owner: 10Jbond) [21:09:45] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@1c9d487]: airflow: hourly tasks must wait for yesterdays daily tank [21:09:45] !log ebernhardson@deploy1001 deploy aborted: airflow: hourly tasks must wait for yesterdays daily tank (duration: 00m 00s) [21:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:47] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@1c9d487]: airflow: hourly tasks must wait for yesterdays daily task [21:09:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:39] brennen: hey, what's the status of train? It would be great to backport https://gerrit.wikimedia.org/r/c/mediawiki/core/+/655148/ into both wmf.28/wmf.27 if possible :) [21:15:04] (03CR) 10jerkins-bot: [V: 04-1] (WIP) debdeploy: Add debdeploy functionality [software/spicerack] - 10https://gerrit.wikimedia.org/r/658626 (owner: 10Jbond) [21:15:06] (03CR) 10jerkins-bot: [V: 04-1] Add a ferm module to Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/658972 (owner: 10Muehlenhoff) [21:15:11] (03CR) 10jerkins-bot: [V: 04-1] puppet: add puppetmaster retrieval [software/spicerack] - 10https://gerrit.wikimedia.org/r/659008 (owner: 10David Caro) [21:15:20] Urbanecm: awaiting validation of https://gerrit.wikimedia.org/r/c/mediawiki/core/+/658780 [21:15:34] ack :) [21:16:36] 10SRE, 10SRE-Access-Requests: Add sdkim to analytics-privatedata-users group - https://phabricator.wikimedia.org/T273058 (10Legoktm) I believe we just need approval from @sdkim's manager now. (Also in the future please use the form at https://wikitech.wikimedia.org/wiki/Production_access#Filing_the_request wh... [21:16:42] (03CR) 10jerkins-bot: [V: 04-1] remote: allow prepending every command with sudo [software/spicerack] - 10https://gerrit.wikimedia.org/r/659009 (owner: 10David Caro) [21:17:11] (03CR) 10Urbanecm: "This change is ready for review." [core] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/658940 (https://phabricator.wikimedia.org/T271551) (owner: 10Urbanecm) [21:17:36] (03PS1) 10Urbanecm: Fix fetching ipblock-exempt within BlockManager::getUserBlock [core] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/658941 (https://phabricator.wikimedia.org/T271551) [21:17:41] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@1c9d487]: airflow: hourly tasks must wait for yesterdays daily task (duration: 07m 54s) [21:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:47] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the addition!" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/657385 (owner: 10Jbond) [21:19:49] 10SRE, 10SRE-Access-Requests: Add sdkim to analytics-privatedata-users group - https://phabricator.wikimedia.org/T273058 (10Legoktm) [21:20:54] 10SRE, 10SRE-Access-Requests: Add sdkim to analytics-privatedata-users group - https://phabricator.wikimedia.org/T273058 (10Legoktm) p:05Triageβ†’03Medium [21:21:03] 10SRE, 10SRE-Access-Requests: Add sdkim to analytics-privatedata-users group - https://phabricator.wikimedia.org/T273058 (10Legoktm) a:03sdkim @sdkim: You will also need to agree to {L3}. [21:21:59] 10SRE, 10SRE-Access-Requests: Hue access for Peter Pelberg - https://phabricator.wikimedia.org/T271602 (10Legoktm) a:05ppelbergβ†’03Ottomata [21:23:41] 10SRE, 10Sustainability: Alerts for drifts in /srv/mediawiki - https://phabricator.wikimedia.org/T272967 (10Legoktm) [21:23:53] thanks legoktm [21:24:06] PROBLEM - nova instance creation test on cloudcontrol1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:24:28] :) [21:27:43] 10SRE, 10vm-requests: eqiad: 3 VM request for ML team etcd - https://phabricator.wikimedia.org/T273074 (10Legoktm) p:05Triageβ†’03Medium [21:28:49] 10SRE, 10Traffic: Downloading from Archiva.wikimedia.org seems slower than Maven Central - https://phabricator.wikimedia.org/T273086 (10Legoktm) p:05Triageβ†’03Low [21:29:05] 10SRE, 10vm-requests: eqiad: 3 VM request for ML team etcd - https://phabricator.wikimedia.org/T273074 (10akosiaris) LGTM. Docs for proceeding with the creation at https://wikitech.wikimedia.org/wiki/Ganeti#Create_a_VM [21:30:18] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:34:34] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:36:25] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2222.codfw.wmnet'] ` an... [21:36:48] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:37:02] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:37:15] 10SRE, 10SRE-Access-Requests: Add sdkim to analytics-privatedata-users group - https://phabricator.wikimedia.org/T273058 (10Ottomata) @Legoktm FYI, Seve does not need ssh access (but can certainly have it if he wants it!). The user in data.yaml can have an empty array for ssh_keys, e.g. `ssh_keys: []`. https... [21:37:40] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2246.codfw.wmnet'] ` an... [21:38:16] ottomata: so is L3 only needed if they're getting ssh access? I read that wiki page but it wasn't obvious to me [21:38:22] 10SRE, 10vm-requests: codfw: 3 VM request for ML team etcd - https://phabricator.wikimedia.org/T273075 (10Dzahn) a:05Dzahnβ†’03None [21:38:54] 10SRE, 10DNS, 10Mail, 10Traffic: ITS request to update SPF & DNS Records for Trust & Safety - https://phabricator.wikimedia.org/T272750 (10pkang) @drochford hey david, based on Andre's last comment, would the team be open to have the emails be sent from a subdomain like @zendesk.wikimedia.org? [21:40:42] Hmm, legoktm good q. I'd say they probably should read that, for the Handling sensitive data bit. but do they need to sign it? hm. [21:41:01] this is a relatively new support for us, to be able to grant access to some of this data without ssh login [21:41:26] will send email to you and moritz and luca asking [21:41:39] ok, thanks [21:41:53] since we're still waiting on manager approval I don't think it'll slow anything down in the meantime [21:44:58] k [21:46:06] RECOVERY - mediawiki-installation DSH group on mw2217 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [21:48:52] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@ae24e12]: repoint ores thresholds to yesterday [21:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:40] PROBLEM - logstash JSON linesTCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [21:51:12] ^^ looking [21:51:15] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@ae24e12]: repoint ores thresholds to yesterday (duration: 02m 23s) [21:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:09] (03PS1) 10Dzahn: base: adjust data type for $debdeploy_filter_services [puppet] - 10https://gerrit.wikimedia.org/r/659084 [21:52:34] (03PS1) 10Effie Mouzeli: WIP: memcached: enable the use of unix socket in memcached [puppet] - 10https://gerrit.wikimedia.org/r/659085 (https://phabricator.wikimedia.org/T273115) [21:53:02] (03CR) 10jerkins-bot: [V: 04-1] WIP: memcached: enable the use of unix socket in memcached [puppet] - 10https://gerrit.wikimedia.org/r/659085 (https://phabricator.wikimedia.org/T273115) (owner: 10Effie Mouzeli) [21:53:37] (03CR) 10jerkins-bot: [V: 04-1] base: adjust data type for $debdeploy_filter_services [puppet] - 10https://gerrit.wikimedia.org/r/659084 (owner: 10Dzahn) [21:54:17] (03PS1) 10GergΕ‘ Tisza: Fix BaseModule::BASE_CSS_CLASS visibility [extensions/GrowthExperiments] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/658943 (https://phabricator.wikimedia.org/T273099) [21:55:25] (03PS1) 10Ahmon Dancy: group0 wikis to 1.36.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659086 [21:55:27] (03CR) 10Ahmon Dancy: [C: 03+2] group0 wikis to 1.36.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659086 (owner: 10Ahmon Dancy) [21:56:12] (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659086 (owner: 10Ahmon Dancy) [21:57:12] brennen / dancy: the patch above fixes a probably cause of logspam / minor breakage on group1. I can deploy it in the backport window, if it doesn't fit into the train schedule, it doesn't affect that many users. [21:57:14] RECOVERY - logstash JSON linesTCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [21:57:43] tgr_: we're just now rolling forward to group0 [21:57:48] !log dancy@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.28 [21:57:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:05] (03PS1) 10Andrew Bogott: base.pp: move value_type and merge behavior into the options hash [puppet] - 10https://gerrit.wikimedia.org/r/659087 [21:58:08] tgr_: Thanks for the offer! [21:58:22] tgr_, dancy: i do notice a few of those BASE_CSS_CLASS ones - maybe it would be good to have that before we go forward to group1? [21:58:29] if it seems likely to blow up... [21:58:57] Agreed. It's unclear what effect it has. A ticket for that got filed during log triage I think [21:58:59] (03PS2) 10Andrew Bogott: base.pp: move value_type and merge behavior into the options hash [puppet] - 10https://gerrit.wikimedia.org/r/659087 [21:59:18] yeah. https://phabricator.wikimedia.org/T273099 [21:59:20] (03PS1) 10Dzahn: base: remove Hash data type for $debdeploy_filter_services [puppet] - 10https://gerrit.wikimedia.org/r/659088 [21:59:35] It's a trivial fix, should be safe to backport. Most wikis using the feature are on group2 so up to you. [22:00:12] OK. I see the backport cherry pick: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/658943/ [22:00:40] (keeping one eye on things, but i've a meeting for the next ~hour) [22:01:01] (03CR) 10jerkins-bot: [V: 04-1] base: remove Hash data type for $debdeploy_filter_services [puppet] - 10https://gerrit.wikimedia.org/r/659088 (owner: 10Dzahn) [22:01:18] (03PS2) 10Effie Mouzeli: WIP: memcached: enable the use of unix socket in memcached [puppet] - 10https://gerrit.wikimedia.org/r/659085 (https://phabricator.wikimedia.org/T273115) [22:01:25] (03CR) 10Kosta Harlan: [C: 03+1] Fix BaseModule::BASE_CSS_CLASS visibility [extensions/GrowthExperiments] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/658943 (https://phabricator.wikimedia.org/T273099) (owner: 10GergΕ‘ Tisza) [22:01:54] tgr_: Go ahead and deploy https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/658943 during your window. I'll watch logs as well [22:01:56] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 1 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting%23Nova-fullstack [22:02:08] ack [22:04:08] you mean ignore it for the moment, right? The backport window is two hours from now. [22:05:03] Yeah. That should be ok. [22:06:06] That said, it doesn't look like it would conflict with anything to do right now. [22:07:34] RECOVERY - nova instance creation test on cloudcontrol1003 is OK: PROCS OK: 1 process with command name python, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:11:24] (03CR) 10Andrew Bogott: [C: 03+2] base.pp: move value_type and merge behavior into the options hash [puppet] - 10https://gerrit.wikimedia.org/r/659087 (owner: 10Andrew Bogott) [22:13:32] (03PS3) 10Effie Mouzeli: WIP: memcached: enable the use of unix socket in memcached [puppet] - 10https://gerrit.wikimedia.org/r/659085 (https://phabricator.wikimedia.org/T273115) [22:14:18] (03CR) 10jerkins-bot: [V: 04-1] WIP: memcached: enable the use of unix socket in memcached [puppet] - 10https://gerrit.wikimedia.org/r/659085 (https://phabricator.wikimedia.org/T273115) (owner: 10Effie Mouzeli) [22:16:14] (03Abandoned) 10Dzahn: base: remove Hash data type for $debdeploy_filter_services [puppet] - 10https://gerrit.wikimedia.org/r/659088 (owner: 10Dzahn) [22:16:34] (03Abandoned) 10Dzahn: base: adjust data type for $debdeploy_filter_services [puppet] - 10https://gerrit.wikimedia.org/r/659084 (owner: 10Dzahn) [22:16:52] (03PS4) 10Effie Mouzeli: WIP: memcached: enable the use of unix socket in memcached [puppet] - 10https://gerrit.wikimedia.org/r/659085 (https://phabricator.wikimedia.org/T273115) [22:19:23] (03PS3) 10Dzahn: tlsproxy::prometheus: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/659073 (https://phabricator.wikimedia.org/T209953) [22:20:56] (03CR) 10jerkins-bot: [V: 04-1] tlsproxy::prometheus: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/659073 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [22:23:20] (03PS4) 10Dzahn: tlsproxy::prometheus: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/659073 (https://phabricator.wikimedia.org/T209953) [22:26:49] (03PS1) 10Ahmon Dancy: objectcache: return false during more error cases in RedisBagOStuff::*Multi() methods [core] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/658945 [22:31:26] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:32:01] (03PS2) 10Legoktm: Allow talking to the registry over HTTP [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/658684 (https://phabricator.wikimedia.org/T179696) [22:32:08] (03CR) 10Legoktm: [C: 03+2] Allow talking to the registry over HTTP [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/658684 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm) [22:32:32] (03PS2) 10Dzahn: monitoring::service: hiera -> lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/659075 [22:34:17] (03CR) 10jerkins-bot: [V: 04-1] monitoring::service: hiera -> lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/659075 (owner: 10Dzahn) [22:34:35] (03Merged) 10jenkins-bot: Allow talking to the registry over HTTP [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/658684 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm) [22:37:46] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:39:20] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1405.eqiad.wmnet [22:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:25] (03PS1) 10Legoktm: d/changelog: Bump version to 0.0.11 [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/659091 [22:40:50] (03CR) 10jerkins-bot: [V: 04-1] d/changelog: Bump version to 0.0.11 [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/659091 (owner: 10Legoktm) [22:41:06] (03PS3) 10Dzahn: monitoring::service: hiera -> lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/659075 [22:42:53] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1405.eqiad.wmnet [22:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:27] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2222.codfw.wmnet [22:44:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:10] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2222.codfw.wmnet [22:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:41] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2246.codfw.wmnet [22:46:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:26] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2246.codfw.wmnet [22:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:52] (03PS1) 10Cwhite: profile: place drop ecs filter in place of ecs pre-filter [puppet] - 10https://gerrit.wikimedia.org/r/659092 [22:49:47] 10SRE, 10Sustainability: Alerts for drifts in /srv/mediawiki - https://phabricator.wikimedia.org/T272967 (10Dzahn) running 'scap pull' on all hosts that are being re-pooled [22:49:54] PROBLEM - PHP opcache health on mwdebug2001 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [22:49:57] (03CR) 10Cwhite: [C: 03+2] profile: place drop ecs filter in place of ecs pre-filter [puppet] - 10https://gerrit.wikimedia.org/r/659092 (owner: 10Cwhite) [22:51:05] (03CR) 10Legoktm: "Hmm, was I not supposed to push the upstream/0.0.11 tag initially?" [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/659091 (owner: 10Legoktm) [22:53:44] (03PS2) 10Jdlrobson: Disable max-width on page namespace for wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659079 (https://phabricator.wikimedia.org/T260091) [22:54:30] (03PS1) 10Cwhite: profile: get the ecs pre-filter filename right [puppet] - 10https://gerrit.wikimedia.org/r/659093 [22:54:55] (03CR) 10jerkins-bot: [V: 04-1] Disable max-width on page namespace for wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659079 (https://phabricator.wikimedia.org/T260091) (owner: 10Jdlrobson) [22:55:34] PROBLEM - logstash syslog TCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [22:56:05] (03CR) 10Cwhite: [C: 03+2] profile: get the ecs pre-filter filename right [puppet] - 10https://gerrit.wikimedia.org/r/659093 (owner: 10Cwhite) [22:57:04] PROBLEM - logstash JSON linesTCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [22:57:46] RECOVERY - logstash syslog TCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [22:59:16] RECOVERY - logstash JSON linesTCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [23:06:43] 10SRE, 10Traffic: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Urbanecm) @Cyberpower678 This doesn't sound to be a complete dump of a raw request. I'm pretty confident the response has at least one line (the one starting with `... [23:10:44] (03PS1) 10Legoktm: docker_registry_ha: Have build-homepage talk directly to the registry [puppet] - 10https://gerrit.wikimedia.org/r/659095 (https://phabricator.wikimedia.org/T179696) [23:11:04] (03PS2) 10Legoktm: docker_registry_ha: Have build-homepage talk directly to the registry [puppet] - 10https://gerrit.wikimedia.org/r/659095 (https://phabricator.wikimedia.org/T179696) [23:14:13] 10SRE, 10Traffic: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Legoktm) >>! In T273003#6778468, @Cyberpower678 wrote: > I believe it only does maxlag on write requests, like when it edits. These are all read requests. You sho... [23:16:09] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27710/console" [puppet] - 10https://gerrit.wikimedia.org/r/659095 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm) [23:16:52] (03CR) 10Legoktm: [C: 03+2] "Also tested on registry2002 manually." [puppet] - 10https://gerrit.wikimedia.org/r/659095 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm) [23:17:24] 10SRE, 10Traffic: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Cyberpower678) >>! In T273003#6782382, @Urbanecm wrote: > @Cyberpower678 This doesn't sound to be a complete dump of a raw request. I'm pretty confident the respons... [23:18:36] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:19:18] PROBLEM - SSH on logstash2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:21:46] ok, registry2002 should really stop flapping now [23:24:15] 10SRE, 10vm-requests: codfw: 3 VM request for ML team etcd - https://phabricator.wikimedia.org/T273075 (10Legoktm) p:05Triageβ†’03Medium [23:26:53] (03PS3) 10Jdlrobson: Disable max-width on page namespace for wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/659079 (https://phabricator.wikimedia.org/T260091) [23:30:26] !log reboot logstash2006 [23:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:47] legoktm: cool! [23:32:52] PROBLEM - logstash syslog TCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [23:33:26] PROBLEM - logstash JSON linesTCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [23:33:26] PROBLEM - Check systemd state on logstash2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:34:48] RECOVERY - SSH on logstash2006 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:35:04] RECOVERY - logstash syslog TCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [23:35:38] RECOVERY - logstash JSON linesTCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [23:40:58] RECOVERY - Memcached on mwdebug2001 is OK: TCP OK - 0.032 second response time on 10.192.0.98 port 11210 https://wikitech.wikimedia.org/wiki/Memcached