[00:02:58] Krinkle: please do it in mw config, thanks [00:04:13] mutante: I'll need to know the values [00:05:21] could you place them in my home directory or something? [00:10:12] Krinkle: cat ~/xhgui-db for the password and the other stuff is all public in hieradata/role/common/webperf/xhgui.yaml host (m2-master), db (xhgui) and user (xhgui) https://phabricator.wikimedia.org/T254795#6253646 [00:10:55] thx, verirfied via 'mysql', works as intended [00:11:32] cool [00:14:29] !log krinkle@deploy1001 Synchronized private/PrivateSettings.php: T254795 - Set $wmgXhguiDBuser and $wmgXhguiDBpasswor (duration: 01m 06s) [00:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:34] T254795: Database for XHGui profiles - https://phabricator.wikimedia.org/T254795 [00:16:33] (03CR) 10Krinkle: "Settings are set and deployed in prod. @Dave Can you confirm that this patch should not have any affect on tungsten currently (e.g would s" [puppet] - 10https://gerrit.wikimedia.org/r/603550 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [00:19:12] PROBLEM - dump of x1 in eqiad on db2093 is CRITICAL: dump for x1 at eqiad taken more than 8 days ago: Most recent backup 2020-06-23 00:00:03 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [00:36:46] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 19031096 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:38:36] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 198064 and 56 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:05:22] (03CR) 10Dave Pifke: "This uses a different role, so shouldn't touch tungsten. (Confirmed in beta: deployment-xhgui01 was unaffected when this was cherry-picke" [puppet] - 10https://gerrit.wikimedia.org/r/603550 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [01:23:28] PROBLEM - Check systemd state on netflow1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:31:12] RECOVERY - Check systemd state on netflow1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:36:40] PROBLEM - Check systemd state on netflow1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:17:49] 10Operations, 10Commons, 10ConfirmEdit (CAPTCHA extension), 10Editing-team, and 5 others: GenerateFancyCaptchas.php crashes with "FormatJson.php: File not found in" after 1000 iterations - https://phabricator.wikimedia.org/T230245 (10Krinkle) [03:27:14] (03CR) 10Krinkle: [C: 03+2] findBadBlobs: better separate scan and mark modes. [core] (wmf/1.35.0-wmf.39) - 10https://gerrit.wikimedia.org/r/608667 (https://phabricator.wikimedia.org/T251778) (owner: 10Krinkle) [03:31:28] RECOVERY - Check systemd state on netflow1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:34:13] 10Operations, 10netops: nfacctd segfaulting on netflow2001 - https://phabricator.wikimedia.org/T256790 (10CDanis) Likely segfaulting on netflow1001 now too. Upstream has a patch; I'll attempt backporting it to our version in the morning. If it works I'll also file in Debian BTS with the patch and see about g... [03:36:54] PROBLEM - Check systemd state on netflow1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:49:28] (03Merged) 10jenkins-bot: findBadBlobs: better separate scan and mark modes. [core] (wmf/1.35.0-wmf.39) - 10https://gerrit.wikimedia.org/r/608667 (https://phabricator.wikimedia.org/T251778) (owner: 10Krinkle) [04:00:38] RECOVERY - Check systemd state on netflow1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:01:02] !log krinkle@deploy1001 Synchronized php-1.35.0-wmf.39/maintenance/findBadBlobs.php: I47c11190b665 (duration: 01m 08s) [04:01:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:06:06] PROBLEM - Check systemd state on netflow1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:20:35] (03CR) 10Bmansurov: Add recommendation-api helmfile stanzas (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [04:21:52] (03PS1) 10Cicalese: Deploy MediaModeration on all production wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608753 (https://phabricator.wikimedia.org/T247943) [04:31:44] RECOVERY - Check systemd state on netflow1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:37:14] PROBLEM - Check systemd state on netflow1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:48:34] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:01:06] RECOVERY - Check systemd state on netflow1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:06:36] PROBLEM - Check systemd state on netflow1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:14:16] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:17:12] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 2 (labtestpuppetmaster2001, ...), Fresh: 104 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:32:28] RECOVERY - Check systemd state on netflow1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:37:58] PROBLEM - Check systemd state on netflow1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:47:06] <_joe_> !log restarting nfacctd on netflow1001, it's segfaulting [05:47:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:00] RECOVERY - Check systemd state on netflow1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:07:30] PROBLEM - Check systemd state on netflow1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:09:22] * elukey sees a _joe_ around \o/ [06:21:50] 10Operations, 10observability, 10serviceops, 10Performance-Team (Radar), 10User-Elukey: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10elukey) >>! In T224454#6269950, @CDanis wrote: > There's no alert yet for memcache NIC saturation, and I don't believe there's... [06:31:26] RECOVERY - Check systemd state on netflow1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:33:11] (03CR) 10VulpesVulpes825: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608124 (https://phabricator.wikimedia.org/T256521) (owner: 10Hamish) [06:38:48] PROBLEM - Check systemd state on netflow1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:52:36] (03PS1) 10Majavah: Enable SandboxLink extension in trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608799 (https://phabricator.wikimedia.org/T256782) [06:56:02] 10Operations, 10Traffic: Certain links being rejected by caching if opened in Internet Explorer - https://phabricator.wikimedia.org/T256302 (10Urbanecm) @Aklapper: Here you are, this is the complete message. ` Request from (REDACTED) via cp3062 frontend, Varnish XID 675554498 Upstream caches: cp3062 int Error... [07:16:23] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Overall LGTM. I would probably think of reducing the number of parameters of the profile, but that's really up to your best judgement. Thi" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/606956 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [07:17:37] good morning [07:25:34] 10Operations, 10Beta-Cluster-Infrastructure, 10observability, 10serviceops, 10Patch-For-Review: Stream a subset of mediawiki apache logs to logstash - https://phabricator.wikimedia.org/T244472 (10hashar) Just a note the Apache logs are still emitted to logstash for mw1262 and mw1276 ` name=hieradata/host... [07:28:58] (03PS2) 10ZPapierski: Configuration code for oauth proxy [puppet] - 10https://gerrit.wikimedia.org/r/608633 (https://phabricator.wikimedia.org/T251498) [07:30:13] (03CR) 10jerkins-bot: [V: 04-1] Configuration code for oauth proxy [puppet] - 10https://gerrit.wikimedia.org/r/608633 (https://phabricator.wikimedia.org/T251498) (owner: 10ZPapierski) [07:32:04] RECOVERY - Check systemd state on netflow1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:34:00] 10Operations, 10Traffic, 10Patch-For-Review: purged crashes with "fatal error: concurrent map read and map write" - https://phabricator.wikimedia.org/T256479 (10ema) 05Open→03Resolved a:03ema Fixed by deploying https://gerrit.wikimedia.org/r/c/operations/software/purged/+/608045, the issue hasn't occur... [07:37:30] PROBLEM - Check systemd state on netflow1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:39:06] !log cp2041: restart purged, varnishkafka after librdkafka1 upgrade to 0.11.6-1.1wmf1 T256444 [07:39:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:11] T256444: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444 [07:40:16] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [07:40:17] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:48] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [07:42:49] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:42:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:32] (03PS3) 10ZPapierski: Configuration code for oauth proxy [puppet] - 10https://gerrit.wikimedia.org/r/608633 (https://phabricator.wikimedia.org/T251498) [08:00:51] (03CR) 10jerkins-bot: [V: 04-1] Configuration code for oauth proxy [puppet] - 10https://gerrit.wikimedia.org/r/608633 (https://phabricator.wikimedia.org/T251498) (owner: 10ZPapierski) [08:01:01] !log rolling restart of esams cache nodes to catch up on kernel upgrades [08:01:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:25] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [08:05:26] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:36] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [08:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:32] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atskafka site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:13:38] (03PS1) 10Awight: Remove redundant beta config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608805 [08:17:26] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:19:36] (03CR) 10Jbond: [C: 03+1] mariadb: Use custom types to ensure role/section have valid values. [puppet] - 10https://gerrit.wikimedia.org/r/608618 (owner: 10Kormat) [08:20:14] Question: is there any reason for a db index that is going to be unique not to be explicitly labelled a unique index? I filed T256841 and T256842 before realizing there were many more unique indexes that weren't officially UNIQUE [08:20:15] T256841: slot_revision_origin_role should be a UNIQUE INDEX - https://phabricator.wikimedia.org/T256841 [08:20:15] T256842: iwlinks indexes should be UNIQUE INDEXes - https://phabricator.wikimedia.org/T256842 [08:22:59] (03PS3) 10Jbond: icinga: add ldap-icinga.wikimedia.org CNAME [dns] - 10https://gerrit.wikimedia.org/r/608301 (https://phabricator.wikimedia.org/T256628) [08:23:13] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [08:23:14] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:52] (03CR) 10Jbond: [C: 03+2] icinga: add ldap-icinga.wikimedia.org CNAME [dns] - 10https://gerrit.wikimedia.org/r/608301 (https://phabricator.wikimedia.org/T256628) (owner: 10Jbond) [08:26:34] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10Gehel) @Cmjohnson: looking at the quote that was validated (T232649#5681830) it specified 8x 1.92T SSD (4 SSD per server)... [08:29:12] !log disable BGP to nfacct in eqiad - T256790 [08:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:17] T256790: nfacctd segfaulting on netflow2001 - https://phabricator.wikimedia.org/T256790 [08:30:44] RECOVERY - Check systemd state on netflow1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:31:09] 10Operations, 10User-jbond: update profile::waf::apache2::administrative to use the new abuse_networks hiera key - https://phabricator.wikimedia.org/T253632 (10jbond) As there has been little activity on this task and the current ACL is quite wide i propse that i will remove the current ACL's and going forward... [08:35:02] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "Indeed, this is obsolete since Ie0f135a." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608805 (owner: 10Awight) [08:36:39] (03CR) 10Kormat: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/608618 (owner: 10Kormat) [08:37:34] (03CR) 10Kormat: [C: 03+2] mariadb: Use custom types to ensure role/section have valid values. [puppet] - 10https://gerrit.wikimedia.org/r/608618 (owner: 10Kormat) [08:38:55] (03CR) 10Kormat: [C: 03+2] cumin: Add db-role and db-section aliases [puppet] - 10https://gerrit.wikimedia.org/r/608558 (owner: 10Kormat) [08:39:14] (03PS1) 10Jbond: profile::waf::apache::administrative: remove waf config [puppet] - 10https://gerrit.wikimedia.org/r/608806 [08:41:11] 10Operations, 10Traffic: Certain links being rejected by caching if opened in Internet Explorer with a HTTP 400 error - https://phabricator.wikimedia.org/T256302 (10Aklapper) [08:44:04] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [08:44:04] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:40] (03PS1) 10Jbond: block_abuse_nets: enable block abuse nets on misc sites [puppet] - 10https://gerrit.wikimedia.org/r/608807 (https://phabricator.wikimedia.org/T253632) [08:50:32] (03PS2) 10Jbond: profile::waf::apache::administrative: remove waf config [puppet] - 10https://gerrit.wikimedia.org/r/608806 (https://phabricator.wikimedia.org/T253632) [08:52:37] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [08:52:39] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:44] 10Operations, 10Traffic, 10Patch-For-Review: Current codfw caches have wrong NVME format - https://phabricator.wikimedia.org/T256655 (10ops-monitoring-bot) Icinga downtime for 0:30:00 set by vgutierrez@cumin1001 on 2 host(s) and their services with reason: kernel upgrade ` cp[2039-2040].codfw.wmnet ` [08:53:45] !log draining kubernetes staging node kubestage1001.eqiad.wmnet - T256786 [08:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:49] T256786: kubernetes unable to pull images from registry - https://phabricator.wikimedia.org/T256786 [08:59:16] (03CR) 10Arturo Borrero Gonzalez: "Other than the comment inline, this LGTM." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608565 (https://phabricator.wikimedia.org/T256721) (owner: 10Jbond) [09:04:39] (03PS1) 10Jbond: scap::target: make this resource ensurable [puppet] - 10https://gerrit.wikimedia.org/r/608811 [09:06:20] 10Operations, 10DBA, 10User-Kormat: Add monitoring to ensure that puppet/tendril/zarcillo all agree on the set of sections that exist - https://phabricator.wikimedia.org/T256845 (10Kormat) [09:06:28] 10Operations, 10DBA, 10User-Kormat: Add monitoring to ensure that puppet/tendril/zarcillo all agree on the set of sections that exist - https://phabricator.wikimedia.org/T256845 (10Kormat) p:05Triage→03Medium [09:06:31] (03PS7) 10Jbond: puppetmaster::frontend: manage ca_cert.pem and fix types lookup calls [puppet] - 10https://gerrit.wikimedia.org/r/608565 (https://phabricator.wikimedia.org/T256721) [09:06:42] (03PS3) 10Muehlenhoff: Switch puppetboard to only use CAS for authentication [puppet] - 10https://gerrit.wikimedia.org/r/607239 [09:07:06] (03CR) 10Jbond: "updated thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608565 (https://phabricator.wikimedia.org/T256721) (owner: 10Jbond) [09:07:49] (03CR) 10jerkins-bot: [V: 04-1] scap::target: make this resource ensurable [puppet] - 10https://gerrit.wikimedia.org/r/608811 (owner: 10Jbond) [09:08:20] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [09:08:21] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:52] 10Puppet, 10DBA, 10cloud-services-team (Kanban): labtestpuppetmaster2001 is failing to backup - https://phabricator.wikimedia.org/T256846 (10jcrespo) [09:08:56] (03CR) 10Volans: "see inline" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/607996 (https://phabricator.wikimedia.org/T243935) (owner: 10Kormat) [09:10:32] (03PS2) 10Jbond: scap::target: make this resource ensurable [puppet] - 10https://gerrit.wikimedia.org/r/608811 [09:10:35] * kormat files a task to get volan's access dropped while he's supposed to be on vacation :P [09:10:56] (03PS3) 10Kormat: mysql_legacy: update Cumin queries for DB selection [software/spicerack] - 10https://gerrit.wikimedia.org/r/607996 (https://phabricator.wikimedia.org/T243935) [09:11:06] (03CR) 10jerkins-bot: [V: 04-1] scap::target: make this resource ensurable [puppet] - 10https://gerrit.wikimedia.org/r/608811 (owner: 10Jbond) [09:11:18] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw} instance=kafkamon1001 job=burrow partition={2,3,4,5} site=eqiad topic={udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasour [09:11:18] us/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [09:13:02] (03CR) 10Arturo Borrero Gonzalez: puppetmaster::frontend: manage ca_cert.pem and fix types lookup calls (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608565 (https://phabricator.wikimedia.org/T256721) (owner: 10Jbond) [09:14:21] (03PS3) 10Jbond: scap::target: make this resource ensurable [puppet] - 10https://gerrit.wikimedia.org/r/608811 [09:15:05] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [09:15:05] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:13] (03CR) 10jerkins-bot: [V: 04-1] mysql_legacy: update Cumin queries for DB selection [software/spicerack] - 10https://gerrit.wikimedia.org/r/607996 (https://phabricator.wikimedia.org/T243935) (owner: 10Kormat) [09:15:48] (03PS8) 10Jbond: puppetmaster::frontend: manage ca_cert.pem and fix types lookup calls [puppet] - 10https://gerrit.wikimedia.org/r/608565 (https://phabricator.wikimedia.org/T256721) [09:15:59] (03CR) 10Jbond: puppetmaster::frontend: manage ca_cert.pem and fix types lookup calls (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608565 (https://phabricator.wikimedia.org/T256721) (owner: 10Jbond) [09:17:19] (03PS1) 10Elukey: archiva::proxy: allow nginx to serve content from repositories [puppet] - 10https://gerrit.wikimedia.org/r/608812 (https://phabricator.wikimedia.org/T252767) [09:17:33] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Switch puppetboard to only use CAS for authentication [puppet] - 10https://gerrit.wikimedia.org/r/607239 (owner: 10Muehlenhoff) [09:18:02] 10Operations, 10OCG-General: ocg alarm ocg_job_status_queue 'flapping' - https://phabricator.wikimedia.org/T97524 (10Joe) 05Open→03Invalid [09:18:55] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: Use chained cert for mail relay TLS [puppet] - 10https://gerrit.wikimedia.org/r/608720 (https://phabricator.wikimedia.org/T256806) (owner: 10BryanDavis) [09:21:56] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:23:06] !log restarting dockerd on kubestage1002.eqiad.wmnet - T256786 [09:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:11] T256786: kubernetes unable to pull images from registry - https://phabricator.wikimedia.org/T256786 [09:23:38] PROBLEM - puppetboard.wikimedia.org on puppetboard2001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host: HTTP/1.1 302 Found https://wikitech.wikimedia.org/wiki/Puppet%23PuppetDB [09:23:48] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:25:50] PROBLEM - puppetboard.wikimedia.org on puppetboard1001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host: HTTP/1.1 302 Found https://wikitech.wikimedia.org/wiki/Puppet%23PuppetDB [09:27:57] ^ I'll fix that [09:28:50] ACKNOWLEDGEMENT - puppetboard.wikimedia.org on puppetboard1001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host: HTTP/1.1 302 Found Muehlenhoff Switching to CAS https://wikitech.wikimedia.org/wiki/Puppet%23PuppetDB [09:28:50] ACKNOWLEDGEMENT - puppetboard.wikimedia.org on puppetboard2001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host: HTTP/1.1 302 Found Muehlenhoff Switching to CAS https://wikitech.wikimedia.org/wiki/Puppet%23PuppetDB [09:34:02] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [09:34:02] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:43] (03CR) 10Awight: "> Patch Set 1: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608805 (owner: 10Awight) [09:39:40] (03Abandoned) 10Hnowlan: deployment-docker-changeprop01: override docker configuration [puppet] - 10https://gerrit.wikimedia.org/r/598753 (https://phabricator.wikimedia.org/T251176) (owner: 10Hnowlan) [09:42:27] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [09:42:28] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:33] 10Operations, 10Traffic, 10Patch-For-Review: Current codfw caches have wrong NVME format - https://phabricator.wikimedia.org/T256655 (10ops-monitoring-bot) Icinga downtime for 0:30:00 set by vgutierrez@cumin1001 on 2 host(s) and their services with reason: kernel upgrade ` cp[2041-2042].codfw.wmnet ` [09:44:23] 10Operations, 10Traffic, 10Patch-For-Review: Current codfw caches have wrong NVME format - https://phabricator.wikimedia.org/T256655 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [09:46:35] !log cordoning kubernetes[2001-2004].codfw.wmnet,kubernetes[1001-1004].eqiad.wmnet - T256786 [09:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:40] T256786: kubernetes unable to pull images from registry - https://phabricator.wikimedia.org/T256786 [09:47:01] (03PS4) 10Kormat: mysql_legacy: update Cumin queries for DB selection [software/spicerack] - 10https://gerrit.wikimedia.org/r/607996 (https://phabricator.wikimedia.org/T243935) [09:52:10] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [09:52:11] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:52:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:27] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/607996 (https://phabricator.wikimedia.org/T243935) (owner: 10Kormat) [10:01:56] (03CR) 10Jcrespo: "Let's test it and compare results as Manuel suggested before merging." [software/spicerack] - 10https://gerrit.wikimedia.org/r/607996 (https://phabricator.wikimedia.org/T243935) (owner: 10Kormat) [10:03:48] RECOVERY - dump of x1 in eqiad on db2093 is OK: Last dump for x1 at eqiad (db1102.eqiad.wmnet:3320) taken on 2020-07-01 08:52:39 (30 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [10:09:13] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/607535 (https://phabricator.wikimedia.org/T225680) (owner: 10Alexandros Kosiaris) [10:09:24] !log draining and docker restart (one at a time) kubernetes[2001-2004].codfw.wmnet [10:09:26] (03PS1) 10Jcrespo: mariadb: Add monitoring to temporary test ipd database db1077 [puppet] - 10https://gerrit.wikimedia.org/r/608820 (https://phabricator.wikimedia.org/T256120) [10:09:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:18] (03CR) 10Kormat: "> Patch Set 4:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/607996 (https://phabricator.wikimedia.org/T243935) (owner: 10Kormat) [10:10:47] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Add monitoring to temporary test ipd database db1077 [puppet] - 10https://gerrit.wikimedia.org/r/608820 (https://phabricator.wikimedia.org/T256120) (owner: 10Jcrespo) [10:10:57] (03PS2) 10Jcrespo: mariadb: Add monitoring to temporary test ipd database db1077 [puppet] - 10https://gerrit.wikimedia.org/r/608820 (https://phabricator.wikimedia.org/T256120) [10:11:37] ^ kormat: could I get a review, as you may know better what is the latest puppet proper code? [10:11:48] sure thing [10:12:03] whith the new sections and stuff [10:14:34] o/ anyone around here familiar with robot policies in mediawiki configuration? [10:14:55] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [10:14:56] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:57] somehow robot policies set in T255538 aren't working (or even present in the meta tag) [10:15:57] T255538: Disable search engine indexing (with noindex) in specific namespaces of Turkish Wikipedia - https://phabricator.wikimedia.org/T255538 [10:16:00] 10Operations, 10CAS-SSO, 10Patch-For-Review: icinga Blocked by X-Frame-Options Policy - https://phabricator.wikimedia.org/T251513 (10MoritzMuehlenhoff) This still happens after the CASScope change, if I log in and the (non long term) session expires, it fails to connect to the IDP for renewing the session:... [10:16:22] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [10:16:22] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:07] !log renumber NTT transit links - T254877 [10:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:18] (03CR) 10Kormat: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/608820 (https://phabricator.wikimedia.org/T256120) (owner: 10Jcrespo) [10:22:23] (03CR) 10Jbond: [C: 03+1] "LGTM, thx" [puppet] - 10https://gerrit.wikimedia.org/r/608820 (https://phabricator.wikimedia.org/T256120) (owner: 10Jcrespo) [10:25:43] (03CR) 10Jcrespo: "Question:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/607996 (https://phabricator.wikimedia.org/T243935) (owner: 10Kormat) [10:27:10] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [10:27:36] ^ probably me, I'll check [10:28:28] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [10:29:00] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [10:29:36] (03CR) 10Jcrespo: "It is fair if the answer is "I will do in depth testing when the new code is in place", BTW (I will +1 if testing is delayed for more code" [software/spicerack] - 10https://gerrit.wikimedia.org/r/607996 (https://phabricator.wikimedia.org/T243935) (owner: 10Kormat) [10:30:19] (03CR) 10Jcrespo: [C: 03+2] mariadb: Add monitoring to temporary test ipd database db1077 [puppet] - 10https://gerrit.wikimedia.org/r/608820 (https://phabricator.wikimedia.org/T256120) (owner: 10Jcrespo) [10:30:20] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:30:42] deneb is happy again \o/ [10:31:14] on the other hand, restbase2009 has been down for a day now [10:31:44] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:32:40] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 3/5 UP : OSPFv3: 3/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:32:57] that's me ^ [10:33:10] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:33:52] (03CR) 10Kormat: "> Patch Set 4:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/607996 (https://phabricator.wikimedia.org/T243935) (owner: 10Kormat) [10:34:31] !log power-cycle restbase2009 [10:34:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:20] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:37:44] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:38:22] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:38:52] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:40:28] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:42:14] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 81 probes of 655 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:44:56] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:45:41] !log draining and docker restart (one at a time) kubernetes[1001-1004].eqiad.wmnet - T256786 [10:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:45] T256786: kubernetes unable to pull images from registry - https://phabricator.wikimedia.org/T256786 [10:47:14] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:47:22] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 5 probes of 655 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:47:27] (03PS4) 10ZPapierski: Configuration code for oauth proxy [puppet] - 10https://gerrit.wikimedia.org/r/608633 (https://phabricator.wikimedia.org/T251498) [10:47:40] (03CR) 10jerkins-bot: [V: 04-1] Configuration code for oauth proxy [puppet] - 10https://gerrit.wikimedia.org/r/608633 (https://phabricator.wikimedia.org/T251498) (owner: 10ZPapierski) [10:49:23] 10Operations, 10DBA, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10jcrespo) a:03jbond The host has been added the missing monitoring and open ports, as well as updated on tendril and zarcillo. If this is all... [10:50:00] (03PS1) 10ZPapierski: Handle oauth proxy settings [puppet] - 10https://gerrit.wikimedia.org/r/608824 (https://phabricator.wikimedia.org/T251498) [10:50:02] (03CR) 10Jcrespo: [C: 03+1] mysql_legacy: update Cumin queries for DB selection [software/spicerack] - 10https://gerrit.wikimedia.org/r/607996 (https://phabricator.wikimedia.org/T243935) (owner: 10Kormat) [10:50:04] !log power on restbase2009 [10:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:32] (03CR) 10Kormat: [C: 03+2] mysql_legacy: update Cumin queries for DB selection [software/spicerack] - 10https://gerrit.wikimedia.org/r/607996 (https://phabricator.wikimedia.org/T243935) (owner: 10Kormat) [10:51:17] (03CR) 10jerkins-bot: [V: 04-1] Handle oauth proxy settings [puppet] - 10https://gerrit.wikimedia.org/r/608824 (https://phabricator.wikimedia.org/T251498) (owner: 10ZPapierski) [10:52:50] (03Merged) 10jenkins-bot: mysql_legacy: update Cumin queries for DB selection [software/spicerack] - 10https://gerrit.wikimedia.org/r/607996 (https://phabricator.wikimedia.org/T243935) (owner: 10Kormat) [10:54:39] (03PS1) 10Ema: varnish: add SystemTap script to debug VCL discard issues [puppet] - 10https://gerrit.wikimedia.org/r/608846 (https://phabricator.wikimedia.org/T236754) [10:55:27] (03CR) 10Ema: [C: 03+2] varnish: add SystemTap script to debug VCL discard issues [puppet] - 10https://gerrit.wikimedia.org/r/608846 (https://phabricator.wikimedia.org/T236754) (owner: 10Ema) [10:56:34] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:57:31] (03PS1) 10Muehlenhoff: Update debmonitor links to point to puppetboard instead of cas-puppetboard [puppet] - 10https://gerrit.wikimedia.org/r/608848 [10:57:48] (03PS2) 10ZPapierski: Handle oauth proxy settings [puppet] - 10https://gerrit.wikimedia.org/r/608824 (https://phabricator.wikimedia.org/T251498) [10:58:03] 10Operations, 10DBA, 10SRE-tools, 10Patch-For-Review, 10User-Kormat: Audit all cumin queries in switchdc scripts - https://phabricator.wikimedia.org/T243935 (10Kormat) 05Open→03Resolved `mysql_legacy` is now updated, so i think this can be closed. [10:58:06] 10Operations: FY2020-2021 Q1 eqiad -> codfw switchover - https://phabricator.wikimedia.org/T243316 (10Kormat) [10:58:11] ema: did not last long :-| ... this is still my fault, though [10:58:18] (deneb crying) [10:59:03] (03CR) 10jerkins-bot: [V: 04-1] Handle oauth proxy settings [puppet] - 10https://gerrit.wikimedia.org/r/608824 (https://phabricator.wikimedia.org/T251498) (owner: 10ZPapierski) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Your horoscope predicts another unfortunate European mid-day backport window(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200701T1100). [11:00:04] micgro42, tgr, and Majavah: A patch you scheduled for European mid-day backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:10] o/ [11:00:11] o/ [11:00:27] hi :) [11:00:35] (03PS1) 10Arturo Borrero Gonzalez: toolforge: mailexchange: add prometheus bits to know about queue length [puppet] - 10https://gerrit.wikimedia.org/r/608849 (https://phabricator.wikimedia.org/T256737) [11:01:03] 10Operations, 10ops-codfw: restbase2009 down - https://phabricator.wikimedia.org/T256863 (10ema) [11:01:22] micgro42: should I start with your maintenance script? [11:01:35] o/ [11:01:38] 10Operations, 10ops-codfw, 10RESTBase: restbase2009 down - https://phabricator.wikimedia.org/T256863 (10ema) [11:01:54] @Lucas: no, I'm in a video call with Amir to do it :) [11:02:04] thank you :) [11:02:04] (03PS2) 10Arturo Borrero Gonzalez: toolforge: mailexchange: add prometheus bits to know about queue length [puppet] - 10https://gerrit.wikimedia.org/r/608849 (https://phabricator.wikimedia.org/T256737) [11:02:20] !log ema@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2009.codfw.wmnet [11:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:34] !log restbase2009 depooled T256863 [11:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:38] (03PS3) 10ZPapierski: Handle oauth proxy settings [puppet] - 10https://gerrit.wikimedia.org/r/608824 (https://phabricator.wikimedia.org/T251498) [11:02:38] T256863: restbase2009 down - https://phabricator.wikimedia.org/T256863 [11:02:46] can you tell him to join IRC then? [11:02:54] if he’s doing the window… [11:03:01] ah wati [11:03:03] *wait [11:03:05] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart) - https://phabricator.wikimedia.org/T242705 (10akosiaris) p:05High→03Low I 'll lower priority for this. We may have the solution. h... [11:03:08] looked for the wrong nickname, sorry 🤦 [11:03:12] Amir1 is here [11:03:26] I'm here [11:03:29] meeting [11:03:50] (03CR) 10jerkins-bot: [V: 04-1] Handle oauth proxy settings [puppet] - 10https://gerrit.wikimedia.org/r/608824 (https://phabricator.wikimedia.org/T251498) (owner: 10ZPapierski) [11:03:52] (03PS3) 10Arturo Borrero Gonzalez: toolforge: mailexchange: add prometheus bits to know about queue length [puppet] - 10https://gerrit.wikimedia.org/r/608849 (https://phabricator.wikimedia.org/T256737) [11:05:23] Lucas_WMDE: we are doing the maintenance script [11:05:32] 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10akosiaris) [11:05:38] (03PS4) 10Arturo Borrero Gonzalez: toolforge: mailexchange: add prometheus bits to know about queue length [puppet] - 10https://gerrit.wikimedia.org/r/608849 (https://phabricator.wikimedia.org/T256737) [11:07:30] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [11:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:57] !log Changing datatype of several properties with mwscript extensions/Wikibase/repo/maintenance/changePropertyDataType.php (T255241) [11:08:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:01] T255241: Convert some properties string->external ID - https://phabricator.wikimedia.org/T255241 [11:09:14] Cool, we are done :) [11:09:38] (03PS5) 10Arturo Borrero Gonzalez: toolforge: mailexchange: add prometheus bits to know about queue length [puppet] - 10https://gerrit.wikimedia.org/r/608849 (https://phabricator.wikimedia.org/T256737) [11:11:20] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: mailexchange: add prometheus bits to know about queue length [puppet] - 10https://gerrit.wikimedia.org/r/608849 (https://phabricator.wikimedia.org/T256737) (owner: 10Arturo Borrero Gonzalez) [11:11:28] can we continue with the config patches then? [11:11:28] ok [11:13:06] yeah [11:13:16] (03PS2) 10Lucas Werkmeister (WMDE): Fully set MW_NO_SESSION for browser metadata endpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608713 (owner: 10Gergő Tisza) [11:13:18] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Fully set MW_NO_SESSION for browser metadata endpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608713 (owner: 10Gergő Tisza) [11:15:08] (03Merged) 10jenkins-bot: Fully set MW_NO_SESSION for browser metadata endpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608713 (owner: 10Gergő Tisza) [11:15:46] tgr: how should I sync that change? four sync-files, probably? [11:15:58] yeah, order doesn't matter [11:16:26] just quickly checking mwdebug1001 [11:17:01] /w has some debug logging code so probably better not to sync the whole directory [11:17:16] thanks, good to now [11:17:17] *know [11:17:39] no strange log messages for mwdebug, syncing [11:19:03] (03PS2) 10Ema: 5.1.3-1wm16: add 0039-probe-cold-state-race.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/608606 (https://phabricator.wikimedia.org/T236754) [11:19:19] (03CR) 10jerkins-bot: [V: 04-1] 5.1.3-1wm16: add 0039-probe-cold-state-race.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/608606 (https://phabricator.wikimedia.org/T236754) (owner: 10Ema) [11:19:34] !log lucaswerkmeister-wmde@deploy1001 Synchronized w/extract2.php: Config: [[gerrit:608713|Fully set MW_NO_SESSION for browser metadata endpoints]], 1/4 (duration: 01m 16s) [11:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:30] !log lucaswerkmeister-wmde@deploy1001 Synchronized w/favicon.php: Config: [[gerrit:608713|Fully set MW_NO_SESSION for browser metadata endpoints]], 2/4 (duration: 01m 04s) [11:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:54] !log lucaswerkmeister-wmde@deploy1001 Synchronized w/robots.php: Config: [[gerrit:608713|Fully set MW_NO_SESSION for browser metadata endpoints]], 3/4 (duration: 01m 03s) [11:22:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:49] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, two nits inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/607612 (https://phabricator.wikimedia.org/T224586) (owner: 10Herron) [11:24:28] !log lucaswerkmeister-wmde@deploy1001 Synchronized w/touch.php: Config: [[gerrit:608713|Fully set MW_NO_SESSION for browser metadata endpoints]], 4/4 (duration: 01m 06s) [11:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:28] thanks Lucas_WMDE! [11:25:52] Majavah: the translations mentioned in the task probably won’t be live yet, is that okay? [11:26:02] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Update debmonitor links to point to puppetboard instead of cas-puppetboard [puppet] - 10https://gerrit.wikimedia.org/r/608848 (owner: 10Muehlenhoff) [11:26:32] Lucas_WMDE: Hmm, good point [11:26:33] (03PS3) 10Ema: 5.1.3-1wm16: add 0039-probe-cold-state-race.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/608606 (https://phabricator.wikimedia.org/T236754) [11:26:46] (03CR) 10jerkins-bot: [V: 04-1] 5.1.3-1wm16: add 0039-probe-cold-state-race.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/608606 (https://phabricator.wikimedia.org/T236754) (owner: 10Ema) [11:27:11] actually yeah, probably waiting is a good idea if the link target changes [11:27:20] wmf.39 seems to have a newer portlet-label than wmf.38, but subpage-name is still “sandbox” [11:27:22] yeah [11:27:52] should I comment on Phabricator or do you want to do it? [11:27:59] anything is fine with me [11:28:01] or backport i18n patches, I guess [11:28:06] (03CR) 10Gergő Tisza: "The relevant task is T127233, forgot about that when writing the patch." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608713 (owner: 10Gergő Tisza) [11:28:07] but the new subpage-name isn’t even in master yet [11:28:09] so let’s wait [11:28:11] I’ll comment [11:28:27] yeah sounds good, thanks [11:28:55] (03CR) 10Alexandros Kosiaris: Add recommendation-api helmfile stanzas (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [11:32:26] !log EU B&C window done [11:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:49] (03CR) 10Majavah: [C: 04-1] "Let's wait until updated translations are deployed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608799 (https://phabricator.wikimedia.org/T256782) (owner: 10Majavah) [11:37:02] (03CR) 10Addshore: "FInally got around to checking this on test commons and it looks good, so I'll continue with the rest of the config changes" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605645 (owner: 10Addshore) [11:38:27] (03PS1) 10VulpesVulpes825: Change the Simplified Chinese logo for Chinese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608853 [11:39:53] (03PS14) 10Addshore: Wikidata: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569258 (https://phabricator.wikimedia.org/T254315) (owner: 10WMDE-leszek) [11:40:36] jouncebot: now [11:40:36] For the next 0 hour(s) and 19 minute(s): European mid-day backport window(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200701T1100) [11:40:47] * addshore will borrow the next 20 mins as noone else is using them [11:41:14] (03CR) 10Addshore: [C: 03+2] Wikidata: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569258 (https://phabricator.wikimedia.org/T254315) (owner: 10WMDE-leszek) [11:42:06] (03Merged) 10jenkins-bot: Wikidata: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569258 (https://phabricator.wikimedia.org/T254315) (owner: 10WMDE-leszek) [11:46:23] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T254315 Wikidata: Define entity sources configuration [[gerrit:569258]] (duration: 01m 06s) [11:46:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:55] T254315: entitysources: Directly create entitySources config for WMF production wikis - https://phabricator.wikimedia.org/T254315 [11:47:13] (03PS15) 10Addshore: Wikidata client wikis: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569259 (https://phabricator.wikimedia.org/T254315) (owner: 10WMDE-leszek) [11:47:56] !log A:cp upgrade librdkafka1 to 0.11.6-1.1wmf1 and restart purged, varnishkafka T256444 [11:48:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:01] T256444: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444 [11:48:53] addshore: please make sure that patch does not break beta cluster testwiki, as beta cluster has no testwikidatawiki [11:49:15] (03CR) 10Addshore: Wikidata client wikis: Define entity sources configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569259 (https://phabricator.wikimedia.org/T254315) (owner: 10WMDE-leszek) [11:49:52] Majavah: that last patch only applied to wikidatawiki, so should be fine (the one already deployed), the one i just rebsed looks like it needs some alterations [11:50:04] (03CR) 10Muehlenhoff: "The patch itself looks good to me, but one of the type changes will break, maybe that one can simply be backed out for now (or the call si" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608811 (owner: 10Jbond) [11:50:25] addshore: that other patch was exactly why I was commenting [11:51:41] (03CR) 10Majavah: [C: 04-1] "Left couple minor inline comments. Also please make sure this does not break beta cluster testwiki as beta cluster has no testwikidatawiki" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569259 (https://phabricator.wikimedia.org/T254315) (owner: 10WMDE-leszek) [11:52:37] 10Operations, 10ops-eqiad, 10Analytics-Radar: an-presto1004 down - https://phabricator.wikimedia.org/T253438 (10MoritzMuehlenhoff) What's the status here, any feedback from Dell on replacements etc? [11:54:53] (03PS16) 10Addshore: Wikidata client wikis: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569259 (https://phabricator.wikimedia.org/T254315) (owner: 10WMDE-leszek) [11:55:43] (03PS17) 10Addshore: Wikidata client wikis: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569259 (https://phabricator.wikimedia.org/T254315) (owner: 10WMDE-leszek) [11:55:46] (03CR) 10jerkins-bot: [V: 04-1] Wikidata client wikis: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569259 (https://phabricator.wikimedia.org/T254315) (owner: 10WMDE-leszek) [11:56:20] (03PS2) 10VulpesVulpes825: Change the Simplified Chinese logo for Chinese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608853 (https://phabricator.wikimedia.org/T256839) [11:56:44] (03CR) 10jerkins-bot: [V: 04-1] Wikidata client wikis: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569259 (https://phabricator.wikimedia.org/T254315) (owner: 10WMDE-leszek) [11:57:22] aaah, using the db lists it fails the linting [11:57:49] oh that's annoying [11:58:27] * addshore will think about the db lists, as its probably them that needs fixing, not the config itself [11:58:44] but i need to get a grasp on the whole of the IS.php to figure that out [11:59:06] (03CR) 10Addshore: [C: 04-1] "Looks like some poking perhaps needs to happen with the db lists in order to allow this config to be "nice"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569259 (https://phabricator.wikimedia.org/T254315) (owner: 10WMDE-leszek) [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200701T1200) [12:00:45] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (but needs the quoting fixed merged first)" [puppet] - 10https://gerrit.wikimedia.org/r/606286 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [12:00:58] (03PS1) 10Jcrespo: transfer.py: improve default options on Transferer class [software/transferpy] - 10https://gerrit.wikimedia.org/r/608856 [12:03:19] (03CR) 10Jcrespo: "While working on T256749 I found some light improvements to default options. Let me know what you think." [software/transferpy] - 10https://gerrit.wikimedia.org/r/608856 (owner: 10Jcrespo) [12:05:45] (03CR) 10Privacybatm: [C: 03+1] "It looks good to me." [software/transferpy] - 10https://gerrit.wikimedia.org/r/608856 (owner: 10Jcrespo) [12:08:55] (03CR) 10Jcrespo: [C: 03+2] transfer.py: improve default options on Transferer class [software/transferpy] - 10https://gerrit.wikimedia.org/r/608856 (owner: 10Jcrespo) [12:09:48] 10Operations, 10DBA, 10User-Kormat: Add mysql_role and section profiles to remaining mariadb roles - https://phabricator.wikimedia.org/T256866 (10Kormat) [12:09:55] 10Operations, 10DBA, 10User-Kormat: Add mysql_role and section profiles to remaining mariadb roles - https://phabricator.wikimedia.org/T256866 (10Kormat) p:05Triage→03Medium [12:12:57] (03PS1) 10Jbond: security::access::config: Add types to define [puppet] - 10https://gerrit.wikimedia.org/r/608859 [12:13:13] (03CR) 10ArielGlenn: [C: 03+1] "I don't depend on systemd alerts for the dumpsdata boxes, so let's do this." [puppet] - 10https://gerrit.wikimedia.org/r/601374 (owner: 10Bstorm) [12:14:33] (03PS2) 10Jbond: security::access::config: Add types to define [puppet] - 10https://gerrit.wikimedia.org/r/608859 [12:18:59] (03PS4) 10Jbond: scap::target: make this resource ensurable [puppet] - 10https://gerrit.wikimedia.org/r/608811 [12:19:04] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. I tested the new base config on sretest1002 and could successfully create a user after it was applied" [puppet] - 10https://gerrit.wikimedia.org/r/608489 (owner: 10Dzahn) [12:19:31] (03PS8) 10Privacybatm: [WIP] transferpy: Generate checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) [12:19:38] (03PS5) 10Jbond: scap::target: make this resource ensurable [puppet] - 10https://gerrit.wikimedia.org/r/608811 [12:20:04] (03PS3) 10VulpesVulpes825: Change the Simplified Chinese logo for Chinese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608853 (https://phabricator.wikimedia.org/T256839) [12:21:34] (03CR) 10Jbond: "PCC: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/23597/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608811 (owner: 10Jbond) [12:25:06] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:26:42] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 17769888 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:27:22] (03PS3) 10Jbond: security::access::config: Add types to define [puppet] - 10https://gerrit.wikimedia.org/r/608859 [12:27:43] (03PS6) 10Jbond: scap::target: make this resource ensurable [puppet] - 10https://gerrit.wikimedia.org/r/608811 [12:28:09] (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/607853 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [12:28:32] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 18872 and 70 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:29:18] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/608811 (owner: 10Jbond) [12:29:47] 10Operations, 10ops-eqiad, 10netops: (Need by: 2019-09-30) upgrade msw1-eqiad from EX4200 to EX4300 - https://phabricator.wikimedia.org/T225121 (10ayounsi) I cleaned up some default config leftovers as well as refactored the interfaces the same way we have them in codfw (with storm control). It's ready to b... [12:31:28] 10Operations, 10DBA, 10User-Kormat: Add mysql_role and section profiles to remaining mariadb roles - https://phabricator.wikimedia.org/T256866 (10Kormat) [12:32:04] (03CR) 10Elukey: "Looks good to me, what does pcc say? If everything is as expected I think it could be good to test it somewhere like mwdebug." [puppet] - 10https://gerrit.wikimedia.org/r/594760 (https://phabricator.wikimedia.org/T244340) (owner: 10RLazarus) [12:32:14] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/607853 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [12:33:23] (03CR) 10Zoranzoki21: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608853 (https://phabricator.wikimedia.org/T256839) (owner: 10VulpesVulpes825) [12:36:34] Majavah: any idea where I can read up on these config yaml files vs the db lists ? [12:37:33] addshore: not sure if you can, sorry :/ [12:37:58] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/608859 (owner: 10Jbond) [12:38:20] mediawiki-config doesn't have any documentation that isn't like "here's how you do this incredibly simple and common thing" [12:38:31] * addshore will wait for James F :P [12:38:41] was just about to suggest that lol [12:38:46] What do you need? [12:39:13] Most of it is work-out-able with the CI tests output [12:39:20] Well, it looks like im going ot want to make some new db lists, so wondering if thats even the right thing to do any more, or if I should be making yml files? [12:39:44] dblists shouldn't be manual [12:39:59] why do you need a new dblist? [12:39:59] You add tags to the yml files (which exist for each wiki), which wil then be used to create the dblists [12:40:02] aaah, so db lists are make from the yml files now? [12:40:08] gotcha! [12:40:11] 10Operations, 10DBA, 10User-Kormat: Add mysql_role and section profiles to remaining mariadb roles - https://phabricator.wikimedia.org/T256866 (10Kormat) [12:40:31] It's not a comment in every one.. [12:40:31] https://github.com/wikimedia/operations-mediawiki-config/blob/master/dblists/flow.dblist [12:40:34] But most [12:40:35] ># NOTE: This file is automatically generated. Do not edit it directly, run 'composer buildDBLists' instead. [12:40:48] 10Operations, 10DBA, 10User-Kormat: Add mysql_role and section profiles to remaining mariadb roles - https://phabricator.wikimedia.org/T256866 (10Kormat) [12:42:09] so, the current lists exist from a time before we had mutliple repos really, well just test and commons, it would be cleaner to rename them and also have an extra one or 2, maybe [12:43:00] wikibase-repo-(wikidata|testwikidata|common|testcommons) << (well this is just 1 site each, so could probably just not be a list), but then the same for clients which are lists, so wikibase-client-(wikidata|testwikidata|common|testcommons) [12:44:16] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 52 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:48:10] 10Operations, 10DBA, 10User-Kormat: Add mysql_role and section profiles to remaining mariadb roles - https://phabricator.wikimedia.org/T256866 (10Kormat) [12:50:44] (03PS9) 10Privacybatm: [WIP] transferpy: Generate checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) [12:53:31] 10Operations, 10Wikimedia-Mailing-lists: Creation of mailinglist for Board of WUG Esperanto and Free Knowledge - https://phabricator.wikimedia.org/T255951 (10KuboF) Thank you ssingh and Dzahn! [12:53:48] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [12:53:49] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:36] !log T256790 ✔️ cdanis@apt1001.wikimedia.org ~ 🕘☕ sudo -E reprepro -C main include buster-wikimedia pmacct_1.7.2-3+wmf1_amd64.changes [12:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:41] T256790: nfacctd segfaulting on netflow2001 - https://phabricator.wikimedia.org/T256790 [12:56:22] jouncebot: now [12:56:22] For the next 0 hour(s) and 3 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200701T1200) [12:58:05] (03PS1) 10Hashar: group1 wikis to 1.35.0-wmf.39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608871 [12:58:50] !log T256790 ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕘☕ sudo debdeploy deploy -u 2020-07-01-pmacct.yaml -s netflow [12:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:46] hashar: o/ [13:00:05] hashar and twentyafterfour: I, the Bot under the Fountain, allow thee, The Deployer, to do Mediawiki train - European+American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200701T1300). [13:00:09] if you have a minute for a weird jenkins -1 trouble for analytics, let me know :) [13:01:35] (03CR) 10Ppchelko: [C: 03+1] Deploy MediaModeration on all production wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608753 (https://phabricator.wikimedia.org/T247943) (owner: 10Cicalese) [13:03:19] (03PS18) 10Addshore: Wikidata client wikis: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569259 (https://phabricator.wikimedia.org/T254315) (owner: 10WMDE-leszek) [13:03:25] !log T256790 ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕘☕ sudo cumin 'netflow[3-5]001*' 'systemctl restart nfacctd' [13:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:30] T256790: nfacctd segfaulting on netflow2001 - https://phabricator.wikimedia.org/T256790 [13:04:01] (03PS19) 10Addshore: Wikidata client wikis: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569259 (https://phabricator.wikimedia.org/T254315) (owner: 10WMDE-leszek) [13:04:05] (03CR) 10jerkins-bot: [V: 04-1] Wikidata client wikis: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569259 (https://phabricator.wikimedia.org/T254315) (owner: 10WMDE-leszek) [13:05:13] 10Operations, 10netops: nfacctd segfaulting on netflow2001 - https://phabricator.wikimedia.org/T256790 (10CDanis) 05Open→03Resolved a:03CDanis Backport deployed, all seems well (fortunately netflow1001 was still receiving the triggering BGP data when I went to test, so verified old version still crashed,... [13:06:47] (03PS1) 10Kormat: mariadb: Add role and section profiles to remaining mariadb roles. [puppet] - 10https://gerrit.wikimedia.org/r/608874 (https://phabricator.wikimedia.org/T256866) [13:06:50] cdanis: really nice job [13:07:02] elukey: aw, thanks :) [13:07:16] (03CR) 10Kormat: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/608874 (https://phabricator.wikimedia.org/T256866) (owner: 10Kormat) [13:07:53] grmbllblbb [13:08:04] it went a lot easier than I expected after I installed libc6-dbg on netflow2001 and saw that the crashes within malloc weren't, as I was hoping, a crash because it was asking for 2^64 bytes or similar, but instead `malloc(24)` was crashing [13:08:12] never something you want to see ;) [13:08:18] hashar: that's a terrible password ;) [13:08:36] !log ✔️ cdanis@netflow2001.codfw.wmnet ~ 🕘☕ sudo apt remove valgrind libc6-dbg [13:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:18] (03PS20) 10Addshore: Wikidata client wikis: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569259 (https://phabricator.wikimedia.org/T254315) (owner: 10WMDE-leszek) [13:09:40] elukey: you should be able to reproduce locally with utils/run_ci_locally.sh ? or bundle update && bundle exec rake -j1 test [13:09:55] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [13:09:56] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:59] CI runs the rake tasks in parallel and the output of each task ends up multiplexed which makes it hard to spot the error [13:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:11] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [13:10:12] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:10:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:15] rake -j1 forces rake to run the tasks serially, that usually makes it easier to find the error [13:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:27] hashar: so the issue seems to be always "fail to merge dependencies etc.." like in https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/608872 [13:10:32] (03CR) 10Hashar: [C: 03+2] group1 wikis to 1.35.0-wmf.39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608871 (owner: 10Hashar) [13:10:37] (03PS21) 10Addshore: Wikidata client wikis: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569259 (https://phabricator.wikimedia.org/T254315) (owner: 10WMDE-leszek) [13:10:43] but I am a bit confused about the error msg [13:11:15] (03Merged) 10jenkins-bot: group1 wikis to 1.35.0-wmf.39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608871 (owner: 10Hashar) [13:11:18] elukey: hmm that might be on the server side ( zuul-merger does a git merge against tip of branch, the repo might had exploded there ) [13:11:51] hashar: yeah this is why I was asking for your wisdom :) [13:12:42] GitCommandError: Cmd('git') failed due to: exit code(1) [13:12:43] cmdline: git fetch --tags -v origin [13:12:46] yeah that is server side [13:12:49] it is wrong [13:12:57] somehow :\ maybe cause a tag got renamed [13:13:00] (03PS13) 10Addshore: Commons: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569260 (https://phabricator.wikimedia.org/T254315) (owner: 10WMDE-leszek) [13:13:31] ah yes it might have happened during the last release, we had to rollback [13:13:33] elukey: https://phabricator.wikimedia.org/T252310 [13:13:38] but it is supposedly fixed [13:13:44] will look at it after the train [13:13:53] it should git fetch --force --tags [13:14:05] so the zuul deployment does not have the right code [13:14:07] :-\ [13:14:09] (03CR) 10Addshore: [C: 04-1] "I think it is better to explicitly set these things rather than set them in "default" as default includes a bunch of stuff not even connec" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569261 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [13:14:22] hashar: yes please keep going with the train sorry, anytime :) [13:14:23] thanks a lot [13:14:24] * hashar does the train [13:14:38] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:15:42] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.39 [13:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:46] !log hashar@deploy1001 Synchronized php: group1 wikis to 1.35.0-wmf.39 (duration: 01m 04s) [13:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:51] elukey: I just reopen the task .Seems like I haven't deployed the update to contint2001 / forgot about it [13:16:56] (03PS2) 10Kormat: mariadb: Add role and section profiles to remaining mariadb roles. [puppet] - 10https://gerrit.wikimedia.org/r/608874 (https://phabricator.wikimedia.org/T256866) [13:16:59] (03PS22) 10Addshore: Wikidata client wikis: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569259 (https://phabricator.wikimedia.org/T254315) (owner: 10WMDE-leszek) [13:17:27] (03PS14) 10Addshore: Commons: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569260 (https://phabricator.wikimedia.org/T254315) (owner: 10WMDE-leszek) [13:19:23] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Remove cas-logstash from caches [puppet] - 10https://gerrit.wikimedia.org/r/607508 (https://phabricator.wikimedia.org/T246998) (owner: 10Muehlenhoff) [13:19:33] 10Operations, 10SRE-Access-Requests: Requesting access to production servers for Ahmon Dancy - RelEng - https://phabricator.wikimedia.org/T256770 (10thcipriani) [13:19:53] 10Operations, 10SRE-Access-Requests: Requesting access to production servers for Ahmon Dancy - RelEng - https://phabricator.wikimedia.org/T256770 (10thcipriani) > - access request (or expansion) has sign off of WMF sponsor/manager (sponser for volunteers, manager for wmf staff) Approved [13:20:01] Jul 01 13:09:14 deneb docker-report-releng[18394]: docker-registry.wikimedia.org/releng/ci-common:0.2 [FAIL] [13:20:03] Jul 01 13:09:14 deneb systemd[1]: docker-reporter-releng-images.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED [13:22:03] (03PS1) 10Jbond: apero_cas: add abbility to configure per service properties [puppet] - 10https://gerrit.wikimedia.org/r/608877 (https://phabricator.wikimedia.org/T251513) [13:22:05] (03PS1) 10Jbond: role::idp: disable X-frame-options for icinga [puppet] - 10https://gerrit.wikimedia.org/r/608878 (https://phabricator.wikimedia.org/T256536) [13:22:18] there is no error logs with the train .. I am disappointed [13:22:44] (03PS1) 10Elukey: maven: remove mirrored repository from main settings [puppet] - 10https://gerrit.wikimedia.org/r/608879 (https://phabricator.wikimedia.org/T252767) [13:24:40] (03PS3) 10Kormat: mariadb: Add role and section profiles to remaining mariadb roles. [puppet] - 10https://gerrit.wikimedia.org/r/608874 (https://phabricator.wikimedia.org/T256866) [13:25:20] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1001 job=burrow partition={0,1,2,4,5} site=eqiad topic={rsyslog-notice,udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-da [13:25:20] ometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [13:28:22] 10Operations, 10ops-codfw, 10RESTBase: restbase2009 down - https://phabricator.wikimedia.org/T256863 (10akosiaris) show system1/log1 etc has 2 telling entries ` hpiLO-> show system1/log1/record19 status=0 status_tag=COMMAND COMPLETED Wed Jul 1 13:24:25 2020 /system1/log1/record19 Targets Propert... [13:28:33] (03CR) 10Jbond: [C: 03+1] zuul: replace user/group with systemd-sysuser and reserved UID [puppet] - 10https://gerrit.wikimedia.org/r/607853 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [13:28:38] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/608600 (https://phabricator.wikimedia.org/T256726) (owner: 10Alexandros Kosiaris) [13:28:44] !log hashar@deploy1001 Started deploy [zuul/deploy@00f69b3]: (no justification provided) [13:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:49] elukey: ^^ [13:29:02] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [13:29:07] well at least trying to [13:29:16] !log hashar@deploy1001 Finished deploy [zuul/deploy@00f69b3]: (no justification provided) (duration: 00m 32s) [13:29:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:05] hashar: thankssss! I am rechecking to see if it works [13:30:32] still failing, I guess I have to wait a sec probably [13:30:35] !log hashar@deploy1001 Started deploy [zuul/deploy@00f69b3]: (no justification provided) [13:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:43] !log hashar@deploy1001 Finished deploy [zuul/deploy@00f69b3]: (no justification provided) (duration: 00m 08s) [13:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:46] !log Restarting zuul-merger on contint2001 # T252310 [13:35:47] 10Operations, 10wmf-sre-laptop: Split SRE-specific components into an SRE sub-package; create sub-packages for other teams as well - https://phabricator.wikimedia.org/T256872 (10CDanis) [13:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:50] T252310: pywikibot get merge rejections due to zuul-merger not being able to update tags - https://phabricator.wikimedia.org/T252310 [13:36:53] elukey: that worked [13:37:18] hashar: thanks a lot! [13:37:23] !log contint1001 stopped zuul-merger for a test. started it again [13:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:44] that will teach me to "lets restart that tomorrow when it is less of a nuisance" [13:37:48] hmm [13:37:52] nuisance might be french [13:38:52] (03CR) 10Kormat: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/608874 (https://phabricator.wikimedia.org/T256866) (owner: 10Kormat) [13:44:47] (03CR) 10Ottomata: [C: 03+1] scap::target: make this resource ensurable [puppet] - 10https://gerrit.wikimedia.org/r/608811 (owner: 10Jbond) [13:45:03] (03PS11) 10Addshore: Wikibase: Removed config option wmgUseEntitySourceBasedFederation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569263 (https://phabricator.wikimedia.org/T241975) (owner: 10WMDE-leszek) [13:45:58] (03PS1) 10Muehlenhoff: Make Graphite httpd site configurable [puppet] - 10https://gerrit.wikimedia.org/r/608881 [13:47:12] (03CR) 10jerkins-bot: [V: 04-1] Make Graphite httpd site configurable [puppet] - 10https://gerrit.wikimedia.org/r/608881 (owner: 10Muehlenhoff) [13:50:34] (03PS1) 10Ayounsi: Depool eqsin for cr3-eqsin setup [dns] - 10https://gerrit.wikimedia.org/r/608882 (https://phabricator.wikimedia.org/T255766) [13:52:03] (03CR) 10Kormat: "PCC is happy." [puppet] - 10https://gerrit.wikimedia.org/r/608874 (https://phabricator.wikimedia.org/T256866) (owner: 10Kormat) [13:52:36] (03PS2) 10Muehlenhoff: Make Graphite httpd site configurable [puppet] - 10https://gerrit.wikimedia.org/r/608881 [13:53:49] (03PS2) 10Cicalese: Deploy MediaModeration on all production wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608753 (https://phabricator.wikimedia.org/T247943) [13:58:53] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: NDA for superset access request from WMDE employee danshick - https://phabricator.wikimedia.org/T254442 (10danshick-wmde) Thank you! Could someone let me know what credentials to use? [13:59:16] 10Operations, 10wmf-sre-laptop: Split SRE-specific components into an SRE sub-package; create sub-packages for other teams as well - https://phabricator.wikimedia.org/T256872 (10hashar) A bit of history: In https://gerrit.wikimedia.org/r/plugins/gitiles/wmf-utils/+/refs/heads/master there is: * `wmf-clone` by... [13:59:56] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:01:40] so well 1.35.0-wmf.39 on group1 looks all fine [14:01:54] so hmm I am taking a break and will be back later for some more adventures [14:02:11] folks can ring me if needed though I might not be immediately available [14:04:08] 10Operations: Handle sunset of stretch-backports - https://phabricator.wikimedia.org/T256877 (10MoritzMuehlenhoff) [14:04:25] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [14:04:26] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:44] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, 10User-brennen: Remove obsoleted docker images - https://phabricator.wikimedia.org/T242604 (10JMeybohm) Unfortunately removing all tags of an image (e.g. repository) does not remove the repository itself from the registry[1][2].... [14:09:35] (03PS2) 10ArielGlenn: add options to make testing dumps rsync easier [puppet] - 10https://gerrit.wikimedia.org/r/608006 [14:11:08] (03CR) 10ArielGlenn: [C: 03+2] add options to make testing dumps rsync easier [puppet] - 10https://gerrit.wikimedia.org/r/608006 (owner: 10ArielGlenn) [14:18:23] (03PS1) 10ArielGlenn: fix the long-running dumps exception checker issue [puppet] - 10https://gerrit.wikimedia.org/r/608885 (https://phabricator.wikimedia.org/T254856) [14:18:29] (03PS4) 10Kormat: mariadb: Add role and section profiles to remaining mariadb roles. [puppet] - 10https://gerrit.wikimedia.org/r/608874 (https://phabricator.wikimedia.org/T256866) [14:20:27] (03CR) 10ArielGlenn: [C: 03+2] fix the long-running dumps exception checker issue [puppet] - 10https://gerrit.wikimedia.org/r/608885 (https://phabricator.wikimedia.org/T254856) (owner: 10ArielGlenn) [14:23:59] (03CR) 10Jbond: "LGTM but comments inline" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/608874 (https://phabricator.wikimedia.org/T256866) (owner: 10Kormat) [14:26:50] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:27:48] (03CR) 10Jcrespo: [C: 03+1] "Everything here seems right, but I think there is missing roles? (dbstores)" [puppet] - 10https://gerrit.wikimedia.org/r/608874 (https://phabricator.wikimedia.org/T256866) (owner: 10Kormat) [14:28:30] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [14:28:31] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:43] (03CR) 10Jcrespo: [C: 03+1] "General misc ones are missing too? Maybe those were added before, I haven't checked." [puppet] - 10https://gerrit.wikimedia.org/r/608874 (https://phabricator.wikimedia.org/T256866) (owner: 10Kormat) [14:36:00] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 57 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:36:24] (03PS4) 10Ema: 5.1.3-1wm16: add discard patches [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/608606 (https://phabricator.wikimedia.org/T236754) [14:36:36] (03CR) 10jerkins-bot: [V: 04-1] 5.1.3-1wm16: add discard patches [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/608606 (https://phabricator.wikimedia.org/T236754) (owner: 10Ema) [14:38:28] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:40:47] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: redis for docker-registry should have maxmemory-policy set to allkeys-lru - https://phabricator.wikimedia.org/T256726 (10JMeybohm) a:03akosiaris [14:41:26] 10Operations: Handle sunset of stretch-backports - https://phabricator.wikimedia.org/T256877 (10MoritzMuehlenhoff) [14:41:43] 10Operations, 10DBA, 10User-Kormat: Remove unused parameters from profile::mariadb::monitor::prometheus - https://phabricator.wikimedia.org/T256879 (10Kormat) [14:41:48] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 50 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:41:49] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, 10User-brennen: Remove obsoleted docker images - https://phabricator.wikimedia.org/T242604 (10JMeybohm) [14:43:02] (03PS1) 10Giuseppe Lavagetto: Handle cases of repositories with no tags [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/608889 [14:43:04] (03PS1) 10Giuseppe Lavagetto: New package version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/608890 [14:43:13] <_joe_> jayme: ^^ [14:44:28] (03CR) 10jerkins-bot: [V: 04-1] Handle cases of repositories with no tags [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/608889 (owner: 10Giuseppe Lavagetto) [14:44:32] (03PS2) 10Giuseppe Lavagetto: Handle cases of repositories with no tags [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/608889 [14:44:34] (03PS2) 10Giuseppe Lavagetto: New package version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/608890 [14:44:40] (03CR) 10jerkins-bot: [V: 04-1] New package version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/608890 (owner: 10Giuseppe Lavagetto) [14:45:09] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/23605/" [puppet] - 10https://gerrit.wikimedia.org/r/608881 (owner: 10Muehlenhoff) [14:45:18] <_joe_> oh sigh damn flake8 [14:45:52] (03CR) 10jerkins-bot: [V: 04-1] Handle cases of repositories with no tags [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/608889 (owner: 10Giuseppe Lavagetto) [14:45:55] (03CR) 10jerkins-bot: [V: 04-1] New package version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/608890 (owner: 10Giuseppe Lavagetto) [14:46:20] (03PS1) 10Gergő Tisza: Remove old incorrect GrowthExperiments survey config from beta kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608892 (https://phabricator.wikimedia.org/T256828) [14:46:29] (03PS3) 10Giuseppe Lavagetto: Handle cases of repositories with no tags [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/608889 [14:46:31] (03PS3) 10Giuseppe Lavagetto: New package version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/608890 [14:47:45] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: NDA for superset access request from WMDE employee danshick - https://phabricator.wikimedia.org/T254442 (10Dzahn) @danshick-wmde (cc: @ssingh @Nuria ) This sounds like it's another case of T252703#6152109 / T252703#6154468 Please try to use "Dan S... [14:49:00] (03CR) 10QChris: [C: 03+1] gerrit: remove all database parameters / support [puppet] - 10https://gerrit.wikimedia.org/r/606549 (https://phabricator.wikimedia.org/T254158) (owner: 10Dzahn) [14:49:02] (03PS1) 10Alexandros Kosiaris: redis::misc: Correct typo in instance_overrides [puppet] - 10https://gerrit.wikimedia.org/r/608893 (https://phabricator.wikimedia.org/T256726) [14:49:30] (03CR) 10JMeybohm: [C: 03+1] "Maybe add a back reference to https://phabricator.wikimedia.org/T242604, but LGTM!" [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/608889 (owner: 10Giuseppe Lavagetto) [14:58:16] (03CR) 10Alexandros Kosiaris: [C: 03+2] redis::misc: Correct typo in instance_overrides [puppet] - 10https://gerrit.wikimedia.org/r/608893 (https://phabricator.wikimedia.org/T256726) (owner: 10Alexandros Kosiaris) [14:59:49] (03CR) 10Ayounsi: [C: 03+2] Depool eqsin for cr3-eqsin setup [dns] - 10https://gerrit.wikimedia.org/r/608882 (https://phabricator.wikimedia.org/T255766) (owner: 10Ayounsi) [15:00:35] !log depool eqsin for routers work - T255766 [15:00:37] (03PS2) 10Krinkle: mediawiki: Add api.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/599751 (https://phabricator.wikimedia.org/T246945) (owner: 10Ladsgroup) [15:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:40] T255766: cr3-eqsin to production - https://phabricator.wikimedia.org/T255766 [15:03:09] !log move vrrp master to cr2-eqsin - T255766 [15:03:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:32] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [15:03:33] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:03:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:47] XioNoX: did my homer-public patch to do that work, btw? [15:05:32] cdanis: I did it manually 😅 but I applied the priority to the group instead of every interfaces [15:05:39] ahah [15:05:52] close enough ;) [15:09:48] !log bump eqsin-codfw ospf link cost - T255766 [15:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:53] T255766: cr3-eqsin to production - https://phabricator.wikimedia.org/T255766 [15:10:32] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 45.85 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:11:58] (03PS4) 10Giuseppe Lavagetto: Handle cases of repositories with no tags [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/608889 (https://phabricator.wikimedia.org/T242604) [15:12:02] (03PS4) 10Giuseppe Lavagetto: New package version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/608890 [15:12:35] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Handle cases of repositories with no tags [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/608889 (https://phabricator.wikimedia.org/T242604) (owner: 10Giuseppe Lavagetto) [15:13:02] !log disable cr1-eqsin transit/peering BGP - T255766 [15:13:04] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:43] (03PS1) 10Jbond: profile::mariadb::misc: create generic profile for misc classes [puppet] - 10https://gerrit.wikimedia.org/r/608895 [15:13:45] (03PS1) 10Jbond: role::tendril: move tendril to new profile::mariadb::misc [puppet] - 10https://gerrit.wikimedia.org/r/608896 [15:14:02] (03Merged) 10jenkins-bot: Handle cases of repositories with no tags [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/608889 (https://phabricator.wikimedia.org/T242604) (owner: 10Giuseppe Lavagetto) [15:15:23] !log disable BGP to pybal on cr1-eqsin - T255766 [15:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:28] T255766: cr3-eqsin to production - https://phabricator.wikimedia.org/T255766 [15:16:58] !log re0.cr1-eqsin> request system power-off both-routing-engines - T255766 [15:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:28] PROBLEM - OSPF status on mr1-eqsin is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:21:30] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 84, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:22:32] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:24:05] (03PS4) 10ZPapierski: Handle oauth proxy settings [puppet] - 10https://gerrit.wikimedia.org/r/608824 (https://phabricator.wikimedia.org/T251498) [15:24:42] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 48 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:25:37] (03CR) 10jerkins-bot: [V: 04-1] Handle oauth proxy settings [puppet] - 10https://gerrit.wikimedia.org/r/608824 (https://phabricator.wikimedia.org/T251498) (owner: 10ZPapierski) [15:27:03] (03PS2) 10Jbond: profile::mariadb::misc: create generic profile for misc classes [puppet] - 10https://gerrit.wikimedia.org/r/608895 [15:27:17] (03PS2) 10Jbond: role::tendril: move tendril to new profile::mariadb::misc [puppet] - 10https://gerrit.wikimedia.org/r/608896 [15:27:22] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: redis for docker-registry should have maxmemory-policy set to allkeys-lru - https://phabricator.wikimedia.org/T256726 (10akosiaris) 05Open→03Resolved Double checked across all nodes, this has been applied successfully. R... [15:27:26] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: improve docker registry architecture - https://phabricator.wikimedia.org/T209271 (10akosiaris) [15:29:02] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 47 probes of 656 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:31:17] (03PS1) 10Muehlenhoff: Explicitly depend on git-review 1.27 [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/608901 [15:32:47] (03CR) 10CDanis: [C: 03+1] Explicitly depend on git-review 1.27 [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/608901 (owner: 10Muehlenhoff) [15:33:40] (03PS10) 10Privacybatm: [WIP] transferpy: Generate checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) [15:34:31] (03CR) 10jerkins-bot: [V: 04-1] [WIP] transferpy: Generate checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [15:35:08] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Explicitly depend on git-review 1.27 [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/608901 (owner: 10Muehlenhoff) [15:36:08] (03PS3) 10Jbond: profile::mariadb::misc: create generic profile for misc classes [puppet] - 10https://gerrit.wikimedia.org/r/608895 [15:37:16] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:38:04] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:38:10] (03PS4) 10Jbond: profile::mariadb::misc: create generic profile for misc classes [puppet] - 10https://gerrit.wikimedia.org/r/608895 [15:39:05] (03PS5) 10Jbond: profile::mariadb::misc: create generic profile for misc classes [puppet] - 10https://gerrit.wikimedia.org/r/608895 [15:39:22] (03PS3) 10Jbond: role::tendril: move tendril to new profile::mariadb::misc [puppet] - 10https://gerrit.wikimedia.org/r/608896 [15:41:27] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart) - https://phabricator.wikimedia.org/T242705 (10Halfak) Fantastic :) Thanks for the quick turn-around @akosiaris. [15:43:21] (03PS6) 10Jbond: profile::mariadb::misc: create generic profile for misc classes [puppet] - 10https://gerrit.wikimedia.org/r/608895 [15:43:50] (03PS4) 10Jbond: role::tendril: move tendril to new profile::mariadb::misc [puppet] - 10https://gerrit.wikimedia.org/r/608896 [15:45:41] (03PS7) 10Jbond: profile::mariadb::misc: create generic profile for misc classes [puppet] - 10https://gerrit.wikimedia.org/r/608895 [15:46:34] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 5 probes of 656 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:50:42] (03PS5) 10Jbond: role::tendril: move tendril to new profile::mariadb::misc [puppet] - 10https://gerrit.wikimedia.org/r/608896 [15:52:44] RECOVERY - OSPF status on mr1-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:55:32] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10Dzahn) a:03ssingh [15:56:13] (03CR) 10Ayounsi: [C: 03+2] cr1-eqsin -> cr3-eqsin [puppet] - 10https://gerrit.wikimedia.org/r/606425 (https://phabricator.wikimedia.org/T255766) (owner: 10Ayounsi) [15:56:49] (03PS1) 10ZPapierski: Authenticate with MW oauth 1.0a for WCQS [puppet] - 10https://gerrit.wikimedia.org/r/608905 (https://phabricator.wikimedia.org/T251498) [15:57:21] (03CR) 10jerkins-bot: [V: 04-1] Authenticate with MW oauth 1.0a for WCQS [puppet] - 10https://gerrit.wikimedia.org/r/608905 (https://phabricator.wikimedia.org/T251498) (owner: 10ZPapierski) [15:59:39] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Couple of minor inline comments, otherwise LGTM" (032 comments) [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/608088 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [16:00:13] !log updating eqsin LVS BGP neighbors IPs - T255766 [16:00:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:18] T255766: cr3-eqsin to production - https://phabricator.wikimedia.org/T255766 [16:02:58] (03CR) 10Giuseppe Lavagetto: [C: 03+2] New package version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/608890 (owner: 10Giuseppe Lavagetto) [16:05:12] (03Merged) 10jenkins-bot: New package version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/608890 (owner: 10Giuseppe Lavagetto) [16:11:34] (03PS11) 10Privacybatm: [WIP] transferpy: Generate checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) [16:16:22] (03CR) 10Dzahn: "@paladox is this still -1 or +1 now after the upgrade?" [puppet] - 10https://gerrit.wikimedia.org/r/532391 (owner: 10Paladox) [16:18:04] (03CR) 10Privacybatm: "This change is ready for review." [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [16:18:50] (03CR) 10Jcrespo: "> Patch Set 11:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [16:20:02] (03PS12) 10Privacybatm: transferpy: Generate checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) [16:23:54] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 86, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:27:26] (03PS14) 10Dzahn: webperf: Remove XHGui dependency on MongoDB [puppet] - 10https://gerrit.wikimedia.org/r/603550 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [16:28:11] (03CR) 10Dzahn: [C: 03+2] "noop on tungsten: https://puppet-compiler.wmflabs.org/compiler1003/23612/tungsten.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/603550 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [16:29:39] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:30:40] (03CR) 10Dzahn: "on webperf1001: Function lookup() did not find a value for the name 'profile::webperf::site::xhgui_old_host'" [puppet] - 10https://gerrit.wikimedia.org/r/603550 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [16:34:36] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:34:36] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:35:42] 👀 ooh, has Zayo opened a case against our service? [16:39:09] (03PS1) 10Dzahn: webperf: set xhgui_old_host parameter to tungsten, xhgui host to xhgui1001 [puppet] - 10https://gerrit.wikimedia.org/r/608911 (https://phabricator.wikimedia.org/T180761) [16:41:24] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:43:33] (03CR) 10Dzahn: "follow-up https://gerrit.wikimedia.org/r/c/operations/puppet/+/608911" [puppet] - 10https://gerrit.wikimedia.org/r/603550 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [16:49:02] (03PS2) 10Elukey: maven: remove main /etc/maven/settings.xml [puppet] - 10https://gerrit.wikimedia.org/r/608879 (https://phabricator.wikimedia.org/T252767) [16:49:28] ACKNOWLEDGEMENT - Host restbase2009 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T256863 [16:49:42] (03CR) 10Elukey: [C: 03+2] maven: remove main /etc/maven/settings.xml [puppet] - 10https://gerrit.wikimedia.org/r/608879 (https://phabricator.wikimedia.org/T252767) (owner: 10Elukey) [16:55:32] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and nda groups for edtadros - https://phabricator.wikimedia.org/T256435 (10Edtadros) Thank you! I'll do the reading. [16:57:28] !log restart cr2-eqsin for software upgrade - T243080 [16:57:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:13] (03Abandoned) 10Dzahn: webperf: set xhgui_old_host parameter to tungsten, xhgui host to xhgui1001 [puppet] - 10https://gerrit.wikimedia.org/r/608911 (https://phabricator.wikimedia.org/T180761) (owner: 10Dzahn) [16:59:21] (03PS1) 10Jeena Huneidi: Revert "Revert "blubberoid: Update to latest image"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/608836 [17:00:54] (03PS2) 10Jeena Huneidi: Revert "Revert "blubberoid: Update to latest image"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/608836 [17:02:22] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:03:01] (03PS4) 10Dzahn: webperf: switch xhgui_host from tungsten to xhgui1001 [puppet] - 10https://gerrit.wikimedia.org/r/552357 (https://phabricator.wikimedia.org/T158837) [17:03:10] PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: host 103.102.166.131, interfaces up: 77, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:03:11] (03CR) 10Dzahn: "instead amended to https://gerrit.wikimedia.org/r/c/operations/puppet/+/552357" [puppet] - 10https://gerrit.wikimedia.org/r/603550 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [17:03:44] PROBLEM - OSPF status on mr1-eqsin is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:03:53] (03CR) 10Jeena Huneidi: [C: 03+2] Revert "Revert "blubberoid: Update to latest image"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/608836 (owner: 10Jeena Huneidi) [17:04:50] (03Merged) 10jenkins-bot: Revert "Revert "blubberoid: Update to latest image"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/608836 (owner: 10Jeena Huneidi) [17:05:02] RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:05:36] RECOVERY - OSPF status on mr1-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:06:04] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:08:47] (03PS5) 10Dzahn: webperf: switch xhgui_host from tungsten to xhgui1001 [puppet] - 10https://gerrit.wikimedia.org/r/552357 (https://phabricator.wikimedia.org/T158837) [17:09:31] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and nda groups for edtadros - https://phabricator.wikimedia.org/T256435 (10Jrbranaa) Approved on my end. [17:09:32] (03CR) 10Krinkle: [C: 04-1] "We won't be using old_host in prod. That's just for beta to test the transition." [puppet] - 10https://gerrit.wikimedia.org/r/552357 (https://phabricator.wikimedia.org/T158837) (owner: 10Dzahn) [17:10:36] (03CR) 10Krinkle: [C: 04-1] "once the packages are published and the new host is ready we can just flip the switch here, and then import the rest in the minutes/hours " [puppet] - 10https://gerrit.wikimedia.org/r/552357 (https://phabricator.wikimedia.org/T158837) (owner: 10Dzahn) [17:10:36] 10Operations, 10netops: Configure management-instance on routers with Junos > 17.3 - https://phabricator.wikimedia.org/T247073 (10ayounsi) [17:14:58] !log set flex-flow-sizing to cr2-eqsin - T248394 [17:15:32] Flex Flow Sizing ENABLED?: PFE-0: Yes [17:15:37] cdanis: ^ [17:15:42] paravoid: ^ [17:15:54] XioNoX: wait, that wasn't enabled? how? [17:16:11] cdanis: paravoid found that they fixed flex flow on the MX204 [17:16:17] since we upgraded them all [17:16:18] ohhhhhh [17:16:20] oh oh oh [17:16:22] good [17:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:44] T248394: roll out sensible flow-table-sizes to Juniper core routers with sampling enabled - https://phabricator.wikimedia.org/T248394 [17:17:10] pushing it to eqsin now that it's depooled, then will send a CR to update them all tomorrow [17:17:15] \o/ [17:17:46] I think it's hitless to enable? [17:18:34] yep [17:18:47] but at least we can see overnight if it's working as expected [17:18:51] yeah, nice [17:19:32] and now we know about the `show services accounting flow inline-jflow` command as well [17:19:35] so, easier to check [17:19:44] yep [17:19:47] for posterity, https://prsearch.juniper.net/InfoCenter/index?page=prcontent&id=PR1356072 [17:21:50] task updated [17:21:55] (for posterity) [17:21:56] :) [17:27:43] dpifke: hi, are you around? I merged your change but now i have some follow-up for it [17:28:11] !log repool eqsin - T255766 [17:29:57] Yeah. Sorry about not catching xhgui_old_host ahead of time. [17:30:37] For now, let's point both xhgui_host and xhgui_old_host at tungsten. I need to finish the data migration before we can flip the switch on the former. [17:30:37] dpifke: no worries. so the compiler tells me that my follow-up will not change tungsten and do "something" on webperf1001 [17:30:58] by something i mean "either it unbreaks the puppet change and does nothing else or it unbreaks and does more stuff", heh [17:31:10] dpifke: aha, ok. amending! [17:31:55] (03Restored) 10Dzahn: webperf: set xhgui_old_host parameter to tungsten, xhgui host to xhgui1001 [puppet] - 10https://gerrit.wikimedia.org/r/608911 (https://phabricator.wikimedia.org/T180761) (owner: 10Dzahn) [17:33:38] (03PS2) 10Dzahn: webperf: set xhgui_old_host parameter to tungsten [puppet] - 10https://gerrit.wikimedia.org/r/608911 (https://phabricator.wikimedia.org/T180761) [17:33:57] dpifke: ^ +1 ? [17:34:31] (03CR) 10Dave Pifke: [C: 03+1] "LGTM. This fixes webperf1001 for now. We'll flip the switch to xhgui1001 in the other patch once the data migration is ready." [puppet] - 10https://gerrit.wikimedia.org/r/608911 (https://phabricator.wikimedia.org/T180761) (owner: 10Dzahn) [17:34:52] (03CR) 10Dzahn: "for now we will set both old and new host to tungsten https://gerrit.wikimedia.org/r/c/operations/puppet/+/608911" [puppet] - 10https://gerrit.wikimedia.org/r/552357 (https://phabricator.wikimedia.org/T158837) (owner: 10Dzahn) [17:34:57] (03CR) 10Dzahn: [C: 03+2] webperf: set xhgui_old_host parameter to tungsten [puppet] - 10https://gerrit.wikimedia.org/r/608911 (https://phabricator.wikimedia.org/T180761) (owner: 10Dzahn) [17:36:47] dpifke: puppet is fixed on webperf[12]001. It did change the content of sites-available/50-performance-wikimedia-org.conf [17:37:01] the proxypass to xhgui-old got added [17:37:30] so that seems expected and good. and we made good progress here. thanks for your work on removing mongo :) [17:37:34] Looks good from this end. [17:37:40] great, ack [17:37:49] 10Operations, 10ops-eqsin, 10netops, 10Patch-For-Review: cr3-eqsin to production - https://phabricator.wikimedia.org/T255766 (10ayounsi) Replacement went smooth! Last step is to update Netbox. [17:38:04] The other pending bit of this is going to be getting the .debs uploaded to our APT repository. [17:38:09] I'm assuming I don't have permissions to do that myself. [17:38:54] e.g. https://gerrit.wikimedia.org/r/c/performance/debs/xhgui/+/602203 [17:39:50] That's right. Could you add an update to the ticket which debs not to be uploaded and which component/distro it should be? [17:40:03] or even make a new one, up to you [17:40:16] Will do. [17:40:43] it will be more steps, first building on the build host and then uploading on the APT repo host [17:41:00] and then using reprepro to import them [17:41:40] Gotcha. We use git-build-package on the build host, right? [17:41:44] i see you already have a ticket to package it, we can also reuse that for uploading it, if you want [17:44:14] dpifke: no, last time it was pdebuild [17:45:16] it keeps changing, afaict [17:49:24] (03CR) 10Ayounsi: [C: 03+2] Replace cr1-eqsin with cr3-eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/606419 (https://phabricator.wikimedia.org/T255766) (owner: 10Ayounsi) [17:50:44] OK, I'll make sure everything builds with that and then ping the ticket. [17:51:26] (03Abandoned) 10Dzahn: xhgui: let perf-team admins have access to xhgui DB (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/608456 (https://phabricator.wikimedia.org/T254795) (owner: 10Dzahn) [17:51:56] (03PS8) 10Ottomata: Add eventlogging_legacy job to Refine EventLogging events are migrated to EventGate [puppet] - 10https://gerrit.wikimedia.org/r/593610 (https://phabricator.wikimedia.org/T249261) [17:52:07] dpifke: perfect. thx [17:53:09] (03CR) 10jerkins-bot: [V: 04-1] Add eventlogging_legacy job to Refine EventLogging events are migrated to EventGate [puppet] - 10https://gerrit.wikimedia.org/r/593610 (https://phabricator.wikimedia.org/T249261) (owner: 10Ottomata) [17:54:28] (03CR) 10Dzahn: [C: 03+2] systemd::sysuser: quote the gecos field to avoid errors [puppet] - 10https://gerrit.wikimedia.org/r/608489 (owner: 10Dzahn) [17:54:31] (03PS9) 10Ottomata: Add eventlogging_legacy Refine job for events migrated to EventGate [puppet] - 10https://gerrit.wikimedia.org/r/593610 (https://phabricator.wikimedia.org/T249261) [17:55:46] (03CR) 10jerkins-bot: [V: 04-1] Add eventlogging_legacy Refine job for events migrated to EventGate [puppet] - 10https://gerrit.wikimedia.org/r/593610 (https://phabricator.wikimedia.org/T249261) (owner: 10Ottomata) [17:56:16] (03CR) 10Dzahn: "changes /etc/sysusers.d/sysusers-base-config.conf like this everywhere:" [puppet] - 10https://gerrit.wikimedia.org/r/608489 (owner: 10Dzahn) [17:57:38] (03PS10) 10Ottomata: Add eventlogging_legacy Refine job for events migrated to EventGate [puppet] - 10https://gerrit.wikimedia.org/r/593610 (https://phabricator.wikimedia.org/T249261) [17:58:50] (03CR) 10jerkins-bot: [V: 04-1] Add eventlogging_legacy Refine job for events migrated to EventGate [puppet] - 10https://gerrit.wikimedia.org/r/593610 (https://phabricator.wikimedia.org/T249261) (owner: 10Ottomata) [17:59:18] (03PS11) 10Ottomata: Add eventlogging_legacy Refine job for events migrated to EventGate [puppet] - 10https://gerrit.wikimedia.org/r/593610 (https://phabricator.wikimedia.org/T249261) [18:00:04] hashar and twentyafterfour: Dear deployers, time to do the Train log triage with CPT deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200701T1800). [18:00:04] RoanKattouw, Niharika, and Urbanecm: That opportune time is upon us again. Time for a Morning backport window(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200701T1800). [18:00:04] Pchelolo: A patch you scheduled for Morning backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:02:04] (03CR) 10Ottomata: [C: 03+2] Add eventlogging_legacy Refine job for events migrated to EventGate [puppet] - 10https://gerrit.wikimedia.org/r/593610 (https://phabricator.wikimedia.org/T249261) (owner: 10Ottomata) [18:02:09] CindyCicaleseWMF: im gonna start with yours [18:02:31] (03CR) 10Ppchelko: [C: 03+2] Deploy MediaModeration on all production wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608753 (https://phabricator.wikimedia.org/T247943) (owner: 10Cicalese) [18:03:22] (03Merged) 10jenkins-bot: Deploy MediaModeration on all production wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608753 (https://phabricator.wikimedia.org/T247943) (owner: 10Cicalese) [18:06:36] (03PS1) 10Ottomata: Refine - KaiOSAppConsent exclusion was accidentally lost, re-add it [puppet] - 10https://gerrit.wikimedia.org/r/608918 [18:07:23] (03PS2) 10Ottomata: Refine - KaiOSAppConsent exclusion was accidentally lost, re-add it [puppet] - 10https://gerrit.wikimedia.org/r/608918 [18:07:28] !log ppchelko@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Deploy MediaModeration on all production wikis gerrit:608753 (duration: 01m 07s) [18:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:39] (03PS4) 10Ppchelko: Enable kafka purges on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607590 (https://phabricator.wikimedia.org/T250781) [18:08:22] (03CR) 10Ppchelko: [C: 03+2] Enable kafka purges on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607590 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko) [18:08:40] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Refine - KaiOSAppConsent exclusion was accidentally lost, re-add it [puppet] - 10https://gerrit.wikimedia.org/r/608918 (owner: 10Ottomata) [18:09:52] (03PS9) 10Dzahn: jenkins: replace system user/group with systemd-sysuser [puppet] - 10https://gerrit.wikimedia.org/r/606286 (https://phabricator.wikimedia.org/T224591) [18:10:36] (03Merged) 10jenkins-bot: Enable kafka purges on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607590 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko) [18:18:48] !log ppchelko@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: Enable kafka purges on wikitech gerrit:607590 IS-labs.php (duration: 01m 03s) [18:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:38] !log joal@deploy1001 Started deploy [analytics/refinery@114bfed]: Regular analytics weekly train [analytics/refinery@114bfed] [18:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:55] jouncebot: now [18:24:55] For the next 0 hour(s) and 35 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200701T1800) [18:24:56] For the next 0 hour(s) and 35 minute(s): Morning backport window(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200701T1800) [18:25:07] * addshore checks what was happening in swat [18:25:19] I am on train with twenty.after.four as backup [18:25:19] !log joal@deploy1001 Finished deploy [analytics/refinery@114bfed]: Regular analytics weekly train [analytics/refinery@114bfed] (duration: 03m 41s) [18:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:38] anything bad going on ? [18:25:45] Sorry for the rollback folks, a quickfix had been forgotten - redeploying in minutes [18:26:44] * addshore would like to swat a thing or 2 once Pchelolo is done! :) [18:26:54] (03CR) 10Krinkle: [WIP] wgEventStreams - Allow for some default stream config settings (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606758 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [18:27:29] addshore: my thing doesn't break anything but it doesn't work.. so I'll stop for now and debug it. feel free to go [18:27:43] haha! okay :) I know that feeling! [18:27:44] !log joal@deploy1001 Started deploy [analytics/refinery@8b7bddf]: Regular analytics weekly train [analytics/refinery@8b7bddf] [18:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:16] (03PS23) 10Addshore: Wikidata client wikis: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569259 (https://phabricator.wikimedia.org/T254315) (owner: 10WMDE-leszek) [18:28:48] James_F: just wanted to express how much I <3 this diffConfig thing :) [18:29:17] (03CR) 10Addshore: [C: 03+2] Wikidata client wikis: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569259 (https://phabricator.wikimedia.org/T254315) (owner: 10WMDE-leszek) [18:30:01] (03Merged) 10jenkins-bot: Wikidata client wikis: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569259 (https://phabricator.wikimedia.org/T254315) (owner: 10WMDE-leszek) [18:31:58] addshore: Happy to hear it. [18:33:06] (03CR) 10Dzahn: [C: 03+1] "@hashar I am ready to do this but we will have to follow-up on contint* and releases*, i would say delete old user, run puppet, run find /" [puppet] - 10https://gerrit.wikimedia.org/r/606286 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [18:34:25] James_F: as it is my first sync with these crazy yaml files being touched, any advice? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/569259 [18:34:46] or, i guess the yaml files ulteimatly are not actually used? only the generated db list? [18:35:53] !log joal@deploy1001 Finished deploy [analytics/refinery@8b7bddf]: Regular analytics weekly train [analytics/refinery@8b7bddf] (duration: 08m 09s) [18:35:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:09] !log joal@deploy1001 Started deploy [analytics/refinery@8b7bddf] (thin): Regular analytics weekly train THIN [analytics/refinery@8b7bddf] [18:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:28] !log joal@deploy1001 Finished deploy [analytics/refinery@8b7bddf] (thin): Regular analytics weekly train THIN [analytics/refinery@8b7bddf] (duration: 02m 19s) [18:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:39] !log addshore@deploy1001 sync-file aborted: T254315 Wikidata client wikis: Define entity sources configuration [[gerrit:569259]] (duration: 00m 38s) [18:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:43] T254315: entitysources: Directly create entitySources config for WMF production wikis - https://phabricator.wikimedia.org/T254315 [18:41:49] noop thats not okay [18:42:03] (03PS1) 10Addshore: Revert "Wikidata client wikis: Define entity sources configuration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608838 [18:42:08] (03CR) 10Addshore: [V: 03+2 C: 03+2] Revert "Wikidata client wikis: Define entity sources configuration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608838 (owner: 10Addshore) [18:43:18] only made it to the canary hosts, syncing the revert to them now, saw log spam, so backtracking, will check logs after.... [18:43:40] !log addshore@deploy1001 Synchronized wmf-config: REVERT T254315 Wikidata client wikis: Define entity sources configuration [[gerrit:569259]] (duration: 01m 04s) [18:44:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:26] Also, fun time to figure out that the new gerrit UI won't let you revert a change until you write a reason into the commit msg window that pops up! [18:44:46] (03PS1) 10Addshore: Revert "Revert "Wikidata client wikis: Define entity sources configuration"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608839 [18:45:08] (03CR) 10Addshore: [V: 04-1 C: 04-2] "Not ready as the last time this was deployed we got errors (ticket TBA)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608839 (owner: 10Addshore) [18:45:08] addshore: stuff blowing up on mediawiki-errors, all known? [18:45:13] undefined methods, undefined namespaces [18:45:29] Yup, that was that config change, should be all reverted now, only made it to canary hosts before i stopped it [18:45:46] going to look at the logs and write tickets now [18:45:47] (03CR) 10Ottomata: "Talked with Timon in IRC, we think it makes more sense to put this logic in EventStreamConfig. Will do there." (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606758 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [18:45:59] (03Abandoned) 10Ottomata: [WIP] wgEventStreams - Allow for some default stream config settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606758 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [18:47:59] thanks for checking Krinkle :) [18:48:56] Krinkle: is there a way around the phabricator pahtality require URI too long thign? :/ [18:49:37] (03PS1) 10Ahmon Dancy: Fixed paths to SETUP.* files in README [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/608928 [18:49:40] (03PS1) 10Ahmon Dancy: Fixed a minor typo [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/608929 [18:49:57] (03CR) 10Addshore: [C: 04-2] "blocked on what was the parent change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569260 (https://phabricator.wikimedia.org/T254315) (owner: 10WMDE-leszek) [18:55:23] (03PS1) 10QChris: gerrit: Drop old version [puppet] - 10https://gerrit.wikimedia.org/r/608931 [18:55:25] (03PS1) 10QChris: gerrit: Move new version's homedir to default place [puppet] - 10https://gerrit.wikimedia.org/r/608932 [18:55:27] (03PS1) 10QChris: gerrit: Drop removal of javamelody-deps jar [puppet] - 10https://gerrit.wikimedia.org/r/608933 [19:00:04] hashar and twentyafterfour: That opportune time is upon us again. Time for a Mediawiki train - European+American Version (secondary timeslot) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200701T1900). [19:01:15] (03PS2) 10Addshore: Revert "Revert "Wikidata client wikis: Define entity sources configuration"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608839 (https://phabricator.wikimedia.org/T254315) [19:01:24] (03CR) 10Addshore: [V: 04-1 C: 04-2] Revert "Revert "Wikidata client wikis: Define entity sources configuration"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608839 (https://phabricator.wikimedia.org/T254315) (owner: 10Addshore) [19:01:34] * twentyafterfour is here [19:02:51] twentyafterfour: I am still in the hangout ;) [19:05:22] addshore: should we hold the train until you figure out the wikidata stuff? [19:05:27] no, your all good! [19:05:34] tis all reverted and stable [19:05:35] cool Danke Schon [19:05:49] (03Abandoned) 10Ammarpad: Require editinterface to edit NS_CONFIG [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608212 (https://phabricator.wikimedia.org/T256278) (owner: 10Ammarpad) [19:07:46] (03CR) 10Addshore: [C: 04-2] "It looks like this change should have been in the queue before the client wikis change. and this would likely have avoided the issues, see" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569260 (https://phabricator.wikimedia.org/T254315) (owner: 10WMDE-leszek) [19:09:44] (03PS15) 10Addshore: Commons: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569260 (https://phabricator.wikimedia.org/T256906) (owner: 10WMDE-leszek) [19:13:02] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 103.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [19:15:47] (03PS3) 10Addshore: Revert "Revert "Wikidata client wikis: Define entity sources configuration"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608839 (https://phabricator.wikimedia.org/T254315) [19:16:12] (03PS1) 1020after4: group2 wikis to 1.35.0-wmf.39 refs T254176 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608937 [19:16:14] (03CR) 1020after4: [C: 03+2] group2 wikis to 1.35.0-wmf.39 refs T254176 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608937 (owner: 1020after4) [19:16:29] (03CR) 10Addshore: [C: 03+1] "This one is now good to go, and looking at the diff it is doing all the right things." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569260 (https://phabricator.wikimedia.org/T256906) (owner: 10WMDE-leszek) [19:16:56] (03Merged) 10jenkins-bot: group2 wikis to 1.35.0-wmf.39 refs T254176 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608937 (owner: 1020after4) [19:18:22] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group2 wikis to 1.35.0-wmf.39 refs T254176 [19:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:27] T254176: 1.35.0-wmf.39 deployment blockers - https://phabricator.wikimedia.org/T254176 [19:18:59] (03CR) 10Addshore: [C: 03+1] "The diff of this change looks good now and commons remains untouched which should should eliminate the issues seen the first time this was" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608839 (https://phabricator.wikimedia.org/T254315) (owner: 10Addshore) [19:19:24] (03PS12) 10Addshore: Wikibase: Removed config option wmgUseEntitySourceBasedFederation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569263 (https://phabricator.wikimedia.org/T241975) (owner: 10WMDE-leszek) [19:20:40] (03PS13) 10Addshore: Wikibase: Removed config option wmgUseEntitySourceBasedFederation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569263 (https://phabricator.wikimedia.org/T241975) (owner: 10WMDE-leszek) [19:22:16] (03PS6) 10Dzahn: webperf: switch xhgui_host from tungsten to xhgui1001 [puppet] - 10https://gerrit.wikimedia.org/r/552357 (https://phabricator.wikimedia.org/T158837) [19:22:21] (03PS1) 10Andrew Bogott: Keystone: rename the 'password_whitelist' auth module to 'password_safelist' [puppet] - 10https://gerrit.wikimedia.org/r/608943 [19:22:44] (03CR) 10jerkins-bot: [V: 04-1] Keystone: rename the 'password_whitelist' auth module to 'password_safelist' [puppet] - 10https://gerrit.wikimedia.org/r/608943 (owner: 10Andrew Bogott) [19:22:56] (03PS1) 10Addshore: Wikibase: stop using wmgUseEntitySourceBasedFederation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608944 (https://phabricator.wikimedia.org/T241975) [19:23:37] !log 1.35.0-wmf.39 is now deployed to group2 wikis, everything appears to be normal. refs T254176 [19:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:42] T254176: 1.35.0-wmf.39 deployment blockers - https://phabricator.wikimedia.org/T254176 [19:23:49] (03PS14) 10Addshore: Wikibase: Removed config option wmgUseEntitySourceBasedFederation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569263 (https://phabricator.wikimedia.org/T241975) (owner: 10WMDE-leszek) [19:24:52] (03CR) 10Addshore: [C: 03+1] Wikibase: stop using wmgUseEntitySourceBasedFederation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608944 (https://phabricator.wikimedia.org/T241975) (owner: 10Addshore) [19:25:01] (03PS2) 10Andrew Bogott: Keystone: rename the 'password_whitelist' auth module to 'password_safelist' [puppet] - 10https://gerrit.wikimedia.org/r/608943 [19:25:23] (03CR) 10Addshore: [C: 03+1] Wikibase: Removed config option wmgUseEntitySourceBasedFederation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569263 (https://phabricator.wikimedia.org/T241975) (owner: 10WMDE-leszek) [19:26:14] train is all quiet! \o/ [19:26:22] thank you twentyafterfour for handling it [19:28:54] jouncebot: next [19:28:54] In 0 hour(s) and 31 minute(s): Services – Graphoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200701T2000) [19:29:05] (03PS1) 10Andrew Bogott: Keystone.conf: added codfw1dev-proxy-dns-manager to password safelist [puppet] - 10https://gerrit.wikimedia.org/r/608946 [19:29:10] * addshore will leave his config stuff until the morrow [19:30:08] (03CR) 10Dzahn: "We stopped using letsencrypt::cert::integrated in Gerrit. The only places still using it seem to be:" [puppet] - 10https://gerrit.wikimedia.org/r/602722 (owner: 10Paladox) [19:31:25] (03CR) 10Andrew Bogott: [C: 03+2] Keystone: rename the 'password_whitelist' auth module to 'password_safelist' [puppet] - 10https://gerrit.wikimedia.org/r/608943 (owner: 10Andrew Bogott) [19:33:53] (03CR) 10Andrew Bogott: [C: 03+2] Keystone.conf: added codfw1dev-proxy-dns-manager to password safelist [puppet] - 10https://gerrit.wikimedia.org/r/608946 (owner: 10Andrew Bogott) [19:34:13] (03PS2) 10Andrew Bogott: Keystone.conf: added codfw1dev-proxy-dns-manager to password safelist [puppet] - 10https://gerrit.wikimedia.org/r/608946 [19:50:24] (03CR) 10Dzahn: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/602722 (owner: 10Paladox) [19:59:10] (03PS1) 10Dzahn: hiera: remove old "do_acme: false" that should do nothing nowadays [puppet] - 10https://gerrit.wikimedia.org/r/608950 [20:00:04] halfak and accraze: Time to snap out of that daydream and deploy Services – Graphoid / Citoid / ORES. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200701T2000). [20:00:24] No ORES deployment today :) [20:05:36] (03PS1) 10Dzahn: dumps: remove do_acme parameter [puppet] - 10https://gerrit.wikimedia.org/r/608951 [20:08:23] (03CR) 10Dzahn: [C: 04-1] hiera: remove old "do_acme: false" that should do nothing nowadays (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608950 (owner: 10Dzahn) [20:14:21] (03CR) 10ArielGlenn: "Didn't it used to be that do_acme was only true on one server, the active one? So that would be the one that would rsync web logs off to s" [puppet] - 10https://gerrit.wikimedia.org/r/608951 (owner: 10Dzahn) [20:16:53] addshore: not that I know, if there is it probably won't beat 3x copy paste though [20:19:28] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana-next_443: Servers logstash1024.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:19:50] (03PS4) 10Krinkle: Wikidata client wikis: Define entity sources configuration (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608839 (https://phabricator.wikimedia.org/T254315) (owner: 10Addshore) [20:20:27] ^ please ignore [20:20:37] (03CR) 10Krinkle: [C: 04-1] "To be applied to arclamp repo instead." [puppet] - 10https://gerrit.wikimedia.org/r/598292 (https://phabricator.wikimedia.org/T253679) (owner: 10Aaron Schulz) [20:21:11] mutante: could you help roll out https://gerrit.wikimedia.org/r/c/operations/puppet/+/607370 today? [20:21:13] dpifke: ^ [20:21:18] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:31:49] (03PS1) 10QChris: gerrit: Make implicit its templates explicit [puppet] - 10https://gerrit.wikimedia.org/r/608953 (https://phabricator.wikimedia.org/T256729) [20:31:51] (03PS1) 10QChris: gerrit: Format short gerrit URLs in phabricator comments [puppet] - 10https://gerrit.wikimedia.org/r/608954 (https://phabricator.wikimedia.org/T256729) [20:33:31] (03CR) 10Krinkle: [C: 04-1] "To be submitted to the new repo instead." [puppet] - 10https://gerrit.wikimedia.org/r/606789 (https://phabricator.wikimedia.org/T255920) (owner: 10Krinkle) [20:39:29] (03CR) 10EBernhardson: Configuration code for oauth proxy (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/608633 (https://phabricator.wikimedia.org/T251498) (owner: 10ZPapierski) [20:39:56] (03PS1) 10Dzahn: hiera: delete yaml files for non-existing hosts [puppet] - 10https://gerrit.wikimedia.org/r/608955 [20:40:56] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [20:41:54] Krinkle: yes. does anything block it from being merged right now? [20:42:10] PROBLEM - snapshot of s1 in eqiad on db2093 is CRITICAL: snapshot for s1 at eqiad taken more than 3 days ago: Most recent backup 2020-06-28 20:31:54 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [20:42:13] mutante: nope, I'm here to verify [20:42:19] Now's good :) [20:42:56] ok, and you don't have to deploy it immediately afterwards via scap? [20:43:04] then i'll go ahead [20:43:27] mutante: I don't have a change currently that I want to deploy, no. [20:43:44] or do you mean that scap packages require doing a no-op deploy? [20:43:49] I don't remember if it does [20:43:52] I can do one for sure [20:44:11] no, i just mean that we don't have to push through follow-ups to make scap work right away [20:44:13] I've perf.wm.o open and the webperf1002 server in ssh to see the change and verify everythign is still fine [20:44:50] are you saying it needs follow-ups to make scap work? I'm not getting it :) [20:45:05] usually when stuff is added to scap it means several follow-ups and takes a while [20:45:18] i am cool merging this one to get closer to it [20:46:08] My expectation is that this puppet change will remove the files from puppet, remove them from the webperf..2 hosts, and automatically git-clone the repo and set up new systemd services and have a working cron job for arclamp svgs same as now. [20:46:16] And it also sets up the needed stuff on deploy1001 [20:46:20] i just didn't want to necessarily commit to the rest of that [20:46:21] does it not do that? [20:46:44] if we can't do a deploy right away that's fine, but I'd like to know what else is needed. [20:46:54] i wouldn't expect it to just work on deploy1001, no [20:47:12] for example new deployment keys and keyholder [20:47:41] and other stuff i forgot but usually "move X to scap" has a couple issues [20:49:07] (03PS10) 10Dzahn: arclamp: Deploy from scap [puppet] - 10https://gerrit.wikimedia.org/r/607370 (https://phabricator.wikimedia.org/T200109) (owner: 10Dave Pifke) [20:49:11] (03PS1) 10Mholloway: Mobileapps: Update to 2020-07-01-151702-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/608957 [20:49:50] compiling it on deploy1001 [20:50:29] (03CR) 10Mholloway: [C: 03+2] Mobileapps: Update to 2020-07-01-151702-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/608957 (owner: 10Mholloway) [20:50:41] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/23621/deploy1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/607370 (https://phabricator.wikimedia.org/T200109) (owner: 10Dave Pifke) [20:50:45] I'm vaguely aware that scap3/trebuchet involves ssh and stuff indeed, but I don't know how to make that work if it isn't done by puppet or scap automatically [20:51:01] we have almost a 100 deployed repos with scap3 though [20:51:07] is not not a clear cut documented process? [20:51:12] (03CR) 10QChris: [C: 04-1] "Since the parent changes got merged, but it seems some got undone (e.g.: I50b564c60c3b98ad4cd4fedbb9e9472fe95ea937)" [puppet] - 10https://gerrit.wikimedia.org/r/456437 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox) [20:51:19] https://wikitech.wikimedia.org/wiki/Scap3/Migration_Guide#First_Deployment [20:51:30] I don't know if that's up to date, but that looks fairly complicated indeed [20:51:39] (03Merged) 10jenkins-bot: Mobileapps: Update to 2020-07-01-151702-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/608957 (owner: 10Mholloway) [20:51:43] I don't recall doing any of that for navtiming though [20:51:48] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@d7476f5]: Update mobileapps to 953fc41a [20:51:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:10] I just recall that basically every time stuff is moved to scap it seems very complicated. [20:52:17] we will see in a minute [20:52:37] ok :) [20:52:50] so long as the service itself works, I'm fine figuring out the rest another day [20:53:31] yep, cool. and for me it is 'as long as keyholder and puppet isn't broken on deployment servers" [20:53:34] 20:53:06 deploy failed: Command ' [20:53:53] 20:53:06 Unhandled error: [20:54:19] scap/sync/2020-07-01/0001 e779f4ca3f3ff00ca8df6ecf122325bcc50bb57f [20:54:19] ' returned non-zero exit status 128 [20:55:57] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@d7476f5]: Update mobileapps to 953fc41a (duration: 04m 08s) [20:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:27] !log krinkle@deploy1001 Ran `scap deploy --init` for /srv/deployment/performance/arc-lamp [20:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:51] !log mholloway-shell@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [20:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:12] PROBLEM - snapshot of s1 in codfw on db2093 is CRITICAL: snapshot for s1 at codfw taken more than 3 days ago: Most recent backup 2020-06-28 20:36:34 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [20:58:34] !log mholloway-shell@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [20:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:24] mutante: from where is that error? [21:00:17] !log mholloway-shell@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [21:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:16] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 103.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [21:03:28] 10Operations, 10SRE-Access-Requests: Requesting access to production servers for Ahmon Dancy - RelEng - https://phabricator.wikimedia.org/T256770 (10Dzahn) [21:04:11] Krinkle: deploy1001 puppet run [21:04:52] mutante: can you paste it with more context? I don't understand what failed [21:05:15] puppet will run in ~ 3min on webperf1002 [21:05:39] Krinkle: it's gone now [21:06:01] i could still paste it if you care.. but on the next run it did not happen [21:06:15] it's probably normal for the first run [21:06:37] yeah i'd like to know at least which command it was executing from puppet [21:06:43] e.g. 'git clone' or something [21:07:41] Krinkle: https://phabricator.wikimedia.org/P11723 [21:08:44] ok, yeah, I guess that's normal [21:09:31] puppet is running now on schedule on webperf1002 [21:09:48] ok, great [21:13:40] PROBLEM - snapshot of s8 in eqiad on db2093 is CRITICAL: snapshot for s8 at eqiad taken more than 3 days ago: Most recent backup 2020-06-28 20:42:28 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [21:14:28] mutante: where are the crons stored for non-root? [21:14:34] Jul 1 21:09:48 webperf1002 puppet-agent[6351]: (/Stage[main]/Arclamp/Cron[arclamp_generate_svgs]/command) command changed '/usr/local/bin/arclamp-generate-svgs > /dev/null' to '/srv/deployment/performance/arc-lamp/arclamp-generate-svgs >/dev/null' [21:15:06] the user has no homedir, so not sure where to look, it's not in /etc/cron* [21:15:12] PROBLEM - snapshot of s8 in codfw on db2093 is CRITICAL: snapshot for s8 at codfw taken more than 3 days ago: Most recent backup 2020-06-28 20:48:16 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [21:17:32] Krinkle: /var/spool/cron/crontabs/xenon [21:18:55] ack, thanks! [21:19:24] https://en.wikipedia.org/wiki/Spooling - never gets old [21:21:55] (03PS1) 10Dave Pifke: webperf: Serve different robots.txt on beta site [puppet] - 10https://gerrit.wikimedia.org/r/608962 (https://phabricator.wikimedia.org/T255092) [21:22:19] I've kicked off an svg run, waiting for it to finish for one of the files. [21:22:21] but looking good so far [21:22:49] I also see /srv/arclamp/logs/daily/2020-07-01.excimer.all.log is still being appended to regularly by the new log process [21:28:39] (03PS2) 10Dave Pifke: webperf: Serve different robots.txt on beta site [puppet] - 10https://gerrit.wikimedia.org/r/608962 (https://phabricator.wikimedia.org/T255092) [21:29:32] mutante: all good. [21:29:35] dpifke: :) [21:30:08] dpifke: also, I'm deleting the performance-beta manual web proxy now, in favour of the *.wikimedia/apache proper handling we have now. [21:30:29] your commit msg reminded me that we still have the GUI proxy that bypasses the traffic layer etc. [21:30:32] Krinkle: great:) [21:32:42] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [21:33:03] (03CR) 10Krinkle: [C: 03+1] "LGTM. though the beta instance is mainly at https://performance.wikimedia.beta.wmflabs.org now. I've deleted the old pre-puppet GUI proxy " [puppet] - 10https://gerrit.wikimedia.org/r/608962 (https://phabricator.wikimedia.org/T255092) (owner: 10Dave Pifke) [21:37:22] Krinkle: Was the scap --init command needed on to get it to deploy? Or did we just need to wait for puppet run on both deploy1001 and webperf1002? [21:37:43] Because if the former, we should fix the deploy puppet script to run that when a new repo is added. [21:37:49] https://wikitech.wikimedia.org/wiki/Scap3/Migration_Guide#First_Deployment [21:37:59] dpifke: it's complicated :) [21:38:15] what I know is that puppet tries to run that command already but fails, it succeeds when I run it manually however [21:38:54] I suspect that that is a coincence however [21:39:01] it fails because the destination doesn't have it set up yet [21:39:07] OK. Because I believe there is code in puppet to avoid the manual step. [21:39:10] so the first roll out half-works on deploy1001 and works on the destination [21:39:26] and then after words it works the second try, which is usually done by an impatient human [21:39:53] Right. If puppet agent ran on webperf1002 before it ran on deploy1001, it would have failed. [21:40:14] I guess we should have submitted it as two separate changes. [21:41:18] I'm not entirely sure. I think there's a cycle here. You're right that 'scap pull' would not work on webperf unless it runs on deploy first. [21:41:35] but afaik th puppet script isn't doing a scap pull but a git clone, so the destination part should be fine to run ahead of deploy [21:41:51] the issue mgiht be the other way around, where the scap init step expects the destination to have the directory and user set up [21:42:07] which the puppet resource scap::target doesn't allow separating [21:42:10] I *think* that adding the repo to hierdata should do the work of setting it up on the deployment host. [21:42:15] it's one resources that ensures both parts depending on the host it run son [21:42:23] Ah yeah, that might work [21:42:49] might be worth filing a task for releng to confirm/document accordingly if that's the best practice [21:43:31] https://performance.wikimedia.beta.wmflabs.org/robots.txt?boo [21:43:34] https://performance.wikimedia.org/robots.txt [21:43:35] I had to do some hunting to figure out that's where it was looking. :) [21:43:36] both live :) [21:44:07] want to cherry pick the other robots change in beta? [21:44:31] Yeah. I need to un-cherry-pick the XHGui patch as well, now that it's merged. Will do so now. [21:55:50] (03Abandoned) 10Dzahn: dumps: remove do_acme parameter [puppet] - 10https://gerrit.wikimedia.org/r/608951 (owner: 10Dzahn) [21:57:03] (03Abandoned) 10Dzahn: hiera: remove old "do_acme: false" that should do nothing nowadays [puppet] - 10https://gerrit.wikimedia.org/r/608950 (owner: 10Dzahn) [21:57:16] https://performance.wikimedia.beta.wmflabs.org/robots.txt?foo [22:02:41] 10Operations, 10Core Platform Team, 10Release Pipeline, 10Release-Engineering-Team-TODO, and 5 others: Kask functional testing with Cassandra via the Deployment Pipeline - https://phabricator.wikimedia.org/T224041 (10jeena) [22:02:49] (03PS3) 10Dave Pifke: webperf: Serve different robots.txt on beta site [puppet] - 10https://gerrit.wikimedia.org/r/608962 (https://phabricator.wikimedia.org/T255092) [22:05:42] (03PS1) 10Krinkle: Revert "Title: fix subpage split for degenerate cases" [core] (wmf/1.35.0-wmf.39) - 10https://gerrit.wikimedia.org/r/608844 [22:06:45] (03CR) 10Krinkle: [C: 03+1] "LGTM. Ready to land." [puppet] - 10https://gerrit.wikimedia.org/r/608962 (https://phabricator.wikimedia.org/T255092) (owner: 10Dave Pifke) [22:07:02] brennen: ok to roll out the above revert? ^ [22:08:41] parser caches are populated with potentially bad content as we speak [22:09:22] * Krinkle also prepares wmf config patch to invalidate pc from group2 for past few hours [22:09:24] Krinkle: sorry, no context - i haven't been involved in deploys this week... uh, i think probably ok? we're a touch past that 3pm cutoff but sounds like an emergency fix sorta situation [22:09:45] twentyafterfour is train backup this week, i think. [22:10:01] ah ok, yeah, it's a parser bug [22:11:22] Krinkle: I pushed a fix to master [22:11:41] So hopefully no need to revert [22:12:36] Daimona: I'll revert, then tomorrow Daniel can test this on beta and look for other cases perhaps [22:12:37] 10Operations, 10Core Platform Team, 10Release Pipeline, 10Release-Engineering-Team-TODO, and 5 others: Kask functional testing with Cassandra via the Deployment Pipeline - https://phabricator.wikimedia.org/T224041 (10jeena) We've published a cassandra image to docker-registry.wikimedia.org/releng/cassandra... [22:12:45] I don't want to assume this is the only case we missed right now [22:12:59] it's end of day for them already [22:13:06] Sure [22:13:37] (03CR) 10Krinkle: [C: 03+2] Revert "Title: fix subpage split for degenerate cases" [core] (wmf/1.35.0-wmf.39) - 10https://gerrit.wikimedia.org/r/608844 (owner: 10Krinkle) [22:14:56] simple repro at https://en.wikipedia.org/w/index.php?title=Wikipedia:Sandbox&oldid=965535645 [22:15:09] 10Operations, 10Core Platform Team, 10Release Pipeline, 10Release-Engineering-Team-TODO, and 5 others: Kask functional testing with Cassandra via the Deployment Pipeline - https://phabricator.wikimedia.org/T224041 (10jeena) [22:15:31] Quality repro page ™ [22:16:32] (I'm referring to the content above) [22:17:30] My apologies [22:17:33] (03PS1) 10Ryan Kemper: Scale largest shards to be closer to 30GB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608965 [22:17:35] I did not see the text that was there [22:18:22] (03CR) 10jerkins-bot: [V: 04-1] Scale largest shards to be closer to 30GB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608965 (owner: 10Ryan Kemper) [22:18:24] (03CR) 10Andrew Bogott: [C: 03+1] hiera: delete yaml files for non-existing hosts [puppet] - 10https://gerrit.wikimedia.org/r/608955 (owner: 10Dzahn) [22:18:26] (03CR) 10Dzahn: "> I checked the instance "relic-stretch" in the project toolserver-legacy which uses letsencrypt::cert::integrated and to my surprise it d" [puppet] - 10https://gerrit.wikimedia.org/r/602722 (owner: 10Paladox) [22:18:33] (03CR) 10Dzahn: [C: 03+2] gerrit: Drop removal of javamelody-deps jar [puppet] - 10https://gerrit.wikimedia.org/r/608933 (owner: 10QChris) [22:19:53] Ahah I removed that :) [22:20:51] I was just like "wtf is this? how has it got anything to do with the bug?" then read the lines below [22:29:59] (03PS1) 10Legoktm: toolforge: Update comment reflecting source of tesseract packages [puppet] - 10https://gerrit.wikimedia.org/r/608966 (https://phabricator.wikimedia.org/T256881) [22:33:44] (03Merged) 10jenkins-bot: Revert "Title: fix subpage split for degenerate cases" [core] (wmf/1.35.0-wmf.39) - 10https://gerrit.wikimedia.org/r/608844 (owner: 10Krinkle) [22:34:13] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1003/23624/" [puppet] - 10https://gerrit.wikimedia.org/r/608955 (owner: 10Dzahn) [22:35:04] 10Operations: Handle sunset of stretch-backports - https://phabricator.wikimedia.org/T256877 (10Legoktm) From a [[https://codesearch.wmflabs.org/operations/?q=stretch-backports&i=nope&files=&repos=|puppet codesearch]], I see: * librdkafka1 - eventlogging * 'librados2', 'librgw2', 'librbd1',... [22:36:16] (03CR) 10EBernhardson: Scale largest shards to be closer to 30GB (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608965 (owner: 10Ryan Kemper) [22:37:33] !log krinkle@deploy1001 Synchronized php-1.35.0-wmf.39/includes/Title.php: I8d5bad9c654c4ab (duration: 01m 00s) [22:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:23] (03PS1) 10Krinkle: Use RejectParserCacheValue to reject parser output from 19:10–22:40 UTC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608971 (https://phabricator.wikimedia.org/T256922) [22:42:31] Daimona: can you double chedk? [22:42:51] Sure [22:43:04] (the config patch) [22:43:19] Oh [22:43:24] I checked the sandbox and it looks perfect [22:43:33] Aside from the penis thing that appeared again lol [22:44:41] Why in the wikidata section? [22:45:50] (03PS2) 10Ryan Kemper: Scale largest shards to be closer to 30GB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608965 [22:46:35] (03CR) 10jerkins-bot: [V: 04-1] Scale largest shards to be closer to 30GB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608965 (owner: 10Ryan Kemper) [22:48:11] Krinkle: i was cleaning up planet feed database and just noticed this is 404: https://codepen.io/Krinkle/post/feed is there a replacement? [22:49:30] Krinkle: replacing with https://blog.codepen.io/feed/ :) [22:49:52] eh.. I don't write on that blog [22:49:54] https://codepen.io/Krinkle/posts/ [22:49:57] that's the codepen blog [22:50:01] I guess they removed support for RSS [22:50:06] anyway, I don't post anything new there [22:50:18] so if my timotijhof.net feed is there, this one can be removed [22:50:37] ok, ACK, doing that! [22:50:55] jouncebot: now [22:50:55] No deployments scheduled for the next 0 hour(s) and 9 minute(s) [22:51:00] jouncebot: next [22:51:00] In 0 hour(s) and 8 minute(s): Evening backport window(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200701T2300) [22:52:46] Daimona: eh, I guess that's an old conditional. all repos I care about are clients though [22:53:03] in fact, is there any public/SUL wikii that isn't a client? even commons/wikidata are self-clients, right? [22:53:09] (03PS2) 10Krinkle: Use RejectParserCacheValue to reject parser output from 19:10–22:40 UTC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608971 (https://phabricator.wikimedia.org/T256922) [22:54:58] Gotcha [22:54:58] 851/946 wikis [22:55:02] Unsure about that [22:55:02] anyway, removed [22:56:21] updating your feed from https://timotijhof.net/category/tools/feed/ to https://timotijhof.net/feed.xml [22:57:42] (03CR) 10Bstorm: [C: 03+2] dumps-distribution: don't monitor systemd directly for paging [puppet] - 10https://gerrit.wikimedia.org/r/601374 (owner: 10Bstorm) [23:00:04] RoanKattouw, Niharika, and Urbanecm: That opportune time is upon us again. Time for a Evening backport window(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200701T2300). [23:00:05] VulpesVulpes825: A patch you scheduled for Evening backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:38] (03CR) 10Addshore: [C: 03+1] Use RejectParserCacheValue to reject parser output from 19:10–22:40 UTC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608971 (https://phabricator.wikimedia.org/T256922) (owner: 10Krinkle) [23:00:44] !log set a short downtime on labstore1006/7 to prevent alert while disabling direct systemd monitoring [23:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:52] (03CR) 10Krinkle: [C: 03+2] Use RejectParserCacheValue to reject parser output from 19:10–22:40 UTC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608971 (https://phabricator.wikimedia.org/T256922) (owner: 10Krinkle) [23:02:24] I can do the SWAT [23:02:31] But VulpesVulpes825 isn't here [23:02:52] (03Merged) 10jenkins-bot: Use RejectParserCacheValue to reject parser output from 19:10–22:40 UTC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608971 (https://phabricator.wikimedia.org/T256922) (owner: 10Krinkle) [23:03:09] RoanKattouw: hold on [23:03:16] OK, waiting [23:03:25] The requester isn't here anyway [23:03:28] :) [23:04:33] (03PS1) 10Dave Pifke: [WIP] webperf: Enable prometheus-apache-exporter [puppet] - 10https://gerrit.wikimedia.org/r/608973 (https://phabricator.wikimedia.org/T215740) [23:04:40] * addshore can make use of some of the time if there is indeed free swat time :) (can also do his own deploys) [23:05:28] (03CR) 10jerkins-bot: [V: 04-1] [WIP] webperf: Enable prometheus-apache-exporter [puppet] - 10https://gerrit.wikimedia.org/r/608973 (https://phabricator.wikimedia.org/T215740) (owner: 10Dave Pifke) [23:05:40] Sorry, being a little bit late. I am here now waiting for patch deployment [23:06:18] RoanKattouw: ^ [23:07:29] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 75.25 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [23:08:16] OK [23:08:23] Krinkle: Ping me when you're done? [23:08:35] yeah, still testing it to be sure [23:09:19] (03PS2) 10Dave Pifke: [WIP] webperf: Enable prometheus-apache-exporter [puppet] - 10https://gerrit.wikimedia.org/r/608973 (https://phabricator.wikimedia.org/T215740) [23:10:29] I'm not getting any parser cache hits on mwdebug1001 [23:10:37] oh, right, I'm en-gb [23:10:41] :P [23:10:43] ok, nvm, mystery solved [23:10:58] classic british problems [23:11:09] * Krinkle brexits [23:12:42] (03PS3) 10Ryan Kemper: Scale largest shards to be closer to 30GB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608965 (https://phabricator.wikimedia.org/T256928) [23:13:24] (03CR) 10jerkins-bot: [V: 04-1] Scale largest shards to be closer to 30GB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608965 (https://phabricator.wikimedia.org/T256928) (owner: 10Ryan Kemper) [23:14:15] confirmed I get a miss for stuff with the matching dates in wgPageParseReport.cachereport when using mwdebug1001 [23:14:17] and miss on 1002 [23:14:24] and still hit for older and more recent stuff [23:14:38] I've pulled it down to a few app servers first to spread out the load a tiny bit [23:14:48] and to jobrunners and mwmaint [23:15:01] giving it 2 minutes and then doing the rest [23:18:08] !log krinkle@deploy1001 Synchronized wmf-config/CommonSettings.php: Ibb42db7fd1ee (duration: 00m 55s) [23:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:30] RoanKattouw: ddone [23:19:07] Alright, thanks [23:19:16] VulpesVulpes825: I'll start with your patches now, sorry for the delay [23:19:34] RoanKattouw: No worries, and sorry for late arrival [23:20:09] (03CR) 10Catrope: [C: 03+2] Change the Simplified Chinese logo for Chinese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608853 (https://phabricator.wikimedia.org/T256839) (owner: 10VulpesVulpes825) [23:20:56] (03Merged) 10jenkins-bot: Change the Simplified Chinese logo for Chinese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608853 (https://phabricator.wikimedia.org/T256839) (owner: 10VulpesVulpes825) [23:28:40] addshore: https://grafana.wikimedia.org/d/000000106/parser-cache?orgId=1&from=now-1h&to=now [23:28:42] fyi :) [23:28:53] * addshore was watching ;) [23:30:50] (03PS1) 10Dzahn: planet: remove broken feeds, update feed URLs [puppet] - 10https://gerrit.wikimedia.org/r/608974 (https://phabricator.wikimedia.org/T168459) [23:32:30] !log catrope@deploy1001 Synchronized static/images/project-logos/: Change Simplified Chinese logo for zhwiki (T256839) (duration: 00m 55s) [23:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:36] T256839: Change the Simplified Chinese logo of Chinese Wikipedia - https://phabricator.wikimedia.org/T256839 [23:34:02] (03PS2) 10Catrope: Set $wgForceUIMsgAsContentMsg for Chinese Wikiquote, Wiktionary, Wikinews, Wikisource, Wikiversity and Wikibooks. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608124 (https://phabricator.wikimedia.org/T256521) (owner: 10Hamish) [23:34:24] (03CR) 10Catrope: [C: 03+2] Set $wgForceUIMsgAsContentMsg for Chinese Wikiquote, Wiktionary, Wikinews, Wikisource, Wikiversity and Wikibooks. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608124 (https://phabricator.wikimedia.org/T256521) (owner: 10Hamish) [23:34:30] * addshore will do his config patches tommorrow [23:35:09] (03Merged) 10jenkins-bot: Set $wgForceUIMsgAsContentMsg for Chinese Wikiquote, Wiktionary, Wikinews, Wikisource, Wikiversity and Wikibooks. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608124 (https://phabricator.wikimedia.org/T256521) (owner: 10Hamish) [23:41:05] RoanKattouw: Thank you for your help. [23:41:42] VulpesVulpes825: Not done yet [23:41:48] VulpesVulpes825: I still need you to test that last patch sorry [23:42:08] RoanKattouw: Oh, Okay, I will test it really quick. [23:42:14] It's on mwdebug1002 now [23:42:21] (the logo went straight to productoin) [23:46:50] RoanKattouw: I need 3 more min for testing, sorry for the delay. [23:49:57] PROBLEM - Long running screen/tmux on kubernetes1001 is CRITICAL: CRIT: Long running tmux process. (user: cdanis PID: 32232, 1738540s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [23:50:38] RoanKattouw: Patch works on Chinese Wikiquote, Wiktionary, Wikinews, Wikisource, Wikiversity and Wikibooks. LGTM [23:51:39] Great, deploying [23:53:48] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Set $wgForceUIAsContentMsg for zhwikibooks, zhwikinews, zhwikiquote, zhwikisource, zhwikiversity, zhwiktionary (T256521) (duration: 00m 55s) [23:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:53] T256521: No Chinese variant translation displayed in for Chinese site name at Chinese Wikiquote, Wiktionary, Wikinews, Wikisource, Wikiversity and Wikibooks - https://phabricator.wikimedia.org/T256521 [23:58:05] <VulpesVulpes825> RoanKattouw: thank you for your help to deploy the two patches. [23:58:21] <RoanKattouw> No problem! Thank you for your patience