[00:00:59] !log restarting wikibugs [00:01:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:02:00] !log wb2-grrrri was not running and wikibugs had no more Gerrit updates since a while [00:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:02:32] (03CR) 10Dzahn: "lalalala" [dns] - 10https://gerrit.wikimedia.org/r/623468 (https://phabricator.wikimedia.org/T260742) (owner: 10Dzahn) [00:02:55] there we go, gerrit notifcations were gone for a while and back now [00:03:25] (03CR) 10Dzahn: [C: 03+2] remove releases2001 [dns] - 10https://gerrit.wikimedia.org/r/623468 (https://phabricator.wikimedia.org/T260742) (owner: 10Dzahn) [00:09:50] (03CR) 10Dzahn: [C: 03+2] devtools: unbreak deployment_server by mcrouter use_onhost_memcache: false [puppet] - 10https://gerrit.wikimedia.org/r/623446 (owner: 10Dzahn) [00:10:14] (03CR) 10Dzahn: [C: 03+2] "The last Puppet run was at Tue Jul 21 10:46:38 UTC 2020 (59840 minutes ago). 😞" [puppet] - 10https://gerrit.wikimedia.org/r/623446 (owner: 10Dzahn) [00:10:28] PROBLEM - dump of es4 in codfw on icinga1001 is CRITICAL: Last dump for es4 at codfw (es2022.codfw.wmnet) taken on 2020-09-01 00:00:01 is 1 GB, but previous one was 611 GB, a change of 99.8% https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [00:13:02] PROBLEM - dump of es5 in codfw on icinga1001 is CRITICAL: Last dump for es5 at codfw (es2025.codfw.wmnet) taken on 2020-09-01 00:00:01 is 1 GB, but previous one was 589 GB, a change of 99.7% https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [00:30:37] (03PS2) 10Dzahn: add dns-disc for releases servers [dns] - 10https://gerrit.wikimedia.org/r/623465 [00:42:04] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 107610664 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:43:56] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 37464 and 79 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:47:28] (03Abandoned) 10Dzahn: releases: switch the active server from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/623081 (owner: 10Dzahn) [00:48:09] (03Abandoned) 10Dzahn: releases: switch backend from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/623082 (owner: 10Dzahn) [01:17:18] !log updated the pynetbox package to 5.0.7 and uploaded to buster [01:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:28:30] 10Operations, 10Maps, 10Traffic, 10Wiki-Loves-Monuments (2020): wikimedia.pl returns a HTTP 429 error (let it access varnish maps_domains) - https://phabricator.wikimedia.org/T261506 (10AntiCompositeNumber) [01:28:34] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10AntiCompositeNumber) [01:30:30] 10Operations, 10Maps, 10Traffic, 10Wiki-Loves-Monuments (2020): maps.wikilovesmonuments.org returns a HTTP 429 error (let it access varnish maps_domains) - https://phabricator.wikimedia.org/T260520 (10AntiCompositeNumber) [01:30:34] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10AntiCompositeNumber) [02:07:14] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.7 [core] (wmf/1.36.0-wmf.7) - 10https://gerrit.wikimedia.org/r/623473 [02:10:18] ^ should that be abandoned? I though there was no deployment this week [02:27:10] (03PS1) 10KartikMistry: Update cxserver to 2020-08-30-011854-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/623475 (https://phabricator.wikimedia.org/T253439) [05:04:42] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 103 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:17:11] (03PS1) 10Marostegui: Revert "dbproxy1021,1017: Test db1128 as m5 master" [puppet] - 10https://gerrit.wikimedia.org/r/623374 [05:18:06] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1021,1017: Test db1128 as m5 master" [puppet] - 10https://gerrit.wikimedia.org/r/623374 (owner: 10Marostegui) [05:22:04] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Marostegui) [05:22:11] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10Marostegui) [05:30:30] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Marostegui) @Papaul any chances we can place es2034 into A4 or A8 instead of A6? [06:10:18] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:10:36] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:17:33] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, 10Patch-For-Review: TY pages in a subdomain of wikipedia and set hide banner cookie - https://phabricator.wikimedia.org/T251780 (10Urbanecm) >>! In T251780#6425547, @DStrine wrote: > This is done from my perspective. I'... [06:20:50] !log Install query killers on db2137:3314 T243373 [06:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:58] T243373: Enable DB replication codfw -> eqiad before the switchover and some other checks - https://phabricator.wikimedia.org/T243373 [06:22:18] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Request for access to analytics-privatedata-users for cparle - https://phabricator.wikimedia.org/T260450 (10elukey) @Cparle I suggest to read https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos/UserGuide for Kerberos, especially https://wiki... [06:23:02] the router link down (ulsfo - eqord) is due to Telia maintenance [06:24:40] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: 2020-09-15) upgrade/replace memory in stat100[58] - https://phabricator.wikimedia.org/T260448 (10elukey) @Jclark-ctr I'd need to schedule the maintenance in advance to let people know that we are rebooting (a lot of users use these hosts)... [06:32:04] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10elukey) @Jclark-ctr they will not but we can do one host at the time anyway when you have time! [06:32:21] (03CR) 10Jcrespo: wikireplicas: create multiinstance roles and profiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/622444 (https://phabricator.wikimedia.org/T260843) (owner: 10Bstorm) [06:36:06] (03CR) 10Jcrespo: wikireplicas: create multiinstance roles and profiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/622444 (https://phabricator.wikimedia.org/T260843) (owner: 10Bstorm) [06:43:25] (03CR) 10Jcrespo: wikireplicas: create multiinstance roles and profiles (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/622444 (https://phabricator.wikimedia.org/T260843) (owner: 10Bstorm) [06:44:15] !log reimage kafka-jumbo1002 to Buster [06:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:40] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [06:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:50] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [06:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:04] Deploy window No deploys all day! See meta:Tech/Server_switch_2020. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200901T0700) [07:00:19] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1004 is CRITICAL: 103 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1004 [07:03:02] this is expected -^ [07:05:36] !log restarting jenkins on releases1002 to pick up Java security updates [07:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:36] !log installing 4.9.228 kernel on stretch systems (only installing the deb, reboots separately) [07:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:29] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1004 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1004 [07:21:48] RECOVERY - dump of es4 in codfw on icinga1001 is OK: Last dump for es4 at codfw (es2022.codfw.wmnet) taken on 2020-08-25 00:00:01 (611 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [07:24:24] RECOVERY - dump of es5 in codfw on icinga1001 is OK: Last dump for es5 at codfw (es2025.codfw.wmnet) taken on 2020-08-25 00:00:01 (589 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [07:24:47] (03PS1) 10Kormat: dbtools: Add script to help check heartbeat health [software] - 10https://gerrit.wikimedia.org/r/623521 [07:27:19] (03CR) 10Marostegui: [C: 03+1] "Let's merge as is, this is useful already" [software] - 10https://gerrit.wikimedia.org/r/623521 (owner: 10Kormat) [07:27:33] (03CR) 10Kormat: [C: 03+2] dbtools: Add script to help check heartbeat health [software] - 10https://gerrit.wikimedia.org/r/623521 (owner: 10Kormat) [07:33:14] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:34:54] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:39:03] (03CR) 10JMeybohm: [C: 04-1] "LGTM, apart from the note on `state`." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/623464 (owner: 10Dzahn) [07:43:12] (03CR) 10JMeybohm: [C: 04-1] "You will also need to add discovery stanzas to etcd: https://wikitech.wikimedia.org/wiki/LVS#etcd_data_for_DNS_Discovery" [puppet] - 10https://gerrit.wikimedia.org/r/623464 (owner: 10Dzahn) [07:46:49] (03PS1) 10Muehlenhoff: Remove obsolete chromium-admin group [puppet] - 10https://gerrit.wikimedia.org/r/623523 [07:47:12] (03CR) 10Filippo Giunchedi: [C: 03+2] aptrepo: import reprepro 'updates' public keys [puppet] - 10https://gerrit.wikimedia.org/r/621485 (https://phabricator.wikimedia.org/T260883) (owner: 10Filippo Giunchedi) [07:47:21] (03CR) 10Filippo Giunchedi: [C: 03+2] aptrepo: add note re: multiple keys for 'updates' [puppet] - 10https://gerrit.wikimedia.org/r/623359 (https://phabricator.wikimedia.org/T260883) (owner: 10Filippo Giunchedi) [07:47:26] (03CR) 10Filippo Giunchedi: [C: 03+2] aptrepo: import current reprepro 'updates' keys [puppet] - 10https://gerrit.wikimedia.org/r/621506 (https://phabricator.wikimedia.org/T260883) (owner: 10Filippo Giunchedi) [07:55:58] (03PS14) 10Vgutierrez: Release 6.0.6-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/621694 (https://phabricator.wikimedia.org/T260702) [07:56:00] (03PS1) 10Vgutierrez: Drop 0003-vsm-perms.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/623524 (https://phabricator.wikimedia.org/T260702) [07:57:15] (03CR) 10jerkins-bot: [V: 04-1] Drop 0003-vsm-perms.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/623524 (https://phabricator.wikimedia.org/T260702) (owner: 10Vgutierrez) [08:01:24] (03PS1) 10Jcrespo: mariadb-backups: Fix missing check on no ERROR msgs on logs [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/623525 (https://phabricator.wikimedia.org/T138562) [08:02:22] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backups: Fix missing check on no ERROR msgs on logs [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/623525 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [08:13:04] (03PS1) 10Kormat: Fix import order for test_WMFMariaDB.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/623526 [08:13:41] (03PS2) 10Jcrespo: mariadb-backups: Fix missing check on no ERROR msgs on logs [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/623525 (https://phabricator.wikimedia.org/T138562) [08:14:01] (03CR) 10jerkins-bot: [V: 04-1] Fix import order for test_WMFMariaDB.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/623526 (owner: 10Kormat) [08:14:55] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backups: Fix missing check on no ERROR msgs on logs [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/623525 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [08:15:12] (03CR) 10jerkins-bot: [V: 04-1] Release 6.0.6-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/621694 (https://phabricator.wikimedia.org/T260702) (owner: 10Vgutierrez) [08:15:53] (03PS2) 10Kormat: Fix import order for test_WMFMariaDB.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/623526 [08:16:56] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:16:56] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:17:25] (03PS3) 10Kormat: Use fixed versions for black and isort. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/623526 [08:18:31] (03CR) 10Jcrespo: [C: 03+1] Use fixed versions for black and isort. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/623526 (owner: 10Kormat) [08:18:48] (03CR) 10Kormat: [C: 03+2] Use fixed versions for black and isort. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/623526 (owner: 10Kormat) [08:19:45] (03Merged) 10jenkins-bot: Use fixed versions for black and isort. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/623526 (owner: 10Kormat) [08:22:32] (03PS3) 10Jcrespo: mariadb-backups: Fix missing check on no ERROR msgs on logs [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/623525 (https://phabricator.wikimedia.org/T138562) [08:24:34] 10Operations, 10SRE-tools, 10tox-wikimedia, 10Patch-For-Review, 10User-Kormat: Introduce Python code formatters usage - https://phabricator.wikimedia.org/T211750 (10Volans) [follow up from a chat in #wikimedia-databases] Another aspect that should be clarified is how to manage changes introduced by the c... [08:32:13] (03PS7) 10Kormat: mariadb: Create profile::mariadb::packages_wmf [puppet] - 10https://gerrit.wikimedia.org/r/622972 (https://phabricator.wikimedia.org/T256972) [08:32:15] (03PS4) 10Kormat: mariadb: Allow overriding of wmf-mariadb version in hiera [puppet] - 10https://gerrit.wikimedia.org/r/622995 (https://phabricator.wikimedia.org/T256972) [08:36:48] 10Operations, 10Analytics-Radar, 10Traffic, 10Patch-For-Review: Package varnish 6.0.x - https://phabricator.wikimedia.org/T261632 (10ema) Varnish 6.0.x fails to start in labs with our current setup: ` Sep 01 07:42:53 traffic-cache-atstext-buster varnish-frontend[7196]: rm: cannot remove '_.vsm_mgt/_.Arg.8c... [08:38:34] (03PS1) 10Jcrespo: mariadb-backups: Fix missing check on no ERROR msgs on logs [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/623530 (https://phabricator.wikimedia.org/T138562) [08:39:28] (03PS1) 10Ema: varnish: give CAP_DAC_OVERRIDE back to root [puppet] - 10https://gerrit.wikimedia.org/r/623531 (https://phabricator.wikimedia.org/T261632) [08:39:35] (03PS1) 10Jcrespo: mariadb-backups: Fix missing check on no ERROR msgs on logs [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/623532 (https://phabricator.wikimedia.org/T138562) [08:40:01] (03Abandoned) 10Jcrespo: mariadb-backups: Fix missing check on no ERROR msgs on logs [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/623530 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [08:40:10] (03Abandoned) 10Jcrespo: mariadb-backups: Fix missing check on no ERROR msgs on logs [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/623525 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [08:42:36] (03PS1) 10Ema: varnish: do not explicitly install libvarnishapi1 [puppet] - 10https://gerrit.wikimedia.org/r/623533 (https://phabricator.wikimedia.org/T261487) [08:44:56] 10Operations, 10Traffic, 10conftool, 10serviceops, 10Patch-For-Review: confd's watch functionality appears to be partially broken when interacting with etcd 3.x - https://phabricator.wikimedia.org/T260889 (10Joe) etcd-mirror is ok, it's just that writes need to be serialized and etcd2 is orders of magnit... [08:47:10] (03PS1) 10Giuseppe Lavagetto: confd: only read from the master during the switchover [puppet] - 10https://gerrit.wikimedia.org/r/623535 (https://phabricator.wikimedia.org/T260889) [08:48:54] (03CR) 10Hashar: [C: 03+1] "Ah I thought the charts directory has not been moved when it just got relocated somewhere else. So yeah I guess we will adjust whatever sy" [puppet] - 10https://gerrit.wikimedia.org/r/621090 (https://phabricator.wikimedia.org/T260742) (owner: 10Dzahn) [08:49:43] 10Operations, 10serviceops, 10Patch-For-Review: decom releases1001 and releases2001 - https://phabricator.wikimedia.org/T260742 (10hashar) @Dzahn from the Gerrit change, I -1 ed it cause we had lost https://releases.wikimedia.org/charts/ , turns out that has been moved somewhere else. So we can decom releas... [08:51:50] !log uploaded apache 2.4.10-10+deb8u16+wmf1 for jessie-wikimedia [08:51:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:55] (03PS1) 10Jcrespo: mariadb-backups: Update backup logic to check errors on log [puppet] - 10https://gerrit.wikimedia.org/r/623538 (https://phabricator.wikimedia.org/T138562) [08:52:27] (03CR) 10Volans: [C: 03+1] "Looking also at the SRV records in the dns repo LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/623535 (https://phabricator.wikimedia.org/T260889) (owner: 10Giuseppe Lavagetto) [08:54:07] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Update backup logic to check errors on log [puppet] - 10https://gerrit.wikimedia.org/r/623538 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [09:00:08] (03CR) 10Alexandros Kosiaris: [C: 03+1] confd: only read from the master during the switchover [puppet] - 10https://gerrit.wikimedia.org/r/623535 (https://phabricator.wikimedia.org/T260889) (owner: 10Giuseppe Lavagetto) [09:01:04] !log installing Java 8 sec updates on contint* [09:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:44] (03PS1) 10Filippo Giunchedi: aptrepo: let puppet build/manage the GPG public keyring [puppet] - 10https://gerrit.wikimedia.org/r/623540 (https://phabricator.wikimedia.org/T260883) [09:05:33] (03PS1) 10Effie Mouzeli: Add entries for push-notifications [dns] - 10https://gerrit.wikimedia.org/r/623541 (https://phabricator.wikimedia.org/T256973) [09:05:57] (03CR) 10jerkins-bot: [V: 04-1] Add entries for push-notifications [dns] - 10https://gerrit.wikimedia.org/r/623541 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [09:06:54] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 104 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [09:07:22] 10Operations, 10SRE-swift-storage, 10User-fgiunchedi: Put ms-be2057 (Dell R740xd2) in service - https://phabricator.wikimedia.org/T261633 (10fgiunchedi) With a fixed `blocksize=64k` for both sequential reads and writes: ` write_seq: (groupid=1, jobs=24): err= 0: pid=227769: Tue Sep 1 08:40:17 2020 write:... [09:09:20] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/623540 (https://phabricator.wikimedia.org/T260883) (owner: 10Filippo Giunchedi) [09:09:42] (03CR) 10Filippo Giunchedi: [C: 03+2] aptrepo: let puppet build/manage the GPG public keyring [puppet] - 10https://gerrit.wikimedia.org/r/623540 (https://phabricator.wikimedia.org/T260883) (owner: 10Filippo Giunchedi) [09:12:40] (03PS1) 10Filippo Giunchedi: hieradata: fix undefined for aptrepo gpg_pubring [puppet] - 10https://gerrit.wikimedia.org/r/623542 [09:15:05] (03PS2) 10Filippo Giunchedi: hieradata: fix undefined for aptrepo gpg_pubring [puppet] - 10https://gerrit.wikimedia.org/r/623542 [09:18:46] PROBLEM - Check systemd state on an-tool1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:21:32] (03Abandoned) 10Jforrester: Branch commit for wmf/1.36.0-wmf.7 [core] (wmf/1.36.0-wmf.7) - 10https://gerrit.wikimedia.org/r/623473 (owner: 10TrainBranchBot) [09:24:04] (03PS1) 10Effie Mouzeli: Add discovery records for push-notifications [dns] - 10https://gerrit.wikimedia.org/r/623544 (https://phabricator.wikimedia.org/T256973) [09:24:35] (03CR) 10jerkins-bot: [V: 04-1] Add discovery records for push-notifications [dns] - 10https://gerrit.wikimedia.org/r/623544 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [09:25:20] (03PS3) 10Filippo Giunchedi: hieradata: fix undefined for aptrepo gpg_pubring [puppet] - 10https://gerrit.wikimedia.org/r/623542 [09:25:43] (03CR) 10Kormat: wikireplicas: create multiinstance roles and profiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/622444 (https://phabricator.wikimedia.org/T260843) (owner: 10Bstorm) [09:27:13] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: fix undefined for aptrepo gpg_pubring [puppet] - 10https://gerrit.wikimedia.org/r/623542 (owner: 10Filippo Giunchedi) [09:27:21] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/623542 (owner: 10Filippo Giunchedi) [09:27:30] (03PS1) 10Volans: sre.ganeti.makevm: adapt to Netbox DNS automation [cookbooks] - 10https://gerrit.wikimedia.org/r/623545 (https://phabricator.wikimedia.org/T258729) [09:29:39] (03CR) 10Volans: [C: 04-1] "This can be merged only after the second cutoff date: https://wikitech.wikimedia.org/wiki/DNS/Netbox#Cutoff_dates" [cookbooks] - 10https://gerrit.wikimedia.org/r/623545 (https://phabricator.wikimedia.org/T258729) (owner: 10Volans) [09:32:25] (03CR) 10Jbond: [C: 04-1] "on issue wioth defaults to the lookup function otherwise looks good" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/621771 (owner: 10Dzahn) [09:33:23] (03CR) 10Giuseppe Lavagetto: [C: 03+2] confd: only read from the master during the switchover [puppet] - 10https://gerrit.wikimedia.org/r/623535 (https://phabricator.wikimedia.org/T260889) (owner: 10Giuseppe Lavagetto) [09:38:28] !log systemctl restart docker-reporter-releng-images.service on deneb to clear out alert because of temporary HTTP 504 from debmonitor [09:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:06] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:48:03] !log reserve cr2-eqiad:xe-3/3/7 for new Telia port [09:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:49] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: support contactgroups stubs [puppet] - 10https://gerrit.wikimedia.org/r/622588 (owner: 10Filippo Giunchedi) [09:49:58] (03PS2) 10Jbond: sslcert::x509_to_pkcs12: add define for creating p12 files [puppet] - 10https://gerrit.wikimedia.org/r/623361 (https://phabricator.wikimedia.org/T253957) [09:50:25] (03PS3) 10Jbond: base::puppet: add ability to create p12 puppet cert [puppet] - 10https://gerrit.wikimedia.org/r/623362 (https://phabricator.wikimedia.org/T253957) [09:52:21] (03PS4) 10Jbond: base::puppet: add ability to create p12 puppet cert [puppet] - 10https://gerrit.wikimedia.org/r/623362 (https://phabricator.wikimedia.org/T253957) [09:52:36] (03PS3) 10Jbond: puppet ssl p12: enable generation of puppet p12 cert on test cluster [puppet] - 10https://gerrit.wikimedia.org/r/623363 (https://phabricator.wikimedia.org/T253957) [09:56:51] 10Operations, 10Epic, 10cloud-services-team (Kanban): CloudVPS: our ideal future model - https://phabricator.wikimedia.org/T209460 (10aborrero) The natural followup of this is in the 2020 network refresh project (clousw/cloudgw): https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementP... [10:00:38] (03PS2) 10Vgutierrez: Release 2.0.91-3wm [software/varnish/libvmod-tbf] (debian) - 10https://gerrit.wikimedia.org/r/623396 (https://phabricator.wikimedia.org/T261632) [10:05:23] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [10:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:31] (03PS1) 10Kormat: mariadb: Add profile::mariadb::packages_client [puppet] - 10https://gerrit.wikimedia.org/r/623567 [10:17:40] (03CR) 10Jbond: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [10:26:45] (03PS7) 10Itamar Givon: Add `wmgWikibaseClientItemAndPropertySourceName` to InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622612 (https://phabricator.wikimedia.org/T258060) [10:27:04] (03PS2) 10Kormat: mariadb: Add profile::mariadb::packages_client [puppet] - 10https://gerrit.wikimedia.org/r/623567 (https://phabricator.wikimedia.org/T256972) [10:28:56] (03CR) 10Kormat: "PCC run: https://puppet-compiler.wmflabs.org/compiler1001/24847/" [puppet] - 10https://gerrit.wikimedia.org/r/623567 (https://phabricator.wikimedia.org/T256972) (owner: 10Kormat) [10:29:17] (03PS1) 10Hnowlan: api-gateway: Use envoyproxy.io annotation for metrics gathering. [deployment-charts] - 10https://gerrit.wikimedia.org/r/623568 (https://phabricator.wikimedia.org/T254910) [10:31:37] (03PS8) 10Kormat: WIP mariadb: simplify package_wmf [puppet] - 10https://gerrit.wikimedia.org/r/622569 [10:31:48] (03CR) 10Hnowlan: [C: 03+2] api-gateway: Use envoyproxy.io annotation for metrics gathering. [deployment-charts] - 10https://gerrit.wikimedia.org/r/623568 (https://phabricator.wikimedia.org/T254910) (owner: 10Hnowlan) [10:33:00] (03Merged) 10jenkins-bot: api-gateway: Use envoyproxy.io annotation for metrics gathering. [deployment-charts] - 10https://gerrit.wikimedia.org/r/623568 (https://phabricator.wikimedia.org/T254910) (owner: 10Hnowlan) [10:33:12] (03CR) 10Mark Bergsma: [C: 03+1] admin: Add user klausman [puppet] - 10https://gerrit.wikimedia.org/r/623357 (https://phabricator.wikimedia.org/T261626) (owner: 10Klausman) [10:33:25] (03PS1) 10Arturo Borrero Gonzalez: nftables: add infrastructure for customizing the ruleset [puppet] - 10https://gerrit.wikimedia.org/r/623569 (https://phabricator.wikimedia.org/T261724) [10:34:19] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Production for klausman - https://phabricator.wikimedia.org/T261626 (10mark) Approved. [10:36:53] 10Operations, 10Gerrit, 10Wikimedia-GitHub, 10Release-Engineering-Team (Development services), 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)): operations-puppet repo doesn't seem sync'ed with github's - https://phabricator.wikimedia.org/T261105 (10Ssgg21211) [10:37:19] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [10:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:53] (03PS4) 10Filippo Giunchedi: prometheus: minimal default alerts for Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/622557 (https://phabricator.wikimedia.org/T258948) [10:37:55] (03PS3) 10Filippo Giunchedi: prometheus: move beta to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/622561 (https://phabricator.wikimedia.org/T258948) [10:37:57] (03PS5) 10Filippo Giunchedi: prometheus: add 'alertmanagers' setting to all instances [puppet] - 10https://gerrit.wikimedia.org/r/622558 (https://phabricator.wikimedia.org/T258948) [10:37:59] (03PS4) 10Filippo Giunchedi: icinga: redirect to https if not already proxied [puppet] - 10https://gerrit.wikimedia.org/r/622566 (https://phabricator.wikimedia.org/T258948) [10:38:01] (03PS1) 10Filippo Giunchedi: alertmanager: use amtool check-config as validate_cmd [puppet] - 10https://gerrit.wikimedia.org/r/623570 (https://phabricator.wikimedia.org/T258948) [10:38:03] (03PS1) 10Filippo Giunchedi: alertmanager: send only icinga criticals to irc [puppet] - 10https://gerrit.wikimedia.org/r/623571 (https://phabricator.wikimedia.org/T258948) [10:38:32] 10Operations, 10Gerrit, 10Wikimedia-GitHub, 10Release-Engineering-Team (Development services), 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)): operations-puppet repo doesn't seem sync'ed with github's - https://phabricator.wikimedia.org/T261105 (10Majavah) [10:39:10] (03PS2) 10Arturo Borrero Gonzalez: nftables: add infrastructure for customizing the ruleset [puppet] - 10https://gerrit.wikimedia.org/r/623569 (https://phabricator.wikimedia.org/T261724) [10:39:43] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/623569 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [10:39:48] (03PS2) 10Filippo Giunchedi: alertmanager: use amtool check-config as validate_cmd [puppet] - 10https://gerrit.wikimedia.org/r/623570 (https://phabricator.wikimedia.org/T258948) [10:40:00] (03CR) 10jerkins-bot: [V: 04-1] nftables: add infrastructure for customizing the ruleset [puppet] - 10https://gerrit.wikimedia.org/r/623569 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [10:40:27] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: use amtool check-config as validate_cmd [puppet] - 10https://gerrit.wikimedia.org/r/623570 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [10:41:19] (03PS3) 10Arturo Borrero Gonzalez: nftables: add infrastructure for customizing the ruleset [puppet] - 10https://gerrit.wikimedia.org/r/623569 (https://phabricator.wikimedia.org/T261724) [10:43:05] (03PS4) 10Filippo Giunchedi: prometheus: move beta to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/622561 (https://phabricator.wikimedia.org/T258948) [10:43:37] (03PS3) 10Kormat: mariadb: Add profile::mariadb::packages_client [puppet] - 10https://gerrit.wikimedia.org/r/623567 (https://phabricator.wikimedia.org/T256972) [10:44:20] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: move beta to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/622561 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [10:44:59] (03PS2) 10Filippo Giunchedi: alertmanager: send only icinga criticals to irc [puppet] - 10https://gerrit.wikimedia.org/r/623571 (https://phabricator.wikimedia.org/T258948) [10:46:11] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: send only icinga criticals to irc [puppet] - 10https://gerrit.wikimedia.org/r/623571 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [10:47:25] (03PS8) 10Itamar Givon: Add `wmgWikibaseClientItemAndPropertySourceName` to InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622612 (https://phabricator.wikimedia.org/T258060) [10:47:27] (03PS3) 10Itamar Givon: Use `wmgWikibaseClientItemAndPropertySourceName` instead of `wmgWikibaseClientLocalEntitySourceName` in Wikibase.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622993 (https://phabricator.wikimedia.org/T258060) [10:47:29] (03PS3) 10Itamar Givon: Remove `wmgWikibaseClientLocalEntitySourceName` from InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622994 (https://phabricator.wikimedia.org/T258060) [10:47:49] (03Abandoned) 10Muehlenhoff: Remove support for jessie/wmf-mariadb10 from mariadb::packages_wmf [puppet] - 10https://gerrit.wikimedia.org/r/524721 (owner: 10Muehlenhoff) [10:49:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] nftables: add infrastructure for customizing the ruleset [puppet] - 10https://gerrit.wikimedia.org/r/623569 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [10:51:20] (03CR) 10Muehlenhoff: mariadb: Add profile::mariadb::packages_client (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/623567 (https://phabricator.wikimedia.org/T256972) (owner: 10Kormat) [10:52:55] (03PS1) 10Arturo Borrero Gonzalez: nftables: override systemd service file [puppet] - 10https://gerrit.wikimedia.org/r/623573 (https://phabricator.wikimedia.org/T261724) [10:56:13] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1002/24850/" [puppet] - 10https://gerrit.wikimedia.org/r/623573 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [10:56:16] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] nftables: override systemd service file [puppet] - 10https://gerrit.wikimedia.org/r/623573 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [11:01:22] (03CR) 10Muehlenhoff: mariadb: Add profile::mariadb::packages_client (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/623567 (https://phabricator.wikimedia.org/T256972) (owner: 10Kormat) [11:02:11] (03PS4) 10Elukey: admin: Add user klausman [puppet] - 10https://gerrit.wikimedia.org/r/623357 (https://phabricator.wikimedia.org/T261626) (owner: 10Klausman) [11:05:14] (03PS1) 10Arturo Borrero Gonzalez: nftables: fix service unit override [puppet] - 10https://gerrit.wikimedia.org/r/623575 (https://phabricator.wikimedia.org/T261724) [11:07:01] (03CR) 10Muehlenhoff: admin: Add user klausman (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/623357 (https://phabricator.wikimedia.org/T261626) (owner: 10Klausman) [11:07:01] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] nftables: fix service unit override [puppet] - 10https://gerrit.wikimedia.org/r/623575 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [11:12:50] (03CR) 10Jcrespo: mariadb: Add profile::mariadb::packages_client (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/623567 (https://phabricator.wikimedia.org/T256972) (owner: 10Kormat) [11:17:46] (03PS1) 10Arturo Borrero Gonzalez: nftables: mask the service if not running [puppet] - 10https://gerrit.wikimedia.org/r/623578 [11:18:03] (03PS2) 10Arturo Borrero Gonzalez: nftables: mask the service if not running [puppet] - 10https://gerrit.wikimedia.org/r/623578 (https://phabricator.wikimedia.org/T261724) [11:18:28] (03CR) 10jerkins-bot: [V: 04-1] nftables: mask the service if not running [puppet] - 10https://gerrit.wikimedia.org/r/623578 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [11:19:03] (03PS3) 10Arturo Borrero Gonzalez: nftables: mask the service if not running [puppet] - 10https://gerrit.wikimedia.org/r/623578 (https://phabricator.wikimedia.org/T261724) [11:20:51] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] nftables: mask the service if not running [puppet] - 10https://gerrit.wikimedia.org/r/623578 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [11:28:32] (03CR) 10Elukey: admin: Add user klausman (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/623357 (https://phabricator.wikimedia.org/T261626) (owner: 10Klausman) [11:28:34] 10Operations, 10DBA, 10Patch-For-Review, 10User-Kormat, 10User-jbond: Refactor mariadb puppet code - https://phabricator.wikimedia.org/T256972 (10Kormat) a:03Kormat [11:29:26] (03PS5) 10Elukey: admin: Add user klausman [puppet] - 10https://gerrit.wikimedia.org/r/623357 (https://phabricator.wikimedia.org/T261626) (owner: 10Klausman) [11:29:40] moritzm: ---^ good to go? [11:29:58] looking [11:30:55] (03CR) 10Muehlenhoff: [C: 03+1] admin: Add user klausman [puppet] - 10https://gerrit.wikimedia.org/r/623357 (https://phabricator.wikimedia.org/T261626) (owner: 10Klausman) [11:30:58] ship it :-) [11:31:10] super thanks :) I'll wait after the switchover [11:31:14] and make sure to also add Tobias to the cn=ops LDAP group after merging [11:31:23] yep will do [11:31:23] one or two services are cn=ops only [11:32:11] (03CR) 10Elukey: "This is good to go, I am going to merge after the switchover. Note to self: add Tobias to the ops LDAP group too." [puppet] - 10https://gerrit.wikimedia.org/r/623357 (https://phabricator.wikimedia.org/T261626) (owner: 10Klausman) [11:33:59] (03PS1) 10Arturo Borrero Gonzalez: nftables: reorder masking operations to don't race with service state changes [puppet] - 10https://gerrit.wikimedia.org/r/623580 (https://phabricator.wikimedia.org/T261724) [11:34:59] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] nftables: reorder masking operations to don't race with service state changes [puppet] - 10https://gerrit.wikimedia.org/r/623580 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [11:42:07] (03PS9) 10Kormat: WIP mariadb: simplify mariadb::service [puppet] - 10https://gerrit.wikimedia.org/r/622569 [11:42:09] (03PS1) 10Kormat: WIP mariadb: mariadb::config [puppet] - 10https://gerrit.wikimedia.org/r/623582 [11:42:15] (03PS1) 10Ema: varnish: stop installing libvmod-tbf [puppet] - 10https://gerrit.wikimedia.org/r/623583 (https://phabricator.wikimedia.org/T261632) [11:46:04] (03CR) 10Jcrespo: "General idea looking good." [puppet] - 10https://gerrit.wikimedia.org/r/622569 (owner: 10Kormat) [11:49:50] (03PS1) 10Arturo Borrero Gonzalez: nftables: use wmflib::ensure for the service parameter [puppet] - 10https://gerrit.wikimedia.org/r/623584 (https://phabricator.wikimedia.org/T261724) [11:51:50] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] nftables: use wmflib::ensure for the service parameter [puppet] - 10https://gerrit.wikimedia.org/r/623584 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [11:54:05] (03PS2) 10Kormat: mariadb: Make mariadb::config basedir required. [puppet] - 10https://gerrit.wikimedia.org/r/623582 [11:55:25] (03PS1) 10Arturo Borrero Gonzalez: nftables: drop service params [puppet] - 10https://gerrit.wikimedia.org/r/623585 (https://phabricator.wikimedia.org/T261724) [11:55:50] (03CR) 10jerkins-bot: [V: 04-1] nftables: drop service params [puppet] - 10https://gerrit.wikimedia.org/r/623585 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [11:58:53] (03PS3) 10Jason Linehan: Enable MediaWiki client errors on commonswiki and metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623392 (https://phabricator.wikimedia.org/T255585) [12:03:45] (03CR) 10Jason Linehan: "Abandoning this and issuing another because something seems messed up with my mediawiki-config repo that's causing a bunch of merge confli" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623392 (https://phabricator.wikimedia.org/T255585) (owner: 10Jason Linehan) [12:03:50] (03Abandoned) 10Jason Linehan: Enable MediaWiki client errors on commonswiki and metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623392 (https://phabricator.wikimedia.org/T255585) (owner: 10Jason Linehan) [12:04:14] (03PS1) 10Jason Linehan: Enable MediaWiki client errors on commonswiki and metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623587 (https://phabricator.wikimedia.org/T255585) [12:12:46] 10Operations, 10observability, 10Patch-For-Review: Evaluate/integrate rasdaemon as a replacement for mcelog - https://phabricator.wikimedia.org/T205396 (10CDanis) >>! In T205396#6422696, @jbond wrote: >>>! In T205396#4955858, @CDanis wrote: >> @jbond kindly backported the buster version of rasdaemon to stret... [12:13:48] (03Abandoned) 10Jason Linehan: Enable MediaWiki client errors on commonswiki and metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623587 (https://phabricator.wikimedia.org/T255585) (owner: 10Jason Linehan) [12:13:53] (03Restored) 10Jason Linehan: Enable MediaWiki client errors on commonswiki and metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623392 (https://phabricator.wikimedia.org/T255585) (owner: 10Jason Linehan) [12:18:19] (03PS6) 10Gilles: Lossy optimisation of Wikipedia logos static PNGs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599284 (https://phabricator.wikimedia.org/T252108) [12:22:31] 10Operations, 10netbox: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10ayounsi) [12:29:03] (03CR) 10Kormat: "PCC run: https://puppet-compiler.wmflabs.org/compiler1003/24852/" [puppet] - 10https://gerrit.wikimedia.org/r/623582 (owner: 10Kormat) [12:33:17] 10Operations, 10ORES, 10Machine Learning Platform (Current), 10Patch-For-Review: ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart) - https://phabricator.wikimedia.org/T242705 (10akosiaris) 05Openβ†’03Resolved [12:35:44] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) @Marostegui the only chance placing it in A4 is to use the 10G port since A4 is a 10G switch. In A8 I need to check to see if i have available... [12:36:26] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/623582 (owner: 10Kormat) [12:36:35] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Marostegui) Thanks, let's see if there are available spaces on A8, if not, let's leave it on A6. [12:38:53] (03CR) 10Marostegui: [C: 03+1] "+1 but as the others, to be merged after the DC switchover" [puppet] - 10https://gerrit.wikimedia.org/r/623582 (owner: 10Kormat) [12:41:17] (03PS4) 10Kormat: mariadb: Add profile::mariadb::packages_client [puppet] - 10https://gerrit.wikimedia.org/r/623567 (https://phabricator.wikimedia.org/T256972) [12:44:36] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) install memory upgrades in ores200[1-9] - https://phabricator.wikimedia.org/T259908 (10akosiaris) I think we can do all of these in a single maint window (say a 2-3hours). Since gracefully powering off a host (via a press of the power button) will also depo... [12:54:12] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) install memory upgrades in ores200[1-9] - https://phabricator.wikimedia.org/T259908 (10Papaul) @akosiaris welcome back . I hope you had a great vacation. You can proceed to the downtime, I will take care of powering the servers and adding the DIMMS when on... [12:57:21] (03PS8) 10Kormat: mariadb: Create profile::mariadb::packages_wmf [puppet] - 10https://gerrit.wikimedia.org/r/622972 (https://phabricator.wikimedia.org/T256972) [12:57:23] (03PS5) 10Kormat: mariadb: Allow overriding of wmf-mariadb version in hiera [puppet] - 10https://gerrit.wikimedia.org/r/622995 (https://phabricator.wikimedia.org/T256972) [12:57:25] (03PS5) 10Kormat: mariadb: Add profile::mariadb::packages_client [puppet] - 10https://gerrit.wikimedia.org/r/623567 (https://phabricator.wikimedia.org/T256972) [12:57:27] (03PS3) 10Kormat: mariadb: Make mariadb::config basedir required. [puppet] - 10https://gerrit.wikimedia.org/r/623582 [12:57:29] (03PS10) 10Kormat: WIP mariadb: simplify mariadb::service [puppet] - 10https://gerrit.wikimedia.org/r/622569 [13:03:55] (03PS2) 10Arturo Borrero Gonzalez: nftables: drop service params [puppet] - 10https://gerrit.wikimedia.org/r/623585 (https://phabricator.wikimedia.org/T261724) [13:06:51] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1003/24853/" [puppet] - 10https://gerrit.wikimedia.org/r/623585 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [13:07:11] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] nftables: drop service params [puppet] - 10https://gerrit.wikimedia.org/r/623585 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [13:08:14] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is CRITICAL: 103.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [13:15:21] 10Operations, 10Analytics-Radar, 10Traffic, 10Patch-For-Review: Package varnish 6.0.x - https://phabricator.wikimedia.org/T261632 (10Vgutierrez) [13:16:40] FYI: in about 15 minutes we'll start coordinating pre-switchover work in this channel [13:17:08] if you have any production changes in progress, please either (a) don't, or (b) wrap them up ASAP [13:17:14] Good luck [13:17:30] thanks :D [13:17:32] (a) don't or (b) really don't [13:17:51] (a) don't've (preferred) or (b) don't [13:18:05] but, but... it works on my laptop! [13:18:13] okay, ema's allowed, NOBODY ELSE [13:18:19] unless you also test on ema's laptop [13:18:44] * mark locks ema's laptop [13:20:24] No problem rzl [13:21:02] |log upgrading esams to varnish 6 [13:21:08] * vgutierrez runs away for coffee [13:21:21] it's just a tiny cumin cmd, don't you worry rzl [13:21:28] okay! I trust you vgutierrez [13:21:31] I know you would never lie to me [13:21:54] did we consider having some sort of lock with puppet-merge? [13:22:00] where at least a warning is given [13:22:06] and/or maybe motd on all hosts or something ;) [13:22:21] I can start something merging and just not press "y" [13:22:23] (for the future, obviously) [13:22:25] that'll lock it up pretty good [13:22:32] but sure, will add to AIs [13:23:10] what's going on in here? [13:23:37] XioNoX: just shifting some traffic around [13:23:46] :) [13:23:56] XioNoX: small production change, almost not worth mentioning [13:25:06] (03PS4) 10Kormat: mariadb: Make mariadb::config basedir required. [puppet] - 10https://gerrit.wikimedia.org/r/623582 (https://phabricator.wikimedia.org/T256972) [13:25:08] (03PS11) 10Kormat: mariadb: simplify mariadb::service [puppet] - 10https://gerrit.wikimedia.org/r/622569 (https://phabricator.wikimedia.org/T256972) [13:25:10] PROBLEM - Check whether microcode mitigations for CPU vulnerabilities are applied on kafka-jumbo1001 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {flush_l1d, md_clear} https://wikitech.wikimedia.org/wiki/Microcode [13:25:16] nice timing [13:25:38] especially as we start doing this more often and it becomes less of a deal, ensuring we have no conflicting things going on becomes more important [13:28:28] I need to reboot jumbo1001 but I am of course waiting for a quieter moment to do it :) [13:29:11] elukey: anything called jumbo can't make that big an impact [13:29:38] (03CR) 10Gehel: "recheck" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/616602 (owner: 10Ebernhardson) [13:29:59] kormat: It would have to be Jumbo-Jumbo to have an effect ;) [13:30:00] hello! [13:30:26] for anyone following along, the plan is to move mediawiki's active DC from eqiad to codfw in about half an hour [13:30:46] we'll coordinate in here, unless we get flooded out by icinga-wm, in which case we'll move to #wikimedia-sre [13:30:46] Using a lot of trucks and strong movers. [13:31:02] rzl: I'm in [13:31:17] for anyone with root on cumin1001, I'm sharing my terminal, and you can follow along by running `sudo -i tmux attach -rt switchdc` on that machine [13:31:30] (if you modify flags, please keep -r so that your session is read-only) [13:31:57] the procedure is at https://wikitech.wikimedia.org/wiki/Switch_Datacenter#MediaWiki and we'll be starting at phase 0 [13:32:31] let's get the drums rolling [13:32:55] Trizek should be setting site-wide banners on all wikis right about now [13:33:10] They are supposed to be live. [13:33:14] perfect thanks [13:33:27] hmm I'm not seeing them [13:33:39] I do [13:33:48] I see a banner on enwiki, but not on commons [13:33:48] * akosiaris not seeing them either [13:34:04] sounds like they're coming in inconsistently, probably just caching somewhere, I'm fine with that [13:34:09] ok, seeing them now [13:34:12] they'll appear for everyone before long [13:34:26] There is a bit of cache to work on for the banners. [13:34:29] not everyone IIRC. Readers will not see them at all, right? [13:34:31] πŸ‘ [13:34:39] dewiki has banners up as well [13:34:48] akosiaris, readers will see them, since we have a lot of readers who edit. [13:35:00] I'm getting it here without being logged [13:35:09] okay, I'm going to start prep steps [13:35:16] Trizek: I meant non logged in users, I stand corrected [13:35:30] but I indeed don't get them for enwiki in an incognito window (yet) [13:35:34] I got what you meant, akosiaris. :) [13:35:35] _joe_, volans: check my flags please :) [13:35:48] <_joe_> rzl: sure [13:36:17] <_joe_> +1 [13:36:19] rzl: +1 [13:36:37] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-disable-puppet [13:36:39] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-disable-puppet (exit_code=0) [13:36:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:57] for anyone following along, we expect some retries in 00-reduce-ttl, that's normal [13:37:07] it's possible we'll have to rerun the step, that's fine too [13:37:09] <_joe_> rzl: should be better today :) [13:37:15] I've heard that before :D [13:37:18] but glad :) [13:37:26] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl [13:37:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:29] <_joe_> well it was, it was just lagging, not broken :) [13:37:43] whoever has teh small window attached to the tmux can please make it bigger? [13:37:50] <_joe_> +1 [13:37:51] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=0) [13:37:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:01] in particular higher [13:38:18] future 'attach to tmux' commands should implement a minimum terminal size requirement ;) [13:38:26] yes [13:38:33] rzl: logs looks good [13:38:35] so it was you kormat [13:38:38] ok, 4/10 .. this went better I gather [13:38:43] <_joe_> yes [13:38:49] yep, thanks _joe_ [13:39:08] I'm going to start the cache warmups, in the meantime we can spot-check DNS records just to be sure [13:39:21] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-warmup-caches [13:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:36] the logs are quite clear but sure [13:39:38] <_joe_> ahahahha [13:39:43] <_joe_> "hi volans" [13:40:16] (he's made fun of me for answering that prompt too quickly without stopping to consider the philosophical implications) [13:40:26] ahahah [13:40:53] <_joe_> rzl: I was thinking we might want to run the warmup step twice [13:40:53] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-warmup-caches (exit_code=0) [13:40:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:05] volans: AI for later, we should put the 5-minute sleep back into 00-reduce-ttl now that the cache warmup is so much faster [13:41:14] <_joe_> wut [13:41:15] we actually did it multiple times the last time IIRC [13:41:21] <_joe_> this was way too fast [13:41:23] no worries right now since I'm waiting until 14:00 anyway [13:41:33] because we were printig the times of the requests [13:41:41] _joe_: faster is expected, Krinkle removed a bunch of redundant URLs from it [13:41:41] until they were converging to some normal value [13:41:56] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=now-1h&to=now&var-datasource=codfw%20prometheus%2Fops&var-cluster=appserver&var-method=GET&var-code=200 [13:42:01] <_joe_> codfw had a spike in response times [13:42:07] yeah but it's recovering [13:42:13] <_joe_> ofc [13:42:15] rerunning now [13:42:16] <_joe_> the script is over [13:42:27] actually I'll pause another moment [13:42:30] <_joe_> rzl: wait for ~ 1 more minute yes [13:42:31] just so the spikes are clearer to see on the graph [13:42:34] <_joe_> yes [13:42:42] will start at 13:44 [13:43:04] <_joe_> just to be clear: before the url removal, we made ~ 1k requests/s during warmup [13:43:15] <_joe_> I am worried we're not warming up enough [13:43:19] banners up on elwiki as well for non logged in user, I think they should be everywhere by now [13:43:21] Krinkle: here? [13:43:40] btw, umatrix blocks them (for anyone using that extension) [13:43:40] yes [13:43:41] yep banner is showing for me now too [13:43:50] <_joe_> akosiaris: itwiki as well [13:43:56] For those wanting to check codfw DB health https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=codfw&var-group=core&var-shard=All&var-role=All&from=now-12h&to=now [13:44:04] Krinkle: can you talk about the warmup script being much faster to complete? [13:44:08] we can see some of the warm up effects [13:44:14] (rerunning now) [13:44:17] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-warmup-caches [13:44:19] seeing banner on enwiki over here [13:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:40] it's faster. I removed urls that were warming things up that we no longer need warm up for or that just dont' exist anymore. [13:44:47] I also removed a long tail of urls for closed and small wikis. [13:45:12] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-warmup-caches (exit_code=0) [13:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:42] pausing here for graphsquintery [13:46:04] latency impact seems not as bad, this round [13:46:10] for anyone not currently looking at the warmup: please prepare, but don't save, test edits on one or two wikis of your choice [13:46:23] after we set read-only, I'll ask you to try to save them and confirm that you cannot [13:46:56] would be good if someone can also verify on mobile apps [13:47:44] already don't see an edit button on wikidata [13:47:48] * volans has meta [13:47:52] mutante: :-/ [13:48:37] * _joe_ has itwiki [13:48:42] * marostegui has eswiki [13:48:49] * mark has nlwiki [13:48:50] <_joe_> can someone do dewiki/enwiki please? [13:48:56] * cdanis enwiki [13:49:00] I do dewiki [13:49:03] <_joe_> lol [13:49:04] mutante: wikidata items never have an edit button anyway :P [13:49:16] Someone to take commonswiki too? [13:49:22] _joe_: does warmup look okay? [13:49:31] go/no-go basically [13:49:39] <_joe_> rzl: it looks better, the latency hit was much smaller AFAICS [13:49:42] the message is mis-translated and talks about the creation of a second DC, but it's there :-) [13:49:49] <_joe_> should we run it one last time [13:49:54] <_joe_> ? [13:49:58] moritzm: can you add to the incident doc please? link in the topic of the other channel [13:50:05] will do [13:50:06] won't do anything about it now but we can fix it for next time [13:50:07] thanks [13:50:13] i'll take commonswiki [13:50:16] +1 for a third time [13:50:16] _joe_: sure, starting now [13:50:17] kormat: thanks :* [13:50:23] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-warmup-caches [13:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:42] while this is running, a quick note on backout plans: [13:50:56] if we have problems any time before phase 2, we can just stop -- if we cancel for the day, we'll run phase 8 to unprep [13:51:13] after phase 2-3 (e.g. failure to set RO) we can run phases 6-7 to set RW, still in eqiad, and investigate [13:51:20] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-warmup-caches (exit_code=0) [13:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:29] * Krinkle marks yesterday switch and warmup in Grafana annotations {operations, mediawiki, traffic, performance} [13:51:43] if we have trouble after phase 4-5, we can rerun them with the args flipped to move back to eqiad, then 6-7 to go RW again [13:51:53] grafana annotations would be a good addition to the cookbooks too [13:52:02] * mark adds the suggestion to the doc [13:52:03] <_joe_> uhm [13:52:26] <_joe_> can we please keep the discussion here about things happening? I've just seen a spike of mcrouter errors [13:52:30] mark: there's a long-open task about making better automated use of grafana annotations :) [13:52:38] _joe_: which dc? [13:52:41] <_joe_> codfw [13:52:46] <_joe_> during the warmup [13:52:54] <_joe_> let me investigate further. [13:52:59] ack, holding here [13:54:02] some connect timeouts [13:54:05] <_joe_> Sep 1 13:26:20 mw2256 mcrouter[1046]: W0901 13:26:20.432327 1197 TLSTicketKeyManager.cpp:192] No keys configured, falling back to default [13:54:13] <_joe_> this is new to me... [13:54:26] T-6m [13:54:35] is that about TLS ticket session resumption? [13:54:45] <_joe_> maybe? [13:54:45] it's also only a warn [13:54:48] _joe_ I already seen it and opened an issue to upstream, no answer [13:54:56] akosiaris: 14:00 is just the start of an hour-long window, no problem starting later [13:54:56] <_joe_> ok [13:55:08] I'd rather take our time and be confident before starting [13:55:29] β›” 🀠 [13:55:35] _joe_: the new issue correlates with the 'hard' TKOs + the 'connect timeout's [13:55:36] rzl: yup, agreed. Just pointing it out cause it makes me feel a bit like we are launching Apollo 11 [13:55:42] πŸ‘ [13:56:18] akosiaris: does this feel like something of a moonshot to you? [13:56:31] every single time [13:56:33] <_joe_> https://grafana.wikimedia.org/d/000000549/mcrouter?viewPanel=6&orgId=1&var-source=codfw%20prometheus%2Fops&var-cluster=All&var-instance=All&var-memcached_server=All this worries me [13:56:35] I wouldn't like to be a codfw database right now....they are about to have such a nice wake up [13:56:44] I wish we had weighted balancing [13:56:46] alas [13:56:59] let's keep this channel focused on the mcrouter question for now please :) [13:57:10] <_joe_> 10.192.0.85:11211 marked soft TKO [13:57:30] <_joe_> mc2021.codfw.wmnet suffered [13:57:38] <_joe_> ok, nothing "strange" I guess [13:57:43] (03PS1) 10Joal: Update analytics snapshots data purge [puppet] - 10https://gerrit.wikimedia.org/r/623601 (https://phabricator.wikimedia.org/T237047) [13:57:49] <_joe_> let's go I would say [13:57:54] I don't see saturation in memcached shards [13:58:10] okay, going ahead with 01-stop-maintenance [13:58:21] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance [13:58:23] +1 [13:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:34] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=0) [13:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:41] apergos: there were some snapshots running so maybe this ^ killed them [13:58:48] Im here now, just in time [13:58:50] rzl: Stray php processes still present on the maintenance host, please check [13:59:01] apergos: they look s3 related [13:59:04] volans: I think that's the dumps still running [13:59:05] _joe_ part of the cache warm up ended up in the gutter, not a big deal but I just checked [13:59:21] _joe_: can you check what's still running on mwmaint1? [13:59:22] pgrep -c php is what the script checks [13:59:24] if it's s3 then they can pick up again this evening, no worries [13:59:27] thanks for the heads up [13:59:32] rzl: yep, I see them running still [13:59:33] mwmaint1002 that is [13:59:47] marostegui: dumps but nothing else php, right? [13:59:59] rzl: yep [14:00:01] /srv/mediawiki/php-1.35.0-wmf.1/extensions/ConfirmEdit/captcha-old.py [14:00:04] only s3 as well [14:00:06] sounds good [14:00:19] <_joe_> no it's generate captcha [14:00:22] <_joe_> yes [14:00:28] okay! phases 2 through 7 are the critical period: once we're RO we want to be RW as soon as we can do so safely [14:00:33] <_joe_> I'd say let's go? [14:00:34] I'm not going to stop between steps [14:00:43] +1 [14:00:45] if anything comes up, say "stop" in here [14:00:49] ok [14:00:52] ok [14:00:52] ack [14:00:56] volans: those captcha-old.py processes... are from 2019? [14:00:58] /bin/bash /usr/local/bin/mwscript extensions/FlaggedRevs/maintenance/updateStats.php enwikibooks [14:00:58] (03PS1) 10Ottomata: camus - don't check eqiad topics while DC switchover to codfw is ongoing [puppet] - 10https://gerrit.wikimedia.org/r/623602 [14:01:00] ? [14:01:05] mwmaint1002 hasn't been rebooted for 713 days [14:01:09] heh I see one still running so they might have all survived! [14:01:10] just started, right on 14:00 UTC mark [14:01:15] a moment after I run phase 2, please save your test edits, and if you succeed, please say "stop" as we might still be read-write [14:01:18] cdanis: yes :) [14:01:25] rzl: wilco [14:01:26] mwmaint1002 discussion for later please :) [14:01:32] ok [14:01:45] any objections to going read-only now? [14:01:48] +1 [14:01:53] +1 [14:01:53] +1 [14:01:53] <_joe_> go [14:01:54] go [14:02:00] yoloooooo [14:02:03] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.02-set-readonly [14:02:04] !log rzl@cumin1001 MediaWiki read-only period starts at: 2020-09-01 14:02:04.851006 [14:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:28] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.02-set-readonly (exit_code=0) [14:02:29] test edits now, please [14:02:31] enwiki confirmed r/o [14:02:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:33] Spanish read only stuff [14:02:33] my edit on itwiki got rejected as expected [14:02:35] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.03-set-db-readonly [14:02:35] ES is locked [14:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:42] nlwiki is locked [14:02:43] commonswiki is locked [14:02:45] fiwiki confirmed ro [14:02:47] Warning: The database has been locked for maintenance, so you will not be able to save your edits right now [14:02:49] RO [14:02:56] confirmed ro on en.wp mobile app and wikidata in browser [14:03:07] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (exit_code=0) [14:03:08] confirmed on commons/desktop [14:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:10] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki [14:03:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:13] elwiki locked [14:03:15] RO confimred on Brazilian wikipedia [14:03:19] confirmed meta [14:03:24] en mobile ro [14:03:31] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1003/24854/an-launcher1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/623602 (owner: 10Ottomata) [14:03:48] same for dewiki [14:03:59] volans: Sleeping 23.459 seconds to reach the 10 seconds mark? [14:04:04] lol [14:04:07] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki (exit_code=0) [14:04:07] yeah dunno [14:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:13] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.05-invert-redis-sessions [14:04:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:16] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.05-invert-redis-sessions (exit_code=0) [14:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:20] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite [14:04:23] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (exit_code=0) [14:04:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:26] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.07-set-readwrite [14:04:27] db rows written almost to 0 [14:04:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:33] test edits again after this please [14:04:44] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:04:47] I see writs on codfw masters [14:04:53] !log rzl@cumin1001 END (FAIL) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=99) [14:04:56] edit worked on wd [14:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:01] site info failed [14:05:01] edit worked on ES [14:05:04] edit confirmed on nlwiki [14:05:04] is siteinfo timing out? [14:05:06] failed to get siteinfo? [14:05:07] <_joe_> response times are skyrocketing [14:05:09] volans: do I rerun? [14:05:11] edited wikidata [14:05:13] edit saved on enwiki [14:05:14] <_joe_> rzl: wait [14:05:15] rzl: yes I\d say so [14:05:17] edut cibfurned commonswiki [14:05:17] waiting [14:05:17] <_joe_> please all wait [14:05:24] they are supposed to be idempotent [14:05:29] I guess one host failed to reply [14:05:32] to the siteinfo call [14:05:38] <_joe_> let's see if we can recover [14:05:39] we get it randomly from the load balancer [14:05:41] reads are happening on codfw dbs [14:05:41] edit confirmed on en monile [14:06:04] marostegui: I don't see them on prometh. aggregated yet [14:06:11] jynus: I am starting to see them [14:06:14] codfw memcached traffic is picking up quite a lot, mediawiki latency looks to be coming back down [14:06:17] marostegui: yes, now [14:06:19] <_joe_> yes [14:06:20] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-m [14:06:24] also the _check_siteinfo_dry_run_aware [14:06:25] is the last action [14:06:31] so it'sjust a check [14:06:35] <_joe_> yes [14:06:36] PROBLEM - MariaDB read only x1 #page on db1103 is CRITICAL: CRIT: read_only: True, expected False: OK: Version 10.4.13-MariaDB-log, Uptime 7186097s, event_scheduler: True, 60.59 QPS, connection latency: 0.002161s, query latency: 0.000454s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:06:37] PROBLEM - MariaDB read only s2 #page on db1122 is CRITICAL: CRIT: read_only: True, expected False: OK: Version 10.1.43-MariaDB, Uptime 9104653s, event_scheduler: True, 203.89 QPS, connection latency: 0.002333s, query latency: 0.000760s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:06:37] volans: ack thanks [14:06:40] ^is that the stalls due to ro or real latency? [14:06:48] well I see a bunch of edits in rc on wd so clearly read/write is happening [14:06:51] PROBLEM - MariaDB read only s1 #page on db1083 is CRITICAL: CRIT: read_only: True, expected False: OK: Version 10.1.43-MariaDB, Uptime 8931862s, event_scheduler: True, 75.65 QPS, connection latency: 0.002598s, query latency: 0.000666s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:06:52] PROBLEM - MariaDB read only s4 #page on db1081 is CRITICAL: CRIT: read_only: True, expected False: OK: Version 10.1.43-MariaDB, Uptime 8841195s, event_scheduler: True, 503.92 QPS, connection latency: 0.004594s, query latency: 0.000488s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:06:52] PROBLEM - MariaDB read only s7 #page on db1086 is CRITICAL: CRIT: read_only: True, expected False: OK: Version 10.1.43-MariaDB, Uptime 10141435s, event_scheduler: True, 41.58 QPS, connection latency: 0.003542s, query latency: 0.001225s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:06:53] marostegui: ^ looking? [14:06:53] PROBLEM - MariaDB read only s8 #page on db1109 is CRITICAL: CRIT: read_only: True, expected False: OK: Version 10.1.43-MariaDB, Uptime 9104645s, event_scheduler: True, 71.88 QPS, connection latency: 0.002317s, query latency: 0.001074s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:06:55] <_joe_> sigh [14:06:57] we hav to ack that [14:06:57] lol [14:06:57] ugh [14:07:01] that's a race condition yeah [14:07:02] <_joe_> that's just expected [14:07:05] or puppet needs to run [14:07:08] we forgot to downtime [14:07:09] ignoring for now, thanks [14:07:12] <_joe_> ok rzl [14:07:17] will be out of sync until puppet runs [14:07:18] PROBLEM - MariaDB read only es5 #page on es1024 is CRITICAL: CRIT: read_only: True, expected False: OK: Version 10.4.12-MariaDB-log, Uptime 10288368s, event_scheduler: True, 25.22 QPS, connection latency: 0.002614s, query latency: 0.000549s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:07:19] PROBLEM - MariaDB read only es4 #page on es1021 is CRITICAL: CRIT: read_only: True, expected False: OK: Version 10.4.13-MariaDB-log, Uptime 4596161s, event_scheduler: True, 28.15 QPS, connection latency: 0.002521s, query latency: 0.000423s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:07:19] jynus: let's run puppet? [14:07:20] PROBLEM - MariaDB read only s6 #page on db1093 is CRITICAL: CRIT: read_only: True, expected False: OK: Version 10.1.44-MariaDB, Uptime 4349659s, event_scheduler: True, 98.31 QPS, connection latency: 0.002759s, query latency: 0.000729s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:07:20] <_joe_> response times are back to normal in codfw [14:07:20] jynus or kormat can you run puppet on eqiad masters? [14:07:21] PROBLEM - MariaDB read only s5 #page on db1100 is CRITICAL: CRIT: read_only: True, expected False: OK: Version 10.1.43-MariaDB, Uptime 10313104s, event_scheduler: True, 52.52 QPS, connection latency: 0.002177s, query latency: 0.000697s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:07:24] 95%ile latency on appservers is down to under 800ms [14:07:26] looking good [14:07:27] _joe_: reurunning [14:07:29] marostegui: on it. [14:07:35] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.07-set-readwrite [14:07:35] codfw slaves are behaving fine so far [14:07:36] !log rzl@cumin1001 MediaWiki read-only period ends at: 2020-09-01 14:07:36.305500 [14:07:36] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0) [14:07:38] kormat: cheers [14:07:38] confoirmed edit on meta fwiw [14:07:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:42] passed this time, easy [14:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:47] <_joe_> ok [14:07:55] (we'll use the first RW timestamp from the SAL as that's when edits restarted) [14:08:00] Brazilian wiki write successful [14:08:05] <_joe_> the real end of read-only was the first [14:08:07] <_joe_> btw [14:08:08] yeah [14:08:10] 500 errors here: https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?orgId=1&from=now-1h&to=now&var-site=codfw&var-site=eqiad&var-site=eqsin&var-site=esams&var-site=ulsfo&var-cache_type=varnish-text&var-status_type=5&var-method=GET&var-method=HEAD&var-method=POST [14:08:17] let's hold here, we can run the cleanups when we're sure everything is healthy [14:08:19] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:08:19] rzl: yea that was just the check that failed, all the actions were ok [14:08:23] PROBLEM - MariaDB read only s4 on db2090 is CRITICAL: CRIT: read_only: False, expected True: OK: Version 10.1.43-MariaDB, Uptime 21015063s, event_scheduler: True, 869.78 QPS, connection latency: 0.002024s, query latency: 0.000492s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:08:28] PROBLEM - MariaDB read only s3 #page on db1123 is CRITICAL: CRIT: read_only: True, expected False: OK: Version 10.1.43-MariaDB, Uptime 10141557s, event_scheduler: True, 91.40 QPS, connection latency: 0.002181s, query latency: 0.000572s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:08:29] PROBLEM - MariaDB read only s7 on db2118 is CRITICAL: CRIT: read_only: False, expected True: OK: Version 10.1.43-MariaDB, Uptime 22147079s, event_scheduler: True, 854.69 QPS, connection latency: 0.002119s, query latency: 0.000507s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:08:38] RECOVERY - MariaDB read only s2 #page on db1122 is OK: Version 10.1.43-MariaDB, Uptime 9104775s, read_only: True, event_scheduler: True, 34.04 QPS, connection latency: 0.002150s, query latency: 0.000517s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:08:43] PROBLEM - MariaDB read only x1 on db2096 is CRITICAL: CRIT: read_only: False, expected True: OK: Version 10.4.13-MariaDB-log, Uptime 5534448s, event_scheduler: True, 211.86 QPS, connection latency: 0.001927s, query latency: 0.000489s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:08:44] That was fast! I didn't noticed the read-only. [14:08:48] we need to run puppet also on codfw masters, kormat [14:08:49] does anyone see latency issues that are *not* yet recovering? [14:09:00] <_joe_> no [14:09:08] traffic on replicas seems stable now [14:09:09] PROBLEM - MariaDB read only es4 on es2021 is CRITICAL: CRIT: read_only: False, expected True: OK: Version 10.4.13-MariaDB-log, Uptime 4857864s, event_scheduler: True, 74.76 QPS, connection latency: 0.002533s, query latency: 0.000588s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:09:10] rzl: there was a small bump of mysql traffic [14:09:18] probably due to user's retries [14:09:19] restbase is serving 500s [14:09:21] PROBLEM - MariaDB read only s6 on db2129 is CRITICAL: CRIT: read_only: False, expected True: OK: Version 10.1.43-MariaDB, Uptime 22829589s, event_scheduler: True, 256.15 QPS, connection latency: 0.001705s, query latency: 0.000471s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:09:21] PROBLEM - MariaDB read only s1 on db2112 is CRITICAL: CRIT: read_only: False, expected True: OK: Version 10.1.43-MariaDB, Uptime 22320011s, event_scheduler: True, 579.65 QPS, connection latency: 0.003793s, query latency: 0.000587s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:09:25] I do not see the 500 errors recovering [14:09:33] 500s are leveling off, not yet recovering [14:09:38] PROBLEM - MariaDB read only s2 on db2107 is CRITICAL: CRIT: read_only: False, expected True: OK: Version 10.1.43-MariaDB, Uptime 20905341s, event_scheduler: True, 460.22 QPS, connection latency: 0.002296s, query latency: 0.000459s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:09:41] <_joe_> ema:can you go check on logstash please? [14:09:43] average response time on the "red dashboard" is lower than it was before the spike [14:09:43] PROBLEM - MariaDB read only s3 on db2105 is CRITICAL: CRIT: read_only: False, expected True: OK: Version 10.1.43-MariaDB, Uptime 18161570s, event_scheduler: True, 513.49 QPS, connection latency: 0.001826s, query latency: 0.000507s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:09:44] looking [14:09:46] kormat: are you sure it is puppet that has to run on masters or on icinga, cannot say? [14:09:48] _joe_: ema: every 500 i've looked at is restbase [14:09:52] <_joe_> because I don't see the same data on the backend [14:09:56] sigh let's move to #-sre [14:09:56] better do it in both [14:10:01] moving to #wikimedia-sre [14:10:01] PROBLEM - MariaDB read only es5 on es2023 is CRITICAL: CRIT: read_only: False, expected True: OK: Version 10.4.13-MariaDB-log, Uptime 7090540s, event_scheduler: True, 69.84 QPS, connection latency: 0.002846s, query latency: 0.000516s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:10:01] PROBLEM - MariaDB read only s8 on db2079 is CRITICAL: CRIT: read_only: False, expected True: OK: Version 10.1.44-MariaDB, Uptime 11573822s, event_scheduler: True, 263.63 QPS, connection latency: 0.002119s, query latency: 0.000570s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:10:08] kormat: first masters and then icinga IIRC [14:10:14] how they are puppetized [14:10:16] yeah [14:10:19] jynus: first db and then icinga host, most likely [14:10:20] confirmed, the 500s seem to be restbase [14:10:28] #wikimedia-sre [14:10:30] s1 is still having bad response time from codfw dbs, but there is not apparent lag, so it is a matter of time [14:10:31] let's not be in two places :) [14:10:34] RECOVERY - MariaDB read only x1 #page on db1103 is OK: Version 10.4.13-MariaDB-log, Uptime 7186334s, read_only: True, event_scheduler: True, 27.45 QPS, connection latency: 0.002205s, query latency: 0.000669s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:10:42] recovery, thanks [14:10:45] RECOVERY - MariaDB read only s7 #page on db1086 is OK: Version 10.1.43-MariaDB, Uptime 10141669s, read_only: True, event_scheduler: True, 102.64 QPS, connection latency: 0.003426s, query latency: 0.001327s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:10:46] RECOVERY - MariaDB read only s8 #page on db1109 is OK: Version 10.1.43-MariaDB, Uptime 9104879s, read_only: True, event_scheduler: True, 71.90 QPS, connection latency: 0.003940s, query latency: 0.000661s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:11:10] RECOVERY - MariaDB read only es5 #page on es1024 is OK: Version 10.4.12-MariaDB-log, Uptime 10288600s, read_only: True, event_scheduler: True, 46.55 QPS, connection latency: 0.002860s, query latency: 0.000713s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:11:15] RECOVERY - MariaDB read only s1 on db2112 is OK: Version 10.1.43-MariaDB, Uptime 22320125s, read_only: False, event_scheduler: True, 759.13 QPS, connection latency: 0.001765s, query latency: 0.000420s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:11:21] I am checking the parsercache and weirdly, it has better hit rate now [14:11:31] maybe because of the read only? [14:11:53] in any case it is working very well [14:11:57] RECOVERY - MariaDB read only s8 on db2079 is OK: Version 10.1.44-MariaDB, Uptime 11573937s, read_only: False, event_scheduler: True, 402.58 QPS, connection latency: 0.002244s, query latency: 0.000816s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:12:06] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for holger - https://phabricator.wikimedia.org/T261754 (10holger.knust) [14:12:31] I see very little lag on codfw dbs [14:12:41] RECOVERY - MariaDB read only s1 #page on db1083 is OK: Version 10.1.43-MariaDB, Uptime 8932212s, read_only: True, event_scheduler: True, 61.19 QPS, connection latency: 0.002677s, query latency: 0.000707s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:12:42] RECOVERY - MariaDB read only s4 #page on db1081 is OK: Version 10.1.43-MariaDB, Uptime 8841545s, read_only: True, event_scheduler: True, 322.12 QPS, connection latency: 0.004580s, query latency: 0.000972s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:13:09] RECOVERY - MariaDB read only s6 on db2129 is OK: Version 10.1.43-MariaDB, Uptime 22829818s, read_only: False, event_scheduler: True, 239.70 QPS, connection latency: 0.003502s, query latency: 0.000517s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:13:27] RECOVERY - MariaDB read only s2 on db2107 is OK: Version 10.1.43-MariaDB, Uptime 20905569s, read_only: False, event_scheduler: True, 407.51 QPS, connection latency: 0.001657s, query latency: 0.000557s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:13:33] RECOVERY - MariaDB read only s3 on db2105 is OK: Version 10.1.43-MariaDB, Uptime 18161799s, read_only: False, event_scheduler: True, 775.46 QPS, connection latency: 0.001840s, query latency: 0.000282s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:13:36] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for holger - https://phabricator.wikimedia.org/T261754 (10WDoranWMF) As @holger.knust 's manager I support the request for these rights which he needs as part of his daily work. [14:13:49] RECOVERY - MariaDB read only es5 on es2023 is OK: Version 10.4.13-MariaDB-log, Uptime 7090768s, read_only: False, event_scheduler: True, 83.44 QPS, connection latency: 0.004945s, query latency: 0.000505s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:14:05] RECOVERY - MariaDB read only s4 on db2090 is OK: Version 10.1.43-MariaDB, Uptime 21015405s, read_only: False, event_scheduler: True, 957.10 QPS, connection latency: 0.001928s, query latency: 0.000476s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:14:13] RECOVERY - MariaDB read only s7 on db2118 is OK: Version 10.1.43-MariaDB, Uptime 22147423s, read_only: False, event_scheduler: True, 424.66 QPS, connection latency: 0.002355s, query latency: 0.000489s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:14:23] RECOVERY - MariaDB read only x1 on db2096 is OK: Version 10.4.13-MariaDB-log, Uptime 5534789s, read_only: False, event_scheduler: True, 157.00 QPS, connection latency: 0.002689s, query latency: 0.000346s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:14:41] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-restore-ttl [14:14:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:49] RECOVERY - MariaDB read only es4 on es2021 is OK: Version 10.4.13-MariaDB-log, Uptime 4858204s, read_only: False, event_scheduler: True, 73.83 QPS, connection latency: 0.002595s, query latency: 0.000610s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:15:07] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-restore-ttl (exit_code=0) [14:15:08] puppet run twice on all eqiad and codfw db masters [14:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Reduce db2083 weight', diff saved to https://phabricator.wikimedia.org/P12429 and previous config saved to /var/cache/conftool/dbconfig/20200901-141521-marostegui.json [14:15:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:26] i think this particular check is local to the db hosts themselves, but i'm also running puppet on icinga1001 now too. [14:15:39] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-start-maintenance [14:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:14] PROBLEM - MariaDB read only x1 #page on db1103 is CRITICAL: CRIT: read_only: True, expected False: OK: Version 10.4.13-MariaDB-log, Uptime 7186674s, event_scheduler: True, 58.49 QPS, connection latency: 0.002301s, query latency: 0.000828s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:16:37] kormat: missed this one? [14:17:04] volans: no [14:17:10] <_joe_> can someone please ack the pages? [14:17:13] <_joe_> on victorops [14:17:17] on it [14:17:21] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-start-maintenance (exit_code=0) [14:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:33] PROBLEM - Check the last execution of mediawiki_job_wikibase_repo_prune_test on mwmaint1002 is CRITICAL: CRITICAL: Status of the systemd unit mediawiki_job_wikibase_repo_prune_test https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:18:26] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-update-tendril [14:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:38] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-update-tendril (exit_code=0) [14:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:53] {done} vo ack [14:19:24] wikibase_repo_prune_test from above needs a run of puppet too? [14:19:39] <_joe_> volans: what do you mean? [14:20:01] <_joe_> no I think it's another issue [14:20:13] PROBLEM - Check the last execution of mediawiki_job_wikibase_repo_prune2 on mwmaint1002 is CRITICAL: CRITICAL: Status of the systemd unit mediawiki_job_wikibase_repo_prune2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:20:22] the icinga problem, if it needs to run puppet there and tehn on icinga or is an issue to investigate [14:20:32] <_joe_> no [14:20:37] <_joe_> it just needs investigation [14:22:03] <_joe_> !log restarted confd on mwmaint1002 [14:22:05] <_joe_> sigh [14:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:18] <_joe_> volans: ^^ this was the issue apparently [14:22:29] :/ [14:22:52] do we expect that logstash is a bit lagged in what it's reporting? [14:24:38] I still see read only errors, specially on commonswiki [14:24:44] but I can edit/see other edits [14:24:55] apergos: if you mean ongoing fatals, see discussion in -sre [14:24:58] jynus: apergos: please see #-sre [14:24:59] RECOVERY - MariaDB read only s5 #page on db1100 is OK: Version 10.1.43-MariaDB, Uptime 10314163s, read_only: True, event_scheduler: True, 162.78 QPS, connection latency: 0.002195s, query latency: 0.000613s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:25:10] kormat: so that recovered ^ [14:25:11] yeah I did after writing here and getting no answer :-) [14:25:15] same concerns too heh [14:26:14] kormat: did you run puppet on icinga too? [14:26:18] volans: yes [14:26:22] apergos: no lag afaict https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&from=1598966777547&to=1598970377548&var-datasource=eqiad%20prometheus%2Fops&var-input=kafka%2Frsyslog-udp-localhost-eqiad&refresh=5m&viewPanel=21 [14:26:25] let's check on puppetboard the diff [14:28:28] PROBLEM - MariaDB Replica Lag: m2 on db2078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 32691973.98 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:28:57] <_joe_> !log restarting envoy on all eqiad jobrunners [14:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:10] I will check that m2 replica [14:29:11] RECOVERY - MariaDB read only s3 #page on db1123 is OK: Version 10.1.43-MariaDB, Uptime 10142800s, read_only: True, event_scheduler: True, 209.08 QPS, connection latency: 0.002375s, query latency: 0.000618s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:29:18] m2 isn't user facing [14:30:08] (03CR) 10Jason Linehan: [C: 03+2] Enable MediaWiki client errors on commonswiki and metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623392 (https://phabricator.wikimedia.org/T255585) (owner: 10Jason Linehan) [14:30:22] RECOVERY - Check the last execution of mediawiki_job_wikibase_repo_prune2 on mwmaint1002 is OK: OK: Status of the systemd unit mediawiki_job_wikibase_repo_prune2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:30:54] (03Merged) 10jenkins-bot: Enable MediaWiki client errors on commonswiki and metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623392 (https://phabricator.wikimedia.org/T255585) (owner: 10Jason Linehan) [14:31:28] PROBLEM - Host mw2267 is DOWN: PING CRITICAL - Packet loss = 100% [14:31:31] RECOVERY - MariaDB read only s6 #page on db1093 is OK: Version 10.1.44-MariaDB, Uptime 4351111s, read_only: True, event_scheduler: True, 117.46 QPS, connection latency: 0.005608s, query latency: 0.000566s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:31:46] rzl: Host mw2267 is DOWN [14:32:06] <_joe_> that definitely doesn't need rzl [14:32:08] RECOVERY - Host mw2267 is UP: PING OK - Packet loss = 0%, RTA = 33.48 ms [14:32:09] <_joe_> can someone check? [14:32:23] rebooted [14:32:28] I'm having a look [14:32:39] thanks volans [14:33:03] (03CR) 10Jason Linehan: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623392 (https://phabricator.wikimedia.org/T255585) (owner: 10Jason Linehan) [14:33:11] nothing in SEL for mw2267 since 2018 [14:33:18] PROBLEM - MariaDB Replica Lag: m2 on db2133 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 32692264.29 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:33:29] (03CR) 10Krinkle: "This is an odd time for a deployment. Is this intentional?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623392 (https://phabricator.wikimedia.org/T255585) (owner: 10Jason Linehan) [14:33:59] or grafana... [14:33:59] RECOVERY - MariaDB read only x1 #page on db1103 is OK: Version 10.4.13-MariaDB-log, Uptime 7187739s, read_only: True, event_scheduler: True, 21.53 QPS, connection latency: 0.002779s, query latency: 0.000433s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:34:00] or syslog [14:35:53] (03PS1) 10Krinkle: Revert "Enable MediaWiki client errors on commonswiki and metawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623547 [14:35:57] (03CR) 10Krinkle: [C: 03+2] Revert "Enable MediaWiki client errors on commonswiki and metawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623547 (owner: 10Krinkle) [14:36:23] 10Operations, 10MW-on-K8s, 10TechCom-RFC, 10serviceops, 10Patch-For-Review: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10AMooney) [14:36:42] (03Merged) 10jenkins-bot: Revert "Enable MediaWiki client errors on commonswiki and metawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623547 (owner: 10Krinkle) [14:36:56] (03CR) 10Jason Linehan: "> Patch Set 3:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623392 (https://phabricator.wikimedia.org/T255585) (owner: 10Jason Linehan) [14:37:42] 10Operations, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2020-09-30) rack/setup/install frmx2001.frack.codfw.wmnet, frdata2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T260183 (10Papaul) @Jgreen can you please provide me with the VLAN information for both servers? thanks [14:37:56] RECOVERY - Check the last execution of mediawiki_job_wikibase_repo_prune_test on mwmaint1002 is OK: OK: Status of the systemd unit mediawiki_job_wikibase_repo_prune_test https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:39:30] RECOVERY - MariaDB read only es4 #page on es1021 is OK: Version 10.4.13-MariaDB-log, Uptime 4598093s, read_only: True, event_scheduler: True, 35.10 QPS, connection latency: 0.005224s, query latency: 0.000568s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:39:36] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [14:39:36] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [14:40:48] godog: --^ [14:41:14] elukey: taking a look [14:41:56] 10Operations, 10Diff-blog, 10Traffic: Switch blog.wikimedia.org to diff.wikimedia.org - https://phabricator.wikimedia.org/T254367 (10Aklapper) [14:42:10] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [14:42:10] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [14:42:34] see just mentioning Filippo is enough [14:42:51] haha I wish, sadly it is a known rsyslog failure/segv :( [14:43:02] RECOVERY - MariaDB Replica Lag: m2 on db2078 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:43:56] PROBLEM - MariaDB read only s8 on db1109 is CRITICAL: CRIT: read_only: True, expected False: OK: Version 10.1.43-MariaDB, Uptime 9106869s, event_scheduler: True, 284.02 QPS, connection latency: 0.002275s, query latency: 0.000756s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:45:38] PROBLEM - MariaDB read only s4 on db1081 is CRITICAL: CRIT: read_only: True, expected False: OK: Version 10.1.43-MariaDB, Uptime 8843522s, event_scheduler: True, 268.15 QPS, connection latency: 0.002750s, query latency: 0.000569s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:46:06] kormat: should I downtime those? [14:46:22] back now, i'll take it [14:47:09] see databases-, value is flapping, it is not just lag [14:47:28] it seems underlying data is unreliable [14:48:29] I will downtime db1077 to make sure it doesn't bother us [14:48:33] jynus: yes, that that's exactly why I ask, if I should downtime them more [14:48:52] kormat I think was taking that [14:48:58] Someone is asking on a talk page if the read-only time happen. [14:49:10] Trizek: yes, all done :) [14:49:11] Trizek: it happened already [14:49:19] :D [14:49:34] I forget, what was the final duration? [14:49:37] marostegui: see volans on -sre, he may have seen the cause [14:49:44] yep, I am following [14:49:51] cdanis: volans said 2m49s, I haven't checked his arithmetic [14:50:02] but we beat mark's target by eleven seconds :D [14:50:09] :D [14:50:15] uh oh [14:50:17] you remembered that [14:50:22] <_joe_> lol [14:50:26] next time I will keep quiet :) [14:50:27] The comment I mentioned: https://meta.wikimedia.org/w/index.php?title=Talk:Tech/Server_switch_2020&diff=20414361&oldid=20413558 [14:50:49] rzl: I didn't do the math precisely, might be +/- 1s [14:50:50] Trizek: that's lovely thank you, I may frame it [14:50:55] there were decimals :D [14:51:57] rzl: I took 2020-09-01 14:04:53,619 rzl 28797 [INFO log.py:161 in log_task_end] END (FAIL) [14:52:11] <_joe_> that's the correct one yes [14:52:15] although technically we had that failure, but within seconds it should have converged everywhere [14:52:20] <_joe_> yes [14:53:33] volans, do you have a translated version of your sentence about the tame it took? :D [14:54:22] (03PS1) 10Jbond: role::mx: parameterise otrs db variables [puppet] - 10https://gerrit.wikimedia.org/r/623607 (https://phabricator.wikimedia.org/T244792) [14:54:24] (03PS1) 10Jbond: role:mx: add script to generate otrs aliases [puppet] - 10https://gerrit.wikimedia.org/r/623608 (https://phabricator.wikimedia.org/T256972) [14:54:46] Trizek: from 2020-09-01 14:07:36.305500 to 2020-09-01 14:04:53,619 [14:54:52] <_joe_> Trizek: read-only lasted 2 minutes 49 seconds [14:54:58] volans: he was asking for a human readable one ;) [14:55:07] yes, I was about to do the correct math [14:55:11] not an approximate one :D [14:55:16] Thanks volans! [14:55:18] <_joe_> mark: don't belittle our human linter [14:55:27] <_joe_> :P [14:55:46] (03CR) 10jerkins-bot: [V: 04-1] role:mx: add script to generate otrs aliases [puppet] - 10https://gerrit.wikimedia.org/r/623608 (https://phabricator.wikimedia.org/T256972) (owner: 10Jbond) [14:56:05] wait I get 2 minutes 42 seconds [14:56:33] <_joe_> rzl: volans indicated the wrong times [14:56:53] <_joe_> the correct times are from 14:02:04.851006 to 2020-09-01 14:04:53,619 [14:56:55] oh was that including the duration of the set-RO step? [14:56:57] yeah okay [14:57:26] <_joe_> 2 mins 49 secs 242 ms [14:57:27] _joe_: why? [14:57:33] from SAL [14:57:33] 14:07 rzl@cumin1001: MediaWiki read-only period ends at: 2020-09-01 14:07:36.305500 [14:57:37] ah sorry [14:57:38] my bad [14:57:42] 14:02 rzl@cumin1001: MediaWiki read-only period starts at: 2020-09-01 14:02:04.851006 [14:57:45] bad paste [14:57:50] <_joe_> yes [14:57:51] yeag [14:57:56] I think it's very funny that we're arguing about sub-second precision on something that takes ~10 seconds to get re-read by all appservers :P [14:57:59] <_joe_> so ok, 2 min 49 secs [14:58:06] cdanis: some things are important [14:58:11] cdanis: we include that basically [14:58:13] because of the checks [14:58:14] <_joe_> cdanis: the irony wasn't lost on me ;) [14:58:15] πŸ€” [14:58:31] <_joe_> I'm just trolling riccardo [14:58:40] <_joe_> you might have noticed I didn't include nanoseconds [14:58:42] volans: we include that *plus* some extra error of a few seconds time between retries [14:58:45] so again [14:58:51] talking about milliseconds is ridiculous :P [14:58:59] <_joe_> yes [14:59:17] don't spoil our party by talking abuot margins of error :-P [14:59:17] <_joe_> ofc the read-only time lasted between that time and ~ 10 seconds less [14:59:20] <_joe_> depending on user [14:59:41] apergos: tbh I expected more worrying about error bars from the multiple physicists here πŸ™ƒ [14:59:43] RECOVERY - MariaDB Replica Lag: m2 on db2133 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:59:47] <_joe_> anyways, this is probably the fastest we can go at this point [14:59:49] :-D [14:59:51] and also the start time is rounded up, we log before actually going RO [14:59:55] so it's definitely less [15:00:13] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:00:16] we could say that it's a higer lomit :D [15:00:18] *limit [15:00:27] <_joe_> anyways, the best testament to all this is decidedly https://meta.wikimedia.org/w/index.php?title=Talk:Tech/Server_switch_2020&diff=20414361&oldid=20413558 [15:00:30] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) The only chance to have it in A8 is if heze is decom before I get the server onsite. [15:00:33] for any given user is less or equal [15:00:44] cdanis: well, it's tricky too, because not all of our users are in the same inertial reference frame [15:00:56] _joe_: lol [15:01:14] for example, if one user is on a yacht traveling at some fraction of the speed of light relative to eqiad, [15:01:26] rzl: rotfl [15:01:35] not during the switchover [15:09:31] <_joe_> so regrouping for a sec: are we done with the switchover? [15:10:17] yep, I asked in -sre if anyone had lingering issues and it seems not [15:10:34] <_joe_> ok, cool! [15:10:49] great. thanks everyone. [15:11:05] πŸ‘ [15:12:43] 10Operations, 10OTRS: Research whether it makes sense to have OTRS installation in an HA setup - https://phabricator.wikimedia.org/T169322 (10akosiaris) 05Openβ†’03Declined I am gonna decline this for now. It's been 3 years with no action and no other similar incident that warrants this for now. We can reope... [15:13:17] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Marostegui) Let's go for A6 then :) Thanks for checking! [15:13:22] 10Operations, 10SRE-Access-Requests: Requesting access to Production for lsobanski - https://phabricator.wikimedia.org/T261760 (10LSobanski) [15:14:59] (03CR) 10Ottomata: [C: 03+2] camus - don't check eqiad topics while DC switchover to codfw is ongoing [puppet] - 10https://gerrit.wikimedia.org/r/623602 (owner: 10Ottomata) [15:15:29] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [15:15:34] (03Abandoned) 10Dzahn: switch deployment_server from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/623085 (owner: 10Dzahn) [15:15:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:20] 10Operations, 10ops-codfw, 10netops: (Need by: ) codfw:rack/setup/new management switches - https://phabricator.wikimedia.org/T253154 (10RobH) [15:17:19] so the "mwmaint.discovery" record that apergos mentioned earlier. it should just affect which DC hosts the noc.wikimedia.org website. we could switch it if we want to [15:18:22] 10Operations, 10SRE-Access-Requests: Requesting access to Production for lsobanski - https://phabricator.wikimedia.org/T261760 (10colewhite) p:05Triageβ†’03Medium a:03colewhite [15:18:55] 10Operations, 10SRE-Access-Requests: Requesting access to Production for lsobanski - https://phabricator.wikimedia.org/T261760 (10colewhite) [15:19:15] 10Operations, 10SRE-Access-Requests: Requesting access to Production for lsobanski - https://phabricator.wikimedia.org/T261760 (10colewhite) [15:19:35] 10Operations, 10SRE-Access-Requests: Requesting access to Production for lsobanski - https://phabricator.wikimedia.org/T261760 (10LSobanski) [15:22:42] (03CR) 10Elukey: [C: 03+2] admin: Add user klausman [puppet] - 10https://gerrit.wikimedia.org/r/623357 (https://phabricator.wikimedia.org/T261626) (owner: 10Klausman) [15:24:39] 10Operations, 10SRE-Access-Requests: Requesting access to Production for lsobanski - https://phabricator.wikimedia.org/T261760 (10mark) Approved. [15:25:42] 10Operations, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2020-09-30) rack/setup/install frmx2001.frack.codfw.wmnet, frdata2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T260183 (10Papaul) [15:25:48] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Production for klausman - https://phabricator.wikimedia.org/T261626 (10elukey) Also added Tobias' username to the ops LDAP group. [15:25:48] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:18] (03PS1) 10Cwhite: admin: lsobanski onboarding [puppet] - 10https://gerrit.wikimedia.org/r/623612 (https://phabricator.wikimedia.org/T261760) [15:27:33] (03CR) 10jerkins-bot: [V: 04-1] admin: lsobanski onboarding [puppet] - 10https://gerrit.wikimedia.org/r/623612 (https://phabricator.wikimedia.org/T261760) (owner: 10Cwhite) [15:27:55] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Production for lsobanski - https://phabricator.wikimedia.org/T261760 (10colewhite) [15:28:28] (03PS1) 10Vgutierrez: Release 1.3.1-4 [software/varnish/libvmod-re2] (debian) - 10https://gerrit.wikimedia.org/r/623614 (https://phabricator.wikimedia.org/T261632) [15:28:55] (03CR) 10jerkins-bot: [V: 04-1] Release 1.3.1-4 [software/varnish/libvmod-re2] (debian) - 10https://gerrit.wikimedia.org/r/623614 (https://phabricator.wikimedia.org/T261632) (owner: 10Vgutierrez) [15:30:20] (03CR) 10Dzahn: "> Patch Set 4: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/621090 (https://phabricator.wikimedia.org/T260742) (owner: 10Dzahn) [15:30:48] (03PS2) 10Cwhite: admin: lsobanski onboarding [puppet] - 10https://gerrit.wikimedia.org/r/623612 (https://phabricator.wikimedia.org/T261760) [15:31:04] who is this sobanski new person?? Do we trust him?? :D [15:31:11] πŸ€” [15:31:14] yea [15:32:03] elukey: Nothing to see here, move along [15:32:19] ahahahah [15:32:59] Or I will tell everyone whose fault it is that I'm here [15:38:56] (03PS6) 10Dzahn: prometheus: hiera()->lookup() and data types in exporters, part 2 [puppet] - 10https://gerrit.wikimedia.org/r/621771 [15:42:18] (03CR) 10Jbond: [C: 03+1] "LGTM, thx" [puppet] - 10https://gerrit.wikimedia.org/r/621771 (owner: 10Dzahn) [15:49:10] (03PS1) 10SBassett: Make wmcs db views username-suppression aware [puppet] - 10https://gerrit.wikimedia.org/r/623616 (https://phabricator.wikimedia.org/T205460) [15:50:49] (03CR) 10Rush: [C: 03+2] Make wmcs db views username-suppression aware [puppet] - 10https://gerrit.wikimedia.org/r/623616 (https://phabricator.wikimedia.org/T205460) (owner: 10SBassett) [15:52:46] (03PS3) 10Cwhite: admin, icinga: lsobanski onboarding [puppet] - 10https://gerrit.wikimedia.org/r/623612 (https://phabricator.wikimedia.org/T261760) [15:52:53] 10Operations, 10Performance-Team, 10serviceops, 10Datacenter-Switchover: Unexplained increase in save times, possibly associated with DC switchover - https://phabricator.wikimedia.org/T261763 (10RLazarus) [15:56:04] !log labsdb* puppet agent --test; sudo /usr/local/sbin/maintain-views --all-databases --table user --replace-all; sudo /usr/local/sbin/maintain-views --all-databases --table user_old --replace-all [15:56:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:56] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) install memory upgrades in ores200[1-9] - https://phabricator.wikimedia.org/T259908 (10akosiaris) >>! In T259908#6426710, @Papaul wrote: > @akosiaris welcome back . I hope you had a great vacation. You can proceed to the downtime, I will take care of power... [15:58:00] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is OK: (C)100 gt (W)80 gt 77.29 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [16:04:14] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Production for klausman - https://phabricator.wikimedia.org/T261626 (10klausman) SSH confirmed working: ` ~ $ ssh stat1005.eqiad.wmnet Linux stat1005 4.19.0-10-amd64 #1 SMP Debian 4.19.132-1 (2020-07-24) x86_64 Debian GNU/Linux... [16:06:53] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) install memory upgrades in ores200[1-9] - https://phabricator.wikimedia.org/T259908 (10Papaul) off tomorrow Thursday [16:07:52] (03PS1) 10Arturo Borrero Gonzalez: cloud: bootstrap the cloudgw role/profile [puppet] - 10https://gerrit.wikimedia.org/r/623618 [16:12:00] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) install memory upgrades in ores100[1-9] - https://phabricator.wikimedia.org/T259909 (10akosiaris) Hi @Jclark-ctr. I think we can follow the same process for this as I outlined in T259908#6426689. Do you have a date preference? [16:13:09] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2020), 10User-notice: CommRel support for FY2020-2021 Q1 DC switchover - https://phabricator.wikimedia.org/T244808 (10Trizek-WMF) p:05Highβ†’03Medium Some thoughts about this, in random order. They will be reused for the retro I plan to do next week. *... [16:13:11] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) install memory upgrades in ores200[1-9] - https://phabricator.wikimedia.org/T259908 (10akosiaris) >>! In T259908#6427243, @Papaul wrote: > off tomorrow Thursday It's a date! I 'll schedule downtime for about 6h (just to be on the safe side) on Thursday then. [16:16:33] 10Puppet, 10DBA, 10SRE-tools, 10conftool, and 2 others: Alerting spam and wrong state of primary dc source info on databases while switching dc from eqiad -> codfw - https://phabricator.wikimedia.org/T261767 (10jcrespo) [16:22:11] (03CR) 10Marostegui: [C: 03+1] admin, icinga: lsobanski onboarding [puppet] - 10https://gerrit.wikimedia.org/r/623612 (https://phabricator.wikimedia.org/T261760) (owner: 10Cwhite) [16:39:11] (03Abandoned) 10Herron: lists: copy incoming mail to standby server [puppet] - 10https://gerrit.wikimedia.org/r/607612 (https://phabricator.wikimedia.org/T224586) (owner: 10Herron) [16:52:06] (03CR) 10BryanDavis: [C: 03+1] prometheus: minimal default alerts for Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/622557 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [17:00:26] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [17:02:12] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [17:13:09] o/ [17:14:54] Hi hauskatze [17:16:50] 10Operations, 10InternetArchiveBot, 10Traffic: Support TLSv1.3 in IABot - https://phabricator.wikimedia.org/T251414 (10Cyberpower678) 05Openβ†’03Resolved a:03Cyberpower678 Added support in v2.0.7, but it will remain backward compatible with previous versions of PHP. [17:22:59] (03CR) 10Bstorm: wikireplicas: create multiinstance roles and profiles (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/622444 (https://phabricator.wikimedia.org/T260843) (owner: 10Bstorm) [17:23:46] (03PS4) 10Bstorm: wikireplicas: create multiinstance roles and profiles [puppet] - 10https://gerrit.wikimedia.org/r/622444 (https://phabricator.wikimedia.org/T260843) [17:24:48] (03CR) 10Bstorm: wikireplicas: create multiinstance roles and profiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/622444 (https://phabricator.wikimedia.org/T260843) (owner: 10Bstorm) [17:28:46] (03PS1) 10Bstorm: wikireplicas: test removing deprecated passwords module from role [puppet] - 10https://gerrit.wikimedia.org/r/623623 (https://phabricator.wikimedia.org/T260843) [17:30:18] !log Starting wdqs deploy [17:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:07] !log ryankemper@deploy1001 Started deploy [wdqs/wdqs@7920fbe]: 0.3.46 [17:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:25] !log `wdqs1003` (the canary instance) is failing tests now, going to rollback [17:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:50] !log ryankemper@deploy1001 Finished deploy [wdqs/wdqs@7920fbe]: 0.3.46 (duration: 03m 43s) [17:35:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:26] !log wdqs [canary] rollback complete, tests passing now. Will need to dig into source of failure [17:36:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:52] PROBLEM - Check systemd state on wdqs1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:39:45] (03CR) 10Dzahn: [C: 03+1] "looks good to me, except the Icinga contact does not exist in the private repo yet, so having a contactgroup here with a member that does " [puppet] - 10https://gerrit.wikimedia.org/r/623612 (https://phabricator.wikimedia.org/T261760) (owner: 10Cwhite) [17:46:14] 10Puppet, 10DBA, 10SRE-tools, 10conftool, and 2 others: Alerting spam and wrong state of primary dc source info on databases while switching dc from eqiad -> codfw - https://phabricator.wikimedia.org/T261767 (10Volans) The context of the outdated info was confd stuck on one of the puppetmaster, so when one... [17:49:32] (03Abandoned) 10Dzahn: switch mwmaint backend from eqiad to codfw (noc.wikimedia.org) [dns] - 10https://gerrit.wikimedia.org/r/623089 (owner: 10Dzahn) [17:50:51] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [17:50:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:54] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [17:51:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:00] 10Operations, 10serviceops, 10Patch-For-Review: decom releases1001 and releases2001 - https://phabricator.wikimedia.org/T260742 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `releases1001.eqiad.wmnet` - releases1001.eqiad.wmnet (**PASS**) - Downtimed host on... [17:53:50] (03CR) 10Cwhite: [C: 03+2] admin, icinga: lsobanski onboarding [puppet] - 10https://gerrit.wikimedia.org/r/623612 (https://phabricator.wikimedia.org/T261760) (owner: 10Cwhite) [17:54:21] (03PS1) 10Hnowlan: api-gateway: Add mappings for ratelimit service [deployment-charts] - 10https://gerrit.wikimedia.org/r/623624 (https://phabricator.wikimedia.org/T254910) [17:57:03] (03CR) 10Bstorm: "PCC seems to come up clean https://puppet-compiler.wmflabs.org/compiler1002/24856/" [puppet] - 10https://gerrit.wikimedia.org/r/623623 (https://phabricator.wikimedia.org/T260843) (owner: 10Bstorm) [17:58:14] RECOVERY - Check systemd state on wdqs1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:59:56] (03PS1) 10Dzahn: site/DHCP: decom releases1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/623625 (https://phabricator.wikimedia.org/T260742) [18:00:33] (03PS2) 10Bstorm: wikireplicas: test removing deprecated passwords module from role [puppet] - 10https://gerrit.wikimedia.org/r/623623 (https://phabricator.wikimedia.org/T260843) [18:05:13] (03CR) 10Ppchelko: [C: 04-1] api-gateway: Add mappings for ratelimit service (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/623624 (https://phabricator.wikimedia.org/T254910) (owner: 10Hnowlan) [18:10:13] (03PS1) 10Ssingh: wikidough: add an option to set the landing page [puppet] - 10https://gerrit.wikimedia.org/r/623630 (https://phabricator.wikimedia.org/T252132) [18:12:12] (03PS1) 10Effie Mouzeli: lvs::configuration: add push-notifications patch 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/623631 (https://phabricator.wikimedia.org/T256973) [18:12:20] (03CR) 10Dzahn: [C: 03+2] site/DHCP: decom releases1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/623625 (https://phabricator.wikimedia.org/T260742) (owner: 10Dzahn) [18:12:44] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:12:54] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:13:44] (03PS1) 10Effie Mouzeli: lvs::configuration: add push-notifications patch 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/623632 (https://phabricator.wikimedia.org/T256973) [18:14:37] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1002/24857/malmok.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/623630 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [18:15:42] 10Operations, 10serviceops, 10Platform Team Workboards (Clinic Duty Team), 10Release Pipeline (Blubber), 10Release-Engineering-Team (Pipeline): Deployment infrastructure for PHP microservices - https://phabricator.wikimedia.org/T261369 (10thcipriani) [18:15:43] (03PS1) 10Effie Mouzeli: lvs::configuration: add push-notifications patch 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/623634 (https://phabricator.wikimedia.org/T256973) [18:15:45] (03PS2) 10Hnowlan: api-gateway: Add mappings for ratelimit service [deployment-charts] - 10https://gerrit.wikimedia.org/r/623624 (https://phabricator.wikimedia.org/T254910) [18:17:07] (03PS2) 10Effie Mouzeli: lvs::configuration: add push-notifications patch 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/623631 (https://phabricator.wikimedia.org/T256973) [18:17:13] (03CR) 10Hnowlan: api-gateway: Add mappings for ratelimit service (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/623624 (https://phabricator.wikimedia.org/T254910) (owner: 10Hnowlan) [18:17:29] 10Operations, 10SRE-Access-Requests: Requesting access to Production for lsobanski - https://phabricator.wikimedia.org/T261760 (10Nintendofan885) 05Openβ†’03Resolved [18:18:46] (03CR) 10Ppchelko: [C: 04-1] api-gateway: Add mappings for ratelimit service (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/623624 (https://phabricator.wikimedia.org/T254910) (owner: 10Hnowlan) [18:19:42] 10Operations, 10SRE-Access-Requests: Requesting access to Production for klausman - https://phabricator.wikimedia.org/T261626 (10Nintendofan885) 05Openβ†’03Resolved [18:23:21] * cdanis emailed Telia re: router interfaces alerts [18:26:02] (03Abandoned) 10Dzahn: decom releases1001 [puppet] - 10https://gerrit.wikimedia.org/r/621090 (https://phabricator.wikimedia.org/T260742) (owner: 10Dzahn) [18:30:56] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:31:08] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:32:07] (03PS5) 10Bstorm: wikireplicas: create multiinstance roles and profiles [puppet] - 10https://gerrit.wikimedia.org/r/622444 (https://phabricator.wikimedia.org/T260843) [18:35:38] (03PS1) 10Dzahn: decom releases1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/623635 (https://phabricator.wikimedia.org/T260742) [18:39:55] (03CR) 10Dzahn: [C: 03+2] decom releases1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/623635 (https://phabricator.wikimedia.org/T260742) (owner: 10Dzahn) [18:40:01] (03PS2) 10Dzahn: decom releases1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/623635 (https://phabricator.wikimedia.org/T260742) [18:48:10] (03CR) 10Bstorm: "Adding the requires naturally doesn't seem to harm existing roles https://puppet-compiler.wmflabs.org/compiler1001/24858/labsdb1009.eqiad." [puppet] - 10https://gerrit.wikimedia.org/r/622444 (https://phabricator.wikimedia.org/T260843) (owner: 10Bstorm) [18:53:52] (03PS6) 10Bstorm: wikireplicas: create multiinstance roles and profiles [puppet] - 10https://gerrit.wikimedia.org/r/622444 (https://phabricator.wikimedia.org/T260843) [18:54:16] (03PS2) 10Ahmon Dancy: WIP: Add support for dev realm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622193 [18:55:09] (03CR) 10jerkins-bot: [V: 04-1] WIP: Add support for dev realm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622193 (owner: 10Ahmon Dancy) [18:58:00] 10Operations, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2020-09-30) rack/setup/install frmx2001.frack.codfw.wmnet, frdata2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T260183 (10Dwisehaupt) @Papaul These hosts will both go in the frack-listenerdmz VLAN. [19:06:05] (03PS7) 10Bstorm: wikireplicas: create multiinstance roles and profiles [puppet] - 10https://gerrit.wikimedia.org/r/622444 (https://phabricator.wikimedia.org/T260843) [19:09:06] 10Operations, 10Discovery-Search: Create cookbook to reindex into elasticsearch / cirrus - https://phabricator.wikimedia.org/T219507 (10Gehel) p:05Mediumβ†’03High [19:09:52] 10Operations, 10Discovery-Search, 10User-MoritzMuehlenhoff: Also use java::security on elasticsearch/relforge - https://phabricator.wikimedia.org/T251540 (10Gehel) p:05Mediumβ†’03High [19:12:58] (03CR) 10Cwhite: [C: 03+1] icinga: redirect to https if not already proxied [puppet] - 10https://gerrit.wikimedia.org/r/622566 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [19:13:27] (03CR) 10Bstorm: wikireplicas: create multiinstance roles and profiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/622444 (https://phabricator.wikimedia.org/T260843) (owner: 10Bstorm) [19:13:46] (03CR) 10CRusnov: [C: 03+1] "Looks good to me!" [cookbooks] - 10https://gerrit.wikimedia.org/r/623545 (https://phabricator.wikimedia.org/T258729) (owner: 10Volans) [19:15:13] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch: Use SSL certificates with discovery entry for elasticsearch - https://phabricator.wikimedia.org/T162037 (10Gehel) [19:15:16] (03CR) 10Herron: [C: 03+1] icinga: redirect to https if not already proxied [puppet] - 10https://gerrit.wikimedia.org/r/622566 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [19:27:02] (03CR) 10Dzahn: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/623523 (owner: 10Muehlenhoff) [19:27:25] 10Operations, 10Discovery-Search, 10Elasticsearch, 10Maps: Review Elastic/maps Grafana dashboards - https://phabricator.wikimedia.org/T209812 (10Gehel) [19:27:48] 10Operations, 10Discovery-Search, 10Elasticsearch, 10Traffic: Enable nginx prometheus metrics for all elastic nodes - https://phabricator.wikimedia.org/T216681 (10Gehel) 05Openβ†’03Declined This has not been a need for a while [19:29:24] (03CR) 10Dzahn: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/623630 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [19:29:54] 10Operations, 10Discovery, 10Discovery-Search, 10MediaWiki-Search, 10observability: Search service monitoring should fail if search results only return exact matches and suggestions don't work - https://phabricator.wikimedia.org/T101914 (10Gehel) p:05Mediumβ†’03Triage [19:30:56] (03Abandoned) 10Dzahn: aptrepo: switch active server from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/623083 (owner: 10Dzahn) [19:30:58] (03Abandoned) 10Dzahn: planet: switch backend from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/623087 (owner: 10Dzahn) [19:31:18] (03Abandoned) 10Dzahn: people: switch backend from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/623088 (owner: 10Dzahn) [19:31:20] (03Abandoned) 10Dzahn: switch webserver_misc_apps from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/623090 (owner: 10Dzahn) [19:37:08] PROBLEM - Check whether microcode mitigations for CPU vulnerabilities are applied on kafka-jumbo1002 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {flush_l1d, md_clear} https://wikitech.wikimedia.org/wiki/Microcode [19:39:37] (03PS3) 10Ahmon Dancy: WIP: Add support for dev realm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622193 [19:39:51] (03CR) 10Ssingh: [C: 03+2] wikidough: add an option to set the landing page [puppet] - 10https://gerrit.wikimedia.org/r/623630 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [19:52:59] (03PS4) 10Ahmon Dancy: Add support for dev realm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622193 [20:06:57] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: patch new cross-connect - https://phabricator.wikimedia.org/T261791 (10RobH) p:05Triageβ†’03Medium [20:07:05] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: patch new cross-connect - https://phabricator.wikimedia.org/T261791 (10RobH) [20:08:18] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: patch new cross-connect - https://phabricator.wikimedia.org/T261791 (10RobH) [20:11:51] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: patch new cross-connect - https://phabricator.wikimedia.org/T261791 (10RobH) [20:14:48] RECOVERY - PHP opcache health on mwdebug1001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [20:25:42] 10Operations, 10Android-app-Bugs, 10Page Content Service, 10Product-Infrastructure-Team-Backlog, and 4 others: Incorrect language variant returned for PCS endpoints - https://phabricator.wikimedia.org/T249284 (10holger.knust) Is this related to T256491? [20:29:37] 10Operations, 10MediaWiki-extensions-CodeReview, 10Platform Engineering: Make an HTML dump of the output of the CodeReview extension on MediaWiki.org - https://phabricator.wikimedia.org/T205361 (10holger.knust) [20:35:20] 10Operations, 10WM-Bot: wm-bot doesn't set charset=utf-8, which breaks (amongst other things) emoji rendering - https://phabricator.wikimedia.org/T250104 (10CDanis) @RLazarus encountered this today while doing some retrospective on #datacenter-switchover. @bd808 do you know who runs/owns wm-bot? [20:40:12] 10Operations, 10WM-Bot: wm-bot doesn't set charset=utf-8, which breaks (amongst other things) emoji rendering - https://phabricator.wikimedia.org/T250104 (10bd808) >>! In T250104#6428142, @CDanis wrote: > @bd808 do you know who runs/owns wm-bot? I would consider @Petrb to be the primary "owner" of wm-bot, but... [21:55:49] (03PS7) 10Dzahn: prometheus: hiera()->lookup() and data types in exporters, part 2 [puppet] - 10https://gerrit.wikimedia.org/r/621771 [21:57:30] 10Operations, 10serviceops, 10Patch-For-Review: decom releases1001 and releases2001 - https://phabricator.wikimedia.org/T260742 (10Dzahn) 05Stalledβ†’03Resolved [21:57:34] 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review: replace backends for releases.wikimedia.org with buster VMs - https://phabricator.wikimedia.org/T247652 (10Dzahn) [22:16:52] (03PS8) 10Bstorm: wikireplicas: create multiinstance roles and profiles [puppet] - 10https://gerrit.wikimedia.org/r/622444 (https://phabricator.wikimedia.org/T260843) [22:27:21] 10Operations, 10ops-codfw, 10serviceops: decommission mw2135-mw2147, mw2187-mw2214 - physical / datacenter part - https://phabricator.wikimedia.org/T261524 (10Dzahn) a:05Dzahnβ†’03None [22:37:42] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Cookie β€œWMF-Last-Access-Global” has been rejected for invalid domain. - https://phabricator.wikimedia.org/T261803 (10AntiCompositeNumber) [22:39:13] !log [urbanecm@mwmaint2001 ~]$ mwscript extensions/OATHAuth/maintenance/disableOATHAuthForUser.php --wiki=sysop_itwiki Pierpao (T261722) [22:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:21] T261722: Please reset 2FA for User:Pierpao on sysop_itwiki - https://phabricator.wikimedia.org/T261722 [22:45:44] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Cookie β€œWMF-Last-Access-Global” has been rejected for invalid domain. - https://phabricator.wikimedia.org/T261803 (10colewhite) p:05Triageβ†’03Medium [22:46:02] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for holger - https://phabricator.wikimedia.org/T261754 (10colewhite) p:05Triageβ†’03Medium a:03colewhite [22:48:57] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for holger - https://phabricator.wikimedia.org/T261754 (10colewhite) [22:49:52] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for holger - https://phabricator.wikimedia.org/T261754 (10colewhite) [22:51:21] 10Operations, 10ops-codfw, 10serviceops: decommission mw2135-mw2147, mw2187-mw2214 - physical / datacenter part - https://phabricator.wikimedia.org/T261524 (10colewhite) p:05Triageβ†’03Medium [22:52:52] (03CR) 10Dzahn: [C: 04-1] "ran compiler on * and found actual issue: https://puppet-compiler.wmflabs.org/compiler1001/24860/wdqs2002.codfw.wmnet/change.wdqs2002.codf" [puppet] - 10https://gerrit.wikimedia.org/r/621771 (owner: 10Dzahn) [22:53:48] 10Operations, 10Wikimedia-Mailing-lists: Mailing list for UG Greece - https://phabricator.wikimedia.org/T261607 (10colewhite) a:03colewhite [22:55:36] (03PS8) 10Dzahn: prometheus: hiera()->lookup() and data types in exporters, part 2 [puppet] - 10https://gerrit.wikimedia.org/r/621771 [23:24:24] (03CR) 10Dzahn: [C: 03+2] "amended and fixed for wdqs: https://puppet-compiler.wmflabs.org/compiler1003/24861/ was already noop on everything else: https://puppet-co" [puppet] - 10https://gerrit.wikimedia.org/r/621771 (owner: 10Dzahn) [23:25:05] (03CR) 10Dzahn: [C: 03+2] prometheus: hiera()->lookup() and data types in exporters, part 2 [puppet] - 10https://gerrit.wikimedia.org/r/621771 (owner: 10Dzahn) [23:28:34] (03PS3) 10Dzahn: order langlist.tmpl entries alphabetically [dns] - 10https://gerrit.wikimedia.org/r/623143 (https://phabricator.wikimedia.org/T253439) (owner: 10Gerrit maintenance bot) [23:29:11] (03PS4) 10Dzahn: order langlist.tmpl entries alphabetically [dns] - 10https://gerrit.wikimedia.org/r/623143 (https://phabricator.wikimedia.org/T253439) (owner: 10Gerrit maintenance bot) [23:30:00] (03CR) 10Dzahn: [C: 03+2] order langlist.tmpl entries alphabetically [dns] - 10https://gerrit.wikimedia.org/r/623143 (https://phabricator.wikimedia.org/T253439) (owner: 10Gerrit maintenance bot) [23:40:28] (03PS1) 10Dzahn: cache::base: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/623662 [23:45:00] (03CR) 10Dzahn: [C: 03+2] Remove obsolete chromium-admin group [puppet] - 10https://gerrit.wikimedia.org/r/623523 (owner: 10Muehlenhoff) [23:45:07] (03PS2) 10Dzahn: Remove obsolete chromium-admin group [puppet] - 10https://gerrit.wikimedia.org/r/623523 (owner: 10Muehlenhoff) [23:45:12] (03PS2) 10Dzahn: cache::base: replace hiera() with lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/623662 [23:46:20] (03CR) 10jerkins-bot: [V: 04-1] cache::base: replace hiera() with lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/623662 (owner: 10Dzahn)