[00:00:04] twentyafterfour: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210429T0000). [00:01:28] RECOVERY - Check systemd state on lists1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:02:43] 10SRE, 10ops-eqiad, 10SRE-swift-storage, 10User-fgiunchedi: Decom ms-be[1019-1026] - https://phabricator.wikimedia.org/T272836 (10wiki_willy) a:03Cmjohnson [00:04:56] 10ops-eqiad, 10cloud-services-team (Hardware): labstore1007 crashed after storage controller errors--replace disk? - https://phabricator.wikimedia.org/T281045 (10wiki_willy) a:03Jclark-ctr [00:06:18] !log T280382 `wdqs1013.eqiad.wmnet` has been re-imaged and had the appropriate wikidata/categories journal files transferred. `df -h` shows disk space is no longer an issue following the switch to `raid0`: `/dev/mapper/vg0-srv 2.7T 998G 1.6T 39% /srv` [00:06:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:26] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [00:06:56] PROBLEM - Check systemd state on lists1001 is CRITICAL: CRITICAL - degraded: The following units failed: check_exclude_backups.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:08:59] (03PS1) 10Tim Starling: Fix query error in ImageListPager [core] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/683403 (https://phabricator.wikimedia.org/T281405) [00:09:37] (03CR) 10Tim Starling: [C: 03+2] Fix query error in ImageListPager [core] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/683403 (https://phabricator.wikimedia.org/T281405) (owner: 10Tim Starling) [00:11:13] !log T280382 `sudo -i wmf-auto-reimage-host -p T280382 wdqs1004.eqiad.wmnet` on `ryankemper@cumin1001` tmux session `reimage` [00:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:31] 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: ssh unreachable for wdqs2007.codfw.wmnet - https://phabricator.wikimedia.org/T281437 (10RKemper) [00:31:21] (03Merged) 10jenkins-bot: Fix query error in ImageListPager [core] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/683403 (https://phabricator.wikimedia.org/T281405) (owner: 10Tim Starling) [00:34:12] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Wikisul maillist - https://phabricator.wikimedia.org/T279482 (10Legoktm) >>! In T279482#7044044, @felipedafonseca wrote: > Hi. I think there is an error in the list, or I am not sure exactly how to manage it. I received a pending approval notice by email, b... [00:40:58] !log tstarling@deploy1002 Synchronized php-1.37.0-wmf.3/includes/specials/pagers/ImageListPager.php: T281405 (duration: 01m 08s) [00:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:06] T281405: Wikimedia\Rdbms\DBQueryError: Error 1054: Unknown column 'img_actor' - https://phabricator.wikimedia.org/T281405 [00:58:03] RECOVERY - snapshot of s2 in eqiad on alert1001 is OK: Last snapshot for s2 at eqiad (db1171.eqiad.wmnet:3312) taken on 2021-04-28 23:27:11 (835 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [00:58:14] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: ssh unreachable for wdqs2007.codfw.wmnet - https://phabricator.wikimedia.org/T281437 (10RKemper) [01:01:30] 10SRE, 10Wikimedia-Mailing-lists: Mailman3 is not importing list footers ("templates") as expected - https://phabricator.wikimedia.org/T281425 (10Legoktm) a:03Legoktm Templates are very not-properly implemented upstream so my plan is hackish: * Set global English defaults for templates via Puppet. This will... [01:03:32] (03PS5) 10Jforrester: wgAbuseFilterAflFilterMigrationStage: Stop setting, COMPAT_NEW is default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657697 (https://phabricator.wikimedia.org/T269712) [01:04:13] (03CR) 10Jforrester: [C: 03+1] "Safe to deploy whenever." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657697 (https://phabricator.wikimedia.org/T269712) (owner: 10Jforrester) [01:13:03] (03PS1) 10Jforrester: Undeploy JADE from production, Part I [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683452 (https://phabricator.wikimedia.org/T281418) [01:13:05] (03PS1) 10Jforrester: Undeploy JADE from production, Part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683453 (https://phabricator.wikimedia.org/T281418) [01:13:07] (03PS1) 10Jforrester: Undeploy JADE from production, Part III [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683454 (https://phabricator.wikimedia.org/T281418) [01:18:14] !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [01:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:18:45] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: ssh unreachable for wdqs2007.codfw.wmnet - https://phabricator.wikimedia.org/T281437 (10RKemper) [01:19:11] !log T280382 Aborted data transfer; `wdqs2007` is hosed (see https://phabricator.wikimedia.org/T281437) [01:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:19:20] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [01:19:43] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [01:19:43] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [01:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:19:59] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:20:06] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [01:20:07] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [01:20:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:21:02] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [01:21:03] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [01:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:21:40] (03PS2) 10Krinkle: externalstore: convert some log messages to WARNING [core] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/682720 [01:22:07] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:22:10] (03Abandoned) 10Krinkle: externalstore: convert some log messages to WARNING [core] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/682720 (owner: 10Krinkle) [01:23:04] (03PS6) 10Krinkle: mc: Remove unused mcrouterAware/cluster/coalesceKeys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677734 [01:23:06] !log T280382 `sudo -i wmf-auto-reimage-host -p T280382 --new wdqs1004.eqiad.wmnet` on `ryankemper@cumin1001` tmux session `reimage` [01:23:08] (03CR) 10Krinkle: [C: 03+2] mc: Remove unused mcrouterAware/cluster/coalesceKeys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677734 (owner: 10Krinkle) [01:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:23:31] AaronSchulz: fyi ^ [01:24:21] (03Merged) 10jenkins-bot: mc: Remove unused mcrouterAware/cluster/coalesceKeys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677734 (owner: 10Krinkle) [01:29:41] * Krinkle staging on mwdebug1002 [01:44:40] !log krinkle@deploy1002 Synchronized wmf-config/mc.php: I5869b3c3ba4a (duration: 01m 08s) [01:44:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:46:36] (03PS1) 10Legoktm: mailman3: Puppetize Amir's migration script [puppet] - 10https://gerrit.wikimedia.org/r/683462 [01:46:38] (03PS1) 10Legoktm: mailman3: Improve migration script [puppet] - 10https://gerrit.wikimedia.org/r/683463 [01:47:05] (03CR) 10jerkins-bot: [V: 04-1] mailman3: Puppetize Amir's migration script [puppet] - 10https://gerrit.wikimedia.org/r/683462 (owner: 10Legoktm) [01:47:16] (03CR) 10jerkins-bot: [V: 04-1] mailman3: Improve migration script [puppet] - 10https://gerrit.wikimedia.org/r/683463 (owner: 10Legoktm) [01:48:20] (03PS2) 10Legoktm: mailman3: Improve migration script [puppet] - 10https://gerrit.wikimedia.org/r/683463 [01:48:57] (03CR) 10Legoktm: "I'm not really inclined to fix the shellcheck errors since I'm rewriting it in the next commit." [puppet] - 10https://gerrit.wikimedia.org/r/683462 (owner: 10Legoktm) [01:54:04] Krinkle: I see it [01:55:36] (03CR) 10Aaron Schulz: "Blocks https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/683023/1" [puppet] - 10https://gerrit.wikimedia.org/r/654330 (owner: 10Aaron Schulz) [02:14:23] (03PS2) 10Aaron Schulz: Set $wgChronologyProtectorStash to "mcrouter" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683023 [02:14:25] (03PS1) 10Aaron Schulz: Set $wgCentralAuthTokenCacheType to mcrouter-master-dc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683465 [02:26:29] (03CR) 10Reedy: [C: 04-2] "Needs to wait for wmf.4 to land everywhere and be stable. So potentially deployable from the 7th May." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681834 (https://phabricator.wikimedia.org/T280886) (owner: 10Reedy) [02:44:12] !log milimetric@deploy1002 Started deploy [analytics/refinery@740226b]: Hotfix for referrer job [02:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:53:15] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:58:31] PROBLEM - Check systemd state on archiva1002 is CRITICAL: CRITICAL - degraded: The following units failed: archiva.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:58:52] !log milimetric@deploy1002 Finished deploy [analytics/refinery@740226b]: Hotfix for referrer job (duration: 14m 40s) [02:58:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:59:00] !log milimetric@deploy1002 Started deploy [analytics/refinery@740226b] (thin): Hotfix for referrer job [02:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:59:07] !log milimetric@deploy1002 Finished deploy [analytics/refinery@740226b] (thin): Hotfix for referrer job (duration: 00m 06s) [02:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:00:23] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:03:15] RECOVERY - Check systemd state on archiva1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:17:33] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:23:20] (03Abandoned) 10Aaron Schulz: Set $wgChronologyProtectorStash to "mcrouter" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683023 (owner: 10Aaron Schulz) [03:25:49] (03PS2) 10Aaron Schulz: Set $wgCentralAuthTokenCacheType to mcrouter-master-dc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683465 (https://phabricator.wikimedia.org/T278392) [03:40:55] PROBLEM - SSH on wdqs2007 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:46:24] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Deploy schema change making mailman3.mailinglist info column bigger - https://phabricator.wikimedia.org/T281444 (10Legoktm) [03:47:55] 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Schema-change: Deploy schema change making mailman3.mailinglist info column bigger - https://phabricator.wikimedia.org/T281444 (10Legoktm) [04:16:30] 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Schema-change: Deploy schema change making mailman3.mailinglist info column bigger - https://phabricator.wikimedia.org/T281444 (10Marostegui) p:05Triage→03Medium Let me know when you want this to be done, if it requires coordination with you or can be done any... [04:23:05] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (lists1001), Fresh: 94 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:27:50] (03PS1) 10Marostegui: db1118: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/683473 (https://phabricator.wikimedia.org/T278214) [04:27:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1118 for reimage', diff saved to https://phabricator.wikimedia.org/P15623 and previous config saved to /var/cache/conftool/dbconfig/20210429-042757-marostegui.json [04:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:28:24] (03CR) 10Marostegui: [C: 03+2] db1118: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/683473 (https://phabricator.wikimedia.org/T278214) (owner: 10Marostegui) [04:36:10] (03PS1) 10Marostegui: instances.yaml: Add db1156 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/683474 (https://phabricator.wikimedia.org/T258361) [04:36:44] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1156 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/683474 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [04:38:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1156 to dbctl T258361', diff saved to https://phabricator.wikimedia.org/P15624 and previous config saved to /var/cache/conftool/dbconfig/20210429-043812-marostegui.json [04:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:38:21] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [04:38:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1156 into s2 for the first time with minimal weight T258361', diff saved to https://phabricator.wikimedia.org/P15625 and previous config saved to /var/cache/conftool/dbconfig/20210429-043857-marostegui.json [04:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:39:49] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) db1156 pooled in s2 with minimal weight [04:41:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1118.eqiad.wmnet with reason: REIMAGE [04:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:44:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1118.eqiad.wmnet with reason: REIMAGE [04:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:44:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1156 into s2 for the first time with minimal weight T258361', diff saved to https://phabricator.wikimedia.org/P15626 and previous config saved to /var/cache/conftool/dbconfig/20210429-044458-marostegui.json [04:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:45:06] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [04:50:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1156 into s2 for the first time with minimal weight T258361', diff saved to https://phabricator.wikimedia.org/P15627 and previous config saved to /var/cache/conftool/dbconfig/20210429-045015-marostegui.json [04:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:50:23] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [04:55:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1156 into s2 for the first time with minimal weight T258361', diff saved to https://phabricator.wikimedia.org/P15629 and previous config saved to /var/cache/conftool/dbconfig/20210429-045557-marostegui.json [04:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:56:05] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [05:00:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 10%: Slowly pool into s2 db1156', diff saved to https://phabricator.wikimedia.org/P15630 and previous config saved to /var/cache/conftool/dbconfig/20210429-050045-root.json [05:00:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:59] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) Automatically pooling db1156 into s2. [05:01:08] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [05:01:27] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) All the hosts in this task have been productionized. Pending: decommission the old ones. [05:06:00] (03PS1) 10Marostegui: db1083: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/683476 (https://phabricator.wikimedia.org/T278214) [05:06:50] (03CR) 10Marostegui: [C: 03+2] db1083: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/683476 (https://phabricator.wikimedia.org/T278214) (owner: 10Marostegui) [05:09:05] (03PS1) 10Legoktm: mailman3: Properly import templates for admins [puppet] - 10https://gerrit.wikimedia.org/r/683477 (https://phabricator.wikimedia.org/T281425) [05:09:53] (03CR) 10jerkins-bot: [V: 04-1] mailman3: Properly import templates for admins [puppet] - 10https://gerrit.wikimedia.org/r/683477 (https://phabricator.wikimedia.org/T281425) (owner: 10Legoktm) [05:10:43] (03PS2) 10Legoktm: mailman3: Properly import templates for admins [puppet] - 10https://gerrit.wikimedia.org/r/683477 (https://phabricator.wikimedia.org/T281425) [05:12:03] (03PS3) 10Legoktm: mailman3: Properly import templates for admins [puppet] - 10https://gerrit.wikimedia.org/r/683477 (https://phabricator.wikimedia.org/T281425) [05:15:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 15%: Slowly pool into s2 db1156', diff saved to https://phabricator.wikimedia.org/P15631 and previous config saved to /var/cache/conftool/dbconfig/20210429-051549-root.json [05:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1121 for tables checking', diff saved to https://phabricator.wikimedia.org/P15632 and previous config saved to /var/cache/conftool/dbconfig/20210429-052146-marostegui.json [05:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:22] !log Check tables on db1121 (this will cause lag on s4 commonswiki, on wikireplicas) [05:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:05] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 95 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:30:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 20%: Slowly pool into s2 db1156', diff saved to https://phabricator.wikimedia.org/P15633 and previous config saved to /var/cache/conftool/dbconfig/20210429-053052-root.json [05:30:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:53] PROBLEM - SSH on phab2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:45:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 25%: Slowly pool into s2 db1156', diff saved to https://phabricator.wikimedia.org/P15634 and previous config saved to /var/cache/conftool/dbconfig/20210429-054556-root.json [05:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:30] (03CR) 10Giuseppe Lavagetto: "I'm not sure I see a good reason for this change:" [puppet] - 10https://gerrit.wikimedia.org/r/654330 (owner: 10Aaron Schulz) [06:01:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 30%: Slowly pool into s2 db1156', diff saved to https://phabricator.wikimedia.org/P15635 and previous config saved to /var/cache/conftool/dbconfig/20210429-060100-root.json [06:01:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 40%: Slowly pool into s2 db1156', diff saved to https://phabricator.wikimedia.org/P15636 and previous config saved to /var/cache/conftool/dbconfig/20210429-061603-root.json [06:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:18] (03CR) 10Filippo Giunchedi: [C: 03+1] remove icinga[12]001 addresses from firewall rules [homer/public] - 10https://gerrit.wikimedia.org/r/683420 (https://phabricator.wikimedia.org/T279601) (owner: 10Herron) [06:22:56] (03CR) 10Filippo Giunchedi: [C: 03+1] remove all references to icinga1001 [puppet] - 10https://gerrit.wikimedia.org/r/682992 (https://phabricator.wikimedia.org/T279601) (owner: 10Herron) [06:23:14] (03CR) 10Filippo Giunchedi: [C: 03+1] remove all references to icinga2001 [puppet] - 10https://gerrit.wikimedia.org/r/682999 (https://phabricator.wikimedia.org/T279602) (owner: 10Herron) [06:28:56] (03PS1) 10Marostegui: db1121: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/683483 (https://phabricator.wikimedia.org/T280492) [06:29:31] (03CR) 10Marostegui: [C: 03+2] db1121: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/683483 (https://phabricator.wikimedia.org/T280492) (owner: 10Marostegui) [06:31:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 50%: Slowly pool into s2 db1156', diff saved to https://phabricator.wikimedia.org/P15637 and previous config saved to /var/cache/conftool/dbconfig/20210429-063107-root.json [06:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:52] 10SRE, 10DBA, 10Data-Persistence-Backup, 10Wikimedia-Mailing-lists: Create production databases for mailman3 - https://phabricator.wikimedia.org/T278614 (10jcrespo) Will update here next time db backups run there. I also saw a filesystem backup of 1.6GB here: https://grafana.wikimedia.org/d/413r2vbWk/bacu... [06:39:00] RECOVERY - SSH on phab2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:43:38] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10fgiunchedi) >>! In T265435#7042614, @wiki_willy wrote: > Hi @fgiunchedi - Chatsworth has been pretty flexible with the amount of time we have for testing it, so I think we should be... [06:46:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 60%: Slowly pool into s2 db1156', diff saved to https://phabricator.wikimedia.org/P15639 and previous config saved to /var/cache/conftool/dbconfig/20210429-064611-root.json [06:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:42] (03PS1) 10Marostegui: Revert "db1083: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/683405 [06:53:25] (03CR) 10Marostegui: [C: 03+2] Revert "db1083: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/683405 (owner: 10Marostegui) [06:53:51] !log add 100G to prometheus/ops in eqiad [06:53:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1083 (re)pooling @ 10%: Repool db1083', diff saved to https://phabricator.wikimedia.org/P15640 and previous config saved to /var/cache/conftool/dbconfig/20210429-065445-root.json [06:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:01] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [06:57:57] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [06:58:17] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [07:01:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 75%: Slowly pool into s2 db1156', diff saved to https://phabricator.wikimedia.org/P15641 and previous config saved to /var/cache/conftool/dbconfig/20210429-070114-root.json [07:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:40] (03CR) 10Muehlenhoff: "This is good to go, the conf2* hosts have been decommissioned." [puppet] - 10https://gerrit.wikimedia.org/r/664314 (https://phabricator.wikimedia.org/T273182) (owner: 10Jcrespo) [07:05:47] (03CR) 10Muehlenhoff: [C: 03+1] bacula: Revert TLS 1.0 downgrade on storage servers (including director) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/664314 (https://phabricator.wikimedia.org/T273182) (owner: 10Jcrespo) [07:09:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1083 (re)pooling @ 25%: Repool db1083', diff saved to https://phabricator.wikimedia.org/P15642 and previous config saved to /var/cache/conftool/dbconfig/20210429-070949-root.json [07:09:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:06] (03CR) 10Jcrespo: "yay! Except that now I am fully aware no backups are running of etc (or zookeeper)." [puppet] - 10https://gerrit.wikimedia.org/r/664314 (https://phabricator.wikimedia.org/T273182) (owner: 10Jcrespo) [07:16:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 80%: Slowly pool into s2 db1156', diff saved to https://phabricator.wikimedia.org/P15643 and previous config saved to /var/cache/conftool/dbconfig/20210429-071618-root.json [07:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:57] (03PS1) 10Marostegui: db-eqiad.php: Depool pc1007 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683515 (https://phabricator.wikimedia.org/T273281) [07:17:13] (03PS28) 10Jcrespo: mediabackup: Initial setup for the media backup worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/668380 (https://phabricator.wikimedia.org/T276442) [07:17:15] (03PS3) 10Jcrespo: bacula: Revert TLS 1.0 downgrade on storage servers (including director) [puppet] - 10https://gerrit.wikimedia.org/r/664314 (https://phabricator.wikimedia.org/T273182) [07:17:35] (03PS4) 10Jcrespo: bacula: Revert TLS 1.0 downgrade on storage servers (including director) [puppet] - 10https://gerrit.wikimedia.org/r/664314 (https://phabricator.wikimedia.org/T273182) [07:18:39] (03CR) 10Ayounsi: customscript: fix VIP assignment in PuppetDB import (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/683365 (owner: 10Volans) [07:19:32] (03CR) 10Ayounsi: [C: 03+1] remove icinga[12]001 addresses from firewall rules [homer/public] - 10https://gerrit.wikimedia.org/r/683420 (https://phabricator.wikimedia.org/T279601) (owner: 10Herron) [07:20:22] (03CR) 10Jcrespo: "@JMeybohm please check if you have time to reenable etcd backups (following Joe's guidance), as I believe they run no more. I still think " [puppet] - 10https://gerrit.wikimedia.org/r/664313 (https://phabricator.wikimedia.org/T271573) (owner: 10Jcrespo) [07:24:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1083 (re)pooling @ 40%: Repool db1083', diff saved to https://phabricator.wikimedia.org/P15644 and previous config saved to /var/cache/conftool/dbconfig/20210429-072453-root.json [07:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:40] (03CR) 10Jcrespo: [C: 03+1] "Looks good, but double check server configuration (read_only), remove incoming replication, etc." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683515 (https://phabricator.wikimedia.org/T273281) (owner: 10Marostegui) [07:25:57] (03CR) 10Marostegui: "yep, thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683515 (https://phabricator.wikimedia.org/T273281) (owner: 10Marostegui) [07:26:05] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool pc1007 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683515 (https://phabricator.wikimedia.org/T273281) (owner: 10Marostegui) [07:26:51] (03Merged) 10jenkins-bot: db-eqiad.php: Depool pc1007 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683515 (https://phabricator.wikimedia.org/T273281) (owner: 10Marostegui) [07:28:15] (03CR) 10Muehlenhoff: [C: 03+1] bacula: Revert TLS 1.0 downgrade on storage servers (including director) [puppet] - 10https://gerrit.wikimedia.org/r/664314 (https://phabricator.wikimedia.org/T273182) (owner: 10Jcrespo) [07:28:29] !log marostegui@deploy1002 Synchronized wmf-config/db-eqiad.php: Depool pc1007 (duration: 01m 08s) [07:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:45] !log Stop mysql and upgrade kernel on pc1007 [07:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:01] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool pc1007" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683526 [07:29:09] (03CR) 10Marostegui: [C: 04-2] "not ready yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683526 (owner: 10Marostegui) [07:30:39] (03CR) 10Jcrespo: [C: 03+2] bacula: Revert TLS 1.0 downgrade on storage servers (including director) [puppet] - 10https://gerrit.wikimedia.org/r/664314 (https://phabricator.wikimedia.org/T273182) (owner: 10Jcrespo) [07:31:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 90%: Slowly pool into s2 db1156', diff saved to https://phabricator.wikimedia.org/P15645 and previous config saved to /var/cache/conftool/dbconfig/20210429-073122-root.json [07:31:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:49] (03CR) 10Jcrespo: [C: 03+2] bacula: Revert TLS 1.0 downgrade on storage servers (including director) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/664314 (https://phabricator.wikimedia.org/T273182) (owner: 10Jcrespo) [07:33:16] 10SRE, 10ci-test-error: tox-docker CI test doesn't pick up overrides for pylint - https://phabricator.wikimedia.org/T281347 (10hashar) + @Volans due to overall knowledge about python linters / tox etc. I don't have the issue locally. From the output of `py35-flake8` I grabbed the list of installed packages an... [07:33:58] (03PS3) 10Volans: customscript: fix VIP assignment in PuppetDB import [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/683365 [07:36:16] (03PS1) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/683517 [07:38:10] (03CR) 10Ayounsi: [C: 03+1] customscript: fix VIP assignment in PuppetDB import [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/683365 (owner: 10Volans) [07:39:12] (03PS8) 10Majavah: etcd: Use cfssl for peer-to-peer communication [puppet] - 10https://gerrit.wikimedia.org/r/674077 [07:39:38] (03CR) 10jerkins-bot: [V: 04-1] Jenkins job validation (DO NOT SUBMIT) [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/683517 (owner: 10Hashar) [07:39:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1083 (re)pooling @ 50%: Repool db1083', diff saved to https://phabricator.wikimedia.org/P15646 and previous config saved to /var/cache/conftool/dbconfig/20210429-073956-root.json [07:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:49] 10SRE, 10Data-Persistence-Backup, 10Patch-For-Review: Revert OpenSSL min version configuration introduced for bacula compatibility - https://phabricator.wikimedia.org/T273182 (10jcrespo) Doing on backup1001|1002|2001|2002: ` rm /etc/ssl/openssl.cnf apt install --reinstall -o Dpkg::Options::="--force-confask... [07:46:09] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool pc1007" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683526 (owner: 10Marostegui) [07:46:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 100%: Slowly pool into s2 db1156', diff saved to https://phabricator.wikimedia.org/P15647 and previous config saved to /var/cache/conftool/dbconfig/20210429-074625-root.json [07:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:50] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool pc1007" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683526 (owner: 10Marostegui) [07:48:16] !log marostegui@deploy1002 Synchronized wmf-config/db-eqiad.php: Repool pc1007 (duration: 01m 07s) [07:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:24] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10User-fgiunchedi: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10fgiunchedi) [07:48:39] !log rolling restart of bacula hosts T273182 [07:48:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:49] T273182: Revert OpenSSL min version configuration introduced for bacula compatibility - https://phabricator.wikimedia.org/T273182 [07:52:39] 10SRE, 10ci-test-error: tox-docker CI test doesn't pick up overrides for pylint - https://phabricator.wikimedia.org/T281347 (10Volans) In `setup.py` there is a very specific version of prospector from 2018 (`'prospector[with_everything]==1.1.6.2'`). This, in conjunction with the fact that the same virtualenv i... [07:52:45] 10SRE, 10ci-test-error: tox-docker CI test doesn't pick up overrides for pylint - https://phabricator.wikimedia.org/T281347 (10hashar) I have nuked the cache, but it still pick up the "wrong" flake8: ` Collecting flake8 Downloading flake8-3.9.1-py2.py3-none-any.whl (73 kB) ` And it picked up flake8-3.5.0 :-... [07:53:29] 10SRE, 10ci-test-error: tox-docker CI test doesn't pick up overrides for pylint - https://phabricator.wikimedia.org/T281347 (10Volans) Also if you want to support bullseye it would be good to add 3.9 support to tox and setup.py too. [07:54:06] !log Upgrade kernel on db2089 [07:54:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:24] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "> Patch Set 7:" [puppet] - 10https://gerrit.wikimedia.org/r/682971 (https://phabricator.wikimedia.org/T253058) (owner: 10Giuseppe Lavagetto) [07:54:44] (03CR) 10Volans: customscript: fix VIP assignment in PuppetDB import (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/683365 (owner: 10Volans) [07:55:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1083 (re)pooling @ 60%: Repool db1083', diff saved to https://phabricator.wikimedia.org/P15648 and previous config saved to /var/cache/conftool/dbconfig/20210429-075500-root.json [07:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:32] (03CR) 10Volans: [C: 03+2] customscript: fix VIP assignment in PuppetDB import [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/683365 (owner: 10Volans) [08:01:21] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] kubernetes::deployment_server: also add kafka broker, pass CIDRs [puppet] - 10https://gerrit.wikimedia.org/r/682971 (https://phabricator.wikimedia.org/T253058) (owner: 10Giuseppe Lavagetto) [08:03:04] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) >>! In T264398#7040458, @dpifke wrote: > https://varnish-cache.org/docs/trunk/users-guide/vcl-grace.html An important th... [08:03:30] (03CR) 10JMeybohm: "> Sadly, while this is desirable, etcd3 needs the backup script to be adapted a bit." [puppet] - 10https://gerrit.wikimedia.org/r/664313 (https://phabricator.wikimedia.org/T271573) (owner: 10Jcrespo) [08:03:38] (03PS4) 10JMeybohm: configcluster: Enable etcd v3 backups for stretch hosts [puppet] - 10https://gerrit.wikimedia.org/r/664313 (https://phabricator.wikimedia.org/T271573) (owner: 10Jcrespo) [08:05:16] (03PS4) 10Legoktm: mailman3: Properly import templates for admins [puppet] - 10https://gerrit.wikimedia.org/r/683477 (https://phabricator.wikimedia.org/T281425) [08:05:18] (03PS1) 10Legoktm: mailman3: Fix URL postorius provides to fetch templates [puppet] - 10https://gerrit.wikimedia.org/r/683520 (https://phabricator.wikimedia.org/T281425) [08:05:22] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29269/console" [puppet] - 10https://gerrit.wikimedia.org/r/664313 (https://phabricator.wikimedia.org/T271573) (owner: 10Jcrespo) [08:08:29] 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Schema-change: Deploy schema change making mailman3.mailinglist info column bigger - https://phabricator.wikimedia.org/T281444 (10Legoktm) I think it can be done anytime but @ladsgroup should confirm. [08:08:54] 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Schema-change: Deploy schema change making mailman3.mailinglist info column bigger - https://phabricator.wikimedia.org/T281444 (10Marostegui) Thanks - I will wait for the confirmation [08:10:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1083 (re)pooling @ 70%: Repool db1083', diff saved to https://phabricator.wikimedia.org/P15649 and previous config saved to /var/cache/conftool/dbconfig/20210429-081004-root.json [08:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:17] (03PS1) 10MMandere: Add mmandere shell account [puppet] - 10https://gerrit.wikimedia.org/r/683522 (https://phabricator.wikimedia.org/T281344) [08:12:31] !log Upgrade kernel on db2109 [08:12:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:34] (03PS5) 10JMeybohm: configcluster: Enable etcd v2 backups [puppet] - 10https://gerrit.wikimedia.org/r/664313 (https://phabricator.wikimedia.org/T271573) (owner: 10Jcrespo) [08:14:05] (03PS1) 10Volans: customscript: fix typo in interface automation [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/683524 [08:14:12] 10SRE, 10serviceops: Support etcd v3 backups with ::etcd::backup - https://phabricator.wikimedia.org/T281447 (10JMeybohm) [08:16:08] (03CR) 10JMeybohm: [C: 03+2] configcluster: Enable etcd v2 backups [puppet] - 10https://gerrit.wikimedia.org/r/664313 (https://phabricator.wikimedia.org/T271573) (owner: 10Jcrespo) [08:18:19] 10SRE, 10Data-Persistence-Backup, 10Patch-For-Review: Revert OpenSSL min version configuration introduced for bacula compatibility - https://phabricator.wikimedia.org/T273182 (10jcrespo) [08:18:24] 10SRE, 10serviceops, 10Patch-For-Review: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10jcrespo) [08:19:00] (03CR) 10Volans: [C: 03+2] customscript: fix typo in interface automation [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/683524 (owner: 10Volans) [08:19:09] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/683522 (https://phabricator.wikimedia.org/T281344) (owner: 10MMandere) [08:19:42] !log Upgrade kernel on db2114 [08:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:40] metrics from backups may pause for a second while director restarts [08:23:16] (03CR) 10Ayounsi: [C: 03+1] customscript: fix typo in interface automation [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/683524 (owner: 10Volans) [08:23:24] 10ops-codfw, 10DC-Ops, 10decommission-hardware, 10serviceops: decommission conf200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T281374 (10JMeybohm) >>! In T281374#7042326, @ops-monitoring-bot wrote: > - COMMON_STEPS (**FAIL**) > - **Failed to run the sre.dns.netbox cookbook**: Cumin execution fa... [08:23:25] (03CR) 10Volans: [V: 03+2 C: 03+2] Upstream release v0.2.7 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/683256 (owner: 10Volans) [08:23:41] (03PS1) 10WMDE-Fisch: Enable suggested values in TemplateData and VisualEditor InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683547 (https://phabricator.wikimedia.org/T273857) [08:23:43] (03PS1) 10WMDE-Fisch: Enable suggested values in TemplateData and VisualEditor CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683548 (https://phabricator.wikimedia.org/T273857) [08:24:01] 10ops-codfw, 10DC-Ops, 10decommission-hardware, 10serviceops: decommission conf200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T281374 (10JMeybohm) [08:24:56] 10SRE, 10serviceops: Support etcd v3 backups with ::etcd::backup - https://phabricator.wikimedia.org/T281447 (10JMeybohm) p:05Triage→03Medium [08:25:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1083 (re)pooling @ 80%: Repool db1083', diff saved to https://phabricator.wikimedia.org/P15651 and previous config saved to /var/cache/conftool/dbconfig/20210429-082507-root.json [08:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:30] 10SRE, 10serviceops: Support etcd v3 backups with ::etcd::backup - https://phabricator.wikimedia.org/T281447 (10JMeybohm) [08:25:33] 10SRE, 10serviceops: Support proxying to etcd v3 storage on buster or later - https://phabricator.wikimedia.org/T275600 (10JMeybohm) [08:25:36] !log volans@deploy1002 Started deploy [homer/deploy@89cd07c]: Release v0.2.7 [08:25:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:20] (03CR) 10DCausse: rdf-streaming-updater: enable HA capability (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/679519 (https://phabricator.wikimedia.org/T273098) (owner: 10Mstyles) [08:26:22] (03PS5) 10DCausse: rdf-streaming-updater: enable HA capability [deployment-charts] - 10https://gerrit.wikimedia.org/r/679519 (https://phabricator.wikimedia.org/T273098) (owner: 10Mstyles) [08:26:24] (03PS4) 10DCausse: rdf-streaming-updater: use session mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/681497 (https://phabricator.wikimedia.org/T280166) (owner: 10Mstyles) [08:26:27] 10SRE, 10serviceops, 10Patch-For-Review: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10JMeybohm) 05Open→03Resolved [08:27:48] !log Upgrade kernel on db2115 [08:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:32] (03CR) 10DCausse: rdf-streaming-updater: use session mode (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/681497 (https://phabricator.wikimedia.org/T280166) (owner: 10Mstyles) [08:28:44] !log volans@deploy1002 Finished deploy [homer/deploy@89cd07c]: Release v0.2.7 (duration: 03m 08s) [08:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:27] !log Upgrade kernel on db2120 [08:33:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:51] 10SRE, 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Followup to backup1001 bacula switchover (misc pending tasks) - https://phabricator.wikimedia.org/T238048 (10jcrespo) [08:38:21] 10SRE, 10Data-Persistence-Backup, 10Patch-For-Review: Revert OpenSSL min version configuration introduced for bacula compatibility - https://phabricator.wikimedia.org/T273182 (10jcrespo) 05Open→03Resolved This has been successfully reverted and a backup each has been run from both stretch and buster host... [08:38:31] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Enable suggested values in TemplateData and VisualEditor CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683548 (https://phabricator.wikimedia.org/T273857) (owner: 10WMDE-Fisch) [08:39:12] !log Upgrade kernel on db2121 [08:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:20] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Enable suggested values in TemplateData and VisualEditor InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683547 (https://phabricator.wikimedia.org/T273857) (owner: 10WMDE-Fisch) [08:40:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1083 (re)pooling @ 100%: Repool db1083', diff saved to https://phabricator.wikimedia.org/P15652 and previous config saved to /var/cache/conftool/dbconfig/20210429-084011-root.json [08:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:41] (03CR) 10Ema: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29270/console" [puppet] - 10https://gerrit.wikimedia.org/r/683522 (https://phabricator.wikimedia.org/T281344) (owner: 10MMandere) [08:41:58] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:42:18] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: SRE Onboarding for Marc Mandere - https://phabricator.wikimedia.org/T281344 (10ema) p:05Triage→03Medium [08:42:40] (03CR) 10Elukey: Add mmandere shell account (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/683522 (https://phabricator.wikimedia.org/T281344) (owner: 10MMandere) [08:42:52] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: SRE Onboarding for Marc Mandere - https://phabricator.wikimedia.org/T281344 (10ema) [08:44:24] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:46:16] !log Upgrade kernel on db2122 [08:46:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:52] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 143, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:50:00] (03PS1) 10JMeybohm: Rename role configcluster_stretch to configcluster [puppet] - 10https://gerrit.wikimedia.org/r/683551 (https://phabricator.wikimedia.org/T271573) [08:50:54] !log Upgrade kernel on db2124 [08:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:26] !log Upgrade kernel on db2125 [08:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:26] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 145, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:56:19] 10SRE, 10Platform Engineering, 10Services, 10Wikimedia-Mailing-lists: Decide on future of public services@ mailing list (which has no maintainers) - https://phabricator.wikimedia.org/T278516 (10Mvolz) >>! In T278516#7043057, @Legoktm wrote: >>>! In T278516#7041242, @Mvolz wrote: >> I've replied on the list... [08:57:45] !log Upgrade kernel on db2133 [08:57:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:36] (03PS1) 10Volans: setup.py: limit max version of pynetbox [software/homer] - 10https://gerrit.wikimedia.org/r/683554 [09:01:33] !log stop replication and checking data of db2100:s7 [09:01:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:51] 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Schema-change: Deploy schema change making mailman3.mailinglist info column bigger - https://phabricator.wikimedia.org/T281444 (10Ladsgroup) Yes. Sorry I just woke up. Let's do it! [09:06:07] (03CR) 10Ayounsi: [C: 03+1] setup.py: limit max version of pynetbox [software/homer] - 10https://gerrit.wikimedia.org/r/683554 (owner: 10Volans) [09:06:09] (03CR) 10Volans: [C: 03+2] setup.py: limit max version of pynetbox [software/homer] - 10https://gerrit.wikimedia.org/r/683554 (owner: 10Volans) [09:06:35] (03CR) 10Ladsgroup: [C: 03+1] "But I like bash :D" [puppet] - 10https://gerrit.wikimedia.org/r/683463 (owner: 10Legoktm) [09:06:38] (03CR) 10JMeybohm: [C: 04-1] networkpolicy: add autogenerated egress rules (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/683379 (https://phabricator.wikimedia.org/T253058) (owner: 10Giuseppe Lavagetto) [09:07:19] (03CR) 10Ladsgroup: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/683462 (owner: 10Legoktm) [09:07:58] (03CR) 10Ladsgroup: [C: 03+1] mailman3: Fix URL postorius provides to fetch templates [puppet] - 10https://gerrit.wikimedia.org/r/683520 (https://phabricator.wikimedia.org/T281425) (owner: 10Legoktm) [09:08:06] (03PS1) 10Legoktm: mailman3: Fork and improve upstream templates [puppet] - 10https://gerrit.wikimedia.org/r/683555 (https://phabricator.wikimedia.org/T281425) [09:08:30] (03CR) 10Ladsgroup: [C: 03+1] "It is weird. The socket thingy is also the default. It seems it's default config is not consistent." [puppet] - 10https://gerrit.wikimedia.org/r/683520 (https://phabricator.wikimedia.org/T281425) (owner: 10Legoktm) [09:09:21] (03Merged) 10jenkins-bot: setup.py: limit max version of pynetbox [software/homer] - 10https://gerrit.wikimedia.org/r/683554 (owner: 10Volans) [09:11:41] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.2.8 [software/homer] - 10https://gerrit.wikimedia.org/r/683556 [09:14:58] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.2.8 [software/homer] - 10https://gerrit.wikimedia.org/r/683556 (owner: 10Volans) [09:17:30] (03PS2) 10Giuseppe Lavagetto: networkpolicy: add autogenerated egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/683379 (https://phabricator.wikimedia.org/T253058) [09:17:35] (03CR) 10Giuseppe Lavagetto: networkpolicy: add autogenerated egress rules (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/683379 (https://phabricator.wikimedia.org/T253058) (owner: 10Giuseppe Lavagetto) [09:18:20] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.2.8 [software/homer] - 10https://gerrit.wikimedia.org/r/683556 (owner: 10Volans) [09:20:21] (03CR) 10JMeybohm: [C: 03+1] "Nice!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/683379 (https://phabricator.wikimedia.org/T253058) (owner: 10Giuseppe Lavagetto) [09:29:33] (03PS1) 10Volans: Upstream release v0.2.8 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/683560 [09:30:04] (03CR) 10Volans: [V: 03+2 C: 03+2] Upstream release v0.2.8 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/683560 (owner: 10Volans) [09:32:55] I'm stopping mailman3 and its web services so we can run a schema change [09:33:50] marostegui: stopped go [09:34:32] Amir1: done, check -databases [09:34:43] cool [09:34:50] it was faster I could write a SAL message [09:35:41] !log volans@deploy1002 Started deploy [homer/deploy@e394769]: Release v0.2.8 [09:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:39] 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Schema-change: Deploy schema change making mailman3.mailinglist info column bigger - https://phabricator.wikimedia.org/T281444 (10Marostegui) 05Open→03Resolved a:03Marostegui All done: ` root@db1128.eqiad.wmnet[mailman3]> ALTER TABLE mailinglist MODIFY COLUM... [09:36:42] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: Mailman3 "info" field is small (only 255 characters) - https://phabricator.wikimedia.org/T281426 (10Marostegui) [09:38:07] 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Schema-change: Deploy schema change making mailman3.mailinglist info column bigger - https://phabricator.wikimedia.org/T281444 (10Ladsgroup) I confirm it fixed the issue (tried it on https://lists.wikimedia.org/postorius/lists/lgbt.lists.wikimedia.org/) mailman w... [09:39:11] !log volans@deploy1002 Finished deploy [homer/deploy@e394769]: Release v0.2.8 (duration: 03m 30s) [09:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:20] (03PS1) 10Mvolz: Switch to a different contact email [deployment-charts] - 10https://gerrit.wikimedia.org/r/683563 (https://phabricator.wikimedia.org/T278516) [09:42:25] !log uploaded pynetbox 5.3.0-2 to bullseye-wikimedia on qpt.w.o [09:42:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:09] (03CR) 10Legoktm: [C: 04-1] Switch to a different contact email (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/683563 (https://phabricator.wikimedia.org/T278516) (owner: 10Mvolz) [09:45:56] (03PS2) 10Mvolz: Switch to a different contact email [deployment-charts] - 10https://gerrit.wikimedia.org/r/683563 (https://phabricator.wikimedia.org/T278516) [09:46:59] (03PS3) 10Mvolz: Switch to a different contact email [deployment-charts] - 10https://gerrit.wikimedia.org/r/683563 (https://phabricator.wikimedia.org/T278516) [09:47:53] (03PS4) 10Mvolz: Switch to a different contact email [deployment-charts] - 10https://gerrit.wikimedia.org/r/683563 (https://phabricator.wikimedia.org/T278516) [09:56:49] (03CR) 10Ladsgroup: "Generally looks fine. Just if you can somehow test it, it'd be great." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/683477 (https://phabricator.wikimedia.org/T281425) (owner: 10Legoktm) [09:59:50] (03CR) 10Urbanecm: [C: 03+1] "indeed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657697 (https://phabricator.wikimedia.org/T269712) (owner: 10Jforrester) [10:00:05] mvolz: That opportune time is upon us again. Time for a Services – Citoid / Zotero deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210429T1000). [10:00:13] (03CR) 10Giuseppe Lavagetto: [C: 03+1] netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond) [10:00:58] (03PS1) 10Ema: vcl: remove cluster_fe_hit hooks [puppet] - 10https://gerrit.wikimedia.org/r/683571 (https://phabricator.wikimedia.org/T264398) [10:01:00] (03PS1) 10Ema: vcl: make vcl_hit invoke default VCL [puppet] - 10https://gerrit.wikimedia.org/r/683572 (https://phabricator.wikimedia.org/T264398) [10:09:02] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.074 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:09:34] 10ops-codfw: Move YubiHSM from auth2001 to pki2001 - https://phabricator.wikimedia.org/T281459 (10MoritzMuehlenhoff) [10:10:58] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream, 10User-Ladsgroup: Mailman3 "info" field is small (only 255 characters) - https://phabricator.wikimedia.org/T281426 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup So it's done. Tested and it works. [10:13:58] (03CR) 10Legoktm: "> Patch Set 4:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/683477 (https://phabricator.wikimedia.org/T281425) (owner: 10Legoktm) [10:15:08] Some people are saying CX is down [10:16:26] Amir1: I saw an UBN ticket about that, saying it's caused by recent PHP update. Lemme find it. [10:16:45] https://phabricator.wikimedia.org/T281346 [10:16:48] https://phabricator.wikimedia.org/T281346 [10:16:54] oh, Majavah's faster :) [10:17:26] I just chose the first UBN task I saw on #contenttranslation phab workboard, didn't even wait for the text to load :D [10:17:29] lol Majavah is always faster [10:17:36] it has a patch merged in master and pending backports, i guess we can just fix it? :D [10:18:07] Amir1: will you, or should I? :D [10:18:12] I shall [10:18:16] :D [10:18:38] go for it then :) [10:18:58] (03CR) 10Ladsgroup: [C: 03+2] "backporting" [extensions/ContentTranslation] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/683135 (https://phabricator.wikimedia.org/T281346) (owner: 10KartikMistry) [10:19:08] (03CR) 10Ladsgroup: [C: 03+2] "backporting UBN" [extensions/ContentTranslation] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/683134 (https://phabricator.wikimedia.org/T281346) (owner: 10KartikMistry) [10:20:31] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Wikisul maillist - https://phabricator.wikimedia.org/T279482 (10felipedafonseca) Did yesterday, thanks. [10:21:45] hey Amir1 and Urbanecm, we have a trainee for the backports window today! see https://phabricator.wikimedia.org/T281458 [10:22:09] oh, sorry, you are doing work... shutting up until it's done [10:22:24] apergos: good :). This is an UBN, but I can add some patch to the window if you want me to ;). [10:22:35] :-D [10:22:47] (03PS15) 10DCausse: rdf-streaming-updater: create helmfile.d structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [10:22:49] (03PS6) 10DCausse: rdf-streaming-updater: enable HA capability [deployment-charts] - 10https://gerrit.wikimedia.org/r/679519 (https://phabricator.wikimedia.org/T273098) (owner: 10Mstyles) [10:22:51] (03PS5) 10DCausse: rdf-streaming-updater: use session mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/681497 (https://phabricator.wikimedia.org/T280166) (owner: 10Mstyles) [10:22:52] I have the jade undeployment [10:23:04] but tbh, I have a headache atm [10:23:15] there are two mw core and two mw config patches already in the window, that should be fine [10:23:16] hope the pills kick in [10:23:29] I hope so too [10:23:48] apergos: the backports are...actually the UBN fixes Amir1's deploying now :-). cc kart_ [10:23:53] oh :-D [10:23:55] all of them? [10:24:13] the two backports, not the configs AFAICS [10:24:20] ah, ok the two configs should be fine then [10:24:31] yup [10:24:44] I hope one or both of you will be here and in the meet for our trainee, I don't want to be the solo trainer here :-) [10:24:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1110 for kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15654 and previous config saved to /var/cache/conftool/dbconfig/20210429-102447-marostegui.json [10:24:53] Majavah: I'm going to break beta soon :D [10:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:02] Amir1: already broken! [10:25:07] d;oh :-D [10:25:11] more broken [10:25:12] T281418 [10:25:13] T281418: Undeploy JADE extension from beta cluster - https://phabricator.wikimedia.org/T281418 [10:25:23] T281450 [10:25:24] T281450: Exception: Table 'enwiki.wb_items_per_site' doesn't exist - https://phabricator.wikimedia.org/T281450 [10:25:34] oh boy [10:25:35] :D [10:26:04] ohdear [10:26:09] isn't that something that broke in master of Wikibase? :D [10:26:15] I know what's happening. Nature hates me. When I'm not on vacation, everything is fine, when I am. Everything breaks [10:26:38] :D [10:26:42] Amir1: Then never take vacations, please [10:26:44] Urbanecm: I doubt it, it's a table that should be there [10:26:55] marostegui: Per German law I must :D [10:26:58] that's trying to read it from the wrong database [10:27:00] it doesn't exist on production enwiki either Amir1 https://www.irccloud.com/pastebin/oDI0vw5X/ [10:27:09] Amir1: :( [10:27:24] once HR sent me an email like, "either you take your vacation or we will make you take them" [10:27:35] :D [10:27:36] !log Upgrade kernel on db1110 [10:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:04] Urbanecm: yes, on wikidata it should be (channeling my inner yoda) [10:28:05] Urbanecm: that table is in wikidatawiki, but someone broke wikibase to accidentally read it from the wrong database (enwiki) [10:28:07] "we have ways to make you take them" [10:28:32] Amir1: yeah, it exists in wikidatawiki. So...I still think something in master Wikibase made it use the wrong database :D [10:28:44] yeah, needs investigation [10:28:49] yup [10:28:50] did you read the ticket? [10:29:06] https://phabricator.wikimedia.org/T281450#7044763 [10:29:48] (03CR) 10Ladsgroup: [C: 03+1] "> Patch Set 4:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/683477 (https://phabricator.wikimedia.org/T281425) (owner: 10Legoktm) [10:30:40] https://phabricator.wikimedia.org/T281457 Marius is already on it [10:32:14] cool [10:33:25] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Wikisul maillist - https://phabricator.wikimedia.org/T279482 (10Ladsgroup) Have they tried the UI or they sent an email to `wikisul-join@` ? We disabled email subscription. [10:37:36] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Wikisul maillist - https://phabricator.wikimedia.org/T279482 (10felipedafonseca) They did it through the website. The current situation is: 1) I received an email advising of a subscription request (sent to you) which did not show up in the "pending approva... [10:39:37] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Wikisul maillist - https://phabricator.wikimedia.org/T279482 (10felipedafonseca) About point "1' I think we are in two admins and the other admin approved (I didn't know we were in two admins), that must have happened because there are three memebers on the... [10:39:54] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Wikisul maillist - https://phabricator.wikimedia.org/T279482 (10Ladsgroup) I see your emails https://lists.wikimedia.org/hyperkitty/list/wikisul@lists.wikimedia.org/ I think if no one is subscribed yet, it won't show it to them. [10:40:18] (03CR) 10Jbond: [C: 03+2] netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond) [10:40:18] Urbanecm: oh, I forgot about 'training' part and we have two more patches to deploy today. [10:41:08] kart_: AFAIK, for requesting devs, the "training part" means the trainers do the deploy, and comment it to the trainees in a video call. So, nothing should change for you (the deploys might be bit slower due to the commentary) [10:41:09] kart_: This https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ContentTranslation/+/683514/? [10:41:41] Amir1: You mentioned about https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ContentTranslation/+/683514 [10:41:41] the trainers may do the deploy or they may talk someone else through steps of it, assuming that other person has deploy rights [10:41:45] but yeah [10:41:51] Amir1: yes. [10:42:10] Urbanecm: Thanks. Might join next time :) [10:42:15] great! [10:42:38] okay, let me just backport them [10:42:40] we need more deployers, the more we share the load the less anxiety there is for anyone [10:43:04] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:43:05] (03PS1) 10Ladsgroup: Another fix for token cookie handling [extensions/ContentTranslation] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/683533 (https://phabricator.wikimedia.org/T281346) [10:43:17] (03CR) 10Gergő Tisza: [C: 03+1] Set wgGEMentorshipMigrationStage to SCHEMA_COMPAT_NEW everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683430 (https://phabricator.wikimedia.org/T279853) (owner: 10Urbanecm) [10:43:28] (03PS1) 10Ladsgroup: Another fix for token cookie handling [extensions/ContentTranslation] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/683534 (https://phabricator.wikimedia.org/T281346) [10:43:30] (03CR) 10Jbond: [C: 03+2] netbase: add new module to manage /etc/services (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond) [10:43:39] (03CR) 10Ladsgroup: [C: 03+2] Another fix for token cookie handling [extensions/ContentTranslation] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/683534 (https://phabricator.wikimedia.org/T281346) (owner: 10Ladsgroup) [10:43:48] (03CR) 10Ladsgroup: [C: 03+2] Another fix for token cookie handling [extensions/ContentTranslation] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/683533 (https://phabricator.wikimedia.org/T281346) (owner: 10Ladsgroup) [10:44:41] (03CR) 10Jbond: [C: 03+2] netbase: add new module to manage /etc/services (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond) [10:44:44] apergos: I think "we need more deployers who actually deploy" is more precise 🙂 [10:44:50] heh [10:45:04] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Wikisul maillist - https://phabricator.wikimedia.org/T279482 (10Ladsgroup) Maybe the email went to spam? [10:45:09] well if it turns out we are having a great old time in here deploying, others will feel they are missing out on the fun :-D [10:45:11] (03PS16) 10Jbond: P:base: add ability to manage services file [puppet] - 10https://gerrit.wikimedia.org/r/670918 [10:45:30] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:45:41] apergos: maybe we should do activity audits for deployers more often. That way, the number of deployers would at least be representative. [10:45:52] :-D [10:46:02] (03PS16) 10Jbond: P:netbase: parse the service catalogue and inject the service ports [puppet] - 10https://gerrit.wikimedia.org/r/673105 [10:47:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 25%: Repool db1110', diff saved to https://phabricator.wikimedia.org/P15655 and previous config saved to /var/cache/conftool/dbconfig/20210429-104700-root.json [10:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:24] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Wikisul maillist - https://phabricator.wikimedia.org/T279482 (10felipedafonseca) The emails are being sent, but I'm not receiving them. They didn't go to spam, I just checked. Under "member options" there is an option "Delivery status" and this option was... [10:52:06] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Wikisul maillist - https://phabricator.wikimedia.org/T279482 (10Ladsgroup) GMail deduplicates emails you send, so you wouldn't receive it yourself. Did others receive it? [10:52:54] (03Merged) 10jenkins-bot: Fix CX token cookie [extensions/ContentTranslation] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/683135 (https://phabricator.wikimedia.org/T281346) (owner: 10KartikMistry) [10:52:58] (03Merged) 10jenkins-bot: Fix CX token cookie [extensions/ContentTranslation] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/683134 (https://phabricator.wikimedia.org/T281346) (owner: 10KartikMistry) [10:54:49] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.1/extensions/ContentTranslation/modules/base/mw.cx.SiteMapper.js: Backport: [[gerrit:683134|Fix CX token cookie (T281346)]] (duration: 01m 09s) [10:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:57] T281346: Cannot start or resume a translation for articles with spaces or non-ascii characters in the title - https://phabricator.wikimedia.org/T281346 [10:56:25] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.3/extensions/ContentTranslation/modules/base/mw.cx.SiteMapper.js: Backport: [[gerrit:683135|Fix CX token cookie (T281346)]] (duration: 01m 08s) [10:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:57] !log updating apt on buster (SUA 198), which eases bullseye upgrades T275873 [11:00:01] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Wikisul maillist - https://phabricator.wikimedia.org/T279482 (10felipedafonseca) Ok, that must be it. I will check it out, thanks. [11:00:04] Amir1, Lucas_WMDE, apergos, and duesen: Dear deployers, time to do the EU Backport and Config training deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210429T1100). [11:00:04] kart_ and CFisch_WMDE: A patch you scheduled for EU Backport and Config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:05] T275873: Prepare our base system layer for Debian 11/bullseye - https://phabricator.wikimedia.org/T275873 [11:00:19] o/ [11:00:47] Yeah here and seems Amir1 just deployed the first set :) [11:00:58] the second set is almost there [11:01:47] hey one of Amir1 or Urbanecm, wanna join the google meet? [11:02:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 50%: Repool db1110', diff saved to https://phabricator.wikimedia.org/P15656 and previous config saved to /var/cache/conftool/dbconfig/20210429-110204-root.json [11:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:28] 10SRE, 10Patch-For-Review, 10User-jbond: PKI/CFSSL Next steps - https://phabricator.wikimedia.org/T268882 (10jbond) [11:06:45] 10SRE, 10Patch-For-Review, 10User-jbond: PKI/CFSSL Next steps - https://phabricator.wikimedia.org/T268882 (10jbond) 05Open→03Resolved a:03jbond [11:07:00] Given long CI wait time, we should have bit bigger backport window than 1 hour, I feel. [11:08:28] (03PS1) 10Jbond: P:pki::root_ca: enable backups [puppet] - 10https://gerrit.wikimedia.org/r/683578 (https://phabricator.wikimedia.org/T281369) [11:09:01] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] P:base: add ability to manage services file [puppet] - 10https://gerrit.wikimedia.org/r/670918 (owner: 10Jbond) [11:12:45] apergos: should I? I mean, I don't think I'll learn sth there. But i can if you want me to ;) [11:14:01] (03CR) 10Ladsgroup: [C: 03+2] Enable suggested values in TemplateData and VisualEditor InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683547 (https://phabricator.wikimedia.org/T273857) (owner: 10WMDE-Fisch) [11:15:13] CFisch_WMDE: I'm about to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/683547, are you able to test it when it's done? [11:15:36] (03Merged) 10jenkins-bot: Enable suggested values in TemplateData and VisualEditor InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683547 (https://phabricator.wikimedia.org/T273857) (owner: 10WMDE-Fisch) [11:15:43] no but when you do the 2nd one [11:15:49] thesocialdev: ^ [11:15:58] thanks [11:16:15] Urbanecm: it was more if you want to help train :-D but Amir is there so that's two trainers [11:16:18] we're good :-) [11:16:33] (03Merged) 10jenkins-bot: Another fix for token cookie handling [extensions/ContentTranslation] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/683534 (https://phabricator.wikimedia.org/T281346) (owner: 10Ladsgroup) [11:16:36] (03Merged) 10jenkins-bot: Another fix for token cookie handling [extensions/ContentTranslation] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/683533 (https://phabricator.wikimedia.org/T281346) (owner: 10Ladsgroup) [11:16:43] okay, cool :) [11:17:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 75%: Repool db1110', diff saved to https://phabricator.wikimedia.org/P15657 and previous config saved to /var/cache/conftool/dbconfig/20210429-111708-root.json [11:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:37] CFisch_WMDE: deployng the first mediawiki-config patch [11:21:54] thesocialdev: thanks [11:24:29] !log mbsantos@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:683547|Enable suggested values in TemplateData and VisualEditor InitialiseSettings (T273857)]] (duration: 01m 07s) [11:24:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:37] T273857: Enable suggested values - https://phabricator.wikimedia.org/T273857 [11:27:12] CFisch_WMDE: the first deployment went through, please let me know if you notice anything wrong about it [11:29:07] I will. Should be fine though until the 2nd patch is deployed :-D [11:30:02] (03CR) 10Ladsgroup: [C: 03+2] Enable suggested values in TemplateData and VisualEditor CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683548 (https://phabricator.wikimedia.org/T273857) (owner: 10WMDE-Fisch) [11:32:01] (03Merged) 10jenkins-bot: Enable suggested values in TemplateData and VisualEditor CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683548 (https://phabricator.wikimedia.org/T273857) (owner: 10WMDE-Fisch) [11:32:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 100%: Repool db1110', diff saved to https://phabricator.wikimedia.org/P15658 and previous config saved to /var/cache/conftool/dbconfig/20210429-113211-root.json [11:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:50] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.3/extensions/ContentTranslation/specials/SpecialContentTranslation.php: Backport: [[gerrit:683533|Another fix for token cookie handling (T281346)]] (duration: 01m 08s) [11:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:58] T281346: Cannot start or resume a translation for articles with spaces or non-ascii characters in the title - https://phabricator.wikimedia.org/T281346 [11:34:34] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.1/extensions/ContentTranslation/specials/SpecialContentTranslation.php: Backport: [[gerrit:683534|Another fix for token cookie handling (T281346)]] (duration: 01m 07s) [11:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:18] CFisch_WMDE: deploying the second patch [11:35:20] (03PS2) 10Jbond: P:pki::root_ca: enable backups [puppet] - 10https://gerrit.wikimedia.org/r/683578 (https://phabricator.wikimedia.org/T281369) [11:36:16] Amir1: Done? :) [11:36:17] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29277/console" [puppet] - 10https://gerrit.wikimedia.org/r/683578 (https://phabricator.wikimedia.org/T281369) (owner: 10Jbond) [11:36:25] kart_: yup [11:36:35] PROBLEM - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: wdqs1010.eqiad.wmnet, maps1009.eqiad.wmnet, webperf1001.eqiad.wmnet, wdqs1004.eqiad.wmnet, db1115.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [11:36:38] Cool. Seems CX is working again now! :) [11:36:46] Amir1: Thanks a lot!! [11:37:02] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/683578 (https://phabricator.wikimedia.org/T281369) (owner: 10Jbond) [11:37:30] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, let's give it a shot today." [puppet] - 10https://gerrit.wikimedia.org/r/683408 (owner: 10Jbond) [11:38:23] !log mbsantos@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:683548|Enable suggested values in TemplateData and VisualEditor CommonSettings (T273857)]] (duration: 01m 07s) [11:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:31] T273857: Enable suggested values - https://phabricator.wikimedia.org/T273857 [11:38:41] CFisch_WMDE: it's done, please let me know if everything is alright [11:39:48] thesocialdev: looks good, seems to work fine [11:40:10] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [11:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:22] Thanks again thesocialdev! [11:41:40] CFisch_WMDE: awesome! Thanks for the patience :) [11:44:46] who is the keeper of git repo groups these days? I forget [11:44:47] RECOVERY - Check whether microcode mitigations for CPU vulnerabilities are applied on sretest1002 is OK: OK - All expected CPU flags found https://wikitech.wikimedia.org/wiki/Microcode [11:44:53] I shall tag them on this task [11:45:12] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet [11:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:22] (03PS2) 10Ladsgroup: Undeploy JADE from production, Part I [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683452 (https://phabricator.wikimedia.org/T281418) (owner: 10Jforrester) [11:45:40] I quickly undeploy Jade. Nothing affecting production. [11:45:50] (03CR) 10Ladsgroup: [C: 03+2] Undeploy JADE from production, Part I [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683452 (https://phabricator.wikimedia.org/T281418) (owner: 10Jforrester) [11:47:10] (03Merged) 10jenkins-bot: Undeploy JADE from production, Part I [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683452 (https://phabricator.wikimedia.org/T281418) (owner: 10Jforrester) [11:49:09] !log ladsgroup@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:683452|Undeploy JADE from production, Part I (T281418)]] (duration: 01m 07s) [11:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:18] T281418: Undeploy JADE extension from beta cluster - https://phabricator.wikimedia.org/T281418 [11:49:57] (03PS2) 10Ladsgroup: Undeploy JADE from production, Part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683453 (https://phabricator.wikimedia.org/T281418) (owner: 10Jforrester) [11:51:12] (03CR) 10Ladsgroup: [C: 03+2] Undeploy JADE from production, Part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683453 (https://phabricator.wikimedia.org/T281418) (owner: 10Jforrester) [11:51:23] RECOVERY - Disk space on mwlog1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mwlog1001&var-datasource=eqiad+prometheus/ops [11:52:08] (03PS2) 10Ladsgroup: Undeploy JADE from production, Part III [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683454 (https://phabricator.wikimedia.org/T281418) (owner: 10Jforrester) [11:52:28] (03Merged) 10jenkins-bot: Undeploy JADE from production, Part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683453 (https://phabricator.wikimedia.org/T281418) (owner: 10Jforrester) [11:52:58] hey thesocialdev please check that you have +2 on mw config now, you should have been added (see task for the training) [11:53:14] 10SRE, 10Patch-For-Review: Additional CFSSL tasks - https://phabricator.wikimedia.org/T281369 (10jbond) [11:53:47] apergos: confirmed, I have! Thanks for doing it [11:53:57] (03PS1) 10Jbond: P:base::certificates: Add pki certs to default trust store [puppet] - 10https://gerrit.wikimedia.org/r/683583 (https://phabricator.wikimedia.org/T281369) [11:53:59] I didn't but Ree dy did, all good! [11:54:19] thanks to the involved parts :) [11:54:21] hey can I close that task? It seems like you came and were trained :-D [11:54:43] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:683453|Undeploy JADE from production, Part II (T281418)]], Part I (duration: 01m 06s) [11:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:51] T281418: Undeploy JADE extension from beta cluster - https://phabricator.wikimedia.org/T281418 [11:55:20] (03CR) 10jerkins-bot: [V: 04-1] P:base::certificates: Add pki certs to default trust store [puppet] - 10https://gerrit.wikimedia.org/r/683583 (https://phabricator.wikimedia.org/T281369) (owner: 10Jbond) [11:55:58] well I will be bold and close it so get a tab back in firefox :-D [11:56:55] (03CR) 10Ladsgroup: [C: 03+2] Undeploy JADE from production, Part III [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683454 (https://phabricator.wikimedia.org/T281418) (owner: 10Jforrester) [11:57:27] (03PS2) 10Jbond: P:base::certificates: Add pki certs to default trust store [puppet] - 10https://gerrit.wikimedia.org/r/683583 (https://phabricator.wikimedia.org/T281369) [11:57:56] (03Merged) 10jenkins-bot: Undeploy JADE from production, Part III [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683454 (https://phabricator.wikimedia.org/T281418) (owner: 10Jforrester) [11:58:26] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29279/console" [puppet] - 10https://gerrit.wikimedia.org/r/683583 (https://phabricator.wikimedia.org/T281369) (owner: 10Jbond) [11:59:56] !log ladsgroup@deploy1002 Synchronized wmf-config/extension-list: Config: [[gerrit:683454|Undeploy JADE from production, Part III (T281418)]] (duration: 01m 07s) [12:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:04] Deploy window Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210429T1200) [12:00:04] T281418: Undeploy JADE extension from beta cluster - https://phabricator.wikimedia.org/T281418 [12:00:09] (03PS2) 10Jbond: O:debmonitor::server: request cert using cfssl [puppet] - 10https://gerrit.wikimedia.org/r/683408 [12:01:44] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:base::certificates: Add pki certs to default trust store [puppet] - 10https://gerrit.wikimedia.org/r/683583 (https://phabricator.wikimedia.org/T281369) (owner: 10Jbond) [12:02:48] I've tried to make the link list on the backports deployers page a bit more obvious and a bit more ordered but ymmv [12:05:11] why there are unittest_* tables on beta cluster [12:05:49] (03PS2) 10MMandere: Add mmandere shell account [puppet] - 10https://gerrit.wikimedia.org/r/683522 (https://phabricator.wikimedia.org/T281344) [12:06:15] (03CR) 10Jbond: [C: 03+2] O:debmonitor::server: request cert using cfssl [puppet] - 10https://gerrit.wikimedia.org/r/683408 (owner: 10Jbond) [12:06:43] !log update debmonitor.discover.wmnet ssl cert [12:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:38] Amir1: uhhhh very good question [12:07:59] I don't want to know the answer [12:08:32] please don't tell me someone tried to run phpunit in beta cluster [12:08:35] anyway. [12:08:55] pfft [12:09:04] I thought people wanted things tested in "production" [12:09:49] (03PS1) 10Jbond: Revert "O:debmonitor::server: request cert using cfssl" [puppet] - 10https://gerrit.wikimedia.org/r/683539 [12:09:51] as the old saying, one man's production is another man's test [12:10:18] Amir1: looks like at least they only exist on enwiki [12:10:33] what are the jade tables called? [12:10:44] I dropped them from enwiki [12:10:46] (03PS2) 10ArielGlenn: make snapshot1011 the new wikidata dumper and snapshot1012 the new enwiki dumper [puppet] - 10https://gerrit.wikimedia.org/r/683293 (https://phabricator.wikimedia.org/T281330) [12:10:50] but the rest of wikis left [12:11:00] (03CR) 10Jcrespo: [C: 03+1] "Looks good to me, when you merge, I can do a test run." [puppet] - 10https://gerrit.wikimedia.org/r/683578 (https://phabricator.wikimedia.org/T281369) (owner: 10Jbond) [12:11:05] they all start with jade_ [12:11:24] does drop tables support wildcards [12:11:28] (03CR) 10Jbond: [C: 03+2] Revert "O:debmonitor::server: request cert using cfssl" [puppet] - 10https://gerrit.wikimedia.org/r/683539 (owner: 10Jbond) [12:11:44] looks like they don [12:12:08] (03CR) 10ArielGlenn: [C: 03+2] make snapshot1011 the new wikidata dumper and snapshot1012 the new enwiki dumper [puppet] - 10https://gerrit.wikimedia.org/r/683293 (https://phabricator.wikimedia.org/T281330) (owner: 10ArielGlenn) [12:12:33] Amir1: is there a reason why I shouldn't do something like foreachwikiindblist all-labs mysql.php -- -e "drop table jade_XXXX;" [12:12:56] I think it needs to go on --write [12:13:25] otherwise it breaks replication (if not already broken for other reasons) [12:13:38] it would just error out, the replicas are read only [12:13:57] they weren't previously [12:14:09] oh fun [12:14:16] so you could do exactly as amir just said [12:14:20] and break replication [12:14:21] it was fun [12:15:42] yup, IIRC it happened in production [12:15:47] oops [12:16:25] but I should be able to do that with `--write`? no issues would be caused after dropping them? [12:17:24] fingers cross [12:17:30] mediawiki is full of surprises [12:18:43] I'll do that [12:23:39] Amir1: gone everywhere [12:23:49] 10SRE, 10Wikimedia-Mailing-lists: Hausa Wikimedians mailing list - https://phabricator.wikimedia.org/T279654 (10Ladsgroup) By recognized, I mean by affcom. I found the recognition announcement by Kiril. https://lists.wikimedia.org/postorius/lists/wikimedia-ha.lists.wikimedia.org/ Added both of you as owners... [12:23:55] 10SRE, 10Wikimedia-Mailing-lists: Hausa Wikimedians mailing list - https://phabricator.wikimedia.org/T279654 (10Ladsgroup) 05Open→03Resolved [12:24:12] Majavah: Thanks! [12:24:23] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Hausa Wikimedians mailing list - https://phabricator.wikimedia.org/T279654 (10Ladsgroup) a:05The_Living_love→03Ladsgroup [12:24:40] (03CR) 10Muehlenhoff: [C: 03+2] profile::java: Remove jessie, add bullseye [puppet] - 10https://gerrit.wikimedia.org/r/683357 (owner: 10Muehlenhoff) [12:26:05] !log installing grub2 updates from buster point release [12:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:04] 10SRE, 10DBA, 10Data-Persistence-Backup, 10Wikimedia-Mailing-lists: Create production databases for mailman3 - https://phabricator.wikimedia.org/T278614 (10Ladsgroup) I assume the size is for the search index. [12:30:18] (03CR) 10Muehlenhoff: [C: 03+2] zookeeper: Remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/683288 (owner: 10Muehlenhoff) [12:36:30] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on eventlog1002.eqiad.wmnet with reason: Testing migration of processors to eventlog1003 [12:36:30] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on eventlog1002.eqiad.wmnet with reason: Testing migration of processors to eventlog1003 [12:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:36] !log upgrading Quiddity to admin in mailman3 [12:36:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:45] (03PS1) 10Jbond: P:debmonitor::client: dont specify an explicit CA [puppet] - 10https://gerrit.wikimedia.org/r/683591 [12:44:10] (03PS2) 10Jbond: P:debmonitor::client: dont specify an explicit CA [puppet] - 10https://gerrit.wikimedia.org/r/683591 [12:45:23] (03CR) 10MMandere: Add mmandere shell account (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/683522 (https://phabricator.wikimedia.org/T281344) (owner: 10MMandere) [12:45:34] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/683591 (owner: 10Jbond) [12:46:10] (03CR) 10Jbond: [C: 03+2] P:debmonitor::client: dont specify an explicit CA [puppet] - 10https://gerrit.wikimedia.org/r/683591 (owner: 10Jbond) [12:52:20] (03PS1) 10Jbond: P:debmonitor::server: siwitch debmonitor.discovery.wmnet to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/683593 [12:53:17] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29284/console" [puppet] - 10https://gerrit.wikimedia.org/r/683593 (owner: 10Jbond) [12:53:37] (03CR) 10Jbond: P:debmonitor::server: siwitch debmonitor.discovery.wmnet to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/683593 (owner: 10Jbond) [12:54:52] (03PS2) 10Jbond: P:debmonitor::server: switch debmonitor.discovery.wmnet to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/683593 [12:58:06] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/683593 (owner: 10Jbond) [13:00:04] liw and longma: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - European+American Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210429T1300). [13:02:20] 10SRE: Integrate Buster 10.8 point update - https://phabricator.wikimedia.org/T274099 (10MoritzMuehlenhoff) [13:03:29] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:04:03] (03PS1) 10Lars Wirzenius: group1 wikis to 1.37.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683598 [13:04:05] (03CR) 10Lars Wirzenius: [C: 03+2] group1 wikis to 1.37.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683598 (owner: 10Lars Wirzenius) [13:04:16] (03CR) 10Muehlenhoff: [C: 03+1] "Sounds like a good idea!" [puppet] (sandbox/filippo/pontoon-o11y) - 10https://gerrit.wikimedia.org/r/676383 (owner: 10Filippo Giunchedi) [13:05:08] (03Merged) 10jenkins-bot: group1 wikis to 1.37.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683598 (owner: 10Lars Wirzenius) [13:05:17] the latency is going up btw [13:05:57] liw: https://phabricator.wikimedia.org/T281455 and https://phabricator.wikimedia.org/T281456 are the same error [13:06:24] !log liw@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.37.0-wmf.3 [13:06:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:32] !log liw@deploy1002 Synchronized php: group1 wikis to 1.37.0-wmf.3 (duration: 01m 07s) [13:07:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:45] (03CR) 10Jbond: [C: 03+2] P:base: add ability to manage services file [puppet] - 10https://gerrit.wikimedia.org/r/670918 (owner: 10Jbond) [13:07:50] (03PS17) 10Jbond: P:base: add ability to manage services file [puppet] - 10https://gerrit.wikimedia.org/r/670918 [13:08:09] !log merge netbase change to manage /etc/services [13:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:34] Majavah, ack, closed one [13:08:57] train is at group1 again [13:09:30] I think that's happening because https://github.com/wikimedia/mediawiki-extensions-MassMessage/blob/master/includes/SpecialMassMessage.php#L262 is not executed, but that file hasn't been touched in a while [13:10:25] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:10:40] (03CR) 10Ssingh: [C: 03+2] Add mmandere shell account [puppet] - 10https://gerrit.wikimedia.org/r/683522 (https://phabricator.wikimedia.org/T281344) (owner: 10MMandere) [13:10:53] (03CR) 10Muehlenhoff: "Looks good, although I'm not sure if the Hiera flag is even needed? In Cloud VPS the backup::sets are simply a NOP, aren't they? At least " [puppet] - 10https://gerrit.wikimedia.org/r/683578 (https://phabricator.wikimedia.org/T281369) (owner: 10Jbond) [13:11:27] !log installing postgresql-11 security updates [13:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:53] (03CR) 10Ottomata: "> Well it depends on what you want to do, but yes, you'll have all IP addresses of a kafka cluster in a list, you can reuse the data struc" [puppet] - 10https://gerrit.wikimedia.org/r/682971 (https://phabricator.wikimedia.org/T253058) (owner: 10Giuseppe Lavagetto) [13:21:56] (03PS1) 10Jbond: P:trafficserver::backend: use ca-certificates.crt to talk to backends [puppet] - 10https://gerrit.wikimedia.org/r/683604 [13:25:09] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:26:27] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Hausa Wikimedians mailing list - https://phabricator.wikimedia.org/T279654 (10The_Living_love) Yes! We are recognised by affcom since 2019. Thank you. [13:26:50] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Hausa Wikimedians mailing list - https://phabricator.wikimedia.org/T279654 (10The_Living_love) Registered. Thanks [13:28:14] (03CR) 10Jbond: [C: 03+2] P:debmonitor::server: switch debmonitor.discovery.wmnet to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/683593 (owner: 10Jbond) [13:32:31] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:33:09] (03CR) 10Giuseppe Lavagetto: [C: 03+2] networkpolicy: add autogenerated egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/683379 (https://phabricator.wikimedia.org/T253058) (owner: 10Giuseppe Lavagetto) [13:33:11] (03PS1) 10Jbond: Revert "P:debmonitor::server: switch debmonitor.discovery.wmnet to cfssl" [puppet] - 10https://gerrit.wikimedia.org/r/683626 [13:33:39] (03CR) 10jerkins-bot: [V: 04-1] Revert "P:debmonitor::server: switch debmonitor.discovery.wmnet to cfssl" [puppet] - 10https://gerrit.wikimedia.org/r/683626 (owner: 10Jbond) [13:34:36] (03Merged) 10jenkins-bot: networkpolicy: add autogenerated egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/683379 (https://phabricator.wikimedia.org/T253058) (owner: 10Giuseppe Lavagetto) [13:34:43] (03PS2) 10Jbond: Revert "P:debmonitor::server: switch debmonitor.discovery.wmnet to cfssl" [puppet] - 10https://gerrit.wikimedia.org/r/683626 [13:35:12] (03CR) 10jerkins-bot: [V: 04-1] Revert "P:debmonitor::server: switch debmonitor.discovery.wmnet to cfssl" [puppet] - 10https://gerrit.wikimedia.org/r/683626 (owner: 10Jbond) [13:36:04] (03PS3) 10Jbond: Revert "P:debmonitor::server: switch debmonitor.discovery.wmnet to cfssl" [puppet] - 10https://gerrit.wikimedia.org/r/683626 [13:37:36] !log otto@deploy1002 Started deploy [analytics/refinery@b3c5820]: update event_sanitized_main allowlst on an-launcher1002 - T273789 [13:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:45] T273789: Sanitize and ingest all event tables into the event_sanitized database - https://phabricator.wikimedia.org/T273789 [13:37:51] PROBLEM - Check status of defined EventLogging jobs on eventlog1002 is CRITICAL: CRITICAL: Stopped EventLogging jobs: eventlogging-processor@client-side-11 eventlogging-processor@client-side-10 eventlogging-processor@client-side-09 eventlogging-processor@client-side-08 eventlogging-processor@client-side-07 eventlogging-processor@client-side-06 eventlogging-processor@client-side-05 eventlogging-processor@client-side-04 eventloggin [13:37:51] t-side-03 eventlogging-processor@client-side-02 https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging [13:38:55] (03CR) 10Jbond: [C: 03+2] Revert "P:debmonitor::server: switch debmonitor.discovery.wmnet to cfssl" [puppet] - 10https://gerrit.wikimedia.org/r/683626 (owner: 10Jbond) [13:38:57] i think this is a false alarm due to migration to eventlog1003 ping hnowlan ;) [13:39:00] ^^^ [13:40:17] (03PS1) 10MMandere: Add mmandere to ops group [puppet] - 10https://gerrit.wikimedia.org/r/683611 (https://phabricator.wikimedia.org/T281344) [13:40:19] (03PS1) 10Volans: setup.py: add missing config [software/cumin] - 10https://gerrit.wikimedia.org/r/683612 [13:40:35] !log otto@deploy1002 Finished deploy [analytics/refinery@b3c5820]: update event_sanitized_main allowlst on an-launcher1002 - T273789 (duration: 02m 59s) [13:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:21] yep, that's me [13:41:28] thanks ottomata, downtiming [13:42:03] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:42:45] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on eventlog1002.eqiad.wmnet with reason: eventlog1003 migration [13:42:45] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on eventlog1002.eqiad.wmnet with reason: eventlog1003 migration [13:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:58] (03PS1) 10Ottomata: Enable event_sanitized_main_immediate job [puppet] - 10https://gerrit.wikimedia.org/r/683614 (https://phabricator.wikimedia.org/T273789) [13:43:20] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on eventlog1003.eqiad.wmnet with reason: eventlog1003 migration [13:43:21] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on eventlog1003.eqiad.wmnet with reason: eventlog1003 migration [13:43:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:11] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:44:15] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:44:30] (03CR) 10jerkins-bot: [V: 04-1] Enable event_sanitized_main_immediate job [puppet] - 10https://gerrit.wikimedia.org/r/683614 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [13:44:58] !log installing Java security updates on stat* hosts [13:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:23] (03PS2) 10Ottomata: Enable event_sanitized_main_immediate job [puppet] - 10https://gerrit.wikimedia.org/r/683614 (https://phabricator.wikimedia.org/T273789) [13:45:25] (03PS1) 10Jbond: cfssl::cert: ensure the CN is also in the Alt name [puppet] - 10https://gerrit.wikimedia.org/r/683616 [13:45:27] (03CR) 10Ema: [C: 03+2] vcl: remove cluster_fe_hit hooks [puppet] - 10https://gerrit.wikimedia.org/r/683571 (https://phabricator.wikimedia.org/T264398) (owner: 10Ema) [13:47:26] (03PS3) 10Ottomata: Enable event_sanitized_main_immediate job [puppet] - 10https://gerrit.wikimedia.org/r/683614 (https://phabricator.wikimedia.org/T273789) [13:47:36] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29290/console" [puppet] - 10https://gerrit.wikimedia.org/r/683616 (owner: 10Jbond) [13:48:37] (03PS2) 10Jbond: cfssl::cert: ensure the CN is also in the Alt name [puppet] - 10https://gerrit.wikimedia.org/r/683616 [13:49:29] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29291/console" [puppet] - 10https://gerrit.wikimedia.org/r/683616 (owner: 10Jbond) [13:49:52] (03CR) 10Elukey: "Looks good thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/683044 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [13:50:44] 10SRE: Integrate Buster 10.8 point update - https://phabricator.wikimedia.org/T274099 (10MoritzMuehlenhoff) [13:51:12] 10SRE: Integrate Buster 10.8 point update - https://phabricator.wikimedia.org/T274099 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is complete [13:51:22] 10SRE: Integrate Buster 10.9 point update - https://phabricator.wikimedia.org/T279054 (10MoritzMuehlenhoff) [13:51:44] (03CR) 10Ema: [C: 03+2] vcl: make vcl_hit invoke default VCL [puppet] - 10https://gerrit.wikimedia.org/r/683572 (https://phabricator.wikimedia.org/T264398) (owner: 10Ema) [13:51:56] (03PS2) 10Ema: vcl: make vcl_hit invoke default VCL [puppet] - 10https://gerrit.wikimedia.org/r/683572 (https://phabricator.wikimedia.org/T264398) [13:55:09] (03CR) 10Jbond: [V: 03+1 C: 03+2] cfssl::cert: ensure the CN is also in the Alt name [puppet] - 10https://gerrit.wikimedia.org/r/683616 (owner: 10Jbond) [13:55:23] (03CR) 10Volans: [C: 03+2] "trivial, self-merging" [software/cumin] - 10https://gerrit.wikimedia.org/r/683612 (owner: 10Volans) [13:55:39] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:55:52] (03PS1) 10Jbond: P:debmonitor::server: switch debmonitor.discovery.wmnet to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/683628 [13:57:14] (03CR) 10Jbond: [C: 03+2] P:debmonitor::server: switch debmonitor.discovery.wmnet to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/683628 (owner: 10Jbond) [13:58:53] (03PS1) 10Ottomata: Add comment in refine about mediawiki_page_properties_change [puppet] - 10https://gerrit.wikimedia.org/r/683622 (https://phabricator.wikimedia.org/T273789) [14:00:43] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29292/console" [puppet] - 10https://gerrit.wikimedia.org/r/683614 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [14:02:54] (03Merged) 10jenkins-bot: setup.py: add missing config [software/cumin] - 10https://gerrit.wikimedia.org/r/683612 (owner: 10Volans) [14:07:23] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/683611 (https://phabricator.wikimedia.org/T281344) (owner: 10MMandere) [14:07:39] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:08:15] (03CR) 10Andrew Bogott: [C: 03+2] Trove: set low default quotas per project but big potential DB size [puppet] - 10https://gerrit.wikimedia.org/r/683092 (https://phabricator.wikimedia.org/T212595) (owner: 10Andrew Bogott) [14:08:44] (03PS1) 10Andrew Bogott: nova vendordata: try to avoid a race with the puppet agent [puppet] - 10https://gerrit.wikimedia.org/r/683646 [14:09:28] (03CR) 10Andrew Bogott: [C: 03+2] nova vendordata: try to avoid a race with the puppet agent [puppet] - 10https://gerrit.wikimedia.org/r/683646 (owner: 10Andrew Bogott) [14:10:03] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:11:05] (03CR) 10Jbond: [C: 03+1] "LGTM however just for clarity this will run on the puppet master not the agent" [puppet] (sandbox/filippo/pontoon-o11y) - 10https://gerrit.wikimedia.org/r/676384 (owner: 10Filippo Giunchedi) [14:12:25] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:16:47] (03CR) 10Jbond: "seems fine to me but will leave joe/alex to +1 as I'm not familiar with current use" [puppet] (sandbox/filippo/pontoon-o11y) - 10https://gerrit.wikimedia.org/r/676385 (owner: 10Filippo Giunchedi) [14:22:07] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:25:52] (03PS3) 10Jbond: P:pki::root_ca: enable backups [puppet] - 10https://gerrit.wikimedia.org/r/683578 (https://phabricator.wikimedia.org/T281369) [14:31:23] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:31:47] RECOVERY - Check status of defined EventLogging jobs on eventlog1002 is OK: OK: All defined EventLogging jobs are runnning. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging [14:32:03] PROBLEM - Check systemd state on analytics1071 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:35:16] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on eventlog[1002-1003].eqiad.wmnet with reason: eventlog1003 migration [14:35:17] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on eventlog[1002-1003].eqiad.wmnet with reason: eventlog1003 migration [14:35:19] PROBLEM - SSH on mw1279.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:44] 10SRE, 10Security, 10User-MoritzMuehlenhoff: Investigate iptables replacements - https://phabricator.wikimedia.org/T279683 (10MoritzMuehlenhoff) [14:38:06] 10SRE, 10Prod-Kubernetes, 10SRE-tools, 10serviceops: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10JMeybohm) a:03JMeybohm [14:39:40] (03PS2) 10Muehlenhoff: Switch to failoid1002 [puppet] - 10https://gerrit.wikimedia.org/r/682108 [14:44:20] (03CR) 10Muehlenhoff: [C: 03+2] Switch to failoid1002 [puppet] - 10https://gerrit.wikimedia.org/r/682108 (owner: 10Muehlenhoff) [14:47:59] ACKNOWLEDGEMENT - Check status of defined EventLogging jobs on eventlog1002 is CRITICAL: CRITICAL: Stopped EventLogging jobs: eventlogging-processor@client-side-11 eventlogging-processor@client-side-10 eventlogging-processor@client-side-09 eventlogging-processor@client-side-08 eventlogging-processor@client-side-07 eventlogging-processor@client-side-06 eventlogging-processor@client-side-05 eventlogging-processor@client-side-04 eve [14:47:59] or@client-side-03 eventlogging-processor@client-side-02 eventlogging-processor@client-side-01 eventlogging-processor@client-side-00 Hnowlan eventlog1003 now handles these jobs. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging [14:48:39] 10SRE, 10Wikimedia-Mailing-lists: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 (10Waihorace) I am a list admin for wikinews-zh and I guess we are ready for the upgrade. But before that I have two questions: 1. For creating an account on Mailman3, apart from... [14:49:35] (03CR) 10Gehel: [C: 03+1] "LGTM!" [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [14:50:28] 10SRE, 10Security, 10User-MoritzMuehlenhoff: Investigate iptables replacements - https://phabricator.wikimedia.org/T279683 (10jbond) [14:55:49] (03CR) 10Vgutierrez: "why we need support for public issued certificates on the ats-be as a TLS client? IIRC this isn't supported by design" [puppet] - 10https://gerrit.wikimedia.org/r/683604 (owner: 10Jbond) [14:59:43] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:04:49] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/683604 (owner: 10Jbond) [15:13:39] (03PS1) 10Andrew Bogott: Hiera: remove some long-abandoned yaml [puppet] - 10https://gerrit.wikimedia.org/r/683667 [15:15:22] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/683667 (owner: 10Andrew Bogott) [15:15:41] PROBLEM - Prometheus cloudmetrics1002/labs restarted: beware possible monitoring artifacts on cloudmetrics1002 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/labs [15:17:17] RECOVERY - Check systemd state on kubernetes2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:21:19] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:22:04] (03CR) 10Jbond: [C: 04-1] P:trafficserver::backend: use ca-certificates.crt to talk to backends (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/683604 (owner: 10Jbond) [15:26:03] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:26:46] (03CR) 10Andrew Bogott: [C: 03+2] Hiera: remove some long-abandoned yaml [puppet] - 10https://gerrit.wikimedia.org/r/683667 (owner: 10Andrew Bogott) [15:27:22] (03PS1) 10Andrew Bogott: OpenStack Nova: Remove some settings from nova.conf on compute nodes [puppet] - 10https://gerrit.wikimedia.org/r/683670 (https://phabricator.wikimedia.org/T281384) [15:31:03] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:34:31] !log [WDQS] pooled `wdqs2001` [15:34:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:44] !log [WDQS] ^ scratch that, depooled `wdqs2001` [15:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:15] RECOVERY - Check systemd state on wdqs2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:36:31] RECOVERY - SSH on mw1279.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:36:45] (03CR) 10Ottomata: [V: 03+1 C: 03+2] Enable event_sanitized_main_immediate job [puppet] - 10https://gerrit.wikimedia.org/r/683614 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [15:36:52] (03CR) 10Ottomata: [C: 03+2] Add comment in refine about mediawiki_page_properties_change [puppet] - 10https://gerrit.wikimedia.org/r/683622 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [15:37:10] (03PS2) 10Andrew Bogott: OpenStack Nova: Remove some settings from nova.conf on compute nodes [puppet] - 10https://gerrit.wikimedia.org/r/683670 (https://phabricator.wikimedia.org/T281384) [15:37:12] (03PS1) 10Andrew Bogott: wmcs virt nodes: standardize some hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/683672 [15:37:33] !log [WDQS] `sudo systemctl restart wdqs-blazegraph` && `sudo systemctl restart wdqs-updater` on `wdqs2001` [15:37:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:39] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:38:39] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2001 is OK: PROCS OK: 1 process with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:38:39] RECOVERY - Query Service HTTP Port on wdqs2001 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [15:38:41] RECOVERY - WDQS SPARQL on wdqs2001 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.210 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:38:42] 10Puppet, 10Machine-Learning-Team, 10ORES: Restructure ORES labs redis puppet role - https://phabricator.wikimedia.org/T281495 (10Halfak) [15:41:47] (03PS4) 10Jbond: P:pki::root_ca: enable backups [puppet] - 10https://gerrit.wikimedia.org/r/683578 (https://phabricator.wikimedia.org/T281369) [15:42:48] (03PS2) 10Andrew Bogott: wmcs virt nodes: standardize some hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/683672 [15:42:50] (03PS3) 10Andrew Bogott: OpenStack Nova: Remove some settings from nova.conf on compute nodes [puppet] - 10https://gerrit.wikimedia.org/r/683670 (https://phabricator.wikimedia.org/T281384) [15:43:24] !log [WDQS] `wdqs2001` is high on update lag but otherwise functioning; will repool when lag is caught up [15:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:38] (03CR) 10Jbond: [C: 03+2] P:pki::root_ca: enable backups [puppet] - 10https://gerrit.wikimedia.org/r/683578 (https://phabricator.wikimedia.org/T281369) (owner: 10Jbond) [15:44:04] !log T280382 `sudo -i wmf-auto-reimage-host -p T280382 --new wdqs1004.eqiad.wmnet` on `ryankemper@cumin1001` tmux session `reimage` (trying reimaging this host one final time, if this fails again will need to do a deeper investigation into what's going wrong here) [15:44:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:11] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [15:44:43] RECOVERY - Prometheus cloudmetrics1002/labs restarted: beware possible monitoring artifacts on cloudmetrics1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/labs [15:45:13] (03PS4) 10Andrew Bogott: OpenStack Nova: Remove some settings from nova.conf on compute nodes [puppet] - 10https://gerrit.wikimedia.org/r/683670 (https://phabricator.wikimedia.org/T281384) [15:45:19] (03CR) 10Andrew Bogott: [C: 03+2] wmcs virt nodes: standardize some hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/683672 (owner: 10Andrew Bogott) [15:46:03] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) >>! In T264398#7045717, @gerritbot wrote: > Change 683572 **merged** by Ema: > %%%[operations/puppet@production] vcl: mak... [15:46:07] (03PS5) 10Andrew Bogott: OpenStack Nova: Remove some settings from nova.conf on compute nodes [puppet] - 10https://gerrit.wikimedia.org/r/683670 (https://phabricator.wikimedia.org/T281384) [15:47:18] (03PS1) 10Jbond: P:backup::host: add sets parameter [puppet] - 10https://gerrit.wikimedia.org/r/683675 [15:47:20] (03PS1) 10Jbond: O:pki::root: move backup sets to hiera [puppet] - 10https://gerrit.wikimedia.org/r/683676 [15:48:03] (03CR) 10Jbond: "Small refactor to move config into hiera feel free to reject if this dosen't work for you workflows" [puppet] - 10https://gerrit.wikimedia.org/r/683675 (owner: 10Jbond) [15:50:21] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:52:02] T281480 is now a train blocker [15:52:03] T281480: Cannot access the database: Too many connections - https://phabricator.wikimedia.org/T281480 [15:52:38] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:58:52] (03CR) 10Jbond: "can we also add a reasonable default to cloud.yaml, i would suggest" [puppet] (sandbox/filippo/pontoon-o11y) - 10https://gerrit.wikimedia.org/r/676383 (owner: 10Filippo Giunchedi) [16:00:04] jbond42 and cdanis: Dear deployers, time to do the Puppet request window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210429T1600). [16:00:54] PROBLEM - WDQS high update lag on wdqs2001 is CRITICAL: 1.245e+05 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:02:21] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/681308 (owner: 10Volans) [16:03:48] PROBLEM - HTTPS-dbtree on dbmonitor1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [16:05:14] RECOVERY - HTTPS-dbtree on dbmonitor1002 is OK: HTTP OK: HTTP/1.1 200 OK - 113881 bytes in 7.883 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [16:12:31] !log powerdown thanos-fe2001 for memory swap [16:12:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:52] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:14:18] (03PS1) 10Ryan Kemper: wdqs: shift 1 codfw internal host to codfw public [puppet] - 10https://gerrit.wikimedia.org/r/683679 (https://phabricator.wikimedia.org/T281498) [16:14:34] PROBLEM - Host thanos-fe2001 is DOWN: PING CRITICAL - Packet loss = 100% [16:14:46] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:15:12] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: SRE Onboarding for Marc Mandere - https://phabricator.wikimedia.org/T281344 (10Ladsgroup) Welcome! [16:15:24] PROBLEM - HTTPS-dbtree on dbmonitor1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [16:15:37] !log otto@deploy1002 Started deploy [analytics/refinery@b3c5820] (hadoop-test): update event_sanitized_main allowlst on an-launcher1002 - T273789 [16:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:45] T273789: Sanitize and ingest all event tables into the event_sanitized database - https://phabricator.wikimedia.org/T273789 [16:16:13] (03CR) 10Ryan Kemper: [C: 03+2] elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [16:16:52] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=thanos-compact site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:17:45] (03CR) 10Jbond: [C: 03+2] wdqs: shift 1 codfw internal host to codfw public [puppet] - 10https://gerrit.wikimedia.org/r/683679 (https://phabricator.wikimedia.org/T281498) (owner: 10Ryan Kemper) [16:18:17] !log otto@deploy1002 Finished deploy [analytics/refinery@b3c5820] (hadoop-test): update event_sanitized_main allowlst on an-launcher1002 - T273789 (duration: 02m 39s) [16:18:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:16] RECOVERY - HTTPS-dbtree on dbmonitor1002 is OK: HTTP OK: HTTP/1.1 200 OK - 113824 bytes in 7.781 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [16:19:36] (03Merged) 10jenkins-bot: elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [16:20:01] (03PS27) 10CRusnov: dhcp: Add module for manipulating dynamic DHCP entries [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855) [16:20:10] PROBLEM - Host elastic2043.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:20:21] 10SRE, 10ops-codfw: thanos-fe2001 machine check exception and crash/stall - https://phabricator.wikimedia.org/T280782 (10Papaul) 05Open→03Resolved Swapped DIMM A1 with DIMM B1 resolving this task for now if we do have the same problem on B1 i will request a replacement [16:20:40] !log T281498 `ryankemper@wdqs2004:~$ sudo run-puppet-agent` [16:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:48] T281498: Transfer one codfw wdqs-internal host over to codfw wdqs (public) - https://phabricator.wikimedia.org/T281498 [16:22:43] !log T281498 `ryankemper@wdqs2004:~$ sudo depool` [16:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:58] RECOVERY - Host thanos-fe2001 is UP: PING OK - Packet loss = 0%, RTA = 31.69 ms [16:24:48] (03PS1) 10Effie Mouzeli: hieradata: disable onhost memcached socket on appservers [puppet] - 10https://gerrit.wikimedia.org/r/683682 [16:25:36] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Wikisul maillist - https://phabricator.wikimedia.org/T279482 (10felipedafonseca) Thd other guys have receive, but I dont receive their. The problem is just with me. [16:27:03] (03CR) 10jerkins-bot: [V: 04-1] dhcp: Add module for manipulating dynamic DHCP entries [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855) (owner: 10CRusnov) [16:27:17] !log liw@deploy1002 sync-wikiversions aborted: Revert "group[0|1] wikis to [VERSION]" (duration: 00m 01s) [16:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:17] (03PS28) 10CRusnov: dhcp: Add module for manipulating dynamic DHCP entries [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855) [16:28:53] !log liw@deploy1002 rebuilt and synchronized wikiversions files: Revert "group1 wikis to 1.37.0-wmf.1" [16:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:35] (03PS1) 10Lars Wirzenius: Revert "group1 wikis to 1.37.0-wmf.3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683683 [16:29:37] (03CR) 10Lars Wirzenius: [C: 03+2] Revert "group1 wikis to 1.37.0-wmf.3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683683 (owner: 10Lars Wirzenius) [16:29:47] !log T281498 `sudo -E cumin 'C:role::lvs::balancer' 'sudo run-puppet-agent'` [16:29:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:58] T281498: Transfer one codfw wdqs-internal host over to codfw wdqs (public) - https://phabricator.wikimedia.org/T281498 [16:30:28] PROBLEM - HTTPS-dbtree on dbmonitor1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [16:32:12] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.37.0-wmf.3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683683 (owner: 10Lars Wirzenius) [16:35:06] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29294/console" [puppet] - 10https://gerrit.wikimedia.org/r/683682 (owner: 10Effie Mouzeli) [16:35:59] (03CR) 10Jbond: [V: 03+1 C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/683682 (owner: 10Effie Mouzeli) [16:36:32] RECOVERY - Host elastic2043.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.69 ms [16:37:05] effie: fyi ^ [16:37:20] (03CR) 10jerkins-bot: [V: 04-1] dhcp: Add module for manipulating dynamic DHCP entries [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855) (owner: 10CRusnov) [16:39:22] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: disable onhost memcached socket on appservers [puppet] - 10https://gerrit.wikimedia.org/r/683682 (owner: 10Effie Mouzeli) [16:41:52] jbond42: cheers [16:44:22] expect some memcached alerts, please ignore [16:44:32] <_joe_> effie: should we do a roll depool / run puppet /restart or it solves quickly enough? [16:44:44] <_joe_> err repool, not restart [16:45:43] no it doesnt matter, memcacged doesnt restart itself on updates [16:45:49] I am running puppet now [16:45:55] and I will do the rolling restart [16:46:13] but icinga will start complaining [16:46:18] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: ssh unreachable for wdqs2007.codfw.wmnet - https://phabricator.wikimedia.org/T281437 (10Papaul) 05Open→03Resolved Drained power from the host, the host is back online. Please change the status in Netbox when the serv... [16:46:35] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:46:54] no need for depooling either since requests will be forwarded to the main cluster [16:48:36] ACKNOWLEDGEMENT - MD RAID on wdqs2007 is CRITICAL: CRITICAL: State: degraded, Active: 7, Working: 7, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T281504 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [16:48:41] 10SRE, 10ops-codfw: Degraded RAID on wdqs2007 - https://phabricator.wikimedia.org/T281504 (10ops-monitoring-bot) [16:53:03] PROBLEM - memcached socket on mw2258 is CRITICAL: connect to file socket /run/memcached/memcached.sock: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [16:53:51] PROBLEM - memcached socket on mw2274 is CRITICAL: connect to file socket /run/memcached/memcached.sock: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [16:54:01] PROBLEM - memcached socket on mw2275 is CRITICAL: connect to file socket /run/memcached/memcached.sock: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [16:54:31] PROBLEM - memcached socket on mw2310 is CRITICAL: connect to file socket /run/memcached/memcached.sock: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [16:55:05] RECOVERY - Host elastic2043 is UP: PING OK - Packet loss = 0%, RTA = 31.64 ms [16:55:27] PROBLEM - memcached socket on mw2309 is CRITICAL: connect to file socket /run/memcached/memcached.sock: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [16:55:27] PROBLEM - Check systemd state on elastic2043 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service,prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:55:59] PROBLEM - memcached socket on mw2333 is CRITICAL: connect to file socket /run/memcached/memcached.sock: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [16:55:59] PROBLEM - memcached socket on mw2329 is CRITICAL: connect to file socket /run/memcached/memcached.sock: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [16:56:19] PROBLEM - memcached socket on mw2315 is CRITICAL: connect to file socket /run/memcached/memcached.sock: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [16:56:21] PROBLEM - memcached socket on mw2325 is CRITICAL: connect to file socket /run/memcached/memcached.sock: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [16:56:45] PROBLEM - memcached socket on mw2331 is CRITICAL: connect to file socket /run/memcached/memcached.sock: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [16:56:45] PROBLEM - memcached socket on mw2336 is CRITICAL: connect to file socket /run/memcached/memcached.sock: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [16:56:45] PROBLEM - memcached socket on mw2353 is CRITICAL: connect to file socket /run/memcached/memcached.sock: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [16:56:55] PROBLEM - memcached socket on mw2337 is CRITICAL: connect to file socket /run/memcached/memcached.sock: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [16:56:58] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search: hw troubleshooting: failure to power up for elastic2043.codfw.wmnet - https://phabricator.wikimedia.org/T281327 (10Papaul) 05Open→03Resolved - Drained power - update the CPLD firmware -Relocate the server from U 3 to U 32 - Change switch power xe-7/0/1... [16:57:21] PROBLEM - memcached socket on mw2357 is CRITICAL: connect to file socket /run/memcached/memcached.sock: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [16:57:24] ^ Is the above intended? [16:57:30] yes, expected [16:57:36] <_joe_> W13: yes, see above :) [16:57:59] ah I see, over read it ^^ [16:58:05] PROBLEM - memcached socket on mw2363 is CRITICAL: connect to file socket /run/memcached/memcached.sock: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [16:58:06] <_joe_> effie: maybe we can force a puppet run on icinga? [16:58:16] I already did [16:58:19] wtf [16:58:21] (03PS29) 10CRusnov: dhcp: Add module for manipulating dynamic DHCP entries [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855) [16:58:23] (03CR) 10CRusnov: dhcp: Add module for manipulating dynamic DHCP entries (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855) (owner: 10CRusnov) [16:58:24] W13: that's also 2* so codfw and passive (until multi dc exists) [16:58:39] PROBLEM - memcached socket on mw2361 is CRITICAL: connect to file socket /run/memcached/memcached.sock: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [16:58:42] I am running it again [16:58:43] PROBLEM - memcached socket on mw2383 is CRITICAL: connect to file socket /run/memcached/memcached.sock: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [16:58:44] <_joe_> effie: maybe you forced the puppet run only in eqiad? [16:58:55] <_joe_> that would explain why codfw is complaining [16:59:01] RECOVERY - SSH on wdqs2007 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:59:21] PROBLEM - memcached socket on mw2390 is CRITICAL: connect to file socket /run/memcached/memcached.sock: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [16:59:29] PROBLEM - memcached socket on mw2385 is CRITICAL: connect to file socket /run/memcached/memcached.sock: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [16:59:29] PROBLEM - memcached socket on mw2375 is CRITICAL: connect to file socket /run/memcached/memcached.sock: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [16:59:31] PROBLEM - memcached socket on mw2387 is CRITICAL: connect to file socket /run/memcached/memcached.sock: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [16:59:37] I think I did codfw too [16:59:37] PROBLEM - memcached socket on mw2388 is CRITICAL: connect to file socket /run/memcached/memcached.sock: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [16:59:37] PROBLEM - memcached socket on mw2386 is CRITICAL: connect to file socket /run/memcached/memcached.sock: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [16:59:53] PROBLEM - memcached socket on mw2406 is CRITICAL: connect to file socket /run/memcached/memcached.sock: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [17:00:05] chrisalbon and accraze: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210429T1700). [17:00:17] PROBLEM - memcached socket on mw2392 is CRITICAL: connect to file socket /run/memcached/memcached.sock: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [17:00:19] I did on both DCs, I seriously can't understand [17:00:31] anyway, running on icinga* hosts again [17:01:54] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [17:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:01] (03CR) 10jerkins-bot: [V: 04-1] dhcp: Add module for manipulating dynamic DHCP entries [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855) (owner: 10CRusnov) [17:10:30] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [17:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:56] (03PS30) 10CRusnov: dhcp: Add module for manipulating dynamic DHCP entries [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855) [17:13:20] RECOVERY - HTTPS-dbtree on dbmonitor1002 is OK: HTTP OK: HTTP/1.1 200 OK - 113879 bytes in 8.939 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [17:13:50] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:24] PROBLEM - HTTPS-dbtree on dbmonitor1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [17:20:03] 10SRE, 10Commons, 10MediaWiki-File-management, 10StructuredDataOnCommons, and 2 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10Multichill) As a work around on pages like https://www.wikidata.org/wiki/Wikidata:WikiPr... [17:30:52] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10dpifke) Thank you for the excellent commit message on the vcl_hit change! It makes the issue very clear. (Future readers of... [17:34:04] (03PS31) 10CRusnov: dhcp: Add module for manipulating dynamic DHCP entries [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855) [17:37:22] (03PS1) 10Krinkle: objectcache: set ATTR_DURABILITY in MemcachedBagOStuff [core] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/683629 [17:37:33] (03PS1) 10Krinkle: objectcache: set ATTR_DURABILITY in MemcachedBagOStuff [core] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/683630 [17:39:16] (03PS2) 10Krinkle: objectcache: set ATTR_DURABILITY in MemcachedBagOStuff [core] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/683629 (https://phabricator.wikimedia.org/T281480) [17:39:22] (03CR) 10Krinkle: [C: 03+2] objectcache: set ATTR_DURABILITY in MemcachedBagOStuff [core] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/683629 (https://phabricator.wikimedia.org/T281480) (owner: 10Krinkle) [17:39:28] (03PS2) 10Krinkle: objectcache: set ATTR_DURABILITY in MemcachedBagOStuff [core] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/683630 (https://phabricator.wikimedia.org/T281480) [17:47:45] * Krinkle acqs deploy lock [17:47:50] * Krinkle will be staging on mwdebug1002 [18:00:04] RoanKattouw, Niharika, and Urbanecm: That opportune time is upon us again. Time for a Morning backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210429T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:02:47] (03Merged) 10jenkins-bot: objectcache: set ATTR_DURABILITY in MemcachedBagOStuff [core] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/683629 (https://phabricator.wikimedia.org/T281480) (owner: 10Krinkle) [18:04:25] (03CR) 10Krinkle: [C: 03+2] objectcache: set ATTR_DURABILITY in MemcachedBagOStuff [core] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/683630 (https://phabricator.wikimedia.org/T281480) (owner: 10Krinkle) [18:04:44] * Krinkle the wmf.3/group0 patch is staged on mwdebug1002 [18:10:52] 10SRE, 10ops-codfw, 10DC-Ops, 10decommission-hardware, 10serviceops: decommission rdb200[3456].codfw.wmnet - https://phabricator.wikimedia.org/T273140 (10Papaul) [18:10:57] !log krinkle@deploy1002 Synchronized php-1.37.0-wmf.3/includes/libs/objectcache/MemcachedBagOStuff.php: I926797a9d494a31, T281480 (duration: 01m 09s) [18:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:05] T281480: SqlBlobStore no longer caching blobs (DBConnectionError Too many connections) - https://phabricator.wikimedia.org/T281480 [18:12:25] 10SRE, 10Security-Team: Request to Join Security Mailing List - https://phabricator.wikimedia.org/T281357 (10Dzahn) p:05Triage→03Medium [18:12:54] 10SRE, 10ops-codfw, 10DC-Ops, 10decommission-hardware, 10serviceops: decommission rdb200[3456].codfw.wmnet - https://phabricator.wikimedia.org/T273140 (10Papaul) 05Open→03Resolved complete [18:13:50] 10SRE, 10ops-codfw, 10netops: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457 (10Papaul) faulty switch shipped out today. Tracking information below https://www.ups.com/track?loc=en_US&Requester=NES&tracknum=1ZA19A021293539705&AgreeToTermsAndConditions=yes&WT.z_eCTAid=ct1_... [18:31:05] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/683695 [18:31:57] (03Merged) 10jenkins-bot: objectcache: set ATTR_DURABILITY in MemcachedBagOStuff [core] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/683630 (https://phabricator.wikimedia.org/T281480) (owner: 10Krinkle) [18:32:38] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Wikisul maillist - https://phabricator.wikimedia.org/T279482 (10felipedafonseca) Guys, everything is ok now. Thanks for the help. [18:32:53] staged wmf.1/group1-2 patch on mwdebug1002 [18:33:24] !log LDAP - added mmandere to wmf group (T281344) [18:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:32] T281344: SRE Onboarding for Marc Mandere - https://phabricator.wikimedia.org/T281344 [18:38:23] !log krinkle@deploy1002 Synchronized php-1.37.0-wmf.1/includes/libs/objectcache/MemcachedBagOStuff.php: I926797a9d494a31, T281480 (duration: 01m 08s) [18:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:31] T281480: SqlBlobStore no longer caching blobs (DBConnectionError Too many connections) - https://phabricator.wikimedia.org/T281480 [18:41:55] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: SRE Onboarding for Marc Mandere - https://phabricator.wikimedia.org/T281344 (10Dzahn) [18:43:41] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: SRE Onboarding for Marc Mandere - https://phabricator.wikimedia.org/T281344 (10Dzahn) @MMandere You have been added to the "wmf" LDAP group. This means many new logins that should now work. Among them for example https://alerts.wikimedia.org and many others... [18:44:56] (03PS1) 10Ottomata: Remove refinery-drop-webrequest-sampled-druid job from test cluster [puppet] - 10https://gerrit.wikimedia.org/r/683698 (https://phabricator.wikimedia.org/T273789) [18:45:23] (03PS2) 10Ottomata: Remove refinery-drop-webrequest-sampled-druid job from test cluster [puppet] - 10https://gerrit.wikimedia.org/r/683698 (https://phabricator.wikimedia.org/T273789) [18:46:08] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: SRE Onboarding for Marc Mandere - https://phabricator.wikimedia.org/T281344 (10Dzahn) Here is more info what this group does: https://wikitech.wikimedia.org/wiki/LDAP/Groups#wmf_group [18:47:07] (03PS3) 10Herron: logstash: move kafka input configs to profile::logstash::kafka_inputs [puppet] - 10https://gerrit.wikimedia.org/r/683695 (https://phabricator.wikimedia.org/T233134) [18:47:33] (03CR) 10Ottomata: [C: 03+2] Remove refinery-drop-webrequest-sampled-druid job from test cluster [puppet] - 10https://gerrit.wikimedia.org/r/683698 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [18:47:45] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1002/29296/logstash1023.eqiad.wmnet/fulldiff.html" [puppet] - 10https://gerrit.wikimedia.org/r/683695 (https://phabricator.wikimedia.org/T233134) (owner: 10Herron) [18:52:52] (03CR) 10Herron: [C: 03+1] hieradata: introduce 'public_domain' variable [puppet] (sandbox/filippo/pontoon-o11y) - 10https://gerrit.wikimedia.org/r/676383 (owner: 10Filippo Giunchedi) [18:53:12] (03PS1) 10Legoktm: planet: Add new wikimedia.brussels blog [puppet] - 10https://gerrit.wikimedia.org/r/683700 [18:53:29] (03PS2) 10Legoktm: planet: Add new wikimedia.brussels blog [puppet] - 10https://gerrit.wikimedia.org/r/683700 [18:53:31] (03CR) 10Herron: [C: 03+1] pontoon: enable sso for alerts in cloud [puppet] (sandbox/filippo/pontoon-o11y) - 10https://gerrit.wikimedia.org/r/676386 (owner: 10Filippo Giunchedi) [18:58:03] !log graphite1004/2003: prune /var/lib/carbon/whisper/MediaWiki_ExternalGuidance_init_Google_tr_fr (bad data from Nov 2019) [18:58:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:54] 10SRE, 10Platform Engineering, 10Services, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Decide on future of public services@ mailing list (which has no maintainers) - https://phabricator.wikimedia.org/T278516 (10Dzahn) for the record: it's also possible to request a shared google group that acts like a... [18:59:36] !log graphite1004/2003: prune /var/lib/carbon/whisper/rl-minify-* (bad data from Aug 2018) [18:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] liw and longma: How many deployers does it take to do MediaWiki train - European+American Version (secondary timeslot) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210429T1900). [19:01:41] !log graphite1004/2003: prune /var/lib/carbon/whisper/MediaWiki/wanobjectcache/revision_row_1/ (bad data from Sep 2019) [19:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:09] (03CR) 10Herron: "> Patch Set 1:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/683047 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron) [19:02:24] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:03:15] (03CR) 10Dzahn: [C: 03+2] "nice, confirmed working" [puppet] - 10https://gerrit.wikimedia.org/r/683700 (owner: 10Legoktm) [19:03:56] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:04:31] 10SRE, 10Datacenter-Switchover: June 2021 Datacenter switchover - https://phabricator.wikimedia.org/T281515 (10Legoktm) [19:04:36] PROBLEM - Disk space on mwlog1001 is CRITICAL: DISK CRITICAL - free space: /srv 276088 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mwlog1001&var-datasource=eqiad+prometheus/ops [19:04:51] 10SRE, 10Datacenter-Switchover: June 2021 Datacenter switchover - https://phabricator.wikimedia.org/T281515 (10Legoktm) [19:04:55] 10SRE, 10CommRel-Specialists-Support (Apr-Jun-2021), 10Datacenter-Switchover: CommRel support for June 2021 Switchover - https://phabricator.wikimedia.org/T281209 (10Legoktm) [19:08:13] (03PS1) 10Bstorm: toolschecker: update the etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/683705 (https://phabricator.wikimedia.org/T279723) [19:10:26] (03CR) 10Bstorm: [C: 03+2] toolschecker: update the etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/683705 (https://phabricator.wikimedia.org/T279723) (owner: 10Bstorm) [19:11:03] (03CR) 10Dzahn: "thank you! :) this unblocked everything that wants to use envoy on bullseye.. which nowadays is a LOT" [puppet] - 10https://gerrit.wikimedia.org/r/683133 (https://phabricator.wikimedia.org/T247364) (owner: 10Jbond) [19:12:00] (03PS1) 10Dzahn: Revert "Revert "site: add peopleweb role to people1003"" [puppet] - 10https://gerrit.wikimedia.org/r/683631 [19:13:57] (03CR) 10Bstorm: [C: 03+2] cloudstore: Collapse drbd vs symlinks into one profile [puppet] - 10https://gerrit.wikimedia.org/r/683445 (https://phabricator.wikimedia.org/T224747) (owner: 10Bstorm) [19:16:06] (03PS1) 10Herron: add kafka-main[12]00[45] to existing kafka-main egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/683706 (https://phabricator.wikimedia.org/T225005) [19:17:00] (03PS2) 10Dzahn: Revert "Revert "site: add peopleweb role to people1003"" [puppet] - 10https://gerrit.wikimedia.org/r/683631 [19:17:23] (03CR) 10Herron: "> Patch Set 3: -Code-Review" [puppet] - 10https://gerrit.wikimedia.org/r/683044 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [19:19:32] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:21:46] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:26:04] RECOVERY - Check systemd state on analytics1071 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:26:13] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: SRE Onboarding for Marc Mandere - https://phabricator.wikimedia.org/T281344 (10RobH) [19:32:15] !log dpifke@deploy1002 Started deploy [performance/navtiming@e7ad939]: Deploy https://gerrit.wikimedia.org/r/c/performance/navtiming/+/683484 [19:32:21] !log dpifke@deploy1002 Finished deploy [performance/navtiming@e7ad939]: Deploy https://gerrit.wikimedia.org/r/c/performance/navtiming/+/683484 (duration: 00m 05s) [19:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:11] 10SRE, 10Wikimedia-Planet: Find a replacement for RSS aggregator for planet.wikimedia.org - https://phabricator.wikimedia.org/T281219 (10Dzahn) On IRC you mentioned a github link that seemed promising. [19:43:29] (03PS1) 10Jeena Huneidi: group1 wikis to 1.37.0-wmf.3 refs T278347 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683710 [19:43:31] (03CR) 10Jeena Huneidi: [C: 03+2] group1 wikis to 1.37.0-wmf.3 refs T278347 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683710 (owner: 10Jeena Huneidi) [19:44:11] (03Merged) 10jenkins-bot: group1 wikis to 1.37.0-wmf.3 refs T278347 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683710 (owner: 10Jeena Huneidi) [19:45:24] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.37.0-wmf.3 refs T278347 [19:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:32] T278347: 1.37.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T278347 [19:46:32] !log jhuneidi@deploy1002 Synchronized php: group1 wikis to 1.37.0-wmf.3 refs T278347 (duration: 01m 08s) [19:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:29] (03PS1) 10Jeena Huneidi: all wikis to 1.37.0-wmf.3 refs T278347 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683714 [19:57:31] (03CR) 10Jeena Huneidi: [C: 03+2] all wikis to 1.37.0-wmf.3 refs T278347 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683714 (owner: 10Jeena Huneidi) [19:59:05] (03Merged) 10jenkins-bot: all wikis to 1.37.0-wmf.3 refs T278347 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683714 (owner: 10Jeena Huneidi) [20:00:20] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.37.0-wmf.3 refs T278347 [20:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:28] T278347: 1.37.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T278347 [20:01:00] RECOVERY - HTTPS-dbtree on dbmonitor1002 is OK: HTTP OK: HTTP/1.1 200 OK - 113873 bytes in 3.515 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [20:10:56] (03PS1) 10Jdlrobson: Prepare for new configuration option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683720 (https://phabricator.wikimedia.org/T277951) [20:16:17] !log Restart tendril database - T281486 [20:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:25] T281486: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 [20:16:52] PROBLEM - HTTPS-dbtree on dbmonitor1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 354 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [20:18:35] ^ that is expected [20:18:55] 10SRE, 10Datacenter-Switchover, 10Performance-Team (Radar): June 2021 Datacenter switchover - https://phabricator.wikimedia.org/T281515 (10Krinkle) [20:20:18] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_eventstreams_internal_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:22:32] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:35:40] (03CR) 10Dzahn: [C: 03+2] Revert "Revert "site: add peopleweb role to people1003"" [puppet] - 10https://gerrit.wikimedia.org/r/683631 (owner: 10Dzahn) [20:36:42] marostegui: thx for ACK [20:39:38] PROBLEM - Check systemd state on dbmonitor1002 is CRITICAL: CRITICAL - degraded: The following units failed: tendril-5m.service,tendril-queries.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:47:04] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (people1003), No backups: 5 (conf1004, ...), Fresh: 97 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [20:50:23] I will look at the backup alert, it's because I re-added a role to a new VM [20:50:35] that has a backup::set in it [20:51:21] (03PS2) 10BryanDavis: wikimedia.org: TXT entry for GitHub domain verified profile [dns] - 10https://gerrit.wikimedia.org/r/661180 (owner: 10Krinkle) [20:53:06] (03PS3) 10BBlack: Puppetize cp501[3456] [puppet] - 10https://gerrit.wikimedia.org/r/683026 (https://phabricator.wikimedia.org/T278182) [20:53:40] (03PS1) 10RobH: adding in single cpu suport sku [software] - 10https://gerrit.wikimedia.org/r/683727 [20:53:57] (03CR) 10RobH: [C: 03+2] adding in single cpu suport sku [software] - 10https://gerrit.wikimedia.org/r/683727 (owner: 10RobH) [20:54:29] (03Merged) 10jenkins-bot: adding in single cpu suport sku [software] - 10https://gerrit.wikimedia.org/r/683727 (owner: 10RobH) [20:54:34] !log Stop mysql on tendril for the UTC night, dbtree and tendrill will remain down for a few hours T281486 [20:54:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:42] T281486: Create a cronjob or an event to truncate/delete rows from tendril.general_log_sampled table - https://phabricator.wikimedia.org/T281486 [20:54:47] (03CR) 10BryanDavis: [C: 03+1] "Arbitrarily adding mutante as reviewer. Feel free to remove yourself in favor of someone else! I'm just hoping to help Krinkle move this f" [dns] - 10https://gerrit.wikimedia.org/r/661180 (owner: 10Krinkle) [20:56:38] (03PS4) 10BBlack: Puppetize cp501[3456] [puppet] - 10https://gerrit.wikimedia.org/r/683026 (https://phabricator.wikimedia.org/T278182) [20:58:42] ACKNOWLEDGEMENT - Check systemd state on dbmonitor1002 is CRITICAL: CRITICAL - degraded: The following units failed: tendril-5m.service,tendril-queries.service Marostegui known https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:58:42] ACKNOWLEDGEMENT - HTTPS-dbtree on dbmonitor1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Marostegui known https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [21:00:20] 10SRE, 10observability: Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10lmata) This needs to live again [21:00:36] PROBLEM - people.wikimedia.org requires authentication on people1003 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 404 Not Found https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [21:01:00] 10SRE, 10observability: Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10lmata) Technically in progress between @CDanis and me. [21:02:34] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:03:24] 10SRE, 10observability: Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10lmata) Quote opened in T281530 [21:03:29] 10SRE, 10observability: Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10lmata) a:03lmata [21:03:56] PROBLEM - Check systemd state on prometheus2003 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:04:01] ACKNOWLEDGEMENT - Check systemd state on prometheus1004 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service Marostegui known https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:04:01] ACKNOWLEDGEMENT - Check systemd state on prometheus2003 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service Marostegui known https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:04:06] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:05:21] ACKNOWLEDGEMENT - people.wikimedia.org requires authentication on people1003 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 404 Not Found daniel_zahn new VM https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [21:06:11] ACKNOWLEDGEMENT - Backup freshness on backup1001 is CRITICAL: All failures: 1 (people1003), No backups: 5 (conf1004, ...), Fresh: 97 jobs daniel_zahn people1003 just got the role again, investigating https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [21:15:16] PROBLEM - Check systemd state on prometheus1003 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:15:18] PROBLEM - Check systemd state on prometheus2004 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:16:08] !log backup1001 - sudo check_bacula.py --icinga [21:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:19] (03PS1) 10Dzahn: bacula: add people1003 job to monitoring ignorelist [puppet] - 10https://gerrit.wikimedia.org/r/683732 [21:24:42] (03CR) 10Dzahn: [C: 03+2] bacula: add people1003 job to monitoring ignorelist [puppet] - 10https://gerrit.wikimedia.org/r/683732 (owner: 10Dzahn) [21:27:10] (03CR) 10Dzahn: "[backup1001:~] $ sudo check_bacula.py --icinga" [puppet] - 10https://gerrit.wikimedia.org/r/683732 (owner: 10Dzahn) [21:29:26] 10SRE, 10Patch-For-Review: try planet/people on bullseye - https://phabricator.wikimedia.org/T280989 (10Dzahn) >>! In T280989#7038825, @jcrespo wrote: > No problem. Sadly, it is my job to bother people from time to time, making sure backups are working 0:-). I had removed the role with the backup::set and tod... [21:32:38] !log icinga - enabled notifications for checks on ms-backup1001 - they were all manually disabled but none of the checks had any status change since 50 days which indicates it was forgotten to turn them back on which is a common issue with disabling notifications [21:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:39] !log icinga - enabling disabled notifications for random an-worker nodes where mgmt interface had enabled alerts but the actual host didnt [21:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:41] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/29297/" [puppet] - 10https://gerrit.wikimedia.org/r/682769 (owner: 10Dzahn) [21:42:58] (03PS1) 10Bstorm: cloudstore: enable drbd on cloudstore1008 [puppet] - 10https://gerrit.wikimedia.org/r/683737 (https://phabricator.wikimedia.org/T224747) [21:47:04] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:47:15] (03PS2) 10Legoktm: mailman3: Puppetize Amir's migration script [puppet] - 10https://gerrit.wikimedia.org/r/683462 [21:47:17] (03PS3) 10Legoktm: mailman3: Improve migration script [puppet] - 10https://gerrit.wikimedia.org/r/683463 [21:47:19] (03PS2) 10Legoktm: mailman3: Fix URL postorius provides to fetch templates [puppet] - 10https://gerrit.wikimedia.org/r/683520 (https://phabricator.wikimedia.org/T281425) [21:47:21] (03PS5) 10Legoktm: mailman3: Properly import templates for admins [puppet] - 10https://gerrit.wikimedia.org/r/683477 (https://phabricator.wikimedia.org/T281425) [21:47:23] (03PS2) 10Legoktm: mailman3: Fork and improve upstream templates [puppet] - 10https://gerrit.wikimedia.org/r/683555 (https://phabricator.wikimedia.org/T281425) [21:47:47] (03CR) 10jerkins-bot: [V: 04-1] mailman3: Puppetize Amir's migration script [puppet] - 10https://gerrit.wikimedia.org/r/683462 (owner: 10Legoktm) [21:48:15] (03CR) 10Legoktm: [V: 03+2 C: 03+2] mailman3: Puppetize Amir's migration script [puppet] - 10https://gerrit.wikimedia.org/r/683462 (owner: 10Legoktm) [21:48:25] (03CR) 10Legoktm: [C: 03+2] mailman3: Improve migration script [puppet] - 10https://gerrit.wikimedia.org/r/683463 (owner: 10Legoktm) [21:49:24] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:54:25] (03PS1) 10Bstorm: toolforge: remove spreadcheck for etcd [puppet] - 10https://gerrit.wikimedia.org/r/683739 [21:59:19] 10SRE, 10ops-eqsin, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install cp501[3-6] - https://phabricator.wikimedia.org/T278182 (10BBlack) Note - https://gerrit.wikimedia.org/r/c/operations/puppet/+/683026 has the production roles and config, but we'll need to reimage them into this rather than... [21:59:42] (03PS1) 10Dzahn: rsync::quickdatacopy: ensure a 'passive' host gets an rsync client [puppet] - 10https://gerrit.wikimedia.org/r/683741 (https://phabricator.wikimedia.org/T280989) [21:59:53] (03PS2) 10Bstorm: cloudstore: enable drbd on cloudstore1008 [puppet] - 10https://gerrit.wikimedia.org/r/683737 (https://phabricator.wikimedia.org/T224747) [22:00:10] (03PS2) 10Dzahn: rsync::quickdatacopy: ensure a destination host gets an rsync client [puppet] - 10https://gerrit.wikimedia.org/r/683741 (https://phabricator.wikimedia.org/T280989) [22:04:25] (03PS1) 10Dzahn: peopleweb: ensure rsync is installed [puppet] - 10https://gerrit.wikimedia.org/r/683742 (https://phabricator.wikimedia.org/T280989) [22:05:34] PROBLEM - Check systemd state on an-worker1131 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:05:38] PROBLEM - Hadoop NodeManager on an-worker1131 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:08:10] (03CR) 10Dzahn: [C: 04-1] "Duplicate declaration: Package[rsync] is already declared at (file: /srv/jenkins-workspace/puppet-compiler/29300/change/src/modules/profil" [puppet] - 10https://gerrit.wikimedia.org/r/683742 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn) [22:09:54] (03CR) 10Dzahn: [C: 04-1] "but ... if it's a duplicate declaration and there is:" [puppet] - 10https://gerrit.wikimedia.org/r/683742 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn) [22:13:02] (03CR) 10Dzahn: "also see: https://gerrit.wikimedia.org/r/c/operations/puppet/+/683742 and the comment there. that fails with a duplicate declaration becau" [puppet] - 10https://gerrit.wikimedia.org/r/683741 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn) [22:20:14] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563 [22:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:22] T280563: Reboot elasticsearch* and relforge* to apply kernel security updates - https://phabricator.wikimedia.org/T280563 [22:20:49] (03CR) 10Dzahn: "well.. this one compiles on a dest_host like releases2002" [puppet] - 10https://gerrit.wikimedia.org/r/683741 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn) [22:21:03] !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563 [22:21:06] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563 [22:21:06] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563 [22:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:22] (03CR) 10Dzahn: [C: 04-1] "https://gerrit.wikimedia.org/r/c/operations/puppet/+/683741 compiles on releases2002 (another dest_host) though just fine" [puppet] - 10https://gerrit.wikimedia.org/r/683742 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn) [22:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:18] (03PS3) 10Dzahn: wikimedia.org: TXT entry for GitHub domain verified profile [dns] - 10https://gerrit.wikimedia.org/r/661180 (https://phabricator.wikimedia.org/T207364) (owner: 10Krinkle) [22:26:21] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster relforge: relforge restart - ryankemper@cumin1001 - T280563 [22:26:22] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster relforge: relforge restart - ryankemper@cumin1001 - T280563 [22:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:30] T280563: Reboot elasticsearch* and relforge* to apply kernel security updates - https://phabricator.wikimedia.org/T280563 [22:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:40] RECOVERY - Check systemd state on an-worker1131 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:26:44] RECOVERY - Hadoop NodeManager on an-worker1131 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:27:00] !log T280563 `urllib3.exceptions.NewConnectionError: : Failed to establish a new connection: [Errno -2] Name or service not known` [22:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:08] log T280563 Spotted the issue; forgot to set `--without-lvs` for relforge reboot [22:30:35] (03PS1) 10Dzahn: Add miscweb namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/683743 [22:31:30] (03CR) 10Legoktm: [C: 03+2] mailman3: Fix URL postorius provides to fetch templates [puppet] - 10https://gerrit.wikimedia.org/r/683520 (https://phabricator.wikimedia.org/T281425) (owner: 10Legoktm) [22:32:11] ryankemper: you missed a ! for that last log [22:32:18] !log T280563 Spotted the issue; forgot to set `--without-lvs` for relforge reboot [22:32:21] mutante: thanks :) [22:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:26] T280563: Reboot elasticsearch* and relforge* to apply kernel security updates - https://phabricator.wikimedia.org/T280563 [22:32:54] we need a bot for that :D [22:33:54] :) meta-logmsgbot [22:35:46] It looks like you're logging [22:36:52] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563 [22:36:53] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563 [22:36:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:32] (03CR) 10Legoktm: [C: 03+2] mailman3: Properly import templates for admins [puppet] - 10https://gerrit.wikimedia.org/r/683477 (https://phabricator.wikimedia.org/T281425) (owner: 10Legoktm) [22:40:23] 10SRE, 10Services, 10Service-deployment-requests: New Service Request miscweb - https://phabricator.wikimedia.org/T281538 (10Dzahn) [22:44:15] !log T280563 Bleh, we never moved the new config into spicerack, so it's trying to talk to the old relforge hosts which no longer exist. Will reboot relforge manually and use the cookbook for codfw/eqiad, and circle back later for the spicerack change [22:44:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:22] T280563: Reboot elasticsearch* and relforge* to apply kernel security updates - https://phabricator.wikimedia.org/T280563 [22:46:56] !log T280563 Current master is `relforge1003-relforge-eqiad`, will reboot `1004` first then `1003` after [22:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:19] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29305/console" [puppet] - 10https://gerrit.wikimedia.org/r/683555 (https://phabricator.wikimedia.org/T281425) (owner: 10Legoktm) [23:00:04] brennen: That opportune time is upon us again. Time for a US Backport and Config training deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210429T2300). [23:01:50] (03PS3) 10Legoktm: mailman3: Fork and improve upstream templates [puppet] - 10https://gerrit.wikimedia.org/r/683555 (https://phabricator.wikimedia.org/T281425) [23:02:15] oh boy another deployment training is upon us [23:02:17] (03CR) 10jerkins-bot: [V: 04-1] mailman3: Fork and improve upstream templates [puppet] - 10https://gerrit.wikimedia.org/r/683555 (https://phabricator.wikimedia.org/T281425) (owner: 10Legoktm) [23:02:43] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29306/console" [puppet] - 10https://gerrit.wikimedia.org/r/683555 (https://phabricator.wikimedia.org/T281425) (owner: 10Legoktm) [23:03:35] (03PS4) 10Legoktm: mailman3: Fork and improve upstream templates [puppet] - 10https://gerrit.wikimedia.org/r/683555 (https://phabricator.wikimedia.org/T281425) [23:04:05] (03PS1) 10Thcipriani: DEMO: Add newline to README [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683747 [23:05:25] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin1001 - T280563 [23:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:33] T280563: Reboot elasticsearch* and relforge* to apply kernel security updates - https://phabricator.wikimedia.org/T280563 [23:06:10] !log T280563 `sudo -i cookbook sre.elasticsearch.rolling-operation codfw "codfw reboot" --reboot --nodes-per-run 3 --start-datetime 2021-04-29T23:04:29 --task-id T280563` on `ryankemper@cumin1001` tmux session `elastic_restarts` [23:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:51] hi xSavitar [23:06:55] hey xSavitar [23:06:59] hi :) [23:07:47] (03CR) 10Thcipriani: [C: 03+2] "Backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683747 (owner: 10Thcipriani) [23:08:31] (03Merged) 10jenkins-bot: DEMO: Add newline to README [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683747 (owner: 10Thcipriani) [23:08:36] oops, slight difference between the command I logged and the command I actually ran [23:08:44] !log T280563 `sudo -i cookbook sre.elasticsearch.rolling-operation search_codfw "codfw reboot" --reboot --nodes-per-run 3 --start-datetime 2021-04-29T23:04:29 --task-id T280563` on `ryankemper@cumin1001` tmux session `elastic_restarts` (amended command) [23:08:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:09] (03PS5) 10Legoktm: mailman3: Fork and improve upstream templates [puppet] - 10https://gerrit.wikimedia.org/r/683555 (https://phabricator.wikimedia.org/T281425) [23:16:26] PROBLEM - Check systemd state on elastic2030 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:16:46] PROBLEM - Check systemd state on elastic2057 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:16:52] !log thcipriani@deploy1002 Synchronized README: Config: [[gerrit:683747|DEMO: Add newline to README]] (duration: 00m 56s) [23:16:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:07] !log T280563 successful reboot of `relforge100[3,4]`; `relforge` cluster is back to green status. [23:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:14] T280563: Reboot elasticsearch* and relforge* to apply kernel security updates - https://phabricator.wikimedia.org/T280563 [23:18:44] PROBLEM - Check systemd state on elastic2029 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:18:44] (03CR) 10Bstorm: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/29307/" [puppet] - 10https://gerrit.wikimedia.org/r/683737 (https://phabricator.wikimedia.org/T224747) (owner: 10Bstorm) [23:19:12] RECOVERY - Check systemd state on elastic2057 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:21:08] RECOVERY - Check systemd state on elastic2029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:23:59] (03PS1) 10Thcipriani: Revert "DEMO: Add newline to README" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683749 [23:24:32] (03CR) 10Thcipriani: [C: 03+2] "Config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683749 (owner: 10Thcipriani) [23:25:27] (03Merged) 10jenkins-bot: Revert "DEMO: Add newline to README" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683749 (owner: 10Thcipriani) [23:25:42] PROBLEM - Check systemd state on elastic2041 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:26:06] PROBLEM - Check systemd state on elastic2028 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:26:50] PROBLEM - Check systemd state on elastic2042 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:28:34] RECOVERY - Check systemd state on elastic2030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:33:54] 10SRE, 10Wikimedia-Mailing-lists: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 (10Ladsgroup) >>! In T280322#7045910, @Waihorace wrote: > I am a list admin for wikinews-zh and I guess we are ready for the upgrade. Thanks. I will migrate it on Monday. > > But... [23:34:28] PROBLEM - Check systemd state on elastic2058 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:34:28] PROBLEM - Check systemd state on elastic2044 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:34:29] (03CR) 10Legoktm: [C: 03+2] mailman3: Fork and improve upstream templates [puppet] - 10https://gerrit.wikimedia.org/r/683555 (https://phabricator.wikimedia.org/T281425) (owner: 10Legoktm) [23:36:19] !log thcipriani@deploy1002 Synchronized README: Config: [[gerrit:683749|Revert "DEMO: Add newline to README"]] (duration: 00m 56s) [23:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:42] (Sorry about the `Check systemd state` noise, it looks like our cookbook is lifting that downtime a little bit too early) [23:48:10] RECOVERY - Check systemd state on elastic2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:52:42] RECOVERY - Check systemd state on elastic2041 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:53:52] RECOVERY - Check systemd state on elastic2042 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:56:44] PROBLEM - Check systemd state on elastic2046 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:58:04] PROBLEM - Check systemd state on elastic2032 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:59:08] RECOVERY - Check systemd state on elastic2044 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state