[00:00:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: 2021-03-31) rack/setup/install snapshot101[1-5] - https://phabricator.wikimedia.org/T272509 (10RobH) So these are chugging along just fine, and didn't fall to the manual partition menu. I suspect my change didn't hit apt server, just install host... [00:01:57] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on snapshot1014.eqiad.wmnet with reason: REIMAGE [00:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:21] (03CR) 10Dzahn: "for the record, I replaced this list with one generated on thumbor1002 instead of mw2300 but the fc-list content stayed the same" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681766 (https://phabricator.wikimedia.org/T79424) (owner: 10Dzahn) [00:03:38] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on snapshot1015.eqiad.wmnet with reason: REIMAGE [00:03:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: 2021-03-31) rack/setup/install snapshot101[1-5] - https://phabricator.wikimedia.org/T272509 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['snapshot1011.eqiad.wmnet', 'snapshot1012.eqiad.wmnet', 'snapshot1013.eqiad.wmnet', 'snapshot101... [00:11:35] (03PS1) 10Dzahn: ci::master/deployment_server: add new k8s namespace for shellbox [puppet] - 10https://gerrit.wikimedia.org/r/683111 (https://phabricator.wikimedia.org/T260330) [00:32:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: 2021-03-31) rack/setup/install snapshot101[1-5] - https://phabricator.wikimedia.org/T272509 (10RobH) [00:32:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: 2021-03-31) rack/setup/install snapshot101[1-5] - https://phabricator.wikimedia.org/T272509 (10RobH) 05Open→03Resolved @ArielGlenn These are all yours! [00:38:59] (03PS14) 10Mstyles: rdf-streaming-updater: create helmfile.d structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) [00:39:01] (03PS4) 10Mstyles: rdf-streaming-updater: enable HA capability [deployment-charts] - 10https://gerrit.wikimedia.org/r/679519 (https://phabricator.wikimedia.org/T273098) [00:40:39] (03CR) 10Mstyles: rdf-streaming-updater: enable HA capability (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/679519 (https://phabricator.wikimedia.org/T273098) (owner: 10Mstyles) [00:45:07] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 97 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [01:00:15] !log robh@cumin1001 START - Cookbook sre.dns.netbox [01:00:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:10] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:31:51] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10dpifke) Alternatively, can we get identical results just by incrementing `grace` by `keep`? (And possibly setting `keep` to 0... [01:40:57] (03CR) 10Razzi: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/682785 (https://phabricator.wikimedia.org/T278423) (owner: 10Razzi) [02:28:23] !log robh@cumin1001 START - Cookbook sre.dns.netbox [02:28:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:33:47] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [02:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:21:43] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:28:14] 10SRE, 10Wikimedia-Mailing-lists: hyperkitty didn't import all wikitech-l messages - https://phabricator.wikimedia.org/T281070 (10Legoktm) It seems like every email with an emoji in it from wikimedia-l was skipped: ERROR: https://lists.wikimedia.org/pipermail/wikimedia-l/2020-September/095629.html not in hype... [03:32:06] !log T280382 `sudo -i wmf-auto-reimage-host -p T280382 wdqs1013.eqiad.wmnet` on `ryankemper@cumin1001` tmux session `reimage` [03:32:15] !log T280382 `sudo -i wmf-auto-reimage-host -p T280382 wdqs2007.codfw.wmnet` on `ryankemper@cumin1001` tmux session `reimage` [03:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:32:18] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [03:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:33:12] !log `sudo systemctl restart wdqs-blazegraph` on `wdqs1012` to clear the `WDQS SPARQL` warning [03:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:36:03] 10SRE, 10Wikimedia-Mailing-lists: hyperkitty didn't import all wikitech-l messages - https://phabricator.wikimedia.org/T281070 (10Legoktm) A bunch of older messages were also skipped, but those look like the archives are corrupt? e.g. https://lists.wikimedia.org/pipermail/wikimedia-l/2005-October/064281.html... [03:36:55] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.074 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:48:31] !log ryankemper@puppetmaster1001 conftool action : set/pooled=no; selector: name=wdqs1013.eqiad.wmnet [03:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:49:12] !log ryankemper@puppetmaster1001 conftool action : set/pooled=no; selector: name=wdqs2007.codfw.wmnet [03:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:50:03] (03CR) 10Ryan Kemper: elasticsearch: refactor various rolling operations (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [03:51:17] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2007.codfw.wmnet with reason: REIMAGE [03:51:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:53:16] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2007.codfw.wmnet with reason: REIMAGE [03:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:58:18] (03PS14) 10Ryan Kemper: elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) [04:01:17] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [04:03:50] In 1h I will be switching enwiki db master [04:05:21] 10ops-codfw, 10DC-Ops, 10Discovery-Search: hw troubleshooting: failure to power up for elastic2043.codfw.wmnet - https://phabricator.wikimedia.org/T281327 (10RKemper) [04:07:08] 10SRE, 10ops-codfw, 10Discovery: elastic2043 doesn't power up - https://phabricator.wikimedia.org/T281215 (10RKemper) Made a ticket using the hardware failure template from the dc-ops group. In retrospect I probably should have just copied over the template to here but wasn't sure if the template does anythi... [04:07:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db1163 with weight 0 before the switchover T278214', diff saved to https://phabricator.wikimedia.org/P15598 and previous config saved to /var/cache/conftool/dbconfig/20210428-040718-marostegui.json [04:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:07:28] T278214: Switchover s1 from db1083 to db1163 - https://phabricator.wikimedia.org/T278214 [04:08:01] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [04:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:08:13] !log Start replication changes, connect everything to db1163 T278214 [04:08:19] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs2001.codfw.wmnet --dest wdqs2007.codfw.wmnet --reason "transferring fresh categories journal following reimage" --blazegraph_instance categories` on `ryankemper@cumin1001` tmux session `reimage` [04:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:08:30] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [04:09:34] (03PS15) 10Ryan Kemper: elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) [04:13:03] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [04:13:22] 10SRE, 10Platform Engineering, 10Services, 10Wikimedia-Mailing-lists: Decide on future of public services@ mailing list (which has no maintainers) - https://phabricator.wikimedia.org/T278516 (10Legoktm) >>! In T278516#6949784, @Dzahn wrote: > Imho it should be part of offboarding workflows to check for lis... [04:13:32] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [04:13:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:14:32] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [04:14:36] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs2001.codfw.wmnet --dest wdqs2007.codfw.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph` on `ryankemper@cumin1001` tmux session `reimage` [04:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:14:47] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [04:20:00] (03PS2) 10Marostegui: wmnet: Update s1-master to the right master [dns] - 10https://gerrit.wikimedia.org/r/682882 (https://phabricator.wikimedia.org/T278214) [04:20:09] (03PS4) 10Marostegui: mariadb: Promote db1163 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/682881 (https://phabricator.wikimedia.org/T278214) [04:22:58] 10SRE, 10Platform Engineering, 10Services, 10Wikimedia-Mailing-lists: Decide on future of public services@ mailing list (which has no maintainers) - https://phabricator.wikimedia.org/T278516 (10Legoktm) I proposed closing the list: https://lists.wikimedia.org/pipermail/services/2021-April/000195.html [04:24:33] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1163 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/682881 (https://phabricator.wikimedia.org/T278214) (owner: 10Marostegui) [04:28:07] 10SRE, 10ops-codfw, 10DBA: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Marostegui) I will get db2096 ready for you. [04:31:08] (03PS16) 10Ryan Kemper: elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) [04:32:34] (03PS1) 10Marostegui: db1167: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/683121 (https://phabricator.wikimedia.org/T258361) [04:33:15] (03CR) 10Marostegui: [C: 03+2] db1167: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/683121 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [04:33:44] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [04:34:00] In 30 minutes I will be switching enwiki db master [04:39:07] 10SRE, 10Wikimedia-Mailing-lists: Install mailman3 on lists1001.wikimedia.org - https://phabricator.wikimedia.org/T278610 (10Legoktm) [04:39:28] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review, 10User-Ladsgroup: Install mailman3 and mailman2 at the same time on the cloud - https://phabricator.wikimedia.org/T278612 (10Legoktm) 05Open→03Resolved a:03Ladsgroup [04:42:39] (03PS17) 10Ryan Kemper: elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) [04:43:22] (03PS18) 10Ryan Kemper: elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) [04:45:58] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [04:48:47] (03PS19) 10Ryan Kemper: elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) [04:52:41] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [05:00:13] I am going to start enwiki switchover [05:00:21] !log Starting s1 eqiad failover from db1083 to db1163 - T278214 [05:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:30] T278214: Switchover s1 from db1083 to db1163 - https://phabricator.wikimedia.org/T278214 [05:00:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set s1 as read-only for maintenance T278214', diff saved to https://phabricator.wikimedia.org/P15599 and previous config saved to /var/cache/conftool/dbconfig/20210428-050041-marostegui.json [05:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:56] RO confirmed [05:01:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db1163 to s1 master and remove read-only from s1 T278214', diff saved to https://phabricator.wikimedia.org/P15600 and previous config saved to /var/cache/conftool/dbconfig/20210428-050138-marostegui.json [05:01:45] RO removed [05:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:04] Test edit worked fine [05:02:11] I can edit fine again yes [05:02:13] same here [05:02:18] hey sobanski and jynus o/ [05:02:27] recentchanges is moving [05:02:34] tendril looks good [05:02:48] no errors on log [05:03:31] (03PS1) 10Legoktm: mailman3: Make sure mailman-web uses utf8mb4 as well [puppet] - 10https://gerrit.wikimedia.org/r/683123 [05:03:38] (but we should really look to minimize those on topology changes) [05:04:11] recentchanges keeps going fine and I can see edits happening on the master [05:04:32] (03CR) 10TsepoThoabala: [C: 03+1] Enable partial action blocks on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683089 (https://phabricator.wikimedia.org/T280528) (owner: 10Tchanders) [05:04:38] (03CR) 10Marostegui: [C: 03+2] wmnet: Update s1-master to the right master [dns] - 10https://gerrit.wikimedia.org/r/682882 (https://phabricator.wikimedia.org/T278214) (owner: 10Marostegui) [05:05:14] I think we are good sobanski and jynus. Thanks for the support :* [05:05:15] (03CR) 10TsepoThoabala: [C: 03+1] Enable partial action blocks on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683088 (https://phabricator.wikimedia.org/T280528) (owner: 10Tchanders) [05:05:48] last edit 5:00, first edit 05:01 [05:07:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1083 (old s1 master) for schema change', diff saved to https://phabricator.wikimedia.org/P15601 and previous config saved to /var/cache/conftool/dbconfig/20210428-050754-marostegui.json [05:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:54] (03CR) 10Legoktm: [C: 03+2] mailman3: Make sure mailman-web uses utf8mb4 as well [puppet] - 10https://gerrit.wikimedia.org/r/683123 (owner: 10Legoktm) [05:15:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1112 for schema change', diff saved to https://phabricator.wikimedia.org/P15602 and previous config saved to /var/cache/conftool/dbconfig/20210428-051526-marostegui.json [05:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 25%: Repool db1112', diff saved to https://phabricator.wikimedia.org/P15603 and previous config saved to /var/cache/conftool/dbconfig/20210428-052915-root.json [05:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:43] (03PS20) 10Ryan Kemper: elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) [05:44:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 50%: Repool db1112', diff saved to https://phabricator.wikimedia.org/P15604 and previous config saved to /var/cache/conftool/dbconfig/20210428-054419-root.json [05:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:49:05] (03PS1) 10Marostegui: instances.yaml: Add db1167 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/683124 (https://phabricator.wikimedia.org/T258361) [05:50:08] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1167 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/683124 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [05:51:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1167 in s8 T258361', diff saved to https://phabricator.wikimedia.org/P15605 and previous config saved to /var/cache/conftool/dbconfig/20210428-055144-marostegui.json [05:51:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:54] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [05:52:23] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream, 10User-RhinosF1: Gravatar add link still shows in profile - https://phabricator.wikimedia.org/T278410 (10Ladsgroup) 05Open→03Resolved [05:52:37] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [05:56:27] (03PS1) 10Marostegui: install_server: Do not reimage db1156,db1167 [puppet] - 10https://gerrit.wikimedia.org/r/683125 [05:57:14] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1156,db1167 [puppet] - 10https://gerrit.wikimedia.org/r/683125 (owner: 10Marostegui) [05:59:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 75%: Repool db1112', diff saved to https://phabricator.wikimedia.org/P15606 and previous config saved to /var/cache/conftool/dbconfig/20210428-055922-root.json [05:59:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:02] !log Stop MySQL on db2096 (x1 codfw) T281135 [06:00:04] Amir1 and legoktm: #bothumor I � Unicode. All rise for Mailman3 install deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210428T0600). [06:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:10] T281135: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 [06:00:14] the time has come [06:00:18] all raise [06:00:22] jouncebot is trolling us with that unicode joke [06:01:17] 10SRE, 10ops-codfw, 10DBA: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Marostegui) @Papaul db2096 is now off, so you can proceed as needed. [06:02:04] (03CR) 10Legoktm: [C: 03+2] mariadb: Allow lists1001.wikimedia.org to talk to m5 [puppet] - 10https://gerrit.wikimedia.org/r/681753 (https://phabricator.wikimedia.org/T278614) (owner: 10Legoktm) [06:03:59] ok, lists1001 can talk to m5-master now [06:04:28] marostegui: FYI we are deploying :D [06:04:32] good [06:04:33] I am here [06:06:11] (03PS1) 10Legoktm: lists: Enable Mailman3 on lists1001 [puppet] - 10https://gerrit.wikimedia.org/r/683147 (https://phabricator.wikimedia.org/T278610) [06:07:08] (03CR) 10Ladsgroup: [C: 03+1] lists: Enable Mailman3 on lists1001 [puppet] - 10https://gerrit.wikimedia.org/r/683147 (https://phabricator.wikimedia.org/T278610) (owner: 10Legoktm) [06:07:15] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29241/console" [puppet] - 10https://gerrit.wikimedia.org/r/683147 (https://phabricator.wikimedia.org/T278610) (owner: 10Legoktm) [06:07:51] That diff is wow https://puppet-compiler.wmflabs.org/compiler1002/29241/lists1001.wikimedia.org/index.html [06:09:37] I'm going to change the default charset on mailman3/mailman3web to utf8mb4 first [06:09:56] 10SRE, 10ops-eqiad: Degraded RAID on an-worker1100 - https://phabricator.wikimedia.org/T280132 (10elukey) @Ottomata @razzi this task needs some follow up :) [06:10:19] done [06:10:34] Amir1: lgtm, all set? [06:11:08] double checking. Are the passwords in private repo? [06:11:44] (03PS1) 10Legoktm: lists: Add mailman3-roots group to lists1001 [puppet] - 10https://gerrit.wikimedia.org/r/683148 [06:12:16] yes [06:12:36] db_password, web::db_password, api_password, web::secret, archiver_key [06:12:55] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [06:13:08] coool [06:13:16] I'm double checking the charset in the config https://github.com/wikimedia/puppet/blob/production/modules/mailman3/templates/mailman.cfg.erb#L172 [06:13:24] (03CR) 10Legoktm: [V: 03+1 C: 03+2] lists: Enable Mailman3 on lists1001 [puppet] - 10https://gerrit.wikimedia.org/r/683147 (https://phabricator.wikimedia.org/T278610) (owner: 10Legoktm) [06:13:31] https://github.com/wikimedia/puppet/blob/production/modules/mailman3/templates/mailman-web.py.erb#L77 [06:13:35] yup there [06:13:40] :D [06:13:48] (03CR) 10Ladsgroup: [C: 03+1] lists: Add mailman3-roots group to lists1001 [puppet] - 10https://gerrit.wikimedia.org/r/683148 (owner: 10Legoktm) [06:14:10] running puppet [06:14:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 100%: Repool db1112', diff saved to https://phabricator.wikimedia.org/P15607 and previous config saved to /var/cache/conftool/dbconfig/20210428-061426-root.json [06:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:54] Error: Could not set 'file' on ensure: No such file or directory - A directory component in /etc/mailman3/mailman.cfg20210428-21314-aanovg.lock does not exist or is a dangling symbolic link (file: /etc/puppet/modules/mailman3/manifests/listserve.pp, line: 34) [06:14:54] Error: Could not set 'file' on ensure: No such file or directory - A directory component in /etc/mailman3/mailman.cfg20210428-21314-aanovg.lock does not exist or is a dangling symbolic link (file: /etc/puppet/modules/mailman3/manifests/listserve.pp, line: 34) [06:15:17] * Amir1 glues his eyes to https://grafana.wikimedia.org/d/nULM0E1Wk/mailman?orgId=1&from=now-3h&to=now [06:15:25] I stopped puppet [06:15:33] /etc/mailman3/ doesn't exist [06:16:09] I guess normally the package creates it? [06:16:26] (03CR) 10Aaron Schulz: "The master fallback log events were change to WARNING." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682322 (https://phabricator.wikimedia.org/T281048) (owner: 10Reedy) [06:16:37] YES [06:16:54] hmm, it's a chicken and egg [06:17:00] ok, it'll get fixed on the second puppet run then [06:17:34] resuming puppet [06:17:39] for the first time, the file needs the package, the subsequent runs, the package needs directory [06:18:12] python3-django-hyperkitty : Depends: python3-django (>= 2:2.2) but 1:1.11.29-1~deb10u1 is to be installed [06:18:12] Depends: python3-django-mailman3 (>= 1.3.3) but it is not going to be installed [06:18:18] * legoktm fixes [06:19:14] we can have puppet create /etc/mailman3 [06:19:18] (later) [06:19:29] yeah [06:20:15] (03CR) 10Elukey: [C: 04-1] "Precautionary -1 just to discuss some things!" [puppet] - 10https://gerrit.wikimedia.org/r/683044 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [06:20:21] marostegui: I think it's creating the tables now, fyi [06:20:40] https://lists.wikimedia.org/postorius/ [06:20:50] ok [06:20:51] https://lists.wikimedia.org/hyperkitty/ [06:21:19] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [06:21:20] example.com sigh [06:21:24] yeah fixing [06:21:56] legoktm: I can see the tables on mailman3 database [06:22:58] same [06:23:00] and mailman3web [06:23:03] they look correct this time [06:23:14] Amir1: fixed the default site [06:23:20] !log legoktm@lists1001:~$ sudo mailman-web set_default_site --name lists.wikimedia.org --domain lists.wikimedia.org [06:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:58] Awesome [06:24:06] now superusers? [06:24:14] (03CR) 10Elukey: [C: 04-1] "Another thing - kafka-main200[4,5] don't have their AAAA records in the DNS, still due to https://phabricator.wikimedia.org/T271136" [puppet] - 10https://gerrit.wikimedia.org/r/683044 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [06:24:16] let me send test emails [06:24:21] https://lists.wikimedia.org/user-profile/ is a 404 [06:24:44] 10SRE, 10SRE-tools, 10IPv6, 10User-jbond: Some Foundation clusters do not appear to support IPv6 - https://phabricator.wikimedia.org/T271136 (10elukey) @crusnov we are good to deploy the other AAAA records, can we proceed? [06:25:15] the web for mailman2 is working fine [06:26:18] hyperkitty is 404ing some times too [06:26:27] I assume that's edge cache [06:26:36] add ?urlfoo [06:26:41] lists isn't behind varnish [06:26:56] !log created mailman3 superusers for Administrator (noc@), Ladsgroup and Legoktm [06:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:22] that's interesting [06:27:24] wtf I'm getting 404s too [06:27:37] I'm logged in so I don't think it should be cached [06:27:51] maybe apache is weird [06:27:57] shall we restart apache again? [06:28:13] the 404 is at apache level [06:28:25] restarted apache [06:28:31] Majavah: logged into MM3? [06:28:34] yes [06:28:41] Amir1: btw I set a random pw for you and didn't save it so you'll need to reset it [06:28:45] we haven't made anything yet ^^ [06:28:54] legoktm: cool. Noted [06:29:18] it seems restarting apache fixed it [06:29:44] ..... [06:29:50] user_id = 0 is Majavah [06:30:03] I still see a 404 on https://lists.wikimedia.org/user-profile/ [06:30:21] lol [06:30:36] yes, I had that url ready and everything to get a low user id :D [06:30:45] sigh [06:30:46] please hold on [06:30:48] you got the lowest [06:31:12] anyway, the only thing that 404s is user-account? [06:31:29] (03CR) 10Legoktm: [C: 03+2] lists: Add mailman3-roots group to lists1001 [puppet] - 10https://gerrit.wikimedia.org/r/683148 (owner: 10Legoktm) [06:32:33] (03CR) 10Volans: "As requested did a pass. In general looks good to me, nothing wrong with the class API. I left some nit/question inline, none is a blocker" (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [06:32:52] I make a patch for user-profile [06:32:55] (03PS1) 10Legoktm: lists: Route /user-profile/ to Mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/683149 [06:33:04] Amir1: ^ [06:33:07] or +1 lego's [06:33:20] (03CR) 10Ladsgroup: [C: 03+1] lists: Route /user-profile/ to Mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/683149 (owner: 10Legoktm) [06:33:21] so is hyperkitty 404ing for anyone else still? [06:33:41] not to me [06:34:05] I'm going to guess some apache worker didn't have mod_proxy_uwsgi enabled or something and it needed a clean restart [06:34:25] apache works in mysterious ways [06:34:42] Amir1: btw you should have root now [06:34:54] YES [06:34:55] Thanks [06:35:00] ./hyperkitty/ works but not /hyperkitty - both do on lists-next [06:35:01] yup [06:35:23] do I need to escape the - in user-profile? [06:35:38] let me check [06:35:51] ty, I'll test on polymorphic in the meantime [06:36:09] oh [06:36:28] my regex compiler says it's not needed [06:36:47] it even errors when i try to escape it [06:37:03] (03PS1) 10Legoktm: lists: Proxy /hyperkitty and co (no trailing /) to Mailman3 too [puppet] - 10https://gerrit.wikimedia.org/r/683150 [06:37:06] RhinosF1: ^ [06:37:47] (03CR) 10RhinosF1: [C: 03+1] lists: Proxy /hyperkitty and co (no trailing /) to Mailman3 too [puppet] - 10https://gerrit.wikimedia.org/r/683150 (owner: 10Legoktm) [06:37:55] Ty [06:38:24] Technically you can just remove it, I don't know any mailman2 endpoint being like that but better safe than sorry I assume [06:39:06] (03PS1) 10ArielGlenn: bring up snapshot1001,12,13 as dumps testbed hosts [puppet] - 10https://gerrit.wikimedia.org/r/683151 (https://phabricator.wikimedia.org/T281330) [06:39:29] https://polymorphic.lists.wmcloud.org/user-profile works now [06:39:37] (03CR) 10Legoktm: [C: 03+2] lists: Proxy /hyperkitty and co (no trailing /) to Mailman3 too [puppet] - 10https://gerrit.wikimedia.org/r/683150 (owner: 10Legoktm) [06:39:41] (03CR) 10Legoktm: [C: 03+2] lists: Route /user-profile/ to Mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/683149 (owner: 10Legoktm) [06:39:43] (03CR) 10Elukey: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/682785 (https://phabricator.wikimedia.org/T278423) (owner: 10Razzi) [06:40:26] btw, exim4 on mailman2 works AFAICS [06:40:38] I got your test mail [06:40:42] yup [06:41:21] Shall I migrate test to the new system now? [06:41:33] go for it [06:41:39] cool [06:41:45] I'm going to step away for a minute to grab some snacks [06:42:11] (03PS3) 10Legoktm: lists: Backup /var/lib/mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/681763 [06:42:21] go enjoy! [06:47:22] legoktm: ModuleNotFoundError: No module named 'mailman_hyperkitty' when creating a list [06:47:48] we need to install mailman3-kyperkitty plugin [06:47:50] rip [06:47:53] Give me a minute [06:48:10] all good. Take your time [06:49:15] looking now [06:51:59] (03PS1) 10Legoktm: mailman3: Explicitly install python3-mailman-hyperkitty [puppet] - 10https://gerrit.wikimedia.org/r/683226 [06:52:03] Amir1: ^ [06:52:33] legoktm: this is good to go but I think there is one else to do as well [06:52:40] let me grab it from puppetmaster [06:52:43] wip commits [06:53:33] I might've accidentally cleaned that up, confusing it with the django-hyperkitty package [06:54:06] yeah, it's gone [06:54:12] we can fix it [06:54:16] let's go with this [06:54:17] (03CR) 10Legoktm: [C: 03+2] mailman3: Explicitly install python3-mailman-hyperkitty [puppet] - 10https://gerrit.wikimedia.org/r/683226 (owner: 10Legoktm) [06:54:22] my bad [06:54:39] all good. All of these packages are REALLY confusing [06:54:53] like two different hyperkitty [06:54:57] hyperkitten? [06:55:06] xD [06:55:42] ok, installed and restart mailman3 [06:55:45] try again now? [06:56:45] yup works like a charm [06:56:49] now upgrading test [06:57:25] :D [06:58:06] https://lists.wikimedia.org/postorius/lists/test.lists.wikimedia.org/members/member/ [06:59:02] upgraded, archives and indexes [06:59:23] https://lists.wikimedia.org/hyperkitty/list/test@lists.wikimedia.org/thread/K5TJFAEQ5IZMCQA4X2A3WN7VHZFQOOPE/ [06:59:50] logged out it says "This mailing list is private. You must be subscribed to view the archives. " SWEET [07:00:29] search is also working (and not working logged out) [07:00:52] legoktm: okay, time to migrate lgbt? [07:01:49] fun question. How can I disable a mailing list? [07:01:57] there's a script but one sec [07:02:03] I sent a message to test@, waiting for it to show [07:02:06] sure [07:02:21] I got the mail [07:02:32] $ sudo disable_list [07:02:32] Usage: /usr/local/sbin/disable_list [-e|--enable] [07:02:50] I didn't [07:03:00] oh ffs [07:03:05] I sent it to you [07:03:40] !log Deploy schema change on db2089:3316 and db1098:3316 T266486 T268392 T273360 [07:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:52] T268392: Schema change for watchlist.wl_notificationtimestamp going binary(14) from varbinary(14) - https://phabricator.wikimedia.org/T268392 [07:03:52] T273360: Schema change for dropping default of img_timestamp and making it binary(14) - https://phabricator.wikimedia.org/T273360 [07:03:52] T266486: Schema change to turn user_last_timestamp.user_newtalk to binary(14) - https://phabricator.wikimedia.org/T266486 [07:03:58] %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s [07:04:22] !log add AAAA record for kafka-main2002.codfw.wmnet [07:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:34] Amir1: do you see that in the mail footer? [07:04:44] Yup [07:05:18] legoktm: lol [07:05:18] !log elukey@cumin1001 START - Cookbook sre.dns.netbox [07:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:51] ugh, I guess we need to fiddle with the templates thing [07:05:59] Amir1: but I think importing lgbt@ is fine [07:06:09] okay, going for it [07:06:21] oh also it's sudo disable_list [07:06:30] and I already said that >.> [07:07:27] lol [07:07:47] now I'm looking if there's a command to enable it back in mailman3 [07:07:53] but that's for later [07:08:01] for now I do it in the ui [07:08:10] disable_list is a custom script we wrote [07:08:21] it just enables emergency_moderation and bans everyone from sending to the list [07:09:20] that's scary [07:09:47] what do you expect, it's mm2 [07:09:57] all users now imported but as non-member [07:10:01] because they are banned [07:10:10] !log elukey@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:10:12] uhm [07:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:23] so maybe we should *not* disable the list before importing [07:11:27] it might take a minute or two to import a mailing list [07:11:40] maybe, we can not ban everyone? [07:11:52] as an option for example [07:12:39] !log add AAAA record for kafka-main200[3,4,5].codfw.wmnet [07:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:02] the disable script really doesn't do much then [07:13:17] I think we're better off just importing and accepting the race condition [07:13:42] in theory MM2 would deliver the mail? [07:14:52] !log elukey@cumin1001 START - Cookbook sre.dns.netbox [07:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:01] (03CR) 10Hashar: [C: 03+1] Relax CSP rules for taint-check-demo [puppet] - 10https://gerrit.wikimedia.org/r/680337 (https://phabricator.wikimedia.org/T257301) (owner: 10Daimona Eaytoy) [07:18:26] IRCcloud seems to have issues... [07:18:55] legoktm: I upgraded LGBT and disabled the old one [07:19:00] (03CR) 10Marostegui: [C: 03+2] mariadb: Reenable notifications on db1156 after maintenance [puppet] - 10https://gerrit.wikimedia.org/r/682668 (https://phabricator.wikimedia.org/T280492) (owner: 10Jcrespo) [07:19:25] !log elukey@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:39] 10SRE, 10SRE-tools, 10IPv6, 10User-jbond: Some Foundation clusters do not appear to support IPv6 - https://phabricator.wikimedia.org/T271136 (10elukey) [07:20:31] 10SRE, 10SRE-tools, 10IPv6, 10User-jbond: Some Foundation clusters do not appear to support IPv6 - https://phabricator.wikimedia.org/T271136 (10elukey) 05Open→03Resolved a:03elukey Added the remaining AAAA records for kafka-main200[2-5]! [07:21:34] (03CR) 10Elukey: [C: 04-1] "Just added the AAAA records, all good :)" [puppet] - 10https://gerrit.wikimedia.org/r/683044 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [07:22:43] Amir189: awesome [07:22:56] Amir189: did you try emailing to it? [07:22:57] Got this An error occurred: Internal Server Error

Internal Server Error

[07:23:05] o.O where? [07:23:08] (03PS1) 10Marostegui: db1154.yaml: Clean up references [puppet] - 10https://gerrit.wikimedia.org/r/683231 [07:23:10] I'm updating the description [07:23:52] (03CR) 10Marostegui: [C: 03+2] db1154.yaml: Clean up references [puppet] - 10https://gerrit.wikimedia.org/r/683231 (owner: 10Marostegui) [07:23:53] haha, it just fails instead of telling me it's more than 1K character [07:23:56] beautiful [07:24:30] "Data too long for column 'info' at row 1" [07:24:31] wtf [07:24:49] maybe we can just increase its size? [07:24:55] (03CR) 10JMeybohm: rdf-streaming-updater: enable HA capability (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/679519 (https://phabricator.wikimedia.org/T273098) (owner: 10Mstyles) [07:25:02] (03PS2) 10Elukey: kafka-main: deploy kafka::main role to kafka-main[12]00[45] [puppet] - 10https://gerrit.wikimedia.org/r/683044 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [07:25:04] (03PS1) 10Elukey: Add new kafka-main IPs to the kafka_brokers_main firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/683232 (https://phabricator.wikimedia.org/T225005) [07:25:09] I don't want to make schema changes without coordinating with upstream [07:25:17] yeah [07:25:22] let me see if there's an option [07:26:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1098 for schema change and kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15608 and previous config saved to /var/cache/conftool/dbconfig/20210428-072609-marostegui.json [07:26:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:24] (03CR) 10JMeybohm: rdf-streaming-updater: create helmfile.d structure (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [07:27:06] !log update php7.2 on appservers && rolling php7.2-fpm restarts [07:27:13] (eqiad) [07:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:26] legoktm: For now, I just used shorter url, etc. [07:29:57] TBH, I like this limit. Descriptions should be concise [07:30:15] so we are all set? [07:31:10] I should send an email to lgbt@ [07:31:48] Amir189: I think so! [07:33:18] (03CR) 10Volans: [C: 04-1] "one typo" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/683232 (https://phabricator.wikimedia.org/T225005) (owner: 10Elukey) [07:34:13] legoktm: Fun question, how can I verify the email went through? the only thing I can think of is exim4 logs [07:34:23]