[00:00:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: 2021-03-31) rack/setup/install snapshot101[1-5] - https://phabricator.wikimedia.org/T272509 (10RobH) So these are chugging along just fine, and didn't fall to the manual partition menu. I suspect my change didn't hit apt server, just install host... [00:01:57] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on snapshot1014.eqiad.wmnet with reason: REIMAGE [00:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:21] (03CR) 10Dzahn: "for the record, I replaced this list with one generated on thumbor1002 instead of mw2300 but the fc-list content stayed the same" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681766 (https://phabricator.wikimedia.org/T79424) (owner: 10Dzahn) [00:03:38] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on snapshot1015.eqiad.wmnet with reason: REIMAGE [00:03:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: 2021-03-31) rack/setup/install snapshot101[1-5] - https://phabricator.wikimedia.org/T272509 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['snapshot1011.eqiad.wmnet', 'snapshot1012.eqiad.wmnet', 'snapshot1013.eqiad.wmnet', 'snapshot101... [00:11:35] (03PS1) 10Dzahn: ci::master/deployment_server: add new k8s namespace for shellbox [puppet] - 10https://gerrit.wikimedia.org/r/683111 (https://phabricator.wikimedia.org/T260330) [00:32:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: 2021-03-31) rack/setup/install snapshot101[1-5] - https://phabricator.wikimedia.org/T272509 (10RobH) [00:32:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: 2021-03-31) rack/setup/install snapshot101[1-5] - https://phabricator.wikimedia.org/T272509 (10RobH) 05Open→03Resolved @ArielGlenn These are all yours! [00:38:59] (03PS14) 10Mstyles: rdf-streaming-updater: create helmfile.d structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) [00:39:01] (03PS4) 10Mstyles: rdf-streaming-updater: enable HA capability [deployment-charts] - 10https://gerrit.wikimedia.org/r/679519 (https://phabricator.wikimedia.org/T273098) [00:40:39] (03CR) 10Mstyles: rdf-streaming-updater: enable HA capability (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/679519 (https://phabricator.wikimedia.org/T273098) (owner: 10Mstyles) [00:45:07] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 97 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [01:00:15] !log robh@cumin1001 START - Cookbook sre.dns.netbox [01:00:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:10] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:31:51] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10dpifke) Alternatively, can we get identical results just by incrementing `grace` by `keep`? (And possibly setting `keep` to 0... [01:40:57] (03CR) 10Razzi: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/682785 (https://phabricator.wikimedia.org/T278423) (owner: 10Razzi) [02:28:23] !log robh@cumin1001 START - Cookbook sre.dns.netbox [02:28:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:33:47] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [02:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:21:43] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:28:14] 10SRE, 10Wikimedia-Mailing-lists: hyperkitty didn't import all wikitech-l messages - https://phabricator.wikimedia.org/T281070 (10Legoktm) It seems like every email with an emoji in it from wikimedia-l was skipped: ERROR: https://lists.wikimedia.org/pipermail/wikimedia-l/2020-September/095629.html not in hype... [03:32:06] !log T280382 `sudo -i wmf-auto-reimage-host -p T280382 wdqs1013.eqiad.wmnet` on `ryankemper@cumin1001` tmux session `reimage` [03:32:15] !log T280382 `sudo -i wmf-auto-reimage-host -p T280382 wdqs2007.codfw.wmnet` on `ryankemper@cumin1001` tmux session `reimage` [03:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:32:18] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [03:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:33:12] !log `sudo systemctl restart wdqs-blazegraph` on `wdqs1012` to clear the `WDQS SPARQL` warning [03:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:36:03] 10SRE, 10Wikimedia-Mailing-lists: hyperkitty didn't import all wikitech-l messages - https://phabricator.wikimedia.org/T281070 (10Legoktm) A bunch of older messages were also skipped, but those look like the archives are corrupt? e.g. https://lists.wikimedia.org/pipermail/wikimedia-l/2005-October/064281.html... [03:36:55] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.074 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:48:31] !log ryankemper@puppetmaster1001 conftool action : set/pooled=no; selector: name=wdqs1013.eqiad.wmnet [03:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:49:12] !log ryankemper@puppetmaster1001 conftool action : set/pooled=no; selector: name=wdqs2007.codfw.wmnet [03:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:50:03] (03CR) 10Ryan Kemper: elasticsearch: refactor various rolling operations (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [03:51:17] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2007.codfw.wmnet with reason: REIMAGE [03:51:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:53:16] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2007.codfw.wmnet with reason: REIMAGE [03:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:58:18] (03PS14) 10Ryan Kemper: elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) [04:01:17] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [04:03:50] In 1h I will be switching enwiki db master [04:05:21] 10ops-codfw, 10DC-Ops, 10Discovery-Search: hw troubleshooting: failure to power up for elastic2043.codfw.wmnet - https://phabricator.wikimedia.org/T281327 (10RKemper) [04:07:08] 10SRE, 10ops-codfw, 10Discovery: elastic2043 doesn't power up - https://phabricator.wikimedia.org/T281215 (10RKemper) Made a ticket using the hardware failure template from the dc-ops group. In retrospect I probably should have just copied over the template to here but wasn't sure if the template does anythi... [04:07:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db1163 with weight 0 before the switchover T278214', diff saved to https://phabricator.wikimedia.org/P15598 and previous config saved to /var/cache/conftool/dbconfig/20210428-040718-marostegui.json [04:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:07:28] T278214: Switchover s1 from db1083 to db1163 - https://phabricator.wikimedia.org/T278214 [04:08:01] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [04:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:08:13] !log Start replication changes, connect everything to db1163 T278214 [04:08:19] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs2001.codfw.wmnet --dest wdqs2007.codfw.wmnet --reason "transferring fresh categories journal following reimage" --blazegraph_instance categories` on `ryankemper@cumin1001` tmux session `reimage` [04:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:08:30] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [04:09:34] (03PS15) 10Ryan Kemper: elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) [04:13:03] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [04:13:22] 10SRE, 10Platform Engineering, 10Services, 10Wikimedia-Mailing-lists: Decide on future of public services@ mailing list (which has no maintainers) - https://phabricator.wikimedia.org/T278516 (10Legoktm) >>! In T278516#6949784, @Dzahn wrote: > Imho it should be part of offboarding workflows to check for lis... [04:13:32] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [04:13:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:14:32] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [04:14:36] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs2001.codfw.wmnet --dest wdqs2007.codfw.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph` on `ryankemper@cumin1001` tmux session `reimage` [04:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:14:47] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [04:20:00] (03PS2) 10Marostegui: wmnet: Update s1-master to the right master [dns] - 10https://gerrit.wikimedia.org/r/682882 (https://phabricator.wikimedia.org/T278214) [04:20:09] (03PS4) 10Marostegui: mariadb: Promote db1163 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/682881 (https://phabricator.wikimedia.org/T278214) [04:22:58] 10SRE, 10Platform Engineering, 10Services, 10Wikimedia-Mailing-lists: Decide on future of public services@ mailing list (which has no maintainers) - https://phabricator.wikimedia.org/T278516 (10Legoktm) I proposed closing the list: https://lists.wikimedia.org/pipermail/services/2021-April/000195.html [04:24:33] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1163 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/682881 (https://phabricator.wikimedia.org/T278214) (owner: 10Marostegui) [04:28:07] 10SRE, 10ops-codfw, 10DBA: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Marostegui) I will get db2096 ready for you. [04:31:08] (03PS16) 10Ryan Kemper: elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) [04:32:34] (03PS1) 10Marostegui: db1167: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/683121 (https://phabricator.wikimedia.org/T258361) [04:33:15] (03CR) 10Marostegui: [C: 03+2] db1167: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/683121 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [04:33:44] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [04:34:00] In 30 minutes I will be switching enwiki db master [04:39:07] 10SRE, 10Wikimedia-Mailing-lists: Install mailman3 on lists1001.wikimedia.org - https://phabricator.wikimedia.org/T278610 (10Legoktm) [04:39:28] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review, 10User-Ladsgroup: Install mailman3 and mailman2 at the same time on the cloud - https://phabricator.wikimedia.org/T278612 (10Legoktm) 05Open→03Resolved a:03Ladsgroup [04:42:39] (03PS17) 10Ryan Kemper: elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) [04:43:22] (03PS18) 10Ryan Kemper: elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) [04:45:58] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [04:48:47] (03PS19) 10Ryan Kemper: elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) [04:52:41] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [05:00:13] I am going to start enwiki switchover [05:00:21] !log Starting s1 eqiad failover from db1083 to db1163 - T278214 [05:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:30] T278214: Switchover s1 from db1083 to db1163 - https://phabricator.wikimedia.org/T278214 [05:00:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set s1 as read-only for maintenance T278214', diff saved to https://phabricator.wikimedia.org/P15599 and previous config saved to /var/cache/conftool/dbconfig/20210428-050041-marostegui.json [05:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:56] RO confirmed [05:01:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db1163 to s1 master and remove read-only from s1 T278214', diff saved to https://phabricator.wikimedia.org/P15600 and previous config saved to /var/cache/conftool/dbconfig/20210428-050138-marostegui.json [05:01:45] RO removed [05:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:04] Test edit worked fine [05:02:11] I can edit fine again yes [05:02:13] same here [05:02:18] hey sobanski and jynus o/ [05:02:27] recentchanges is moving [05:02:34] tendril looks good [05:02:48] no errors on log [05:03:31] (03PS1) 10Legoktm: mailman3: Make sure mailman-web uses utf8mb4 as well [puppet] - 10https://gerrit.wikimedia.org/r/683123 [05:03:38] (but we should really look to minimize those on topology changes) [05:04:11] recentchanges keeps going fine and I can see edits happening on the master [05:04:32] (03CR) 10TsepoThoabala: [C: 03+1] Enable partial action blocks on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683089 (https://phabricator.wikimedia.org/T280528) (owner: 10Tchanders) [05:04:38] (03CR) 10Marostegui: [C: 03+2] wmnet: Update s1-master to the right master [dns] - 10https://gerrit.wikimedia.org/r/682882 (https://phabricator.wikimedia.org/T278214) (owner: 10Marostegui) [05:05:14] I think we are good sobanski and jynus. Thanks for the support :* [05:05:15] (03CR) 10TsepoThoabala: [C: 03+1] Enable partial action blocks on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683088 (https://phabricator.wikimedia.org/T280528) (owner: 10Tchanders) [05:05:48] last edit 5:00, first edit 05:01 [05:07:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1083 (old s1 master) for schema change', diff saved to https://phabricator.wikimedia.org/P15601 and previous config saved to /var/cache/conftool/dbconfig/20210428-050754-marostegui.json [05:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:54] (03CR) 10Legoktm: [C: 03+2] mailman3: Make sure mailman-web uses utf8mb4 as well [puppet] - 10https://gerrit.wikimedia.org/r/683123 (owner: 10Legoktm) [05:15:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1112 for schema change', diff saved to https://phabricator.wikimedia.org/P15602 and previous config saved to /var/cache/conftool/dbconfig/20210428-051526-marostegui.json [05:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 25%: Repool db1112', diff saved to https://phabricator.wikimedia.org/P15603 and previous config saved to /var/cache/conftool/dbconfig/20210428-052915-root.json [05:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:43] (03PS20) 10Ryan Kemper: elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) [05:44:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 50%: Repool db1112', diff saved to https://phabricator.wikimedia.org/P15604 and previous config saved to /var/cache/conftool/dbconfig/20210428-054419-root.json [05:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:49:05] (03PS1) 10Marostegui: instances.yaml: Add db1167 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/683124 (https://phabricator.wikimedia.org/T258361) [05:50:08] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1167 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/683124 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [05:51:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1167 in s8 T258361', diff saved to https://phabricator.wikimedia.org/P15605 and previous config saved to /var/cache/conftool/dbconfig/20210428-055144-marostegui.json [05:51:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:54] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [05:52:23] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream, 10User-RhinosF1: Gravatar add link still shows in profile - https://phabricator.wikimedia.org/T278410 (10Ladsgroup) 05Open→03Resolved [05:52:37] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [05:56:27] (03PS1) 10Marostegui: install_server: Do not reimage db1156,db1167 [puppet] - 10https://gerrit.wikimedia.org/r/683125 [05:57:14] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1156,db1167 [puppet] - 10https://gerrit.wikimedia.org/r/683125 (owner: 10Marostegui) [05:59:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 75%: Repool db1112', diff saved to https://phabricator.wikimedia.org/P15606 and previous config saved to /var/cache/conftool/dbconfig/20210428-055922-root.json [05:59:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:02] !log Stop MySQL on db2096 (x1 codfw) T281135 [06:00:04] Amir1 and legoktm: #bothumor I � Unicode. All rise for Mailman3 install deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210428T0600). [06:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:10] T281135: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 [06:00:14] the time has come [06:00:18] all raise [06:00:22] jouncebot is trolling us with that unicode joke [06:01:17] 10SRE, 10ops-codfw, 10DBA: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Marostegui) @Papaul db2096 is now off, so you can proceed as needed. [06:02:04] (03CR) 10Legoktm: [C: 03+2] mariadb: Allow lists1001.wikimedia.org to talk to m5 [puppet] - 10https://gerrit.wikimedia.org/r/681753 (https://phabricator.wikimedia.org/T278614) (owner: 10Legoktm) [06:03:59] ok, lists1001 can talk to m5-master now [06:04:28] marostegui: FYI we are deploying :D [06:04:32] good [06:04:33] I am here [06:06:11] (03PS1) 10Legoktm: lists: Enable Mailman3 on lists1001 [puppet] - 10https://gerrit.wikimedia.org/r/683147 (https://phabricator.wikimedia.org/T278610) [06:07:08] (03CR) 10Ladsgroup: [C: 03+1] lists: Enable Mailman3 on lists1001 [puppet] - 10https://gerrit.wikimedia.org/r/683147 (https://phabricator.wikimedia.org/T278610) (owner: 10Legoktm) [06:07:15] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29241/console" [puppet] - 10https://gerrit.wikimedia.org/r/683147 (https://phabricator.wikimedia.org/T278610) (owner: 10Legoktm) [06:07:51] That diff is wow https://puppet-compiler.wmflabs.org/compiler1002/29241/lists1001.wikimedia.org/index.html [06:09:37] I'm going to change the default charset on mailman3/mailman3web to utf8mb4 first [06:09:56] 10SRE, 10ops-eqiad: Degraded RAID on an-worker1100 - https://phabricator.wikimedia.org/T280132 (10elukey) @Ottomata @razzi this task needs some follow up :) [06:10:19] done [06:10:34] Amir1: lgtm, all set? [06:11:08] double checking. Are the passwords in private repo? [06:11:44] (03PS1) 10Legoktm: lists: Add mailman3-roots group to lists1001 [puppet] - 10https://gerrit.wikimedia.org/r/683148 [06:12:16] yes [06:12:36] db_password, web::db_password, api_password, web::secret, archiver_key [06:12:55] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [06:13:08] coool [06:13:16] I'm double checking the charset in the config https://github.com/wikimedia/puppet/blob/production/modules/mailman3/templates/mailman.cfg.erb#L172 [06:13:24] (03CR) 10Legoktm: [V: 03+1 C: 03+2] lists: Enable Mailman3 on lists1001 [puppet] - 10https://gerrit.wikimedia.org/r/683147 (https://phabricator.wikimedia.org/T278610) (owner: 10Legoktm) [06:13:31] https://github.com/wikimedia/puppet/blob/production/modules/mailman3/templates/mailman-web.py.erb#L77 [06:13:35] yup there [06:13:40] :D [06:13:48] (03CR) 10Ladsgroup: [C: 03+1] lists: Add mailman3-roots group to lists1001 [puppet] - 10https://gerrit.wikimedia.org/r/683148 (owner: 10Legoktm) [06:14:10] running puppet [06:14:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 100%: Repool db1112', diff saved to https://phabricator.wikimedia.org/P15607 and previous config saved to /var/cache/conftool/dbconfig/20210428-061426-root.json [06:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:54] Error: Could not set 'file' on ensure: No such file or directory - A directory component in /etc/mailman3/mailman.cfg20210428-21314-aanovg.lock does not exist or is a dangling symbolic link (file: /etc/puppet/modules/mailman3/manifests/listserve.pp, line: 34) [06:14:54] Error: Could not set 'file' on ensure: No such file or directory - A directory component in /etc/mailman3/mailman.cfg20210428-21314-aanovg.lock does not exist or is a dangling symbolic link (file: /etc/puppet/modules/mailman3/manifests/listserve.pp, line: 34) [06:15:17] * Amir1 glues his eyes to https://grafana.wikimedia.org/d/nULM0E1Wk/mailman?orgId=1&from=now-3h&to=now [06:15:25] I stopped puppet [06:15:33] /etc/mailman3/ doesn't exist [06:16:09] I guess normally the package creates it? [06:16:26] (03CR) 10Aaron Schulz: "The master fallback log events were change to WARNING." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682322 (https://phabricator.wikimedia.org/T281048) (owner: 10Reedy) [06:16:37] YES [06:16:54] hmm, it's a chicken and egg [06:17:00] ok, it'll get fixed on the second puppet run then [06:17:34] resuming puppet [06:17:39] for the first time, the file needs the package, the subsequent runs, the package needs directory [06:18:12] python3-django-hyperkitty : Depends: python3-django (>= 2:2.2) but 1:1.11.29-1~deb10u1 is to be installed [06:18:12] Depends: python3-django-mailman3 (>= 1.3.3) but it is not going to be installed [06:18:18] * legoktm fixes [06:19:14] we can have puppet create /etc/mailman3 [06:19:18] (later) [06:19:29] yeah [06:20:15] (03CR) 10Elukey: [C: 04-1] "Precautionary -1 just to discuss some things!" [puppet] - 10https://gerrit.wikimedia.org/r/683044 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [06:20:21] marostegui: I think it's creating the tables now, fyi [06:20:40] https://lists.wikimedia.org/postorius/ [06:20:50] ok [06:20:51] https://lists.wikimedia.org/hyperkitty/ [06:21:19] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [06:21:20] example.com sigh [06:21:24] yeah fixing [06:21:56] legoktm: I can see the tables on mailman3 database [06:22:58] same [06:23:00] and mailman3web [06:23:03] they look correct this time [06:23:14] Amir1: fixed the default site [06:23:20] !log legoktm@lists1001:~$ sudo mailman-web set_default_site --name lists.wikimedia.org --domain lists.wikimedia.org [06:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:58] Awesome [06:24:06] now superusers? [06:24:14] (03CR) 10Elukey: [C: 04-1] "Another thing - kafka-main200[4,5] don't have their AAAA records in the DNS, still due to https://phabricator.wikimedia.org/T271136" [puppet] - 10https://gerrit.wikimedia.org/r/683044 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [06:24:16] let me send test emails [06:24:21] https://lists.wikimedia.org/user-profile/ is a 404 [06:24:44] 10SRE, 10SRE-tools, 10IPv6, 10User-jbond: Some Foundation clusters do not appear to support IPv6 - https://phabricator.wikimedia.org/T271136 (10elukey) @crusnov we are good to deploy the other AAAA records, can we proceed? [06:25:15] the web for mailman2 is working fine [06:26:18] hyperkitty is 404ing some times too [06:26:27] I assume that's edge cache [06:26:36] add ?urlfoo [06:26:41] lists isn't behind varnish [06:26:56] !log created mailman3 superusers for Administrator (noc@), Ladsgroup and Legoktm [06:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:22] that's interesting [06:27:24] wtf I'm getting 404s too [06:27:37] I'm logged in so I don't think it should be cached [06:27:51] maybe apache is weird [06:27:57] shall we restart apache again? [06:28:13] the 404 is at apache level [06:28:25] restarted apache [06:28:31] Majavah: logged into MM3? [06:28:34] yes [06:28:41] Amir1: btw I set a random pw for you and didn't save it so you'll need to reset it [06:28:45] we haven't made anything yet ^^ [06:28:54] legoktm: cool. Noted [06:29:18] it seems restarting apache fixed it [06:29:44] ..... [06:29:50] user_id = 0 is Majavah [06:30:03] I still see a 404 on https://lists.wikimedia.org/user-profile/ [06:30:21] lol [06:30:36] yes, I had that url ready and everything to get a low user id :D [06:30:45] sigh [06:30:46] please hold on [06:30:48] you got the lowest [06:31:12] anyway, the only thing that 404s is user-account? [06:31:29] (03CR) 10Legoktm: [C: 03+2] lists: Add mailman3-roots group to lists1001 [puppet] - 10https://gerrit.wikimedia.org/r/683148 (owner: 10Legoktm) [06:32:33] (03CR) 10Volans: "As requested did a pass. In general looks good to me, nothing wrong with the class API. I left some nit/question inline, none is a blocker" (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [06:32:52] I make a patch for user-profile [06:32:55] (03PS1) 10Legoktm: lists: Route /user-profile/ to Mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/683149 [06:33:04] Amir1: ^ [06:33:07] or +1 lego's [06:33:20] (03CR) 10Ladsgroup: [C: 03+1] lists: Route /user-profile/ to Mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/683149 (owner: 10Legoktm) [06:33:21] so is hyperkitty 404ing for anyone else still? [06:33:41] not to me [06:34:05] I'm going to guess some apache worker didn't have mod_proxy_uwsgi enabled or something and it needed a clean restart [06:34:25] apache works in mysterious ways [06:34:42] Amir1: btw you should have root now [06:34:54] YES [06:34:55] Thanks [06:35:00] ./hyperkitty/ works but not /hyperkitty - both do on lists-next [06:35:01] yup [06:35:23] do I need to escape the - in user-profile? [06:35:38] let me check [06:35:51] ty, I'll test on polymorphic in the meantime [06:36:09] oh [06:36:28] my regex compiler says it's not needed [06:36:47] it even errors when i try to escape it [06:37:03] (03PS1) 10Legoktm: lists: Proxy /hyperkitty and co (no trailing /) to Mailman3 too [puppet] - 10https://gerrit.wikimedia.org/r/683150 [06:37:06] RhinosF1: ^ [06:37:47] (03CR) 10RhinosF1: [C: 03+1] lists: Proxy /hyperkitty and co (no trailing /) to Mailman3 too [puppet] - 10https://gerrit.wikimedia.org/r/683150 (owner: 10Legoktm) [06:37:55] Ty [06:38:24] Technically you can just remove it, I don't know any mailman2 endpoint being like that but better safe than sorry I assume [06:39:06] (03PS1) 10ArielGlenn: bring up snapshot1001,12,13 as dumps testbed hosts [puppet] - 10https://gerrit.wikimedia.org/r/683151 (https://phabricator.wikimedia.org/T281330) [06:39:29] https://polymorphic.lists.wmcloud.org/user-profile works now [06:39:37] (03CR) 10Legoktm: [C: 03+2] lists: Proxy /hyperkitty and co (no trailing /) to Mailman3 too [puppet] - 10https://gerrit.wikimedia.org/r/683150 (owner: 10Legoktm) [06:39:41] (03CR) 10Legoktm: [C: 03+2] lists: Route /user-profile/ to Mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/683149 (owner: 10Legoktm) [06:39:43] (03CR) 10Elukey: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/682785 (https://phabricator.wikimedia.org/T278423) (owner: 10Razzi) [06:40:26] btw, exim4 on mailman2 works AFAICS [06:40:38] I got your test mail [06:40:42] yup [06:41:21] Shall I migrate test to the new system now? [06:41:33] go for it [06:41:39] cool [06:41:45] I'm going to step away for a minute to grab some snacks [06:42:11] (03PS3) 10Legoktm: lists: Backup /var/lib/mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/681763 [06:42:21] go enjoy! [06:47:22] legoktm: ModuleNotFoundError: No module named 'mailman_hyperkitty' when creating a list [06:47:48] we need to install mailman3-kyperkitty plugin [06:47:50] rip [06:47:53] Give me a minute [06:48:10] all good. Take your time [06:49:15] looking now [06:51:59] (03PS1) 10Legoktm: mailman3: Explicitly install python3-mailman-hyperkitty [puppet] - 10https://gerrit.wikimedia.org/r/683226 [06:52:03] Amir1: ^ [06:52:33] legoktm: this is good to go but I think there is one else to do as well [06:52:40] let me grab it from puppetmaster [06:52:43] wip commits [06:53:33] I might've accidentally cleaned that up, confusing it with the django-hyperkitty package [06:54:06] yeah, it's gone [06:54:12] we can fix it [06:54:16] let's go with this [06:54:17] (03CR) 10Legoktm: [C: 03+2] mailman3: Explicitly install python3-mailman-hyperkitty [puppet] - 10https://gerrit.wikimedia.org/r/683226 (owner: 10Legoktm) [06:54:22] my bad [06:54:39] all good. All of these packages are REALLY confusing [06:54:53] like two different hyperkitty [06:54:57] hyperkitten? [06:55:06] xD [06:55:42] ok, installed and restart mailman3 [06:55:45] try again now? [06:56:45] yup works like a charm [06:56:49] now upgrading test [06:57:25] :D [06:58:06] https://lists.wikimedia.org/postorius/lists/test.lists.wikimedia.org/members/member/ [06:59:02] upgraded, archives and indexes [06:59:23] https://lists.wikimedia.org/hyperkitty/list/test@lists.wikimedia.org/thread/K5TJFAEQ5IZMCQA4X2A3WN7VHZFQOOPE/ [06:59:50] logged out it says "This mailing list is private. You must be subscribed to view the archives. " SWEET [07:00:29] search is also working (and not working logged out) [07:00:52] legoktm: okay, time to migrate lgbt? [07:01:49] fun question. How can I disable a mailing list? [07:01:57] there's a script but one sec [07:02:03] I sent a message to test@, waiting for it to show [07:02:06] sure [07:02:21] I got the mail [07:02:32] $ sudo disable_list [07:02:32] Usage: /usr/local/sbin/disable_list [-e|--enable] [07:02:50] I didn't [07:03:00] oh ffs [07:03:05] I sent it to you [07:03:40] !log Deploy schema change on db2089:3316 and db1098:3316 T266486 T268392 T273360 [07:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:52] T268392: Schema change for watchlist.wl_notificationtimestamp going binary(14) from varbinary(14) - https://phabricator.wikimedia.org/T268392 [07:03:52] T273360: Schema change for dropping default of img_timestamp and making it binary(14) - https://phabricator.wikimedia.org/T273360 [07:03:52] T266486: Schema change to turn user_last_timestamp.user_newtalk to binary(14) - https://phabricator.wikimedia.org/T266486 [07:03:58] %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s [07:04:22] !log add AAAA record for kafka-main2002.codfw.wmnet [07:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:34] Amir1: do you see that in the mail footer? [07:04:44] Yup [07:05:18] legoktm: lol [07:05:18] !log elukey@cumin1001 START - Cookbook sre.dns.netbox [07:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:51] ugh, I guess we need to fiddle with the templates thing [07:05:59] Amir1: but I think importing lgbt@ is fine [07:06:09] okay, going for it [07:06:21] oh also it's sudo disable_list [07:06:30] and I already said that >.> [07:07:27] lol [07:07:47] now I'm looking if there's a command to enable it back in mailman3 [07:07:53] but that's for later [07:08:01] for now I do it in the ui [07:08:10] disable_list is a custom script we wrote [07:08:21] it just enables emergency_moderation and bans everyone from sending to the list [07:09:20] that's scary [07:09:47] what do you expect, it's mm2 [07:09:57] all users now imported but as non-member [07:10:01] because they are banned [07:10:10] !log elukey@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:10:12] uhm [07:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:23] so maybe we should *not* disable the list before importing [07:11:27] it might take a minute or two to import a mailing list [07:11:40] maybe, we can not ban everyone? [07:11:52] as an option for example [07:12:39] !log add AAAA record for kafka-main200[3,4,5].codfw.wmnet [07:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:02] the disable script really doesn't do much then [07:13:17] I think we're better off just importing and accepting the race condition [07:13:42] in theory MM2 would deliver the mail? [07:14:52] !log elukey@cumin1001 START - Cookbook sre.dns.netbox [07:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:01] (03CR) 10Hashar: [C: 03+1] Relax CSP rules for taint-check-demo [puppet] - 10https://gerrit.wikimedia.org/r/680337 (https://phabricator.wikimedia.org/T257301) (owner: 10Daimona Eaytoy) [07:18:26] IRCcloud seems to have issues... [07:18:55] legoktm: I upgraded LGBT and disabled the old one [07:19:00] (03CR) 10Marostegui: [C: 03+2] mariadb: Reenable notifications on db1156 after maintenance [puppet] - 10https://gerrit.wikimedia.org/r/682668 (https://phabricator.wikimedia.org/T280492) (owner: 10Jcrespo) [07:19:25] !log elukey@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:39] 10SRE, 10SRE-tools, 10IPv6, 10User-jbond: Some Foundation clusters do not appear to support IPv6 - https://phabricator.wikimedia.org/T271136 (10elukey) [07:20:31] 10SRE, 10SRE-tools, 10IPv6, 10User-jbond: Some Foundation clusters do not appear to support IPv6 - https://phabricator.wikimedia.org/T271136 (10elukey) 05Open→03Resolved a:03elukey Added the remaining AAAA records for kafka-main200[2-5]! [07:21:34] (03CR) 10Elukey: [C: 04-1] "Just added the AAAA records, all good :)" [puppet] - 10https://gerrit.wikimedia.org/r/683044 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [07:22:43] Amir189: awesome [07:22:56] Amir189: did you try emailing to it? [07:22:57] Got this An error occurred: Internal Server Error

Internal Server Error

[07:23:05] o.O where? [07:23:08] (03PS1) 10Marostegui: db1154.yaml: Clean up references [puppet] - 10https://gerrit.wikimedia.org/r/683231 [07:23:10] I'm updating the description [07:23:52] (03CR) 10Marostegui: [C: 03+2] db1154.yaml: Clean up references [puppet] - 10https://gerrit.wikimedia.org/r/683231 (owner: 10Marostegui) [07:23:53] haha, it just fails instead of telling me it's more than 1K character [07:23:56] beautiful [07:24:30] "Data too long for column 'info' at row 1" [07:24:31] wtf [07:24:49] maybe we can just increase its size? [07:24:55] (03CR) 10JMeybohm: rdf-streaming-updater: enable HA capability (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/679519 (https://phabricator.wikimedia.org/T273098) (owner: 10Mstyles) [07:25:02] (03PS2) 10Elukey: kafka-main: deploy kafka::main role to kafka-main[12]00[45] [puppet] - 10https://gerrit.wikimedia.org/r/683044 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [07:25:04] (03PS1) 10Elukey: Add new kafka-main IPs to the kafka_brokers_main firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/683232 (https://phabricator.wikimedia.org/T225005) [07:25:09] I don't want to make schema changes without coordinating with upstream [07:25:17] yeah [07:25:22] let me see if there's an option [07:26:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1098 for schema change and kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15608 and previous config saved to /var/cache/conftool/dbconfig/20210428-072609-marostegui.json [07:26:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:24] (03CR) 10JMeybohm: rdf-streaming-updater: create helmfile.d structure (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [07:27:06] !log update php7.2 on appservers && rolling php7.2-fpm restarts [07:27:13] (eqiad) [07:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:26] legoktm: For now, I just used shorter url, etc. [07:29:57] TBH, I like this limit. Descriptions should be concise [07:30:15] so we are all set? [07:31:10] I should send an email to lgbt@ [07:31:48] Amir189: I think so! [07:33:18] (03CR) 10Volans: [C: 04-1] "one typo" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/683232 (https://phabricator.wikimedia.org/T225005) (owner: 10Elukey) [07:34:13] legoktm: Fun question, how can I verify the email went through? the only thing I can think of is exim4 logs [07:34:23] it doesn't have archive [07:34:26] don't you receive copies of your own mail? [07:34:33] no [07:34:38] I can enable it [07:35:07] (03CR) 10Elukey: Add new kafka-main IPs to the kafka_brokers_main firewall rules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/683232 (https://phabricator.wikimedia.org/T225005) (owner: 10Elukey) [07:36:30] (03PS2) 10Elukey: Add new kafka-main IPs to the kafka_brokers_main firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/683232 (https://phabricator.wikimedia.org/T225005) [07:36:32] (03PS3) 10Elukey: kafka-main: deploy kafka::main role to kafka-main[12]00[45] [puppet] - 10https://gerrit.wikimedia.org/r/683044 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [07:36:32] oh it's enabled already [07:36:38] probably in the new account [07:38:14] (03CR) 10JMeybohm: [C: 03+2] configcluster: No longer include zookeeper in old configcluster role [puppet] - 10https://gerrit.wikimedia.org/r/682669 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [07:38:21] sent [07:38:24] let's see [07:40:24] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/683232 (https://phabricator.wikimedia.org/T225005) (owner: 10Elukey) [07:40:27] PROBLEM - puppet last run on conf2001 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:40:47] !log Deploy schema change on db1098:3316 and db1098:3316 T266486 T268392 T273360 [07:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:58] T268392: Schema change for watchlist.wl_notificationtimestamp going binary(14) from varbinary(14) - https://phabricator.wikimedia.org/T268392 [07:40:58] T273360: Schema change for dropping default of img_timestamp and making it binary(14) - https://phabricator.wikimedia.org/T273360 [07:40:58] T266486: Schema change to turn user_last_timestamp.user_newtalk to binary(14) - https://phabricator.wikimedia.org/T266486 [07:41:12] finally [07:41:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 25%: Repool db1098:3317', diff saved to https://phabricator.wikimedia.org/P15609 and previous config saved to /var/cache/conftool/dbconfig/20210428-074114-root.json [07:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:36] I didn't recieve my own email (maybe i have a filter for it?) but smtp says that it sent the mail [07:45:21] RECOVERY - puppet last run on conf2001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:45:49] (03CR) 10Muehlenhoff: [C: 03+2] Switch to failoid2002 [puppet] - 10https://gerrit.wikimedia.org/r/682107 (owner: 10Muehlenhoff) [07:46:22] check with another list subscriber? [07:46:27] I do think gmail deduplicates mails or something [07:48:00] legoktm: I just checked, it's there [07:48:11] sending the announcement now [07:49:03] we need to adjust the part for "disabled mailing list" [07:50:08] Amir1: can we just remove that line? [07:50:19] done [07:50:44] sent [07:51:02] okay, shall I create new mailing lists? [07:51:09] I shall pass [07:51:41] Amir1: I'm tired too :p let's do that tomorrow [07:51:59] Sure [07:52:04] (03CR) 10Jon Harald Søby: "recheck" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673297 (owner: 10Jon Harald Søby) [07:52:15] Do that when you wake up [07:52:25] I go play starwars [07:52:42] o/ enjoy [07:52:47] !log update php7.2 on api servers && rolling php7.2-fpm restarts [07:52:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:46] (03CR) 10JMeybohm: [C: 03+1] "IPs and names match with what's in netbox" [puppet] - 10https://gerrit.wikimedia.org/r/683232 (https://phabricator.wikimedia.org/T225005) (owner: 10Elukey) [07:54:59] 10SRE, 10Security-Team, 10Wikimedia-Mailing-lists: Upgrade GNU Mailman from 2.1 to Mailman3 - https://phabricator.wikimedia.org/T52864 (10Ladsgroup) You can now access mailman3: https://lists.wikimedia.org/postorius/lists/ LGBT is now upgraded. More mailing lists will follow soon. More info: https://lists.... [07:55:29] (03CR) 10David Caro: [C: 03+2] ceph.eqiad: enable octopus repositories [puppet] - 10https://gerrit.wikimedia.org/r/681296 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [07:56:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 50%: Repool db1098:3317', diff saved to https://phabricator.wikimedia.org/P15610 and previous config saved to /var/cache/conftool/dbconfig/20210428-075618-root.json [07:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:36] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Merging, as running puppet on rdb20[09|10] without this patch will break changeprop. We can revert if needed" [puppet] - 10https://gerrit.wikimedia.org/r/682975 (owner: 10Alexandros Kosiaris) [07:57:38] (03CR) 10Elukey: [C: 03+2] Add new kafka-main IPs to the kafka_brokers_main firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/683232 (https://phabricator.wikimedia.org/T225005) (owner: 10Elukey) [07:58:47] 10SRE, 10Security-Team, 10Wikimedia-Mailing-lists: Upgrade GNU Mailman from 2.1 to Mailman3 - https://phabricator.wikimedia.org/T52864 (10Legoktm) [07:58:50] 10SRE, 10Wikimedia-Mailing-lists: Install mailman3 on lists1001.wikimedia.org - https://phabricator.wikimedia.org/T278610 (10Legoktm) 05Open→03Resolved a:03Legoktm [07:59:11] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [07:59:12] elukey: dcaro: should I merge your changes as well ? [07:59:19] akosiaris: +1 [08:02:16] afk for a bit [08:04:53] (03PS1) 10Muehlenhoff: ldap replicas: Only include the OpenLDAP exporter up to buster, not ported to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/683234 (https://phabricator.wikimedia.org/T266147) [08:05:21] (03CR) 10jerkins-bot: [V: 04-1] ldap replicas: Only include the OpenLDAP exporter up to buster, not ported to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/683234 (https://phabricator.wikimedia.org/T266147) (owner: 10Muehlenhoff) [08:08:42] (03CR) 10Elukey: [C: 04-1] "Keith: the firewall changes should be rolled out, the last bit is about the /srv partition, but not a big priority." [puppet] - 10https://gerrit.wikimedia.org/r/683044 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [08:11:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 75%: Repool db1098:3317', diff saved to https://phabricator.wikimedia.org/P15611 and previous config saved to /var/cache/conftool/dbconfig/20210428-081121-root.json [08:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:11] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [08:14:34] (03PS2) 10Muehlenhoff: ldap replicas: Only include the OpenLDAP exporter up to buster [puppet] - 10https://gerrit.wikimedia.org/r/683234 (https://phabricator.wikimedia.org/T266147) [08:16:03] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:16:06] I'm looking into the uploads alerts [08:21:05] PROBLEM - Check systemd state on kubernetes2002 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:22:19] (03PS1) 10Muehlenhoff: python-poolcounter: Build for bullseye [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/683235 [08:23:33] (03PS1) 10Filippo Giunchedi: swift: bump thresholds for swift originals uploads [puppet] - 10https://gerrit.wikimedia.org/r/683236 [08:25:12] !log update php7.2 on jobrunners and parsoid servers && rolling php7.2-fpm restarts [08:25:32] !log update php7.2 on jobrunners and parsoid servers && rolling php7.2-fpm restarts [08:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:05] dcaro: FYI CI is failing on puppet with errors related to the ceph repos https://integration.wikimedia.org/ci/job/operations-puppet-tests-buster-docker/24642/console [08:26:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 100%: Repool db1098:3317', diff saved to https://phabricator.wikimedia.org/P15612 and previous config saved to /var/cache/conftool/dbconfig/20210428-082625-root.json [08:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:04] godog: looking [08:28:02] (03CR) 10jerkins-bot: [V: 04-1] swift: bump thresholds for swift originals uploads [puppet] - 10https://gerrit.wikimedia.org/r/683236 (owner: 10Filippo Giunchedi) [08:28:13] (03CR) 10jerkins-bot: [V: 04-1] python-poolcounter: Build for bullseye [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/683235 (owner: 10Muehlenhoff) [08:28:15] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:28:16] (03CR) 10Filippo Giunchedi: "CI failures are unrelated to this change:" [puppet] - 10https://gerrit.wikimedia.org/r/683236 (owner: 10Filippo Giunchedi) [08:28:23] (03CR) 10Muehlenhoff: [C: 03+2] ldap replicas: Only include the OpenLDAP exporter up to buster [puppet] - 10https://gerrit.wikimedia.org/r/683234 (https://phabricator.wikimedia.org/T266147) (owner: 10Muehlenhoff) [08:29:12] godog: that's me yes, I created a task to follow up, I'm in the middle of an upgrade: T281335, the issue is that the new value in hiera changed [08:29:12] T281335: puppet.ceph: current repo tests are failing - https://phabricator.wikimedia.org/T281335 [08:29:22] feel free to ignore [08:29:48] dcaro: for sure -- thank you for taking a look immediately ! [08:30:19] 10SRE, 10ops-codfw, 10DBA: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10fgiunchedi) [08:32:55] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] swift: bump thresholds for swift originals uploads [puppet] - 10https://gerrit.wikimedia.org/r/683236 (owner: 10Filippo Giunchedi) [08:33:18] (03PS1) 10Alexandros Kosiaris: check_raid: Make python3 compatible [puppet] - 10https://gerrit.wikimedia.org/r/683237 [08:34:09] (03CR) 10jerkins-bot: [V: 04-1] check_raid: Make python3 compatible [puppet] - 10https://gerrit.wikimedia.org/r/683237 (owner: 10Alexandros Kosiaris) [08:34:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 25%: Repool db1098:3316', diff saved to https://phabricator.wikimedia.org/P15613 and previous config saved to /var/cache/conftool/dbconfig/20210428-083458-root.json [08:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 25%: Repool db1098:3316', diff saved to https://phabricator.wikimedia.org/P15614 and previous config saved to /var/cache/conftool/dbconfig/20210428-083552-root.json [08:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1098:3317', diff saved to https://phabricator.wikimedia.org/P15615 and previous config saved to /var/cache/conftool/dbconfig/20210428-083625-marostegui.json [08:36:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:06] jouncebot: now [08:39:06] No deployments scheduled for the next 2 hour(s) and 20 minute(s) [08:40:01] (03CR) 10Urbanecm: [C: 03+2] Add alt, bcl, diq, mad, mni, mnw, nia, skr, tay and trv to InterwikiSortOrders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673297 (owner: 10Jon Harald Søby) [08:40:27] (03Merged) 10jenkins-bot: Add alt, bcl, diq, mad, mni, mnw, nia, skr, tay and trv to InterwikiSortOrders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673297 (owner: 10Jon Harald Søby) [08:41:08] !log urbanecm@deploy1002 sync-file aborted: 96ad0d4ad294c442b4936a63ae1cd9de9c098aa9: Add alt, bcl, diq, mad, mni, mnw, nia, skr, tay and trv to InterwikiSortOrders (duration: 00m 02s) [08:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:46] (03CR) 10Filippo Giunchedi: "LGTM overall (see inline), make sure to remove entries from site.yml as well" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/682992 (https://phabricator.wikimedia.org/T279601) (owner: 10Herron) [08:42:15] (03CR) 10Filippo Giunchedi: "Ditto, same comments as Ibfaa9f19f405" [puppet] - 10https://gerrit.wikimedia.org/r/682999 (https://phabricator.wikimedia.org/T279602) (owner: 10Herron) [08:42:24] !log urbanecm@deploy1002 Synchronized wmf-config/InterwikiSortOrders.php: 96ad0d4ad294c442b4936a63ae1cd9de9c098aa9: Add alt, bcl, diq, mad, mni, mnw, nia, skr, tay and trv to InterwikiSortOrders (duration: 01m 08s) [08:42:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:38] (03CR) 10Muehlenhoff: "There's already a patch out there: https://gerrit.wikimedia.org/r/c/operations/puppet/+/670972/" [puppet] - 10https://gerrit.wikimedia.org/r/683237 (owner: 10Alexandros Kosiaris) [08:43:44] (03CR) 10Ayounsi: [C: 04-1] "v6 missing." [homer/public] - 10https://gerrit.wikimedia.org/r/683050 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron) [08:44:48] (03CR) 10Filippo Giunchedi: [C: 03+1] kafka-logging: migrate logstash2001 broker to kafka-logging2001 [puppet] - 10https://gerrit.wikimedia.org/r/683012 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron) [08:44:52] (03CR) 10Filippo Giunchedi: [C: 03+1] kafka-logging: migrate logstash2002 broker to kafka-logging2002 [puppet] - 10https://gerrit.wikimedia.org/r/683013 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron) [08:44:57] (03CR) 10Filippo Giunchedi: [C: 03+1] kafka-logging: migrate logstash2003 broker to kafka-logging2003 [puppet] - 10https://gerrit.wikimedia.org/r/683014 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron) [08:47:03] RECOVERY - Check systemd state on kubernetes2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:50:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 50%: Repool db1098:3316', diff saved to https://phabricator.wikimedia.org/P15616 and previous config saved to /var/cache/conftool/dbconfig/20210428-085056-root.json [08:51:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:29] (03CR) 10Alexandros Kosiaris: [C: 03+1] check-raid.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/670972 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [08:51:47] (03Abandoned) 10Alexandros Kosiaris: check_raid: Make python3 compatible [puppet] - 10https://gerrit.wikimedia.org/r/683237 (owner: 10Alexandros Kosiaris) [08:54:32] (03PS2) 10DCausse: rdf-streaming-updater: use session mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/681497 (https://phabricator.wikimedia.org/T280166) (owner: 10Mstyles) [08:55:08] (03CR) 10jerkins-bot: [V: 04-1] rdf-streaming-updater: use session mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/681497 (https://phabricator.wikimedia.org/T280166) (owner: 10Mstyles) [08:56:12] CI is going to be restarted entirely due to hardware reboot [08:57:13] (03PS3) 10DCausse: rdf-streaming-updater: use session mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/681497 (https://phabricator.wikimedia.org/T280166) (owner: 10Mstyles) [08:58:35] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:59:12] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host contint2001.wikimedia.org [08:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 75%: Repool db1098:3316', diff saved to https://phabricator.wikimedia.org/P15617 and previous config saved to /var/cache/conftool/dbconfig/20210428-090559-root.json [09:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:09] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host contint2001.wikimedia.org [09:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:17] (03CR) 10Alexandros Kosiaris: [C: 04-1] thumbor: add a timer that writes the output of fc-list to /srv (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/682685 (https://phabricator.wikimedia.org/T280718) (owner: 10Dzahn) [09:08:21] RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [09:09:03] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:09:29] !log restarting jenkins* on releases to pick up Java security updates [09:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:30] (03PS25) 10Jcrespo: mediabackup: Initial setup for the media backup worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/668380 (https://phabricator.wikimedia.org/T276442) [09:10:37] RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [09:10:57] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:11:00] (03PS26) 10Jcrespo: mediabackup: Initial setup for the media backup worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/668380 (https://phabricator.wikimedia.org/T276442) [09:11:12] CI is fully back [09:12:13] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host contint1001.wikimedia.org [09:12:14] (03CR) 10jerkins-bot: [V: 04-1] mediabackup: Initial setup for the media backup worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/668380 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [09:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:20] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host contint1001.wikimedia.org [09:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 100%: Repool db1098:3316', diff saved to https://phabricator.wikimedia.org/P15618 and previous config saved to /var/cache/conftool/dbconfig/20210428-092103-root.json [09:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:58] (03PS27) 10Jcrespo: mediabackup: Initial setup for the media backup worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/668380 (https://phabricator.wikimedia.org/T276442) [09:28:39] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [09:29:15] (03CR) 10jerkins-bot: [V: 04-1] mediabackup: Initial setup for the media backup worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/668380 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [09:31:01] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [09:32:29] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Assuming my reading of kube-apiserver --help is correct, this will also enable (aside from the old controllers)" [puppet] - 10https://gerrit.wikimedia.org/r/677922 (https://phabricator.wikimedia.org/T270063) (owner: 10JMeybohm) [09:32:46] (03PS1) 10JMeybohm: Move codfw etcd clients to new cluster [dns] - 10https://gerrit.wikimedia.org/r/683244 (https://phabricator.wikimedia.org/T271573) [09:33:47] (03PS1) 10David Caro: ceph: fix tests to use the new values. [puppet] - 10https://gerrit.wikimedia.org/r/683245 (https://phabricator.wikimedia.org/T281335) [09:36:28] (03CR) 10David Caro: [C: 03+2] ceph: fix tests to use the new values. [puppet] - 10https://gerrit.wikimedia.org/r/683245 (https://phabricator.wikimedia.org/T281335) (owner: 10David Caro) [09:37:03] (03CR) 10Alexandros Kosiaris: "I guess these should also make it to the deployment-charts for the egress rules." [puppet] - 10https://gerrit.wikimedia.org/r/683232 (https://phabricator.wikimedia.org/T225005) (owner: 10Elukey) [09:37:24] 10SRE, 10Patch-For-Review: Updated java security policy in OpenJDK 11.9 - https://phabricator.wikimedia.org/T266782 (10MoritzMuehlenhoff) New version from 11.0.11: ` --- /etc/java-11-openjdk/security/java.security 2020-08-19 15:34:58.966241818 +0000 +++ /etc/java-11-openjdk/security/java.security.dpkg-new 20... [09:37:27] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Overall we're getting there, with a few rough edges." (037 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/667047 (owner: 10Legoktm) [09:38:04] godog: tests fixed, thanks a lot for the ping, let me know if you see any other issues :) [09:39:31] dcaro: for sure -- thanks for the fix [09:39:36] (03CR) 10Alexandros Kosiaris: [C: 03+1] ci::master/deployment_server: add new k8s namespace for shellbox [puppet] - 10https://gerrit.wikimedia.org/r/683111 (https://phabricator.wikimedia.org/T260330) (owner: 10Dzahn) [09:40:42] !log restarting Tomcat on idp-test1001 to pick up Java security updates [09:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:36] (03Abandoned) 10Phuedx: Rename RelatedArticlesFooterWhitelistedSkins to RelatedArticlesFooterAllowedSkins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681598 (https://phabricator.wikimedia.org/T277958) (owner: 10Phuedx) [09:46:04] (03PS1) 10JMeybohm: common: Remove old zookeer hosts [puppet] - 10https://gerrit.wikimedia.org/r/683246 (https://phabricator.wikimedia.org/T271573) [09:55:29] (03PS2) 10Hnowlan: api-gateway: Create individual cluster definitions for read and write [deployment-charts] - 10https://gerrit.wikimedia.org/r/682921 (https://phabricator.wikimedia.org/T277585) [09:55:34] (03PS1) 10JMeybohm: Switch conf200[1-3] to conf200[4-6] [puppet] - 10https://gerrit.wikimedia.org/r/683248 (https://phabricator.wikimedia.org/T271573) [09:57:26] (03CR) 10Elukey: [C: 03+1] common: Remove old zookeer hosts [puppet] - 10https://gerrit.wikimedia.org/r/683246 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [09:57:34] (03PS1) 10Arturo Borrero Gonzalez: cloudgw-dev: enable prometheus ports [puppet] - 10https://gerrit.wikimedia.org/r/683249 [09:58:09] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw-dev: enable prometheus ports [puppet] - 10https://gerrit.wikimedia.org/r/683249 (owner: 10Arturo Borrero Gonzalez) [09:58:14] (03PS2) 10JMeybohm: common: Remove old zookeeper hosts [puppet] - 10https://gerrit.wikimedia.org/r/683246 (https://phabricator.wikimedia.org/T271573) [09:58:31] (03CR) 10Alexandros Kosiaris: "This has been tested on mw1338 btw" [puppet] - 10https://gerrit.wikimedia.org/r/682619 (https://phabricator.wikimedia.org/T279100) (owner: 10Alexandros Kosiaris) [10:00:41] (03CR) 10Alexandros Kosiaris: [C: 04-1] modules::conftool add safe-service-restart scap option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/682141 (https://phabricator.wikimedia.org/T266055) (owner: 10Effie Mouzeli) [10:00:45] 10Puppet, 10SRE, 10puppet-compiler, 10Patch-For-Review, 10User-jbond: OKR: Work required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10jbond) >>! In T265138#6826732, @jbond wrote: >>>! In T265138#6817957, @jbond wrote: >> @Ladsgroup This could well be to do with how puppetlabs de... [10:00:50] 10SRE, 10Platform Engineering, 10Services, 10Wikimedia-Mailing-lists: Decide on future of public services@ mailing list (which has no maintainers) - https://phabricator.wikimedia.org/T278516 (10Mvolz) I've replied on the list, but we use this in citoid as a contact point for crossref to report traffic issu... [10:01:17] (03CR) 10Giuseppe Lavagetto: [C: 04-1] Move codfw etcd clients to new cluster (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/683244 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [10:02:39] (03CR) 10JMeybohm: [C: 03+2] common: Remove old zookeeper hosts [puppet] - 10https://gerrit.wikimedia.org/r/683246 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [10:03:10] (03CR) 10Giuseppe Lavagetto: [C: 04-1] Switch conf200[1-3] to conf200[4-6] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/683248 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [10:04:14] (03CR) 10Hnowlan: "> Patch Set 1:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/682921 (https://phabricator.wikimedia.org/T277585) (owner: 10Hnowlan) [10:05:12] (03CR) 10Hnowlan: "> Patch Set 2:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/682921 (https://phabricator.wikimedia.org/T277585) (owner: 10Hnowlan) [10:06:24] (03PS2) 10JMeybohm: Move codfw etcd clients to new cluster [dns] - 10https://gerrit.wikimedia.org/r/683244 (https://phabricator.wikimedia.org/T271573) [10:09:16] (03CR) 10JMeybohm: Move codfw etcd clients to new cluster (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/683244 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [10:09:55] (03PS2) 10JMeybohm: Switch conf200[1-3] to conf200[4-6] [puppet] - 10https://gerrit.wikimedia.org/r/683248 (https://phabricator.wikimedia.org/T271573) [10:10:46] (03CR) 10JMeybohm: Switch conf200[1-3] to conf200[4-6] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/683248 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [10:11:50] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Move codfw etcd clients to new cluster (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/683244 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [10:12:14] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Switch conf200[1-3] to conf200[4-6] [puppet] - 10https://gerrit.wikimedia.org/r/683248 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [10:14:47] PROBLEM - SSH on wdqs2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:15:37] PROBLEM - Query Service HTTP Port on wdqs2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [10:15:51] PROBLEM - WDQS SPARQL on wdqs2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:16:15] PROBLEM - WDQS SPARQL on wdqs2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 414 bytes in 1.180 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:16:37] PROBLEM - Check systemd state on wdqs2001 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-blazegraph.service,wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:18:25] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2001 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:19:28] 10SRE, 10CAS-SSO: Tomcat/CAS fails to start with OpenJDK 11.0.11 - https://phabricator.wikimedia.org/T281345 (10MoritzMuehlenhoff) [10:22:00] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10fgiunchedi) Thank you @papaul, today I poked a little at librenms chatsworth support and it looks like the current support is not complete (for sure not as complete as sentry3/sentry... [10:22:04] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:22:21] (03PS2) 10Muehlenhoff: python-poolcounter: Build for bullseye [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/683235 [10:24:58] (03CR) 10jerkins-bot: [V: 04-1] python-poolcounter: Build for bullseye [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/683235 (owner: 10Muehlenhoff) [10:27:50] (03PS2) 10Jbond: x509-bundle.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/670978 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [10:35:46] 10SRE, 10Jenkins: tox-docker CI test doesn't pick up overrides for pylint - https://phabricator.wikimedia.org/T281347 (10MoritzMuehlenhoff) [10:36:19] (03CR) 10Muehlenhoff: "The CI failure can be ignored, this appears to be a bug in our setup, opened https://phabricator.wikimedia.org/T281347 for it." [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/683235 (owner: 10Muehlenhoff) [10:38:46] (03CR) 10Jbond: [C: 03+2] x509-bundle.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/670978 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [10:42:15] (03CR) 10Effie Mouzeli: [C: 04-1] "from my tests it appears that verification is completely skipped" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/682619 (https://phabricator.wikimedia.org/T279100) (owner: 10Alexandros Kosiaris) [10:43:37] (03CR) 10Jbond: [C: 03+2] check-raid.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/670972 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [10:44:20] !log updated the check-raid nrpe script to python3 [10:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:38] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.02531 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [10:46:50] looking [10:48:29] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29244/console" [puppet] - 10https://gerrit.wikimedia.org/r/682971 (owner: 10Giuseppe Lavagetto) [10:50:24] (03CR) 10Effie Mouzeli: modules::conftool add safe-service-restart scap option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/682141 (https://phabricator.wikimedia.org/T266055) (owner: 10Effie Mouzeli) [10:50:32] (03PS1) 10Jbond: Revert "x509-bundle.py: Port to Python 3" [puppet] - 10https://gerrit.wikimedia.org/r/683132 [10:53:21] (03CR) 10Jbond: [C: 03+2] Revert "x509-bundle.py: Port to Python 3" [puppet] - 10https://gerrit.wikimedia.org/r/683132 (owner: 10Jbond) [10:55:31] (03PS1) 10Jbond: x509-bundle.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/683133 (https://phabricator.wikimedia.org/T247364) [10:56:15] (03PS2) 10Jbond: x509-bundle.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/683133 (https://phabricator.wikimedia.org/T247364) [10:56:47] (03PS1) 10Volans: Upstream release v0.2.7 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/683256 [10:57:42] (03CR) 10Jbond: [C: 03+2] x509-bundle.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/683133 (https://phabricator.wikimedia.org/T247364) (owner: 10Jbond) [10:58:06] jouncebot: next [10:58:06] In 0 hour(s) and 1 minute(s): European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210428T1100) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor I � Unicode. All rise for European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210428T1100). [11:00:04] Tchanders and CFisch_WMDE: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:10] I can deploy today [11:00:14] o/ [11:00:16] Hi Tchanders and CFisch_WMDE :) [11:00:16] Hi [11:00:21] Hi! [11:00:40] Would be cool if you could do my backport again Urbanecm! [11:00:50] (03CR) 10Urbanecm: [C: 03+2] Separate reference preview settings in beta & non-beta [extensions/Popups] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/682819 (https://phabricator.wikimedia.org/T281235) (owner: 10WMDE-Fisch) [11:00:55] sure sure :) [11:01:03] thx :-) [11:01:15] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw1001.eqiad.wmnet with reason: REIMAGE [11:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:37] (03CR) 10Urbanecm: [C: 03+2] Enable partial action blocks on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683088 (https://phabricator.wikimedia.org/T280528) (owner: 10Tchanders) [11:01:41] (03CR) 10Urbanecm: [C: 03+2] Enable partial action blocks on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683089 (https://phabricator.wikimedia.org/T280528) (owner: 10Tchanders) [11:02:06] Tchanders: fyi, the beta patch will land within 30 minutes. Do let me know if it doesn't for any reason :) [11:02:21] Thanks Urbanecm! [11:02:50] (03Merged) 10jenkins-bot: Enable partial action blocks on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683088 (https://phabricator.wikimedia.org/T280528) (owner: 10Tchanders) [11:02:53] (03Merged) 10jenkins-bot: Enable partial action blocks on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683089 (https://phabricator.wikimedia.org/T280528) (owner: 10Tchanders) [11:03:15] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw1002.eqiad.wmnet with reason: REIMAGE [11:03:18] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw1001.eqiad.wmnet with reason: REIMAGE [11:03:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:47] Tchanders: can you test your testwiki patch on mwdebug1001, please? [11:04:03] Having a look... [11:04:30] (03PS1) 10Jbond: nrpe:chgeck_raid: use universal_newlines vs text [puppet] - 10https://gerrit.wikimedia.org/r/683258 [11:04:50] (03CR) 10Jbond: [V: 03+2 C: 03+2] nrpe:chgeck_raid: use universal_newlines vs text [puppet] - 10https://gerrit.wikimedia.org/r/683258 (owner: 10Jbond) [11:05:23] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw1002.eqiad.wmnet with reason: REIMAGE [11:05:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:51] (03PS1) 10Jbond: check_raid: use universal_newlines vs text [puppet] - 10https://gerrit.wikimedia.org/r/683259 [11:08:11] (03CR) 10Jbond: [V: 03+2 C: 03+2] check_raid: use universal_newlines vs text [puppet] - 10https://gerrit.wikimedia.org/r/683259 (owner: 10Jbond) [11:08:57] (03CR) 10Ayounsi: [C: 03+1] Upstream release v0.2.7 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/683256 (owner: 10Volans) [11:09:54] Urbanecm: Not seeing changes in testwiki, but just realised the config is only added in DefaultSettings in the next train - so that'll be why [11:10:04] aha, okay then. syncing [11:10:11] Thanks! [11:10:59] (03Merged) 10jenkins-bot: Separate reference preview settings in beta & non-beta [extensions/Popups] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/682819 (https://phabricator.wikimedia.org/T281235) (owner: 10WMDE-Fisch) [11:13:27] Urbanecm: Change reached beta. Thank you! [11:13:33] cool [11:16:21] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: ddbc378e41783356e28cd90bbefa08624ea2844c: Enable partial action blocks on testwiki (T280528) (duration: 01m 07s) [11:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:30] T280528: Enable partial action blocks on beta and testwiki - https://phabricator.wikimedia.org/T280528 [11:17:18] (03PS2) 10ArielGlenn: bring up snapshot1001,12,13 as dumps testbed hosts [puppet] - 10https://gerrit.wikimedia.org/r/683151 (https://phabricator.wikimedia.org/T281330) [11:17:54] CFisch_WMDE: your backport is pulled to mwdebug1001, please test [11:18:35] Urbanecm: all good thanks! [11:18:42] syncing [11:20:33] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.3/extensions/Popups/: 8d0ae5e8fedefa911fc216bfc810d7a6169ea7e5: Separate reference preview settings in beta & non-beta (T281235) (duration: 01m 08s) [11:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:42] T281235: Separate beta and non-beta user setting for reference previews - https://phabricator.wikimedia.org/T281235 [11:21:30] (03CR) 10ArielGlenn: [C: 03+2] bring up snapshot1001,12,13 as dumps testbed hosts [puppet] - 10https://gerrit.wikimedia.org/r/683151 (https://phabricator.wikimedia.org/T281330) (owner: 10ArielGlenn) [11:21:40] CFisch_WMDE: done. Anything else? [11:21:57] Urbanecm: Nice, thanks. I'm good! [11:22:01] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.0053 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [11:22:02] cool :) [11:22:05] then we're done i guess [11:22:11] !log EU B&C window done [11:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:05] moritzm: FYI PHP upgrades (partially) broke CX: https://phabricator.wikimedia.org/T281346 [11:24:22] if there are whines about snapshot1011,12,13 it's because they need mcrouter secrets [11:24:42] if anyone happens to know how to generate those, please let me know, otherwise I will be hunting around on wikitech etc [11:29:29] done but still needs to happen for labs secrets [11:31:58] Nikerabbit: ack, thanks. this didn't show up in the initial phase when only canaries were upgraded,but by it's fully rolled out [11:36:13] 10Puppet, 10SRE-tools, 10Patch-For-Review, 10Python3-Porting, and 3 others: Forward port Python2 files to Python3 in Puppet Repository - https://phabricator.wikimedia.org/T247364 (10jbond) I merged a couple of changes today, just a note that subprocess.Popen (and all the functions which inturn call this) d... [11:38:25] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10BBlack) >>! In T264398#7040458, @dpifke wrote: > Alternatively, can we get identical results just by incrementing `grace` by... [11:40:22] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:tlsproxy::envoy: refactor ssl configuertion [puppet] - 10https://gerrit.wikimedia.org/r/682982 (owner: 10Jbond) [11:40:23] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] Add fake mcrouter secrets for snapshot1011,12,13 [labs/private] - 10https://gerrit.wikimedia.org/r/683262 (https://phabricator.wikimedia.org/T281330) (owner: 10ArielGlenn) [11:41:23] they should all be happy now. [11:42:51] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10BBlack) Continuing the thought above: varnishlog data may infer that most of the perf impact could be restored just by extendi... [11:46:56] (03CR) 10JMeybohm: [C: 03+2] Move codfw etcd clients to new cluster (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/683244 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [11:49:19] (03PS1) 10ArielGlenn: Add snapshot1011,12,13 to scap targets for the dumps repo [dumps/scap] - 10https://gerrit.wikimedia.org/r/683263 (https://phabricator.wikimedia.org/T281330) [11:50:21] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] Add snapshot1011,12,13 to scap targets for the dumps repo [dumps/scap] - 10https://gerrit.wikimedia.org/r/683263 (https://phabricator.wikimedia.org/T281330) (owner: 10ArielGlenn) [11:51:10] (03CR) 10Effie Mouzeli: modules::conftool add safe-service-restart scap option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/682141 (https://phabricator.wikimedia.org/T266055) (owner: 10Effie Mouzeli) [11:53:01] !log switching SRV record _etcd._tcp to new etcd cluster (for codfw, eqsin, ulsfo) [11:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:00] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:55:34] (03CR) 10Reedy: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682322 (https://phabricator.wikimedia.org/T281048) (owner: 10Reedy) [11:56:08] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:56:22] PROBLEM - MediaWiki EtcdConfig up-to-date on mw2402 is CRITICAL: etcd last index (8978) is outdated compared to the master one (646954) https://wikitech.wikimedia.org/wiki/Etcd [11:56:26] PROBLEM - MediaWiki EtcdConfig up-to-date on mw2389 is CRITICAL: etcd last index (8978) is outdated compared to the master one (646954) https://wikitech.wikimedia.org/wiki/Etcd [11:56:26] PROBLEM - MediaWiki EtcdConfig up-to-date on mw2403 is CRITICAL: etcd last index (8978) is outdated compared to the master one (646954) https://wikitech.wikimedia.org/wiki/Etcd [11:56:26] PROBLEM - MediaWiki EtcdConfig up-to-date on parse2014 is CRITICAL: etcd last index (8978) is outdated compared to the master one (646954) https://wikitech.wikimedia.org/wiki/Etcd [11:56:34] PROBLEM - MediaWiki EtcdConfig up-to-date on mw2361 is CRITICAL: etcd last index (8978) is outdated compared to the master one (646954) https://wikitech.wikimedia.org/wiki/Etcd [11:56:34] PROBLEM - MediaWiki EtcdConfig up-to-date on mw2400 is CRITICAL: etcd last index (8978) is outdated compared to the master one (646954) https://wikitech.wikimedia.org/wiki/Etcd [11:56:34] PROBLEM - MediaWiki EtcdConfig up-to-date on mw2374 is CRITICAL: etcd last index (8978) is outdated compared to the master one (646954) https://wikitech.wikimedia.org/wiki/Etcd [11:56:40] PROBLEM - MediaWiki EtcdConfig up-to-date on mw2325 is CRITICAL: etcd last index (8978) is outdated compared to the master one (646954) https://wikitech.wikimedia.org/wiki/Etcd [11:56:46] PROBLEM - MediaWiki EtcdConfig up-to-date on mw2316 is CRITICAL: etcd last index (8978) is outdated compared to the master one (646954) https://wikitech.wikimedia.org/wiki/Etcd [11:56:46] PROBLEM - MediaWiki EtcdConfig up-to-date on mw2304 is CRITICAL: etcd last index (8978) is outdated compared to the master one (646954) https://wikitech.wikimedia.org/wiki/Etcd [11:56:46] PROBLEM - MediaWiki EtcdConfig up-to-date on mw2308 is CRITICAL: etcd last index (8978) is outdated compared to the master one (646954) https://wikitech.wikimedia.org/wiki/Etcd [11:56:48] PROBLEM - MediaWiki EtcdConfig up-to-date on mw2324 is CRITICAL: etcd last index (8978) is outdated compared to the master one (646954) https://wikitech.wikimedia.org/wiki/Etcd [11:56:50] PROBLEM - MediaWiki EtcdConfig up-to-date on mw2255 is CRITICAL: etcd last index (8978) is outdated compared to the master one (646954) https://wikitech.wikimedia.org/wiki/Etcd [11:56:54] PROBLEM - MediaWiki EtcdConfig up-to-date on parse2013 is CRITICAL: etcd last index (8978) is outdated compared to the master one (646954) https://wikitech.wikimedia.org/wiki/Etcd [11:56:54] PROBLEM - MediaWiki EtcdConfig up-to-date on mw2283 is CRITICAL: etcd last index (8978) is outdated compared to the master one (646954) https://wikitech.wikimedia.org/wiki/Etcd [11:56:56] PROBLEM - MediaWiki EtcdConfig up-to-date on mw2369 is CRITICAL: etcd last index (8978) is outdated compared to the master one (646954) https://wikitech.wikimedia.org/wiki/Etcd [11:57:10] PROBLEM - MediaWiki EtcdConfig up-to-date on mw2354 is CRITICAL: etcd last index (8978) is outdated compared to the master one (646954) https://wikitech.wikimedia.org/wiki/Etcd [11:57:16] PROBLEM - MediaWiki EtcdConfig up-to-date on mw2319 is CRITICAL: etcd last index (8978) is outdated compared to the master one (646954) https://wikitech.wikimedia.org/wiki/Etcd [11:57:16] PROBLEM - MediaWiki EtcdConfig up-to-date on mw2352 is CRITICAL: etcd last index (8978) is outdated compared to the master one (646954) https://wikitech.wikimedia.org/wiki/Etcd [11:57:20] noisy chatterbox isn't it [11:57:36] RECOVERY - MediaWiki EtcdConfig up-to-date on mw2402 is OK: etcd last index (8978) matches the master one (8978) https://wikitech.wikimedia.org/wiki/Etcd [11:57:38] RECOVERY - MediaWiki EtcdConfig up-to-date on mw2389 is OK: etcd last index (8978) matches the master one (8978) https://wikitech.wikimedia.org/wiki/Etcd [11:57:38] RECOVERY - MediaWiki EtcdConfig up-to-date on parse2014 is OK: etcd last index (8978) matches the master one (8978) https://wikitech.wikimedia.org/wiki/Etcd [11:57:38] RECOVERY - MediaWiki EtcdConfig up-to-date on mw2403 is OK: etcd last index (8978) matches the master one (8978) https://wikitech.wikimedia.org/wiki/Etcd [11:57:48] RECOVERY - MediaWiki EtcdConfig up-to-date on mw2361 is OK: etcd last index (8978) matches the master one (8978) https://wikitech.wikimedia.org/wiki/Etcd [11:57:48] RECOVERY - MediaWiki EtcdConfig up-to-date on mw2400 is OK: etcd last index (8978) matches the master one (8978) https://wikitech.wikimedia.org/wiki/Etcd [11:57:48] RECOVERY - MediaWiki EtcdConfig up-to-date on mw2374 is OK: etcd last index (8978) matches the master one (8978) https://wikitech.wikimedia.org/wiki/Etcd [11:57:56] RECOVERY - MediaWiki EtcdConfig up-to-date on mw2325 is OK: etcd last index (8978) matches the master one (8978) https://wikitech.wikimedia.org/wiki/Etcd [11:58:02] RECOVERY - MediaWiki EtcdConfig up-to-date on mw2304 is OK: etcd last index (8978) matches the master one (8978) https://wikitech.wikimedia.org/wiki/Etcd [11:58:02] RECOVERY - MediaWiki EtcdConfig up-to-date on mw2308 is OK: etcd last index (8978) matches the master one (8978) https://wikitech.wikimedia.org/wiki/Etcd [11:58:06] RECOVERY - MediaWiki EtcdConfig up-to-date on mw2324 is OK: etcd last index (8978) matches the master one (8978) https://wikitech.wikimedia.org/wiki/Etcd [11:58:06] RECOVERY - MediaWiki EtcdConfig up-to-date on mw2255 is OK: etcd last index (8978) matches the master one (8978) https://wikitech.wikimedia.org/wiki/Etcd [11:58:12] RECOVERY - MediaWiki EtcdConfig up-to-date on parse2013 is OK: etcd last index (8978) matches the master one (8978) https://wikitech.wikimedia.org/wiki/Etcd [11:58:14] RECOVERY - MediaWiki EtcdConfig up-to-date on mw2369 is OK: etcd last index (8978) matches the master one (8978) https://wikitech.wikimedia.org/wiki/Etcd [11:58:28] RECOVERY - MediaWiki EtcdConfig up-to-date on mw2354 is OK: etcd last index (8978) matches the master one (8978) https://wikitech.wikimedia.org/wiki/Etcd [11:58:34] RECOVERY - MediaWiki EtcdConfig up-to-date on mw2319 is OK: etcd last index (8978) matches the master one (8978) https://wikitech.wikimedia.org/wiki/Etcd [11:58:34] RECOVERY - MediaWiki EtcdConfig up-to-date on mw2352 is OK: etcd last index (8978) matches the master one (8978) https://wikitech.wikimedia.org/wiki/Etcd [11:59:16] RECOVERY - MediaWiki EtcdConfig up-to-date on mw2316 is OK: etcd last index (8978) matches the master one (8978) https://wikitech.wikimedia.org/wiki/Etcd [11:59:26] RECOVERY - MediaWiki EtcdConfig up-to-date on mw2283 is OK: etcd last index (8978) matches the master one (8978) https://wikitech.wikimedia.org/wiki/Etcd [12:00:04] Deploy window Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210428T1200) [12:02:00] RECOVERY - Disk space on mwlog1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mwlog1001&var-datasource=eqiad+prometheus/ops [12:06:02] (03PS1) 10KartikMistry: Fix CX token cookie [extensions/ContentTranslation] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/683134 (https://phabricator.wikimedia.org/T281346) [12:06:28] (03PS1) 10KartikMistry: Fix CX token cookie [extensions/ContentTranslation] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/683135 (https://phabricator.wikimedia.org/T281346) [12:14:02] (03CR) 10Alexandros Kosiaris: safe-service-restart: Only verify in scope services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/682619 (https://phabricator.wikimedia.org/T279100) (owner: 10Alexandros Kosiaris) [12:14:53] (03PS4) 10Ayounsi: cloudsw: policy-options [homer/public] - 10https://gerrit.wikimedia.org/r/682957 [12:14:55] (03PS4) 10Ayounsi: cloudsw: loopback firewall filter [homer/public] - 10https://gerrit.wikimedia.org/r/682972 [12:15:11] !log jmm@cumin2001 START - Cookbook sre.cassandra.roll-restart [12:15:11] !log jmm@cumin2001 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) [12:15:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:36] !log jmm@cumin2001 START - Cookbook sre.cassandra.roll-restart [12:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:01] !log rolling restart of cassandra in restbase-dev to pick up Java security updates [12:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:02] 10SRE, 10ops-eqiad: Degraded RAID on an-worker1100 - https://phabricator.wikimedia.org/T280132 (10Ottomata) a:05Cmjohnson→03razzi [12:19:22] (03PS1) 10Jbond: P:tlsproxy::envoy: Add support for cfssl ssl provider [puppet] - 10https://gerrit.wikimedia.org/r/683266 [12:19:43] (03PS1) 10Majavah: Unset public_rewrites:false for beta loginwiki [puppet] - 10https://gerrit.wikimedia.org/r/683267 [12:20:34] (03CR) 10jerkins-bot: [V: 04-1] P:tlsproxy::envoy: Add support for cfssl ssl provider [puppet] - 10https://gerrit.wikimedia.org/r/683266 (owner: 10Jbond) [12:22:25] 10SRE, 10Security-Team: Request to Join Security Mailing List - https://phabricator.wikimedia.org/T281357 (10Reedy) I thought all SRE automagically get security@ as they also go to one of the SRE aliases? [12:25:27] (03PS2) 10Jbond: P:tlsproxy::envoy: Add support for cfssl ssl provider [puppet] - 10https://gerrit.wikimedia.org/r/683266 [12:26:11] (03CR) 10Filippo Giunchedi: "recheck" [alerts] - 10https://gerrit.wikimedia.org/r/670230 (https://phabricator.wikimedia.org/T272977) (owner: 10Filippo Giunchedi) [12:26:16] (03PS2) 10Filippo Giunchedi: Sketch of Performance team alerts [alerts] - 10https://gerrit.wikimedia.org/r/670230 (https://phabricator.wikimedia.org/T272977) [12:26:38] PROBLEM - SSH on phab2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:28:23] !log manually edited /srv/deployment/dumps/dumps-cache/config on snapshots1011,12,13 to change deploy1001 to deploy1002 (where did it get the old value from? these are new installs!) [12:28:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:04] (03PS1) 10Arturo Borrero Gonzalez: openstack: neutron: topology changes for cloudgw [puppet] - 10https://gerrit.wikimedia.org/r/683268 (https://phabricator.wikimedia.org/T270704) [12:29:40] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "DON'T MERGE THIS." [puppet] - 10https://gerrit.wikimedia.org/r/683268 (https://phabricator.wikimedia.org/T270704) (owner: 10Arturo Borrero Gonzalez) [12:35:48] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 36): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29246/console" [puppet] - 10https://gerrit.wikimedia.org/r/683266 (owner: 10Jbond) [12:36:07] !log jmm@cumin2001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) [12:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:35] (03PS3) 10Filippo Giunchedi: Sketch of Performance team alerts [alerts] - 10https://gerrit.wikimedia.org/r/670230 (https://phabricator.wikimedia.org/T281358) [12:37:39] (03PS1) 10Filippo Giunchedi: Return pathlib objects from file-listing functions [alerts] - 10https://gerrit.wikimedia.org/r/683269 (https://phabricator.wikimedia.org/T272977) [12:39:21] !log restarting pybal on lvs2010 - T271573 [12:39:28] (03PS2) 10Filippo Giunchedi: Return pathlib objects from file-listing functions [alerts] - 10https://gerrit.wikimedia.org/r/683269 (https://phabricator.wikimedia.org/T272977) [12:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:30] T271573: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 [12:40:50] (03PS2) 10Arturo Borrero Gonzalez: openstack: neutron: topology changes for cloudgw [puppet] - 10https://gerrit.wikimedia.org/r/683268 (https://phabricator.wikimedia.org/T270704) [12:42:03] !log restarting pybal on lvs5003,lvs4007 - T271573 [12:42:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:22] !log jmm@cumin2001 START - Cookbook sre.cassandra.roll-restart [12:43:25] (03CR) 10Filippo Giunchedi: [C: 03+2] Return pathlib objects from file-listing functions [alerts] - 10https://gerrit.wikimedia.org/r/683269 (https://phabricator.wikimedia.org/T272977) (owner: 10Filippo Giunchedi) [12:43:31] 10SRE, 10CommRel-Specialists-Support (Apr-Jun-2021), 10Datacenter-Switchover: CommRel support for June 2021 Switchover - https://phabricator.wikimedia.org/T281209 (10Trizek-WMF) [12:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:57] (03PS2) 10Effie Mouzeli: modules::conftool add safe-service-restart scap option [puppet] - 10https://gerrit.wikimedia.org/r/682141 (https://phabricator.wikimedia.org/T266055) [12:45:13] (03CR) 10jerkins-bot: [V: 04-1] modules::conftool add safe-service-restart scap option [puppet] - 10https://gerrit.wikimedia.org/r/682141 (https://phabricator.wikimedia.org/T266055) (owner: 10Effie Mouzeli) [12:45:43] (03CR) 10Ottomata: [V: 03+1 C: 03+2] test/data_purge - add drop_event job [puppet] - 10https://gerrit.wikimedia.org/r/683053 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [12:45:50] !log upgrading labweb to PHP 7.4.32 [12:45:51] 10SRE, 10ops-codfw, 10DBA: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) @Marostegui @jcrespo thanks [12:45:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:53] 10SRE, 10CommRel-Specialists-Support (Apr-Jun-2021), 10Datacenter-Switchover: CommRel support for June 2021 Switchover - https://phabricator.wikimedia.org/T281209 (10Trizek-WMF) [12:48:07] !log restarting pybal on lvs2009 - T271573 [12:48:12] (03PS1) 10Jbond: O:debmonitor::server: request cert using cfssl [puppet] - 10https://gerrit.wikimedia.org/r/683270 [12:48:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:16] T271573: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 [12:50:55] (03PS3) 10Jbond: P:tlsproxy::envoy: Add support for cfssl ssl provider [puppet] - 10https://gerrit.wikimedia.org/r/683266 [12:51:12] (03CR) 10JMeybohm: [C: 03+2] Switch conf200[1-3] to conf200[4-6] [puppet] - 10https://gerrit.wikimedia.org/r/683248 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [12:51:16] (03PS2) 10Jbond: O:debmonitor::server: request cert using cfssl [puppet] - 10https://gerrit.wikimedia.org/r/683270 [12:53:48] (03PS3) 10Effie Mouzeli: modules::conftool add safe-service-restart scap option [puppet] - 10https://gerrit.wikimedia.org/r/682141 (https://phabricator.wikimedia.org/T266055) [12:55:03] 10SRE, 10Scap, 10Release-Engineering-Team (Seen): find a way to systematically update the deployment server name across all repos - https://phabricator.wikimedia.org/T197470 (10hnowlan) This issue reoccured when we moved to deploy1002 for a bunch of services that use `deploy-local` for initial deployments vi... [12:55:22] 10SRE, 10Scap, 10Release-Engineering-Team (Seen): find a way to systematically update the deployment server name across all repos - https://phabricator.wikimedia.org/T197470 (10ArielGlenn) Just ran into this today on an install of new snapshot1011,12,13: got the dreaded ` Apr 28 11:47:52 snapshot1011 puppet... [12:55:36] !log upgrading snapshot hosts to PHP 7.4.32 [12:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:42] RECOVERY - nova-compute proc minimum on cloudvirt1039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [12:58:36] PROBLEM - PyBal connections to etcd on lvs2007 is CRITICAL: CRITICAL: 0 connections established with conf2004.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [12:59:01] this is me [12:59:30] PROBLEM - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 0 connections established with conf2004.codfw.wmnet:4001 (min=57) https://wikitech.wikimedia.org/wiki/PyBal [13:00:04] liw and longma: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - European+American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210428T1300). [13:00:28] PROBLEM - PyBal connections to etcd on lvs5001 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [13:01:48] PROBLEM - PyBal connections to etcd on lvs5002 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [13:01:56] (03PS1) 10Lars Wirzenius: group1 wikis to 1.37.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683272 [13:01:58] (03CR) 10Lars Wirzenius: [C: 03+2] group1 wikis to 1.37.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683272 (owner: 10Lars Wirzenius) [13:02:04] !log upgrading deployment servers to PHP 7.4.32 [13:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:38] (03Merged) 10jenkins-bot: group1 wikis to 1.37.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683272 (owner: 10Lars Wirzenius) [13:03:51] !log liw@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.37.0-wmf.3 [13:03:56] !log jmm@cumin2001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) [13:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:58] !log liw@deploy1002 Synchronized php: group1 wikis to 1.37.0-wmf.3 (duration: 01m 07s) [13:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:17] (03PS4) 10Effie Mouzeli: modules::conftool add safe-service-restart scap option [puppet] - 10https://gerrit.wikimedia.org/r/682141 (https://phabricator.wikimedia.org/T266055) [13:06:48] PROBLEM - DPKG on cloudweb2001-dev is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [13:07:13] ^ that's me, should recover soon [13:07:40] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 268 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:07:56] RECOVERY - PyBal connections to etcd on lvs2009 is OK: OK: 57 connections established with conf2004.codfw.wmnet:4001 (min=57) https://wikitech.wikimedia.org/wiki/PyBal [13:09:23] (03PS3) 10Jbond: O:debmonitor::server: request cert using cfssl [puppet] - 10https://gerrit.wikimedia.org/r/683270 [13:09:24] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 11 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:09:55] (03PS1) 10David Caro: wmcs.openstack: unpin cloudvirt1039 [puppet] - 10https://gerrit.wikimedia.org/r/683274 (https://phabricator.wikimedia.org/T261137) [13:10:00] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:10:41] !log restarting pybal on lvs5002,lvs4006,lvs2008 - T271573 [13:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:50] T271573: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 [13:11:14] (03CR) 10David Caro: [C: 03+2] wmcs.openstack: unpin cloudvirt1039 [puppet] - 10https://gerrit.wikimedia.org/r/683274 (https://phabricator.wikimedia.org/T261137) (owner: 10David Caro) [13:11:47] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:12:12] PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 655 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:12:17] rolling back from group1 [13:12:50] RECOVERY - PyBal connections to etcd on lvs5002 is OK: OK: 4 connections established with conf2006.codfw.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [13:13:21] (03PS4) 10Jbond: P:tlsproxy::envoy: Add support for cfssl ssl provider [puppet] - 10https://gerrit.wikimedia.org/r/683266 [13:13:36] (03PS4) 10Jbond: O:debmonitor::server: request cert using cfssl [puppet] - 10https://gerrit.wikimedia.org/r/683270 [13:14:10] RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: (C)100 gt (W)50 gt 38 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:14:26] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29253/console" [puppet] - 10https://gerrit.wikimedia.org/r/683270 (owner: 10Jbond) [13:14:49] !log liw@deploy1002 rebuilt and synchronized wikiversions files: Revert "group1 wikis to 3.17.0-wmf.1" [13:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:00] !log restarting pybal on lvs5001,lvs4005,lvs2007 - T271573 [13:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:48] RECOVERY - PyBal connections to etcd on lvs2007 is OK: OK: 12 connections established with conf2004.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [13:15:58] (03PS1) 10Lars Wirzenius: Revert "group1 wikis to 1.37.0-wmf.3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683276 (https://phabricator.wikimedia.org/T281361) [13:16:00] (03CR) 10Lars Wirzenius: [C: 03+2] Revert "group1 wikis to 1.37.0-wmf.3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683276 (https://phabricator.wikimedia.org/T281361) (owner: 10Lars Wirzenius) [13:16:50] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.37.0-wmf.3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683276 (https://phabricator.wikimedia.org/T281361) (owner: 10Lars Wirzenius) [13:17:00] RECOVERY - PyBal connections to etcd on lvs5001 is OK: OK: 12 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [13:21:10] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) >>! In T264398#7037012, @Gilles wrote: > The change's claimed behaviour is definitely consistant with the change we obser... [13:22:06] 10SRE, 10Scap, 10Release-Engineering-Team (Seen): find a way to systematically update the deployment server name across all repos - https://phabricator.wikimedia.org/T197470 (10hnowlan) I also see the above with `--refresh-config` added: ` root@snapshot1011:~# sudo -u dumpsgen /usr/bin/scap deploy-local --r... [13:25:44] 10SRE, 10CommRel-Specialists-Support (Apr-Jun-2021), 10Datacenter-Switchover: CommRel support for June 2021 Switchover - https://phabricator.wikimedia.org/T281209 (10Trizek-WMF) [13:26:12] (03PS1) 10Jbond: O:pki::root: Add debmonitor intermediate CA [puppet] - 10https://gerrit.wikimedia.org/r/683277 [13:26:32] RECOVERY - SSH on phab2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:26:43] (03CR) 10Jbond: [C: 03+2] O:pki::root: Add debmonitor intermediate CA [puppet] - 10https://gerrit.wikimedia.org/r/683277 (owner: 10Jbond) [13:28:25] (03PS1) 10David Caro: wmcs.openstack: unpin cloudvirts to continue upgrade to victoria [puppet] - 10https://gerrit.wikimedia.org/r/683278 (https://phabricator.wikimedia.org/T261137) [13:28:48] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:29:08] (03CR) 10Andrew Bogott: [C: 03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/683278 (https://phabricator.wikimedia.org/T261137) (owner: 10David Caro) [13:31:20] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:31:37] (03CR) 10David Caro: [C: 03+2] wmcs.openstack: unpin cloudvirts to continue upgrade to victoria [puppet] - 10https://gerrit.wikimedia.org/r/683278 (https://phabricator.wikimedia.org/T261137) (owner: 10David Caro) [13:37:22] RECOVERY - DPKG on cloudweb2001-dev is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [13:38:45] (03PS1) 10Jbond: O:pki::multirootca: add debmonito ca [puppet] - 10https://gerrit.wikimedia.org/r/683282 [13:39:03] 10SRE, 10Wikimedia-Mailing-lists: Hausa Wikimedians mailing list - https://phabricator.wikimedia.org/T279654 (10Ladsgroup) We are making this mailing list but the naming is wrong. https://meta.wikimedia.org/wiki/Mailing_lists/Standardization says the naming should be wikimedia-ha and also is it a registered us... [13:39:42] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29254/console" [puppet] - 10https://gerrit.wikimedia.org/r/683282 (owner: 10Jbond) [13:40:37] (03CR) 10Jbond: [V: 03+1 C: 03+2] O:pki::multirootca: add debmonito ca [puppet] - 10https://gerrit.wikimedia.org/r/683282 (owner: 10Jbond) [13:41:56] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] python-poolcounter: Build for bullseye [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/683235 (owner: 10Muehlenhoff) [13:42:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudvirt104[0-6].eqiad.wmnet - https://phabricator.wikimedia.org/T275081 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts: ` ['clo... [13:42:07] 10SRE, 10Wikimedia-Mailing-lists: Wikisul maillist - https://phabricator.wikimedia.org/T279482 (10Ladsgroup) 05Stalled→03Open We can create it now. Just double checking https://meta.wikimedia.org/wiki/Mailing_lists/Standardization is not very clear on user groups. I can still go with wikisul though. Just a... [13:44:47] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Create a mailing list for ptwikinews - https://phabricator.wikimedia.org/T280408 (10Ladsgroup) 05Stalled→03Resolved a:03Ladsgroup https://lists.wikimedia.org/postorius/lists/wikinews-pt.lists.wikimedia.org/ Done. You are admin now, please create an a... [13:45:19] (03PS1) 10Jbond: P:debmonitor::client: use debmonitor CA [puppet] - 10https://gerrit.wikimedia.org/r/683284 [13:48:10] (03PS2) 10Jbond: P:debmonitor::client: use debmonitor CA [puppet] - 10https://gerrit.wikimedia.org/r/683284 [13:48:12] (03PS1) 10Jbond: sretest1002: test new debmon CA [puppet] - 10https://gerrit.wikimedia.org/r/683286 [13:48:14] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Unset public_rewrites:false for beta loginwiki [puppet] - 10https://gerrit.wikimedia.org/r/683267 (owner: 10Majavah) [13:48:18] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Unset public_rewrites:false for beta loginwiki [puppet] - 10https://gerrit.wikimedia.org/r/683267 (owner: 10Majavah) [13:48:55] (03PS3) 10Jbond: P:debmonitor::client: use debmonitor CA [puppet] - 10https://gerrit.wikimedia.org/r/683284 [13:49:11] (03CR) 10Jbond: [C: 03+2] sretest1002: test new debmon CA [puppet] - 10https://gerrit.wikimedia.org/r/683286 (owner: 10Jbond) [13:50:55] (03CR) 10Jbond: [C: 03+2] P:debmonitor::client: use debmonitor CA [puppet] - 10https://gerrit.wikimedia.org/r/683284 (owner: 10Jbond) [13:52:03] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01001 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:52:27] 10SRE, 10serviceops, 10Patch-For-Review: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10JMeybohm) DNS SRV records, pybal's and confd instances in codfw, eqsin, ulsfo moved to the new cluster. navtiming.service on webperf needed a restart as well. [13:52:54] 10SRE, 10Gerrit, 10Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 10Release-Engineering-Team: Disable Gerrit user Mdhollo - https://phabricator.wikimedia.org/T281291 (10Mholloway) Ah, that Gerrit account was associated with the LDAP account `hollowlog`, which can also be disabled. Thanks again! [13:54:00] 10SRE: Revoke debmonitor.discovery.wmnet - https://phabricator.wikimedia.org/T281366 (10jbond) p:05Triage→03Medium [13:55:25] (03PS1) 10Muehlenhoff: zookeeper: Remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/683288 [13:55:59] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/683288 (owner: 10Muehlenhoff) [13:57:38] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1040.eqiad.wmnet with reason: REIMAGE [13:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:49] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1040.eqiad.wmnet with reason: REIMAGE [13:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:00] (03PS1) 10Muehlenhoff: Add ldap-replica2005 as new replica with Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/683290 [14:00:29] (03PS5) 10David Caro: wmcs.ceph: Added mon upgrade cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/682099 [14:00:31] (03PS2) 10David Caro: wmcs.ceph: add cookbook to upgrade all osds [cookbooks] - 10https://gerrit.wikimedia.org/r/682106 (https://phabricator.wikimedia.org/T280641) [14:02:58] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Create a mailing list for ptwikinews - https://phabricator.wikimedia.org/T280408 (10Edu) @Ladsgroup thank you so much [14:04:21] (03CR) 10jerkins-bot: [V: 04-1] wmcs.ceph: add cookbook to upgrade all osds [cookbooks] - 10https://gerrit.wikimedia.org/r/682106 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [14:04:40] (03CR) 10jerkins-bot: [V: 04-1] wmcs.ceph: Added mon upgrade cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/682099 (owner: 10David Caro) [14:10:57] 10SRE: Revoke debmonitor.discovery.wmnet - https://phabricator.wikimedia.org/T281366 (10jbond) [14:12:17] (03PS1) 10ArielGlenn: make snapshot1011 the new wikidata dumper and snapshot1012 the new enwiki dumper [puppet] - 10https://gerrit.wikimedia.org/r/683293 (https://phabricator.wikimedia.org/T281330) [14:16:15] (03CR) 10Jbond: [C: 03+2] P:tlsproxy::envoy: Add support for cfssl ssl provider [puppet] - 10https://gerrit.wikimedia.org/r/683266 (owner: 10Jbond) [14:17:57] !log milimetric@deploy1002 Started deploy [analytics/refinery@559d98d]: Regular analytics weekly train [analytics/refinery@559d98d] [14:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:23] 10SRE: Additional CFSSL tasks - https://phabricator.wikimedia.org/T281369 (10jbond) [14:22:40] 10SRE: Revoke debmonitor.discovery.wmnet - https://phabricator.wikimedia.org/T281366 (10jbond) [14:22:42] 10SRE: Additional CFSSL tasks - https://phabricator.wikimedia.org/T281369 (10jbond) [14:23:59] (03PS4) 10Giuseppe Lavagetto: kubernetes::deployment_server: also add kafka broker, pass CIDRs [puppet] - 10https://gerrit.wikimedia.org/r/682971 (https://phabricator.wikimedia.org/T253058) [14:25:06] 10SRE: Create a discover CA - https://phabricator.wikimedia.org/T281370 (10jbond) [14:25:25] 10SRE: Create a discover CA - https://phabricator.wikimedia.org/T281370 (10jbond) p:05Triage→03Medium [14:25:39] 10SRE: Additional CFSSL tasks - https://phabricator.wikimedia.org/T281369 (10jbond) p:05Triage→03Medium [14:26:26] !log installing net-snmp updates from buster point release [14:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:32] 10SRE, 10Wikimedia-Mailing-lists: Wikisul maillist - https://phabricator.wikimedia.org/T279482 (10felipedafonseca) Thanks. [14:30:00] (03CR) 10Ottomata: "Cool!" [puppet] - 10https://gerrit.wikimedia.org/r/682971 (https://phabricator.wikimedia.org/T253058) (owner: 10Giuseppe Lavagetto) [14:30:29] !log milimetric@deploy1002 Finished deploy [analytics/refinery@559d98d]: Regular analytics weekly train [analytics/refinery@559d98d] (duration: 12m 31s) [14:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:39] (03PS5) 10Giuseppe Lavagetto: kubernetes::deployment_server: also add kafka broker, pass CIDRs [puppet] - 10https://gerrit.wikimedia.org/r/682971 (https://phabricator.wikimedia.org/T253058) [14:30:44] !log milimetric@deploy1002 Started deploy [analytics/refinery@559d98d]: - [14:30:44] !log milimetric@deploy1002 deploy aborted: - (duration: 00m 00s) [14:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:18] !log milimetric@deploy1002 Started deploy [analytics/refinery@559d98d]: Regular analytics weekly train [analytics/refinery@559d98d] [14:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:04] (03Abandoned) 10Jbond: O:debmonitor::server: request cert using cfssl [puppet] - 10https://gerrit.wikimedia.org/r/683270 (owner: 10Jbond) [14:32:48] !log installing iproute2 updates from buster point release [14:32:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:26] 10SRE, 10ops-codfw, 10DBA: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) [14:34:25] !log milimetric@deploy1002 Finished deploy [analytics/refinery@559d98d]: Regular analytics weekly train [analytics/refinery@559d98d] (duration: 03m 07s) [14:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:47] !log milimetric@deploy1002 Started deploy [analytics/refinery@559d98d] (thin): Regular analytics weekly train THIN [analytics/refinery@559d98d] [14:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:54] !log milimetric@deploy1002 Finished deploy [analytics/refinery@559d98d] (thin): Regular analytics weekly train THIN [analytics/refinery@559d98d] (duration: 00m 06s) [14:35:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:35] !log milimetric@deploy1002 Started deploy [analytics/refinery@559d98d] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@559d98d] [14:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:15] (03CR) 10Ppchelko: "Ok, if you create a ticket to try to upstream metric splitting by request method and assign it to me, I think I can do it. Compiling envoy" [deployment-charts] - 10https://gerrit.wikimedia.org/r/682921 (https://phabricator.wikimedia.org/T277585) (owner: 10Hnowlan) [14:36:19] (03CR) 10Ppchelko: [C: 03+1] api-gateway: Create individual cluster definitions for read and write [deployment-charts] - 10https://gerrit.wikimedia.org/r/682921 (https://phabricator.wikimedia.org/T277585) (owner: 10Hnowlan) [14:38:22] PROBLEM - Host db2096.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:40:35] !log milimetric@deploy1002 Finished deploy [analytics/refinery@559d98d] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@559d98d] (duration: 04m 59s) [14:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:09] 10SRE, 10serviceops, 10Patch-For-Review: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10JMeybohm) [14:42:44] (03CR) 10Ottomata: [C: 03+2] Revert "Revert "sqoop: switch to single grouped_wikis.csv"" [puppet] - 10https://gerrit.wikimedia.org/r/682791 (https://phabricator.wikimedia.org/T279564) (owner: 10Razzi) [14:44:10] !log imported gitlab-ce 13.9.7-ce.0 to apt.wikimedia.org [14:44:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudvirt104[0-6].eqiad.wmnet - https://phabricator.wikimedia.org/T275081 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt1040.eqiad.wmnet'] ` and were **ALL** succe... [14:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:34] RECOVERY - Host db2096.mgmt is UP: PING OK - Packet loss = 0%, RTA = 45.04 ms [14:48:35] 10SRE: Add PKI root CA to ca-certificates via puppet - https://phabricator.wikimedia.org/T281376 (10jbond) [14:49:02] (03PS2) 10Herron: add kafka-logging200[123] to kafka term [homer/public] - 10https://gerrit.wikimedia.org/r/683050 (https://phabricator.wikimedia.org/T279342) [14:49:48] 10SRE, 10ops-codfw, 10DBA: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) [14:50:11] (03CR) 10Herron: "> Patch Set 1: Code-Review-1" [homer/public] - 10https://gerrit.wikimedia.org/r/683050 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron) [14:50:41] 10SRE: debmonitor.discovery.wmnet: Generate server cetificate via cfssl - https://phabricator.wikimedia.org/T281377 (10jbond) [14:51:11] (03PS1) 10Jbond: P:debmonitor::server: add support for CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/683348 (https://phabricator.wikimedia.org/T281377) [14:52:21] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29255/console" [puppet] - 10https://gerrit.wikimedia.org/r/683348 (https://phabricator.wikimedia.org/T281377) (owner: 10Jbond) [14:52:33] (03CR) 10Hnowlan: [C: 03+2] reboot-single: allow specification of ticket and reason [cookbooks] - 10https://gerrit.wikimedia.org/r/681426 (owner: 10Hnowlan) [14:53:35] !log jayme@cumin1001 START - Cookbook sre.hosts.decommission for hosts conf[2001-2003].codfw.wmnet [14:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:58] (03PS1) 10Arturo Borrero Gonzalez: nftables: restart command should be reload [puppet] - 10https://gerrit.wikimedia.org/r/683349 [14:54:41] (03PS2) 10Jbond: P:debmonitor::server: add support for CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/683348 (https://phabricator.wikimedia.org/T281377) [14:55:43] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29256/console" [puppet] - 10https://gerrit.wikimedia.org/r/683348 (https://phabricator.wikimedia.org/T281377) (owner: 10Jbond) [14:56:08] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:debmonitor::server: add support for CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/683348 (https://phabricator.wikimedia.org/T281377) (owner: 10Jbond) [14:56:18] 10SRE, 10ops-codfw, 10DBA: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) [14:56:43] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1003/29257/" [puppet] - 10https://gerrit.wikimedia.org/r/683349 (owner: 10Arturo Borrero Gonzalez) [14:57:08] (03PS5) 10Hnowlan: reboot-single: allow specification of ticket and reason [cookbooks] - 10https://gerrit.wikimedia.org/r/681426 [14:57:30] 10SRE, 10ops-codfw, 10DBA: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) [14:57:49] 10SRE: Integrate Buster 10.8 point update - https://phabricator.wikimedia.org/T274099 (10MoritzMuehlenhoff) [14:59:27] hnowlan: hello is it poosible to depool and powerdown sessionstore2001 for me? I have to relocate the server https://phabricator.wikimedia.org/T281135 thanks [14:59:30] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={etcd,etcdmirror,swagger_check_citoid_cluster_eqiad} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:00:05] !log imported python-poolcounter 0.0.2-1+deb11u1 to apt.wikimedia.org T275873 [15:00:12] (03CR) 10BBlack: [C: 03+1] P:rsyslog: send pybal logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/682566 (owner: 10Jbond) [15:00:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:14] T275873: Prepare our base system layer for Debian 11/bullseye - https://phabricator.wikimedia.org/T275873 [15:01:28] (03CR) 10Muehlenhoff: reboot-single: allow specification of ticket and reason (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/681426 (owner: 10Hnowlan) [15:03:32] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [15:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:36] papaul: yep, should be - when would suit? [15:04:43] 10SRE, 10ops-codfw, 10DBA: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Eevans) Almost too late to the party here, but it's fine to relocate sessionstore2001. Will it be put under maintenance in Icinga? [15:05:40] hnowlan: it was set to move it at 10:30am CT if you can now it will be great thanks [15:09:04] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on sessionstore2001.codfw.wmnet with reason: Server relocation [15:09:05] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on sessionstore2001.codfw.wmnet with reason: Server relocation [15:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:48] papaul: should be powered down now [15:10:42] (03CR) 10Jbond: [C: 03+2] P:rsyslog: send pybal logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/682566 (owner: 10Jbond) [15:10:44] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:10:55] (03PS1) 10Jbond: hiera: move cloud debmon to pki [puppet] - 10https://gerrit.wikimedia.org/r/683352 [15:11:14] (03CR) 10Jbond: [V: 03+2 C: 03+2] hiera: move cloud debmon to pki [puppet] - 10https://gerrit.wikimedia.org/r/683352 (owner: 10Jbond) [15:11:19] hnowlan: thanks will ping you when i have it back up [15:11:38] thanks! [15:12:19] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [15:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:03] (03CR) 10Herron: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/683044 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [15:14:14] (03PS1) 10Jbond: hiera - cloud: add cfssl lable [puppet] - 10https://gerrit.wikimedia.org/r/683353 [15:14:46] 10SRE: Additional CFSSL tasks - https://phabricator.wikimedia.org/T281369 (10jbond) [15:15:04] (03CR) 10Jbond: [C: 03+2] hiera - cloud: add cfssl lable [puppet] - 10https://gerrit.wikimedia.org/r/683353 (owner: 10Jbond) [15:15:07] (03PS1) 10Muehlenhoff: package_builder: Added python3-pytest-runner [puppet] - 10https://gerrit.wikimedia.org/r/683354 [15:15:09] (03CR) 10Jbond: [V: 03+2 C: 03+2] hiera - cloud: add cfssl lable [puppet] - 10https://gerrit.wikimedia.org/r/683353 (owner: 10Jbond) [15:19:04] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:19:17] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts conf[2001-2003].codfw.wmnet [15:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:45] (03CR) 10Muehlenhoff: [C: 03+2] package_builder: Added python3-pytest-runner [puppet] - 10https://gerrit.wikimedia.org/r/683354 (owner: 10Muehlenhoff) [15:20:17] !log jayme@cumin1001 START - Cookbook sre.dns.netbox [15:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:24] PROBLEM - Check systemd state on kubernetes2001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rsyslog.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:20:42] PROBLEM - Host sessionstore2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:22:50] (03PS1) 10Muehlenhoff: profile::java: Remove jessie, add bullseye [puppet] - 10https://gerrit.wikimedia.org/r/683357 [15:24:00] !log jayme@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:15] PROBLEM - Host sessionstore2001 is DOWN: PING CRITICAL - Packet loss = 100% [15:25:30] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on sessionstore2001.codfw.wmnet with reason: Server relocation [15:25:30] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on sessionstore2001.codfw.wmnet with reason: Server relocation [15:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:37] RECOVERY - Host sessionstore2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 45.90 ms [15:25:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:10] (03CR) 10Gehel: elasticsearch: refactor various rolling operations (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [15:30:03] hnowlan: sessionstore2001 is back online [15:31:34] (03PS1) 10JMeybohm: Remove references to conf200[1-3] after decom [puppet] - 10https://gerrit.wikimedia.org/r/683358 (https://phabricator.wikimedia.org/T271573) [15:31:39] 10SRE, 10ops-codfw, 10DBA: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) [15:32:14] 10SRE, 10ops-codfw, 10DBA: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) [15:33:25] (03CR) 10Muehlenhoff: "hieradata/role/common/configcluster.yaml can also go away." [puppet] - 10https://gerrit.wikimedia.org/r/683358 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [15:34:19] (03CR) 10JMeybohm: Remove references to conf200[1-3] after decom (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/683358 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [15:34:57] PROBLEM - Long running screen/tmux on kubestagemaster1001 is CRITICAL: CRIT: Long running tmux process. (user: jayme PID: 6039, 1731846s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [15:35:34] oopsie [15:36:24] (03CR) 10JMeybohm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/683358 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [15:38:27] (03PS1) 10Jbond: cloud - debmon: add root ca to trust store [puppet] - 10https://gerrit.wikimedia.org/r/683362 [15:38:36] papaul: it's back up but it appears to be having network issues - is there any reconfig on my end required? [15:39:19] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Remove references to conf200[1-3] after decom [puppet] - 10https://gerrit.wikimedia.org/r/683358 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [15:39:32] (03CR) 10JMeybohm: [C: 03+2] Remove references to conf200[1-3] after decom [puppet] - 10https://gerrit.wikimedia.org/r/683358 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [15:40:05] hnowlan: yes i change the networtk port on the switch only it shouldn't have any ip address change [15:40:08] (03CR) 10Jbond: [C: 03+2] cloud - debmon: add root ca to trust store [puppet] - 10https://gerrit.wikimedia.org/r/683362 (owner: 10Jbond) [15:40:11] (03CR) 10Jbond: [V: 03+2 C: 03+2] cloud - debmon: add root ca to trust store [puppet] - 10https://gerrit.wikimedia.org/r/683362 (owner: 10Jbond) [15:40:26] hnowlan: let me double check on my end [15:41:28] (03CR) 10Elukey: [C: 03+1] profile::java: Remove jessie, add bullseye [puppet] - 10https://gerrit.wikimedia.org/r/683357 (owner: 10Muehlenhoff) [15:41:37] hnowlan: yes i had it on port 27 and not 16 fixing it now [15:42:18] thanks! [15:44:08] 10SRE, 10SRE-tools: debmonitor-client.postinst: line 7: systemd-sysusers: command not found on stretch docker images - https://phabricator.wikimedia.org/T280892 (10jbond) 05Open→03Resolved This should be fixed now, please reopen if you are still seeing issues [15:44:16] 10SRE, 10SRE-tools: debmonitor-client.service stays in failed state in case of server errors - https://phabricator.wikimedia.org/T280484 (10jbond) 05Open→03Resolved [15:47:29] PROBLEM - Host sessionstore2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:47:52] (03CR) 10Ayounsi: [C: 03+1] add kafka-logging200[123] to kafka term [homer/public] - 10https://gerrit.wikimedia.org/r/683050 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron) [15:49:03] (03PS1) 10Jbond: hiera - clod: enable pki in sso project [puppet] - 10https://gerrit.wikimedia.org/r/683364 [15:49:22] (03CR) 10Jbond: [V: 03+2 C: 03+2] hiera - clod: enable pki in sso project [puppet] - 10https://gerrit.wikimedia.org/r/683364 (owner: 10Jbond) [15:49:55] (03PS1) 10Volans: customscript: fix VIP assignment in PuppetDB import [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/683365 [15:51:41] RECOVERY - Host sessionstore2001 is UP: PING OK - Packet loss = 0%, RTA = 31.73 ms [15:51:45] hnowlan: good now [15:51:51] (03CR) 10Volans: "This should fix the issue Arturo got." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/683365 (owner: 10Volans) [15:52:10] (03PS1) 10Jbond: P::debmonitor::server: pass hosts instead of names [puppet] - 10https://gerrit.wikimedia.org/r/683366 [15:52:21] PROBLEM - cassandra-a SSL 10.192.16.95:7001 on sessionstore2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [15:52:37] RECOVERY - Long running screen/tmux on aqs1011 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [15:52:54] papaul: nice, thank you! [15:53:13] PROBLEM - cassandra-a CQL 10.192.16.95:9042 on sessionstore2001 is CRITICAL: connect to address 10.192.16.95 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [15:53:23] hnowlan: you welcome [15:53:45] (03CR) 10Jbond: [C: 03+2] P::debmonitor::server: pass hosts instead of names [puppet] - 10https://gerrit.wikimedia.org/r/683366 (owner: 10Jbond) [15:53:53] RECOVERY - Host sessionstore2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.63 ms [15:54:39] RECOVERY - cassandra-a SSL 10.192.16.95:7001 on sessionstore2001 is OK: SSL OK - Certificate sessionstore2001-a valid until 2023-02-22 11:12:13 +0000 (expires in 664 days) https://phabricator.wikimedia.org/T120662 [15:55:31] RECOVERY - cassandra-a CQL 10.192.16.95:9042 on sessionstore2001 is OK: TCP OK - 0.032 second response time on 10.192.16.95 port 9042 https://phabricator.wikimedia.org/T93886 [15:56:09] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: 2021-04-30) rack/setup/install backup200[4-7] - https://phabricator.wikimedia.org/T277323 (10Papaul) [15:56:49] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: 2021-04-30) rack/setup/install backup200[4-7] - https://phabricator.wikimedia.org/T277323 (10Papaul) [16:00:45] PROBLEM - ensure kvm processes are running on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:02:33] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:04:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudvirt104[0-6].eqiad.wmnet - https://phabricator.wikimedia.org/T275081 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts: ` ['clo... [16:04:23] (03CR) 10Ayounsi: customscript: fix VIP assignment in PuppetDB import (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/683365 (owner: 10Volans) [16:04:48] ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 andrew bogott work in progress https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:09:45] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:11:40] (03PS1) 10Addshore: Fix incorrect ItemId typehint in Lua bindings [extensions/Wikibase] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/683143 (https://phabricator.wikimedia.org/T281361) [16:12:12] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [16:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:29] (03CR) 10Jakob: [C: 03+1] Fix incorrect ItemId typehint in Lua bindings [extensions/Wikibase] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/683143 (https://phabricator.wikimedia.org/T281361) (owner: 10Addshore) [16:17:45] (03PS1) 10David Caro: wmcs: add cloudvirt drain cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/683370 (https://phabricator.wikimedia.org/T280641) [16:17:47] (03PS1) 10David Caro: WIP wmcs.openstack: add live_upgrade cloudvirt cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/683371 [16:19:16] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [16:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:39] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1041.eqiad.wmnet with reason: REIMAGE [16:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:45] (03CR) 10jerkins-bot: [V: 04-1] WIP wmcs.openstack: add live_upgrade cloudvirt cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/683371 (owner: 10David Caro) [16:20:47] (03CR) 10jerkins-bot: [V: 04-1] wmcs: add cloudvirt drain cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/683370 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [16:21:39] (03PS21) 10Ryan Kemper: elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) [16:21:41] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1042.eqiad.wmnet with reason: REIMAGE [16:21:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:18] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1041.eqiad.wmnet with reason: REIMAGE [16:22:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:41] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1043.eqiad.wmnet with reason: REIMAGE [16:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:24] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1042.eqiad.wmnet with reason: REIMAGE [16:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:58] (03PS1) 10Giuseppe Lavagetto: Remove records for the old etcd cluster raft consensus [dns] - 10https://gerrit.wikimedia.org/r/683372 [16:25:41] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1044.eqiad.wmnet with reason: REIMAGE [16:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:53] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [16:26:02] (03CR) 10Papaul: [C: 03+1] Remove records for the old etcd cluster raft consensus [dns] - 10https://gerrit.wikimedia.org/r/683372 (owner: 10Giuseppe Lavagetto) [16:26:20] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Remove records for the old etcd cluster raft consensus [dns] - 10https://gerrit.wikimedia.org/r/683372 (owner: 10Giuseppe Lavagetto) [16:26:32] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1043.eqiad.wmnet with reason: REIMAGE [16:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:09] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [16:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:33] (03CR) 10Hnowlan: [C: 03+2] reboot-single: allow specification of ticket and reason (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/681426 (owner: 10Hnowlan) [16:27:42] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1045.eqiad.wmnet with reason: REIMAGE [16:27:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:20] (03PS22) 10Ryan Kemper: elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) [16:28:31] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1044.eqiad.wmnet with reason: REIMAGE [16:28:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:59] addshore: hello! Let's unblock the train :-) [16:29:00] Urbanecm: you should be able to verify the fix with https://test-commons.wikimedia.org/wiki/File:Godward_Idleness_1900-dupe!.jpg [16:29:05] jouncebot: now [16:29:05] No deployments scheduled for the next 1 hour(s) and 30 minute(s) [16:29:06] which currently fatals :)_ [16:29:11] (03CR) 10Urbanecm: [C: 03+2] Fix incorrect ItemId typehint in Lua bindings [extensions/Wikibase] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/683143 (https://phabricator.wikimedia.org/T281361) (owner: 10Addshore) [16:29:23] ack, thanks. That looks like a simple verification 🙂 [16:29:32] should be! :) [16:29:35] yup yup :) [16:29:41] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1046.eqiad.wmnet with reason: REIMAGE [16:29:42] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:51] im arpund so ping me if you need anything, but ill be looking at other things right now :) [16:29:53] *around [16:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:15] okay, I'll ping if needed :) [16:30:28] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1045.eqiad.wmnet with reason: REIMAGE [16:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:36] (03Merged) 10jenkins-bot: reboot-single: allow specification of ticket and reason [cookbooks] - 10https://gerrit.wikimedia.org/r/681426 (owner: 10Hnowlan) [16:32:22] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1046.eqiad.wmnet with reason: REIMAGE [16:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:51] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10wiki_willy) Hi @fgiunchedi - Chatsworth has been pretty flexible with the amount of time we have for testing it, so I think we should be ok keeping it for a longer duration. Just le... [16:34:56] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [16:37:36] (03CR) 10LMata: [C: 03+1] "lgtm moving back to +1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677002 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [16:38:14] I'm upgrading a bunch of mailing lists now [16:39:59] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01002 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [16:43:12] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: 2021-04-30) rack/setup/install backup200[4-7] - https://phabricator.wikimedia.org/T277323 (10Papaul) [16:43:24] (03PS1) 10Ottomata: Update refine_sanitize job to use event_sanitized_analytics_allowlist [puppet] - 10https://gerrit.wikimedia.org/r/683375 (https://phabricator.wikimedia.org/T273789) [16:44:51] (03PS1) 10Ottomata: canary_events and refine_sanitize - use refinery 0.1.9 [puppet] - 10https://gerrit.wikimedia.org/r/683376 (https://phabricator.wikimedia.org/T273789) [16:45:13] (03CR) 10Ottomata: [C: 03+2] Update refine_sanitize job to use event_sanitized_analytics_allowlist [puppet] - 10https://gerrit.wikimedia.org/r/683375 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [16:47:38] Urbanecm: hi, can you check if everything works fine with https://lists.wikimedia.org/postorius/lists/wikimediacz-l.lists.wikimedia.org/ [16:47:41] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01001 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [16:47:45] (03CR) 10Ottomata: [C: 03+2] canary_events and refine_sanitize - use refinery 0.1.9 [puppet] - 10https://gerrit.wikimedia.org/r/683376 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [16:48:22] (03PS23) 10Ryan Kemper: elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) [16:48:23] The search index is now being built, so the search shouldn't be working well but the rest should [16:48:24] Amir1: hi, certainly. First, a question: How do I log in? :D [16:48:34] Urbanecm: create a new account :D [16:48:51] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:48:52] if you're fast you can get a low user id! :D [16:48:56] Amir1: but that would not be subscribed? :D [16:49:06] it's a private list, i hope new mailman honors that :D [16:49:15] as long as it's connected to your email address, it should be fine [16:49:30] aha, cool [16:49:33] you can have multiple email addresses in one account [16:49:47] Majavah: not everyone is as fast as you :D [16:49:54] they're private now, but I thought they were public a few moments ago when playing around [16:50:29] yeah it seems for a minute or two it becomes public. I need to find a solution for that [16:50:45] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:51:00] please do before importing things like arbocom-en@ or ops@ or similar [16:51:13] most of those don't have an archive [16:51:19] arbcom-en is a google group AFAIK [16:51:41] I think that was on mailman previously before moving to google [16:52:19] !log powerdown logstash2034 for relocation [16:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:36] so far so good Amir1 [16:53:12] Amir1: I assume I should tell WMCZ people to use the new interface, right? [16:53:21] yes [16:53:32] okay, will do [16:55:21] Urbanecm: can you use the moderation tools? https://lists.wikimedia.org/postorius/lists/wikimediacz-l.lists.wikimedia.org/settings/ [16:55:21] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={atlas_exporter,pdu_sentry4} site={eqiad,eqsin} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:55:26] (03PS1) 10Giuseppe Lavagetto: networkpolicy: add autogenerated egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/683379 (https://phabricator.wikimedia.org/T253058) [16:56:33] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:57:10] Amir1: I see them, but right now, there is nothing _to_ moderate [16:57:34] nod [16:57:41] let me know if you encounter any issues [16:57:46] will do. [16:58:00] and thanks for all the mailman work :) [17:00:49] PROBLEM - Host logstash2034.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:00:49] ^^ [17:03:55] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:04:35] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:53] (03PS2) 10Volans: customscript: fix VIP assignment in PuppetDB import [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/683365 [17:05:06] (03CR) 10Volans: customscript: fix VIP assignment in PuppetDB import (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/683365 (owner: 10Volans) [17:05:35] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:05:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudvirt104[0-6].eqiad.wmnet - https://phabricator.wikimedia.org/T275081 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt1042.eqiad.wmnet', 'cloudvirt1041.eqiad.wmne... [17:06:26] (03Merged) 10jenkins-bot: Fix incorrect ItemId typehint in Lua bindings [extensions/Wikibase] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/683143 (https://phabricator.wikimedia.org/T281361) (owner: 10Addshore) [17:06:37] RECOVERY - Host logstash2034.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.66 ms [17:07:37] (03PS5) 10Hnowlan: site: eventlog1003 role to eventlogging, allow access to kafka [puppet] - 10https://gerrit.wikimedia.org/r/682573 (https://phabricator.wikimedia.org/T278137) [17:09:33] (03PS1) 10Jbond: P:debmonitor::server: fix parameter order [puppet] - 10https://gerrit.wikimedia.org/r/683383 [17:10:55] (03CR) 10Jbond: [C: 03+2] P:debmonitor::server: fix parameter order [puppet] - 10https://gerrit.wikimedia.org/r/683383 (owner: 10Jbond) [17:11:58] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.3/extensions/Wikibase/client/includes/DataAccess/Scribunto/WikibaseLanguageIndependentLuaBindings.php: b392dba0d77904d7de819043e51d8c3fbf003873: Fix incorrect ItemId typehint in Lua bindings (T281361) (duration: 01m 09s) [17:12:03] addshore: liw: synced [17:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:07] T281361: TypeError: Argument 2 passed to Wikibase\Client\DataAccess\Scribunto\WikibaseLanguageIndependentLuaBindings::trackUsageForSitelink() must be an instance of Wikibase\DataModel\Entity\ItemId, instance of Wikibase\MediaInfo\DataMo - https://phabricator.wikimedia.org/T281361 [17:12:51] (03CR) 10Hnowlan: [C: 03+2] site: eventlog1003 role to eventlogging, allow access to kafka [puppet] - 10https://gerrit.wikimedia.org/r/682573 (https://phabricator.wikimedia.org/T278137) (owner: 10Hnowlan) [17:13:42] (03CR) 10Giuseppe Lavagetto: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/682971 (https://phabricator.wikimedia.org/T253058) (owner: 10Giuseppe Lavagetto) [17:14:10] 10SRE, 10ops-codfw, 10DBA: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) [17:14:28] Urbanecm, thank you kindly! [17:14:34] longma, backport done [17:14:34] any time :) [17:14:44] * Urbanecm afk, ping if needed [17:14:59] (03PS6) 10Giuseppe Lavagetto: kubernetes::deployment_server: also add kafka broker, pass CIDRs [puppet] - 10https://gerrit.wikimedia.org/r/682971 (https://phabricator.wikimedia.org/T253058) [17:16:53] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29259/console" [puppet] - 10https://gerrit.wikimedia.org/r/682971 (https://phabricator.wikimedia.org/T253058) (owner: 10Giuseppe Lavagetto) [17:19:58] (03PS7) 10Giuseppe Lavagetto: kubernetes::deployment_server: also add kafka broker, pass CIDRs [puppet] - 10https://gerrit.wikimedia.org/r/682971 (https://phabricator.wikimedia.org/T253058) [17:21:14] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29260/console" [puppet] - 10https://gerrit.wikimedia.org/r/682971 (https://phabricator.wikimedia.org/T253058) (owner: 10Giuseppe Lavagetto) [17:24:52] (03PS1) 10Jbond: P:pki::get_cert: Support multiple certs with different profiles [puppet] - 10https://gerrit.wikimedia.org/r/683385 [17:25:50] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29261/console" [puppet] - 10https://gerrit.wikimedia.org/r/683385 (owner: 10Jbond) [17:25:53] 10SRE, 10Wikimedia-Mailing-lists: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 (10Ladsgroup) I migrated a bunch of mailing lists now. Wrote a bash script (you can find it in my home) to automate it. It should probably shorten out if the mailing list already ex... [17:26:10] (03CR) 10jerkins-bot: [V: 04-1] P:pki::get_cert: Support multiple certs with different profiles [puppet] - 10https://gerrit.wikimedia.org/r/683385 (owner: 10Jbond) [17:27:15] (03PS2) 10Jbond: P:pki::get_cert: Support multiple certs with different profiles [puppet] - 10https://gerrit.wikimedia.org/r/683385 [17:28:12] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29262/console" [puppet] - 10https://gerrit.wikimedia.org/r/683385 (owner: 10Jbond) [17:29:01] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:pki::get_cert: Support multiple certs with different profiles [puppet] - 10https://gerrit.wikimedia.org/r/683385 (owner: 10Jbond) [17:33:14] (03PS1) 10Jbond: P:debmonitor::client: Add provide chain to certificate request [puppet] - 10https://gerrit.wikimedia.org/r/683406 [17:34:06] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29263/console" [puppet] - 10https://gerrit.wikimedia.org/r/683406 (owner: 10Jbond) [17:34:41] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:36:19] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:40:45] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:debmonitor::client: Add provide chain to certificate request [puppet] - 10https://gerrit.wikimedia.org/r/683406 (owner: 10Jbond) [17:42:49] (03PS3) 10Dzahn: thumbor: add a timer that writes the output of fc-list to /srv [puppet] - 10https://gerrit.wikimedia.org/r/682685 (https://phabricator.wikimedia.org/T280718) [17:43:13] (03CR) 10Dzahn: thumbor: add a timer that writes the output of fc-list to /srv (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/682685 (https://phabricator.wikimedia.org/T280718) (owner: 10Dzahn) [17:43:25] (03CR) 10jerkins-bot: [V: 04-1] thumbor: add a timer that writes the output of fc-list to /srv [puppet] - 10https://gerrit.wikimedia.org/r/682685 (https://phabricator.wikimedia.org/T280718) (owner: 10Dzahn) [17:47:40] (03PS1) 10Jbond: O:debmonitor::server: request cert using cfssl [puppet] - 10https://gerrit.wikimedia.org/r/683408 [17:48:35] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29264/console" [puppet] - 10https://gerrit.wikimedia.org/r/683408 (owner: 10Jbond) [17:56:07] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: 2021-04-30) rack/setup/install backup200[4-7] - https://phabricator.wikimedia.org/T277323 (10Papaul) [17:56:08] (03PS4) 10Dzahn: thumbor: add a timer that writes the output of fc-list to a file [puppet] - 10https://gerrit.wikimedia.org/r/682685 (https://phabricator.wikimedia.org/T280718) [17:57:13] RECOVERY - ensure kvm processes are running on cloudvirt1040 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:57:27] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 67 probes of 637 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:00:04] liw and longma: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210428T1800). [18:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210428T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:03:47] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 50 probes of 637 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:12:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudvirt104[0-6].eqiad.wmnet - https://phabricator.wikimedia.org/T275081 (10Andrew) These servers are now installed and running; 1040, 1041 and 1042 are now active in the 'ceph' host agg... [18:13:09] Thanks for unblocking the train Urbanecm addshore ! [18:13:23] * addshore just poked people and buttons :) [18:13:50] I just pressed few deployment buttons :-) [18:13:54] But glad to see it done :9 [18:20:35] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:21:05] !log added mvolz as listadmin for services@ and reset admin pw (T278516) [18:21:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:14] T278516: Decide on future of public services@ mailing list (which has no maintainers) - https://phabricator.wikimedia.org/T278516 [18:21:58] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:25:14] 10SRE, 10ci-test-error: tox-docker CI test doesn't pick up overrides for pylint - https://phabricator.wikimedia.org/T281347 (10hashar) [18:25:22] 10SRE, 10Platform Engineering, 10Services, 10Wikimedia-Mailing-lists: Decide on future of public services@ mailing list (which has no maintainers) - https://phabricator.wikimedia.org/T278516 (10Legoktm) >>! In T278516#7041242, @Mvolz wrote: > I've replied on the list, but we use this in citoid as a contact... [18:31:04] (03CR) 10Dzahn: "thinking about the next step after this: first I thought we'd use rsync (quickdatacopy or maybe better directly using rsync::module, that " [puppet] - 10https://gerrit.wikimedia.org/r/682685 (https://phabricator.wikimedia.org/T280718) (owner: 10Dzahn) [18:32:49] 10SRE, 10ops-eqiad, 10procurement, 10cloud-services-team (Kanban): cloudvirt1040 primary NIC disconnected - https://phabricator.wikimedia.org/T281399 (10Andrew) [18:34:54] 10SRE, 10ops-eqiad, 10procurement, 10cloud-services-team (Kanban): cloudvirt1040 primary NIC disconnected - https://phabricator.wikimedia.org/T281399 (10Andrew) [18:35:53] 10SRE, 10ops-eqsin, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install cp501[3-6] - https://phabricator.wikimedia.org/T278182 (10RobH) [18:49:32] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={atlas_exporter,routinator} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:00:04] liw and longma: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - European+American Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210428T1900). [19:01:14] (03PS24) 10Gehel: elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [19:02:05] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:07:35] (03PS2) 10Herron: remove all references to icinga1001 [puppet] - 10https://gerrit.wikimedia.org/r/682992 (https://phabricator.wikimedia.org/T279601) [19:09:04] (03PS2) 10Herron: remove all references to icinga2001 [puppet] - 10https://gerrit.wikimedia.org/r/682999 (https://phabricator.wikimedia.org/T279602) [19:09:52] (03CR) 10Ottomata: "> The way to do so is to add:" [puppet] - 10https://gerrit.wikimedia.org/r/682971 (https://phabricator.wikimedia.org/T253058) (owner: 10Giuseppe Lavagetto) [19:10:12] (03CR) 10Ryan Kemper: "flake8 is really annoying me. I want to allow my docstring to be more than 120 characters for the help message so I can include actual com" (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [19:10:16] (03PS1) 10Jeena Huneidi: group1 wikis to 1.37.0-wmf.3 refs T278347 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683419 [19:10:18] (03CR) 10Jeena Huneidi: [C: 03+2] group1 wikis to 1.37.0-wmf.3 refs T278347 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683419 (owner: 10Jeena Huneidi) [19:11:21] (03Merged) 10jenkins-bot: group1 wikis to 1.37.0-wmf.3 refs T278347 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683419 (owner: 10Jeena Huneidi) [19:11:50] (03CR) 10Ryan Kemper: "> Patch Set 24:" (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [19:12:33] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.37.0-wmf.3 refs T278347 [19:12:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:44] T278347: 1.37.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T278347 [19:12:52] (03PS1) 10Herron: remove icinga[12]001 addresses from firewall rules [homer/public] - 10https://gerrit.wikimedia.org/r/683420 (https://phabricator.wikimedia.org/T279601) [19:13:09] (03PS2) 10Herron: remove icinga[12]001 addresses from firewall rules [homer/public] - 10https://gerrit.wikimedia.org/r/683420 (https://phabricator.wikimedia.org/T279601) [19:13:15] (03CR) 10Southparkfan: Add WMCS specific cloud role for syslog server (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [19:13:41] !log jhuneidi@deploy1002 Synchronized php: group1 wikis to 1.37.0-wmf.3 refs T278347 (duration: 01m 07s) [19:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:41] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:15:50] (03CR) 10Herron: remove all references to icinga1001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/682992 (https://phabricator.wikimedia.org/T279601) (owner: 10Herron) [19:16:44] jouncebot: next [19:16:44] In 0 hour(s) and 43 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210428T2000) [19:18:48] (03CR) 10Herron: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/682999 (https://phabricator.wikimedia.org/T279602) (owner: 10Herron) [19:18:55] (03CR) 10Majavah: Add WMCS specific cloud role for syslog server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [19:19:50] (03PS1) 10Eevans: cqlshrc.erb: Use TLSv1.2 for cqlsh client connections [puppet] - 10https://gerrit.wikimedia.org/r/683422 (https://phabricator.wikimedia.org/T281404) [19:20:39] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_mobileapps_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:21:09] Majavah: I'll wait for a WMCS review. thanks :-) [19:21:41] (afk) [19:23:07] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:30:59] PROBLEM - Disk space on mwlog1001 is CRITICAL: DISK CRITICAL - free space: /srv 261857 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mwlog1001&var-datasource=eqiad+prometheus/ops [19:31:20] (03PS1) 10RobH: cp501[3-6] base install params [puppet] - 10https://gerrit.wikimedia.org/r/683425 (https://phabricator.wikimedia.org/T278182) [19:31:56] (03PS2) 10RobH: cp501[3-6] base install params [puppet] - 10https://gerrit.wikimedia.org/r/683425 (https://phabricator.wikimedia.org/T278182) [19:32:26] (03CR) 10RobH: [C: 03+2] cp501[3-6] base install params [puppet] - 10https://gerrit.wikimedia.org/r/683425 (https://phabricator.wikimedia.org/T278182) (owner: 10RobH) [19:36:23] RECOVERY - Long running screen/tmux on kubestagemaster1001 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [19:37:01] space should be better tomorrow [19:37:35] (on mwlog1001) [19:38:21] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005889 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [19:40:14] 10SRE, 10ops-eqiad, 10procurement, 10cloud-services-team (Kanban): cloudvirt1040 primary NIC disconnected - https://phabricator.wikimedia.org/T281399 (10Andrew) this recovered after a reboot. We'll see if it holds... [19:42:29] 10SRE, 10Commons: DB Error when attempting to go to user watchlist - https://phabricator.wikimedia.org/T281407 (10Languageseeker) [19:44:45] longma: ^ is a dupe of https://phabricator.wikimedia.org/T281405, right? [19:45:23] looks like it, thanks [19:46:22] 10SRE, 10Commons: DB Error when attempting to go to user watchlist - https://phabricator.wikimedia.org/T281407 (10Majavah) [19:46:44] you beat me to it :P [19:48:13] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01178 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [19:51:46] longma: second duplicate now closed :/ [19:52:11] (03PS1) 10Dwisehaupt: Add cname for fundraising.frdev -> frdata-eqiad [dns] - 10https://gerrit.wikimedia.org/r/683426 (https://phabricator.wikimedia.org/T260571) [19:52:16] hmm, maybe I should roll back? [19:52:27] might be a good idea [19:53:22] alright [19:53:57] I think I found the patch causing this and left a comment, but I'm heading into bed now and Tim (patch author) is likely asleep at this time [19:54:44] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.0005889 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [19:54:46] thanks for your help! [19:56:00] 10SRE, 10ops-eqsin, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install cp501[3-6] - https://phabricator.wikimedia.org/T278182 (10RobH) [19:56:22] !log robh@cumin1001 START - Cookbook sre.dns.netbox [19:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:47] 10SRE, 10ops-eqiad, 10procurement, 10cloud-services-team (Kanban): cloudvirt1040 primary NIC disconnected - https://phabricator.wikimedia.org/T281399 (10Andrew) ...and it's offline again [19:57:57] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: Revert "group1 wikis to 1.37.0-wmf.1" [19:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:08] (03PS1) 10Jeena Huneidi: Revert "group1 wikis to 1.37.0-wmf.3 refs T278347" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683427 [19:59:11] (03CR) 10Jeena Huneidi: [C: 03+2] Revert "group1 wikis to 1.37.0-wmf.3 refs T278347" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683427 (owner: 10Jeena Huneidi) [19:59:54] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.37.0-wmf.3 refs T278347" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683427 (owner: 10Jeena Huneidi) [20:00:04] chrisalbon and accraze: It is that lovely time of the day again! You are hereby commanded to deploy Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210428T2000). [20:00:49] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:00:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:14] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [20:35:48] 10SRE, 10ops-eqsin, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install cp501[3-6] - https://phabricator.wikimedia.org/T278182 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['cp5013.eqsin.wmnet', 'cp5014.eqsin.wmnet', 'cp5015.eqsin.wmn... [20:49:22] (03CR) 10Legoktm: [C: 03+2] lists: Backup /var/lib/mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/681763 (owner: 10Legoktm) [20:52:52] 10SRE, 10DBA, 10Data-Persistence-Backup, 10Wikimedia-Mailing-lists: Create production databases for mailman3 - https://phabricator.wikimedia.org/T278614 (10Legoktm) >>! In T278614#7025271, @jcrespo wrote: > Backups have been enabled and access seems correct. I saw the dbs are right now empty, but please pi... [20:54:40] (03CR) 10Jgreen: [C: 03+2] Add cname for fundraising.frdev -> frdata-eqiad [dns] - 10https://gerrit.wikimedia.org/r/683426 (https://phabricator.wikimedia.org/T260571) (owner: 10Dwisehaupt) [21:00:15] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29266/console" [puppet] - 10https://gerrit.wikimedia.org/r/683111 (https://phabricator.wikimedia.org/T260330) (owner: 10Dzahn) [21:00:43] (03PS1) 10Urbanecm: Set wgGEMentorshipMigrationStage to SCHEMA_COMPAT_NEW everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683430 (https://phabricator.wikimedia.org/T279853) [21:01:15] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5014.eqsin.wmnet with reason: REIMAGE [21:01:18] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5015.eqsin.wmnet with reason: REIMAGE [21:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:58] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5013.eqsin.wmnet with reason: REIMAGE [21:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:14] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp5014.eqsin.wmnet with reason: REIMAGE [21:03:20] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp5013.eqsin.wmnet with reason: REIMAGE [21:03:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:04] 10SRE, 10ops-eqsin, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install cp501[3-6] - https://phabricator.wikimedia.org/T278182 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp5014.eqsin.wmnet', 'cp5015.eqsin.wmnet', 'cp5013.eqsin.wmnet', 'cp5016.eqsin.wmnet'] ` Of which those **F... [21:04:10] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5016.eqsin.wmnet with reason: REIMAGE [21:04:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:18] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp5015.eqsin.wmnet with reason: REIMAGE [21:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:16] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp5016.eqsin.wmnet with reason: REIMAGE [21:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:35] (03PS1) 10Legoktm: Add k8s dummy tokens for shellbox [labs/private] - 10https://gerrit.wikimedia.org/r/683432 [21:13:29] (03CR) 10Legoktm: [V: 03+2 C: 03+2] Add k8s dummy tokens for shellbox [labs/private] - 10https://gerrit.wikimedia.org/r/683432 (owner: 10Legoktm) [21:16:18] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:17:26] 10SRE, 10ops-codfw, 10Discovery: elastic2043 doesn't power up - https://phabricator.wikimedia.org/T281215 (10Papaul) 05Open→03Declined [21:18:58] (03CR) 10Dzahn: [C: 03+2] ci::master/deployment_server: add new k8s namespace for shellbox [puppet] - 10https://gerrit.wikimedia.org/r/683111 (https://phabricator.wikimedia.org/T260330) (owner: 10Dzahn) [21:20:56] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:21:31] 10SRE, 10ops-eqsin, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install cp501[3-6] - https://phabricator.wikimedia.org/T278182 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` cp5013.eqsin.wmnet ` The log can be found in `/var/log/wmf-aut... [21:24:13] !log T280382 `sudo -i wmf-auto-reimage-host -p T280382 --new wdqs1013.eqiad.wmnet` on `ryankemper@cumin1001` tmux session `reimage` (previous reimage timed out, instance appears to have rebooted) [21:24:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:21] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [21:26:08] 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops, 10procurement: Dc-Ops Commands for Cumin - https://phabricator.wikimedia.org/T279721 (10Papaul) [21:27:31] (03PS1) 10Legoktm: Add shellbox namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/683435 [21:28:06] (03CR) 10Dzahn: [C: 03+1] Add shellbox namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/683435 (owner: 10Legoktm) [21:31:55] 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops, 10procurement: Dc-Ops Commands for Cumin - https://phabricator.wikimedia.org/T279721 (10Papaul) [21:32:16] !log T280382 [WDQS] `wdqs2007` ssh is unreachable; power cycling via `racadm>>racadm serveraction powercycle` [21:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:24] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [21:32:58] PROBLEM - Host wdqs2007 is DOWN: PING CRITICAL - Packet loss = 100% [21:34:34] RECOVERY - Host wdqs2007 is UP: PING OK - Packet loss = 0%, RTA = 31.75 ms [21:34:34] PROBLEM - Check systemd state on wdqs2007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-blazegraph-exporter-wdqs-blazegraph.service,prometheus-blazegraph-exporter-wdqs-categories.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:34:48] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2007 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:35:14] PROBLEM - Query Service HTTP Port on wdqs2007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [21:35:40] RECOVERY - SSH on wdqs2007 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:35:40] PROBLEM - Blazegraph Port for wdqs-categories on wdqs2007 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:35:40] PROBLEM - Blazegraph process -wdqs-categories- on wdqs2007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:35:44] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:37:31] !log T280382 `wdqs2007` is reachable again; glancing at `/srv/wdqs` its `wikidata.jnl` is `839G` when it should be `975G` so I'll re-do the wikidata journal transfer [21:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:39] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [21:38:32] upgrading another mailing list [21:38:55] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs2001.codfw.wmnet --dest wdqs2007.codfw.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph` on `ryankemper@cumin1001` tmux session `reimage` [21:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:18] !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [21:39:24] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [21:39:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:43] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1013.eqiad.wmnet with reason: REIMAGE [21:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:27] (03CR) 10Legoktm: [C: 03+2] Add shellbox namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/683435 (owner: 10Legoktm) [21:41:49] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1013.eqiad.wmnet with reason: REIMAGE [21:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:15] 10SRE, 10ops-eqsin, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install cp501[3-6] - https://phabricator.wikimedia.org/T278182 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp5013.eqsin.wmnet'] ` Of which those **FAILED**: ` ['cp5013.eqsin.wmnet'] ` [21:44:23] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5013.eqsin.wmnet with reason: REIMAGE [21:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:45] !log legoktm@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [21:44:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:15] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp5013.eqsin.wmnet with reason: REIMAGE [21:46:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:47] 10SRE, 10Jade, 10TechCom, 10Epic, and 3 others: Deploy pilot of Jade to a small set of wikis. - https://phabricator.wikimedia.org/T183381 (10Ladsgroup) [21:47:46] !log legoktm@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [21:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:35] 10SRE, 10SRE-Access-Requests: SRE Onboarding for Marc Mandere - https://phabricator.wikimedia.org/T281344 (10Peachey88) [21:48:37] 10SRE, 10Security-Team: Request to Join Security Mailing List - https://phabricator.wikimedia.org/T281357 (10Peachey88) [21:49:20] !log legoktm@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [21:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:43] !log legoktm@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [21:49:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:43] 10SRE, 10Jade, 10TechCom, 10Epic, and 3 others: Deploy pilot of Jade to a small set of wikis. - https://phabricator.wikimedia.org/T183381 (10Ladsgroup) 05Resolved→03Declined It's not deployed to production and it's going to be undeployed from beta too {T281418} [21:51:07] (03PS1) 10RobH: updating cp501[3456] role [puppet] - 10https://gerrit.wikimedia.org/r/683438 (https://phabricator.wikimedia.org/T278182) [21:51:19] (03PS2) 10RobH: updating cp501[3456] role [puppet] - 10https://gerrit.wikimedia.org/r/683438 (https://phabricator.wikimedia.org/T278182) [21:52:11] (03CR) 10RobH: [C: 03+2] updating cp501[3456] role [puppet] - 10https://gerrit.wikimedia.org/r/683438 (https://phabricator.wikimedia.org/T278182) (owner: 10RobH) [21:53:07] 10SRE, 10ops-eqsin, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install cp501[3-6] - https://phabricator.wikimedia.org/T278182 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` cp5013.eqsin.wmnet ` The log can be found in `/var/log/wmf-aut... [22:02:49] 10SRE, 10Services, 10Service-deployment-requests: New Service Request Shellbox - https://phabricator.wikimedia.org/T281423 (10Legoktm) [22:04:11] (03PS24) 10CRusnov: dhcp: Add module for manipulating dynamic DHCP entries [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855) [22:04:13] (03CR) 10CRusnov: dhcp: Add module for manipulating dynamic DHCP entries (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855) (owner: 10CRusnov) [22:05:01] 10SRE, 10Services, 10Service-deployment-requests: New Service Request Shellbox - https://phabricator.wikimedia.org/T281423 (10Dzahn) [22:08:05] 10SRE, 10Services, 10Service-deployment-requests: New Service Request Shellbox - https://phabricator.wikimedia.org/T281423 (10Legoktm) [22:10:29] (03CR) 10jerkins-bot: [V: 04-1] dhcp: Add module for manipulating dynamic DHCP entries [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855) (owner: 10CRusnov) [22:10:44] RECOVERY - MegaRAID on an-worker1100 is OK: OK: optimal, 23 logical, 23 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:15:57] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5013.eqsin.wmnet with reason: REIMAGE [22:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:08] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5013.eqsin.wmnet with reason: REIMAGE [22:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:18] (03PS25) 10CRusnov: dhcp: Add module for manipulating dynamic DHCP entries [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855) [22:18:24] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [22:18:28] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1004.eqiad.wmnet --dest wdqs1013.eqiad.wmnet --reason "transferring fresh categories journal following reimage" --blazegraph_instance categories` on `ryankemper@cumin1001` tmux session `reimage` [22:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:37] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [22:23:40] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [22:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:05] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 532 bytes in 1.076 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:24:10] (03CR) 10jerkins-bot: [V: 04-1] dhcp: Add module for manipulating dynamic DHCP entries [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855) (owner: 10CRusnov) [22:26:36] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [22:26:44] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1004.eqiad.wmnet --dest wdqs1013.eqiad.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph` on `ryankemper@cumin1001` tmux session `reimage` [22:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:57] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [22:29:57] 10SRE, 10Security-Team: Request to Join Security Mailing List - https://phabricator.wikimedia.org/T281357 (10Dzahn) Not anymore, it used to be that way but now security@wikimedia.org is a Google inbox. (since T243446) So this isn't really an SRE thing anymore. [22:33:29] 10SRE, 10SRE-Access-Requests: SRE Onboarding for Marc Mandere - https://phabricator.wikimedia.org/T281344 (10Dzahn) [22:33:59] 10SRE, 10SRE-Access-Requests: SRE Onboarding for Marc Mandere - https://phabricator.wikimedia.org/T281344 (10Dzahn) Welcome to SRE! Confirmed your Google inbox / email exists now. This should unblock the next steps. [22:34:21] 10SRE, 10SRE-Access-Requests: SRE Onboarding for Marc Mandere - https://phabricator.wikimedia.org/T281344 (10Dzahn) [22:34:41] 10SRE, 10ops-eqsin, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install cp501[3-6] - https://phabricator.wikimedia.org/T278182 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp5013.eqsin.wmnet'] ` and were **ALL** successful. [22:38:02] 10SRE, 10SRE-Access-Requests: SRE Onboarding for Marc Mandere - https://phabricator.wikimedia.org/T281344 (10Dzahn) [22:38:28] mutante: ^ thanks :) [22:38:45] 10SRE, 10SRE-Access-Requests: SRE Onboarding for Marc Mandere - https://phabricator.wikimedia.org/T281344 (10Dzahn) - added permissions to Google 'Ops vendor maintenance & contracts' calendar - added to Google 'maint-announce' shared mail inbox [22:39:06] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Wikisul maillist - https://phabricator.wikimedia.org/T279482 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup https://lists.wikimedia.org/postorius/lists/wikisul.lists.wikimedia.org/ Please create an account https://lists.wikimedia.org/accounts/signup/?... [22:39:23] sukhe: ah! np. and 'welcome back' to you [22:39:31] :D [22:42:04] 10SRE, 10SRE-Access-Requests: SRE Onboarding for Marc Mandere - https://phabricator.wikimedia.org/T281344 (10Dzahn) - added to Phabricator "WMF-NDA-requests" (not same as WMF-NDA): https://phabricator.wikimedia.org/project/members/974/ [22:44:30] 10SRE, 10Services, 10Service-deployment-requests: New Service Request Shellbox - https://phabricator.wikimedia.org/T281423 (10Dzahn) [22:44:33] 10SRE, 10MW-on-K8s, 10Shellbox, 10serviceops, and 4 others: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10Dzahn) [22:47:34] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Wikisul maillist - https://phabricator.wikimedia.org/T279482 (10felipedafonseca) Ladsgroup, thanks. Is the list open or closed? We wanted to request an open list (public) and a closed list (just for coordinators) if possible. [22:50:22] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Wikisul maillist - https://phabricator.wikimedia.org/T279482 (10Ladsgroup) The task description requests a private mailing list. I created a private mailing list. You can request a public one later. [22:51:14] 10SRE, 10ops-eqiad, 10Analytics-Kanban: Degraded RAID on an-worker1100 - https://phabricator.wikimedia.org/T280132 (10razzi) [22:52:40] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Wikisul maillist - https://phabricator.wikimedia.org/T279482 (10felipedafonseca) Thank you, I just wasn't sure if that was possible. [22:55:55] 10SRE, 10Services, 10Service-deployment-requests: New Service Request Shellbox - https://phabricator.wikimedia.org/T281423 (10Dzahn) A new namespace "shellbox" has been created in environments "staging-codfw" and "staging-eqiad". ` root@deploy1002:~# kube_env admin staging-eqiad root@deploy1002:~# kubectl... [22:56:08] 10SRE, 10ops-eqiad: Degraded RAID on an-worker1100 - https://phabricator.wikimedia.org/T280132 (10razzi) a:05razzi→03Cmjohnson For simplicity, I'll create a new task, and this one can stay resolved. Thanks @Cmjohnson! [22:57:44] mutante: what does it mean to be a member of wmf-nda-requests? [22:58:35] is that an acl* group for controllingn who can handle wmf-nda requests? [23:00:05] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210428T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:01:33] Krinkle: it's a group of people who are asking to be added to the actual WMF-NDA [23:02:15] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: Mailman3 "info" field is small (only 255 characters) - https://phabricator.wikimedia.org/T281426 (10Ladsgroup) https://gitlab.com/mailman/mailman/-/issues/886 [23:02:27] mutante: hm.. ok. that's in addition to the task being tagged with it. is it easy to explain why? (just curious) [23:02:42] it gives access to sign the NDA [23:02:56] comes from https://phabricator.wikimedia.org/T84994 [23:03:52] it's possible it is not making much sense anymore since legal move to a different system to handle NDAs [23:04:09] It is the ACL group for the L2 or L3 doc right? [23:04:28] * bd808 can't remember which is which [23:04:50] L2 I think [23:06:27] !log dpifke@deploy1002 Started deploy [performance/navtiming@cf8b2e9]: Deploying https://gerrit.wikimedia.org/r/c/performance/navtiming/+/682886 [23:06:32] !log dpifke@deploy1002 Finished deploy [performance/navtiming@cf8b2e9]: Deploying https://gerrit.wikimedia.org/r/c/performance/navtiming/+/682886 (duration: 00m 05s) [23:06:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:54] (03PS26) 10CRusnov: dhcp: Add module for manipulating dynamic DHCP entries [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855) [23:08:43] bd808: yes, confirmed. there is a custom policy on L2 that lets members of that group see it [23:09:17] 10SRE, 10ops-eqsin, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install cp501[3-6] - https://phabricator.wikimedia.org/T278182 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['cp5014.eqsin.wmnet', 'cp5015.eqsin.wmnet', 'cp5016.eqsin.wmn... [23:09:28] Krinkle: what he said, this lets them even view L2. the other question is if L2 still means anything because the "actual NDA" is handled not in legalpad [23:09:42] 10SRE, 10ops-eqsin, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install cp501[3-6] - https://phabricator.wikimedia.org/T278182 (10RobH) [23:09:51] and then there is always that "you can expect it to be included in hiring nowadays" [23:10:05] but then.. how would I know from that ticket if they were hired [23:10:38] mutante: bd808: thx, I wish there was a "what links here" in phab to see that kind of use [23:10:42] makes sense :) [23:11:29] also "why cant it be public info what is inside L2" - dont know [23:11:36] I feel like the Legal team changed their mind about using phab as the repo of signatures about a month after Chase got it all working for them [23:11:42] this [23:13:08] (03CR) 10jerkins-bot: [V: 04-1] dhcp: Add module for manipulating dynamic DHCP entries [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855) (owner: 10CRusnov) [23:18:39] (03CR) 10Mstyles: rdf-streaming-updater: create helmfile.d structure (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [23:20:00] (03CR) 10Mstyles: rdf-streaming-updater: use session mode (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/681497 (https://phabricator.wikimedia.org/T280166) (owner: 10Mstyles) [23:23:56] (03PS1) 10Bstorm: cloudstore: Collapse drbd vs symlinks into one profile [puppet] - 10https://gerrit.wikimedia.org/r/683445 (https://phabricator.wikimedia.org/T224747) [23:25:08] (03CR) 10jerkins-bot: [V: 04-1] cloudstore: Collapse drbd vs symlinks into one profile [puppet] - 10https://gerrit.wikimedia.org/r/683445 (https://phabricator.wikimedia.org/T224747) (owner: 10Bstorm) [23:28:15] (03PS2) 10Bstorm: cloudstore: Collapse drbd vs symlinks into one profile [puppet] - 10https://gerrit.wikimedia.org/r/683445 (https://phabricator.wikimedia.org/T224747) [23:29:32] (03CR) 10jerkins-bot: [V: 04-1] cloudstore: Collapse drbd vs symlinks into one profile [puppet] - 10https://gerrit.wikimedia.org/r/683445 (https://phabricator.wikimedia.org/T224747) (owner: 10Bstorm) [23:31:22] (03PS3) 10Bstorm: cloudstore: Collapse drbd vs symlinks into one profile [puppet] - 10https://gerrit.wikimedia.org/r/683445 (https://phabricator.wikimedia.org/T224747) [23:32:00] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5014.eqsin.wmnet with reason: REIMAGE [23:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:16] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5015.eqsin.wmnet with reason: REIMAGE [23:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:02] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5014.eqsin.wmnet with reason: REIMAGE [23:34:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:15] PROBLEM - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is CRITICAL: CRITICAL: the following (8) node(s) change every puppet run: cp5015.eqsin.wmnet, wdqs1010.eqiad.wmnet, maps1009.eqiad.wmnet, wdqs1013.eqiad.wmnet, cp5016.eqsin.wmnet, cp5014.eqsin.wmnet, webperf1001.eqiad.wmnet, db1115.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [23:36:07] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5015.eqsin.wmnet with reason: REIMAGE [23:36:12] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5016.eqsin.wmnet with reason: REIMAGE [23:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:04] (03PS4) 10Bstorm: cloudstore: Collapse drbd vs symlinks into one profile [puppet] - 10https://gerrit.wikimedia.org/r/683445 (https://phabricator.wikimedia.org/T224747) [23:38:10] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5016.eqsin.wmnet with reason: REIMAGE [23:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:10] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [23:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:27] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.096 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [23:45:05] (03CR) 10Bstorm: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/29268/" [puppet] - 10https://gerrit.wikimedia.org/r/683445 (https://phabricator.wikimedia.org/T224747) (owner: 10Bstorm) [23:47:15] (03PS5) 10Bstorm: cloudstore: Collapse drbd vs symlinks into one profile [puppet] - 10https://gerrit.wikimedia.org/r/683445 (https://phabricator.wikimedia.org/T224747) [23:52:29] 10SRE, 10ops-eqsin, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install cp501[3-6] - https://phabricator.wikimedia.org/T278182 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp5014.eqsin.wmnet', 'cp5015.eqsin.wmnet', 'cp5016.eqsin.wmnet'] ` and were **ALL** successful. [23:52:43] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Wikisul maillist - https://phabricator.wikimedia.org/T279482 (10felipedafonseca) Hi. I think there is an error in the list, or I am not sure exactly how to manage it. I received a pending approval notice by email, but in the email, under "Pending Approval",... [23:52:46] 10SRE, 10Wikimedia-Mailing-lists: Hausa Wikimedians mailing list - https://phabricator.wikimedia.org/T279654 (10The_Living_love) Thank you Ladsgroup for the update. Yes, exactly, Wikimedia-ha should be the name. We are not registered yet but are working to register as a legal entity in Nigeria. [23:54:16] 10SRE, 10ops-eqsin, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install cp501[3-6] - https://phabricator.wikimedia.org/T278182 (10RobH) These are all set to staged with the insetup_noferm role applied. [23:54:39] 10SRE, 10ops-eqsin, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install cp501[3-6] - https://phabricator.wikimedia.org/T278182 (10RobH) 05Open→03Resolved