[00:06:24] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [00:06:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:00] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1315.eqiad.wmnet [00:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:30] !log dzahn@cumin1001 START - Cookbook sre.hosts.reboot-single [00:07:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:20] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [00:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:48] !log dzahn@cumin1001 START - Cookbook sre.hosts.reboot-single [00:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:21] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [00:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:07] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [00:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:37:38] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1339.eqiad.wmnet [00:37:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:37:52] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1340.eqiad.wmnet [00:37:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:13] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1341.eqiad.wmnet [00:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:07] (03PS1) 10Catrope: Enable static maps on testwiki, disable on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620810 [00:39:18] !log dzahn@cumin1001 START - Cookbook sre.hosts.reboot-single [00:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:28] !log dzahn@cumin1001 START - Cookbook sre.hosts.reboot-single [00:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:11] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [00:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:29] !log dzahn@cumin1001 START - Cookbook sre.hosts.reboot-single [00:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:36] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [00:53:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:04:26] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [01:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:37:50] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1344.eqiad.wmnet [01:37:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:38:31] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1343.eqiad.wmnet [01:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:54:23] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1376.eqiad.wmnet [01:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:05:51] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.5 [core] (wmf/1.36.0-wmf.5) - 10https://gerrit.wikimedia.org/r/620813 [03:29:46] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 57 probes of 574 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [03:35:42] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 45 probes of 574 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:10:02] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:11:30] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 238, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:32:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1088', diff saved to https://phabricator.wikimedia.org/P12279 and previous config saved to /var/cache/conftool/dbconfig/20200818-043241-marostegui.json [04:32:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:37:40] (03PS2) 10DannyS712: Branch commit for wmf/1.36.0-wmf.5 [core] (wmf/1.36.0-wmf.5) - 10https://gerrit.wikimedia.org/r/620813 (https://phabricator.wikimedia.org/T257973) (owner: 10TrainBranchBot) [04:43:46] (03CR) 10Marostegui: dbproxy1016,dbproxy1020: Change m3 master [puppet] - 10https://gerrit.wikimedia.org/r/620664 (https://phabricator.wikimedia.org/T259589) (owner: 10Marostegui) [04:43:51] (03PS2) 10Marostegui: dbproxy1016,dbproxy1020: Change m3 master [puppet] - 10https://gerrit.wikimedia.org/r/620664 (https://phabricator.wikimedia.org/T259589) [04:46:43] (03PS3) 10Marostegui: dbproxy1016,dbproxy1020: Change m3 master [puppet] - 10https://gerrit.wikimedia.org/r/620664 (https://phabricator.wikimedia.org/T259589) [04:48:05] (03PS4) 10Marostegui: dbproxy1016,dbproxy1020: Change m3 master [puppet] - 10https://gerrit.wikimedia.org/r/620664 (https://phabricator.wikimedia.org/T259589) [04:48:26] In 10 minutes we'll failover phabricator master, so we'll set phabricator in read only for a few minutes [04:49:41] (03CR) 10Marostegui: [C: 03+2] dbproxy1016,dbproxy1020: Change m3 master [puppet] - 10https://gerrit.wikimedia.org/r/620664 (https://phabricator.wikimedia.org/T259589) (owner: 10Marostegui) [05:00:04] marostegui and twentyafterfour: May I have your attention please! m3 (phabricator) database master failover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200818T0500) [05:00:11] !log Failover m3 (phabricator) database master from db1128 to db1132 - T259589 [05:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:15] T259589: Upgrade m3 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T259589 [05:00:39] !log phabricator is now read-only [05:00:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:01:27] twentyafterfour: done, you can revert [05:01:50] !log phabricator read-only ended [05:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:08] !log phabricator appears to be fully functional [05:02:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:11] I can edit [05:02:47] looks good! [05:02:54] 10Operations, 10DBA, 10Phabricator: Upgrade m3 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T259589 (10Marostegui) [05:02:54] twentyafterfour: indeed! we are done! thank you [05:03:05] marostegui: awesome, that went smoothly [05:03:12] and you're welcome [05:03:19] happy to help [05:04:12] thank you very much [05:09:13] (03PS1) 10Marostegui: db1128: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/620818 (https://phabricator.wikimedia.org/T260324) [05:09:39] 10Operations, 10DBA, 10Phabricator: Upgrade m3 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T259589 (10Marostegui) 05Open→03Resolved This was done successfully. m3 fully runs Buster and MariaDB 10.4 db1128 will be moved to m5, and that will be tracked at T260324 Thanks @mmodell for hel... [05:23:03] (03CR) 10Marostegui: [C: 03+2] db1128: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/620818 (https://phabricator.wikimedia.org/T260324) (owner: 10Marostegui) [05:23:38] (03PS1) 10Marostegui: db1128: Specify db1128 shard [puppet] - 10https://gerrit.wikimedia.org/r/620820 [05:25:02] 10Operations, 10serviceops: High traffic on mc1020 (18 Aug) - https://phabricator.wikimedia.org/T260622 (10Joe) Since no analysis on the incoming keys was done, there is no way to know what the problem was. I'll uncordon mc1020, and monitor the situation, but I assume there isn't much else to do at this point. [05:25:21] (03CR) 10Marostegui: [C: 03+2] db1128: Specify db1128 shard [puppet] - 10https://gerrit.wikimedia.org/r/620820 (owner: 10Marostegui) [05:29:56] (03PS1) 10Marostegui: site.pp: Specify the correct role for db1132 [puppet] - 10https://gerrit.wikimedia.org/r/620821 (https://phabricator.wikimedia.org/T259589) [05:32:18] (03PS2) 10Marostegui: site.pp: Specify the correct role for db1132 [puppet] - 10https://gerrit.wikimedia.org/r/620821 (https://phabricator.wikimedia.org/T259589) [05:35:28] (03CR) 10Marostegui: [C: 03+2] site.pp: Specify the correct role for db1132 [puppet] - 10https://gerrit.wikimedia.org/r/620821 (https://phabricator.wikimedia.org/T259589) (owner: 10Marostegui) [05:52:22] <_joe_> !log running puppet on mc1020 T260622 [05:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:26] T260622: High traffic on mc1020 (18 Aug) - https://phabricator.wikimedia.org/T260622 [05:52:55] (03PS1) 1020after4: Phabricator: use a separate mysql user for phd daemons [puppet] - 10https://gerrit.wikimedia.org/r/620822 (https://phabricator.wikimedia.org/T146055) [06:01:33] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [06:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:30] RECOVERY - Memcached on mc1020 is OK: TCP OK - 0.000 second response time on 10.64.0.81 port 11211 https://wikitech.wikimedia.org/wiki/Memcached [06:06:04] !log oblivian@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [06:06:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:32] (03PS2) 1020after4: Phabricator: use a separate mysql user for phd daemons [puppet] - 10https://gerrit.wikimedia.org/r/620822 (https://phabricator.wikimedia.org/T146055) [06:19:36] (03CR) 10Jcrespo: [C: 03+2] Phabricator: use a separate mysql user for phd daemons [puppet] - 10https://gerrit.wikimedia.org/r/620822 (https://phabricator.wikimedia.org/T146055) (owner: 1020after4) [06:21:13] !log deploy password change to phabricator service T146055 [06:21:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:16] T146055: Improve privilege separation for phabricator's config files and mysql credentials - https://phabricator.wikimedia.org/T146055 [06:41:19] !log add cloudflare PNI IPs in eqiad - T259036 [06:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:02] in 5 minutes we will do maintenance on phabricator, some minutes of unavailability could happen [06:45:43] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 103 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:48:39] !log deploy another password change to phabricator service (potentially disruptive) T250361 [06:48:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:42] T250361: replace phabricator db passwords with longer passwords - https://phabricator.wikimedia.org/T250361 [06:48:47] some phab errors could happen [06:49:26] consider saving comments you are writing/posts you are creating [06:49:40] will ping when finished [06:52:57] ah yeah came here to say phab error [06:53:06] ...and its back! [06:53:13] ah indeed!! :) [06:53:29] phaaaaaaaaabulous [06:53:48] !log phabricator maintenance successful [06:53:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:03] (03PS2) 10Giuseppe Lavagetto: sre.hosts.reboot-cluster: set the servers to state=inactive [cookbooks] - 10https://gerrit.wikimedia.org/r/620742 [06:55:05] (03PS1) 10Giuseppe Lavagetto: sre.hosts.reboot-cluster: change icinga bypass logic [cookbooks] - 10https://gerrit.wikimedia.org/r/620873 [06:56:01] if you still see some error please report [06:56:04] (03CR) 10jerkins-bot: [V: 04-1] sre.hosts.reboot-cluster: change icinga bypass logic [cookbooks] - 10https://gerrit.wikimedia.org/r/620873 (owner: 10Giuseppe Lavagetto) [06:56:20] *some phabricator error [07:01:30] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-cluster [07:01:30] !log oblivian@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=99) [07:01:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:47] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-cluster [07:01:48] 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-17) rack/setup/install - https://phabricator.wikimedia.org/T255520 (10ayounsi) asw2-c-eqiad:ge-5/0/36 has been flapping a lot. I disabled the port feel free to re-enable it when you're working on it. [07:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:53] 10Operations, 10DBA, 10Phabricator: replace phabricator db passwords with longer passwords - https://phabricator.wikimedia.org/T250361 (10jcrespo) [07:05:18] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Jan Eissfeldt (jeissfeldt) - https://phabricator.wikimedia.org/T260555 (10fgiunchedi) @drochford I'm still having troubles locating that user in wikitech, the user page mentions that wikitech username isn't registered: https://wikitech.wiki... [07:07:28] !log prometheus eqiad: add 100G to prometheus/global [07:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:06] !log oblivian@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=99) [07:08:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:55] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: limit queries to Thanos sidecar / Prometheus to last 15d [puppet] - 10https://gerrit.wikimedia.org/r/620656 (https://phabricator.wikimedia.org/T260241) (owner: 10Filippo Giunchedi) [07:11:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1089', diff saved to https://phabricator.wikimedia.org/P12281 and previous config saved to /var/cache/conftool/dbconfig/20200818-071121-marostegui.json [07:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:41] !log update rest of phabricator passwords T250361 [07:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:44] T250361: replace phabricator db passwords with longer passwords - https://phabricator.wikimedia.org/T250361 [07:19:31] !log oblivian@cumin1001 conftool action : set/pooled=yes; selector: name=mw213[5-9].codfw.wmnet [07:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:42] addshore: o/ would you be able to add me to the operations-software-wmfmariadbpy gerrit group? (https://gerrit.wikimedia.org/r/admin/groups/f948dea7f1f871e879aacb863838a9bcf4e17793) [07:20:25] 10Operations, 10DBA, 10Phabricator: replace phabricator db passwords with longer passwords - https://phabricator.wikimedia.org/T250361 (10jcrespo) [07:20:41] 10Operations, 10fundraising-tech-ops, 10netops: Automate diff and commit of frack ACL - https://phabricator.wikimedia.org/T260655 (10ayounsi) p:05Triage→03Low [07:22:55] PROBLEM - mediawiki-installation DSH group on mw2137 is CRITICAL: Host mw2137 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [07:23:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1089', diff saved to https://phabricator.wikimedia.org/P12282 and previous config saved to /var/cache/conftool/dbconfig/20200818-072349-marostegui.json [07:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:26] 10Operations, 10Packaging, 10serviceops, 10Platform Team Initiatives (Session Management Service (CDP2)): Need help to create and deploy Debian-packaged Python 3 app - https://phabricator.wikimedia.org/T229980 (10jijiki) 05Open→03Resolved a:03jijiki Closing this task due to inactivity, please reopen... [07:24:42] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/620742 (owner: 10Giuseppe Lavagetto) [07:32:51] 10Operations, 10observability, 10Patch-For-Review: Grafana/Thanos serves 503s for long-time-window requests - https://phabricator.wikimedia.org/T260241 (10fgiunchedi) With the latest change I'm able to query data for the last 90d via Thanos on the host dashboard! Performance still could be better, and for th... [07:33:49] (03CR) 10Volans: [C: 03+1] "Code looks good, it's really not ideal that we have to accept this risk of repooling with potential failures because of Icinga latency." (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/620873 (owner: 10Giuseppe Lavagetto) [07:33:51] ACKNOWLEDGEMENT - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 238, down: 1, dormant: 0, excluded: 0, unused: 0: ayounsi Telia PWIC114315 - The acknowledgement expires at: 2020-08-18 11:00:00. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:33:52] ACKNOWLEDGEMENT - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0: ayounsi Telia PWIC114315 - The acknowledgement expires at: 2020-08-18 11:00:00. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:34:33] (03PS2) 10Giuseppe Lavagetto: sre.hosts.reboot-cluster: change icinga bypass logic [cookbooks] - 10https://gerrit.wikimedia.org/r/620873 [07:35:33] (03CR) 10jerkins-bot: [V: 04-1] sre.hosts.reboot-cluster: change icinga bypass logic [cookbooks] - 10https://gerrit.wikimedia.org/r/620873 (owner: 10Giuseppe Lavagetto) [07:38:27] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-cluster [07:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:45] 10Operations, 10DBA, 10Parsoid, 10serviceops, 10Parsoid-Tests: update mysql GRANTs for testreduce - https://phabricator.wikimedia.org/T260627 (10Kormat) Hi, i've created the new grants. Please test and let me know if there are any issues. Cheers. [07:42:48] <_joe_> !log performing rolling reboot of all codfw api servers [07:42:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1089', diff saved to https://phabricator.wikimedia.org/P12283 and previous config saved to /var/cache/conftool/dbconfig/20200818-074325-marostegui.json [07:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:09] !log oblivian@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=1) [07:44:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:19] !log VictorOps ack'd incidents will re-trigger after 24h if not resolved - T259465 [07:45:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:22] T259465: VictorOps behavior on long-ack'd incidents - https://phabricator.wikimedia.org/T259465 [07:48:01] (03PS8) 10Jcrespo: mariadb: Setup section->port assignment on puppet [puppet] - 10https://gerrit.wikimedia.org/r/620722 (https://phabricator.wikimedia.org/T257033) [07:48:03] (03PS1) 10Jcrespo: phabricator: Add new db user for the daemon, separate from web request user [puppet] - 10https://gerrit.wikimedia.org/r/620879 (https://phabricator.wikimedia.org/T146055) [07:48:15] (03PS2) 10Jcrespo: phabricator: Add new db user for the daemon, separate from web request user [puppet] - 10https://gerrit.wikimedia.org/r/620879 (https://phabricator.wikimedia.org/T146055) [07:48:20] (03CR) 10Kormat: [C: 03+1] mariadb: Allow the installation of clouddb hosts [puppet] - 10https://gerrit.wikimedia.org/r/620529 (https://phabricator.wikimedia.org/T260441) (owner: 10Marostegui) [07:49:00] (03PS2) 10Marostegui: mariadb: Allow the installation of clouddb hosts [puppet] - 10https://gerrit.wikimedia.org/r/620529 (https://phabricator.wikimedia.org/T260441) [07:50:17] (03PS3) 10Jcrespo: phabricator: Add new db user for the daemon, separate from web request user [puppet] - 10https://gerrit.wikimedia.org/r/620879 (https://phabricator.wikimedia.org/T146055) [07:50:23] (03CR) 10Marostegui: [C: 03+2] mariadb: Allow the installation of clouddb hosts [puppet] - 10https://gerrit.wikimedia.org/r/620529 (https://phabricator.wikimedia.org/T260441) (owner: 10Marostegui) [07:50:27] (03PS4) 10Jcrespo: phabricator: Add new db user for the daemon, separate from web request user [puppet] - 10https://gerrit.wikimedia.org/r/620879 (https://phabricator.wikimedia.org/T146055) [07:50:29] 10Operations, 10observability, 10User-fgiunchedi: VictorOps behavior on long-ack'd incidents - https://phabricator.wikimedia.org/T259465 (10fgiunchedi) 05Open→03Stalled The change is active now for the 'wikimedia' organization, stalling the task while waiting to see how this pans out! [07:52:03] (03CR) 10Giuseppe Lavagetto: sre.hosts.reboot-cluster: change icinga bypass logic (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/620873 (owner: 10Giuseppe Lavagetto) [07:53:09] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-cluster [07:53:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:52] (03CR) 10Giuseppe Lavagetto: sre.hosts.reboot-cluster: change icinga bypass logic (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/620873 (owner: 10Giuseppe Lavagetto) [07:58:50] !log oblivian@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=99) [07:58:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:23] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: ASAP) rack/setup/install clouddb10[13-20] - https://phabricator.wikimedia.org/T260441 (10Marostegui) I have merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/620529, so the new hosts will get in... [08:01:44] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: ASAP) rack/setup/install clouddb10[13-20] - https://phabricator.wikimedia.org/T260441 (10Marostegui) [08:02:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1089', diff saved to https://phabricator.wikimedia.org/P12284 and previous config saved to /var/cache/conftool/dbconfig/20200818-080256-marostegui.json [08:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:02] (03CR) 10JMeybohm: [C: 04-1] Modify api-gateway access logging to conform to schema (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/619341 (https://phabricator.wikimedia.org/T251812) (owner: 10Ppchelko) [08:06:57] (03PS6) 10Filippo Giunchedi: Add prometheus::karma and profile::alertmanager::web [puppet] - 10https://gerrit.wikimedia.org/r/620726 (https://phabricator.wikimedia.org/T258948) [08:06:59] (03PS5) 10Filippo Giunchedi: Enable profile::alertmanager::web on alerting_host [puppet] - 10https://gerrit.wikimedia.org/r/620727 (https://phabricator.wikimedia.org/T258948) [08:07:44] (03CR) 10Filippo Giunchedi: Add prometheus::karma and profile::alertmanager::web (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/620726 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [08:09:25] (03PS3) 10Giuseppe Lavagetto: sre.hosts.reboot-cluster: change icinga bypass logic [cookbooks] - 10https://gerrit.wikimedia.org/r/620873 [08:09:27] (03CR) 10JMeybohm: [C: 04-1] ratelimit: crash on startup if config is invalid (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/620068 (owner: 10Ppchelko) [08:10:38] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-cluster [08:10:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:43] (03CR) 10Filippo Giunchedi: "Thanks John for the feedback, all comments should be addressed now" [puppet] - 10https://gerrit.wikimedia.org/r/620726 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [08:11:10] <_joe_> jouncebot: next [08:11:10] In 2 hour(s) and 48 minute(s): European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200818T1100) [08:15:11] (03PS1) 10Kormat: mariadb: Group m5 testreduce grants together [puppet] - 10https://gerrit.wikimedia.org/r/620880 [08:15:47] (03PS2) 10Kormat: mariadb: Group m5 testreduce grants together [puppet] - 10https://gerrit.wikimedia.org/r/620880 [08:15:56] (03CR) 10Filippo Giunchedi: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/619739 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [08:16:18] !log oblivian@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=99) [08:16:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:08] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-cluster [08:17:08] (03PS3) 10Kormat: mariadb: Group m5 testreduce grants together [puppet] - 10https://gerrit.wikimedia.org/r/620880 [08:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:13] (03PS1) 10Marostegui: mariadb: Allow install new es hosts in eqiad/codfw [puppet] - 10https://gerrit.wikimedia.org/r/620881 (https://phabricator.wikimedia.org/T260373) [08:17:51] (03CR) 10Jcrespo: "Not too concerned about this, but note that grouping by host will make per-service change harder and grouping by service will make host ch" [puppet] - 10https://gerrit.wikimedia.org/r/620880 (owner: 10Kormat) [08:17:54] (03CR) 10Giuseppe Lavagetto: [C: 03+2] sre.hosts.reboot-cluster: set the servers to state=inactive [cookbooks] - 10https://gerrit.wikimedia.org/r/620742 (owner: 10Giuseppe Lavagetto) [08:18:57] (03Merged) 10jenkins-bot: sre.hosts.reboot-cluster: set the servers to state=inactive [cookbooks] - 10https://gerrit.wikimedia.org/r/620742 (owner: 10Giuseppe Lavagetto) [08:20:37] (03CR) 10Kormat: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/620880 (owner: 10Kormat) [08:21:22] (03PS2) 10Kormat: mariadb: Fix referencing of wrong variable. [puppet] - 10https://gerrit.wikimedia.org/r/620699 [08:21:50] (03CR) 10Marostegui: [C: 04-1] phabricator: Add new db user for the daemon, separate from web request user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/620879 (https://phabricator.wikimedia.org/T146055) (owner: 10Jcrespo) [08:22:40] (03PS1) 10Volans: icinga: fix bug for recheck_all_services() [software/spicerack] - 10https://gerrit.wikimedia.org/r/620882 [08:22:52] (03CR) 10Marostegui: [C: 03+1] mariadb: Fix referencing of wrong variable. [puppet] - 10https://gerrit.wikimedia.org/r/620699 (owner: 10Kormat) [08:23:19] (03PS1) 10Ayounsi: Add alert[12]001 to existing icinga ACL terms [homer/public] - 10https://gerrit.wikimedia.org/r/620883 [08:23:21] (03CR) 10Giuseppe Lavagetto: [C: 03+1] icinga: fix bug for recheck_all_services() [software/spicerack] - 10https://gerrit.wikimedia.org/r/620882 (owner: 10Volans) [08:23:52] RECOVERY - mediawiki-installation DSH group on mw2137 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:24:03] !log oblivian@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-cluster (exit_code=97) [08:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:45] (03PS1) 10Evrifaessa: Merge branch 'master' of https://gerrit.wikimedia.org/r/operations/mediawiki-config Change-Id: I4ef592c4e61a3aa55ba5f6a8ba060c935644a52c [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620884 [08:27:47] (03PS1) 10Evrifaessa: Merge branch 'master' of https://gerrit.wikimedia.org/r/operations/mediawiki-config Change-Id: Ib7b64451e6800712cd04698801c5f2f77e3d160b [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620885 [08:27:49] (03PS1) 10Evrifaessa: Merge branch 'master' of https://gerrit.wikimedia.org/r/operations/mediawiki-config Change-Id: Id31faad923e4cabddf918fd42822fbe757b2daac [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620886 [08:27:51] (03PS1) 10Evrifaessa: Merge branch 'master' of https://gerrit.wikimedia.org/r/operations/mediawiki-config Change-Id: I4e391d89c9db64de27d7f8f9be381a8aefcbae3e [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620887 [08:27:53] (03PS1) 10Evrifaessa: Add Wikisource wordmark for trwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620888 (https://phabricator.wikimedia.org/T260658) [08:27:55] (03CR) 10Volans: [C: 03+2] icinga: fix bug for recheck_all_services() [software/spicerack] - 10https://gerrit.wikimedia.org/r/620882 (owner: 10Volans) [08:28:40] (03CR) 10Kormat: [C: 03+2] mariadb: Fix referencing of wrong variable. [puppet] - 10https://gerrit.wikimedia.org/r/620699 (owner: 10Kormat) [08:29:22] (03CR) 10Jcrespo: ":-)" [puppet] - 10https://gerrit.wikimedia.org/r/620880 (owner: 10Kormat) [08:30:19] can someone get jenkins-bot to verify this, please? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/620888/ [08:30:39] (03Merged) 10jenkins-bot: icinga: fix bug for recheck_all_services() [software/spicerack] - 10https://gerrit.wikimedia.org/r/620882 (owner: 10Volans) [08:34:21] (03PS5) 10Jcrespo: phabricator: Add new db user for the daemon, separate from web request user [puppet] - 10https://gerrit.wikimedia.org/r/620879 (https://phabricator.wikimedia.org/T146055) [08:34:56] (03CR) 10Marostegui: [C: 03+1] mariadb: Group m5 testreduce grants together [puppet] - 10https://gerrit.wikimedia.org/r/620880 (owner: 10Kormat) [08:35:20] (03CR) 10Jcrespo: "Done?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/620879 (https://phabricator.wikimedia.org/T146055) (owner: 10Jcrespo) [08:35:27] (03PS1) 10JMeybohm: Update patch: Detect kubeconfig as known argument in plugin invocations [debs/helm] - 10https://gerrit.wikimedia.org/r/620890 [08:36:04] (03CR) 10Kormat: [C: 03+2] mariadb: Group m5 testreduce grants together [puppet] - 10https://gerrit.wikimedia.org/r/620880 (owner: 10Kormat) [08:36:38] (03CR) 10Marostegui: [C: 03+1] phabricator: Add new db user for the daemon, separate from web request user [puppet] - 10https://gerrit.wikimedia.org/r/620879 (https://phabricator.wikimedia.org/T146055) (owner: 10Jcrespo) [08:36:49] (03PS6) 10Jcrespo: phabricator: Add new db user for the daemon, separate from web request user [puppet] - 10https://gerrit.wikimedia.org/r/620879 (https://phabricator.wikimedia.org/T146055) [08:37:03] 10Operations, 10DBA, 10Parsoid, 10serviceops, and 2 others: update mysql GRANTs for testreduce - https://phabricator.wikimedia.org/T260627 (10Kormat) [08:44:10] (03CR) 10Jcrespo: [C: 03+2] phabricator: Add new db user for the daemon, separate from web request user [puppet] - 10https://gerrit.wikimedia.org/r/620879 (https://phabricator.wikimedia.org/T146055) (owner: 10Jcrespo) [08:44:13] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.39 [software/spicerack] - 10https://gerrit.wikimedia.org/r/620892 [08:44:32] !log restart ats-tls on cp5006 [08:44:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:35] 10Operations, 10SRE-tools, 10serviceops: Create a cookbook to perform a rolling reboot of a kubernetes cluster - https://phabricator.wikimedia.org/T260661 (10Joe) [08:44:44] 10Operations, 10SRE-tools, 10serviceops: Create a cookbook to perform a rolling reboot of a kubernetes cluster - https://phabricator.wikimedia.org/T260661 (10Joe) p:05Triage→03Medium [08:44:51] (03PS2) 10Volans: CHANGELOG: add changelogs for release v0.0.39 [software/spicerack] - 10https://gerrit.wikimedia.org/r/620892 [08:45:29] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.39 [software/spicerack] - 10https://gerrit.wikimedia.org/r/620892 (owner: 10Volans) [08:46:00] (03PS1) 10Kormat: mariadb: Add m5 testreduce grants for testreduce1001 [puppet] - 10https://gerrit.wikimedia.org/r/620893 (https://phabricator.wikimedia.org/T260627) [08:46:55] RECOVERY - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp5006 is OK: HTTP OK: HTTP/1.1 200 Ok - 31945 bytes in 1.213 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:46:58] (03PS1) 10Filippo Giunchedi: pontoon: stack-specific hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/620894 [08:47:22] 10Operations, 10DBA, 10Phabricator, 10Patch-For-Review: replace phabricator db passwords with longer passwords - https://phabricator.wikimedia.org/T250361 (10jcrespo) 05Open→03Resolved a:03jcrespo [08:47:31] RECOVERY - Ensure traffic_server is running for instance tls on cp5006 is OK: PROCS OK: 1 process with args /srv/trafficserver/tls/bin/traffic_server -M --run-root=/srv/trafficserver/tls/runroot.yaml --httpport 443 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:47:51] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.39 [software/spicerack] - 10https://gerrit.wikimedia.org/r/620892 (owner: 10Volans) [08:47:57] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp5006 is OK: HTTP OK: HTTP/1.0 200 OK - 23368 bytes in 0.738 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:49:47] (03PS1) 10Volans: Upstream release v0.0.39 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/620895 [08:51:16] 10Operations, 10SRE-tools, 10serviceops: Create a cookbook for depooling one or all services from one kubernetes cluster - https://phabricator.wikimedia.org/T260663 (10Joe) [08:53:43] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/620883 (owner: 10Ayounsi) [08:54:05] (03PS2) 10Volans: Upstream release v0.0.39 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/620895 [08:55:12] (03CR) 10Ayounsi: [C: 03+2] Add alert[12]001 to existing icinga ACL terms [homer/public] - 10https://gerrit.wikimedia.org/r/620883 (owner: 10Ayounsi) [08:55:18] 10Operations, 10SRE-tools, 10serviceops: Create a cookbook for applying an apache config change safely - https://phabricator.wikimedia.org/T260664 (10Joe) [08:55:36] (03Merged) 10jenkins-bot: Add alert[12]001 to existing icinga ACL terms [homer/public] - 10https://gerrit.wikimedia.org/r/620883 (owner: 10Ayounsi) [08:56:12] PROBLEM - mediawiki-installation DSH group on mw2142 is CRITICAL: Host mw2142 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:56:40] (03CR) 10Kormat: [C: 03+1] "This is awesome :)" [puppet] - 10https://gerrit.wikimedia.org/r/620894 (owner: 10Filippo Giunchedi) [08:56:43] 10Operations, 10DBA, 10Sustainability (Incident Followup): Redefine mysql GRANTs for wikiadmin - https://phabricator.wikimedia.org/T249683 (10jcrespo) [08:57:13] (03CR) 10Volans: [C: 03+2] Upstream release v0.0.39 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/620895 (owner: 10Volans) [08:58:23] 10Operations, 10SRE-tools, 10serviceops: Create a cookbook to automate gerrit's switchover - https://phabricator.wikimedia.org/T260666 (10Joe) p:05Triage→03Medium [08:59:51] (03Merged) 10jenkins-bot: Upstream release v0.0.39 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/620895 (owner: 10Volans) [09:00:03] PROBLEM - mediawiki-installation DSH group on mw2140 is CRITICAL: Host mw2140 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [09:00:07] (03PS2) 10KartikMistry: Update cxserver to 2020-08-17-090424-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/620692 (https://phabricator.wikimedia.org/T259980) [09:01:39] 10Operations, 10netops, 10observability, 10Patch-For-Review: Add alert[12]001 to network ACLs - https://phabricator.wikimedia.org/T260533 (10ayounsi) 05Open→03Resolved a:03ayounsi Deployed! [09:05:30] !log Restarting CI Jenkins [09:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:49] _joe_: mw2250.codfw.wmnet is complaining, is it your guinea ping ? [09:06:50] pig* [09:06:53] *sigh* [09:07:09] <_joe_> mw2140 you mean? [09:07:31] you set mw2250 as inactive yesterday [09:07:35] <_joe_> yes [09:07:44] <_joe_> because it had multiple cases of memory issues [09:07:48] <_joe_> see phab history [09:07:56] <_joe_> probably should just be decommed [09:08:10] <_joe_> I just rebooted it so it lost the ack on the dsh alert [09:08:30] I plan to deploy cxserver in few minutes. Anything going on deploy1001 or should I postpone? [09:08:53] I onlu found the Raid problems phab tasks [09:09:08] unless phab search is trolling me [09:11:19] the other two should be decommed anyway [09:11:51] (03CR) 10Marostegui: mariadb: Setup section->port assignment on puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/620722 (https://phabricator.wikimedia.org/T257033) (owner: 10Jcrespo) [09:12:34] 10Operations, 10netops, 10observability: Add alert[12]001 to network ACLs - https://phabricator.wikimedia.org/T260533 (10fgiunchedi) 05Resolved→03Open Thanks @ayounsi ! Reopening since we'll need to add these hosts to pfw devices as well, cc @Jgreen and @Dwisehaupt could you help with that ? Thanks! [09:12:45] _joe_: can you show me the task just to add it in the ack message? [09:13:04] (03CR) 10Marostegui: [C: 03+1] mariadb: Add m5 testreduce grants for testreduce1001 [puppet] - 10https://gerrit.wikimedia.org/r/620893 (https://phabricator.wikimedia.org/T260627) (owner: 10Kormat) [09:13:28] (03CR) 10Kormat: [C: 03+2] mariadb: Add m5 testreduce grants for testreduce1001 [puppet] - 10https://gerrit.wikimedia.org/r/620893 (https://phabricator.wikimedia.org/T260627) (owner: 10Kormat) [09:13:59] 10Operations, 10fundraising-tech-ops, 10netops, 10observability: Add alert[12]001 to network ACLs - https://phabricator.wikimedia.org/T260533 (10fgiunchedi) a:05ayounsi→03None [09:14:44] <_joe_> effie: https://phabricator.wikimedia.org/T227547 [09:15:01] but that is Jul 31 2019, [09:15:22] (03PS9) 10Jcrespo: mariadb: Setup section->port assignment on puppet [puppet] - 10https://gerrit.wikimedia.org/r/620722 (https://phabricator.wikimedia.org/T257033) [09:15:39] ok, thanks [09:16:37] <_joe_> jouncebot: next [09:16:37] In 1 hour(s) and 43 minute(s): European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200818T1100) [09:16:39] <_joe_> sigh [09:16:59] (03CR) 10Jcrespo: mariadb: Setup section->port assignment on puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/620722 (https://phabricator.wikimedia.org/T257033) (owner: 10Jcrespo) [09:17:59] I'll just go ahead and update cxserver. Small change. [09:18:21] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2020-08-17-090424-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/620692 (https://phabricator.wikimedia.org/T259980) (owner: 10KartikMistry) [09:19:34] (03Merged) 10jenkins-bot: Update cxserver to 2020-08-17-090424-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/620692 (https://phabricator.wikimedia.org/T259980) (owner: 10KartikMistry) [09:19:49] (03CR) 10Marostegui: "thanks for fixing, let's get a PCC?" [puppet] - 10https://gerrit.wikimedia.org/r/620722 (https://phabricator.wikimedia.org/T257033) (owner: 10Jcrespo) [09:20:15] (03Abandoned) 10Reedy: Merge branch 'master' of https://gerrit.wikimedia.org/r/operations/mediawiki-config Change-Id: I4ef592c4e61a3aa55ba5f6a8ba060c935644a52c [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620884 (owner: 10Evrifaessa) [09:20:17] (03Abandoned) 10Reedy: Merge branch 'master' of https://gerrit.wikimedia.org/r/operations/mediawiki-config Change-Id: Ib7b64451e6800712cd04698801c5f2f77e3d160b [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620885 (owner: 10Evrifaessa) [09:20:19] (03Abandoned) 10Reedy: Merge branch 'master' of https://gerrit.wikimedia.org/r/operations/mediawiki-config Change-Id: Id31faad923e4cabddf918fd42822fbe757b2daac [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620886 (owner: 10Evrifaessa) [09:20:24] (03Abandoned) 10Reedy: Merge branch 'master' of https://gerrit.wikimedia.org/r/operations/mediawiki-config Change-Id: I4e391d89c9db64de27d7f8f9be381a8aefcbae3e [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620887 (owner: 10Evrifaessa) [09:20:26] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Jan Eissfeldt (jeissfeldt) - https://phabricator.wikimedia.org/T260555 (10JanWMF) Thanks @fgiunchedi it's https://wikitech.wikimedia.org/wiki/User:JEissfeldt :) [09:20:29] (03PS2) 10Reedy: Add Wikisource wordmark for trwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620888 (https://phabricator.wikimedia.org/T260658) (owner: 10Evrifaessa) [09:21:02] !log uploaded spicerack_0.0.39-1+deb10u1 to apt.wikimedia.org buster-wikimedia [09:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:15] !log kartik@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [09:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:07] !log upgraded spicerack to v0.0.39 on cumin hosts [09:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:42] !log oblivian@cumin1001 conftool action : set/pooled=yes; selector: cluster=api_appserver,dc=codfw,name=mw214[02].* [09:28:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:57] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-cluster [09:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:02] !log kartik@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' . [09:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:37] !log kartik@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' . [09:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:59] !log Update cxserver to 2020-08-17-090424-production (T259980) [09:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:02] T259980: Make Google Translate the default for Ukrainian instead of Apertium - https://phabricator.wikimedia.org/T259980 [09:37:25] (03CR) 10Kormat: [C: 03+1] mariadb: Allow install new es hosts in eqiad/codfw [puppet] - 10https://gerrit.wikimedia.org/r/620881 (https://phabricator.wikimedia.org/T260373) (owner: 10Marostegui) [09:37:42] (03CR) 10Marostegui: [C: 03+2] mariadb: Allow install new es hosts in eqiad/codfw [puppet] - 10https://gerrit.wikimedia.org/r/620881 (https://phabricator.wikimedia.org/T260373) (owner: 10Marostegui) [09:39:10] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Marostegui) I have merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/620881 so the hosts will get installed with RAID10... [09:39:13] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10Marostegui) I have merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/620881 so the hosts will get installed with RAID10... [09:39:28] (03PS1) 10Jcrespo: mariadb: Apply the list of port to the core::multiinstance class [puppet] - 10https://gerrit.wikimedia.org/r/620899 (https://phabricator.wikimedia.org/T257033) [09:40:15] !log oblivian@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-cluster (exit_code=97) [09:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:38] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Marostegui) [09:40:47] !log oblivian@cumin1001 conftool action : set/pooled=yes; selector: cluster=api_appserver,dc=codfw,name=mw214[234].* [09:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:29] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10Marostegui) [09:41:42] (03PS2) 10Jcrespo: mariadb: Apply the list of ports to the core::multiinstance class [puppet] - 10https://gerrit.wikimedia.org/r/620899 (https://phabricator.wikimedia.org/T257033) [09:42:23] 10Operations, 10fundraising-tech-ops, 10netops, 10observability: Add alert[12]001 to network ACLs - https://phabricator.wikimedia.org/T260533 (10fgiunchedi) @ayounsi looks like mgmt access isn't permitted yet, can't ping `mgmt.eqiad.wmnet` e.g. ` alert1001# ping ps1-c1-eqiad.mgmt.eqiad.wmnet PING ps1-c1-e... [09:42:33] (03CR) 10Jcrespo: [C: 04-2] "Proof of concept, only db1090 converted, would not work as is." [puppet] - 10https://gerrit.wikimedia.org/r/620899 (https://phabricator.wikimedia.org/T257033) (owner: 10Jcrespo) [09:42:42] (03PS1) 10Giuseppe Lavagetto: Revert "sre.hosts.reboot-cluster: set the servers to state=inactive" [cookbooks] - 10https://gerrit.wikimedia.org/r/620684 [09:42:52] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Revert "sre.hosts.reboot-cluster: set the servers to state=inactive" [cookbooks] - 10https://gerrit.wikimedia.org/r/620684 (owner: 10Giuseppe Lavagetto) [09:43:03] !log jiji@cumin1001 conftool action : set/pooled=yes; selector: name=mw2250.codfw.wmnet [09:43:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:15] (03CR) 10Jcrespo: [C: 04-2] "It should also fail if the section configured is not on the list of ports." [puppet] - 10https://gerrit.wikimedia.org/r/620899 (https://phabricator.wikimedia.org/T257033) (owner: 10Jcrespo) [09:44:27] (03Merged) 10jenkins-bot: Revert "sre.hosts.reboot-cluster: set the servers to state=inactive" [cookbooks] - 10https://gerrit.wikimedia.org/r/620684 (owner: 10Giuseppe Lavagetto) [09:45:26] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-cluster [09:45:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:20] 10Operations, 10Gerrit-Privilege-Requests, 10User-Kormat: Request for Gerrit Managers permissions - https://phabricator.wikimedia.org/T260342 (10MarcoAurelio) Gerrit Managers lets you basically create other repos and manage a couple of things. Maybe @Kormat is thinking on Gerrit adminship instead (`ldap/gerr... [09:54:27] (03PS3) 10Jcrespo: mariadb: Apply the list of ports to the core::multiinstance class [puppet] - 10https://gerrit.wikimedia.org/r/620899 (https://phabricator.wikimedia.org/T257033) [09:58:29] (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/compiler1003/24528/" [puppet] - 10https://gerrit.wikimedia.org/r/620722 (https://phabricator.wikimedia.org/T257033) (owner: 10Jcrespo) [10:01:02] RECOVERY - mediawiki-installation DSH group on mw2140 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [10:01:51] PROBLEM - Host db2125 is DOWN: PING CRITICAL - Packet loss = 100% [10:03:46] ^ marostegui at meeting, need help? [10:03:58] nope, checking [10:04:29] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Jan Eissfeldt (jeissfeldt) - https://phabricator.wikimedia.org/T260555 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi >>! In T260555#6392465, @JanWMF wrote: > Thanks @fgiunchedi it's https://wikitech.wikimedia.org/wiki/User:JEissfeld... [10:06:43] (03CR) 10Jbond: [C: 03+1] "LGTM missed a minor issue with the user header but can be fixed later" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/620726 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [10:06:56] 10Operations, 10ops-codfw, 10DBA: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) [10:07:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2125 - host down T260670', diff saved to https://phabricator.wikimedia.org/P12288 and previous config saved to /var/cache/conftool/dbconfig/20200818-100718-marostegui.json [10:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:22] T260670: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 [10:07:35] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 240, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:07:52] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:08:13] 10Operations, 10ops-codfw, 10DBA: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) p:05Triage→03High Setting this to high to make sure the host is back online before 1st of September, which is when the DC switchover is happening [10:08:46] <_joe_> jouncebot: next [10:08:47] In 0 hour(s) and 51 minute(s): European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200818T1100) [10:09:14] <_joe_> uhmmm not 100% sure the reboots will be done by then, but at least now it's easy to pick up where we left [10:09:43] (03PS1) 10Marostegui: db2125: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/620903 (https://phabricator.wikimedia.org/T260670) [10:10:36] (03CR) 10Marostegui: [C: 03+2] db2125: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/620903 (https://phabricator.wikimedia.org/T260670) (owner: 10Marostegui) [10:14:53] (03PS7) 10Filippo Giunchedi: Add prometheus::karma and profile::alertmanager::web [puppet] - 10https://gerrit.wikimedia.org/r/620726 (https://phabricator.wikimedia.org/T258948) [10:14:55] (03PS6) 10Filippo Giunchedi: Enable profile::alertmanager::web on alerting_host [puppet] - 10https://gerrit.wikimedia.org/r/620727 (https://phabricator.wikimedia.org/T258948) [10:14:57] (03CR) 10Filippo Giunchedi: Add prometheus::karma and profile::alertmanager::web (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/620726 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [10:18:30] RECOVERY - mediawiki-installation DSH group on mw2250 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [10:26:54] (03CR) 10Filippo Giunchedi: [C: 03+2] Add prometheus::karma and profile::alertmanager::web [puppet] - 10https://gerrit.wikimedia.org/r/620726 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [10:27:04] 10Operations, 10Platform Engineering, 10serviceops: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10tstarling) p:05Medium→03Triage [10:27:06] (03CR) 10Filippo Giunchedi: [C: 03+2] Enable profile::alertmanager::web on alerting_host [puppet] - 10https://gerrit.wikimedia.org/r/620727 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [10:30:56] 10Operations, 10Platform Engineering, 10serviceops: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10tstarling) a:03tstarling Assigning to myself since implementation work is underway. [10:33:25] (03PS1) 10Filippo Giunchedi: profile: fix alertmanager apache config [puppet] - 10https://gerrit.wikimedia.org/r/620906 [10:35:56] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: fix alertmanager apache config [puppet] - 10https://gerrit.wikimedia.org/r/620906 (owner: 10Filippo Giunchedi) [10:36:56] (03CR) 10Hashar: [C: 03+2] Branch commit for wmf/1.36.0-wmf.5 [core] (wmf/1.36.0-wmf.5) - 10https://gerrit.wikimedia.org/r/620813 (https://phabricator.wikimedia.org/T257973) (owner: 10TrainBranchBot) [10:38:06] RECOVERY - mediawiki-installation DSH group on mw2142 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [10:38:44] (03PS1) 10Filippo Giunchedi: idp: add alerts.w.o policy [puppet] - 10https://gerrit.wikimedia.org/r/620907 (https://phabricator.wikimedia.org/T258948) [10:41:52] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) The mgmt interface became responsive again, maybe switch issue? @ayounsi could you help checking? [10:43:43] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) These are the HW logs of the host: ` ------------------------------------------------------------------------------- Record: 3 Date/Time: 08... [10:45:44] (03PS3) 10Kormat: Move RemoteExecution library to wmfmariadbpy [software/transferpy] - 10https://gerrit.wikimedia.org/r/619959 (https://phabricator.wikimedia.org/T259516) [10:46:19] !log Powercycle db2125 from the idrac T260670 [10:46:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:24] T260670: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 [10:49:15] (03CR) 10Jbond: [C: 03+2] idp: add alerts.w.o policy [puppet] - 10https://gerrit.wikimedia.org/r/620907 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [10:52:46] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) I cannot see anything on the console, however the idrac says that the host is on, so I have to issued a power cycle, which didn't work and I had to... [10:53:06] RECOVERY - Host db2125 is UP: PING OK - Packet loss = 0%, RTA = 33.44 ms [10:54:03] !log Reboot db2125 after running a full upgrade - T260670 [10:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:06] T260670: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 [10:54:33] <_joe_> I'm going to stop the cluster reboot now, given there is a deploy window [10:55:17] !log oblivian@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-cluster (exit_code=97) [10:55:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:03] !log oblivian@cumin1001 conftool action : set/pooled=yes; selector: cluster=api_appserver,dc=codfw,name=mw229.* [10:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:42] (03Merged) 10jenkins-bot: Branch commit for wmf/1.36.0-wmf.5 [core] (wmf/1.36.0-wmf.5) - 10https://gerrit.wikimedia.org/r/620813 (https://phabricator.wikimedia.org/T257973) (owner: 10TrainBranchBot) [10:59:16] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10ayounsi) >>! In T260670#6392699, @Marostegui wrote: > The mgmt interface became responsive again, maybe switch issue? @ayounsi could you help checking? Those... [10:59:52] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) Thank you! [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy European mid-day backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200818T1100). [11:00:05] Lucas_WMDE and Evrifaessa: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:32] hello [11:00:39] o/ [11:00:43] (looks like I joined just after the jouncebot ping?) [11:00:44] (03PS7) 10Lucas Werkmeister (WMDE): Enable Data Bridge on Catalan Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595543 (https://phabricator.wikimedia.org/T232584) [11:01:43] yeah lol [11:02:07] ^^ [11:02:19] I’ll deploy my own config change first and then yours :) [11:02:32] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Enable Data Bridge on Catalan Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595543 (https://phabricator.wikimedia.org/T232584) (owner: 10Lucas Werkmeister (WMDE)) [11:02:41] \o/ [11:02:43] \o/ [11:03:25] (03Merged) 10jenkins-bot: Enable Data Bridge on Catalan Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595543 (https://phabricator.wikimedia.org/T232584) (owner: 10Lucas Werkmeister (WMDE)) [11:03:49] pulled onto mwdebug1001 [11:03:53] testing [11:04:47] oh I need to purge the test page, that’s why it’s not working [11:05:02] hm, still no [11:05:06] (03PS1) 10Volans: debian: update release and versions [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/620908 [11:05:38] hm, the data bridge init module is registered in RL, but not loaded [11:06:20] !log deploy net-snmp update to buster [11:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:20] I don’t get it [11:10:21] Lucas_WMDE: maybe it requires purging RL cache in your system? [11:10:48] how would I do that? [11:12:33] data bridge isn’t in the RLPAGEMODULES, by the looks of it [11:13:11] if I disable X-Wikimedia-Debug, the module is completely gone, as it should be [11:13:31] when I enable X-Wikimedia-Debug, the module is registered, but not loaded by default [11:13:45] is it possible that the purge request doesn’t go to mwdebug1001? [11:14:02] so the content / parser cache output / whatever is still built without mwdebug1001, and with data bridge disabled [11:14:11] no, I get XHgui purges with mwdebug [11:14:21] Lucas_WMDE: try with ?debug=1 perhaps? [11:14:37] no change [11:14:39] That should use unminimalized RL modules, and as such, bypass cache [11:14:55] I don’t think the RL cache is the problem [11:15:00] the problem seems to be the list of RL modules in the cached parser output [11:15:07] which is missing the Bridge module, for some reason [11:17:06] oh! [11:17:13] sudo $handler->namespaceChecker->isWikibaseEnabled( NS_USER ) [11:17:14] false [11:17:22] Hehe [11:17:43] grrrr [11:17:49] what does that even mean [11:17:55] “Checks if a namespace in Wikibase Client shall have wikibase links, etc., based on settings” [11:17:55] “etc.” [11:18:20] maybe it’s correct that the User namespace does not get some of the other modules that the BeforePageDisplayHandler adds, but it should still have Data Bridge? [11:18:25] but anyways [11:18:28] let me find a better namespace to test with [11:18:41] NS_PROJECT should work [11:18:51] and I think cawiki has a data bridge project page [11:18:55] so maybe I can move my sandbox there [11:19:21] Lmk if you need my stew powah to clean something up :) [11:20:28] (03PS1) 10Jbond: netmon2001: update netmon librenms to use apereo cas [puppet] - 10https://gerrit.wikimedia.org/r/620910 (https://phabricator.wikimedia.org/T256958) [11:20:30] (03PS1) 10Jbond: netmon - librenms: make apereo cass sso authentication the default [puppet] - 10https://gerrit.wikimedia.org/r/620911 (https://phabricator.wikimedia.org/T256958) [11:22:49] https://www.wikidata.org/wiki/Special:Diff/1259648103 \o/ [11:22:52] correctly tagged as well [11:22:54] syncing [11:24:25] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:595543|Enable Data Bridge on Catalan Wikipedia (T232584)]] (duration: 01m 01s) [11:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:29] T232584: Step 1: Production deployment checklist - https://phabricator.wikimedia.org/T232584 [11:24:44] (03CR) 10Ayounsi: "Feel free to merge/test it anytime. Ping me before just in case I'm using it." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/620911 (https://phabricator.wikimedia.org/T256958) (owner: 10Jbond) [11:24:50] (03CR) 10Ayounsi: [C: 03+1] netmon - librenms: make apereo cass sso authentication the default [puppet] - 10https://gerrit.wikimedia.org/r/620911 (https://phabricator.wikimedia.org/T256958) (owner: 10Jbond) [11:24:54] and now it works without mwdebug too, amazing [11:25:16] Cool; [11:25:23] And thanks for the awesome feature! [11:25:59] * Urbanecm cannot wait for it to be everywhere [11:26:33] :) [11:27:11] (03CR) 10Lucas Werkmeister (WMDE): "Uh, Gerrit won’t let me rebase this for some reason?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620888 (https://phabricator.wikimedia.org/T260658) (owner: 10Evrifaessa) [11:27:17] (03PS3) 10Lucas Werkmeister (WMDE): Add Wikisource wordmark for trwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620888 (https://phabricator.wikimedia.org/T260658) (owner: 10Evrifaessa) [11:27:27] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "There we go… 🤷" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620888 (https://phabricator.wikimedia.org/T260658) (owner: 10Evrifaessa) [11:28:20] (03Merged) 10jenkins-bot: Add Wikisource wordmark for trwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620888 (https://phabricator.wikimedia.org/T260658) (owner: 10Evrifaessa) [11:29:15] Evrifaessa: new wordmark is on mwdebug101 [11:29:37] works [11:29:47] *1001 :D [11:29:49] ok, syncing [11:29:59] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:30:16] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:32:10] !log lucaswerkmeister-wmde@deploy1001 Synchronized static/images/mobile/copyright/wikisource-wordmark-tr.svg: Config: [[gerrit:620888|Add Wikisource wordmark for trwikisource (T260658)]], part 1 (duration: 00m 55s) [11:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:14] T260658: Add localized wordmark to trwikisource mobile frontend - https://phabricator.wikimedia.org/T260658 [11:32:26] !log lucaswerkmeister-wmde@mwmaint1002:~$ printf '%s\n' 'https://en.wikipedia.org/static/images/mobile/copyright/wikisource-wordmark-tr.svg' | mwscript purgeList.php # T260658 [11:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:43] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:620888|Add Wikisource wordmark for trwikisource (T260658)]], part 2 (duration: 00m 55s) [11:33:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:04] \o/ [11:34:08] 10Operations, 10ops-codfw, 10DBA: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10jcrespo) Does this host need reprovisioning? [11:34:48] 10Operations, 10ops-codfw, 10DBA: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) Replication came back clean and with no errors on startup (or after mysql_upgrade), so I think we are good [11:36:00] I think that’s all for this deployment window? [11:36:39] !log EU backport&config done [11:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:22] 10Operations, 10ops-eqiad, 10netops: (Need by: 2019-09-30) upgrade msw1-eqiad from EX4200 to EX4300 - https://phabricator.wikimedia.org/T225121 (10faidon) Ping? Besides the issues identified by @ayounsi just above, I see that in another comment above @ayounsi mentioned "wipe the switch" but then I saw the sw... [11:40:39] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-cluster [11:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:54] <_joe_> jouncebot: next [11:42:54] In 4 hour(s) and 17 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200818T1600) [11:53:56] !log add new icinga hosts to mr policies - T260533 [11:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:00] T260533: Add alert[12]001 to network ACLs - https://phabricator.wikimedia.org/T260533 [12:04:55] (03PS1) 10Jbond: profile::idp::client::httpd: retire currently class [puppet] - 10https://gerrit.wikimedia.org/r/620920 [12:06:20] (03CR) 10jerkins-bot: [V: 04-1] profile::idp::client::httpd: retire currently class [puppet] - 10https://gerrit.wikimedia.org/r/620920 (owner: 10Jbond) [12:15:34] (03CR) 10Volans: [C: 03+2] debian: update release and versions [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/620908 (owner: 10Volans) [12:18:15] (03PS1) 10Filippo Giunchedi: hieradata: add alertmanagers variable [puppet] - 10https://gerrit.wikimedia.org/r/620922 (https://phabricator.wikimedia.org/T258948) [12:18:21] (03Merged) 10jenkins-bot: debian: update release and versions [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/620908 (owner: 10Volans) [12:20:25] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1001/24531/prometheus1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/620922 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [12:21:24] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-cluster [12:21:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:10] (03PS2) 10Jbond: profile::idp::client::httpd: retire currently class [puppet] - 10https://gerrit.wikimedia.org/r/620920 (https://phabricator.wikimedia.org/T260677) [12:25:45] 10Operations: generate-debdeploy-spec breaks when trying to use the transition feature - https://phabricator.wikimedia.org/T260680 (10Kormat) [12:25:58] 10Operations, 10User-Kormat: generate-debdeploy-spec breaks when trying to use the transition feature - https://phabricator.wikimedia.org/T260680 (10Kormat) [12:34:25] !log deploying wmfmariadbpy 0.4 [12:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:44] (03PS1) 10Jbond: profile::idp::client::httpd: create new httpd class [puppet] - 10https://gerrit.wikimedia.org/r/620923 (https://phabricator.wikimedia.org/T260677) [12:35:46] (03CR) 10jerkins-bot: [V: 04-1] profile::idp::client::httpd: create new httpd class [puppet] - 10https://gerrit.wikimedia.org/r/620923 (https://phabricator.wikimedia.org/T260677) (owner: 10Jbond) [12:36:35] is.. there some problem with puppetdb? `cumin 'db-all-codfw'` is matching nothing [12:36:47] aaagh. i keep forgetting to supply A: [12:37:00] !log oblivian@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [12:37:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:00] (03PS1) 10Jbond: profile::puppetboard: convert puppetboard to new idp class [puppet] - 10https://gerrit.wikimedia.org/r/620926 (https://phabricator.wikimedia.org/T260677) [12:42:43] (03PS2) 10Jbond: profile::idp::client::httpd: create new httpd class [puppet] - 10https://gerrit.wikimedia.org/r/620923 (https://phabricator.wikimedia.org/T260677) [12:44:44] (03PS3) 10Jbond: profile::idp::client::httpd: create new httpd class [puppet] - 10https://gerrit.wikimedia.org/r/620923 (https://phabricator.wikimedia.org/T260677) [12:45:06] (03PS2) 10Jbond: profile::puppetboard: convert puppetboard to new idp class [puppet] - 10https://gerrit.wikimedia.org/r/620926 (https://phabricator.wikimedia.org/T260677) [12:46:06] (03PS3) 10Jbond: profile::puppetboard: convert puppetboard to new idp class [puppet] - 10https://gerrit.wikimedia.org/r/620926 (https://phabricator.wikimedia.org/T260677) [12:53:02] (03PS4) 10Jbond: profile::puppetboard: convert puppetboard to new idp class [puppet] - 10https://gerrit.wikimedia.org/r/620926 (https://phabricator.wikimedia.org/T260677) [12:55:33] (03PS2) 10Hashar: doc: switch to scap DocumentRoot [puppet] - 10https://gerrit.wikimedia.org/r/620368 (https://phabricator.wikimedia.org/T149924) [12:55:35] (03PS1) 10Hashar: DO NOT MERGE apache tweak for doc.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/620928 [12:56:20] (03PS5) 10Kormat: mariadb: Drop check_mariadb.py in favour of packaged version [puppet] - 10https://gerrit.wikimedia.org/r/619443 (https://phabricator.wikimedia.org/T259516) [12:56:30] (03PS1) 10Filippo Giunchedi: icinga: make sure update-etcd-mw-config-lastindex is enabled [puppet] - 10https://gerrit.wikimedia.org/r/620929 (https://phabricator.wikimedia.org/T247966) [12:57:02] (03CR) 10jerkins-bot: [V: 04-1] doc: switch to scap DocumentRoot [puppet] - 10https://gerrit.wikimedia.org/r/620368 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [12:57:04] (03CR) 10jerkins-bot: [V: 04-1] DO NOT MERGE apache tweak for doc.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/620928 (owner: 10Hashar) [12:57:06] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-cluster [12:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:33] <_joe_> !log rebooting appservers in eqiad, 3 at a time [12:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:39] (03PS1) 10Marostegui: production-m2.sql: Remove CREATE and DELETE from xhgui [puppet] - 10https://gerrit.wikimedia.org/r/620930 (https://phabricator.wikimedia.org/T260640) [12:59:44] (03PS5) 10Jbond: profile::puppetboard: convert puppetboard to new idp class [puppet] - 10https://gerrit.wikimedia.org/r/620926 (https://phabricator.wikimedia.org/T260677) [13:00:24] (03CR) 10Kormat: [C: 03+2] "Ready to deploy now, packaging no longer pulls in cumin on non-cumin hosts." [puppet] - 10https://gerrit.wikimedia.org/r/619443 (https://phabricator.wikimedia.org/T259516) (owner: 10Kormat) [13:03:22] (03PS6) 10Jbond: profile::puppetboard: convert puppetboard to new idp class [puppet] - 10https://gerrit.wikimedia.org/r/620926 (https://phabricator.wikimedia.org/T260677) [13:04:20] !log disabling puppet on all db machines T259516 [13:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:23] T259516: DBA python layout - https://phabricator.wikimedia.org/T259516 [13:04:36] (03CR) 10Jbond: [C: 03+2] profile::idp::client::httpd: retire currently class [puppet] - 10https://gerrit.wikimedia.org/r/620920 (https://phabricator.wikimedia.org/T260677) (owner: 10Jbond) [13:04:39] (03CR) 10Jbond: [C: 03+2] profile::idp::client::httpd: create new httpd class [puppet] - 10https://gerrit.wikimedia.org/r/620923 (https://phabricator.wikimedia.org/T260677) (owner: 10Jbond) [13:04:51] (03CR) 10Jbond: [C: 03+2] profile::puppetboard: convert puppetboard to new idp class [puppet] - 10https://gerrit.wikimedia.org/r/620926 (https://phabricator.wikimedia.org/T260677) (owner: 10Jbond) [13:08:05] ACKNOWLEDGEMENT - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 1, dormant: 0, excluded: 0, unused: 0: ayounsi Zayo TTN-0004342017 - The acknowledgement expires at: 2020-08-19 13:07:41. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:08:06] ACKNOWLEDGEMENT - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: ayounsi Zayo TTN-0004342017 - The acknowledgement expires at: 2020-08-19 13:07:41. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:09:47] (03Abandoned) 10Hashar: Scap: git_fat -> git_binary_manager [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/404222 (https://phabricator.wikimedia.org/T184882) (owner: 10Thcipriani) [13:13:30] hurm. it's possible i've broken puppet on cumin hosts. checking. [13:14:08] yes, yes i have. fixing. [13:16:15] (03PS1) 10Kormat: mariadb: Update name of wmfmariadbpy package [puppet] - 10https://gerrit.wikimedia.org/r/620931 (https://phabricator.wikimedia.org/T259516) [13:17:12] (03PS1) 10Jbond: profile::puppetboard: drop require_packages [puppet] - 10https://gerrit.wikimedia.org/r/620932 (https://phabricator.wikimedia.org/T260677) [13:17:45] (03CR) 10CDanis: [C: 03+1] mariadb: Update name of wmfmariadbpy package [puppet] - 10https://gerrit.wikimedia.org/r/620931 (https://phabricator.wikimedia.org/T259516) (owner: 10Kormat) [13:18:20] (03CR) 10Kormat: [C: 03+2] mariadb: Update name of wmfmariadbpy package [puppet] - 10https://gerrit.wikimedia.org/r/620931 (https://phabricator.wikimedia.org/T259516) (owner: 10Kormat) [13:19:22] (03CR) 10Jbond: [C: 03+2] profile::puppetboard: drop require_packages [puppet] - 10https://gerrit.wikimedia.org/r/620932 (https://phabricator.wikimedia.org/T260677) (owner: 10Jbond) [13:25:44] (03PS1) 10Jbond: profile::puppetboard: use include instead of require to avoid dep loops [puppet] - 10https://gerrit.wikimedia.org/r/620933 [13:26:53] (03CR) 10Jbond: [C: 03+2] profile::puppetboard: use include instead of require to avoid dep loops [puppet] - 10https://gerrit.wikimedia.org/r/620933 (owner: 10Jbond) [13:27:04] (03CR) 10CDanis: [C: 03+1] "It's probably prudent to force a puppet run on all cps after this happens, otherwise URL-based chashing will probably make things weird fo" [puppet] - 10https://gerrit.wikimedia.org/r/620663 (https://phabricator.wikimedia.org/T259143) (owner: 10Filippo Giunchedi) [13:34:04] (03PS1) 10Giuseppe Lavagetto: Test deployments with helmfile lint [deployment-charts] - 10https://gerrit.wikimedia.org/r/620934 [13:34:06] (03PS1) 10Giuseppe Lavagetto: Switch all charts from "stable" to "wmf-stable" [deployment-charts] - 10https://gerrit.wikimedia.org/r/620935 [13:35:46] (03CR) 10Ottomata: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/24541/stat1008.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/620784 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [13:37:36] (03PS1) 10Andrew Bogott: Nova/Neutron: set dhcp_domain and tld to eqiad1.wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/620936 (https://phabricator.wikimedia.org/T260614) [13:37:36] !log move v4 cr1-eqiad from peering to transit bgp group [13:37:38] (03PS1) 10Andrew Bogott: designate: stop creating 'legacy' entries (that is, things under wmflabs) [puppet] - 10https://gerrit.wikimedia.org/r/620937 (https://phabricator.wikimedia.org/T260614) [13:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:48] (03CR) 10Andrew Bogott: [C: 04-1] "This will be merged by appointment after proper notice" [puppet] - 10https://gerrit.wikimedia.org/r/620936 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott) [13:38:57] (03CR) 10Andrew Bogott: [C: 04-1] "This will be merged by appointment after proper notice" [puppet] - 10https://gerrit.wikimedia.org/r/620937 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott) [13:40:56] (03PS1) 10Jbond: profile::hue: migrate hue to new idp define [puppet] - 10https://gerrit.wikimedia.org/r/620938 (https://phabricator.wikimedia.org/T260677) [13:41:03] (03PS1) 10Andrew Bogott: backy2: permit cleanup of images after 3 days [puppet] - 10https://gerrit.wikimedia.org/r/620939 (https://phabricator.wikimedia.org/T259192) [13:41:43] (03CR) 10Andrew Bogott: [C: 03+2] backy2: permit cleanup of images after 3 days [puppet] - 10https://gerrit.wikimedia.org/r/620939 (https://phabricator.wikimedia.org/T259192) (owner: 10Andrew Bogott) [13:41:48] (03CR) 10Kormat: [C: 03+1] production-m2.sql: Remove CREATE and DELETE from xhgui [puppet] - 10https://gerrit.wikimedia.org/r/620930 (https://phabricator.wikimedia.org/T260640) (owner: 10Marostegui) [13:42:08] (03PS2) 10Giuseppe Lavagetto: helmfile.d: refactor eventgate-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/619437 [13:43:27] (03PS2) 10Jbond: profile::hue: migrate hue to new idp define [puppet] - 10https://gerrit.wikimedia.org/r/620938 (https://phabricator.wikimedia.org/T260677) [13:45:40] (03PS18) 10Ppchelko: Modify api-gateway access logging to conform to schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/619341 (https://phabricator.wikimedia.org/T251812) [13:46:14] (03CR) 10Ppchelko: Modify api-gateway access logging to conform to schema (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/619341 (https://phabricator.wikimedia.org/T251812) (owner: 10Ppchelko) [13:46:41] (03PS3) 10Giuseppe Lavagetto: helmfile.d: refactor eventgate-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/619437 (https://phabricator.wikimedia.org/T258572) [13:46:57] (03CR) 10Jbond: [C: 03+2] profile::hue: migrate hue to new idp define [puppet] - 10https://gerrit.wikimedia.org/r/620938 (https://phabricator.wikimedia.org/T260677) (owner: 10Jbond) [13:48:50] (03PS1) 10Jbond: correct resource name [puppet] - 10https://gerrit.wikimedia.org/r/620940 [13:49:37] !log move v4 HE on cr2-eqord from peering to transit bgp group [13:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:20] (03CR) 10Jbond: [C: 03+2] correct resource name [puppet] - 10https://gerrit.wikimedia.org/r/620940 (owner: 10Jbond) [13:50:25] (03PS2) 10Giuseppe Lavagetto: Test deployments with helmfile lint [deployment-charts] - 10https://gerrit.wikimedia.org/r/620934 (https://phabricator.wikimedia.org/T258572) [13:50:27] (03PS2) 10Giuseppe Lavagetto: Switch all charts from "stable" to "wmf-stable" [deployment-charts] - 10https://gerrit.wikimedia.org/r/620935 (https://phabricator.wikimedia.org/T258572) [13:50:29] (03PS4) 10Giuseppe Lavagetto: helmfile.d: refactor eventgate-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/619437 (https://phabricator.wikimedia.org/T258572) [13:52:02] (03CR) 10Marostegui: [C: 03+2] production-m2.sql: Remove CREATE and DELETE from xhgui [puppet] - 10https://gerrit.wikimedia.org/r/620930 (https://phabricator.wikimedia.org/T260640) (owner: 10Marostegui) [13:53:26] !log bump Zayo v4 BGP session in eqiad [13:53:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:45] XioNoX: ITYM s/bump/give up on/ [13:54:00] haha :) [13:54:54] !log Revoke DELETE and CREATE from xhgui user on m2 T260640 [13:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:58] T260640: Additional database user for XHGui administration - https://phabricator.wikimedia.org/T260640 [13:55:51] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10wkandek) [13:57:16] (03PS1) 10Ottomata: jupyterhub_config.py.erb - check type before rendering [puppet] - 10https://gerrit.wikimedia.org/r/620942 (https://phabricator.wikimedia.org/T224658) [13:58:01] (03PS4) 10Ppchelko: ratelimit: crash on startup if config is invalid [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/620068 [13:58:12] (03PS5) 10Ppchelko: ratelimit: crash on startup if config is invalid [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/620068 [13:58:22] (03PS1) 10Jbond: turnilo: update to use new apereo cas define [puppet] - 10https://gerrit.wikimedia.org/r/620943 (https://phabricator.wikimedia.org/T260677) [13:59:03] (03CR) 10Ppchelko: ratelimit: crash on startup if config is invalid (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/620068 (owner: 10Ppchelko) [13:59:24] (03CR) 10jerkins-bot: [V: 04-1] turnilo: update to use new apereo cas define [puppet] - 10https://gerrit.wikimedia.org/r/620943 (https://phabricator.wikimedia.org/T260677) (owner: 10Jbond) [13:59:37] (03PS2) 10Jbond: turnilo: update to use new apereo cas define [puppet] - 10https://gerrit.wikimedia.org/r/620943 (https://phabricator.wikimedia.org/T260677) [14:00:35] (03PS3) 10Jbond: turnilo: update to use new apereo cas define [puppet] - 10https://gerrit.wikimedia.org/r/620943 (https://phabricator.wikimedia.org/T260677) [14:00:37] (03CR) 10Ottomata: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/24546/stat1008.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/620942 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [14:00:40] (03CR) 10jerkins-bot: [V: 04-1] turnilo: update to use new apereo cas define [puppet] - 10https://gerrit.wikimedia.org/r/620943 (https://phabricator.wikimedia.org/T260677) (owner: 10Jbond) [14:02:48] (03PS4) 10Jbond: turnilo: update to use new apereo cas define [puppet] - 10https://gerrit.wikimedia.org/r/620943 (https://phabricator.wikimedia.org/T260677) [14:03:55] (03CR) 10Jbond: [C: 03+2] turnilo: update to use new apereo cas define [puppet] - 10https://gerrit.wikimedia.org/r/620943 (https://phabricator.wikimedia.org/T260677) (owner: 10Jbond) [14:05:09] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/620935 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [14:06:20] (03PS1) 10Ottomata: Fix typo ni jupyterhub::server [puppet] - 10https://gerrit.wikimedia.org/r/620945 (https://phabricator.wikimedia.org/T224658) [14:08:57] (03PS1) 10Jbond: graphite: update to use new apereo cas define [puppet] - 10https://gerrit.wikimedia.org/r/620946 (https://phabricator.wikimedia.org/T260677) [14:08:59] (03PS2) 10Ottomata: Fix typo in jupyterhub::server [puppet] - 10https://gerrit.wikimedia.org/r/620945 (https://phabricator.wikimedia.org/T224658) [14:09:29] (03CR) 10Ottomata: [C: 03+2] Fix typo in jupyterhub::server [puppet] - 10https://gerrit.wikimedia.org/r/620945 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [14:12:35] (03PS2) 10Jbond: graphite: update to use new apereo cas define [puppet] - 10https://gerrit.wikimedia.org/r/620946 (https://phabricator.wikimedia.org/T260677) [14:15:42] (03PS1) 10Esanders: Use VE's new Beta Feature preference instead of wgHiddenPrefs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620953 (https://phabricator.wikimedia.org/T254349) [14:18:02] (03PS3) 10Jbond: graphite: update to use new apereo cas define [puppet] - 10https://gerrit.wikimedia.org/r/620946 (https://phabricator.wikimedia.org/T260677) [14:20:33] (03PS2) 10Filippo Giunchedi: hieradata: add alertmanagers variable [puppet] - 10https://gerrit.wikimedia.org/r/620922 (https://phabricator.wikimedia.org/T258948) [14:20:35] (03PS1) 10Filippo Giunchedi: prometheus: add alertmanager::web jobs [puppet] - 10https://gerrit.wikimedia.org/r/620955 (https://phabricator.wikimedia.org/T258948) [14:20:37] (03PS1) 10Filippo Giunchedi: prometheus: add alerts to 'ops' instance [puppet] - 10https://gerrit.wikimedia.org/r/620956 (https://phabricator.wikimedia.org/T258948) [14:20:39] (03PS1) 10Filippo Giunchedi: prometheus: export icinga service problems as metrics [puppet] - 10https://gerrit.wikimedia.org/r/620957 (https://phabricator.wikimedia.org/T258948) [14:20:41] (03CR) 10Jbond: [C: 03+2] graphite: update to use new apereo cas define [puppet] - 10https://gerrit.wikimedia.org/r/620946 (https://phabricator.wikimedia.org/T260677) (owner: 10Jbond) [14:21:16] (03CR) 10JMeybohm: [C: 04-1] helmfile.d: refactor eventgate-main (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/619437 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [14:21:42] (03PS2) 10Esanders: Prepare for VE's new Beta Feature preference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620953 (https://phabricator.wikimedia.org/T254349) [14:21:44] (03PS1) 10Esanders: Drop wgHiddenPrefs hack for VE Beta Feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620958 (https://phabricator.wikimedia.org/T254349) [14:22:41] (03CR) 10Jforrester: [C: 03+1] Prepare for VE's new Beta Feature preference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620953 (https://phabricator.wikimedia.org/T254349) (owner: 10Esanders) [14:22:43] (03PS2) 10Filippo Giunchedi: prometheus: add alertmanager::web jobs [puppet] - 10https://gerrit.wikimedia.org/r/620955 (https://phabricator.wikimedia.org/T258948) [14:23:25] (03PS1) 10Ottomata: jupyterhub-conda.systemd.erb - allow to write into @data_path [puppet] - 10https://gerrit.wikimedia.org/r/620959 (https://phabricator.wikimedia.org/T224658) [14:23:47] 10Operations, 10ops-codfw, 10DBA: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Papaul) If it is switch issue we should know today since msw-c1 is set to be replaced today. [14:24:08] (03CR) 10Jforrester: "(Not until wmf.7 is everywhere and won't roll back.)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620958 (https://phabricator.wikimedia.org/T254349) (owner: 10Esanders) [14:24:23] (03CR) 10Ottomata: [C: 03+2] jupyterhub-conda.systemd.erb - allow to write into @data_path [puppet] - 10https://gerrit.wikimedia.org/r/620959 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [14:24:54] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add alertmanager::web jobs [puppet] - 10https://gerrit.wikimedia.org/r/620955 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [14:26:14] 10Operations, 10ops-codfw, 10DBA: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) @Papaul there's definitely also something going on with the host, as the CPU errors reported on the HW logs do match the alert time [14:26:49] (03PS1) 10Jbond: analytics_cluster/superset: use updated mod_auth_cas client [puppet] - 10https://gerrit.wikimedia.org/r/620960 (https://phabricator.wikimedia.org/T260677) [14:27:48] 10Operations, 10ops-codfw, 10DBA: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Papaul) @Marostegui can you depool it so i can do some maintenance on? [14:28:20] !log Stop MYSQL on db2125 for on-site maintenance - T260670 [14:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:24] T260670: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 [14:28:27] (03PS1) 10Ottomata: jupyterhub-conda - Always set proxy_pid_file to a writable path [puppet] - 10https://gerrit.wikimedia.org/r/620962 (https://phabricator.wikimedia.org/T224658) [14:28:43] (03CR) 10Jbond: [C: 03+2] analytics_cluster/superset: use updated mod_auth_cas client [puppet] - 10https://gerrit.wikimedia.org/r/620960 (https://phabricator.wikimedia.org/T260677) (owner: 10Jbond) [14:28:58] (03CR) 10Ottomata: [C: 03+2] jupyterhub-conda - Always set proxy_pid_file to a writable path [puppet] - 10https://gerrit.wikimedia.org/r/620962 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [14:29:15] 10Operations, 10ops-codfw, 10DBA: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) @Papaul mysql stopped, you can proceed as needed. Thank you! [14:29:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1104', diff saved to https://phabricator.wikimedia.org/P12290 and previous config saved to /var/cache/conftool/dbconfig/20200818-142937-marostegui.json [14:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:58] (03CR) 10JMeybohm: [C: 04-1] "All helmfile deployment tests fail currently but jenkins still gave V+2" [deployment-charts] - 10https://gerrit.wikimedia.org/r/620934 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [14:32:26] (03CR) 10Andrew Bogott: [C: 03+2] domainproxy: enforce TLS by default [puppet] - 10https://gerrit.wikimedia.org/r/620122 (https://phabricator.wikimedia.org/T120486) (owner: 10BryanDavis) [14:32:40] (03PS1) 10Ottomata: jupterhub_config.py.erb - fix missing comma [puppet] - 10https://gerrit.wikimedia.org/r/620965 (https://phabricator.wikimedia.org/T224658) [14:32:51] <_joe_> jayme: that's sadly expected [14:32:55] (03PS2) 10Ottomata: jupterhub_config.py.erb - fix missing comma [puppet] - 10https://gerrit.wikimedia.org/r/620965 (https://phabricator.wikimedia.org/T224658) [14:32:57] (03CR) 10jerkins-bot: [V: 04-1] jupterhub_config.py.erb - fix missing comma [puppet] - 10https://gerrit.wikimedia.org/r/620965 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [14:33:00] <_joe_> we had some deployments to fix :) [14:33:31] but jenkins should downvote, no? [14:33:52] <_joe_> no, because we had several unfixed tests [14:33:57] <_joe_> but [14:34:26] <_joe_> with the following patch, only termbox/staging and echostore+sessionstore fail. I will fix them soon(TM) [14:34:46] <_joe_> it's stuff that needs things that are in the private repo [14:34:55] (03PS1) 10Jbond: yarn: move ui to new mod_auth_cas define [puppet] - 10https://gerrit.wikimedia.org/r/620986 (https://phabricator.wikimedia.org/T260677) [14:35:23] !log oblivian@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=99) [14:35:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:32] (03CR) 10Ottomata: [C: 03+2] jupterhub_config.py.erb - fix missing comma [puppet] - 10https://gerrit.wikimedia.org/r/620965 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [14:36:20] 10Operations, 10ops-codfw: Replace mc2028 with wmf6413 and name it mc2037 - https://phabricator.wikimedia.org/T260694 (10Papaul) [14:36:34] 10Operations, 10ops-codfw: Replace mc2028 with wmf6413 and name it mc2037 - https://phabricator.wikimedia.org/T260694 (10Papaul) [14:37:23] (03CR) 10Jbond: [C: 03+2] yarn: move ui to new mod_auth_cas define [puppet] - 10https://gerrit.wikimedia.org/r/620986 (https://phabricator.wikimedia.org/T260677) (owner: 10Jbond) [14:37:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1104', diff saved to https://phabricator.wikimedia.org/P12291 and previous config saved to /var/cache/conftool/dbconfig/20200818-143758-marostegui.json [14:38:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:25] (03PS1) 10Ottomata: jupyterhub - make jupyterhub-singleuser-conda-env.sh executable [puppet] - 10https://gerrit.wikimedia.org/r/620988 [14:39:47] (03CR) 10jerkins-bot: [V: 04-1] jupyterhub - make jupyterhub-singleuser-conda-env.sh executable [puppet] - 10https://gerrit.wikimedia.org/r/620988 (owner: 10Ottomata) [14:39:54] (03PS2) 10Ppchelko: api-gateway: jwks.json.yaml -> jwks.json.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/620803 [14:39:57] (03CR) 10Clarakosi: [C: 03+2] api-gateway: jwks.json.yaml -> jwks.json.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/620803 (owner: 10Ppchelko) [14:40:34] (03PS2) 10Ottomata: jupyterhub - make jupyterhub-singleuser-conda-env.sh executable [puppet] - 10https://gerrit.wikimedia.org/r/620988 (https://phabricator.wikimedia.org/T224658) [14:40:57] _joe_: okay, will revisit then :-) [14:41:12] <_joe_> jayme: actually, let me fix that [14:41:13] (03CR) 10Ottomata: [C: 03+2] jupyterhub - make jupyterhub-singleuser-conda-env.sh executable [puppet] - 10https://gerrit.wikimedia.org/r/620988 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [14:41:14] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:41:33] _joe_: sure, take your time :D [14:43:05] (03PS1) 10Jbond: thanos: move ui to new mod_auth_cas define [puppet] - 10https://gerrit.wikimedia.org/r/620989 (https://phabricator.wikimedia.org/T260677) [14:44:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1104', diff saved to https://phabricator.wikimedia.org/P12292 and previous config saved to /var/cache/conftool/dbconfig/20200818-144415-marostegui.json [14:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:26] (03CR) 10Jbond: [C: 03+2] thanos: move ui to new mod_auth_cas define [puppet] - 10https://gerrit.wikimedia.org/r/620989 (https://phabricator.wikimedia.org/T260677) (owner: 10Jbond) [14:46:59] !log move v4 HE on cr3-ulsfo from peering to transit bgp group [14:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:43] (03PS1) 10Ppchelko: Add jwt and ratelimiter fixtures to gateway for more validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/620992 [14:48:53] !log oblivian@cumin1001 conftool action : set/pooled=yes; selector: name=mw13(55|64|65).* [14:48:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:11] (03PS1) 10Ottomata: jupterhub [puppet] - 10https://gerrit.wikimedia.org/r/620993 (https://phabricator.wikimedia.org/T224658) [14:49:39] (03CR) 10jerkins-bot: [V: 04-1] jupterhub [puppet] - 10https://gerrit.wikimedia.org/r/620993 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [14:49:44] PROBLEM - Host db2125.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:50:02] (03PS1) 10Jbond: people: move ui to new mod_auth_cas define [puppet] - 10https://gerrit.wikimedia.org/r/620994 (https://phabricator.wikimedia.org/T260677) [14:50:17] (03PS2) 10Ottomata: juputerhub - Create config files explicitly with proper modes [puppet] - 10https://gerrit.wikimedia.org/r/620993 (https://phabricator.wikimedia.org/T224658) [14:50:44] (03PS3) 10Ottomata: juputerhub - Create config files explicitly with proper modes [puppet] - 10https://gerrit.wikimedia.org/r/620993 (https://phabricator.wikimedia.org/T224658) [14:51:08] (03CR) 10jerkins-bot: [V: 04-1] people: move ui to new mod_auth_cas define [puppet] - 10https://gerrit.wikimedia.org/r/620994 (https://phabricator.wikimedia.org/T260677) (owner: 10Jbond) [14:51:11] (03PS4) 10Ottomata: jupyterhub - Create config files explicitly with proper modes [puppet] - 10https://gerrit.wikimedia.org/r/620993 (https://phabricator.wikimedia.org/T224658) [14:51:52] (03PS2) 10Jbond: people: move ui to new mod_auth_cas define [puppet] - 10https://gerrit.wikimedia.org/r/620994 (https://phabricator.wikimedia.org/T260677) [14:52:00] (03CR) 10jerkins-bot: [V: 04-1] jupyterhub - Create config files explicitly with proper modes [puppet] - 10https://gerrit.wikimedia.org/r/620993 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [14:52:25] (03PS1) 10Giuseppe Lavagetto: sre.hosts.reboot-cluster: pass a list of hosts to Results [cookbooks] - 10https://gerrit.wikimedia.org/r/620995 [14:52:56] (03CR) 10Giuseppe Lavagetto: [C: 03+2] sre.hosts.reboot-cluster: pass a list of hosts to Results [cookbooks] - 10https://gerrit.wikimedia.org/r/620995 (owner: 10Giuseppe Lavagetto) [14:53:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1104', diff saved to https://phabricator.wikimedia.org/P12293 and previous config saved to /var/cache/conftool/dbconfig/20200818-145337-marostegui.json [14:53:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:41] (03CR) 10Jbond: [C: 03+2] people: move ui to new mod_auth_cas define [puppet] - 10https://gerrit.wikimedia.org/r/620994 (https://phabricator.wikimedia.org/T260677) (owner: 10Jbond) [14:54:35] (03Merged) 10jenkins-bot: sre.hosts.reboot-cluster: pass a list of hosts to Results [cookbooks] - 10https://gerrit.wikimedia.org/r/620995 (owner: 10Giuseppe Lavagetto) [14:54:49] RECOVERY - Host db2125.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.00 ms [14:54:51] (03PS5) 10Ottomata: jupyterhub - Create config files explicitly with proper modes [puppet] - 10https://gerrit.wikimedia.org/r/620993 (https://phabricator.wikimedia.org/T224658) [14:55:12] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:55:32] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-cluster [14:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:07] (03CR) 10Ottomata: [C: 03+2] jupyterhub - Create config files explicitly with proper modes [puppet] - 10https://gerrit.wikimedia.org/r/620993 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [14:56:22] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:56:42] !log replacing msw-c1,c2 and c4 [14:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:19] (03PS1) 10Ottomata: jupyterhub - make sure $config_path is created [puppet] - 10https://gerrit.wikimedia.org/r/620999 (https://phabricator.wikimedia.org/T224658) [14:58:51] (03CR) 10Ottomata: [C: 03+2] jupyterhub - make sure $config_path is created [puppet] - 10https://gerrit.wikimedia.org/r/620999 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [14:59:32] PROBLEM - Host kubernetes2012.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:59:47] (03PS1) 10Jbond: piwiki: migrate hue to new idp define [puppet] - 10https://gerrit.wikimedia.org/r/621000 (https://phabricator.wikimedia.org/T260677) [15:01:15] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:01:29] (03PS2) 10Jbond: piwik: migrate hue to new idp define [puppet] - 10https://gerrit.wikimedia.org/r/621000 (https://phabricator.wikimedia.org/T260677) [15:02:39] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=99) [15:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:13] (03CR) 10Jbond: [C: 03+2] piwik: migrate hue to new idp define [puppet] - 10https://gerrit.wikimedia.org/r/621000 (https://phabricator.wikimedia.org/T260677) (owner: 10Jbond) [15:04:57] (03CR) 10Clarakosi: [C: 03+2] api-gateway: jwks.json.yaml -> jwks.json.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/620803 (owner: 10Ppchelko) [15:05:04] RECOVERY - Host kubernetes2012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.03 ms [15:05:12] _joe_: I broke it https://phabricator.wikimedia.org/P12294 [15:06:22] <_joe_> jayme: no I did [15:06:34] <_joe_> but I fixed it I think [15:06:42] <_joe_> a patch I already submitted [15:06:49] <_joe_> and merged :P [15:07:24] but you did not managed to replace the script in memory...dare you :) [15:07:58] PROBLEM - staging-graphite.wikimedia.org requires authentication on graphite1004 is CRITICAL: HTTP CRITICAL: Status line output matched HTTP/1.1 302 - header ocation: https://idp.wikim... not found on https://staging-graphite.wikimedia.org:443/ - 574 bytes in 1.007 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:08:41] ^^ this is me will fix presently [15:10:01] (03PS1) 10Jbond: tendril: migrate hue to new idp define [puppet] - 10https://gerrit.wikimedia.org/r/621005 (https://phabricator.wikimedia.org/T260677) [15:12:12] (03CR) 10Jbond: [C: 03+2] tendril: migrate hue to new idp define [puppet] - 10https://gerrit.wikimedia.org/r/621005 (https://phabricator.wikimedia.org/T260677) (owner: 10Jbond) [15:13:04] (03Merged) 10jenkins-bot: api-gateway: jwks.json.yaml -> jwks.json.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/620803 (owner: 10Ppchelko) [15:13:07] <_joe_> jayme: you can restart adding --exclude mw2[200-367].codfw.wmnet [15:13:44] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:18:05] (03PS1) 10Jbond: graphite: disable monitoring for stagiung site [puppet] - 10https://gerrit.wikimedia.org/r/621006 [15:18:42] (03CR) 10BryanDavis: [C: 03+2] jessie-ssd: Fetch base image from docker-registry.tools.wmflabs.org [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/617288 (owner: 10BryanDavis) [15:18:47] _joe_: mw2[200-360].codfw.wmnet you mean, no? As "Hosts in these groups" is mw[2361,2363,2365,2367].codfw.wmnet (printed twice) [15:18:48] (03CR) 10Jbond: [C: 03+2] graphite: disable monitoring for stagiung site [puppet] - 10https://gerrit.wikimedia.org/r/621006 (owner: 10Jbond) [15:19:23] <_joe_> well, those hosts were rebooted, right? [15:19:26] <_joe_> they're probably ok [15:19:33] <_joe_> you just need to repool them [15:19:42] (03Merged) 10jenkins-bot: jessie-ssd: Fetch base image from docker-registry.tools.wmflabs.org [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/617288 (owner: 10BryanDavis) [15:20:04] Ah, well. Sure. Missinterpreted that [15:20:57] (03CR) 10BryanDavis: [C: 03+2] Pywikibot container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/603652 (https://phabricator.wikimedia.org/T249787) (owner: 10BryanDavis) [15:22:02] (03Merged) 10jenkins-bot: Pywikibot container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/603652 (https://phabricator.wikimedia.org/T249787) (owner: 10BryanDavis) [15:22:05] (03PS1) 10Jbond: librenms: migrate to new idp define [puppet] - 10https://gerrit.wikimedia.org/r/621007 (https://phabricator.wikimedia.org/T260677) [15:23:29] (03CR) 10Jbond: [C: 03+2] librenms: migrate to new idp define [puppet] - 10https://gerrit.wikimedia.org/r/621007 (https://phabricator.wikimedia.org/T260677) (owner: 10Jbond) [15:24:42] PROBLEM - Check systemd state on backup2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:25:11] 10Operations, 10ops-codfw, 10DBA: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Papaul) a:05Papaul→03Marostegui Drained the power from the server and did FW upgrade Before FW upgrade BIOS Version 2.4.7 iDRAC Firmware Version 3.36.36.36 After FW upgrad... [15:25:28] _joe_: but a lot of mw21* hosts are fine as well...I'll just feed the output of the cookboook plus mcrouter back as exclude [15:25:50] RECOVERY - Check systemd state on backup2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:26:49] <_joe_> yeah that's why I was giving you the whole range [15:26:59] <_joe_> --exclude says what NOT to do :) [15:27:31] [18.08.20 17:13] <_joe_> jayme: you can restart adding --exclude mw2[200-367].codfw.wmnet [15:27:38] what am I missing here... [15:27:48] <_joe_> ig fat fingers [15:27:51] <_joe_> *oh [15:28:04] <_joe_> I meant 2[100-367] [15:28:09] ok, happy then :D [15:28:38] RECOVERY - Host mc2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 44.89 ms [15:29:54] (03PS1) 10Ottomata: jupyterhub - pass conda_env_path to jupyterhub_singleuser_conda_env_script [puppet] - 10https://gerrit.wikimedia.org/r/621009 (https://phabricator.wikimedia.org/T224658) [15:30:30] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-cluster [15:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:37] (03CR) 10Ottomata: [C: 03+2] jupyterhub - pass conda_env_path to jupyterhub_singleuser_conda_env_script [puppet] - 10https://gerrit.wikimedia.org/r/621009 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [15:31:00] 10Operations, 10fundraising-tech-ops, 10netops: Automate diff and commit of frack ACL - https://phabricator.wikimedia.org/T260655 (10Jgreen) Is this a process that would be prompted over ssh at the same time as we push the policy to /var/tmp? Or would there be a separate process that watches for a new policy... [15:31:05] 10Operations, 10ops-codfw: Replace mc2028 with wmf6413 and name it mc2037 - https://phabricator.wikimedia.org/T260694 (10Papaul) ` [edit interfaces interface-range vlan-private1-c-codfw] member ge-1/0/22 { ... } + member ge-1/0/2; [edit interfaces interface-range disabled] - member ge-1/0/2; [edit i... [15:31:12] jbond42: ok to puppet-merge? [15:31:21] librenms: migrate to new idp define [15:31:21] ? [15:31:51] 10Operations, 10ops-codfw: Replace mc2028 with wmf6413 and name it mc2037 - https://phabricator.wikimedia.org/T260694 (10Papaul) [15:32:26] jbond42: i hope so! merging. [15:32:42] ottomata: oh sorry miussed yes should be fine thanks [15:33:30] PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad rsync status alert, rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [15:35:18] PROBLEM - Host elastic2046.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:35:19] PROBLEM - Host elastic2047.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:35:20] PROBLEM - Host ms-be2042.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:35:30] PROBLEM - Host lvs2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:35:44] PROBLEM - Host ms-be2020.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:35:45] PROBLEM - Host ms-be2034.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:36:16] PROBLEM - Host cp2035.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:36:17] PROBLEM - Host cp2036.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:36:17] C2 issues? [15:36:18] PROBLEM - Host ps1-c2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:36:26] PROBLEM - Host cp2015.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:36:27] XioNoX, papaul ^^^ [15:36:36] PROBLEM - Host ms-be2055.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:36:38] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [15:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:41] C2 mgmt ofc I meant [15:37:29] PROBLEM - Host dns2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:37:29] (03PS1) 10Ottomata: jupyterhub - don't allow named servers [puppet] - 10https://gerrit.wikimedia.org/r/621011 (https://phabricator.wikimedia.org/T224658) [15:37:56] PROBLEM - Host elastic2045.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:38:01] PROBLEM - Host ms-be2035.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:38:01] PROBLEM - Host ms-be2048.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:38:02] PROBLEM - Host ms-fe2007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:38:15] yes, confirmed they are all in C2 [15:38:20] 10Operations, 10Traffic: Analyze custom varnish 5.1 patches considering the migration to varnish 6 - https://phabricator.wikimedia.org/T260702 (10Vgutierrez) [15:38:21] see https://netbox.wikimedia.org/dcim/racks/60/ [15:38:41] 10Operations, 10Traffic: Analyze custom varnish 5.1 patches considering the migration to varnish 6 - https://phabricator.wikimedia.org/T260702 (10Vgutierrez) p:05Triage→03Medium [15:38:57] (03CR) 10Ottomata: [C: 03+2] jupyterhub - don't allow named servers [puppet] - 10https://gerrit.wikimedia.org/r/621011 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [15:39:48] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:40:38] (03PS1) 10Ottomata: jupyterhub - Subscribe jupyterhub to spawners.py [puppet] - 10https://gerrit.wikimedia.org/r/621012 (https://phabricator.wikimedia.org/T224658) [15:41:17] (03CR) 10Ottomata: [C: 03+2] jupyterhub - Subscribe jupyterhub to spawners.py [puppet] - 10https://gerrit.wikimedia.org/r/621012 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [15:43:16] RECOVERY - Host ps1-c2-codfw is UP: PING OK - Packet loss = 0%, RTA = 34.04 ms [15:43:18] RECOVERY - Host dns2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.89 ms [15:43:44] RECOVERY - Host elastic2045.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.97 ms [15:43:51] RECOVERY - Host ms-be2035.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.49 ms [15:43:51] RECOVERY - Host ms-be2048.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.91 ms [15:43:52] RECOVERY - Host ms-fe2007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.72 ms [15:44:42] RECOVERY - Host elastic2047.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.98 ms [15:45:32] (03PS1) 10Jbond: alerting_host: updated to use new apero cas define [puppet] - 10https://gerrit.wikimedia.org/r/621013 (https://phabricator.wikimedia.org/T260677) [15:47:08] RECOVERY - Host elastic2046.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.94 ms [15:47:29] volans: yes [15:47:44] RECOVERY - Host ms-be2042.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.45 ms [15:47:48] RECOVERY - Host cp2015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.85 ms [15:47:53] RECOVERY - Host cp2035.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.07 ms [15:47:53] RECOVERY - Host cp2036.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.19 ms [15:48:01] (03PS1) 10Vgutierrez: Add Origin and Description headers for every debian patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/621014 (https://phabricator.wikimedia.org/T260702) [15:48:01] RECOVERY - Host ms-be2055.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.64 ms [15:48:02] (03PS1) 10Vgutierrez: Remove unused patches [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/621015 (https://phabricator.wikimedia.org/T260702) [15:48:37] ah ok, thx [15:48:39] RECOVERY - Host lvs2009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.97 ms [15:48:44] RECOVERY - Host ms-be2020.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.72 ms [15:48:44] RECOVERY - Host ms-be2034.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.72 ms [15:53:32] (03PS1) 10Volans: cameras: remove leftover record [dns] - 10https://gerrit.wikimedia.org/r/621016 (https://phabricator.wikimedia.org/T207965) [15:54:36] (03CR) 10Volans: [C: 03+2] "Self merging as this record just escaped my previous grep somehow in the related change." [dns] - 10https://gerrit.wikimedia.org/r/621016 (https://phabricator.wikimedia.org/T207965) (owner: 10Volans) [15:54:57] PROBLEM - Host asw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:55:15] PROBLEM - Host lvs2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:55:16] (03PS2) 10Jbond: alerting_host: updated to use new apero cas define [puppet] - 10https://gerrit.wikimedia.org/r/621013 (https://phabricator.wikimedia.org/T260677) [15:55:21] PROBLEM - Host ms-be2020.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:55:22] PROBLEM - Host ms-be2034.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:55:22] PROBLEM - Host ms-be2042.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:55:24] PROBLEM - Host dns2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:55:24] PROBLEM - Host ps1-c2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:55:31] PROBLEM - Host elastic2045.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:55:42] PROBLEM - Host ms-be2048.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:55:42] PROBLEM - Host ms-be2035.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:55:45] PROBLEM - Host ms-fe2007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:58:49] RECOVERY - Host ps1-c2-codfw is UP: PING OK - Packet loss = 0%, RTA = 36.68 ms [15:58:52] RECOVERY - Host asw-c-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.78 ms [15:59:13] (03PS3) 10Jbond: alerting_host: updated to use new apero cas define [puppet] - 10https://gerrit.wikimedia.org/r/621013 (https://phabricator.wikimedia.org/T260677) [16:00:04] godog and _joe_: Dear deployers, time to do the Puppet request window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200818T1600). [16:00:05] Urbanecm: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:39] RECOVERY - Host lvs2009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.10 ms [16:00:45] RECOVERY - Host ms-be2020.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.73 ms [16:00:46] RECOVERY - Host ms-be2034.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.67 ms [16:00:46] RECOVERY - Host ms-be2042.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.70 ms [16:00:47] RECOVERY - Host dns2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.99 ms [16:00:55] RECOVERY - Host elastic2045.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.99 ms [16:01:07] RECOVERY - Host ms-be2048.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.95 ms [16:01:08] RECOVERY - Host ms-be2035.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.62 ms [16:01:11] RECOVERY - Host ms-fe2007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.67 ms [16:02:13] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/24563/" [puppet] - 10https://gerrit.wikimedia.org/r/621007 (https://phabricator.wikimedia.org/T260677) (owner: 10Jbond) [16:03:15] PROBLEM - Host mc2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:03:49] !log oblivian@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=1) [16:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:08] (03CR) 10jerkins-bot: [V: 04-1] Add Origin and Description headers for every debian patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/621014 (https://phabricator.wikimedia.org/T260702) (owner: 10Vgutierrez) [16:05:34] (03PS1) 10Jbond: profile::idp::client::httpd_legacy: drop unused class [puppet] - 10https://gerrit.wikimedia.org/r/621017 (https://phabricator.wikimedia.org/T260677) [16:06:06] (03CR) 10Filippo Giunchedi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/621013 (https://phabricator.wikimedia.org/T260677) (owner: 10Jbond) [16:07:39] (03PS4) 10Jbond: alerting_host: updated to use new apero cas define [puppet] - 10https://gerrit.wikimedia.org/r/621013 (https://phabricator.wikimedia.org/T260677) [16:09:08] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM https://puppet-compiler.wmflabs.org/compiler1002/539/" [puppet] - 10https://gerrit.wikimedia.org/r/621013 (https://phabricator.wikimedia.org/T260677) (owner: 10Jbond) [16:09:32] (03CR) 10Jbond: [C: 03+2] alerting_host: updated to use new apero cas define [puppet] - 10https://gerrit.wikimedia.org/r/621013 (https://phabricator.wikimedia.org/T260677) (owner: 10Jbond) [16:09:52] (03CR) 10jerkins-bot: [V: 04-1] Remove unused patches [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/621015 (https://phabricator.wikimedia.org/T260702) (owner: 10Vgutierrez) [16:12:08] !log oblivian@cumin1001 conftool action : set/pooled=yes; selector: name=mw14(09|11|13).* [16:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:13] (03PS2) 10Jbond: profile::idp::client::httpd_legacy: drop unused class [puppet] - 10https://gerrit.wikimedia.org/r/621017 (https://phabricator.wikimedia.org/T260677) [16:16:22] (03CR) 10Jbond: [C: 03+2] profile::idp::client::httpd_legacy: drop unused class [puppet] - 10https://gerrit.wikimedia.org/r/621017 (https://phabricator.wikimedia.org/T260677) (owner: 10Jbond) [16:18:49] (03PS1) 10Jbond: profile::idp::client::httpd: drop old hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/621019 (https://phabricator.wikimedia.org/T260677) [16:20:28] 10Operations, 10Traffic, 10Patch-For-Review: Analyze custom varnish 5.1 patches considering the migration to varnish 6 - https://phabricator.wikimedia.org/T260702 (10Vgutierrez) |patch|backport/custom|available on varnish 6.0 |available on varnish 6.4 |can be removed? |0002-exp-thread-realtime.patch| custom... [16:20:46] 10Operations, 10serviceops, 10Platform Team Workboards (Clinic Duty Team): PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10AMooney) p:05Triage→03Medium [16:22:35] (03PS2) 10Vgutierrez: Add Origin and Description headers for every debian patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/621014 (https://phabricator.wikimedia.org/T260702) [16:22:37] (03PS2) 10Vgutierrez: Remove unused patches [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/621015 (https://phabricator.wikimedia.org/T260702) [16:22:39] (03CR) 10Jbond: [C: 03+2] profile::idp::client::httpd: drop old hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/621019 (https://phabricator.wikimedia.org/T260677) (owner: 10Jbond) [16:26:24] (03PS2) 10Dzahn: ci::jenkins: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/619889 [16:26:46] (03CR) 10jerkins-bot: [V: 04-1] ci::jenkins: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/619889 (owner: 10Dzahn) [16:27:25] (03PS3) 10Dzahn: ci::jenkins: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/619889 [16:27:50] 10Operations, 10ops-codfw: Replace mc2028 with wmf6413 and name it mc2037 - https://phabricator.wikimedia.org/T260694 (10Papaul) [16:28:03] (03PS1) 10Catrope: Only fetch task card data for users in variant C and D [extensions/GrowthExperiments] (wmf/1.36.0-wmf.4) - 10https://gerrit.wikimedia.org/r/620967 (https://phabricator.wikimedia.org/T258021) [16:28:19] (03PS1) 10Catrope: Only fetch task card data for users in variant C and D [extensions/GrowthExperiments] (wmf/1.36.0-wmf.5) - 10https://gerrit.wikimedia.org/r/620968 (https://phabricator.wikimedia.org/T258021) [16:35:06] (03PS1) 10Dzahn: profile::ci: add more data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/621021 [16:35:25] (03CR) 10jerkins-bot: [V: 04-1] profile::ci: add more data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/621021 (owner: 10Dzahn) [16:37:22] (03CR) 10jerkins-bot: [V: 04-1] Only fetch task card data for users in variant C and D [extensions/GrowthExperiments] (wmf/1.36.0-wmf.5) - 10https://gerrit.wikimedia.org/r/620968 (https://phabricator.wikimedia.org/T258021) (owner: 10Catrope) [16:37:50] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/24565/" [puppet] - 10https://gerrit.wikimedia.org/r/619889 (owner: 10Dzahn) [16:38:50] (03CR) 10jerkins-bot: [V: 04-1] Remove unused patches [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/621015 (https://phabricator.wikimedia.org/T260702) (owner: 10Vgutierrez) [16:40:20] (03PS1) 10Andrew Bogott: Revert "backy2: temporarily hack data dir to /var/lib/nova/instances" [puppet] - 10https://gerrit.wikimedia.org/r/621022 (https://phabricator.wikimedia.org/T260692) [16:40:22] (03PS1) 10Andrew Bogott: wmcs/ceph/backy: move backup engine to cloudstore1009 [puppet] - 10https://gerrit.wikimedia.org/r/621023 (https://phabricator.wikimedia.org/T260692) [16:40:24] (03PS1) 10Andrew Bogott: backy2: remove some unused hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/621024 (https://phabricator.wikimedia.org/T260692) [16:40:26] (03PS1) 10Andrew Bogott: backy2: hack in a fix to an upstream bug in 'backy2 du' [puppet] - 10https://gerrit.wikimedia.org/r/621025 (https://phabricator.wikimedia.org/T260692) [16:41:01] (03PS2) 10Dzahn: profile::ci: add more data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/621021 [16:41:21] (03CR) 10jerkins-bot: [V: 04-1] backy2: hack in a fix to an upstream bug in 'backy2 du' [puppet] - 10https://gerrit.wikimedia.org/r/621025 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott) [16:41:47] (03CR) 10jerkins-bot: [V: 04-1] backy2: remove some unused hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/621024 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott) [16:44:01] (03CR) 10jerkins-bot: [V: 04-1] Add Origin and Description headers for every debian patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/621014 (https://phabricator.wikimedia.org/T260702) (owner: 10Vgutierrez) [16:44:03] (03PS1) 10Jgreen: nsca_frack.cfg.erb nsca collection for several stray checks [puppet] - 10https://gerrit.wikimedia.org/r/621026 (https://phabricator.wikimedia.org/T260659) [16:44:18] (03PS2) 10Andrew Bogott: backy2: remove some unused hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/621024 (https://phabricator.wikimedia.org/T260692) [16:44:20] (03PS2) 10Andrew Bogott: backy2: hack in a fix to an upstream bug in 'backy2 du' [puppet] - 10https://gerrit.wikimedia.org/r/621025 (https://phabricator.wikimedia.org/T260692) [16:54:05] (03PS3) 10Dzahn: profile::ci: add more data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/621021 [16:59:13] (03CR) 10Hashar: Add profile::analytics::jupyterhub (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/620784 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [17:00:04] halfak and accraze: Dear deployers, time to do the Services – Graphoid / ORES deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200818T1700). [17:02:12] (03PS1) 10Hashar: Restore http_proxy broken by be4a47b [puppet] - 10https://gerrit.wikimedia.org/r/621029 [17:02:41] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, 10Patch-For-Review: TY pages in a subdomain of wikipedia and set hide banner cookie - https://phabricator.wikimedia.org/T251780 (10Pcoombe) Hi all, I've copied across the Thank You pages content to the new wiki. The pag... [17:02:47] (03CR) 10Jgreen: [C: 03+2] nsca_frack.cfg.erb nsca collection for several stray checks [puppet] - 10https://gerrit.wikimedia.org/r/621026 (https://phabricator.wikimedia.org/T260659) (owner: 10Jgreen) [17:03:05] (03CR) 10Jgreen: [V: 03+2 C: 03+2] nsca_frack.cfg.erb nsca collection for several stray checks [puppet] - 10https://gerrit.wikimedia.org/r/621026 (https://phabricator.wikimedia.org/T260659) (owner: 10Jgreen) [17:04:54] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, 10Patch-For-Review: TY pages in a subdomain of wikipedia and set hide banner cookie - https://phabricator.wikimedia.org/T251780 (10DStrine) Thanks @Pcoombe for all your help!!!one!1 I want to over-communicate this with... [17:05:56] (03PS3) 10Andrew Bogott: backy2: remove some unused hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/621024 (https://phabricator.wikimedia.org/T260692) [17:05:58] (03PS3) 10Andrew Bogott: backy2: hack in a fix to an upstream bug in 'backy2 du' [puppet] - 10https://gerrit.wikimedia.org/r/621025 (https://phabricator.wikimedia.org/T260692) [17:06:00] (03PS2) 10Andrew Bogott: wmcs/ceph/backy: move backup engine to cloudstore1009 [puppet] - 10https://gerrit.wikimedia.org/r/621023 (https://phabricator.wikimedia.org/T260692) [17:09:38] (03CR) 10Hashar: ":)" [puppet] - 10https://gerrit.wikimedia.org/r/621029 (owner: 10Hashar) [17:10:28] (03PS3) 10Andrew Bogott: wmcs/ceph/backy: move backup engine to cloudstore1009 [puppet] - 10https://gerrit.wikimedia.org/r/621023 (https://phabricator.wikimedia.org/T260692) [17:15:19] (03PS4) 10Andrew Bogott: wmcs/ceph/backy: move backup engine to cloudstore1009 [puppet] - 10https://gerrit.wikimedia.org/r/621023 (https://phabricator.wikimedia.org/T260692) [17:16:36] (03CR) 10jerkins-bot: [V: 04-1] wmcs/ceph/backy: move backup engine to cloudstore1009 [puppet] - 10https://gerrit.wikimedia.org/r/621023 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott) [17:17:07] (03PS5) 10Andrew Bogott: wmcs/ceph/backy: move backup engine to cloudstore1009 [puppet] - 10https://gerrit.wikimedia.org/r/621023 (https://phabricator.wikimedia.org/T260692) [17:18:20] (03CR) 10jerkins-bot: [V: 04-1] wmcs/ceph/backy: move backup engine to cloudstore1009 [puppet] - 10https://gerrit.wikimedia.org/r/621023 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott) [17:21:33] PROBLEM - Host ripe-atlas-esams IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [17:22:49] PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 253 probes of 656 (alerts on 35) - https://atlas.ripe.net/measurements/23449935/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:25:19] (03PS6) 10Andrew Bogott: wmcs/ceph/backy: move backup engine to cloudstore1009 [puppet] - 10https://gerrit.wikimedia.org/r/621023 (https://phabricator.wikimedia.org/T260692) [17:26:32] (03CR) 10jerkins-bot: [V: 04-1] wmcs/ceph/backy: move backup engine to cloudstore1009 [puppet] - 10https://gerrit.wikimedia.org/r/621023 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott) [17:28:23] RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 8 probes of 656 (alerts on 35) - https://atlas.ripe.net/measurements/23449935/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:29:04] 10Operations, 10ops-eqiad, 10netops: cloudflare CLF-20200806 dmarc to router patch - https://phabricator.wikimedia.org/T259923 (10Cmjohnson) 05Open→03Resolved fixed [17:33:01] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By:2020-08-17) label/setup/install pki1001 - https://phabricator.wikimedia.org/T259826 (10Cmjohnson) [17:34:50] (03PS1) 10Jgreen: nsca_frack.cfg.erb switch from check_http to check_endpoints b/c of TLS error [puppet] - 10https://gerrit.wikimedia.org/r/621035 [17:38:43] (03CR) 10Jgreen: [C: 03+2] nsca_frack.cfg.erb switch from check_http to check_endpoints b/c of TLS error [puppet] - 10https://gerrit.wikimedia.org/r/621035 (owner: 10Jgreen) [17:49:06] (03CR) 10CDanis: [C: 03+2] Restore http_proxy broken by be4a47b [puppet] - 10https://gerrit.wikimedia.org/r/621029 (owner: 10Hashar) [17:49:13] (03CR) 10CDanis: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/24576/contint2001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/621029 (owner: 10Hashar) [17:51:01] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:51:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:37] 10Operations, 10ops-codfw: Replace mc2028 with wmf6413 and name it mc2037 - https://phabricator.wikimedia.org/T260694 (10Papaul) p:05Triage→03Medium [17:56:03] (03CR) 10JMeybohm: [C: 04-1] "This also renames the helm release in environment staging from production to staging (which I think is a good thing, but I wanted to menti" [deployment-charts] - 10https://gerrit.wikimedia.org/r/619437 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [17:56:46] 10Operations, 10ops-codfw, 10netops: (Need by: ) codfw:rack/setup/new management switches - https://phabricator.wikimedia.org/T253154 (10Papaul) [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200818T1800) [18:00:31] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Jan Eissfeldt (jeissfeldt) - https://phabricator.wikimedia.org/T260555 (10drochford) Thank you @fgiunchedi [18:01:01] (03PS7) 10Andrew Bogott: wmcs/ceph/backy: move backup engine to cloudstore1009 [puppet] - 10https://gerrit.wikimedia.org/r/621023 (https://phabricator.wikimedia.org/T260692) [18:01:19] (03CR) 10Andrew Bogott: [C: 03+2] Revert "backy2: temporarily hack data dir to /var/lib/nova/instances" [puppet] - 10https://gerrit.wikimedia.org/r/621022 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott) [18:01:52] (03CR) 10Andrew Bogott: [C: 03+2] backy2: remove some unused hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/621024 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott) [18:02:24] (03CR) 10Andrew Bogott: [C: 03+2] backy2: hack in a fix to an upstream bug in 'backy2 du' [puppet] - 10https://gerrit.wikimedia.org/r/621025 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott) [18:02:41] (03PS4) 10Andrew Bogott: backy2: hack in a fix to an upstream bug in 'backy2 du' [puppet] - 10https://gerrit.wikimedia.org/r/621025 (https://phabricator.wikimedia.org/T260692) [18:04:24] (03CR) 10Andrew Bogott: [C: 03+2] wmcs/ceph/backy: move backup engine to cloudstore1009 [puppet] - 10https://gerrit.wikimedia.org/r/621023 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott) [18:04:46] (03PS8) 10Andrew Bogott: wmcs/ceph/backy: move backup engine to cloudstore1009 [puppet] - 10https://gerrit.wikimedia.org/r/621023 (https://phabricator.wikimedia.org/T260692) [18:05:54] RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [18:08:41] RECOVERY - Host mc2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.77 ms [18:22:41] (03PS1) 10Mholloway: [BETA] EchoPush: Set max subscriptions per user to 10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621036 (https://phabricator.wikimedia.org/T259150) [18:36:22] (03PS1) 10Dzahn: decom helium and heze [puppet] - 10https://gerrit.wikimedia.org/r/621038 (https://phabricator.wikimedia.org/T245161) [18:42:37] (03CR) 10C. Scott Ananian: "James, ping. I know you don't like it, but I just changed a bunch of code in Parsoid/extension/src..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612879 (https://phabricator.wikimedia.org/T241961) (owner: 10C. Scott Ananian) [18:44:49] 10Operations: decom helium and heze - https://phabricator.wikimedia.org/T260717 (10Dzahn) [18:45:06] 10Operations: decom helium and heze - https://phabricator.wikimedia.org/T260717 (10Dzahn) [18:45:34] 10Operations, 10DC-Ops: decom helium and heze - https://phabricator.wikimedia.org/T260717 (10Dzahn) [18:47:08] (03PS2) 10Dzahn: decom helium and heze [puppet] - 10https://gerrit.wikimedia.org/r/621038 (https://phabricator.wikimedia.org/T260717) [18:47:35] (03PS1) 10Dzahn: profile::backup: remove helium from ferm directors [puppet] - 10https://gerrit.wikimedia.org/r/621042 (https://phabricator.wikimedia.org/T260717) [18:57:00] (03PS4) 10Dzahn: profile::ci: add more data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/621021 [18:58:31] (03PS1) 10Andrew Bogott: Revert "wmcs/ceph/backy: move backup engine to cloudstore1009" [puppet] - 10https://gerrit.wikimedia.org/r/621043 [18:59:49] (03CR) 10Andrew Bogott: [C: 03+2] Revert "wmcs/ceph/backy: move backup engine to cloudstore1009" [puppet] - 10https://gerrit.wikimedia.org/r/621043 (owner: 10Andrew Bogott) [19:00:05] twentyafterfour and marxarelli: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Mediawiki train - American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200818T1900). [19:00:23] twentyafterfour: o/ [19:01:23] (03PS1) 10Ppchelko: Move rate_limits configration to the VH level [deployment-charts] - 10https://gerrit.wikimedia.org/r/621044 (https://phabricator.wikimedia.org/T260591) [19:02:13] once again, bye bye irc cloud [19:03:04] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/24578/contint1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/621021 (owner: 10Dzahn) [19:12:25] marxarelli: here [19:13:04] cool cool [19:13:24] !log Promote testwikis from 1.36.0-wmf.4 to 1.36.0-wmf.5 refs T257973 [19:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:32] T257973: 1.36.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T257973 [19:13:41] (03PS1) 1020after4: testwikis wikis to 1.36.0-wmf.5 refs T257973 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621049 [19:13:47] (03CR) 1020after4: [C: 03+2] testwikis wikis to 1.36.0-wmf.5 refs T257973 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621049 (owner: 1020after4) [19:14:36] (03Merged) 10jenkins-bot: testwikis wikis to 1.36.0-wmf.5 refs T257973 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621049 (owner: 1020after4) [19:14:44] !log twentyafterfour@deploy1001 Started scap: testwikis wikis to 1.36.0-wmf.5 refs T257973 [19:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:59] (03CR) 10Dzahn: "now just needs some indentation fixes to make jenkins-bot happy" [puppet] - 10https://gerrit.wikimedia.org/r/620368 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [19:16:19] (03CR) 10Ottomata: "Huh, cool! So, now there are 3 releases: eqiad, codfw, staging, each of which has 2 environments: production and staging?" (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/619437 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [19:28:46] (03CR) 10Hashar: [C: 04-1] "Sorry this patch is not ready at all. That will actually break doc.wikimedia.org and I need to find a nice solution :]" [puppet] - 10https://gerrit.wikimedia.org/r/620368 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [19:29:52] mutante: doc.wikimedia.org document root can not be changed as is. I need to fiddle with apache config and the php code in integration/docroot ;] [19:34:53] hashar: oh, ok. no problem. enough other stuff to do [19:35:07] mutante: but the integration.wm.o one can ;] [19:35:14] hashar: doing releases-jenkins switch [19:35:34] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [19:35:42] hashar: btw, did you expect this size? [19:35:45] 57G /var/lib/jenkins [19:35:49] on releases* [19:35:52] no idea [19:36:39] 16G in a directory named REL1_27 [19:36:52] lets dish that out [19:37:18] those are a bunch of snapshot tarballs [19:39:03] ok, it's not an immediate issue but reducing it won't hurt [19:39:43] btw the regular releases.wm.org files are now synced with --delete so stuff on secondary servers gets deleted when they are deleted from the primary server. before it would grow forever on the secondary [19:40:04] rsync::quickdatacopy got a new parameter to do that and have actual mirrors [19:40:41] releases.wm.org files get auto-synced. jenkins data is allowed but not auto [19:40:46] screw that [19:40:50] I am going to mass delete stuff [19:41:26] ok [19:41:30] !log releases1001: deleting old legacy mediawiki snapshots under /var/lib/jenkins/{REL1_27,REL1_29,REL1_30} # T256164 [19:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:36] T256164: Clear out releases jenkins legacy jobs and move remaining ones to JJB - https://phabricator.wikimedia.org/T256164 [19:42:19] that is deleting [19:42:35] thanks! i will sync with --delete on 1002 in a bit [19:42:38] and 2002 [19:42:52] 2001 never got synced (bad!) but now it doesn't matter anymore because we will just replace it [19:43:20] and I am going to delete the old jobs [19:44:39] duh, i need to switch it to 1001 as the "primary" one last time to be able to rsync [19:44:49] !log Deleting old jobs from https://releases-jenkins.wikimedia.org/ # T256164 [19:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:52] (03PS1) 10Dzahn: Revert "Revert "releases: set releases1001 as primary to sync jenkins config"" [puppet] - 10https://gerrit.wikimedia.org/r/620970 [19:45:56] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [19:46:07] (03PS2) 10Dzahn: Revert "Revert "releases: set releases1001 as primary to sync jenkins config"" [puppet] - 10https://gerrit.wikimedia.org/r/620970 (https://phabricator.wikimedia.org/T256164) [19:48:44] mutante: so /var/lib/jenkins is down to 1.1G now ;] [19:50:37] hashar: confirmed. thanks! [19:50:43] (03PS3) 10Dzahn: Revert "Revert "releases: set releases1001 as primary to sync jenkins config"" [puppet] - 10https://gerrit.wikimedia.org/r/620970 (https://phabricator.wikimedia.org/T256164) [19:51:43] (03CR) 10Dzahn: [C: 03+2] Revert "Revert "releases: set releases1001 as primary to sync jenkins config"" [puppet] - 10https://gerrit.wikimedia.org/r/620970 (https://phabricator.wikimedia.org/T256164) (owner: 10Dzahn) [19:52:11] (03PS1) 10Dzahn: Revert "Revert "Revert "releases: set releases1001 as primary to sync jenkins config""" [puppet] - 10https://gerrit.wikimedia.org/r/620971 [19:54:37] !log rsyncing /var/lib/jenkins from releases1001 to releases1002/2002 with --delete T256164 [19:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:41] T256164: Clear out releases jenkins legacy jobs and move remaining ones to JJB - https://phabricator.wikimedia.org/T256164 [19:56:43] (03PS1) 10Dzahn: Revert "ATS: temp. set backend for releases-jenkins to releases1001" [puppet] - 10https://gerrit.wikimedia.org/r/620972 [19:56:57] (03CR) 10Dzahn: [C: 03+2] Revert "Revert "Revert "releases: set releases1001 as primary to sync jenkins config""" [puppet] - 10https://gerrit.wikimedia.org/r/620971 (owner: 10Dzahn) [19:58:37] it's possible now to have more than one failover and still rsync data - without it trying to rsync to itself [19:59:18] (03PS2) 10Dzahn: Revert "ATS: temp. set backend for releases-jenkins to releases1001" [puppet] - 10https://gerrit.wikimedia.org/r/620972 [20:00:43] !log releases1001 rm /etc/rsync.d/frag* & run puppet [20:00:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:17] (03CR) 10Dzahn: [C: 03+2] Revert "ATS: temp. set backend for releases-jenkins to releases1001" [puppet] - 10https://gerrit.wikimedia.org/r/620972 (owner: 10Dzahn) [20:07:56] !log twentyafterfour@deploy1001 Finished scap: testwikis wikis to 1.36.0-wmf.5 refs T257973 (duration: 53m 12s) [20:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:59] T257973: 1.36.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T257973 [20:11:20] !log running puppet on cp-ats-ulsfo and switching releases-jenkins backend [20:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:47] (03PS1) 1020after4: group0 wikis to 1.36.0-wmf.5 refs T257973 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621057 [20:11:48] (03CR) 1020after4: [C: 03+2] group0 wikis to 1.36.0-wmf.5 refs T257973 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621057 (owner: 1020after4) [20:11:56] (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.5 refs T257973 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621057 (owner: 1020after4) [20:12:48] hashar: twentyafterfour: so i switched releases-jenkins and the data was rsynced but it's not "just working" unfortunately [20:12:49] (03PS1) 10Andrew Bogott: cloudvirt1024: move to Buster and make a ceph cloudvirt [puppet] - 10https://gerrit.wikimedia.org/r/621058 (https://phabricator.wikimedia.org/T260692) [20:12:55] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, 10Sustainability (Incident Followup): mw* servers memory leaks (12 Aug) - https://phabricator.wikimedia.org/T260281 (10eprodromou) We're tracking this, but unsure as to next steps. Let us know if more active investigation from Platform team... [20:13:02] the good part.. we know it was already done for today, so there is a week to fix it [20:13:36] mutante: not working? [20:13:40] oh heh, let me properly start it:) [20:13:52] (03CR) 10Clarakosi: Move rate_limits configration to the VH level (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/621044 (https://phabricator.wikimedia.org/T260591) (owner: 10Ppchelko) [20:13:58] that migrates from releases1001 to releases1002 isn't it ? [20:13:59] that was my own safety-net to make sure it is not running by accident again [20:14:00] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.5 refs T257973 [20:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:04] T257973: 1.36.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T257973 [20:16:38] (03PS1) 10Dzahn: releases: stop jenkins on releases1001, start it on releases1001 [puppet] - 10https://gerrit.wikimedia.org/r/621059 (https://phabricator.wikimedia.org/T247652) [20:16:48] (03PS2) 10Andrew Bogott: cloudvirt1024: move to Buster and make a ceph cloudvirt [puppet] - 10https://gerrit.wikimedia.org/r/621058 (https://phabricator.wikimedia.org/T260692) [20:17:02] (03PS2) 10Dzahn: releases: stop jenkins on releases1001, start it on releases1002 [puppet] - 10https://gerrit.wikimedia.org/r/621059 (https://phabricator.wikimedia.org/T247652) [20:17:19] (03CR) 10jerkins-bot: [V: 04-1] cloudvirt1024: move to Buster and make a ceph cloudvirt [puppet] - 10https://gerrit.wikimedia.org/r/621058 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott) [20:17:22] mutante: hey jenkins appears to be working now [20:18:14] twentyafterfour: i think you are seeing a cached version [20:18:20] give me 30 sec [20:18:57] (03CR) 10Dzahn: [C: 03+2] releases: stop jenkins on releases1001, start it on releases1002 [puppet] - 10https://gerrit.wikimedia.org/r/621059 (https://phabricator.wikimedia.org/T247652) (owner: 10Dzahn) [20:20:27] (03PS3) 10Andrew Bogott: cloudvirt1024: move to Buster and make a ceph cloudvirt [puppet] - 10https://gerrit.wikimedia.org/r/621058 (https://phabricator.wikimedia.org/T260692) [20:21:22] twentyafterfour: hashar: jenkins stopped and masked on 1001 and started on 1002 [20:21:32] i see the UI now after a brief " Please wait while Jenkins is getting ready to work ..." [20:21:48] I got a 503 but that could be dns related maybe [20:21:53] it seems to be an issue though that it still shows 1001 [20:21:57] as the executor [20:22:40] so the config must be edited .. since we rsynced everything but the hostname is in there [20:22:41] yeah [20:22:46] that has to be changed in the config [20:22:55] where do i change the config? [20:23:26] https://releases-jenkins.wikimedia.org/computer/releases1001.eqiad.wmnet/configure [20:23:33] but that does not work for me cause I got plenty of 503 [20:23:39] I guess due to some cache [20:23:39] Dzahn is missing the Agent/ExtendedRead permission [20:23:49] :] [20:24:03] (03CR) 10Ppchelko: Move rate_limits configration to the VH level (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/621044 (https://phabricator.wikimedia.org/T260591) (owner: 10Ppchelko) [20:24:03] hashar: oh. i know why. i only ran puppet on ulsfo caching [20:25:02] * mutante runs puppet on cp-esams [20:25:37] hashar: i got the "view" permission to be able to check it but that is not "extended read" which is a weird term for "write" :) [20:25:38] PROBLEM - HTTP releases-jenkins.wikimedia.org on releases1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 553 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Releases.wikimedia.org%23Jenkins [20:25:51] PROBLEM - jenkins_service_running on releases1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/jenkins/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [20:26:00] well, that should not have happened if i ran puppet on icinga early enough [20:26:10] monitoring already switched with the same Hiera change [20:26:36] hashar: 503 gone for you? [20:27:14] !log https://releases-jenkins.wikimedia.org/ changed agent from releases1001 to releases1002 [20:27:15] ACKNOWLEDGEMENT - HTTP releases-jenkins.wikimedia.org on releases1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 553 bytes in 0.002 second response time daniel_zahn migration in progress https://wikitech.wikimedia.org/wiki/Releases.wikimedia.org%23Jenkins [20:27:15] ACKNOWLEDGEMENT - jenkins_service_running on releases1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/jenkins/jenkins.war daniel_zahn migration in progress https://wikitech.wikimedia.org/wiki/Jenkins [20:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:47] mutante: looks good now [20:28:33] hashar: great! :) [20:28:49] this should conclude the whole "switch releases* to buster" [20:29:15] well, except we should do a test of the repepro upload [20:29:31] i think that's only done for parsoid releases [20:30:56] 10Operations, 10Machine Learning Platform, 10ORES: ORES icinga alerts - https://phabricator.wikimedia.org/T260732 (10Dzahn) [20:33:10] 10Operations, 10Machine Learning Platform, 10ORES: ORES icinga alerts - https://phabricator.wikimedia.org/T260732 (10Dzahn) [20:36:58] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1024: move to Buster and make a ceph cloudvirt [puppet] - 10https://gerrit.wikimedia.org/r/621058 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott) [20:38:03] mutante: well done ;) I am heading to bed now [20:41:40] hashar: merci. bonne nuit [20:59:33] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 122 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [21:01:35] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [21:01:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:46] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:14] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, 10Sustainability (Incident Followup): mw* servers memory leaks (12 Aug) - https://phabricator.wikimedia.org/T260281 (10Joe) [21:10:40] (03PS1) 10Bstorm: wikireplicas: create cumin aliases for wikireplica servers [puppet] - 10https://gerrit.wikimedia.org/r/621067 (https://phabricator.wikimedia.org/T260389) [21:16:40] 10Puppet, 10Beta-Cluster-Infrastructure: puppetdb on deployment-puppetdb03 keeps getting OOMKilled - https://phabricator.wikimedia.org/T248041 (10dpifke) Restarted again: https://sal.toolforge.org/log/aexZA3QBj_Bg1xd3wLvx (I'll reference this task next time, now that I know it exists.) Adding `Restart=always`... [21:18:58] (03CR) 10Ppchelko: [C: 04-2] "Let's hold horses until we get some input on https://github.com/envoyproxy/envoy/issues/12714" [deployment-charts] - 10https://gerrit.wikimedia.org/r/621044 (https://phabricator.wikimedia.org/T260591) (owner: 10Ppchelko) [21:24:14] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-cluster [21:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:15] (03CR) 10Cwhite: [C: 03+1] "loki LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/620317 (owner: 10Addshore) [21:26:45] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:26:49] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:27:07] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/620701 (https://phabricator.wikimedia.org/T247966) (owner: 10Filippo Giunchedi) [21:27:30] (03PS1) 10Mholloway: Update wikifeeds to 2020-08-18-175056-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/621070 [21:30:54] (03CR) 10Mholloway: [C: 03+2] Update wikifeeds to 2020-08-18-175056-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/621070 (owner: 10Mholloway) [21:31:19] (03CR) 10Mholloway: [C: 03+2] [BETA] EchoPush: Set max subscriptions per user to 10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621036 (https://phabricator.wikimedia.org/T259150) (owner: 10Mholloway) [21:32:03] (03Merged) 10jenkins-bot: [BETA] EchoPush: Set max subscriptions per user to 10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621036 (https://phabricator.wikimedia.org/T259150) (owner: 10Mholloway) [21:32:20] (03Merged) 10jenkins-bot: Update wikifeeds to 2020-08-18-175056-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/621070 (owner: 10Mholloway) [21:33:15] 10Operations, 10Parsing-Team, 10TechCom, 10serviceops, and 4 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10daniel) a:05tstarling→03None [21:34:02] (03CR) 10Cwhite: [C: 03+1] "Looks to me like something around the state synchronization process. This looks like a reasonable approach." [puppet] - 10https://gerrit.wikimedia.org/r/620710 (https://phabricator.wikimedia.org/T260521) (owner: 10Filippo Giunchedi) [21:34:37] (03CR) 10Cwhite: [C: 03+1] hieradata: add alertmanagers variable [puppet] - 10https://gerrit.wikimedia.org/r/620922 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [21:34:57] !log mholloway-shell@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [21:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:38] (03CR) 10Cwhite: [C: 03+1] prometheus: add alerts to 'ops' instance [puppet] - 10https://gerrit.wikimedia.org/r/620956 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [21:36:35] (03PS1) 10Andrew Bogott: cloudvirt1024: update nic labels [puppet] - 10https://gerrit.wikimedia.org/r/621075 [21:36:47] (03CR) 10Cwhite: [C: 03+1] prometheus: export icinga service problems as metrics [puppet] - 10https://gerrit.wikimedia.org/r/620957 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [21:37:13] !log mholloway-shell@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [21:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:18] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1024: update nic labels [puppet] - 10https://gerrit.wikimedia.org/r/621075 (owner: 10Andrew Bogott) [21:38:14] (03CR) 10Cwhite: hieradata: switch grafana.w.o to Grafana 7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/620663 (https://phabricator.wikimedia.org/T259143) (owner: 10Filippo Giunchedi) [21:39:43] !log mholloway-shell@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [21:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:58] (03PS1) 10Andrew Bogott: cloudvirt1024: move to new role, 'virt_ceph_and_backy' [puppet] - 10https://gerrit.wikimedia.org/r/621077 (https://phabricator.wikimedia.org/T260692) [21:42:01] (03PS2) 10BryanDavis: cloud: Remove legacy conditionals from profile::base::labs [puppet] - 10https://gerrit.wikimedia.org/r/617539 [21:44:22] (03CR) 10Cwhite: [C: 03+1] icinga: make sure update-etcd-mw-config-lastindex is enabled [puppet] - 10https://gerrit.wikimedia.org/r/620929 (https://phabricator.wikimedia.org/T247966) (owner: 10Filippo Giunchedi) [21:44:24] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1024: move to new role, 'virt_ceph_and_backy' [puppet] - 10https://gerrit.wikimedia.org/r/621077 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott) [21:45:22] (03CR) 10Cwhite: [C: 03+1] profile: switch Grafana plugins to Debian package [puppet] - 10https://gerrit.wikimedia.org/r/619451 (https://phabricator.wikimedia.org/T259143) (owner: 10Filippo Giunchedi) [21:46:11] PROBLEM - Persistent high iowait on cloudstore1008 is CRITICAL: 28.49 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [21:46:32] 10Operations, 10Machine Learning Platform, 10ORES: ORES icinga alerts - https://phabricator.wikimedia.org/T260732 (10Krenair) `modules/icinga/manifests/monitor/ores_labs_web_node.pp` has `check_command => "check_ores_workers!oresweb/node/${title}",` which would be e.g. `check_ores_workers!oresweb/node/ores-w... [21:48:21] (03PS1) 10Andrew Bogott: Added hiera config for the virt_ceph_and_backy role [puppet] - 10https://gerrit.wikimedia.org/r/621078 [21:49:24] (03CR) 10Andrew Bogott: [C: 03+2] Added hiera config for the virt_ceph_and_backy role [puppet] - 10https://gerrit.wikimedia.org/r/621078 (owner: 10Andrew Bogott) [21:49:29] RECOVERY - Persistent high iowait on cloudstore1008 is OK: (C)10 ge (W)5 ge 2.447 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [21:51:33] remote: If you are using git-review, update to at least git-review 1.27. Otherwise: [21:51:34] remote: You need 'Create' rights to create new references. [21:51:34] oof [21:51:40] how long hast that one been I wonder [21:51:55] (03PS1) 10Alex Monk: Fix hostname for labs ORES monitoring [puppet] - 10https://gerrit.wikimedia.org/r/621080 (https://phabricator.wikimedia.org/T260732) [21:54:25] 10Operations, 10Machine Learning Platform, 10ORES, 10Patch-For-Review: ORES icinga alerts - https://phabricator.wikimedia.org/T260732 (10Dzahn) Yea, i can confirm that and came to the same conclusion but the host here is ores.wmflabs.org and the "oresweb" part is just the URL being sent to that host. And t... [21:56:18] (03CR) 10Cwhite: [C: 03+2] prometheus: remove unnecessary define and split mediawiki queries by channel [puppet] - 10https://gerrit.wikimedia.org/r/619574 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [21:56:25] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 78.31 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [21:56:29] (03CR) 10Dzahn: [C: 04-1] "I don't think this is right. What the monitoring check does is ask the host ores.wmflabs.org but for a different URL for each backend. It'" [puppet] - 10https://gerrit.wikimedia.org/r/621080 (https://phabricator.wikimedia.org/T260732) (owner: 10Alex Monk) [21:57:22] (03CR) 10Alex Monk: "Well, you can't. You're hitting the labs novaproxy and the only way to tell it to go to ORES is to specify ORES's hostname." [puppet] - 10https://gerrit.wikimedia.org/r/621080 (https://phabricator.wikimedia.org/T260732) (owner: 10Alex Monk) [21:59:55] (03CR) 10Dzahn: [C: 04-1] "we ARE giving it ores.wmflabs.org as host name, the change is in the -u parameter" [puppet] - 10https://gerrit.wikimedia.org/r/621080 (https://phabricator.wikimedia.org/T260732) (owner: 10Alex Monk) [22:00:11] (03PS1) 10Mholloway: Update wikifeeds to 2020-08-18-215651-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/621083 [22:01:23] (03CR) 10Dzahn: [C: 04-1] "see the check command definition. it is "/check_ores_workers $HOSTADDRESS$ '$ARG1$'" so first parameter is host name and second is URL on " [puppet] - 10https://gerrit.wikimedia.org/r/621080 (https://phabricator.wikimedia.org/T260732) (owner: 10Alex Monk) [22:03:37] (03CR) 10Mholloway: [C: 03+2] Update wikifeeds to 2020-08-18-215651-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/621083 (owner: 10Mholloway) [22:04:19] PROBLEM - Host mw1301 is DOWN: PING CRITICAL - Packet loss = 100% [22:04:54] (03Merged) 10jenkins-bot: Update wikifeeds to 2020-08-18-215651-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/621083 (owner: 10Mholloway) [22:06:00] !log mholloway-shell@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [22:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:58] (03CR) 10Alex Monk: "when you specify a URL like that and override the host header separately, it's only going to use the host in the URL for DNS lookups and m" [puppet] - 10https://gerrit.wikimedia.org/r/621080 (https://phabricator.wikimedia.org/T260732) (owner: 10Alex Monk) [22:07:25] !log mholloway-shell@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [22:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:13] !log mholloway-shell@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [22:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:25] 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops: decom releases1001 and releases2001 - https://phabricator.wikimedia.org/T260742 (10Dzahn) [22:19:38] 10Operations, 10serviceops: decom releases1001 and releases2001 - https://phabricator.wikimedia.org/T260742 (10Dzahn) [22:21:32] 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review: replace backends for releases.wikimedia.org with buster VMs - https://phabricator.wikimedia.org/T247652 (10Dzahn) both https://releases.wikimedia.org and https://releases-jenkins.wikimedia.org have been switched to the... [22:25:04] 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review: replace backends for releases.wikimedia.org with buster VMs - https://phabricator.wikimedia.org/T247652 (10Dzahn) 05Open→03Resolved [22:25:08] 10Operations, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10Dzahn) [22:38:19] (03PS1) 10Bstorm: wikireplicas: add wikireplica cookbook to add a wiki [cookbooks] - 10https://gerrit.wikimedia.org/r/621088 (https://phabricator.wikimedia.org/T260389) [22:40:42] (03Abandoned) 10Dave Pifke: [WIP] webperf: Enable prometheus-apache-exporter [puppet] - 10https://gerrit.wikimedia.org/r/608973 (https://phabricator.wikimedia.org/T215740) (owner: 10Dave Pifke) [22:41:33] !log wkandek@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=1) [22:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:29] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:42:36] jouncebot: next [22:42:37] In 0 hour(s) and 17 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200818T2300) [22:44:15] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:48:00] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-cluster [22:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:40] (03PS1) 10Dzahn: decom releases1001 and releases2001 [puppet] - 10https://gerrit.wikimedia.org/r/621090 (https://phabricator.wikimedia.org/T260742) [23:00:04] RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200818T2300). Please do the needful. [23:00:05] RoanKattouw and Urbanecm: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:08] * Urbanecm here [23:00:34] RoanKattouw: I'm starting with my patch [23:01:08] OK go ahead, ping me when you're done [23:01:27] (03PS2) 10Urbanecm: Enable subpages in NS:0 in techconductwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620001 (https://phabricator.wikimedia.org/T260350) [23:01:30] (03CR) 10Urbanecm: [C: 03+2] Enable subpages in NS:0 in techconductwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620001 (https://phabricator.wikimedia.org/T260350) (owner: 10Urbanecm) [23:01:35] sure [23:02:13] (03Merged) 10jenkins-bot: Enable subpages in NS:0 in techconductwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620001 (https://phabricator.wikimedia.org/T260350) (owner: 10Urbanecm) [23:04:06] (03PS1) 10Dzahn: icinga: fix ORES monitoring after domainproxy now enforces https [puppet] - 10https://gerrit.wikimedia.org/r/621093 (https://phabricator.wikimedia.org/T260732) [23:04:35] !log wkandek@cumin1001 conftool action : set/pooled=yes; selector: name=mw1300.eqiad.wmnet [23:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:16] (03CR) 10Catrope: [C: 03+2] Only fetch task card data for users in variant C and D [extensions/GrowthExperiments] (wmf/1.36.0-wmf.4) - 10https://gerrit.wikimedia.org/r/620967 (https://phabricator.wikimedia.org/T258021) (owner: 10Catrope) [23:07:09] (03PS1) 10Catrope: Normalize parserTests.txt format [extensions/Graph] (wmf/1.36.0-wmf.5) - 10https://gerrit.wikimedia.org/r/620978 (https://phabricator.wikimedia.org/T260676) [23:07:19] (03CR) 10Catrope: [C: 03+2] Normalize parserTests.txt format [extensions/Graph] (wmf/1.36.0-wmf.5) - 10https://gerrit.wikimedia.org/r/620978 (https://phabricator.wikimedia.org/T260676) (owner: 10Catrope) [23:07:21] (03CR) 10jerkins-bot: [V: 04-1] Only fetch task card data for users in variant C and D [extensions/GrowthExperiments] (wmf/1.36.0-wmf.4) - 10https://gerrit.wikimedia.org/r/620967 (https://phabricator.wikimedia.org/T258021) (owner: 10Catrope) [23:07:24] (03CR) 10Catrope: [C: 03+2] Only fetch task card data for users in variant C and D [extensions/GrowthExperiments] (wmf/1.36.0-wmf.5) - 10https://gerrit.wikimedia.org/r/620968 (https://phabricator.wikimedia.org/T258021) (owner: 10Catrope) [23:07:43] ssh: connect to host mw1301.eqiad.wmnet port 22: Connection timed out [23:08:03] (03PS2) 10Dzahn: icinga: fix ORES monitoring after domainproxy now enforces https [puppet] - 10https://gerrit.wikimedia.org/r/621093 (https://phabricator.wikimedia.org/T260732) [23:08:13] Urbanecm: it's because wkandek is rebooting hosts [23:08:14] (03PS2) 10Catrope: Only fetch task card data for users in variant C and D [extensions/GrowthExperiments] (wmf/1.36.0-wmf.4) - 10https://gerrit.wikimedia.org/r/620967 (https://phabricator.wikimedia.org/T258021) [23:08:22] mutante: guessed that, should I do something? [23:08:25] (03CR) 10Catrope: Only fetch task card data for users in variant C and D [extensions/GrowthExperiments] (wmf/1.36.0-wmf.4) - 10https://gerrit.wikimedia.org/r/620967 (https://phabricator.wikimedia.org/T258021) (owner: 10Catrope) [23:08:28] (re-sync? scap pull there?) [23:08:29] (03CR) 10Catrope: [C: 03+2] Only fetch task card data for users in variant C and D [extensions/GrowthExperiments] (wmf/1.36.0-wmf.4) - 10https://gerrit.wikimedia.org/r/620967 (https://phabricator.wikimedia.org/T258021) (owner: 10Catrope) [23:08:46] can't find their IRC handle, would ping them otherwise :/ [23:09:03] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: ac34f7274823e40d0c79752eb5ffe74c76856d04: Enable subpages in NS:0 in techconductwiki (T260350) (duration: 05m 14s) [23:09:04] Urbanecm: it's normally wkandek and i would have done the same and asked to scap pull [23:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:06] T260350: Enable subpages in NS:0 in techconductwiki - https://phabricator.wikimedia.org/T260350 [23:10:14] thanks mutante. I do think the rebooting should be paused, or window adjourned - those two thinks don't like each other :/ [23:10:32] Urbanecm: yes, it should not happen at the same time. i am trying to contact [23:10:39] mw1305 also timeouted, fwiw [23:10:44] RoanKattouw: I'm done, but see ^^^ [23:11:01] (well, I'm done except fixing the timeouted hosts) [23:11:14] Thanks [23:11:15] thanks mutante [23:12:03] (03Merged) 10jenkins-bot: Normalize parserTests.txt format [extensions/Graph] (wmf/1.36.0-wmf.5) - 10https://gerrit.wikimedia.org/r/620978 (https://phabricator.wikimedia.org/T260676) (owner: 10Catrope) [23:12:13] PROBLEM - Host ripe-atlas-codfw IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [23:12:21] PROBLEM - Host ripe-atlas-ulsfo IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [23:12:27] PROBLEM - Host ripe-atlas-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [23:12:41] Urbanecm: i can do the scap pull [23:13:06] mutante: I think I can do that too (technically-speaking), but I thought someone else's on it. [23:13:29] pulled on mw1305 [23:13:41] (03PS2) 10Catrope: Enable static maps on testwiki, disable on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620810 [23:15:13] (03CR) 10Catrope: [C: 03+2] Enable static maps on testwiki, disable on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620810 (owner: 10Catrope) [23:15:15] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 83 probes of 658 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:15:58] (03Merged) 10jenkins-bot: Enable static maps on testwiki, disable on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620810 (owner: 10Catrope) [23:19:14] (03Merged) 10jenkins-bot: Only fetch task card data for users in variant C and D [extensions/GrowthExperiments] (wmf/1.36.0-wmf.5) - 10https://gerrit.wikimedia.org/r/620968 (https://phabricator.wikimedia.org/T258021) (owner: 10Catrope) [23:19:17] (03Merged) 10jenkins-bot: Only fetch task card data for users in variant C and D [extensions/GrowthExperiments] (wmf/1.36.0-wmf.4) - 10https://gerrit.wikimedia.org/r/620967 (https://phabricator.wikimedia.org/T258021) (owner: 10Catrope) [23:20:19] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 7 probes of 658 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:20:51] <_joe_> mutante: if you want to kill reboot-cluster, this is the time to do it :) [23:22:19] !log killed reboot-cluster on cumin1001 [23:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:27] _joe_: done [23:23:28] <_joe_> mutante: I can continue tomorrow :) [23:23:41] RECOVERY - Host ripe-atlas-esams IPv6 is UP: PING OK - Packet loss = 0%, RTA = 83.57 ms [23:25:59] (03PS3) 10Dave Pifke: profiler: Update XHGui SERVER/GET key filter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620139 (owner: 10Krinkle) [23:26:01] (03PS1) 10Dave Pifke: [WIP] profiler: remove MongoDB client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621095 (https://phabricator.wikimedia.org/T180761) [23:26:22] (03CR) 10Dzahn: [C: 03+2] icinga: fix ORES monitoring after domainproxy now enforces https [puppet] - 10https://gerrit.wikimedia.org/r/621093 (https://phabricator.wikimedia.org/T260732) (owner: 10Dzahn) [23:26:32] _joe_: ack, thanks [23:27:53] Urbanecm: so the script is stopped. i can't ssh to 1301 right now to run the scap pull yet.. but it won't be any more [23:28:12] thanks mutante [23:28:25] scap sync-file timeouted at mw1301 when I last ran it [23:28:45] let me check via mgmt [23:29:10] (what i mean is, I want to make sure the server is out of the way for both scap and traffic) [23:31:13] Urbanecm: it is not getting traffic right now. so if you can ignore the scap warning then it's ok. also looking to get it back up [23:31:28] yes, it just warns [23:31:38] (03CR) 10Dave Pifke: "To avoid a merge conflict, this depends on your patch to update SERVER/GET dictionaries." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621095 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [23:32:08] !log rebooting mw1301 via mgmt [23:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:43] RoanKattouw: per mutante, seems you can deploy your stuff now :) ^^ [23:33:00] Hmm it's still hanging at 99% [23:33:03] (I had already starte [23:33:07] aha [23:33:12] it's definitely not pooled.. but it's also booting right now [23:33:18] it will be back in moments [23:33:35] RoanKattouw: that happened for me too, just before it timeouted at the host that's unreachable [23:33:51] Yeah timeout for mw1301 [23:33:56] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable static maps on testwiki, disable them on test2wiki (duration: 03m 22s) [23:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:09] !log Run scap pull at mw1301 [23:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:21] RECOVERY - Host mw1301 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [23:34:22] Running ... oh you already did it [23:34:29] there it is [23:35:01] checking icinga for 1301 [23:35:19] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 51 probes of 573 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:38:44] after rescheduling all icinga checks 1301 is all green now and will pool it again [23:38:53] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1301.eqiad.wmnet [23:38:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:00] !log catrope@deploy1001 Synchronized php-1.36.0-wmf.4/extensions/GrowthExperiments: Only fetch task card data for users in variant C/D (T258021) (duration: 01m 06s) [23:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:03] T258021: Variant C/D: mobile preview for suggested edits module with first suggested edit - https://phabricator.wikimedia.org/T258021 [23:45:05] !log catrope@deploy1001 Synchronized php-1.36.0-wmf.5/extensions/GrowthExperiments: Only fetch task card data for users in variant C/D (T258021) (duration: 01m 05s) [23:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:03] (03PS1) 10Dzahn: icinga: fix typo in ORES check command [puppet] - 10https://gerrit.wikimedia.org/r/621097 (https://phabricator.wikimedia.org/T260732) [23:49:36] (03CR) 10Dzahn: [C: 03+2] icinga: fix typo in ORES check command [puppet] - 10https://gerrit.wikimedia.org/r/621097 (https://phabricator.wikimedia.org/T260732) (owner: 10Dzahn) [23:51:24] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 50 probes of 573 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:57:05] (03PS1) 10Cwhite: prometheus: use aggs to consolidate mediawiki logging metrics [puppet] - 10https://gerrit.wikimedia.org/r/621098 (https://phabricator.wikimedia.org/T256418) [23:58:29] (03PS1) 10Ebernhardson: Correct CirrusSearchUserTesting configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621099 (https://phabricator.wikimedia.org/T254388) [23:58:59] (03CR) 10BryanDavis: wikireplicas: add wikireplica cookbook to add a wiki (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/621088 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm)