[00:00:04] RoanKattouw, Niharika, and Urbanecm: Dear deployers, time to do the Evening backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210303T0000). [00:00:05] No GERRIT patches in the queue for this window AFAICS. [00:01:06] (03CR) 10Dzahn: [C: 03+2] DHCP: remove deploy2001 [puppet] - 10https://gerrit.wikimedia.org/r/667040 (https://phabricator.wikimedia.org/T275832) (owner: 10Dzahn) [00:01:13] (03PS2) 10Dzahn: DHCP: remove deploy2001 [puppet] - 10https://gerrit.wikimedia.org/r/667040 (https://phabricator.wikimedia.org/T275832) [00:01:23] (different thing that does not affect you at all) [00:02:09] mutante: ack. Just three syncs, I'll ping you once done. [00:02:17] there's nothing else in the window [00:05:49] scap is running really slowly [00:08:12] !log urbanecm@deploy1002 sync-file aborted: 80edca8a385870a0e60a98198c99c9839fc01d80: Enable Growth features in idwiki in stealth mode (T259024; 1/3) (duration: 06m 45s) [00:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:19] T259024: Deploy Growth experiments at Indonesian Wikipedia - https://phabricator.wikimedia.org/T259024 [00:08:41] trying again with verbose [00:10:14] Urbanecm: it's possible I am actually affecting you [00:10:21] because i did not merge that earlier [00:10:43] it's getting stuck on something in cat /etc/dsh/group/scap-masters [00:10:44] one host is in dsh groups that doesnt have the firewall rule anymore [00:11:10] then it's me and i should merge that [00:11:11] aha [00:11:24] okay, killed it again [00:11:27] i didn't think about the firewall already being gone for a moment [00:11:32] sorry, one sec [00:11:46] ping me when i should re-run :) [00:13:00] mutante: and if you can, please restart logmsgbot for me, https://wikitech.wikimedia.org/wiki/Logmsgbot has some docs, dunno if up2date [00:15:10] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01007 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [00:16:15] (03CR) 10Dzahn: [C: 03+2] common/scap/DHCP: remove deploy1001 from scap hosts and DHCP [puppet] - 10https://gerrit.wikimedia.org/r/635111 (https://phabricator.wikimedia.org/T275831) (owner: 10Dzahn) [00:16:42] (03CR) 10Dzahn: [C: 03+2] "also removes deploy2001 from dsh group" [puppet] - 10https://gerrit.wikimedia.org/r/635111 (https://phabricator.wikimedia.org/T275831) (owner: 10Dzahn) [00:16:52] (03CR) 10Cwhite: "Is it your intention to keep all Icinga logs for an extended period? Or just the alert notifications?" [puppet] - 10https://gerrit.wikimedia.org/r/667917 (owner: 10Herron) [00:17:00] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 71 probes of 597 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:17:10] Urbanecm: the offending scap master has been removed on deploy1002 [00:17:15] thanks [00:17:20] I'll try again [00:17:29] and both old deployment servers are removed from dsh group [00:18:30] RECOVERY - reading-web-client-errors grafana alert on alert1001 is OK: OK: Overview ( https://grafana.wikimedia.org/d/000000566/overview ) is not alerting. https://logstash.wikimedia.org/app/kibana%23/dashboard/AXDBY8Qhh3Uj6x1zCF56 https://grafana.wikimedia.org/d/000000566/ [00:19:30] mutante: can you also start logmsgbot again? [00:20:52] yea, i was looking at the "widespread puppet failure" thing [00:21:18] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 94 probes of 599 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:21:37] !log alert1001 systemctl restart tcpircbot-logmsgbot [00:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:54] Active: active (running) since Wed 2021-03-03 00:21:25 UTC; 22s ago [00:22:05] sure, thanks [00:22:18] 2021-03-03 00:21:34,904 verne.freenode.net [u'*** No Ident response' [00:22:42] what the hell is going with deployment today... this is the new error https://www.irccloud.com/pastebin/3UK3MxIU/ [00:23:15] (03PS1) 10Cwhite: kibana: add systemctl full path for valid sudoers config [puppet] - 10https://gerrit.wikimedia.org/r/667968 (https://phabricator.wikimedia.org/T272655) [00:23:28] now that part I don't know :/ [00:24:12] oh.. scap-master-sync [00:24:14] maybe I do know [00:24:26] (03CR) 10Cwhite: [C: 03+2] kibana: add systemctl full path for valid sudoers config [puppet] - 10https://gerrit.wikimedia.org/r/667968 (https://phabricator.wikimedia.org/T272655) (owner: 10Cwhite) [00:24:31] Urbanecm: could be caused by https://gerrit.wikimedia.org/r/c/operations/puppet/+/667919 [00:24:36] that changed scap-master-sync today [00:25:32] hmm, that's pretty recent and touches the right part of the code. Can we try to revert it and see? [00:25:44] yes [00:26:04] (03PS1) 10Dzahn: Revert "scap-master-sync: Don't exclude CDB files" [puppet] - 10https://gerrit.wikimedia.org/r/667821 [00:26:18] insertint the irccloud link as reasn for revert [00:26:48] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 53 probes of 599 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:27:19] mutante: https://phabricator.wikimedia.org/P14572 is less likely to disappear, if you want [00:27:41] (03PS2) 10Dzahn: Revert "scap-master-sync: Don't exclude CDB files" [puppet] - 10https://gerrit.wikimedia.org/r/667821 [00:27:45] thanks, using that one [00:27:46] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 51 probes of 597 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:28:00] (03CR) 10Dzahn: [V: 03+2 C: 03+2] Revert "scap-master-sync: Don't exclude CDB files" [puppet] - 10https://gerrit.wikimedia.org/r/667821 (owner: 10Dzahn) [00:30:00] Urbanecm: reverted on deploy1002 and deploy2002, hopefully that is enough [00:30:11] trying [00:31:03] 00:30:44 Finished sync-masters (duration: 00m 07s) [00:31:07] sounds that did the trick [00:31:16] pheew, i prefer that over "other unknown issue" right now [00:31:27] * Urbanecm too [00:31:38] !log 00:31:26 Synchronized wmf-config/InitialiseSettings.php: 80edca8a385870a0e60a98198c99c9839fc01d80: Enable Growth features in idwiki in stealth mode (T259024; 1/3) (duration: 01m 11s) [00:31:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:45] T259024: Deploy Growth experiments at Indonesian Wikipedia - https://phabricator.wikimedia.org/T259024 [00:33:44] PROBLEM - Check systemd state on deploy1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:34:07] (03PS3) 10Dzahn: DHCP: remove deploy2001 [puppet] - 10https://gerrit.wikimedia.org/r/667040 (https://phabricator.wikimedia.org/T275832) [00:35:34] ACKNOWLEDGEMENT - Check systemd state on deploy1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn not active server anymore https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:35:41] mutante: I'd prefer getting the logging bot up if possible, but I can finish this deployment logging manually [00:35:52] is there anything else than 2021-03-03 00:21:34,904 verne.freenode.net [u'*** No Ident response'? [00:36:56] no, there isn't [00:37:06] :( [00:38:44] !log 00:38:12 Synchronized dblists/growthexperiments.dblist: 80edca8a385870a0e60a98198c99c9839fc01d80: Enable Growth features in idwiki in stealth mode (T259024; 2/3) (duration: 01m 10s) [00:38:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:50] T259024: Deploy Growth experiments at Indonesian Wikipedia - https://phabricator.wikimedia.org/T259024 [00:40:13] 10SRE, 10GitLab, 10vm-requests, 10User-brennen: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10wkandek) 05Resolved→03Open Please recreate the 2 VMs in the VLAN that allows for direct external IP addresses. After speaking to Brandon it is clear that we should treat gitlab... [00:40:34] 10SRE, 10GitLab, 10vm-requests, 10User-brennen: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10wkandek) a:03Dzahn [00:41:37] !log 00:40:16 Synchronized wmf-config/config/idwiki.yaml: 80edca8a385870a0e60a98198c99c9839fc01d80: Enable Growth features in idwiki in stealth mode (T259024; 3/3) (duration: 01m 09s) [00:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:42:04] !log Finished deployment in Evening B&C window; logmsgbot is currently down, and a simple restart did not bring it back up [00:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:42:29] (03CR) 10Volans: smokeping: replace mwmaint2001 with cumin2001 as D5 target (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/667957 (https://phabricator.wikimedia.org/T275905) (owner: 10Dzahn) [00:44:31] (03PS1) 10Dzahn: Revert "tcpircbot: allow deploy1002/2002, do not allow deploy1001/2001" [puppet] - 10https://gerrit.wikimedia.org/r/667822 [00:45:51] (03CR) 10Dzahn: [C: 03+2] Revert "tcpircbot: allow deploy1002/2002, do not allow deploy1001/2001" [puppet] - 10https://gerrit.wikimedia.org/r/667822 (owner: 10Dzahn) [00:46:34] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:48:51] Urbanecm: how can the logging be tested [00:49:16] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005332 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [00:52:09] I reverted the last change to it and restarted it again [00:53:13] but my food is burning on the stove, be back soon [00:54:30] mutante: probably by doing a no-op deployment. But as long as the bot is in this channel, it should work IMO. [00:58:21] mutante: `dologmsg ....` from deploy1002 would test tcpircbot [00:58:35] Good to know bd808 [00:58:44] But right now I don't even see it here :( [01:01:24] (03CR) 10Andrew Bogott: [C: 03+2] "On Buster the old code worked because Buster defines /usr/bin/lsblk and Stretch does not. Good news, though, Buster defines both /bin/lsb" [puppet] - 10https://gerrit.wikimedia.org/r/667815 (https://phabricator.wikimedia.org/T276241) (owner: 10Zppix) [01:01:36] bd808: thank you, the issue seems to be it does not even get on the channel as Urbanecm says [01:02:05] It seems that identification fails for whatever reason [01:02:35] It is connected to freenode, but not authenticated, and thus is ended in -overflow [01:05:40] Did we not get it using SASL yet? That would be my guess on being kicked into -overflow [01:06:40] Perhaps? [01:11:50] Urbanecm: eh.. 01:11 -!- logmsgbot [~logmsgbot@nat.openstack.eqiad1.wikimediacloud.org] [01:12:12] that is not the one from the icinga servers [01:14:45] should puppet set it to running automatically? [01:14:48] shouldnt* [01:15:25] yea, it does, the service is running on icinga servers in prod [01:15:34] that logmsgbot above is connected from cloud though [01:16:51] 10SRE, 10Instrument-ClientError, 10MediaWiki-extensions-WikimediaEvents, 10observability: Edits to pt:MediaWiki:Common.js and new bugs that create client side error spike should log alerts - https://phabricator.wikimedia.org/T264665 (10Jdlrobson) 05Open→03Resolved a:03Jdlrobson Thanks so much for you... [01:19:14] mutante: could it just be the ip its coming from is somehow resolving to the hostname of cloud? [01:20:36] Zppix: nah, if it is wikimediacloud.org it is definitely not running in production [01:21:28] is it strange that i can't even resolve wikimediacloud.org to an IP? [01:22:47] 185.15.56.1 [01:22:49] is what i get [01:23:29] it's that NAT IP [01:23:46] in the past it wasnt like everything was outgoing via that IP [01:24:11] could be related [01:24:16] nat.openstack.eqiad1.wikimediacloud.org is the made up name for the Cloud VPS network's public egress IP [01:25:04] PROBLEM - mediawiki-installation DSH group on deploy2001 is CRITICAL: Host deploy2001 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [01:26:34] ACKNOWLEDGEMENT - mediawiki-installation DSH group on deploy2001 is CRITICAL: Host deploy2001 is not in mediawiki-installation dsh group daniel_zahn decom https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [01:28:47] mutante: the zone file is at https://github.com/wikimedia/operations-dns/blob/master/templates/wikimediacloud.org. We have never setup an A record for the bare domain. Probably not a horrible idea though to make it redirect to some page on wikitech. [01:29:56] https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/DNS would be a reasonable place for https://wikimediacloud.org/ to take you to. [01:30:47] bd808: ACK, i see. yea, it was just about the bare domain [01:31:46] but it still doesnt explain why it didnt auth [01:49:02] 10SRE, 10ops-codfw, 10netops: ripe-atlas-codfw is down - https://phabricator.wikimedia.org/T267714 (10Papaul) @faidon do we have some documentation on the console configuration for the RIPE? - console baud rate - Type of cable to use to connect to the console I tried to use a Cisco console cable and a DB9 t... [01:54:53] 10SRE, 10ops-codfw, 10netops: ripe-atlas-codfw is down - https://phabricator.wikimedia.org/T267714 (10CDanis) I'm pretty sure the baud rate is 19200 Not sure about the cable type [01:59:36] 10SRE, 10ops-codfw, 10netops: ripe-atlas-codfw is down - https://phabricator.wikimedia.org/T267714 (10Papaul) I tried both 9600 and 19200 on both cable it didn't work [02:06:38] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install atlas-codfw.wikimedia.org - https://phabricator.wikimedia.org/T273114 (10Papaul) [03:17:08] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:19:56] PROBLEM - tcpircbot_service_running on alert1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args tcpircbot.py https://wikitech.wikimedia.org/wiki/Logmsgbot [03:22:18] RECOVERY - tcpircbot_service_running on alert1001 is OK: PROCS OK: 1 process with command name python, args tcpircbot.py https://wikitech.wikimedia.org/wiki/Logmsgbot [03:24:04] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.064 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:24:40] !log `ryankemper@wdqs1012:~$ sudo systemctl restart wdqs-blazegraph` ~2 mins ago [03:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:29:10] PROBLEM - Check systemd state on wdqs1012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:40:56] RECOVERY - Check systemd state on wdqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:48:02] PROBLEM - Check systemd state on wdqs1012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:50:03] !log Depooled `wdqs1012` until I've got its updater back online [03:50:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:05:58] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 3 (contint2001, ...), Fresh: 100 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:07:52] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:10:12] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:11:52] RECOVERY - Check systemd state on wdqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:16:44] !log T275345 `ryankemper@elastic2045:~$ sudo apt-get upgrade wmf-elasticsearch-search-plugins` [05:16:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:53] T275345: Medium error reported for sda on elastic2045 - https://phabricator.wikimedia.org/T275345 [05:18:58] PROBLEM - Check systemd state on wdqs1012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:19:28] RECOVERY - Check systemd state on elastic2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:27:43] !log Downtime `wdqs1012` until `2021-03-03 19:25:40` (~14 hours from now). Its `wdqs-updater` is failing; ultimately it's blazegraph journal is probably in a bad state meaning we'd have to copy one over from a healthy node, but not kicking that off right now so that we can investigate a little bit first [05:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:10] 10SRE, 10ops-codfw, 10Discovery-Search (Current work): elastic2054 unresponsive - https://phabricator.wikimedia.org/T274555 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ryankemper on cumin2001.codfw.wmnet for hosts: ` elastic2054.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reim... [05:31:30] !log T274555 `sudo -i wmf-auto-reimage-host --conftool -p T274555 elastic2054.codfw.wmnet` [05:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:37] T274555: elastic2054 unresponsive - https://phabricator.wikimedia.org/T274555 [05:31:37] (03CR) 10Marostegui: "> Patch Set 7:" [puppet] - 10https://gerrit.wikimedia.org/r/663865 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi) [05:32:36] !log T274555 `sudo -i wmf-auto-reimage-host --conftool -p T274555 elastic2054.codfw.wmnet` on `ryankemper@cumin2001` tmux session `elastic_reimage_elastic2054` [05:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:25] (03PS1) 10Marostegui: instances.yaml: Add db1164 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/667975 (https://phabricator.wikimedia.org/T258361) [05:40:00] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1164 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/667975 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [05:42:35] (03PS1) 10Marostegui: db1164: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/667976 (https://phabricator.wikimedia.org/T258361) [05:47:09] (03CR) 10Marostegui: [C: 03+2] db1164: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/667976 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [06:02:26] 10SRE, 10ops-codfw, 10Discovery-Search (Current work): elastic2054 unresponsive - https://phabricator.wikimedia.org/T274555 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['elastic2054.codfw.wmnet'] ` and were **ALL** successful. [06:15:07] !log T274555 Removed downtime for `elastic2054` [06:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:14] T274555: elastic2054 unresponsive - https://phabricator.wikimedia.org/T274555 [06:17:13] !log T275345 T274555 Unbanning `elastic2045` and `elastic2054` from our cluster now that both hosts have been re-imaged and are running without errors (commands follow) [06:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:22] T275345: Medium error reported for sda on elastic2045 - https://phabricator.wikimedia.org/T275345 [06:18:19] !log T275345 T274555 `curl -H 'Content-Type: application/json' -XPUT http://localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.exclude":{"_name": null,"_ip": null}}}'` => `{"acknowledged":true,"persistent":{},"transient":{}}` [06:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:27] !log T275345 T274555 `curl -H 'Content-Type: application/json' -XPUT http://localhost:9400/_cluster/settings -d '{"transient":{"cluster.routing.allocation.exclude":{"_name": null,"_ip": null}}}'` => `{"acknowledged":true,"persistent":{},"transient":{}}` [06:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:38] T274555: elastic2054 unresponsive - https://phabricator.wikimedia.org/T274555 [06:21:13] !log T275345 T274555 Re-pooling `elastic2045` and `elastic2054` (commands follow) [06:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:54] !log T275345 T274555 `sudo confctl select 'name=elastic2045.codfw.wmnet' set/pooled=yes` on `ryankemper@puppetmaster1001` [06:26:57] !log T275345 T274555 `sudo confctl select 'name=elastic2054.codfw.wmnet' set/pooled=yes` on `ryankemper@puppetmaster1001` [06:27:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:02] T274555: elastic2054 unresponsive - https://phabricator.wikimedia.org/T274555 [06:27:02] T275345: Medium error reported for sda on elastic2045 - https://phabricator.wikimedia.org/T275345 [06:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:30] (03PS2) 10Brennen Bearnes: logspam-watch: histograms, helptext, and utf-8 handling [puppet] - 10https://gerrit.wikimedia.org/r/667310 [06:41:20] !log Testing log [06:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:18] 10SRE, 10conftool: dbctl not sending !log to IRC - https://phabricator.wikimedia.org/T276299 (10Marostegui) [06:45:17] there's something above about logmsgbot not working properly [06:48:36] Urbanecm: mutante: logmsdgbot code (https://github.com/wikimedia/puppet/blob/production/modules/tcpircbot/files/tcpircbot.py) doesn't seem to even try to auth with nickserv [06:49:17] still not sure why someone with that nick is connecting from WMCS [06:52:44] 10SRE, 10conftool: dbctl not sending !log to IRC - https://phabricator.wikimedia.org/T276299 (10Majavah) `logmsgbot` should be running from alert* servers, for some reason it's connected from a WMCS address: ` 06:51:24 -- | [logmsgbot] (~logmsgbot@nat.openstack.eqiad1.wikimediacloud.org): logmsgbot 06:51:24... [06:54:57] 10SRE, 10conftool: dbctl not sending !log to IRC - https://phabricator.wikimedia.org/T276299 (10Majavah) >>! In T276299#6877466, @Majavah wrote: > [[ https://github.com/wikimedia/puppet/blob/production/modules/tcpircbot/files/tcpircbot.py | It's code ]] doesn't seem to authenticate at all with NickServ correc... [06:59:59] is python-ib3 available on the production cluster? https://github.com/bd808/python-ib3 [07:04:54] 10SRE, 10conftool: dbctl not sending !log to IRC - https://phabricator.wikimedia.org/T276299 (10CDanis) I believe this is related to https://sal.toolforge.org/log/YZV29XcBa_6PSCT9KHrZ cc'ing @Dzahn and @bd808 who were investigating [07:05:19] 10SRE, 10conftool: dbctl not sending !log to IRC - https://phabricator.wikimedia.org/T276299 (10Zppix) This is an issue that mutante had tried to figure out earlier, IIRC is not connected from cloud, it just appears to be that way due to NAT... (there was some discussion in -operations about it but i forget) i... [07:06:12] 10SRE, 10conftool: dbctl not sending !log to IRC - https://phabricator.wikimedia.org/T276299 (10Marostegui) If this helps in anyways, I just got this error when pushing: ` root@cumin1001:/home/marostegui# dbctl instance db1164 pool -p5 root@cumin1001:/home/marostegui# dbctl config commit -m "Increase weight f... [07:06:30] Majavah: it auths https://github.com/wikimedia/puppet/blob/9bb5f3e1a816ffd61c72e23667ea5948c2dcdb6d/modules/tcpircbot/files/tcpircbot.py#L15 [07:06:37] it uses serverpassword to auth [07:06:59] Zppix: yeah, realized that and added it on the task [07:07:52] marostegui: that looks like it may be blocked then by a firewall or something based on the connection refused [07:08:14] zpapierski: could be yeah, though the error isn't showing up on every push [07:08:55] I just would like to see if something in cloud is accidentally using the same name or if it's just a prod server somehow using that wmcs nat ip [07:09:56] <_joe_> Majavah: I doubt cloud has anything to do with it. [07:10:08] Majavah: cloud has nothing to do with it, cloud can't access backend prod IIRC [07:10:58] looking at -overflow logs the bot connected to that at ~00:10Z today, mutante restarted the service at ~00:20Z but it didn't cycle, then at ~01:00Z it disconnected and came back [07:11:14] if you stop the service does it disconnect? [07:11:39] marostegui: i wonder if you stop the service let it sit for a min then start [07:12:00] Zppix: which service? [07:12:02] <_joe_> can we please stop guessing without any base? [07:12:09] <_joe_> let's go with order. [07:12:12] marostegui: tcpircbot [07:12:29] <_joe_> marostegui: so how often do you see errors like the one in https://phabricator.wikimedia.org/T276299#6877477 ? [07:12:34] _joe_: my thought is using systemctl restart is too fast for it to cycle connections [07:12:37] <_joe_> on evry dbctl command? [07:12:44] _joe_: No, I just saw it after a bunch of runs [07:12:48] So this is the first time [07:12:53] definitely not showing up every time [07:13:03] <_joe_> ok so that seems like a different problem than the one you reported originally [07:13:25] yep, not sure if related [07:13:34] I just ran another dbctl change and it didn't show the error [07:13:42] <_joe_> definitely different, that's a connection error, which is weird in itself [07:14:08] <_joe_> so, is there one category of dbctl messages that consistently don't get logged? [07:14:35] earlier today scap also didn't log anything [07:14:53] I tried a manual !log and it did work, but none of the ones coming from dbctl are being logged [07:14:57] <_joe_> ok so it's a more general problem, tcpircbot not accepting messages it seems [07:15:51] <_joe_> ok, I tried to send a message to tcpircpbot from cumin1001 and it didn't work indeed [07:16:39] <_joe_> tcpircbot just restarted sigh [07:17:18] <_joe_> !log test log [07:17:19] the bot is connected on #wikimedia-overflow, it did not just disconnect and reconnect [07:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:34] _joe_: Do you want me to try with dbctl? [07:17:38] <_joe_> no [07:17:46] <_joe_> did anyone bother to look at tcpircbot logs? [07:17:55] stashbot (which records !logs) is a different bot than logmsgbot/tcpircbot which relays the messages [07:17:55] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [07:18:02] <_joe_> Mar 03 07:17:18 alert1001 python[17693]: Traceback (most recent call last): [07:18:04] <_joe_> Mar 03 07:17:18 alert1001 python[17693]: File "tcpircbot.py", line 147, in [07:18:06] <_joe_> Mar 03 07:17:18 alert1001 python[17693]: readable, _, _ = select.select([bot.connection.socket] + files, [], []) [07:18:08] <_joe_> Mar 03 07:17:18 alert1001 python[17693]: TypeError: argument must be an int, or have a fileno() method [07:18:10] <_joe_> Mar 03 07:17:18 alert1001 systemd[1]: tcpircbot-logmsgbot.service: Main process exited, code=exited, status=1/FAILURE [07:18:12] <_joe_> this happens every time someone logs a message [07:18:40] <_joe_> did someone change anything in the bot recently? [07:19:16] no commits to the python script since 2017 [07:20:19] 10SRE, 10conftool: dbctl not sending !log to IRC - https://phabricator.wikimedia.org/T276299 (10Joe) The issue is more general: tcpircbot crashes on every invokation in the following way: ` Mar 03 07:18:00 alert1001 python[2522]: Traceback (most recent call last): Mar 03 07:18:00 alert1001 python[2522]: Fil... [07:22:33] <_joe_> marostegui: can you try logging something from dbctl? in like 10 seconds [07:22:38] sure [07:22:55] _joe_: done [07:23:09] <_joe_> marostegui: ok so something is *really* wrong [07:23:11] <_joe_> I see [07:23:13] <_joe_> Mar 03 07:22:51 alert1001 python[21384]: 2021-03-03 07:22:51,708 TCP ('::ffff:10.64.32.25', 47634, 0, 0): "!log marostegui@cumin1001 dbctl commit (dc=all): 'Increase weight for db1164 in s1 T258361', diff saved to https://phabricator.wikimedia.org/P14586 and previous config saved to /var/cache/conftool/dbconfig/20210303-072251-marostegui.json" [07:23:13] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [07:23:29] <_joe_> and then the crash [07:24:16] <_joe_> -r--r--r-- 1 root root 773 Mar 3 00:47 tcpircbot-logmsgbot.json [07:24:22] <_joe_> this is what changed [07:24:26] were any related packages updated recently? [07:24:37] _joe_: :-/ [07:24:38] <_joe_> Majavah: no it's the config that was changed, I'm ready to bet [07:24:52] maybe deploy* server change related? [07:25:13] <_joe_> that was the day before [07:26:00] so https://gerrit.wikimedia.org/r/c/operations/puppet/+/667238, https://gerrit.wikimedia.org/r/c/operations/puppet/+/667953, https://gerrit.wikimedia.org/r/c/operations/puppet/+/667042 or https://gerrit.wikimedia.org/r/c/operations/puppet/+/635108 [07:26:23] those are the recent config changes [07:26:33] <_joe_> yeah, frankly no idea [07:26:53] <_joe_> to solve this I need to read the code of that bot again [07:27:16] <_joe_> I'm not exactly happy people would leave a fundamental auditing tool in such a dire state and just leave [07:28:13] <_joe_> I'm stopping tcpircbot [07:29:42] it hasn't disconnected from #wikimedia-overflow [07:29:56] <_joe_> I can ensure you that's a freenode problem :) [07:30:03] <_joe_> the service is not running [07:30:06] <_joe_> !log test [07:30:10] <_joe_> as you can see [07:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:15] <_joe_> uhhh [07:30:22] <_joe_> ok that's stashbot [07:30:36] <_joe_> separated [07:31:45] <_joe_> Majavah: any idea why it's in #overflow? [07:31:56] no, that's what I've been wondering yet [07:32:04] it's also not authed [07:32:20] I guess that's why its in overflow, but no idea why it has not authed [07:32:21] <_joe_> and not coming from alert1001 either [07:32:26] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] ReferenceTooltips and other gadget names for ReferencePreviews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663185 (https://phabricator.wikimedia.org/T274353) (owner: 10Thiemo Kreuz (WMDE)) [07:33:02] just making sure, is the password specified in it in the username:password format? [07:33:08] <_joe_> yes [07:33:18] <_joe_> Majavah: ok I'll try to join with an IRC client and authenticate as logmsgbot [07:33:34] this channel is +rf #wikimedia-overflow so that explains why it's on overflow if it does not auth [07:33:42] PROBLEM - tcpircbot_service_running on alert1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args tcpircbot.py https://wikitech.wikimedia.org/wiki/Logmsgbot [07:34:02] make sure to try the server password auth, not sasl as the bot uses server passwords [07:35:19] 10SRE, 10Analytics: Consider Julie for managing Kafka settings, perhaps even integrating with Event Stream Config - https://phabricator.wikimedia.org/T276088 (10elukey) WOW [07:35:56] if it was that process just not closing the connection, I would guess that it would had to time out by now, but it hasn't [07:37:20] Yeah I can't get my python bot to connect to freenode either [07:37:37] It does the auth and then something crashes [07:37:42] <_joe_> ok that bot is NOT the actual logmsgbot [07:37:48] <_joe_> and has the correct password [07:38:00] <_joe_> unless it's connecting from another alerting server [07:38:19] the only place its set to connect from on any public repo is from alert* [07:38:19] <_joe_> but I doubt it, at this point [07:38:22] how do you know that the other/fake bot has the correct password? [07:38:38] <_joe_> Majavah: I dunno, but I just ghosted it and it reconnected immediately [07:38:43] try changing the password and see if it goes to normal? [07:39:44] <_joe_> no. [07:39:57] <_joe_> so the user from cloud is NOT identifying correctly [07:40:04] if it's behind NAT, does some router know which hosts are connected to card.freenode.net? [07:40:24] <_joe_> ok, how do I enable nick protection? [07:40:27] <_joe_> I don't remember :P [07:40:33] nickserv set guard on [07:40:40] <_joe_> thanks Zppix [07:40:57] isn't it enforce and not guard [07:40:57] tbh that should be enabled for all bots [07:41:02] maybe its enforce [07:41:10] if that doesnt work try enforce [07:41:19] i get those two mixed sometimes [07:41:24] RECOVERY - Check systemd state on wdqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:41:26] ENFORCE Enables or disables automatic protection of a nickname. [07:41:41] <_joe_> yeah done it right now Majavah [07:42:19] note that it won't prevent it from connecting, the network will just switch it to a random nick (GuestXX I think) if it does not auth within like 30 secodns [07:42:19] _joe_: the next question, is this a security breach? or just a misconfiguration? [07:42:29] <_joe_> the latter [07:42:40] I see. The server only allows SASL connections only oO [07:42:44] -only [07:42:50] <_joe_> Bsadowski1: oh sigh [07:43:02] RECOVERY - tcpircbot_service_running on alert1001 is OK: PROCS OK: 1 process with command name python, args tcpircbot.py https://wikitech.wikimedia.org/wiki/Logmsgbot [07:43:08] i dont think so [07:43:24] Bsadowski1: which server? [07:43:26] card [07:43:30] <_joe_> marostegui: can you please stop using dbctl for a bit? [07:43:42] ok, let me do the last repool I have to do [07:44:00] Wireshark told me this: "Trailer: *** Notice -- You need to identify via SASL to use this server" [07:44:02] <_joe_> the actual bot is trying to connect to various servers for the record [07:44:27] <_joe_> but I see no login attempts by it [07:44:28] _joe_: done [07:44:47] <_joe_> so let's see now that marostegui will stop crashing it [07:44:53] haha [07:44:54] :P [07:44:56] <_joe_> it with more time it will do it [07:45:02] I just connected to card. without SASL [07:45:02] <_joe_> or if login never really worked [07:45:15] Oh idk then, Majavah [07:45:20] 07:41:36 -- | logmsgbot is now known as Guest39824 [07:45:25] so the enforce just kicked in [07:45:37] Bsadowski1: that's probably based on your ip tbh [07:45:52] yeah I am on a tether right now [07:46:02] :( [07:46:12] <_joe_> please stop randomguessing, or I won't be able to get to the bottom of this [07:47:10] <_joe_> the bot seems generally unable to connect, but I need to look at the bot code at this point [07:47:35] so i wonder what is connecting [07:47:44] <_joe_> that's irrelevant right now [07:54:16] <_joe_> it looks like freenode is blocking the service from logging in [07:59:32] PROBLEM - tcpircbot_service_running on alert1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args tcpircbot.py https://wikitech.wikimedia.org/wiki/Logmsgbot [08:01:41] 10SRE, 10Traffic: Sudden surge of requests to https://wikipedia.org/ from Telus customers - https://phabricator.wikimedia.org/T276213 (10elukey) [08:06:35] 10SRE, 10Traffic: Sudden surge of requests to https://wikipedia.org/ from Telus customers - https://phabricator.wikimedia.org/T276213 (10elukey) > There is a similar spike at around the same time during the Pacific night, between 8 and 9 UTC on January 7th, but I'm not sure if we can still verify what was in t... [08:06:50] RECOVERY - tcpircbot_service_running on alert1001 is OK: PROCS OK: 1 process with command name python, args tcpircbot.py https://wikitech.wikimedia.org/wiki/Logmsgbot [08:06:55] 10SRE, 10MW-on-K8s, 10observability, 10serviceops: Keep calculating latencies for MediaWiki requests that happen k8s - https://phabricator.wikimedia.org/T276095 (10JMeybohm) p:05Triage→03Medium [08:07:50] 10SRE, 10Analytics, 10Machine-Learning-Team: Kubeflow on stat machines - https://phabricator.wikimedia.org/T275551 (10elukey) p:05Triage→03Medium [08:08:20] 10SRE, 10MW-on-K8s, 10observability, 10serviceops: Logging options for apache httpd in k8s - https://phabricator.wikimedia.org/T265876 (10JMeybohm) p:05High→03Medium Lowering prioiry to medium as of discussion with @Joe [08:08:45] saw the task on tcpircbot, please let me know if I can help [08:08:58] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 103 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [08:09:02] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10JMeybohm) p:05Triage→03Medium [08:11:14] (03CR) 10Elukey: Allow changing the IP of the fcgi server (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/667885 (owner: 10Giuseppe Lavagetto) [08:11:36] _joe_: FYI ^ [08:11:57] (03CR) 10Elukey: [C: 03+1] "From my limited understanding of docker images, LGTM :)" (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/667885 (owner: 10Giuseppe Lavagetto) [08:12:46] <_joe_> godog: yeah I have no idea why it's failing to log in to freenode [08:13:02] how exactly it is failing? [08:13:07] <_joe_> I just tried to change username to wmf-logmsgbot (just registered and enforced) [08:13:18] <_joe_> Majavah: no apparent error message [08:13:39] any ideas at what stage? [08:13:46] <_joe_> no [08:13:50] :/ [08:14:37] _joe_: ack [08:15:11] (03PS1) 10JMeybohm: Switch staging.svc.ediad.wmnet to point to codfw k8s [dns] - 10https://gerrit.wikimedia.org/r/667982 [08:15:13] (03PS1) 10JMeybohm: Change kubestagemaster.svc.equiad.wmnet to point to new master [dns] - 10https://gerrit.wikimedia.org/r/667983 [08:17:33] did the script leave any log messages? [08:18:56] (03CR) 10Elukey: "LGTM, again I am super ignorant about Docker images so I checked quickly the config for httpd. Left some notes about mod_access_compact an" (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/667886 (owner: 10Giuseppe Lavagetto) [08:22:00] <_joe_> is there any way to remove the requirement to be identified before joining this channel temporarily? [08:22:14] ./mode #wikimedia-operations -r [08:22:31] (03PS1) 10JMeybohm: Allow RunAsAny in the restricted PSP as well [deployment-charts] - 10https://gerrit.wikimedia.org/r/667986 (https://phabricator.wikimedia.org/T274262) [08:22:39] 10SRE, 10vm-requests: EQIAD and CODFW : 5of VMs requested for kubernetes master - https://phabricator.wikimedia.org/T276204 (10akosiaris) 05Open→03Resolved VMs up and running, resolving. [08:22:49] <_joe_> yeah somehow the bot, even with the new nick that correctly identified it, somehow ended in #wikimedia-overflow [08:23:17] (03CR) 10Alexandros Kosiaris: [C: 03+1] Allow RunAsAny in the restricted PSP as well [deployment-charts] - 10https://gerrit.wikimedia.org/r/667986 (https://phabricator.wikimedia.org/T274262) (owner: 10JMeybohm) [08:23:30] I ran the script locally, it still takes a second or two after connecting before it identifies [08:23:42] so maybe the bot should wait a few seconds before trying to join channels [08:23:54] <_joe_> it's not crashing anymore, at least [08:23:56] if you do /mode #wikimedia-operations -r while opped it should allow you [08:32:15] !log stop/mask tcpircbot-logmsgbot on pontoon-icinga-01 - T276299 [08:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:25] T276299: dbctl not sending !log to IRC - https://phabricator.wikimedia.org/T276299 [08:32:39] I realized that service must have been trying to connect, hence the cloud NAT ip [08:32:46] without the right password of course [08:32:56] pontoon? [08:33:22] PROBLEM - tcpircbot_service_running on alert1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args tcpircbot.py https://wikitech.wikimedia.org/wiki/Logmsgbot [08:33:41] Zppix: yeah, a testing environment [08:33:50] ah [08:35:07] (03PS1) 10Muehlenhoff: Update email address [puppet] - 10https://gerrit.wikimedia.org/r/667987 [08:35:50] RECOVERY - tcpircbot_service_running on alert1001 is OK: PROCS OK: 1 process with command name python, args tcpircbot.py https://wikitech.wikimedia.org/wiki/Logmsgbot [08:37:09] <_joe_> please stop doing actions that would log to tcpircbot for some time, including any use of cumin, else I'll never get to solve the issue [08:38:57] <_joe_> ebernhardson: ^^ [08:39:05] <_joe_> err sorry elukey ^^ [08:39:14] <_joe_> ok domne [08:39:18] it came back authed [08:39:44] im afraid leaving -r might cause spam isseus [08:40:01] <_joe_> yes reverting now [08:40:20] <_joe_> ok, done [08:40:30] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1136.eqiad.wmnet with reason: REIMAGE [08:40:32] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on an-worker1136.eqiad.wmnet with reason: REIMAGE [08:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:16] <_joe_> ok, now lemme try to relogin with +r [08:41:24] * Zppix crosses fingers [08:42:03] <_joe_> ok all good [08:42:03] It worked [08:42:03] \o/ [08:42:30] <_joe_> now the problem as I understand it is if we send a message to tcpircbot while it's not properly connected, it crashes [08:42:41] <_joe_> and it takes a while to reconnect [08:42:50] do we know why it wasnt connected though? [08:42:52] <_joe_> so the pontoon one connected instead, without auth [08:42:54] cumin doean't ! log... just cookbooks [08:42:58] <_joe_> it was restarted [08:43:19] <_joe_> volans: yeah ok, elukey was using a cookbook, sorry if I wasn't precise [08:43:35] so after all it was a cloud host accidentally using the prod module [08:43:39] (03CR) 10David Caro: [C: 03+2] Use the correct package name/path everywhere [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/667887 (owner: 10David Caro) [08:44:23] 10SRE, 10conftool: dbctl not sending !log to IRC - https://phabricator.wikimedia.org/T276299 (10Joe) We found a few other issues: - The nick has no enforce, thus a random instance running in labs is connecting (obviously without password) - Nickserv says it saw the user the last time at when the issues started... [08:45:02] (03CR) 10David Caro: [V: 03+2 C: 03+2] Use the correct package name/path everywhere [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/667887 (owner: 10David Caro) [08:45:08] (03CR) 10Muehlenhoff: [C: 03+2] Update email address [puppet] - 10https://gerrit.wikimedia.org/r/667987 (owner: 10Muehlenhoff) [08:47:59] _joe_: happy to give dbctl a test if you need it for testing [08:48:18] <_joe_> it works, please go ahead with your work [08:48:33] !log test tcpircbot --joe [08:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:38] <_joe_> see? [08:48:55] 10SRE, 10conftool: dbctl not sending !log to IRC - https://phabricator.wikimedia.org/T276299 (10Joe) 05Open→03Resolved p:05Triage→03High a:03Joe After more analysis, this is my understanding of the outstanding problems: - pontoon was connecting without password as logmsgbot, and given the nick has no... [08:50:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase weight for db1164 in s1 T258361', diff saved to https://phabricator.wikimedia.org/P14590 and previous config saved to /var/cache/conftool/dbconfig/20210303-085014-marostegui.json [08:50:19] \o/ _joe_ ^ [08:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:22] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [08:53:18] (03CR) 10Muehlenhoff: [C: 03+2] profile::zuul::server: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/667102 (owner: 10Muehlenhoff) [08:54:26] !log zpapierski@deploy1002 Started deploy [wdqs/wdqs@dbfd1f6]: Deploying emergency fix - WDQS 0.3.66 [08:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:43] 10SRE, 10Desktop Improvements, 10Traffic, 10Bengali-Sites, and 4 others: CDN cache revalidation on several wikis for desktop improvements deployment pt 2 - https://phabricator.wikimedia.org/T274784 (10ovasileva) >>! In T274784#6875627, @BBlack wrote: > @ovasileva Yes, that plan seems reasonable! > > Just... [08:55:13] I'll followup in a task to observability [08:55:37] (03CR) 10JMeybohm: [C: 03+2] Allow RunAsAny in the restricted PSP as well [deployment-charts] - 10https://gerrit.wikimedia.org/r/667986 (https://phabricator.wikimedia.org/T274262) (owner: 10JMeybohm) [08:56:11] _joe_: kinda of a similar task, both have to do with bots not authing to IRC T275594 (its private, so idk if you can see it) [08:56:19] (03Merged) 10jenkins-bot: Allow RunAsAny in the restricted PSP as well [deployment-charts] - 10https://gerrit.wikimedia.org/r/667986 (https://phabricator.wikimedia.org/T274262) (owner: 10JMeybohm) [08:56:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1164 (re)pooling @ 10%: Slowly repool db1164 in s1 for the first time', diff saved to https://phabricator.wikimedia.org/P14591 and previous config saved to /var/cache/conftool/dbconfig/20210303-085658-root.json [08:57:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:32] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [08:57:48] <_joe_> Zppix: completely different issue. [08:57:55] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) db1164 is now slowly automatically being pooled in s1 (running 10.4.18) [08:58:17] _joe_: oh i guess i interpreted your closing comment differently then [08:58:49] anyways good luck guys, i've off to bed. [08:59:37] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [08:59:48] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [09:00:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1157 for schema change', diff saved to https://phabricator.wikimedia.org/P14592 and previous config saved to /var/cache/conftool/dbconfig/20210303-090030-marostegui.json [09:00:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:43] !log zpapierski@deploy1002 Finished deploy [wdqs/wdqs@dbfd1f6]: Deploying emergency fix - WDQS 0.3.66 (duration: 08m 17s) [09:02:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:09] (03CR) 10Muehlenhoff: [C: 03+2] profile::tlsproxy::envoy: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/667861 (owner: 10Muehlenhoff) [09:08:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P14593 and previous config saved to /var/cache/conftool/dbconfig/20210303-090840-root.json [09:08:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1164 (re)pooling @ 15%: Slowly repool db1164 in s1 for the first time', diff saved to https://phabricator.wikimedia.org/P14594 and previous config saved to /var/cache/conftool/dbconfig/20210303-091201-root.json [09:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:13] (03CR) 10Kormat: [C: 03+1] Remove labsdb1012 from puppet in preparation for rename [puppet] - 10https://gerrit.wikimedia.org/r/663865 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi) [09:13:47] (03PS1) 10Muehlenhoff: debian: remove jessie from spec tests [puppet] - 10https://gerrit.wikimedia.org/r/667994 [09:16:05] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [09:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:42] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [09:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:19] (03PS1) 10Muehlenhoff: aptrepo: Update spec test [puppet] - 10https://gerrit.wikimedia.org/r/667995 [09:17:39] (03PS1) 10JMeybohm: deployment_server: Switch k8s staging to codfw [puppet] - 10https://gerrit.wikimedia.org/r/667996 [09:19:50] (03PS1) 10Gehel: cirrus/query_service/icinga: migrate tests to use buster [puppet] - 10https://gerrit.wikimedia.org/r/667998 [09:21:43] (03PS1) 10Muehlenhoff: prometheus::haproxy_exporter: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/668000 [09:22:49] (03PS1) 10Elukey: [WIP] Add profile to deploy the 5.10 Linux kernel on Buster hosts [puppet] - 10https://gerrit.wikimedia.org/r/668001 [09:23:21] ���� [09:23:34] uh? [09:23:40] (03PS5) 10Kormat: mariadb: Convert pt-heartbeat to a systemd service. [puppet] - 10https://gerrit.wikimedia.org/r/665324 (https://phabricator.wikimedia.org/T252528) [09:23:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P14595 and previous config saved to /var/cache/conftool/dbconfig/20210303-092343-root.json [09:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:54] (03PS20) 10Kormat: mariadb: Add section parameters [puppet] - 10https://gerrit.wikimedia.org/r/667547 (https://phabricator.wikimedia.org/T275497) [09:24:57] (03PS2) 10Elukey: [WIP] Add profile to deploy the 5.10 Linux kernel on Buster hosts [puppet] - 10https://gerrit.wikimedia.org/r/668001 [09:25:56] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1137.eqiad.wmnet with reason: REIMAGE [09:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1164 (re)pooling @ 25%: Slowly repool db1164 in s1 for the first time', diff saved to https://phabricator.wikimedia.org/P14596 and previous config saved to /var/cache/conftool/dbconfig/20210303-092705-root.json [09:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:39] 10SRE, 10serviceops, 10User-jijiki: Enable TLS on memcached - https://phabricator.wikimedia.org/T271967 (10Joe) Can I ask how do we intend to perform the transition from non-tls to tls in detail? I see a series of pitfalls with our current setup and the code I see in puppet, but please be explicit about the... [09:28:03] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1137.eqiad.wmnet with reason: REIMAGE [09:28:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:09] (03PS2) 10Gehel: cirrus/query_service/icinga: migrate tests to use buster [puppet] - 10https://gerrit.wikimedia.org/r/667998 [09:28:11] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1138.eqiad.wmnet with reason: REIMAGE [09:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:59] (03PS3) 10Gehel: cirrus/query_service/icinga: remove references to jessie [puppet] - 10https://gerrit.wikimedia.org/r/667998 [09:30:21] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1138.eqiad.wmnet with reason: REIMAGE [09:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:28] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/667998 (owner: 10Gehel) [09:30:41] (03CR) 10Gehel: [C: 03+2] cirrus/query_service/icinga: remove references to jessie [puppet] - 10https://gerrit.wikimedia.org/r/667998 (owner: 10Gehel) [09:31:14] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudnet1003.eqiad.wmnet [09:35:50] (03PS1) 10Muehlenhoff: labs_bootstrapvz: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/668005 [09:35:52] (03PS3) 10Elukey: [WIP] Add profile to deploy the 5.10 Linux kernel on Buster hosts [puppet] - 10https://gerrit.wikimedia.org/r/668001 [09:35:54] (03CR) 10Elukey: "Hi everybody, I kicked off this WIP patch to see if we can discuss what it is needed to puppetize 5.10 for Buster nodes. Lemme know :)" [puppet] - 10https://gerrit.wikimedia.org/r/668001 (owner: 10Elukey) [09:35:56] (03PS1) 10Muehlenhoff: Update comment [puppet] - 10https://gerrit.wikimedia.org/r/668006 [09:35:58] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/667928 (https://phabricator.wikimedia.org/T276208) (owner: 10Andrew Bogott) [09:36:44] (03CR) 10Klausman: "The 4.19-Stretch equivalent of this also installs rasdaemon and masks the mcelog systemd unit. Is that relevant here?" [puppet] - 10https://gerrit.wikimedia.org/r/668001 (owner: 10Elukey) [09:37:33] (03CR) 10Kormat: [C: 03+1] "Good catch, thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/668006 (owner: 10Muehlenhoff) [09:38:19] (03CR) 10Muehlenhoff: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/668001 (owner: 10Elukey) [09:38:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P14597 and previous config saved to /var/cache/conftool/dbconfig/20210303-093847-root.json [09:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:07] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker[1132,1135-1138].eqiad.wmnet [09:39:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:13] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:41:22] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker[1132,1135-1138].eqiad.wmnet [09:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:45] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:42:05] (03PS1) 10Vgutierrez: ATS: Enable parent proxies on ats-tls@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/668007 (https://phabricator.wikimedia.org/T274888) [09:42:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1164 (re)pooling @ 30%: Slowly repool db1164 in s1 for the first time', diff saved to https://phabricator.wikimedia.org/P14598 and previous config saved to /var/cache/conftool/dbconfig/20210303-094208-root.json [09:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:32] (03CR) 10Jcrespo: [C: 03+1] Update comment [puppet] - 10https://gerrit.wikimedia.org/r/668006 (owner: 10Muehlenhoff) [09:43:28] (03PS1) 10Muehlenhoff: profile::wmcs::nfs::primary: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/668008 [09:45:40] (03PS1) 10Muehlenhoff: striker: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/668009 [09:46:11] (03CR) 10Klausman: [C: 03+1] [WIP] Add profile to deploy the 5.10 Linux kernel on Buster hosts [puppet] - 10https://gerrit.wikimedia.org/r/668001 (owner: 10Elukey) [09:47:18] (03CR) 10Volans: [C: 03+2] puppetdb microservice: refactor prior to expand it [puppet] - 10https://gerrit.wikimedia.org/r/667549 (https://phabricator.wikimedia.org/T244840) (owner: 10Volans) [09:47:22] (03PS1) 10Muehlenhoff: service::monitoring: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/668010 [09:48:20] (03CR) 10Volans: [C: 03+2] puppetdb microservice: add support for cumin [puppet] - 10https://gerrit.wikimedia.org/r/667550 (https://phabricator.wikimedia.org/T244840) (owner: 10Volans) [09:48:37] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [09:49:02] (03PS2) 10JMeybohm: deployment_server: Switch k8s staging to codfw [puppet] - 10https://gerrit.wikimedia.org/r/667996 (https://phabricator.wikimedia.org/T276305) [09:49:04] (03PS1) 10JMeybohm: Remove old kubernetes saging master neon.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/668011 (https://phabricator.wikimedia.org/T276305) [09:49:08] (03PS6) 10Volans: puppetdb microservice: add support for cumin [puppet] - 10https://gerrit.wikimedia.org/r/667550 (https://phabricator.wikimedia.org/T244840) [09:49:26] (03CR) 10Marostegui: mariadb: Add section parameters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/667547 (https://phabricator.wikimedia.org/T275497) (owner: 10Kormat) [09:49:29] 10SRE, 10vm-requests: EQIAD and CODFW : 5of VMs requested for kubernetes master - https://phabricator.wikimedia.org/T276204 (10JMeybohm) [09:53:22] (03PS1) 10JMeybohm: Set role on new kubernetes staging master in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/668012 (https://phabricator.wikimedia.org/T276305) [09:53:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P14599 and previous config saved to /var/cache/conftool/dbconfig/20210303-095351-root.json [09:53:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:06] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudnet1004/cloudnet1003: network hiccups because broadcom driver/firmware problem - https://phabricator.wikimedia.org/T271058 (10aborrero) Rebooting cloudnet1003 into the new kernel failed to bring interfaces up: `lang=shell-session root@cloudnet1003:~... [09:54:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1166 for schema change', diff saved to https://phabricator.wikimedia.org/P14600 and previous config saved to /var/cache/conftool/dbconfig/20210303-095417-marostegui.json [09:54:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:36] (03CR) 10Muehlenhoff: [WIP] Add profile to deploy the 5.10 Linux kernel on Buster hosts (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/668001 (owner: 10Elukey) [09:54:55] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on cloudnet1003.eqiad.wmnet with reason: HW issue [09:54:56] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on cloudnet1003.eqiad.wmnet with reason: HW issue [09:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1164 (re)pooling @ 50%: Slowly repool db1164 in s1 for the first time', diff saved to https://phabricator.wikimedia.org/P14601 and previous config saved to /var/cache/conftool/dbconfig/20210303-095712-root.json [09:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:42] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28331/console" [puppet] - 10https://gerrit.wikimedia.org/r/667907 (owner: 10Vgutierrez) [09:57:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P14602 and previous config saved to /var/cache/conftool/dbconfig/20210303-095751-root.json [09:57:54] (03CR) 10Volans: [C: 03+2] Add cuminunpriv1001 to allowed hosts for puppetdb microservice [puppet] - 10https://gerrit.wikimedia.org/r/667600 (owner: 10Muehlenhoff) [09:57:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:24] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [10:00:26] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet1003.eqiad.wmnet [10:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:35] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudnet1003.eqiad.wmnet [10:00:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:53] (03PS1) 10Muehlenhoff: git::lfs: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/668013 [10:01:39] (03PS1) 10Elukey: profile::hadoop::worker: add explicit require for profile::java [puppet] - 10https://gerrit.wikimedia.org/r/668014 (https://phabricator.wikimedia.org/T274795) [10:01:42] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={etherpad,netbox_device_statistics,swagger_check_citoid_cluster_eqiad} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:01:45] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus::haproxy_exporter: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/668000 (owner: 10Muehlenhoff) [10:02:34] (03PS2) 10JMeybohm: Remove old kubernetes saging master neon.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/668011 (https://phabricator.wikimedia.org/T276305) [10:02:36] (03PS2) 10JMeybohm: Set role on new kubernetes staging master in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/668012 (https://phabricator.wikimedia.org/T276305) [10:02:38] (03PS1) 10JMeybohm: Switch from neon.eqiad.wmnet to kubestagemaster.svc.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/668015 (https://phabricator.wikimedia.org/T276305) [10:02:50] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28332/console" [puppet] - 10https://gerrit.wikimedia.org/r/668014 (https://phabricator.wikimedia.org/T274795) (owner: 10Elukey) [10:03:29] jayme: "saging" typo spotted in 668011 while passing by :D [10:03:46] elukey: pssst! :D [10:03:52] (03PS1) 10Muehlenhoff: profile::puppetmaster::common: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/668016 [10:04:19] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::hadoop::worker: add explicit require for profile::java [puppet] - 10https://gerrit.wikimedia.org/r/668014 (https://phabricator.wikimedia.org/T274795) (owner: 10Elukey) [10:04:37] (03CR) 10Filippo Giunchedi: [C: 03+1] assign mwlog2002 role::logging::mediawiki::udp2log [puppet] - 10https://gerrit.wikimedia.org/r/667911 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [10:04:39] (03PS3) 10JMeybohm: Remove old kubernetes staging master neon.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/668011 (https://phabricator.wikimedia.org/T276305) [10:04:41] (03PS3) 10JMeybohm: Set role on new kubernetes staging master in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/668012 (https://phabricator.wikimedia.org/T276305) [10:04:43] (03PS2) 10JMeybohm: Switch from neon.eqiad.wmnet to kubestagemaster.svc.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/668015 (https://phabricator.wikimedia.org/T276305) [10:05:08] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:05:48] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet1003.eqiad.wmnet [10:05:51] (03PS21) 10Kormat: mariadb: Add section parameters [puppet] - 10https://gerrit.wikimedia.org/r/667547 (https://phabricator.wikimedia.org/T275497) [10:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:48] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28333/console" [puppet] - 10https://gerrit.wikimedia.org/r/667547 (https://phabricator.wikimedia.org/T275497) (owner: 10Kormat) [10:06:50] (03PS1) 10Volans: puppetdb microservice: fix API paths [puppet] - 10https://gerrit.wikimedia.org/r/668017 (https://phabricator.wikimedia.org/T244840) [10:07:48] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudnet1004/cloudnet1003: network hiccups because broadcom driver/firmware problem - https://phabricator.wikimedia.org/T271058 (10aborrero) 05Open→03Resolved ok, updating the firmware-bnx2x + upgrading the kernel did the trick apparently. I'm closin... [10:07:54] (03PS1) 10Elukey: Add an-worker113[2,5-8] to the Analytics Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/668018 (https://phabricator.wikimedia.org/T274795) [10:08:02] (03CR) 10Filippo Giunchedi: [C: 04-1] "I can't seem to find the log_mediawiki_servergroup_level_channel_doc_count metric:" [puppet] - 10https://gerrit.wikimedia.org/r/666719 (https://phabricator.wikimedia.org/T262078) (owner: 10Effie Mouzeli) [10:08:06] (03CR) 10jerkins-bot: [V: 04-1] puppetdb microservice: fix API paths [puppet] - 10https://gerrit.wikimedia.org/r/668017 (https://phabricator.wikimedia.org/T244840) (owner: 10Volans) [10:08:18] (03PS1) 10Muehlenhoff: profile::conftool::client: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/668019 [10:08:39] (03CR) 10Elukey: [C: 03+2] Add an-worker113[2,5-8] to the Analytics Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/668018 (https://phabricator.wikimedia.org/T274795) (owner: 10Elukey) [10:08:53] (03PS2) 10Volans: puppetdb microservice: fix API paths [puppet] - 10https://gerrit.wikimedia.org/r/668017 (https://phabricator.wikimedia.org/T244840) [10:12:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1164 (re)pooling @ 60%: Slowly repool db1164 in s1 for the first time', diff saved to https://phabricator.wikimedia.org/P14604 and previous config saved to /var/cache/conftool/dbconfig/20210303-101215-root.json [10:12:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:22] (03PS1) 10Elukey: Correct typo in site.pp related to new analytics worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/668020 [10:12:48] jayme: speaking of typos right :D --^ [10:12:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P14605 and previous config saved to /var/cache/conftool/dbconfig/20210303-101255-root.json [10:13:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:03] (03CR) 10Elukey: [C: 03+2] Correct typo in site.pp related to new analytics worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/668020 (owner: 10Elukey) [10:14:42] (03PS2) 10JMeybohm: Switch staging.svc.ediad.wmnet to point to codfw k8s [dns] - 10https://gerrit.wikimedia.org/r/667982 (https://phabricator.wikimedia.org/T276305) [10:14:44] (03PS2) 10JMeybohm: Change kubestagemaster.svc.equiad.wmnet to point to new master [dns] - 10https://gerrit.wikimedia.org/r/667983 (https://phabricator.wikimedia.org/T276305) [10:15:52] elukey: eheh :) [10:16:25] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/668017 (https://phabricator.wikimedia.org/T244840) (owner: 10Volans) [10:16:53] (03CR) 10Volans: [C: 03+2] puppetdb microservice: fix API paths [puppet] - 10https://gerrit.wikimedia.org/r/668017 (https://phabricator.wikimedia.org/T244840) (owner: 10Volans) [10:18:01] (03PS22) 10Kormat: mariadb: Add section parameters [puppet] - 10https://gerrit.wikimedia.org/r/667547 (https://phabricator.wikimedia.org/T275497) [10:18:22] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "See the comments: basically I think the better approach is to pass the role to the outer functions, or alternatively you need to safeguard" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/667547 (https://phabricator.wikimedia.org/T275497) (owner: 10Kormat) [10:20:11] (03CR) 10Jbond: [C: 03+1] "lgtm and i just noticed that i never hit send on the below response that said the issue it referes to has been fixed" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663658 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [10:21:21] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28334/console" [puppet] - 10https://gerrit.wikimedia.org/r/667547 (https://phabricator.wikimedia.org/T275497) (owner: 10Kormat) [10:22:56] (03PS23) 10Kormat: mariadb: Add section parameters [puppet] - 10https://gerrit.wikimedia.org/r/667547 (https://phabricator.wikimedia.org/T275497) [10:23:53] (03CR) 10Kormat: "> Patch Set 19:" (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/667547 (https://phabricator.wikimedia.org/T275497) (owner: 10Kormat) [10:24:04] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 8 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28335/console" [puppet] - 10https://gerrit.wikimedia.org/r/668007 (https://phabricator.wikimedia.org/T274888) (owner: 10Vgutierrez) [10:24:41] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] ATS: Enable parent proxies on ats-tls@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/668007 (https://phabricator.wikimedia.org/T274888) (owner: 10Vgutierrez) [10:25:08] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [10:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:54] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [10:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1164 (re)pooling @ 75%: Slowly repool db1164 in s1 for the first time', diff saved to https://phabricator.wikimedia.org/P14607 and previous config saved to /var/cache/conftool/dbconfig/20210303-102719-root.json [10:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:32] (03PS1) 10Alexandros Kosiaris: linkrecommendation: Pass entire list of dbproxies to networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/668023 (https://phabricator.wikimedia.org/T276268) [10:27:34] (03PS1) 10Muehlenhoff: pcc: Remove config for puppet 3->4 migration [puppet] - 10https://gerrit.wikimedia.org/r/668024 [10:27:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P14608 and previous config saved to /var/cache/conftool/dbconfig/20210303-102758-root.json [10:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:14] (03CR) 10Jbond: [C: 03+1] "lgtm thx" [puppet] - 10https://gerrit.wikimedia.org/r/667547 (https://phabricator.wikimedia.org/T275497) (owner: 10Kormat) [10:29:58] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28336/console" [puppet] - 10https://gerrit.wikimedia.org/r/667547 (https://phabricator.wikimedia.org/T275497) (owner: 10Kormat) [10:30:09] (03PS1) 10Muehlenhoff: varnish::setup_filesystem: Remove old jessie check [puppet] - 10https://gerrit.wikimedia.org/r/668026 [10:31:46] (03CR) 10Alexandros Kosiaris: [C: 03+2] linkrecommendation: Pass entire list of dbproxies to networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/668023 (https://phabricator.wikimedia.org/T276268) (owner: 10Alexandros Kosiaris) [10:32:32] (03PS1) 10Muehlenhoff: cassandra: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/668027 [10:32:46] (03Merged) 10jenkins-bot: linkrecommendation: Pass entire list of dbproxies to networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/668023 (https://phabricator.wikimedia.org/T276268) (owner: 10Alexandros Kosiaris) [10:33:07] (03PS24) 10Kormat: mariadb: Add section parameters [puppet] - 10https://gerrit.wikimedia.org/r/667547 (https://phabricator.wikimedia.org/T275497) [10:33:22] (03CR) 10jerkins-bot: [V: 04-1] cassandra: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/668027 (owner: 10Muehlenhoff) [10:34:09] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [10:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:19] !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [10:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:26] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [10:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:27] !log rolling restart of ats-tls on eqiad [10:37:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:01] (03CR) 10Muehlenhoff: [C: 03+2] prometheus::haproxy_exporter: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/668000 (owner: 10Muehlenhoff) [10:38:19] (03PS1) 10JMeybohm: admin_ng: Depend coredns on calico and others on coredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/668030 [10:38:27] !log upload new wmf-laptop 0.5.0 package [10:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:50] (03PS2) 10JMeybohm: admin_ng: Depend coredns on calico and others on coredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/668030 [10:40:53] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mariadb: Add section parameters [puppet] - 10https://gerrit.wikimedia.org/r/667547 (https://phabricator.wikimedia.org/T275497) (owner: 10Kormat) [10:42:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1164 (re)pooling @ 90%: Slowly repool db1164 in s1 for the first time', diff saved to https://phabricator.wikimedia.org/P14609 and previous config saved to /var/cache/conftool/dbconfig/20210303-104223-root.json [10:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:58] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Sergey Trofimovsky from Speed & Function - https://phabricator.wikimedia.org/T275722 (10jbond) [10:43:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P14610 and previous config saved to /var/cache/conftool/dbconfig/20210303-104302-root.json [10:43:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1175 for schema change', diff saved to https://phabricator.wikimedia.org/P14611 and previous config saved to /var/cache/conftool/dbconfig/20210303-104522-marostegui.json [10:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:49] (03PS1) 10Kormat: mariadb: Add section parameters: core::multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/668031 [10:45:54] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Sergey Trofimovsky from Speed & Function - https://phabricator.wikimedia.org/T275722 (10jbond) >>! In T275722#6877063, @Sergey.Trofimovsky.SF wrote: > @jbond It's an outcome of me trying to separate personal and S&F accounts here, s... [10:46:01] (03CR) 10Muehlenhoff: [C: 03+2] Update comment [puppet] - 10https://gerrit.wikimedia.org/r/668006 (owner: 10Muehlenhoff) [10:46:18] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/667994 (owner: 10Muehlenhoff) [10:47:14] (03CR) 10Jbond: aptrepo: Update spec test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/667995 (owner: 10Muehlenhoff) [10:48:05] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/668024 (owner: 10Muehlenhoff) [10:48:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P14612 and previous config saved to /var/cache/conftool/dbconfig/20210303-104836-root.json [10:48:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:40] (03CR) 10Alexandros Kosiaris: [C: 03+1] Change kubestagemaster.svc.equiad.wmnet to point to new master [dns] - 10https://gerrit.wikimedia.org/r/667983 (https://phabricator.wikimedia.org/T276305) (owner: 10JMeybohm) [10:50:04] (03CR) 10Alexandros Kosiaris: [C: 03+1] Switch staging.svc.ediad.wmnet to point to codfw k8s [dns] - 10https://gerrit.wikimedia.org/r/667982 (https://phabricator.wikimedia.org/T276305) (owner: 10JMeybohm) [10:50:44] (03CR) 10Muehlenhoff: [C: 03+2] pcc: Remove config for puppet 3->4 migration [puppet] - 10https://gerrit.wikimedia.org/r/668024 (owner: 10Muehlenhoff) [10:51:01] (03CR) 10Alexandros Kosiaris: [C: 03+1] admin_ng: Depend coredns on calico and others on coredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/668030 (owner: 10JMeybohm) [10:52:08] (03CR) 10JMeybohm: [C: 03+2] admin_ng: Depend coredns on calico and others on coredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/668030 (owner: 10JMeybohm) [10:52:51] (03Merged) 10jenkins-bot: admin_ng: Depend coredns on calico and others on coredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/668030 (owner: 10JMeybohm) [10:53:05] (03CR) 10Muehlenhoff: [C: 03+2] debian: remove jessie from spec tests [puppet] - 10https://gerrit.wikimedia.org/r/667994 (owner: 10Muehlenhoff) [10:56:45] PSA: swift codfw is going to be repooled in 10/15 mins (cfr T267338) [10:56:46] T267338: Depool codfw swift cluster - https://phabricator.wikimedia.org/T267338 [10:57:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1164 (re)pooling @ 100%: Slowly repool db1164 in s1 for the first time', diff saved to https://phabricator.wikimedia.org/P14613 and previous config saved to /var/cache/conftool/dbconfig/20210303-105726-root.json [10:57:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:56] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, some minor comments inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/668001 (owner: 10Elukey) [10:59:22] (03CR) 10Muehlenhoff: aptrepo: Update spec test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/667995 (owner: 10Muehlenhoff) [10:59:26] (03PS2) 10Muehlenhoff: aptrepo: Update spec test [puppet] - 10https://gerrit.wikimedia.org/r/667995 [10:59:43] 10SRE, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Traffic, and 2 others: Depool codfw swift cluster - https://phabricator.wikimedia.org/T267338 (10jcrespo) [11:00:22] 10SRE, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Traffic, and 2 others: Depool codfw swift cluster - https://phabricator.wikimedia.org/T267338 (10jcrespo) [11:00:57] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/667995 (owner: 10Muehlenhoff) [11:03:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P14614 and previous config saved to /var/cache/conftool/dbconfig/20210303-110339-root.json [11:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:17] (03CR) 10Muehlenhoff: [WIP] Add profile to deploy the 5.10 Linux kernel on Buster hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/668001 (owner: 10Elukey) [11:07:48] !log filippo@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=swift,name=codfw [11:07:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:40] PROBLEM - puppet last run on stat1007 is CRITICAL: CRITICAL: Puppet has been disabled for longer than 86400 seconds, message: Elukey, last run 1 day ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:12:30] (03CR) 10Effie Mouzeli: [C: 03+2] profile::templates::services_proxy: switch to ::1 when listen_ipv6 is true [puppet] - 10https://gerrit.wikimedia.org/r/667713 (https://phabricator.wikimedia.org/T255568) (owner: 10Effie Mouzeli) [11:13:06] moritzm: do I merge yours too ? [11:13:25] Muehlenhoff: debian: remove jessie from spec tests (db6b319a52) [11:13:46] effie that is safe to merge [11:13:52] tx [11:14:00] (03CR) 10Muehlenhoff: [C: 03+2] aptrepo: Update spec test [puppet] - 10https://gerrit.wikimedia.org/r/667995 (owner: 10Muehlenhoff) [11:14:39] (03CR) 10Alexandros Kosiaris: [C: 03+1] deployment_server: Switch k8s staging to codfw [puppet] - 10https://gerrit.wikimedia.org/r/667996 (https://phabricator.wikimedia.org/T276305) (owner: 10JMeybohm) [11:15:01] (03CR) 10Alexandros Kosiaris: [C: 03+1] Remove old kubernetes staging master neon.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/668011 (https://phabricator.wikimedia.org/T276305) (owner: 10JMeybohm) [11:15:50] (03CR) 10Alexandros Kosiaris: [C: 04-1] "hieradata stuff is missing. But I think I got most of it in 5651e1b0c1e4901c4e1f809f887a2a775cdbe035" [puppet] - 10https://gerrit.wikimedia.org/r/668012 (https://phabricator.wikimedia.org/T276305) (owner: 10JMeybohm) [11:16:08] (03PS6) 10Effie Mouzeli: hieradata: enable ipv6 on envoy services proxy on 2 servers [puppet] - 10https://gerrit.wikimedia.org/r/667714 (https://phabricator.wikimedia.org/T255568) [11:17:13] (03CR) 10Alexandros Kosiaris: [C: 04-1] Switch from neon.eqiad.wmnet to kubestagemaster.svc.eqiad.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/668015 (https://phabricator.wikimedia.org/T276305) (owner: 10JMeybohm) [11:18:20] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: enable ipv6 on envoy services proxy on 2 servers [puppet] - 10https://gerrit.wikimedia.org/r/667714 (https://phabricator.wikimedia.org/T255568) (owner: 10Effie Mouzeli) [11:18:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P14615 and previous config saved to /var/cache/conftool/dbconfig/20210303-111843-root.json [11:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:46] (03Abandoned) 10JMeybohm: Set role on new kubernetes staging master in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/668012 (https://phabricator.wikimedia.org/T276305) (owner: 10JMeybohm) [11:22:12] (03PS3) 10JMeybohm: staging-eqiad: Apply role/hiera to new master [puppet] - 10https://gerrit.wikimedia.org/r/667867 (https://phabricator.wikimedia.org/T276305) (owner: 10Alexandros Kosiaris) [11:22:28] ack, sorry [11:24:54] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 504 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected [11:24:54] ting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 504 (expecting: 200): /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 504 [11:24:54] : /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:27:02] (03PS1) 10Volans: dhcp: specify owner and group for option 82 config [puppet] - 10https://gerrit.wikimedia.org/r/668041 (https://phabricator.wikimedia.org/T221388) [11:28:34] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663658 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [11:28:46] (03PS2) 10Kosta Harlan: [WIP] Use Envoy for requests to MediaWiki API [deployment-charts] - 10https://gerrit.wikimedia.org/r/667868 (https://phabricator.wikimedia.org/T276217) [11:31:58] (03PS1) 10Muehlenhoff: Also add /etc/cumin/aliases for unpriv Cumin master [puppet] - 10https://gerrit.wikimedia.org/r/668043 [11:33:23] (03CR) 10Volans: [C: 03+1] "LGTM if compiler is happy" [puppet] - 10https://gerrit.wikimedia.org/r/668043 (owner: 10Muehlenhoff) [11:33:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P14616 and previous config saved to /var/cache/conftool/dbconfig/20210303-113349-root.json [11:33:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:28] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 504 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 504 (expecting: 2 [11:38:28] a.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 504 (expecting: 200): /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 504 [11:38:28] : /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:39:12] (03PS3) 10Kosta Harlan: linkrecommendation: Use Envoy for requests to MediaWiki API [deployment-charts] - 10https://gerrit.wikimedia.org/r/667868 (https://phabricator.wikimedia.org/T276217) [11:39:22] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 504 (expecting: 200): /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 504 (expecting: 200): /en.wiki [11:39:22] /mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 504 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 504 (expecting: 200): /en.wikipedia.org/ [11:39:22] le} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:40:49] (03PS2) 10Muehlenhoff: Also add /etc/cumin/aliases for unpriv Cumin master [puppet] - 10https://gerrit.wikimedia.org/r/668043 [11:43:07] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/668043 (owner: 10Muehlenhoff) [11:43:49] 10SRE, 10ops-codfw, 10netops: ripe-atlas-codfw is down - https://phabricator.wikimedia.org/T267714 (10faidon) I believe the Atlas is a PCEngines APU, so you'll need a null modem cable or adapter (RXD->TXD, TXD->RXD, etc.) If this is a Cisco rollover cable, it would do the trick, but your DB9<->RJ45 adapter s... [11:46:57] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/28338/" [puppet] - 10https://gerrit.wikimedia.org/r/668043 (owner: 10Muehlenhoff) [11:47:02] (03CR) 10Muehlenhoff: [C: 03+2] Also add /etc/cumin/aliases for unpriv Cumin master [puppet] - 10https://gerrit.wikimedia.org/r/668043 (owner: 10Muehlenhoff) [11:55:56] (03CR) 10Kosta Harlan: "When Ic161d00bcf5545 is merged I can update the image version in this patch as well." [deployment-charts] - 10https://gerrit.wikimedia.org/r/667868 (https://phabricator.wikimedia.org/T276217) (owner: 10Kosta Harlan) [11:56:51] (03CR) 10Kosta Harlan: linkrecommendation: Use Envoy for requests to MediaWiki API (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/667868 (https://phabricator.wikimedia.org/T276217) (owner: 10Kosta Harlan) [11:57:41] (03PS7) 10WMDE-Fisch: ReferenceTooltips and other gadget names for ReferencePreviews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663185 (https://phabricator.wikimedia.org/T274353) (owner: 10Thiemo Kreuz (WMDE)) [11:57:55] (03PS1) 10Kosta Harlan: Do not open DB connections during service initialization [extensions/GrowthExperiments] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/668046 (https://phabricator.wikimedia.org/T276307) [11:58:20] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt1014.eqiad.wmnet [11:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: Time to snap out of that daydream and deploy European mid-day backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210303T1200). [12:00:05] CFisch_WMDE: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:51] \o jouncebot forgot me :( [12:04:11] \o [12:04:36] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1014.eqiad.wmnet [12:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:52] Hey, is anyone deploying? [12:05:03] If not I can in 5 minutes [12:05:30] I'm to distracted at home so free 4 all to merge my patch! [12:06:00] (03CR) 10Urbanecm: [C: 03+2] Do not open DB connections during service initialization [extensions/GrowthExperiments] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/668046 (https://phabricator.wikimedia.org/T276307) (owner: 10Kosta Harlan) [12:07:44] Urbanecm: thanks [12:07:51] okay, it's now official, I can deploy today :) [12:08:23] (03CR) 10Urbanecm: [C: 03+2] ReferenceTooltips and other gadget names for ReferencePreviews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663185 (https://phabricator.wikimedia.org/T274353) (owner: 10Thiemo Kreuz (WMDE)) [12:08:33] CFisch_WMDE: are you able to check it works correctly once on debug server? [12:08:53] kostajh: btw, if you're scheduling close to the window start, you can do `jouncebot: refresh` to update jouncebot's cache ;) [12:08:56] Urbanecm: yes I can check--sorry I spaced out as the window started! [12:09:03] (03PS1) 10Jbond: cfssl::db: make notify_service optional [puppet] - 10https://gerrit.wikimedia.org/r/668067 [12:09:10] awight: cool [12:09:40] awight: great, I just started to figure out where to go for that. But go ahead then! :-) [12:09:48] (03Merged) 10jenkins-bot: ReferenceTooltips and other gadget names for ReferencePreviews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663185 (https://phabricator.wikimedia.org/T274353) (owner: 10Thiemo Kreuz (WMDE)) [12:09:52] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28339/console" [puppet] - 10https://gerrit.wikimedia.org/r/668067 (owner: 10Jbond) [12:09:57] (03CR) 10Jbond: [V: 03+1 C: 03+2] cfssl::db: make notify_service optional [puppet] - 10https://gerrit.wikimedia.org/r/668067 (owner: 10Jbond) [12:11:06] CFisch_WMDE: wanna a quick quide for mwdebug servers? [12:11:38] awight: CFisch_WMDE: pulled onto mwdebug1001, please tet [12:11:40] *test [12:11:49] Urbanecm: I'm suspecting he meant, which sites and pages to visit to exercise our feature. [12:11:52] Urbanecm: ack! [12:11:56] docs are at https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug, if anyone's interested [12:12:15] awight: ah, I thought it means "I don't know how debug servers work". Sorry then :) [12:12:35] :-D [12:12:43] but thanks anyways Urbanecm ;-) [12:13:13] 10SRE, 10Discovery-Search (Current work): /var/run/elasticsearch deleted by elasticsearch - https://phabricator.wikimedia.org/T276198 (10dcausse) p:05Triage→03High Triaging to high as this can cause serious problems. The cause seems to be in elastic itself but I could not spot the exact problem looking at... [12:13:20] (03PS1) 10Muehlenhoff: Configure unprivileged Cumin to use the Puppetdb microservice [puppet] - 10https://gerrit.wikimedia.org/r/668068 [12:15:33] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/668068 (owner: 10Muehlenhoff) [12:16:24] Urbanecm: confirmed working, thanks! [12:16:31] thanks, syncing [12:16:32] +1 [12:17:30] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/668068 (owner: 10Muehlenhoff) [12:18:09] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 90a205ffd60c95c4e87b0da46d9d85138ab74455: Add ReferenceTooltips and other gadget names for ReferencePreviews (T274353) (duration: 01m 10s) [12:18:17] awight: CFisch_WMDE: should be live :) [12:18:22] any other config changes? [12:18:32] \o/ thanks Urbanecm [12:18:37] happy to help [12:18:41] poor stashbot :( [12:18:49] (03Merged) 10jenkins-bot: Do not open DB connections during service initialization [extensions/GrowthExperiments] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/668046 (https://phabricator.wikimedia.org/T276307) (owner: 10Kosta Harlan) [12:19:03] can someone bring stashbot back? [12:19:50] kostajh: pulled to mwdebug1001, please test [12:19:55] https://ldap.toolforge.org/group/tools.stashbot [12:20:05] Urbanecm: having a look [12:20:25] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt1016.eqiad.wmnet [12:20:25] hashar isn't around, and the others look like us based [12:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:30] Majavah: yeah, it doesn't seem to have any european friendlly user :( [12:20:31] ah there it is [12:20:33] but it came back! [12:21:04] should it have? wmcs/toolforge roots can restart it anyways [12:22:14] Majavah: yeah, it's a good bot to have when deploying [12:23:33] toolforge should have a group like "trusted enough to restart things when they break" [12:23:45] or just add more people to stashbot [12:23:53] that would work too :P [12:26:06] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 90a205ffd60c95c4e87b0da46d9d85138ab74455: Add ReferenceTooltips and other gadget names for ReferencePreviews (T274353) (duration: 01m 10s) [12:26:09] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1016.eqiad.wmnet [12:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:13] T274353: Check for the ReferenceTooltips gadget having a non-canonical name - https://phabricator.wikimedia.org/T274353 [12:26:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:24] relogged my sync from earlier, as it was affected by stashbot being gone [12:26:38] Urbanecm: having trouble reproducing the error, TBH. [12:26:52] kostajh: you mean, outside of mwdebug? [12:27:11] (03PS2) 10Muehlenhoff: Configure unprivileged Cumin to use the Puppetdb microservice [puppet] - 10https://gerrit.wikimedia.org/r/668068 [12:27:23] if GE features don't break, i think we can sync and judge by traffic [12:27:24] Urbanecm: yeah. I'd like to be able to trace a request without mwdebug to verify that it's causing the log spam, then trace another request with mwdebug to show that no log spam results [12:27:32] i see [12:27:33] Urbanecm: yeah, agreed on that. [12:27:36] okay, syncing [12:28:07] Urbanecm: sounds good [12:29:17] canaries succeeded, a good sign :) [12:29:18] !log urbanecm@deploy1002 Synchronized php-1.36.0-wmf.33/extensions/GrowthExperiments/ServiceWiring.php: cf635b46a112433f78979b04eb21729783cb2033: Do not open DB connections during service initialization (T276307) (duration: 01m 11s) [12:29:22] kostajh: ^^ [12:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:25] T276307: Expectation (masterConns <= 0) by ApiMain::setRequestExpectations not met (actual: 1):[connect to 10.64.48.35 (testwiki)] - https://phabricator.wikimedia.org/T276307 [12:29:25] anything else? [12:31:16] Majavah: if around, want to test https://phabricator.wikimedia.org/T276306 ? [12:31:34] you deployed that already? [12:32:07] not sure if I have a test subject on the real wikis [12:32:17] (03CR) 10Muehlenhoff: [C: 03+2] Configure unprivileged Cumin to use the Puppetdb microservice [puppet] - 10https://gerrit.wikimedia.org/r/668068 (owner: 10Muehlenhoff) [12:32:28] Majavah: I did not, but I can apply it to mwdebug, and make a subject for you. [12:32:36] sure [12:33:46] Urbanecm: thanks, I'll have a look at the log volume shortly [12:35:00] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 115253784 and 36 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:36:26] (03PS1) 10Jbond: pki::root_ca: add new profile for offline root ca [puppet] - 10https://gerrit.wikimedia.org/r/668071 [12:37:14] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 774344 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:37:29] Urbanecm: I think WMCS are doing restarts so that'll be why stashbot nipped out [12:41:55] Is it only me that thinks meta is super slow [12:42:01] !log Deploy a security patch for T276306 [12:42:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:39] Eh fixed itself [12:43:32] (03PS1) 10Arturo Borrero Gonzalez: openstack: wmcs-drain-hypervisor: add more information to log msg entries [puppet] - 10https://gerrit.wikimedia.org/r/668073 [12:43:39] 10SRE, 10Discovery-Search (Current work): /var/run/elasticsearch deleted by elasticsearch - https://phabricator.wikimedia.org/T276198 (10MoritzMuehlenhoff) >>! In T276198#6878244, @dcausse wrote: > We might want to workaround the issue by always running `systemd-tmpfiles --create` from the elasticsearch system... [12:44:38] (03PS2) 10Arturo Borrero Gonzalez: openstack: wmcs-drain-hypervisor: add more information to log msg entries [puppet] - 10https://gerrit.wikimedia.org/r/668073 [12:45:04] (03CR) 10Jbond: [C: 03+2] pki::root_ca: add new profile for offline root ca [puppet] - 10https://gerrit.wikimedia.org/r/668071 (owner: 10Jbond) [12:48:27] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt1017.eqiad.wmnet [12:48:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:53] (03PS1) 10Effie Mouzeli: hieradata: disable ipv6 on envoy services proxy in restbase-dev hosts [puppet] - 10https://gerrit.wikimedia.org/r/668074 [12:52:22] (03PS1) 10Klausman: [WIP, do not review] Add k8s config for ML machines [puppet] - 10https://gerrit.wikimedia.org/r/668075 [12:52:48] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: wmcs-drain-hypervisor: add more information to log msg entries [puppet] - 10https://gerrit.wikimedia.org/r/668073 (owner: 10Arturo Borrero Gonzalez) [12:54:34] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1017.eqiad.wmnet [12:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:41] (03CR) 10Effie Mouzeli: "PCC https://puppet-compiler.wmflabs.org/compiler1001/28340/restbase-dev1005.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/668074 (owner: 10Effie Mouzeli) [12:54:46] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: disable ipv6 on envoy services proxy in restbase-dev hosts [puppet] - 10https://gerrit.wikimedia.org/r/668074 (owner: 10Effie Mouzeli) [12:58:50] (03PS1) 10Jbond: P:pki::root_ca: fix typo db_ vs dn_ [puppet] - 10https://gerrit.wikimedia.org/r/668076 [13:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210303T1300) [13:00:32] RECOVERY - puppet last run on mwdebug1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:01:55] (03CR) 10Jbond: [C: 03+2] P:pki::root_ca: fix typo db_ vs dn_ [puppet] - 10https://gerrit.wikimedia.org/r/668076 (owner: 10Jbond) [13:03:52] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:06:26] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1032 - https://phabricator.wikimedia.org/T272209 (10fgiunchedi) >>! In T272209#6823608, @Cmjohnson wrote: > @fgiunchedi with ms-be1034 going down and out, I can use a disk from that server to fix this issue. Let me know if you want to do that? Yes please, let's do tha... [13:07:16] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:07:24] (03PS1) 10Jbond: P:pki:root_ca: add auth_keys [puppet] - 10https://gerrit.wikimedia.org/r/668077 [13:09:29] !log swift eqiad-prod: remove ssd weight for ms-be1034 - T276193 [13:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:37] T276193: Decom ms-be1034 - https://phabricator.wikimedia.org/T276193 [13:09:42] (03CR) 10Jbond: [C: 03+2] P:pki:root_ca: add auth_keys [puppet] - 10https://gerrit.wikimedia.org/r/668077 (owner: 10Jbond) [13:13:36] 10SRE, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Epic, 10Goal: WMF media storage must be adequately backed up in a remote location - https://phabricator.wikimedia.org/T262668 (10fgiunchedi) [13:14:07] 10SRE, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Traffic, and 2 others: Depool codfw swift cluster - https://phabricator.wikimedia.org/T267338 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This has happened! We're back to swift active/active, tentatively resolving [13:17:50] (03PS1) 10Jbond: cfssl:::signer: Only notify Service if we actually manage Service [puppet] - 10https://gerrit.wikimedia.org/r/668079 [13:17:52] (03CR) 10JMeybohm: [C: 04-1] linkrecommendation: Use Envoy for requests to MediaWiki API (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/667868 (https://phabricator.wikimedia.org/T276217) (owner: 10Kosta Harlan) [13:18:37] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28341/console" [puppet] - 10https://gerrit.wikimedia.org/r/668079 (owner: 10Jbond) [13:20:00] (03CR) 10Jbond: "added some cloud admins, see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/668016 (owner: 10Muehlenhoff) [13:20:30] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:25:15] (03PS2) 10Jbond: cfssl:::signer: Only notify Service if we actually manage Service [puppet] - 10https://gerrit.wikimedia.org/r/668079 [13:25:47] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/668041 (https://phabricator.wikimedia.org/T221388) (owner: 10Volans) [13:26:05] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28342/console" [puppet] - 10https://gerrit.wikimedia.org/r/668079 (owner: 10Jbond) [13:26:51] (03CR) 10Jbond: [V: 03+1 C: 03+2] cfssl:::signer: Only notify Service if we actually manage Service [puppet] - 10https://gerrit.wikimedia.org/r/668079 (owner: 10Jbond) [13:27:50] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/667898 (owner: 10Jbond) [13:28:31] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:profile::client::httpd: add param override the attribute delimiter [puppet] - 10https://gerrit.wikimedia.org/r/667898 (owner: 10Jbond) [13:31:53] (03PS3) 10Jbond: O:netmon: update delimiter to use ':' [puppet] - 10https://gerrit.wikimedia.org/r/667899 [13:32:08] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 71860352 and 10 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:34:04] (03CR) 10JMeybohm: Switch from neon.eqiad.wmnet to kubestagemaster.svc.eqiad.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/668015 (https://phabricator.wikimedia.org/T276305) (owner: 10JMeybohm) [13:34:32] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 840 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:34:38] (03CR) 10Volans: [C: 03+2] dhcp: specify owner and group for option 82 config [puppet] - 10https://gerrit.wikimedia.org/r/668041 (https://phabricator.wikimedia.org/T221388) (owner: 10Volans) [13:34:40] (03PS3) 10JMeybohm: deployment_server: Switch k8s staging to codfw [puppet] - 10https://gerrit.wikimedia.org/r/667996 (https://phabricator.wikimedia.org/T276305) [13:34:49] (03PS4) 10JMeybohm: Remove old kubernetes staging master neon.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/668011 (https://phabricator.wikimedia.org/T276305) [13:35:31] (03PS3) 10JMeybohm: Switch from neon.eqiad.wmnet to kubestagemaster.svc.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/668015 (https://phabricator.wikimedia.org/T276305) [13:35:34] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/667899 (owner: 10Jbond) [13:38:08] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/667902 (owner: 10Jbond) [13:45:54] (03CR) 10Alexandros Kosiaris: [C: 04-1] Switch from neon.eqiad.wmnet to kubestagemaster.svc.eqiad.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/668015 (https://phabricator.wikimedia.org/T276305) (owner: 10JMeybohm) [13:47:32] (03PS4) 10JMeybohm: staging-eqiad: Apply role/hiera to new master [puppet] - 10https://gerrit.wikimedia.org/r/667867 (https://phabricator.wikimedia.org/T276305) (owner: 10Alexandros Kosiaris) [13:51:22] (03PS1) 10JMeybohm: kubernetes staging: Move k8s, docker and calico version to common [puppet] - 10https://gerrit.wikimedia.org/r/668081 (https://phabricator.wikimedia.org/T276305) [13:51:59] (03PS2) 10Filippo Giunchedi: pontoon: use wmcloud.org sso endpoint [puppet] - 10https://gerrit.wikimedia.org/r/666667 [13:52:01] (03PS2) 10Filippo Giunchedi: pontoon: symlink client ssl certs dir only if needed [puppet] - 10https://gerrit.wikimedia.org/r/666668 [13:52:03] (03PS2) 10Filippo Giunchedi: use wmcloud.org for icinga/alerts [puppet] - 10https://gerrit.wikimedia.org/r/666669 [13:53:49] (03PS4) 10Elukey: [WIP] Add profile to deploy the 5.10 Linux kernel on Buster hosts [puppet] - 10https://gerrit.wikimedia.org/r/668001 [13:54:24] (03PS5) 10Elukey: Add profile to deploy the 5.10 Linux kernel on Buster hosts [puppet] - 10https://gerrit.wikimedia.org/r/668001 [13:55:47] (03CR) 10Elukey: Add profile to deploy the 5.10 Linux kernel on Buster hosts (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/668001 (owner: 10Elukey) [13:55:58] (03Abandoned) 10Filippo Giunchedi: use wmcloud.org for icinga/alerts [puppet] - 10https://gerrit.wikimedia.org/r/666669 (owner: 10Filippo Giunchedi) [13:57:24] (03PS4) 10JMeybohm: Switch from neon.eqiad.wmnet to kubestagemaster.svc.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/668015 (https://phabricator.wikimedia.org/T276305) [13:57:26] (03PS5) 10JMeybohm: staging-eqiad: Apply role/hiera to new master [puppet] - 10https://gerrit.wikimedia.org/r/667867 (https://phabricator.wikimedia.org/T276305) (owner: 10Alexandros Kosiaris) [13:57:28] (03PS2) 10JMeybohm: kubernetes staging: Move k8s, docker and calico version to common [puppet] - 10https://gerrit.wikimedia.org/r/668081 (https://phabricator.wikimedia.org/T276305) [13:57:53] (03CR) 10JMeybohm: Switch from neon.eqiad.wmnet to kubestagemaster.svc.eqiad.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/668015 (https://phabricator.wikimedia.org/T276305) (owner: 10JMeybohm) [13:58:57] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt1018.eqiad.wmnet [13:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:04] liw and longma: Your horoscope predicts another unfortunate Mediawiki train - European+American Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210303T1400). [14:00:42] train is blocked, not promoting it to group1 now [14:01:04] liw: re T276316, do we want to just revert https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/664877 or do something else? per my latest comment that patch is definitely causing this issue [14:01:05] T276316: Argument 1 passed to MediaWiki\User\UserNameUtils::getCanonical() must be of the type string, null given, called in /srv/mediawiki/php-1.36.0-wmf.33/extensions/CentralAuth/includes/CentralAuthGroupMembershipProxy.php on line 48 - https://phabricator.wikimedia.org/T276316 [14:01:14] (03PS1) 10JMeybohm: Fix kubernetes::master_hosts for staging-cofw [puppet] - 10https://gerrit.wikimedia.org/r/668083 [14:01:37] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/668001 (owner: 10Elukey) [14:01:46] (03CR) 10jerkins-bot: [V: 04-1] Fix kubernetes::master_hosts for staging-cofw [puppet] - 10https://gerrit.wikimedia.org/r/668083 (owner: 10JMeybohm) [14:02:03] (03CR) 10JMeybohm: Switch from neon.eqiad.wmnet to kubestagemaster.svc.eqiad.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/668015 (https://phabricator.wikimedia.org/T276305) (owner: 10JMeybohm) [14:02:06] Majavah, if that fixes the problem without introducing new ones, I assume it'd be the right thing to do (but I'm not really qualified to make that judgement call) [14:02:16] (03CR) 10Elukey: [C: 03+2] Add profile to deploy the 5.10 Linux kernel on Buster hosts [puppet] - 10https://gerrit.wikimedia.org/r/668001 (owner: 10Elukey) [14:02:32] Majavah, would you be able to backport the revert? [14:03:15] (03PS2) 10JMeybohm: Fix kubernetes::master_hosts for staging-cofw [puppet] - 10https://gerrit.wikimedia.org/r/668083 [14:05:07] (03CR) 10Kosta Harlan: linkrecommendation: Use Envoy for requests to MediaWiki API (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/667868 (https://phabricator.wikimedia.org/T276217) (owner: 10Kosta Harlan) [14:05:29] https://phabricator.wikimedia.org/T276316#6878552 [14:05:35] looks like we are getting a hotfix [14:05:42] nice [14:06:15] I'm going to need someone to help by deployg a backport with that to the train [14:06:36] (03PS1) 10Elukey: Update GPU settings for Hadoop workers to ROCm 3.8 [puppet] - 10https://gerrit.wikimedia.org/r/668085 (https://phabricator.wikimedia.org/T231067) [14:06:38] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1018.eqiad.wmnet [14:06:43] and then I can roll the train forward to group1 [14:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:28] (03CR) 10Alexandros Kosiaris: [C: 03+1] kubernetes staging: Move k8s, docker and calico version to common [puppet] - 10https://gerrit.wikimedia.org/r/668081 (https://phabricator.wikimedia.org/T276305) (owner: 10JMeybohm) [14:09:26] (03CR) 10Alexandros Kosiaris: [C: 03+1] Switch from neon.eqiad.wmnet to kubestagemaster.svc.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/668015 (https://phabricator.wikimedia.org/T276305) (owner: 10JMeybohm) [14:09:56] (03CR) 10Alexandros Kosiaris: [C: 03+1] staging-eqiad: Apply role/hiera to new master [puppet] - 10https://gerrit.wikimedia.org/r/667867 (https://phabricator.wikimedia.org/T276305) (owner: 10Alexandros Kosiaris) [14:10:04] liw: what kind of help? Are you needing someone to do the actual deploy, or like testing? [14:10:11] happy to help with the backport process if I can, not sure if you were looking for a deployer or not [14:10:18] (03CR) 10Alexandros Kosiaris: [C: 03+2] Fix kubernetes::master_hosts for staging-cofw [puppet] - 10https://gerrit.wikimedia.org/r/668083 (owner: 10JMeybohm) [14:10:27] (03CR) 10Alexandros Kosiaris: [C: 03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/668083 (owner: 10JMeybohm) [14:12:07] Zppix, to do the backport and deployment of the fix, since I don't yet know how; someone who does backports otherwise anyway would be perfect [14:12:20] (03PS1) 10Muehlenhoff: Add profile::base::linux510 for cloudgw and cloudnet [puppet] - 10https://gerrit.wikimedia.org/r/668087 [14:12:23] Oh sorry, I dont have prod access [14:12:37] let's wait for it to show up [14:13:32] and get merged into core (30 minutes) and backported (another 30 minutes) and then see where we are :-D [14:13:50] it's a CA patch, shouldn't take that long hopefully [14:13:55] apergos, yeah, agreed [14:14:17] ah true, the extensions are faster. well anyhow [14:14:23] I'm lurking around if needed [14:14:44] Hey, RhinosF1 pinged me i can be of help. What's up? [14:14:51] apergos, cool, thank you [14:14:57] gonna have a centralauth patch to be backported soon [14:15:03] Urbanecm, we're waiting for a fix to a train bloacker [14:15:05] not me, the person fixing the ubn [14:15:12] it will need to go around, etc. [14:15:15] Urbanecm, https://phabricator.wikimedia.org/T276316 [14:15:23] Urbanecm: liw mentioned above that they weren't sure how to do backported [14:15:28] Backports* [14:15:48] liw: okay, looking [14:19:52] liw: Majavah: RhinosF1: Do we have a patch? [14:20:01] Urbanecm: not yet [14:20:13] Urbanecm: they said within 15 minutes I think [14:20:20] ack [14:20:22] you'll see it on the task I imagine [14:20:31] I'm happy to backport once there's sth to backport [14:20:34] RhinosF1: that was promised 16 minutes ago [14:20:37] (03CR) 10Elukey: [C: 03+2] Update GPU settings for Hadoop workers to ROCm 3.8 [puppet] - 10https://gerrit.wikimedia.org/r/668085 (https://phabricator.wikimedia.org/T231067) (owner: 10Elukey) [14:20:38] and remember that 15 minutes is engineer time [14:20:55] Majavah: heh, let's hope it's not long then [14:21:07] liw: or show you how to backport, maybe in a quick g meet session with screen sharing? [14:22:19] Urbanecm, not on train week, sorry. I'm capable of learning this stuff this week or next. [14:22:33] okay, up to you :) [14:23:13] Urbanecm: how different is this than backporting to core? [14:23:39] apergos: there is one additional step [14:24:11] it's in the docs someplace? I should probably read up, since if you weren't around I might have taken it [14:24:25] apergos: https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers#mediawiki/extensions_and_mediawiki/skins [14:24:33] perfct [14:24:52] happy to answer any questions, if you have any [14:25:03] ah the submodule [14:25:10] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Sergey Trofimovsky from Speed & Function - https://phabricator.wikimedia.org/T275722 (10sergeychernyshev) @jbond sorry, can't do - wrong Sergey ;) You probably meant @Sergey.Trofimovsky.SF [14:25:29] nope, that's clear enough, tyvm [14:25:36] apergos: yup, git submodule update is the key part [14:26:08] 👍 [14:26:10] (03PS5) 10David Caro: wmcs.toolforge.etcd: Added cookbook to depool and remove a node [cookbooks] - 10https://gerrit.wikimedia.org/r/667183 (https://phabricator.wikimedia.org/T274497) [14:26:12] (03PS3) 10David Caro: wmcs.toolforge: add cookbook to create an instance of a prefix [cookbooks] - 10https://gerrit.wikimedia.org/r/667214 (https://phabricator.wikimedia.org/T274497) [14:26:15] (03PS1) 10David Caro: wmcs.toolforge: add cookbook to add a new etcd node [cookbooks] - 10https://gerrit.wikimedia.org/r/668090 (https://phabricator.wikimedia.org/T274497) [14:28:11] (03CR) 10David Caro: wmcs.toolforge.etcd: Added cookbook to depool and remove a node (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/667183 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [14:28:15] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Sergey Trofimovsky from Speed & Function - https://phabricator.wikimedia.org/T275722 (10jbond) >>! In T275722#6878623, @sergeychernyshev wrote: > @jbond sorry, can't do - wrong Sergey ;) > > You probably meant @Sergey.Trofimovsky.S... [14:28:18] (03PS1) 10Klausman: modules/hiera: clean out old (<3.8) ROCm configs [puppet] - 10https://gerrit.wikimedia.org/r/668091 [14:29:01] (03CR) 10jerkins-bot: [V: 04-1] wmcs.toolforge: add cookbook to create an instance of a prefix [cookbooks] - 10https://gerrit.wikimedia.org/r/667214 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [14:29:04] (03CR) 10jerkins-bot: [V: 04-1] wmcs.toolforge.etcd: Added cookbook to depool and remove a node [cookbooks] - 10https://gerrit.wikimedia.org/r/667183 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [14:29:30] (03CR) 10jerkins-bot: [V: 04-1] modules/hiera: clean out old (<3.8) ROCm configs [puppet] - 10https://gerrit.wikimedia.org/r/668091 (owner: 10Klausman) [14:29:42] (03CR) 10jerkins-bot: [V: 04-1] wmcs.toolforge: add cookbook to add a new etcd node [cookbooks] - 10https://gerrit.wikimedia.org/r/668090 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [14:29:54] (03PS2) 10Klausman: modules/hiera: clean out old (<3.8) ROCm configs [puppet] - 10https://gerrit.wikimedia.org/r/668091 [14:30:49] (03CR) 10Elukey: modules/hiera: clean out old (<3.8) ROCm configs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/668091 (owner: 10Klausman) [14:31:50] so...i assume we're waiting for Vlad to upload a new patch? Or should we prepare the revert just in case? I'm not sure how impactful the bug is. [14:32:18] Urbanecm: current known impact is in https://phabricator.wikimedia.org/T276316#6878224 [14:33:55] thanks Majavah [14:34:01] I don't have an option between reverting and waiting [14:34:19] I am still able to modify global groups [14:34:21] (in prod) [14:34:30] train is not yet on group1, that explains it :P [14:34:46] ah [14:34:58] (because this patch is blocking it) [14:35:14] indeed, broken in beta [14:35:16] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1099.eqiad.wmnet with reason: REIMAGE [14:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:29] changing groups or the front page? [14:35:50] Majavah: https://meta.wikimedia.beta.wmflabs.org/wiki/Special:GlobalUserRights shows no "username" field [14:36:00] https://meta.wikimedia.beta.wmflabs.org/wiki/Special:GlobalUserRights/Martin_Urbanec works [14:36:18] https://meta.wikimedia.beta.wmflabs.org/w/index.php?title=Special:Log&logid=87534 [14:36:28] I can change them, just the front page is broken [14:36:31] yup [14:37:16] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1100.eqiad.wmnet with reason: REIMAGE [14:37:18] (03PS3) 10Klausman: modules/hiera: clean out old (<3.8) ROCm configs [puppet] - 10https://gerrit.wikimedia.org/r/668091 [14:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:23] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1099.eqiad.wmnet with reason: REIMAGE [14:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:39] (03CR) 10Klausman: modules/hiera: clean out old (<3.8) ROCm configs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/668091 (owner: 10Klausman) [14:38:56] (03PS4) 10Klausman: modules/hiera: clean out old (<3.8) ROCm configs [puppet] - 10https://gerrit.wikimedia.org/r/668091 [14:39:19] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1101.eqiad.wmnet with reason: REIMAGE [14:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:24] (03PS1) 10Jbond: cfssl::cert: use db-config if signing locally [puppet] - 10https://gerrit.wikimedia.org/r/668094 [14:39:26] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1100.eqiad.wmnet with reason: REIMAGE [14:39:26] (03PS1) 10Jbond: P:pki::root_ca: add ability to generate intermidiate certificates [puppet] - 10https://gerrit.wikimedia.org/r/668095 [14:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:07] liw: 30 minutes passed, out of the 15 minutes promised :/ [14:40:18] Urbanecm, yeah [14:40:31] you want me to upload a revert? [14:40:31] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28343/console" [puppet] - 10https://gerrit.wikimedia.org/r/668094 (owner: 10Jbond) [14:40:41] Majavah: that depends on liw [14:40:51] I'm leaning towards revert and let them deal with it later [14:41:18] am I understanding correctly that the revert will bring back a bug? [14:41:30] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1101.eqiad.wmnet with reason: REIMAGE [14:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:01] I don't think so unless you consider using soft deprecated code as a bug [14:42:11] (03PS3) 10JMeybohm: Switch staging.svc.eqiad.wmnet to point to codfw k8s [dns] - 10https://gerrit.wikimedia.org/r/667982 (https://phabricator.wikimedia.org/T276305) [14:42:11] now promising five minutes [14:42:12] +1 [14:42:35] let's wait five and re-evaluate then [14:44:23] ack [14:46:36] (03CR) 10Muehlenhoff: profile::puppetmaster::common: Remove support for jessie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/668016 (owner: 10Muehlenhoff) [14:47:00] (03PS4) 10JMeybohm: Switch staging.svc.eqiad.wmnet to point to codfw k8s [dns] - 10https://gerrit.wikimedia.org/r/667982 (https://phabricator.wikimedia.org/T276305) [14:47:03] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:47:34] (03CR) 10Elukey: [C: 03+1] "LGTM! Couple of things:" [puppet] - 10https://gerrit.wikimedia.org/r/668091 (owner: 10Klausman) [14:47:59] (03PS1) 10David Caro: wmcs.backups: Retry a VM backup 3 times before failing [puppet] - 10https://gerrit.wikimedia.org/r/668097 (https://phabricator.wikimedia.org/T276096) [14:48:10] heads up: I'll be switching the "active" kubernetes staging cluster from eqiad to codfw in a bit [14:48:31] !log upgrade memcached on mc1027,mc2027 [14:48:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:47] 10SRE, 10Analytics: Consider Julie for managing Kafka settings, perhaps even integrating with Event Stream Config - https://phabricator.wikimedia.org/T276088 (10Ottomata) I know, "only NINETY NINE CENTS...WOW" right?! [14:50:57] it's been 5 min, but since there's another hour of the train deployment window remaining, let's give it another 5 [14:51:24] (03CR) 10Giuseppe Lavagetto: Add httpd image for MediaWiki (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/667886 (owner: 10Giuseppe Lavagetto) [14:51:35] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:52:18] seems we have a patch [14:52:28] Majavah: mind co-reviewing it with me? [14:52:32] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/668098/ [14:53:08] sure [14:53:08] (03PS2) 10Giuseppe Lavagetto: Allow changing the IP of the fcgi server [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/667885 [14:53:09] cool! [14:53:10] (03PS2) 10Giuseppe Lavagetto: Add httpd image for MediaWiki [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/667886 [14:53:12] (03PS1) 10Giuseppe Lavagetto: Fix typo in php7.3-cli [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/668099 [14:53:14] (03PS1) 10Giuseppe Lavagetto: Add memcached image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/668100 [14:55:14] (03PS2) 10David Caro: wmcs.backups: Retry a VM backup 3 times before failing [puppet] - 10https://gerrit.wikimedia.org/r/668097 (https://phabricator.wikimedia.org/T276096) [14:55:48] Urbanecm: I'm seeing "Notice: Array to string conversion in /home/taavi/src/mediawiki/core/includes/language/Message.php on line 1159" on special:globaluserrights but I'm seeing that on master too [14:56:43] huh, a second patch set [14:56:57] which was a commit message update [14:57:06] good [14:58:01] I'd say let's go and try [14:58:03] we can always revert [14:58:14] it works locally, so at least no feature regression [14:58:28] (03PS3) 10David Caro: wmcs.backups: Retry a VM backup 3 times before failing [puppet] - 10https://gerrit.wikimedia.org/r/668097 (https://phabricator.wikimedia.org/T276096) [14:59:15] yeah it works, I'm mostly just wondering if casting is needed everywhere it was added [14:59:40] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28344/console" [puppet] - 10https://gerrit.wikimedia.org/r/668091 (owner: 10Klausman) [14:59:41] i'm 99% sure it is not [15:00:28] (03PS4) 10David Caro: wmcs.backups: Retry a VM backup 3 times before failing [puppet] - 10https://gerrit.wikimedia.org/r/668097 (https://phabricator.wikimedia.org/T276096) [15:00:47] (03CR) 10Jbond: [V: 03+1 C: 03+2] cfssl::cert: use db-config if signing locally [puppet] - 10https://gerrit.wikimedia.org/r/668094 (owner: 10Jbond) [15:01:06] do we want to improve patch or merge+backport now and improve it later if needed? [15:01:22] at lesat the maintenance changes shouldn't be necessary [15:01:31] i vote for "merge and improve later" [15:01:33] Majavah: what about you? [15:01:35] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28345/console" [puppet] - 10https://gerrit.wikimedia.org/r/668091 (owner: 10Klausman) [15:01:53] up to you, sounds sensible but I'd like to somehow ensure that "later" is not "never" [15:01:58] (03CR) 10Jbond: [C: 03+2] P:pki::root_ca: add ability to generate intermidiate certificates [puppet] - 10https://gerrit.wikimedia.org/r/668095 (owner: 10Jbond) [15:02:17] I propose we leave the task open and high proprity so it stays visible [15:02:21] looks good [15:02:27] possbly even as a train blocker [15:02:45] (03CR) 10Alexandros Kosiaris: [C: 03+1] Switch staging.svc.eqiad.wmnet to point to codfw k8s [dns] - 10https://gerrit.wikimedia.org/r/667982 (https://phabricator.wikimedia.org/T276305) (owner: 10JMeybohm) [15:03:19] (03PS1) 10Urbanecm: Transform the first parameter to string [extensions/CentralAuth] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/668048 (https://phabricator.wikimedia.org/T276316) [15:03:57] (03CR) 10JMeybohm: [C: 03+2] Switch staging.svc.eqiad.wmnet to point to codfw k8s [dns] - 10https://gerrit.wikimedia.org/r/667982 (https://phabricator.wikimedia.org/T276305) (owner: 10JMeybohm) [15:04:14] (03CR) 10Urbanecm: [C: 03+2] Transform the first parameter to string [extensions/CentralAuth] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/668048 (https://phabricator.wikimedia.org/T276316) (owner: 10Urbanecm) [15:04:42] liw: I +1'ed the master patch, and merging the backport, so we can discuss whether castings are necessary later. [15:04:55] Urbanecm, excellent, thank you [15:05:01] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/668016 (owner: 10Muehlenhoff) [15:06:38] (03CR) 10JMeybohm: [C: 03+2] deployment_server: Switch k8s staging to codfw [puppet] - 10https://gerrit.wikimedia.org/r/667996 (https://phabricator.wikimedia.org/T276305) (owner: 10JMeybohm) [15:07:52] can someone explain what this actually fixes? if a null is passed in, it's converted to what, the empty string? is that going to cause problems in this context? [15:08:10] (03CR) 10Arturo Borrero Gonzalez: wmcs.toolforge: add cookbook to create an instance of a prefix (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/667214 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [15:08:32] * apergos looks at Urbanecm and Majavah [15:08:51] apergos: yes, it's converted to an empty string. Special:GlobalGroupMembership does not actually use the user when called with no parameter, as it has a box for you to input the username [15:08:59] like this https://usercontent.irccloud-cdn.com/file/mNU7Ahwl/image.png [15:09:19] (03Merged) 10jenkins-bot: Transform the first parameter to string [extensions/CentralAuth] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/668048 (https://phabricator.wikimedia.org/T276316) (owner: 10Urbanecm) [15:09:25] apergos: and type hinting, the new method is defined as non-nullable `string $name` while the old was nullable `$name` [15:09:34] right [15:09:40] Urbanecm: fyi I'm reviewing the patch atm for unnecessary casts [15:09:45] ack [15:10:05] Majavah: if you upload a follow-up patch removing the unnecessary ones, I'm happy to merge both (or you can amend Vlad's, if you feel so). [15:11:15] works for me on testwiki, syncing [15:11:37] Urbanecm: anything works, currently just reviewing in gerrit, since that way I can add inline explanations everywhere, but if you think a follow-up patch is better I can do that too [15:11:53] i don't mind either way, whatever is more convenient for you [15:13:14] !log urbanecm@deploy1002 Synchronized php-1.36.0-wmf.33/extensions/CentralAuth/: af899b6818223928e2da421122c19e64126370da: Transform the first parameter to string (T276316) (duration: 01m 11s) [15:13:18] liw: ^^ [15:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:21] T276316: Argument 1 passed to MediaWiki\User\UserNameUtils::getCanonical() must be of the type string, null given, called in /srv/mediawiki/php-1.36.0-wmf.33/extensions/CentralAuth/includes/CentralAuthGroupMembershipProxy.php on line 48 - https://phabricator.wikimedia.org/T276316 [15:13:31] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 45636400 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:13:36] Urbanecm, cool, thanks; I'll be promoting train to group1 now [15:13:45] I can make sure the page loads at testwiki, but not that it actually does what it is supposed to. I'd need meta to be at wmf.33 to test that [15:14:01] oh, cool, ping me once done please :) [15:14:20] Ty Urbanecm [15:14:38] 10Puppet, 10SRE, 10User-jbond: puppetmaster: clean up instances of the puppet-master package - https://phabricator.wikimedia.org/T276339 (10jbond) [15:15:18] (03PS1) 10Lars Wirzenius: group1 wikis to 1.36.0-wmf.33 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668105 [15:15:20] (03CR) 10Lars Wirzenius: [C: 03+2] group1 wikis to 1.36.0-wmf.33 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668105 (owner: 10Lars Wirzenius) [15:15:28] I'm looking at getRemoteUserMailAddress which could result in a weird query there [15:15:36] the rest looks ok to my unpracticed eye [15:15:39] Urbanecm: added those as an inline comment, second look appreciated [15:15:41] (03CR) 10David Caro: wmcs.toolforge: add cookbook to create an instance of a prefix (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/667214 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [15:15:43] (03CR) 10Elukey: [C: 03+1] "LGTM from my limited understanding :)" (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/668100 (owner: 10Giuseppe Lavagetto) [15:15:47] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 446208 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:15:50] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-drain-hypervisor.py: Better handling of VMS not in state ACTIVE [puppet] - 10https://gerrit.wikimedia.org/r/667928 (https://phabricator.wikimedia.org/T276208) (owner: 10Andrew Bogott) [15:15:55] thanks Majavah [15:15:56] (03PS4) 10Andrew Bogott: wmcs-drain-hypervisor.py: Better handling of VMS not in state ACTIVE [puppet] - 10https://gerrit.wikimedia.org/r/667928 (https://phabricator.wikimedia.org/T276208) [15:16:22] (03Merged) 10jenkins-bot: group1 wikis to 1.36.0-wmf.33 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668105 (owner: 10Lars Wirzenius) [15:16:41] Urbanecm: wat do you tthink about that one change? [15:16:52] which one? [15:17:04] ^ [15:17:05] (05:15:28 μμ) apergos: I'm looking at getRemoteUserMailAddress which could result in a weird query there [15:17:09] ah [15:17:14] (03CR) 10Jbond: [C: 03+1] "lgtm" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/668016 (owner: 10Muehlenhoff) [15:17:27] seems like if we got a false back from getCanonical there, well, ewww. [15:18:06] !log liw@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.33 [15:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:19] apergos: according to codesearch, it is only used from maintenance/sendConfirmAndMigrateEmail.php, and it is only used for migration purposes (ie. moving a wiki to centralauth) [15:18:36] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1027.eqiad.wmnet [15:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:15] !log liw@deploy1002 Synchronized php: group1 wikis to 1.36.0-wmf.33 (duration: 01m 08s) [15:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:29] liw: i can confirm the special page works as intended at meta [15:20:43] Urbanecm, cool, thank you; just came back from fridge to get something to drink [15:20:54] :) [15:20:58] anything else i can do now? [15:21:06] (03CR) 10Andrew Bogott: "Traditionally we don't remove our ability to build new base images until the use of a distro is actually fully done. Deployment-prep stil" [puppet] - 10https://gerrit.wikimedia.org/r/668005 (owner: 10Muehlenhoff) [15:21:24] (03CR) 10Alexandros Kosiaris: linkrecommendation: Use Envoy for requests to MediaWiki API (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/667868 (https://phabricator.wikimedia.org/T276217) (owner: 10Kosta Harlan) [15:22:33] Urbanecm, I don't think so; the task and Gerrit change seem to have further discussion if you're interested [15:22:51] yup, saw that. I'll give a summary from my side on the task. [15:22:59] cool, thanks [15:23:21] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28348/console" [puppet] - 10https://gerrit.wikimedia.org/r/668091 (owner: 10Klausman) [15:23:32] (03PS1) 10Elukey: role::analytics_cluster::hadoop::worker: set linux 5.10 on GPU workers [puppet] - 10https://gerrit.wikimedia.org/r/668106 (https://phabricator.wikimedia.org/T231067) [15:23:46] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1027.eqiad.wmnet [15:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:58] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [15:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:06] (03CR) 10Muehlenhoff: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/668005 (owner: 10Muehlenhoff) [15:25:08] (03CR) 10Ppchelko: [C: 03+1] "This is good for unblocking the train." [extensions/CentralAuth] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/668048 (https://phabricator.wikimedia.org/T276316) (owner: 10Urbanecm) [15:25:38] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28349/console" [puppet] - 10https://gerrit.wikimedia.org/r/668106 (https://phabricator.wikimedia.org/T231067) (owner: 10Elukey) [15:26:31] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [15:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:31] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mathoid' for release 'staging' . [15:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:37] !log staging.svc.eqiad.wmnet now (temporarily) points to the staging-codfw kubernetes cluster (during upgrade in eqiad) [15:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:48] (03PS1) 10WMDE-Fisch: Remove conflicting gadget configuration for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668108 (https://phabricator.wikimedia.org/T276330) [15:27:53] (03CR) 10Effie Mouzeli: "Some nits, otherwise LGTM" (034 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/668100 (owner: 10Giuseppe Lavagetto) [15:28:33] train is on group1 and things didn't explode immediately [15:28:43] That's good [15:28:53] nice to hear [15:29:01] thank you Urbanecm, Majavah, and everyone else who helped [15:29:10] thank you Majavah :) [15:29:19] liw: marked the task as a train blocker for wmf.34 [15:29:33] Urbanecm, excellent, thank you [15:31:53] (03CR) 10Bstorm: [C: 03+2] profile::wmcs::nfs::primary: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/668008 (owner: 10Muehlenhoff) [15:32:34] (03PS4) 10Effie Mouzeli: mediawiki::alerts: add per cluster error/fatals rate alert [puppet] - 10https://gerrit.wikimedia.org/r/666719 (https://phabricator.wikimedia.org/T262078) [15:33:52] (03PS3) 10JMeybohm: Change kubestagemaster.svc.equiad.wmnet to point to new master [dns] - 10https://gerrit.wikimedia.org/r/667983 (https://phabricator.wikimedia.org/T276305) [15:34:11] (03CR) 10Kormat: [C: 03+1] "SSO pontoon? Fancy™" [puppet] - 10https://gerrit.wikimedia.org/r/666667 (owner: 10Filippo Giunchedi) [15:34:46] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt1021.eqiad.wmnet [15:34:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:58] (03PS1) 10Cwhite: profile: swap gerrit log stream to be ecs-only [puppet] - 10https://gerrit.wikimedia.org/r/668109 (https://phabricator.wikimedia.org/T234565) [15:35:16] (03PS1) 10Hnowlan: Revert "Revert "mtail: create separate metrics histogram based on endpoint"" [puppet] - 10https://gerrit.wikimedia.org/r/668050 [15:35:26] (03PS1) 10Jbond: P:pki::root_ca (cloud): add a test intermediate certificate [puppet] - 10https://gerrit.wikimedia.org/r/668110 [15:36:49] (03CR) 10jerkins-bot: [V: 04-1] profile: swap gerrit log stream to be ecs-only [puppet] - 10https://gerrit.wikimedia.org/r/668109 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [15:37:35] (03CR) 10Bstorm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/663865 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi) [15:38:47] (03CR) 10Jbond: [C: 03+2] P:pki::root_ca (cloud): add a test intermediate certificate [puppet] - 10https://gerrit.wikimedia.org/r/668110 (owner: 10Jbond) [15:39:23] (03PS2) 10Cwhite: profile: swap gerrit log stream to be ecs-only [puppet] - 10https://gerrit.wikimedia.org/r/668109 (https://phabricator.wikimedia.org/T234565) [15:41:04] (03CR) 10Muehlenhoff: "> Patch Set 4: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/668091 (owner: 10Klausman) [15:42:25] (03PS1) 10Jbond: P:pki::root_ca: use correct title for intermediate certs [puppet] - 10https://gerrit.wikimedia.org/r/668112 [15:42:31] (03CR) 10Hnowlan: [C: 03+2] Revert "Revert "mtail: create separate metrics histogram based on endpoint"" [puppet] - 10https://gerrit.wikimedia.org/r/668050 (owner: 10Hnowlan) [15:42:34] (03PS1) 10Jdlrobson: Use more generic non-team specific name for alerts [puppet] - 10https://gerrit.wikimedia.org/r/668113 (https://phabricator.wikimedia.org/T264665) [15:42:38] (03PS1) 10Alexandros Kosiaris: ci/releases: Switch to using the codfw staging cluster [puppet] - 10https://gerrit.wikimedia.org/r/668114 [15:43:31] (03PS2) 10JMeybohm: ci/releases: Switch to using the codfw staging cluster [puppet] - 10https://gerrit.wikimedia.org/r/668114 (https://phabricator.wikimedia.org/T276305) (owner: 10Alexandros Kosiaris) [15:43:43] (03CR) 10Awight: [C: 03+1] "Good idea to leave it here, commented." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668108 (https://phabricator.wikimedia.org/T276330) (owner: 10WMDE-Fisch) [15:45:05] (03CR) 10JMeybohm: [C: 03+1] ci/releases: Switch to using the codfw staging cluster [puppet] - 10https://gerrit.wikimedia.org/r/668114 (https://phabricator.wikimedia.org/T276305) (owner: 10Alexandros Kosiaris) [15:45:16] (03PS2) 10Jbond: P:pki::root_ca: use correct title for intermediate certs [puppet] - 10https://gerrit.wikimedia.org/r/668112 [15:45:25] (03CR) 10Alexandros Kosiaris: [C: 03+2] ci/releases: Switch to using the codfw staging cluster [puppet] - 10https://gerrit.wikimedia.org/r/668114 (https://phabricator.wikimedia.org/T276305) (owner: 10Alexandros Kosiaris) [15:45:27] 10Puppet, 10SRE, 10User-jbond: puppetmaster: clean up instances of the puppet-master package - https://phabricator.wikimedia.org/T276339 (10MoritzMuehlenhoff) On puppetmaster2001-2003 and 1002 we even have puppet-master installed (in dpkg "ii" state). From what I can tell cleaning that out with package=>purg... [15:46:41] (03PS5) 10Cwhite: profile: add scap log duplication and ecs mutations [puppet] - 10https://gerrit.wikimedia.org/r/659426 (https://phabricator.wikimedia.org/T234565) [15:47:01] 10SRE, 10ops-codfw, 10netops: ripe-atlas-codfw is down - https://phabricator.wikimedia.org/T267714 (10Papaul) Thank you for the information. I will try to work on it again when i am back on site tomorrow. [15:47:06] 10Puppet, 10SRE, 10User-jbond: puppetmaster: clean up instances of the puppet-master package - https://phabricator.wikimedia.org/T276339 (10jbond) >>! In T276339#6879017, @MoritzMuehlenhoff wrote: > On puppetmaster2001-2003 and 1002 we even have puppet-master installed (in dpkg "ii" state). From what I can t... [15:47:55] (03CR) 10Jbond: [C: 03+2] P:pki::root_ca: use correct title for intermediate certs [puppet] - 10https://gerrit.wikimedia.org/r/668112 (owner: 10Jbond) [15:50:51] 10SRE, 10Gerrit, 10observability, 10Patch-For-Review, and 2 others: Add prometheus exporter to Gerrit - https://phabricator.wikimedia.org/T184086 (10hashar) I kind of forgot about this task. It is still not complete since we need to collect metrics for the replica. Seems the way to go is to use the server... [15:51:59] (03CR) 10Filippo Giunchedi: [C: 03+1] "I missed that the patch adds the metric too" [puppet] - 10https://gerrit.wikimedia.org/r/666719 (https://phabricator.wikimedia.org/T262078) (owner: 10Effie Mouzeli) [15:52:04] (03CR) 10Volans: [C: 03+1] "Is there any blocker for this?" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/654439 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond) [15:53:21] 10SRE, 10Analytics, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech-focus: Deployment strategy and hardware requirement for new Flink based WDQS updater - https://phabricator.wikimedia.org/T247058 (10Milimetric) This is a bit of a drive-by, but have we considered https://min.io/? I went a bit deeper t... [15:53:26] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install ms-backup200[12] - https://phabricator.wikimedia.org/T274202 (10Papaul) [15:54:07] 10SRE, 10Product-Analytics, 10SRE-Access-Requests, 10Structured-Data-Backlog (Current Work): Add Matthew Williams to analytics-privatedata-users - https://phabricator.wikimedia.org/T275671 (10mwilliams) [15:54:09] (03CR) 10Jbond: "> Patch Set 6:" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/654439 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond) [15:54:38] 10SRE, 10Product-Analytics, 10SRE-Access-Requests, 10Structured-Data-Backlog (Current Work): Add Matthew Williams to analytics-privatedata-users - https://phabricator.wikimedia.org/T275671 (10mwilliams) [15:55:54] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1021.eqiad.wmnet [15:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:12] 10SRE, 10Product-Analytics, 10SRE-Access-Requests, 10Structured-Data-Backlog (Current Work): Add Matthew Williams to analytics-privatedata-users - https://phabricator.wikimedia.org/T275671 (10mwilliams) Thanks for getting this started @nettrom_WMF! Adding my manager @lucyblackwell to get approval and I th... [15:56:30] (03CR) 10Cwhite: [C: 03+1] "LGTM, Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/666719 (https://phabricator.wikimedia.org/T262078) (owner: 10Effie Mouzeli) [15:58:32] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki::alerts: add per cluster error/fatals rate alert [puppet] - 10https://gerrit.wikimedia.org/r/666719 (https://phabricator.wikimedia.org/T262078) (owner: 10Effie Mouzeli) [15:58:56] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki::alerts: add per cluster error/fatals rate alert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/666719 (https://phabricator.wikimedia.org/T262078) (owner: 10Effie Mouzeli) [15:58:58] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: use wmcloud.org sso endpoint [puppet] - 10https://gerrit.wikimedia.org/r/666667 (owner: 10Filippo Giunchedi) [15:59:22] (03CR) 10Muehlenhoff: [C: 03+2] profile::puppetmaster::common: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/668016 (owner: 10Muehlenhoff) [15:59:32] effie: merged your change too [15:59:41] great thanks ! [15:59:54] np [16:00:08] 10SRE, 10Desktop Improvements, 10Traffic, 10Bengali-Sites, and 4 others: CDN cache revalidation on several wikis for desktop improvements deployment pt 2 - https://phabricator.wikimedia.org/T274784 (10Vito-Genovese) Could this possibly turn out to be a solution for the issue described at T119366? [16:02:50] PROBLEM - Prometheus k8s-staging cache not updating on prometheus1004 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus1004&var-datasource=eqiad+prometheus/ops [16:03:30] PROBLEM - Prometheus k8s-staging cache not updating on prometheus1003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus1003&var-datasource=eqiad+prometheus/ops [16:05:45] !log jayme@cumin1001 START - Cookbook sre.hosts.decommission for hosts neon.eqiad.wmnet [16:05:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:20] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install backup2003 - https://phabricator.wikimedia.org/T274185 (10jcrespo) [16:07:05] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts neon.eqiad.wmnet [16:07:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:46] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install backup2003 - https://phabricator.wikimedia.org/T274185 (10jcrespo) So, ideally, SSDs would be detected as sda and sdb, and then the recipe custom/backup-format.cfg would work, but I would like to install it myself in case it fails so I don't mak... [16:09:16] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagetcd1004.eqiad.wmnet [16:09:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:37] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagetcd1004.eqiad.wmnet [16:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:48] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagetcd1005.eqiad.wmnet [16:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:45] (03CR) 10Jbond: [C: 03+2] O:netmon: update delimiter to use ':' [puppet] - 10https://gerrit.wikimedia.org/r/667899 (owner: 10Jbond) [16:14:45] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagetcd1006.eqiad.wmnet [16:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:08] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagetcd1005.eqiad.wmnet [16:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:49] (03PS1) 10Ottomata: Set canary_events_enabled: true for rdf-streaming-updater streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668119 (https://phabricator.wikimedia.org/T273901) [16:16:17] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Fix typo in php7.3-cli [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/668099 (owner: 10Giuseppe Lavagetto) [16:17:24] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Allow changing the IP of the fcgi server [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/667885 (owner: 10Giuseppe Lavagetto) [16:17:30] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:17:57] 10SRE, 10observability: alert1001's tcpircbot down for all internal clients (spicerack, helmfile, dbctl, klaxon, etc) - https://phabricator.wikimedia.org/T276299 (10CDanis) [16:18:29] (03CR) 10Jbond: [C: 03+2] P:idp::client::httpd::site: update default delimiter [puppet] - 10https://gerrit.wikimedia.org/r/667902 (owner: 10Jbond) [16:18:31] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagetcd1006.eqiad.wmnet [16:18:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:39] (03PS2) 10Jbond: P:idp::client::httpd::site: update default delimiter [puppet] - 10https://gerrit.wikimedia.org/r/667902 [16:19:46] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:20:42] (03PS6) 10JMeybohm: staging-eqiad: Apply role/hiera to new master [puppet] - 10https://gerrit.wikimedia.org/r/667867 (https://phabricator.wikimedia.org/T276305) (owner: 10Alexandros Kosiaris) [16:22:28] (03CR) 10JMeybohm: [C: 03+2] staging-eqiad: Apply role/hiera to new master [puppet] - 10https://gerrit.wikimedia.org/r/667867 (https://phabricator.wikimedia.org/T276305) (owner: 10Alexandros Kosiaris) [16:23:28] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on gitlab1001.eqiad.wmnet with reason: decom [16:23:28] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on gitlab1001.eqiad.wmnet with reason: decom [16:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:37] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on gitlab1002.eqiad.wmnet with reason: decom [16:23:37] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on gitlab1002.eqiad.wmnet with reason: decom [16:23:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:27] (03CR) 10Ottomata: [C: 03+2] Set canary_events_enabled: true for rdf-streaming-updater streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668119 (https://phabricator.wikimedia.org/T273901) (owner: 10Ottomata) [16:25:19] (03Merged) 10jenkins-bot: Set canary_events_enabled: true for rdf-streaming-updater streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668119 (https://phabricator.wikimedia.org/T273901) (owner: 10Ottomata) [16:25:33] 10SRE, 10ops-eqiad, 10DBA: db1162 crashed - https://phabricator.wikimedia.org/T275309 (10Cmjohnson) This has been moved to this coming Friday at 10am local time (1500UTC) [16:25:33] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts gitlab1002.eqiad.wmnet [16:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:59] (03CR) 10Jbond: [C: 03+2] customscripts/interface_automation: skip slaac addresses [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/654439 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond) [16:26:09] !log deleting gitlab VMs - we have to start over and decom old VMs, then create new VMs with public IPs (T274459) [16:26:12] (03PS9) 10Jbond: interface_automation: update is_primary logic. [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/654435 (https://phabricator.wikimedia.org/T265904) [16:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:16] T274459: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 [16:26:50] (03CR) 10Jbond: [C: 03+2] interface_automation: update is_primary logic. [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/654435 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond) [16:28:04] (03CR) 10Giuseppe Lavagetto: [C: 03+1] service::monitoring: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/668010 (owner: 10Muehlenhoff) [16:28:07] !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: canary_events_enabled: true for rdf-streaming-updater streams - T273901 (duration: 01m 49s) [16:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:14] T273901: Automate event stream ingestion into HDFS for streams that don't use EventGate - https://phabricator.wikimedia.org/T273901 [16:28:18] (03CR) 10Gergő Tisza: [C: 03+1] Use more generic non-team specific name for alerts [puppet] - 10https://gerrit.wikimedia.org/r/668113 (https://phabricator.wikimedia.org/T264665) (owner: 10Jdlrobson) [16:28:27] (03PS1) 10Dzahn: Revert "site: add gitlab VMs with placeholder role" [puppet] - 10https://gerrit.wikimedia.org/r/668056 (https://phabricator.wikimedia.org/T274459) [16:28:41] (03CR) 10jerkins-bot: [V: 04-1] Revert "site: add gitlab VMs with placeholder role" [puppet] - 10https://gerrit.wikimedia.org/r/668056 (https://phabricator.wikimedia.org/T274459) (owner: 10Dzahn) [16:29:11] (03CR) 10David Caro: [C: 03+1] utils/run_ci_localy.sh: Add a script to allow users to run CI from there laptops (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/665134 (https://phabricator.wikimedia.org/T274338) (owner: 10Jbond) [16:29:33] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts gitlab1002.eqiad.wmnet [16:29:37] (03Abandoned) 10Dzahn: trafficserver: add director for gitlab to gitlab1001 [puppet] - 10https://gerrit.wikimedia.org/r/667731 (https://phabricator.wikimedia.org/T276144) (owner: 10Dzahn) [16:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:42] 10SRE, 10GitLab, 10vm-requests, 10Patch-For-Review, 10User-brennen: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `gitlab1002.eqiad.wmnet` - gitlab1002.eqiad.wmnet (**PASS**) - Do... [16:29:49] (03PS1) 10Dzahn: Revert "gitlab: open port 80 for traffic from caching servers" [puppet] - 10https://gerrit.wikimedia.org/r/668057 [16:30:09] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts gitlab1001.eqiad.wmnet [16:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:30] mutante: what failed? is there a bug? [16:30:48] volans: I dont see a failure? [16:30:58] dzahn@cumin1001 END (FAIL) [16:31:02] that log bot message says it failed [16:31:10] https://phabricator.wikimedia.org/T274459#6879233 [16:31:16] see the bold one [16:32:18] we'll know soon if it happens for the second one as well, already running [16:33:01] the failure it's in your shell, can also be if you didn't make the dns change go through [16:33:08] 10SRE, 10SRE-Access-Requests: Requesting access to restricted for Daimona - https://phabricator.wikimedia.org/T276351 (10Urbanecm) [16:33:20] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts gitlab1001.eqiad.wmnet [16:33:23] (03CR) 10Jbond: "have run manually on netbox and looks good to me" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/654435 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond) [16:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:29] 10SRE, 10GitLab, 10vm-requests, 10Patch-For-Review, 10User-brennen: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `gitlab1001.eqiad.wmnet` - gitlab1001.eqiad.wmnet (**PASS**) - Do... [16:34:09] volans: https://phabricator.wikimedia.org/P14617 [16:34:55] mutante: ack, thx, please manually run the sre.dns.netbox cookbook [16:35:02] (03PS1) 10JMeybohm: files/ssl: Add cetificate for kubestagemaster.svc.eqiad.wmnet.crt [puppet] - 10https://gerrit.wikimedia.org/r/668122 (https://phabricator.wikimedia.org/T276305) [16:35:38] (03PS2) 10Alexandros Kosiaris: files/ssl: Add certificate for kubestagemaster.svc.eqiad.wmnet.crt [puppet] - 10https://gerrit.wikimedia.org/r/668122 (https://phabricator.wikimedia.org/T276305) (owner: 10JMeybohm) [16:36:09] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [16:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:10] (03PS1) 10Joal: Bump AQS druid datasource to 2021-02 [puppet] - 10https://gerrit.wikimedia.org/r/668123 [16:37:26] (03CR) 10Alexandros Kosiaris: [C: 03+1] files/ssl: Add certificate for kubestagemaster.svc.eqiad.wmnet.crt [puppet] - 10https://gerrit.wikimedia.org/r/668122 (https://phabricator.wikimedia.org/T276305) (owner: 10JMeybohm) [16:38:02] (03CR) 10Giuseppe Lavagetto: [C: 03+1] service::node: Remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/666930 (owner: 10Muehlenhoff) [16:38:32] (03CR) 10Klausman: [V: 03+1 C: 03+2] modules/hiera: clean out old (<3.8) ROCm configs [puppet] - 10https://gerrit.wikimedia.org/r/668091 (owner: 10Klausman) [16:38:34] (03CR) 10JMeybohm: [C: 03+2] files/ssl: Add certificate for kubestagemaster.svc.eqiad.wmnet.crt [puppet] - 10https://gerrit.wikimedia.org/r/668122 (https://phabricator.wikimedia.org/T276305) (owner: 10JMeybohm) [16:39:02] (03PS5) 10Klausman: modules/hiera: clean out old (<3.8) ROCm configs [puppet] - 10https://gerrit.wikimedia.org/r/668091 [16:39:10] (03CR) 10Klausman: [V: 03+2 C: 03+2] modules/hiera: clean out old (<3.8) ROCm configs [puppet] - 10https://gerrit.wikimedia.org/r/668091 (owner: 10Klausman) [16:40:53] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:40:55] volans: done. generated DNS, checked diff, zonefiles have been deployed [16:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:04] both VMs are removed [16:41:13] mutante: thx [16:42:05] (03CR) 10Ahmon Dancy: [C: 03+1] "Tested this morning. Looks great." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/667310 (owner: 10Brennen Bearnes) [16:43:04] (03CR) 10Dzahn: [C: 03+2] "doesn't make sense anymore if VMs are not behind caches" [puppet] - 10https://gerrit.wikimedia.org/r/668057 (owner: 10Dzahn) [16:43:12] (03PS1) 10Ottomata: Add a consumers.analytics-hadoop setting to automate ingestion of streams intod HDFS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668124 (https://phabricator.wikimedia.org/T273901) [16:43:40] (03PS1) 10Ottomata: an-test - Declare camus jobs based on a new stream setting instead of destination_event_service [puppet] - 10https://gerrit.wikimedia.org/r/668125 (https://phabricator.wikimedia.org/T273901) [16:43:51] elukey: if you have a minute - https://gerrit.wikimedia.org/r/c/operations/puppet/+/668123 [16:44:14] (03CR) 10JMeybohm: [C: 03+2] Change kubestagemaster.svc.equiad.wmnet to point to new master [dns] - 10https://gerrit.wikimedia.org/r/667983 (https://phabricator.wikimedia.org/T276305) (owner: 10JMeybohm) [16:44:48] (03CR) 10jerkins-bot: [V: 04-1] an-test - Declare camus jobs based on a new stream setting instead of destination_event_service [puppet] - 10https://gerrit.wikimedia.org/r/668125 (https://phabricator.wikimedia.org/T273901) (owner: 10Ottomata) [16:45:14] (03PS2) 10Dzahn: Revert "site: add gitlab VMs with placeholder role" [puppet] - 10https://gerrit.wikimedia.org/r/668056 (https://phabricator.wikimedia.org/T274459) [16:45:21] 10SRE, 10Traffic, 10Patch-For-Review: Remove SLAAC IPs from Ganeti hosts - https://phabricator.wikimedia.org/T265904 (10jbond) I have applied the changes in the import script and tested against against [[ https://netbox.wikimedia.org/dcim/devices/2552/ | ganeti224]] and i now see the Primary IPv6: 2620:0:860... [16:46:34] (03CR) 10Dzahn: [C: 03+2] Revert "site: add gitlab VMs with placeholder role" [puppet] - 10https://gerrit.wikimedia.org/r/668056 (https://phabricator.wikimedia.org/T274459) (owner: 10Dzahn) [16:46:41] (03PS3) 10Dzahn: Revert "site: add gitlab VMs with placeholder role" [puppet] - 10https://gerrit.wikimedia.org/r/668056 (https://phabricator.wikimedia.org/T274459) [16:46:47] (03PS2) 10Giuseppe Lavagetto: docker::baseimages: weekly rebuild [puppet] - 10https://gerrit.wikimedia.org/r/665991 [16:49:24] (03CR) 10Klausman: "> It would be great if you could update https://wikitech.wikimedia.org/wiki/Reprepro with something like "Removing a component" when you'v" [puppet] - 10https://gerrit.wikimedia.org/r/668091 (owner: 10Klausman) [16:50:59] (03PS3) 10Giuseppe Lavagetto: docker::baseimages: weekly rebuild [puppet] - 10https://gerrit.wikimedia.org/r/665991 [16:51:04] (03CR) 10Alexandros Kosiaris: [C: 03+1] service::node: Remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/666930 (owner: 10Muehlenhoff) [16:52:00] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28353/console" [puppet] - 10https://gerrit.wikimedia.org/r/665991 (owner: 10Giuseppe Lavagetto) [16:53:40] (03PS5) 10JMeybohm: Switch from neon.eqiad.wmnet to kubestagemaster.svc.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/668015 (https://phabricator.wikimedia.org/T276305) [16:53:42] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] docker::baseimages: weekly rebuild [puppet] - 10https://gerrit.wikimedia.org/r/665991 (owner: 10Giuseppe Lavagetto) [16:55:50] (03CR) 10JMeybohm: [C: 03+2] Switch from neon.eqiad.wmnet to kubestagemaster.svc.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/668015 (https://phabricator.wikimedia.org/T276305) (owner: 10JMeybohm) [16:56:15] (03PS3) 10JMeybohm: kubernetes staging: Move k8s, docker and calico version to common [puppet] - 10https://gerrit.wikimedia.org/r/668081 (https://phabricator.wikimedia.org/T276305) [16:56:29] (03PS1) 10Jbond: cfssl::cert: update default key size to 256 and drop rsa [puppet] - 10https://gerrit.wikimedia.org/r/668127 [16:57:37] 10SRE, 10Mail, 10observability, 10Sustainability (Incident Followup): Add exim queue size to grafana graph - https://phabricator.wikimedia.org/T275867 (10jcrespo) [16:58:32] (03PS1) 10Jbond: P:pki::root_ca: create ocsp certificate [puppet] - 10https://gerrit.wikimedia.org/r/668128 [16:59:49] (03CR) 10JMeybohm: [C: 03+2] kubernetes staging: Move k8s, docker and calico version to common [puppet] - 10https://gerrit.wikimedia.org/r/668081 (https://phabricator.wikimedia.org/T276305) (owner: 10JMeybohm) [17:03:22] (03CR) 10Elukey: [C: 03+2] Bump AQS druid datasource to 2021-02 [puppet] - 10https://gerrit.wikimedia.org/r/668123 (owner: 10Joal) [17:03:33] (03CR) 10Jbond: [C: 03+2] cfssl::cert: update default key size to 256 and drop rsa [puppet] - 10https://gerrit.wikimedia.org/r/668127 (owner: 10Jbond) [17:03:39] (03CR) 10Jbond: [C: 03+2] P:pki::root_ca: create ocsp certificate [puppet] - 10https://gerrit.wikimedia.org/r/668128 (owner: 10Jbond) [17:03:44] mutante: ok to merge? [17:03:50] (also o/) [17:04:30] elukey: mutante: i now have the prompt :P [17:04:32] elukey: yes, but blocked on jaime [17:04:39] quadruple combo. that's a first [17:04:40] ahahhaha [17:04:48] merge it all (TM) [17:05:00] merging :) [17:05:02] +1 for me [17:05:05] :) [17:05:52] (03PS1) 10Papaul: DHCP: Add MAC adress for backup2003 [puppet] - 10https://gerrit.wikimedia.org/r/668129 (https://phabricator.wikimedia.org/T274185) [17:06:16] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:05] (03CR) 10Brennen Bearnes: "> Patch Set 2: Code-Review+1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/667310 (owner: 10Brennen Bearnes) [17:07:07] (03PS1) 10Ottomata: Set destination_event_serivce: eventgate-main for rdf-streaming-updater streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668131 (https://phabricator.wikimedia.org/T273901) [17:07:23] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: swap gerrit log stream to be ecs-only [puppet] - 10https://gerrit.wikimedia.org/r/668109 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [17:09:22] !log elukey@cumin1001 START - Cookbook sre.aqs.roll-restart [17:09:22] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on maps1009.eqiad.wmnet with reason: Resyncing database from scratch [17:09:22] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on maps1009.eqiad.wmnet with reason: Resyncing database from scratch [17:09:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:00] 10SRE, 10Product-Analytics, 10SRE-Access-Requests, 10Structured-Data-Backlog (Current Work): Add Matthew Williams to analytics-privatedata-users - https://phabricator.wikimedia.org/T275671 (10lucyblackwell) Approved! [17:10:21] (03PS1) 10Bstorm: cloudvirt: disable systemd paging for virts that run backy [puppet] - 10https://gerrit.wikimedia.org/r/668132 [17:10:23] (03PS2) 10Ottomata: Set destination_event_serivce: eventgate-main for rdf-streaming-updater streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668131 (https://phabricator.wikimedia.org/T273901) [17:10:39] (03CR) 10Papaul: [C: 03+2] DHCP: Add MAC adress for backup2003 [puppet] - 10https://gerrit.wikimedia.org/r/668129 (https://phabricator.wikimedia.org/T274185) (owner: 10Papaul) [17:11:16] (03PS1) 10Jbond: O:pki: drop rsa certificate as they are no longer supported [puppet] - 10https://gerrit.wikimedia.org/r/668133 [17:12:13] (03CR) 10Ottomata: [C: 03+2] Set destination_event_serivce: eventgate-main for rdf-streaming-updater streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668131 (https://phabricator.wikimedia.org/T273901) (owner: 10Ottomata) [17:12:41] (03CR) 10Jbond: [C: 03+2] O:pki: drop rsa certificate as they are no longer supported [puppet] - 10https://gerrit.wikimedia.org/r/668133 (owner: 10Jbond) [17:13:03] (03CR) 10Bstorm: "We haven't used the role-based targeting much for cloudvirts, but it seemed appropriate here." [puppet] - 10https://gerrit.wikimedia.org/r/668132 (owner: 10Bstorm) [17:13:06] !log elukey@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) [17:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:55] !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Set destination_event_serivce: eventgate-main for rdf-streaming-updater streams - T273901 (duration: 01m 08s) [17:14:01] 10SRE, 10DNS, 10Traffic, 10Abstract Wikipedia (Phase δ): Establish wikifunctions.org - https://phabricator.wikimedia.org/T275904 (10DVrandecic) [17:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:14] T273901: Automate event stream ingestion into HDFS for streams that don't use EventGate - https://phabricator.wikimedia.org/T273901 [17:15:28] (03PS1) 10Dzahn: Revert "DHCP: add MAC address for gitlab1001 VM" [puppet] - 10https://gerrit.wikimedia.org/r/668064 [17:15:42] (03PS1) 10KartikMistry: Update apertium to 2021-03-03-170806-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/668134 (https://phabricator.wikimedia.org/T274262) [17:15:44] (03CR) 10jerkins-bot: [V: 04-1] Revert "DHCP: add MAC address for gitlab1001 VM" [puppet] - 10https://gerrit.wikimedia.org/r/668064 (owner: 10Dzahn) [17:17:18] (03PS1) 10Dzahn: Revert "DHCP: add MAC address for gitlab1002" [puppet] - 10https://gerrit.wikimedia.org/r/668065 [17:20:22] (03PS2) 10Dzahn: Revert "DHCP: add MAC address for gitlab1001 VM" [puppet] - 10https://gerrit.wikimedia.org/r/668064 [17:20:38] 10SRE, 10Traffic, 10Patch-For-Review: Remove SLAAC IPs from Ganeti hosts - https://phabricator.wikimedia.org/T265904 (10Volans) Nice! I think that the script deletes ifaces not present in PuppetDB but doesn't do that for IPs... maybe we should also do that. Thoughts? [17:23:00] (03CR) 10Dzahn: [C: 03+2] Revert "DHCP: add MAC address for gitlab1001 VM" [puppet] - 10https://gerrit.wikimedia.org/r/668064 (owner: 10Dzahn) [17:23:23] (03CR) 10Dzahn: [C: 03+2] Revert "DHCP: add MAC address for gitlab1002" [puppet] - 10https://gerrit.wikimedia.org/r/668065 (owner: 10Dzahn) [17:23:33] (03PS1) 10Papaul: Add backup2003 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/668137 (https://phabricator.wikimedia.org/T274185) [17:25:21] (03CR) 10Papaul: [C: 03+2] Add backup2003 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/668137 (https://phabricator.wikimedia.org/T274185) (owner: 10Papaul) [17:26:27] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install backup2003 - https://phabricator.wikimedia.org/T274185 (10Papaul) [17:27:44] (03PS2) 10Dzahn: Revert "DHCP: add MAC address for gitlab1002" [puppet] - 10https://gerrit.wikimedia.org/r/668065 [17:28:00] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install backup2003 - https://phabricator.wikimedia.org/T274185 (10Papaul) a:05Papaul→03jcrespo @jcrespo all yours [17:28:08] 10SRE, 10SRE-Access-Requests: Requesting access to restricted for Daimona - https://phabricator.wikimedia.org/T276351 (10Legoktm) I wholeheartedly endorse this request, though I had suggested to Daimona that they get deploy access rather than just "restricted" :) But yes, I think just being able to run mainten... [17:28:20] (03CR) 10Dzahn: [C: 03+2] Revert "DHCP: add MAC address for gitlab1002" [puppet] - 10https://gerrit.wikimedia.org/r/668065 (owner: 10Dzahn) [17:28:25] 10SRE, 10SRE-Access-Requests: Requesting access to restricted for Daimona - https://phabricator.wikimedia.org/T276351 (10Legoktm) [17:29:48] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage1001.eqiad.wmnet with reason: REIMAGE [17:29:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:58] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubestage1001.eqiad.wmnet with reason: REIMAGE [17:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:46] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage1002.eqiad.wmnet with reason: REIMAGE [17:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:56] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubestage1002.eqiad.wmnet with reason: REIMAGE [17:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:45] (03CR) 10BryanDavis: [C: 03+1] striker: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/668009 (owner: 10Muehlenhoff) [17:33:55] (03PS1) 10Ottomata: Bump to 2021-03-03-172637-production to get rdf_streaming_updater schemas [deployment-charts] - 10https://gerrit.wikimedia.org/r/668139 (https://phabricator.wikimedia.org/T273901) [17:37:13] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/28351/" [puppet] - 10https://gerrit.wikimedia.org/r/668013 (owner: 10Muehlenhoff) [17:38:04] (03PS2) 10Dzahn: hiera/scap: remove deploy2001 from firewalls and dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/667041 (https://phabricator.wikimedia.org/T275832) [17:38:46] (03PS3) 10Dzahn: hiera/scap: remove deploy2001 from deployment_hosts [puppet] - 10https://gerrit.wikimedia.org/r/667041 (https://phabricator.wikimedia.org/T275832) [17:41:09] (03CR) 10Dzahn: [C: 03+2] hiera/scap: remove deploy2001 from deployment_hosts [puppet] - 10https://gerrit.wikimedia.org/r/667041 (https://phabricator.wikimedia.org/T275832) (owner: 10Dzahn) [17:42:22] (03CR) 10Ottomata: [C: 03+2] Bump to 2021-03-03-172637-production to get rdf_streaming_updater schemas [deployment-charts] - 10https://gerrit.wikimedia.org/r/668139 (https://phabricator.wikimedia.org/T273901) (owner: 10Ottomata) [17:42:29] 10SRE, 10netbox, 10Patch-For-Review: Add SSO support to netbox - https://phabricator.wikimedia.org/T244849 (10crusnov) Update on progress: I discussed the possibilities and situation with @jbond, with the idea that adapting RemoteUserBackend was the general consensus of the above discussion. I have made th... [17:46:27] !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [17:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:26] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [17:49:26] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [17:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:25] (03PS1) 10Dzahn: tcpircbot: remove deploy1001/deploy2001 [puppet] - 10https://gerrit.wikimedia.org/r/668142 (https://phabricator.wikimedia.org/T275831) [17:55:27] (03CR) 10Dzahn: [C: 03+2] tcpircbot: remove deploy1001/deploy2001 [puppet] - 10https://gerrit.wikimedia.org/r/668142 (https://phabricator.wikimedia.org/T275831) (owner: 10Dzahn) [17:56:01] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [17:56:01] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [17:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:04] (03CR) 10Dzahn: "puppet is broken on alert1001" [puppet] - 10https://gerrit.wikimedia.org/r/666719 (https://phabricator.wikimedia.org/T262078) (owner: 10Effie Mouzeli) [17:58:37] oups I will fix it [17:59:00] effie: cool, you are still here, thanks [17:59:18] yeah I just realised what broke it [17:59:43] yup, just wanted to make sure i did not touch that IRC bot running there [17:59:59] the one that broke yesterday [18:00:40] (03PS1) 10JMeybohm: admin_ng: Enable staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/668143 (https://phabricator.wikimedia.org/T276305) [18:01:42] (03CR) 10Alexandros Kosiaris: [C: 03+1] admin_ng: Enable staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/668143 (https://phabricator.wikimedia.org/T276305) (owner: 10JMeybohm) [18:03:10] (03CR) 10JMeybohm: [C: 03+2] admin_ng: Enable staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/668143 (https://phabricator.wikimedia.org/T276305) (owner: 10JMeybohm) [18:04:10] (03Merged) 10jenkins-bot: admin_ng: Enable staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/668143 (https://phabricator.wikimedia.org/T276305) (owner: 10JMeybohm) [18:05:28] (03PS1) 10Ottomata: eventgate-main should also use schemas/event/secondary repo [deployment-charts] - 10https://gerrit.wikimedia.org/r/668144 (https://phabricator.wikimedia.org/T273901) [18:07:52] brennen: let me know whenever you're ready and I'll merge 667310, don't want to jump the gun :) [18:08:20] (03CR) 10Ottomata: [C: 03+2] eventgate-main should also use schemas/event/secondary repo [deployment-charts] - 10https://gerrit.wikimedia.org/r/668144 (https://phabricator.wikimedia.org/T273901) (owner: 10Ottomata) [18:09:00] (03PS1) 10Papaul: Add ms-backup200[1-2] MAC address and partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/668145 (https://phabricator.wikimedia.org/T274202) [18:09:01] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [18:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:20] (03CR) 10Hashar: [C: 03+1] "I am entirely for dropping the legacy events injected to logstash-* . I have moved the dashboard to use the new ecs based event and that " [puppet] - 10https://gerrit.wikimedia.org/r/668109 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [18:12:19] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [18:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:41] (03PS2) 10Papaul: Add ms-backup200[1-2] MAC address and partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/668145 (https://phabricator.wikimedia.org/T274202) [18:15:22] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [18:15:22] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [18:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:28] (03PS1) 10Effie Mouzeli: mediawiki::alerts: fix mediawiki-error-rate check [puppet] - 10https://gerrit.wikimedia.org/r/668166 [18:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:51] (03CR) 10Papaul: [C: 03+2] Add ms-backup200[1-2] MAC address and partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/668145 (https://phabricator.wikimedia.org/T274202) (owner: 10Papaul) [18:16:13] rzl: thanks! it's been tested so should be ready to go. :) [18:16:14] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'apertium' for release 'staging' . [18:16:14] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'apertium' for release 'production' . [18:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:45] brennen: o7 [18:17:18] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' . [18:17:19] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'recommendation-api' for release 'production' . [18:17:19] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [18:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:45] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [18:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:34] (03CR) 10RLazarus: [C: 03+2] logspam-watch: histograms, helptext, and utf-8 handling [puppet] - 10https://gerrit.wikimedia.org/r/667310 (owner: 10Brennen Bearnes) [18:18:58] (03PS1) 10Dzahn: site: remove deploy1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/668168 (https://phabricator.wikimedia.org/T275831) [18:19:06] RECOVERY - Prometheus k8s-staging cache not updating on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus1004&var-datasource=eqiad+prometheus/ops [18:19:54] RECOVERY - Prometheus k8s-staging cache not updating on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus1003&var-datasource=eqiad+prometheus/ops [18:20:40] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [18:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:04] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:21:04] (03CR) 10Ryan Kemper: [C: 03+2] [wdqs] buffer 250 messages instead of 1000 [puppet] - 10https://gerrit.wikimedia.org/r/667541 (owner: 10DCausse) [18:21:15] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [18:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:49] (03CR) 10Bstorm: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/28356/cloudvirt1024.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/668132 (owner: 10Bstorm) [18:22:00] (03CR) 10Urbanecm: [C: 03+1] Use more generic non-team specific name for alerts [puppet] - 10https://gerrit.wikimedia.org/r/668113 (https://phabricator.wikimedia.org/T264665) (owner: 10Jdlrobson) [18:22:22] annet: do you need help with backporting? [18:22:51] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [18:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:23] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' . [18:23:26] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:23:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:40] (03CR) 10Cwhite: [C: 03+1] mediawiki::alerts: fix mediawiki-error-rate check [puppet] - 10https://gerrit.wikimedia.org/r/668166 (owner: 10Effie Mouzeli) [18:24:07] (03CR) 10Legoktm: "recheck" [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/659091 (owner: 10Legoktm) [18:24:23] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'apertium' for release 'production' . [18:24:24] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'apertium' for release 'staging' . [18:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:29] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki::alerts: fix mediawiki-error-rate check [puppet] - 10https://gerrit.wikimedia.org/r/668166 (owner: 10Effie Mouzeli) [18:24:32] (03CR) 10Ahmon Dancy: [C: 04-1] "Holding for improvements." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667243 (owner: 10Ahmon Dancy) [18:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:09] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [18:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:05] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [18:26:05] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [18:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:32] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'echostore' for release 'staging' . [18:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:41] (03CR) 10Legoktm: [V: 03+2 C: 03+2] d/changelog: Bump version to 0.0.11 [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/659091 (owner: 10Legoktm) [18:26:45] (03PS1) 10Papaul: Add ms-backup200[1-2] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/668171 (https://phabricator.wikimedia.org/T274202) [18:27:20] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [18:27:20] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [18:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:34] (03PS1) 10RhinosF1: Also requet timestamp|snippet from non-page results [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/668147 (https://phabricator.wikimedia.org/T271174) [18:27:47] (03PS2) 10Urbanecm: Also requet timestamp|snippet from non-page results [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/668147 (https://phabricator.wikimedia.org/T271174) (owner: 10RhinosF1) [18:28:23] RhinosF1: thanks for offering! I'm getting a patch ready to cherry-pick the fix but I will need someone to deploy. I could wait until the backport window in 30 minutes, though [18:28:23] liw: ^^^mind me backporting the above? [18:28:34] annet: I can do it for you now if you're ready :) [18:28:49] jouncebot: now [18:28:49] No deployments scheduled for the next 0 hour(s) and 31 minute(s) [18:28:55] annet: we got a cherry pick ready. Just needs you to test it. [18:29:05] (03CR) 10Papaul: [C: 03+2] Add ms-backup200[1-2] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/668171 (https://phabricator.wikimedia.org/T274202) (owner: 10Papaul) [18:29:07] yup [18:29:09] Urbanecm: we're in a releng meeting, may be one moment [18:29:12] oh thanks, y'all are the best! looking... [18:29:19] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [18:29:21] it's not yet prepared for testing annet [18:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:33] cool, will look out for it [18:30:01] annet: I added you to the patch as a review. Urbanecm will ping you when it's ready to test. It's similar to the config patches you've done before. [18:30:12] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [18:30:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:23] brennen: I'm not sure i understand -- I'm happy to do it, it's more of a headsup [18:30:24] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: 2021-03-31) rack/setup/install ms-backup200[12] - https://phabricator.wikimedia.org/T274202 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` ms-backup2001.codfw.wmnet ` The log can be foun... [18:30:38] Urbanecm, sure, go ahead, pick a suitable time [18:30:59] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [18:31:02] (03CR) 10Urbanecm: [C: 03+2] "UBN" [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/668147 (https://phabricator.wikimedia.org/T271174) (owner: 10RhinosF1) [18:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:17] (03CR) 10Anne Tomasevich: [C: 03+1] Also requet timestamp|snippet from non-page results [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/668147 (https://phabricator.wikimedia.org/T271174) (owner: 10RhinosF1) [18:32:48] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [18:32:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:10] (03PS3) 10Urbanecm: Also requet timestamp|snippet from non-page results [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/668147 (https://phabricator.wikimedia.org/T271174) (owner: 10RhinosF1) [18:33:17] (03CR) 10Urbanecm: [C: 03+2] Also requet timestamp|snippet from non-page results [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/668147 (https://phabricator.wikimedia.org/T271174) (owner: 10RhinosF1) [18:33:37] (just a commit msg update) [18:33:51] jenkins says 30 mins for merge, so we might be just ready for b&c [18:33:59] Yeah! [18:34:10] Ty [18:34:34] annet: FYI, should be ready in half an hour [18:34:36] (03PS1) 10Hashar: logspam-watch: redraw when terminal size changes [puppet] - 10https://gerrit.wikimedia.org/r/668172 [18:34:41] got it, thank you!! [18:36:04] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts deploy2001.codfw.wmnet [18:36:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:33] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' . [18:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:14] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [18:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:26] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mathoid' for release 'staging' . [18:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:08] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [18:39:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:39] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' . [18:40:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:46] (03CR) 10Ahmon Dancy: [C: 04-1] "I tested this but didn't see any difference in behavior when resizing the terminal. Perhaps a demo is in order." [puppet] - 10https://gerrit.wikimedia.org/r/668172 (owner: 10Hashar) [18:42:58] !log uploaded python3-docker-report 0.0.11 to buster-wikimedia [18:43:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:24] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [18:43:27] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts deploy2001.codfw.wmnet [18:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:46] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'recommendation-api' for release 'production' . [18:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:57] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'sessionstore' for release 'staging' . [18:47:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:43] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [18:47:47] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:14] volans: another decom done just now. exit_code=0 and it included ACKing DNS change. this was hardware as opposed to VM. worked [18:49:42] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' . [18:49:42] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' . [18:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:47] mutante: yeah I noticed, thx, and in general it works, had an issue just with VMs in a couple of times, it seems a race or cache issue so far [18:50:58] we'll try some temporary workaround [18:51:08] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [18:51:12] cc chaomodus (I was thinking to add a sleep for VMs) [18:51:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:31] seems reasonable :) [18:51:33] volans: yep, thanks [18:52:14] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'zotero' for release 'staging' . [18:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:08] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:02] (03PS3) 10Urbanecm: Enable Growth features on sqwiki in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667691 (https://phabricator.wikimedia.org/T275550) [18:56:16] (03PS4) 10Urbanecm: Enable Growth features on sqwiki in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667691 (https://phabricator.wikimedia.org/T275550) [18:56:46] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-backup2001.codfw.wmnet with reason: REIMAGE [18:56:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:50] annet: we should be good to go in a minute [18:58:11] RhinosF1: cool! [18:58:38] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ms-backup2001.codfw.wmnet with reason: REIMAGE [18:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:45] (03Merged) 10jenkins-bot: Also requet timestamp|snippet from non-page results [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/668147 (https://phabricator.wikimedia.org/T271174) (owner: 10RhinosF1) [19:00:05] RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210303T1900). Please do the needful. [19:00:05] No GERRIT patches in the queue for this window AFAICS. [19:00:05] liw and longma: (Dis)respected human, time to deploy Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210303T1900). Please do the needful. [19:00:13] I'll deploy today [19:00:14] just in time [19:00:18] Perfect! [19:00:29] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:39] (03PS1) 10Ryan Kemper: wdqs: expose wdqs1009 externally [puppet] - 10https://gerrit.wikimedia.org/r/668173 (https://phabricator.wikimedia.org/T266470) [19:01:01] (03CR) 10jerkins-bot: [V: 04-1] wdqs: expose wdqs1009 externally [puppet] - 10https://gerrit.wikimedia.org/r/668173 (https://phabricator.wikimedia.org/T266470) (owner: 10Ryan Kemper) [19:01:03] annet: your patch is pulled to mwdebug1001, can you test it? [19:01:23] Urbanecm: yep, will look in just a minute [19:01:27] thanks [19:01:44] * RhinosF1 watching [19:03:07] (03PS4) 10Urbanecm: Enable Growth features on eowiki in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667694 (https://phabricator.wikimedia.org/T276123) [19:04:49] (03PS9) 10Razzi: Remove labsdb1012 from puppet in preparation for rename [puppet] - 10https://gerrit.wikimedia.org/r/663865 (https://phabricator.wikimedia.org/T269211) [19:04:54] (03CR) 10Urbanecm: [C: 03+2] Enable Growth features on sqwiki in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667691 (https://phabricator.wikimedia.org/T275550) (owner: 10Urbanecm) [19:05:50] (03Merged) 10jenkins-bot: Enable Growth features on sqwiki in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667691 (https://phabricator.wikimedia.org/T275550) (owner: 10Urbanecm) [19:06:06] annet: do let me know if you need any help. [19:06:21] 10SRE, 10SRE-tools: Puppet tox: properly lint both Py2 and Py3 files - https://phabricator.wikimedia.org/T184435 (10awight) 05Open→03Declined Can we abandon this task now? Python 2 has been long buried. [19:06:53] Urbanecm: yeah, this error isn't visible in the UI, so I'm not sure how I should be testing it. Sorry, haven't run into this before [19:07:21] Urbanecm: anything in logstash or sync? [19:07:35] annet: so, you need to do some actions in the UI that previously triggered the notice. Then, go to logstash.wikimedia.org, look at the "mwdebug servers" dashboard, and make sure there is no notice. [19:07:47] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: 2021-03-31) rack/setup/install ms-backup200[12] - https://phabricator.wikimedia.org/T274202 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ms-backup2001.codfw.wmnet'] ` and were **ALL** successful. [19:07:49] 10SRE, 10vm-requests: eqiad: 2 VM request for docker-registry - https://phabricator.wikimedia.org/T276380 (10Legoktm) [19:07:54] 10SRE, 10vm-requests: eqiad: 2 VM request for docker-registry - https://phabricator.wikimedia.org/T276380 (10Legoktm) a:03Legoktm [19:08:29] (03PS2) 10Ryan Kemper: wdqs: expose wdqs1009 externally [puppet] - 10https://gerrit.wikimedia.org/r/668173 (https://phabricator.wikimedia.org/T266470) [19:08:34] 10SRE, 10vm-requests: codfw: 2 VM request for docker-registry - https://phabricator.wikimedia.org/T276381 (10Legoktm) [19:08:38] Urbanecm: thanks, on it... [19:08:42] 10SRE, 10vm-requests: codfw: 2 VM request for docker-registry - https://phabricator.wikimedia.org/T276381 (10Legoktm) a:03Legoktm [19:08:58] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [19:09:00] 10SRE, 10vm-requests: codfw: 2 VM request for docker-registry - https://phabricator.wikimedia.org/T276381 (10Legoktm) [19:09:02] 10SRE, 10serviceops: Upgrade docker-registry servers to Debian Buster - https://phabricator.wikimedia.org/T272550 (10Legoktm) [19:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:06] 10SRE, 10vm-requests: eqiad: 2 VM request for docker-registry - https://phabricator.wikimedia.org/T276380 (10Legoktm) [19:09:30] I'm also monitoring that dashboard myself [19:09:37] (03PS10) 10Razzi: Remove labsdb1012 from puppet in preparation for rename [puppet] - 10https://gerrit.wikimedia.org/r/663865 (https://phabricator.wikimedia.org/T269211) [19:10:19] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28359/console" [puppet] - 10https://gerrit.wikimedia.org/r/663865 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi) [19:11:02] Urbanecm: ok, I've done the action that would have triggered the bug several times and am not seeing anything [19:11:08] cool [19:11:17] not seeing anything as well [19:11:18] syncing [19:11:22] great! [19:12:05] 10SRE, 10SRE-tools: Puppet tox: properly lint both Py2 and Py3 files - https://phabricator.wikimedia.org/T184435 (10Andrew) There's certainly still a fair bit of active puppet2 code in the puppet repo. [19:12:51] (03CR) 10Razzi: [V: 03+1] "@Marostegui thanks for the reminder; PCC run is at https://puppet-compiler.wmflabs.org/compiler1003/28359/" [puppet] - 10https://gerrit.wikimedia.org/r/663865 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi) [19:13:14] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/666961 (owner: 10David Caro) [19:14:20] !log urbanecm@deploy1002 Synchronized php-1.36.0-wmf.33/extensions/WikibaseMediaInfo/src/Special/SpecialMediaSearch.php: b741dc32c59700cb0cdcd82f2d951cf993679689: Also requet timestamp|snippet from non-page results (T271174; T276353) (duration: 01m 09s) [19:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:28] T271174: [M] MediaSearch: display namespaces and snippets for search results on "Categories and Pages" tab - https://phabricator.wikimedia.org/T271174 [19:14:28] T276353: PHP Notice: Undefined index: timestamp - https://phabricator.wikimedia.org/T276353 [19:14:36] annet: enjoy, should be live. [19:15:23] Can we call that task resolved then? [19:15:42] 10SRE, 10SRE-tools: Puppet tox: properly lint both Py2 and Py3 files - https://phabricator.wikimedia.org/T184435 (10awight) 05Declined→03Open Okay, thanks for explaining and apologies for the spam! It's slightly alarming to hear this because py27 has been end-of-life for over a year, but I understand that... [19:16:02] RhinosF1: I'd recommend to monitor the logs for a while [19:16:07] (03PS2) 10Ottomata: Configure spark to work better with conda environments [puppet] - 10https://gerrit.wikimedia.org/r/667689 (https://phabricator.wikimedia.org/T272313) [19:16:09] (03PS3) 10Ahmon Dancy: env.php: Allow the datacenter to be specified in WMF_DATACENTER environment variable. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667243 [19:16:20] Urbanecm: ack [19:16:31] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/665058 (https://phabricator.wikimedia.org/T262211) (owner: 10Ryan Kemper) [19:16:37] (03CR) 10Ahmon Dancy: logspam-watch: histograms, helptext, and utf-8 handling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/667310 (owner: 10Brennen Bearnes) [19:16:53] liw: second blocker should be fixed [19:17:03] RhinosF1, thanks! [19:17:23] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install ms-backup200[12] - https://phabricator.wikimedia.org/T274202 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` ms-backup2002.codfw.wmnet ` The log can be found in `/var/log/wmf-aut... [19:17:37] RhinosF1 & Urbanecm: I'm not seeing the error pop up anymore (just visiting https://commons.wikimedia.org/wiki/Special:MediaSearch?type=bitmap&q=cat would have triggered it before), so hopefully we're out of the woods. Thanks so much for resolving this and for your kind help! [19:17:44] (03CR) 10Volans: [C: 03+1] "I'm missing the specific toolforge context, for the rest LGTM. See reply inline for unrelated comment." (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/666919 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [19:17:50] cool, that's nice to hear! [19:18:01] liw: np but I just clicked cherry pick and found everyone. Urbanecm + annet deserve most of the credit. [19:18:21] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:18:22] I deserve all the credit for *causing* the issue, for sure :D [19:18:35] :D [19:18:42] :D [19:19:08] (03CR) 10Ahmon Dancy: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667243 (owner: 10Ahmon Dancy) [19:19:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission analytics10[42-57] - https://phabricator.wikimedia.org/T267932 (10Cmjohnson) 05Open→03Resolved All of the servers have been removed from the racks, the netbox script was run and cookbook on cumin. [19:19:32] doesn't stashbot advertise stickers for those who break and fix the wikis :P [19:19:51] (03PS1) 10Urbanecm: rowiki: Make Growth features available to ro newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668175 (https://phabricator.wikimedia.org/T275130) [19:19:56] Yeah I'm pretty sure you get one for that Majavah [19:20:07] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:20:08] (03CR) 10Brennen Bearnes: logspam-watch: histograms, helptext, and utf-8 handling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/667310 (owner: 10Brennen Bearnes) [19:20:08] You'll have to get it to issue one [19:21:24] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [19:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:29] (03PS1) 10Urbanecm: Revert "Enable Growth features on sqwiki in stealth mode" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668148 [19:21:54] (03CR) 10Urbanecm: [C: 03+2] Revert "Enable Growth features on sqwiki in stealth mode" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668148 (owner: 10Urbanecm) [19:21:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1075.eqiad.wmnet - https://phabricator.wikimedia.org/T274235 (10Cmjohnson) [19:22:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1075.eqiad.wmnet - https://phabricator.wikimedia.org/T274235 (10Cmjohnson) 05Open→03Resolved [19:22:23] (03CR) 10Dduvall: [C: 03+1] env.php: Allow the datacenter to be specified in WMF_DATACENTER environment variable. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667243 (owner: 10Ahmon Dancy) [19:22:45] (03Merged) 10jenkins-bot: Revert "Enable Growth features on sqwiki in stealth mode" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668148 (owner: 10Urbanecm) [19:22:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1078.eqiad.wmnet - https://phabricator.wikimedia.org/T273597 (10Cmjohnson) [19:22:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1078.eqiad.wmnet - https://phabricator.wikimedia.org/T273597 (10Cmjohnson) 05Open→03Resolved [19:22:57] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Cmjohnson) [19:23:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1081.eqiad.wmnet - https://phabricator.wikimedia.org/T273040 (10Cmjohnson) [19:23:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1081.eqiad.wmnet - https://phabricator.wikimedia.org/T273040 (10Cmjohnson) 05Open→03Resolved [19:23:39] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Cmjohnson) [19:24:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: decommission db1090.eqiad.wmnet - https://phabricator.wikimedia.org/T274333 (10Cmjohnson) [19:24:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: decommission db1090.eqiad.wmnet - https://phabricator.wikimedia.org/T274333 (10Cmjohnson) 05Open→03Resolved [19:24:34] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Cmjohnson) [19:25:24] (03PS2) 10Urbanecm: rowiki: Make Growth features available to ro newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668175 (https://phabricator.wikimedia.org/T275130) [19:25:48] (03CR) 10Urbanecm: [C: 03+2] rowiki: Make Growth features available to ro newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668175 (https://phabricator.wikimedia.org/T275130) (owner: 10Urbanecm) [19:26:36] (03Merged) 10jenkins-bot: rowiki: Make Growth features available to ro newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668175 (https://phabricator.wikimedia.org/T275130) (owner: 10Urbanecm) [19:28:34] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 7221371e12353120ce2d659f020ae666fe5dfb00: rowiki: Make Growth features available to ro newcomers (T275130) (duration: 01m 10s) [19:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:42] T275130: Deploy Growth features on Romanian Wikipedia - https://phabricator.wikimedia.org/T275130 [19:30:12] (03PS3) 10Ottomata: Configure spark to work better with conda environments [puppet] - 10https://gerrit.wikimedia.org/r/667689 (https://phabricator.wikimedia.org/T272313) [19:30:15] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:30:15] (03PS1) 10Urbanecm: Help panel: Do not require help desk to be configured [extensions/GrowthExperiments] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/668149 (https://phabricator.wikimedia.org/T273118) [19:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:22] (03CR) 10Urbanecm: [C: 03+2] Help panel: Do not require help desk to be configured [extensions/GrowthExperiments] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/668149 (https://phabricator.wikimedia.org/T273118) (owner: 10Urbanecm) [19:30:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1092.eqiad.wmnet - https://phabricator.wikimedia.org/T275019 (10Cmjohnson) [19:30:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1092.eqiad.wmnet - https://phabricator.wikimedia.org/T275019 (10Cmjohnson) 05Open→03Resolved [19:30:35] (03PS1) 10Urbanecm: Help panel: Do not require help desk to be configured [extensions/GrowthExperiments] (wmf/1.36.0-wmf.32) - 10https://gerrit.wikimedia.org/r/668150 (https://phabricator.wikimedia.org/T273118) [19:30:43] (03CR) 10Urbanecm: [C: 03+2] Help panel: Do not require help desk to be configured [extensions/GrowthExperiments] (wmf/1.36.0-wmf.32) - 10https://gerrit.wikimedia.org/r/668150 (https://phabricator.wikimedia.org/T273118) (owner: 10Urbanecm) [19:30:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1093.eqiad.wmnet - https://phabricator.wikimedia.org/T273955 (10Cmjohnson) [19:31:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1093.eqiad.wmnet - https://phabricator.wikimedia.org/T273955 (10Cmjohnson) 05Open→03Resolved [19:31:10] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Cmjohnson) [19:31:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1094.eqiad.wmnet - https://phabricator.wikimedia.org/T273710 (10Cmjohnson) [19:32:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1094.eqiad.wmnet - https://phabricator.wikimedia.org/T273710 (10Cmjohnson) 05Open→03Resolved [19:32:03] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Cmjohnson) [19:32:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1095 - https://phabricator.wikimedia.org/T273732 (10Cmjohnson) [19:32:29] 10SRE, 10vm-requests: eqiad: 2 VM request for docker-registry - https://phabricator.wikimedia.org/T276380 (10Legoktm) [19:32:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1095 - https://phabricator.wikimedia.org/T273732 (10Cmjohnson) 05Open→03Resolved [19:32:36] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Cmjohnson) [19:32:43] 10SRE, 10vm-requests: codfw: 2 VM request for docker-registry - https://phabricator.wikimedia.org/T276381 (10Legoktm) [19:32:51] RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:33:07] !log legoktm@cumin1001 START - Cookbook sre.ganeti.makevm for new host registry1003.eqiad.wmnet [19:33:11] (03CR) 10Andrew Bogott: [C: 03+2] labs-ip-alias-dump.py: Replace another use of keystoneclient with keystoneauth1 [puppet] - 10https://gerrit.wikimedia.org/r/663870 (https://phabricator.wikimedia.org/T239584) (owner: 10Andrew Bogott) [19:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:44] 10SRE, 10SRE-tools, 10Python3-Porting: Puppet tox: properly lint both Py2 and Py3 files - https://phabricator.wikimedia.org/T184435 (10Aklapper) [19:34:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission phab1003.eqiad.wmnet - https://phabricator.wikimedia.org/T238957 (10Cmjohnson) 05Open→03Resolved This is off the rack and offlined in netbox. I do not see any ip info setup for it. [19:36:37] (03PS1) 10Urbanecm: dawiki: Deploy Growth features to newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668177 [19:36:59] (03CR) 10Urbanecm: [C: 03+2] dawiki: Deploy Growth features to newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668177 (owner: 10Urbanecm) [19:38:05] (03Merged) 10jenkins-bot: dawiki: Deploy Growth features to newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668177 (owner: 10Urbanecm) [19:38:39] (03CR) 10Ottomata: [C: 03+2] Configure spark to work better with conda environments (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/667689 (https://phabricator.wikimedia.org/T272313) (owner: 10Ottomata) [19:38:59] !log urbanecm@deploy1002 sync-file aborted: 7acb37c9b89d192bafc3b54adc33b569b3cea869: dawiki: Deploy Growth features to newcomers (duration: 00m 03s) [19:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:18] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Fix bug in conda-deactivate-stacked that would cause infinite loop [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/667909 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [19:40:25] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Symlink conda-(de)activate-stacked scripts into user env instead of cp [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/667913 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [19:40:26] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 7acb37c9b89d192bafc3b54adc33b569b3cea869: dawiki: Deploy Growth features to newcomers (T256126) (duration: 01m 09s) [19:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:33] T256126: Deploy Growth features on Danish Wikipedia - https://phabricator.wikimedia.org/T256126 [19:41:18] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host gitlab1001.wikimedia.org [19:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:51] 10SRE, 10GitLab, 10vm-requests, 10Patch-For-Review, 10User-brennen: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10Dzahn) ` dzahn@cumin1001:~$ sudo cookbook sre.ganeti.makevm eqiad_A --vcpus 8 --memory 12 --disk 100 --network public gitlab1001.wikimedia.org Ready to create... [19:41:55] (03CR) 10Ahmon Dancy: Rsync private mediawiki files to releases server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/667747 (https://phabricator.wikimedia.org/T276145) (owner: 10Jeena Huneidi) [19:42:05] 10SRE, 10GitLab, 10vm-requests, 10Patch-For-Review, 10User-brennen: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10Dzahn) p:05Medium→03High [19:43:01] (03Merged) 10jenkins-bot: Help panel: Do not require help desk to be configured [extensions/GrowthExperiments] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/668149 (https://phabricator.wikimedia.org/T273118) (owner: 10Urbanecm) [19:43:04] (03Merged) 10jenkins-bot: Help panel: Do not require help desk to be configured [extensions/GrowthExperiments] (wmf/1.36.0-wmf.32) - 10https://gerrit.wikimedia.org/r/668150 (https://phabricator.wikimedia.org/T273118) (owner: 10Urbanecm) [19:43:08] 10SRE, 10vm-requests, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10Dzahn) [19:43:15] (03PS2) 10Ottomata: Symlink conda-(de)activate-stacked scripts into user env instead of cp [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/667913 (https://phabricator.wikimedia.org/T224658) [19:43:24] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Symlink conda-(de)activate-stacked scripts into user env instead of cp [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/667913 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [19:44:41] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:45:09] 10SRE, 10vm-requests, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10Dzahn) >>! In T274459#6877185, @wkandek wrote: > Please recreate the 2 VMs in the VLAN that allows for direct external IP addresses. > > Afte... [19:46:45] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:47:07] (03PS1) 10JMeybohm: kube_env: Add bash completion [puppet] - 10https://gerrit.wikimedia.org/r/668181 [19:47:32] (03PS2) 10Ottomata: Finalize WMDE Technical Wishes schema ingestion migration to Event Platform [puppet] - 10https://gerrit.wikimedia.org/r/666729 (https://phabricator.wikimedia.org/T275005) [19:47:50] (03PS2) 10Ottomata: Finalize Growth schema ingestion migration to Event Platform [puppet] - 10https://gerrit.wikimedia.org/r/666730 (https://phabricator.wikimedia.org/T267333) [19:48:22] !log legoktm@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host registry1003.eqiad.wmnet [19:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:58] (03PS1) 10Legoktm: install_server: Add registry1003.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/668182 (https://phabricator.wikimedia.org/T276380) [19:50:31] (03CR) 10Ottomata: [C: 03+2] Finalize WMDE Technical Wishes schema ingestion migration to Event Platform [puppet] - 10https://gerrit.wikimedia.org/r/666729 (https://phabricator.wikimedia.org/T275005) (owner: 10Ottomata) [19:50:41] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28361/console" [puppet] - 10https://gerrit.wikimedia.org/r/668181 (owner: 10JMeybohm) [19:50:43] (03CR) 10Ottomata: [C: 03+2] Finalize Growth schema ingestion migration to Event Platform [puppet] - 10https://gerrit.wikimedia.org/r/666730 (https://phabricator.wikimedia.org/T267333) (owner: 10Ottomata) [19:51:32] (03CR) 10JMeybohm: kube_env: Add bash completion [puppet] - 10https://gerrit.wikimedia.org/r/668181 (owner: 10JMeybohm) [19:51:51] (03CR) 10JMeybohm: [V: 03+1] kube_env: Add bash completion [puppet] - 10https://gerrit.wikimedia.org/r/668181 (owner: 10JMeybohm) [19:52:36] (03PS1) 10Dzahn: site: add gitlab1001.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/668183 (https://phabricator.wikimedia.org/T274459) [19:53:26] !log urbanecm@deploy1002 Synchronized php-1.36.0-wmf.32/extensions/GrowthExperiments/: a036d9fa10fb522279bd5b7f2c0a14e1de7bb0ae: Help panel: Do not require help desk to be configured (T273118) (duration: 01m 10s) [19:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:32] T273118: Help panel: Remove dependency on Help Desk title existing - https://phabricator.wikimedia.org/T273118 [19:53:38] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install ms-backup200[12] - https://phabricator.wikimedia.org/T274202 (10Papaul) [19:53:54] (03PS1) 10Urbanecm: Enable Growth features on sqwiki in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668152 (https://phabricator.wikimedia.org/T275550) [19:54:02] (03CR) 10jerkins-bot: [V: 04-1] Enable Growth features on sqwiki in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668152 (https://phabricator.wikimedia.org/T275550) (owner: 10Urbanecm) [19:54:12] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Sergey Trofimovsky from Speed & Function - https://phabricator.wikimedia.org/T275722 (10Sergey.Trofimovsky.SF) Sergeys everywhere! @jbond No problem, here's the new key: ` ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIP9JEWVUhWekpKtJWQuA3c... [19:54:35] (03CR) 10Andrew Bogott: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/668005 (owner: 10Muehlenhoff) [19:55:31] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Sergey Trofimovsky from Speed & Function - https://phabricator.wikimedia.org/T275722 (10RhinosF1) >>! In T275722#6880487, @Sergey.Trofimovsky.SF wrote: > Sergeys everywhere! > > @jbond No problem, here's the new key: > ` > ssh-ed2... [19:57:40] !log urbanecm@deploy1002 Synchronized php-1.36.0-wmf.33/extensions/GrowthExperiments/: 4cba1843591dc689ade47aab700a47134b8c15c4: Help panel: Do not require help desk to be configured (T273118) (duration: 01m 10s) [19:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:59] (03PS2) 10Urbanecm: Enable Growth features on sqwiki in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668152 (https://phabricator.wikimedia.org/T275550) [19:57:59] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:58:12] (03CR) 10Urbanecm: [C: 03+2] Enable Growth features on sqwiki in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668152 (https://phabricator.wikimedia.org/T275550) (owner: 10Urbanecm) [19:58:30] (03CR) 10Legoktm: [C: 03+2] install_server: Add registry1003.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/668182 (https://phabricator.wikimedia.org/T276380) (owner: 10Legoktm) [19:59:00] (03PS2) 10Ottomata: eventlogging: Remove multiple unused modules [puppet] - 10https://gerrit.wikimedia.org/r/657538 (https://phabricator.wikimedia.org/T272559) (owner: 10Ladsgroup) [19:59:19] (03PS1) 10Sahilgrewalhere: Fixed typo "paramaters" [puppet] - 10https://gerrit.wikimedia.org/r/668184 (https://phabricator.wikimedia.org/T201491) [19:59:21] (03Merged) 10jenkins-bot: Enable Growth features on sqwiki in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668152 (https://phabricator.wikimedia.org/T275550) (owner: 10Urbanecm) [20:00:05] liw and longma: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Mediawiki train - European+American Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210303T2000). [20:00:19] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:00:22] (03CR) 10Ottomata: [C: 03+2] eventlogging: Remove multiple unused modules [puppet] - 10https://gerrit.wikimedia.org/r/657538 (https://phabricator.wikimedia.org/T272559) (owner: 10Ladsgroup) [20:00:35] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-backup2002.codfw.wmnet with reason: REIMAGE [20:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:31] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ms-backup2002.codfw.wmnet with reason: REIMAGE [20:02:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:17] (03CR) 10Dzahn: "When you do this, could you also do https://gerrit.wikimedia.org/r/c/operations/puppet/+/667288 while at it? part of the same change" [puppet] - 10https://gerrit.wikimedia.org/r/667240 (https://phabricator.wikimedia.org/T274170) (owner: 10Dzahn) [20:05:10] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 0120778ee8b0ffc57f180778e1bae44e931f0ba9: Enable Growth features on sqwiki in stealth mode (T275550) (duration: 01m 09s) [20:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:17] T275550: Deploy Growth features on Albanian Wikipedia - https://phabricator.wikimedia.org/T275550 [20:07:07] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Sergey Trofimovsky from Speed & Function - https://phabricator.wikimedia.org/T275722 (10jbond) >>! In T275722#6880487, @Sergey.Trofimovsky.SF wrote: > Sergeys everywhere! > > @jbond No problem, here's the new key: > ` > ssh-ed2551... [20:08:11] (03CR) 10Legoktm: "> Really? There should be nothing in deployment-prep which has a counterpart in production which is still on buster, mc* got upgraded to B" [puppet] - 10https://gerrit.wikimedia.org/r/668005 (owner: 10Muehlenhoff) [20:09:23] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Sergey Trofimovsky from Speed & Function - https://phabricator.wikimedia.org/T275722 (10Sergey.Trofimovsky.SF) Thanks, let's keep it this way! [20:09:40] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install ms-backup200[12] - https://phabricator.wikimedia.org/T274202 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ms-backup2002.codfw.wmnet'] ` and were **ALL** successful. [20:10:22] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install ms-backup200[12] - https://phabricator.wikimedia.org/T274202 (10Papaul) [20:10:56] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install ms-backup200[12] - https://phabricator.wikimedia.org/T274202 (10Papaul) 05Open→03Resolved @jcrespo this is complete [20:11:55] (03CR) 10Brennen Bearnes: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/668172 (owner: 10Hashar) [20:12:17] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 234 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:12:37] (03PS1) 10Jbond: admin: add strofimovsky01 shll account and to gitlab-roots [puppet] - 10https://gerrit.wikimedia.org/r/668186 (https://phabricator.wikimedia.org/T275722) [20:12:57] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to gitlab1001 / gitlab1002 for Sergey Trofimovsky from Speed & Function - https://phabricator.wikimedia.org/T275722 (10jbond) [20:16:33] (03PS1) 10Legoktm: site.pp: Add new registry VMs [puppet] - 10https://gerrit.wikimedia.org/r/668187 (https://phabricator.wikimedia.org/T276380) [20:17:54] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Revert: Enable Growth features on sqwiki in stealth mode (T275550) (duration: 01m 10s) [20:18:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:01] T275550: Deploy Growth features on Albanian Wikipedia - https://phabricator.wikimedia.org/T275550 [20:19:15] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:19:47] (03PS1) 10Urbanecm: Revert "Enable Growth features on sqwiki in stealth mode" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668188 [20:19:49] (03CR) 10Urbanecm: [C: 03+2] Revert "Enable Growth features on sqwiki in stealth mode" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668188 (owner: 10Urbanecm) [20:20:12] (03CR) 10Legoktm: [C: 03+2] site.pp: Add new registry VMs [puppet] - 10https://gerrit.wikimedia.org/r/668187 (https://phabricator.wikimedia.org/T276380) (owner: 10Legoktm) [20:20:16] /31/ [20:20:17] /31/ [20:20:32] (03Merged) 10jenkins-bot: Revert "Enable Growth features on sqwiki in stealth mode" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668188 (owner: 10Urbanecm) [20:26:33] (03PS1) 10Cwhite: logstash: ingest logstash logs as json and convert to ECS [puppet] - 10https://gerrit.wikimedia.org/r/668189 (https://phabricator.wikimedia.org/T273919) [20:31:05] (03PS2) 10Cwhite: logstash: ingest logstash logs as json and convert to ECS [puppet] - 10https://gerrit.wikimedia.org/r/668189 (https://phabricator.wikimedia.org/T273919) [20:35:58] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host gitlab1001.wikimedia.org [20:36:01] (03PS1) 10Legoktm: conftool: Add registry1003 [puppet] - 10https://gerrit.wikimedia.org/r/668190 (https://phabricator.wikimedia.org/T276380) [20:36:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:42] (03CR) 10Cwhite: [C: 03+2] profile: swap gerrit log stream to be ecs-only [puppet] - 10https://gerrit.wikimedia.org/r/668109 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [20:40:25] (03PS1) 10Andrew Bogott: prepare_cinder_volume.py: mount with 'discard' if we have the option [puppet] - 10https://gerrit.wikimedia.org/r/668191 [20:40:33] (03CR) 10Andrew Bogott: [C: 03+2] prometheus-labs-targets: Replace use of keystoneclient with keystoneauth1 [puppet] - 10https://gerrit.wikimedia.org/r/663872 (https://phabricator.wikimedia.org/T239584) (owner: 10Andrew Bogott) [20:41:14] (03CR) 10Cwhite: [C: 03+2] Use more generic non-team specific name for alerts [puppet] - 10https://gerrit.wikimedia.org/r/668113 (https://phabricator.wikimedia.org/T264665) (owner: 10Jdlrobson) [20:53:49] (03PS1) 10QChris: Allow “Gerrit Managers” to import history [debs/pygments] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/668194 [20:53:52] (03CR) 10QChris: [V: 03+2 C: 03+2] Allow “Gerrit Managers” to import history [debs/pygments] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/668194 (owner: 10QChris) [20:54:37] (03PS1) 10QChris: Import done. Revoke import grants [debs/pygments] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/668195 [20:54:40] (03CR) 10QChris: [V: 03+2 C: 03+2] Import done. Revoke import grants [debs/pygments] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/668195 (owner: 10QChris) [20:56:56] (03PS1) 10Hashar: wikitech: enable BetaFeatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668196 (https://phabricator.wikimedia.org/T125941) [21:00:04] chrisalbon and accraze: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210303T2100). [21:01:56] (03CR) 10Hashar: "+ James cause he is listed for BetaFeatures config change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668196 (https://phabricator.wikimedia.org/T125941) (owner: 10Hashar) [21:11:55] (03CR) 10Bstorm: [C: 03+1] "Looks legit" [puppet] - 10https://gerrit.wikimedia.org/r/668191 (owner: 10Andrew Bogott) [21:16:46] !log legoktm@deploy1002 conftool action : set/pooled=yes; selector: name=registry1002.eqiad.wmnet [21:16:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:25] (03CR) 10Legoktm: [C: 03+2] conftool: Add registry1003 [puppet] - 10https://gerrit.wikimedia.org/r/668190 (https://phabricator.wikimedia.org/T276380) (owner: 10Legoktm) [21:21:58] !log legoktm@deploy1002 conftool action : set/pooled=no; selector: name=registry1003.eqiad.wmnet [21:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:25] !log legoktm@deploy1002 conftool action : set/weight=10; selector: name=registry1003.eqiad.wmnet [21:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:20] (03PS1) 10Dzahn: DHCP: add gitlab1001.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/668198 (https://phabricator.wikimedia.org/T274459) [21:30:14] !log legoktm@deploy1002 conftool action : set/pooled=yes; selector: name=registry1003.eqiad.wmnet [21:30:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:52] (03CR) 10Dzahn: [C: 03+2] DHCP: add gitlab1001.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/668198 (https://phabricator.wikimedia.org/T274459) (owner: 10Dzahn) [21:31:02] (03PS2) 10Dzahn: DHCP: add gitlab1001.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/668198 (https://phabricator.wikimedia.org/T274459) [21:35:50] 10SRE, 10serviceops: Upgrade docker-registry servers to Debian Buster - https://phabricator.wikimedia.org/T272550 (10Legoktm) registry1003 is now pooled, I did a test pull to it specifically and it worked fine. [21:35:54] (03CR) 10Dzahn: [C: 03+2] site: add gitlab1001.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/668183 (https://phabricator.wikimedia.org/T274459) (owner: 10Dzahn) [21:36:00] (03PS2) 10Dzahn: site: add gitlab1001.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/668183 (https://phabricator.wikimedia.org/T274459) [21:50:19] !log legoktm@cumin1001 START - Cookbook sre.ganeti.makevm for new host registry2003.codfw.wmnet [21:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:46] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@7f37d40]: replace refinery-drop-hive-partitions with refinery-drop-older-than [21:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:24] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@7f37d40]: replace refinery-drop-hive-partitions with refinery-drop-older-than (duration: 01m 37s) [21:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:45] (03CR) 10Andrew Bogott: [C: 03+1] cloudvirt: disable systemd paging for virts that run backy [puppet] - 10https://gerrit.wikimedia.org/r/668132 (owner: 10Bstorm) [21:58:32] !log puppetmaster1001 - signing puppet cert for gitlab1001.wikmedia.org (T274459) [21:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:38] T274459: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 [22:01:09] 10SRE, 10WMF-Legal, 10Readers-Web-Backlog (Tracking), 10SEO: (Automate) adding wikinews language versions to the Google Publisher Center / Google News - https://phabricator.wikimedia.org/T254437 (10Jdlrobson) [22:02:10] 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install bast1003.wikimedia.org - https://phabricator.wikimedia.org/T276396 (10RobH) [22:02:20] 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install bast1003.wikimedia.org - https://phabricator.wikimedia.org/T276396 (10RobH) [22:04:34] (03PS2) 10Dzahn: smokeping: replace mwmaint2001 with puppetmaster2002 as D5 target [puppet] - 10https://gerrit.wikimedia.org/r/667957 (https://phabricator.wikimedia.org/T275905) [22:05:01] (03CR) 10Dzahn: "ack! amending to use puppetmaster2002 instead. still same rack and internal IP" [puppet] - 10https://gerrit.wikimedia.org/r/667957 (https://phabricator.wikimedia.org/T275905) (owner: 10Dzahn) [22:05:05] !log legoktm@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host registry2003.codfw.wmnet [22:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:13] 10SRE, 10vm-requests, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10Dzahn) @wkandek @thcipriani A new VM gitlab1001.wikimedia.org in the public network has been created while gitlab1001.eqiad.wmnet has been de... [22:09:16] 10SRE, 10netops: migrate services from bast1002 to bast1003 - https://phabricator.wikimedia.org/T276399 (10RobH) p:05Triage→03Medium [22:09:26] (03CR) 10Andrew Bogott: [C: 03+2] prepare_cinder_volume.py: mount with 'discard' if we have the option [puppet] - 10https://gerrit.wikimedia.org/r/668191 (owner: 10Andrew Bogott) [22:09:31] 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install bast1003.wikimedia.org - https://phabricator.wikimedia.org/T276396 (10RobH) [22:09:34] 10SRE, 10netops: migrate services from bast1002 to bast1003 - https://phabricator.wikimedia.org/T276399 (10RobH) [22:10:40] (03PS1) 10Legoktm: install_server: Add registry2003.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/668210 (https://phabricator.wikimedia.org/T276381) [22:11:17] (03CR) 10Bstorm: [C: 03+2] cloudvirt: disable systemd paging for virts that run backy [puppet] - 10https://gerrit.wikimedia.org/r/668132 (owner: 10Bstorm) [22:13:13] (03PS2) 10Legoktm: install_server: Add registry2003.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/668210 (https://phabricator.wikimedia.org/T276381) [22:13:52] (03CR) 10Dzahn: [C: 03+2] smokeping: replace mwmaint2001 with puppetmaster2002 as D5 target [puppet] - 10https://gerrit.wikimedia.org/r/667957 (https://phabricator.wikimedia.org/T275905) (owner: 10Dzahn) [22:14:12] (03CR) 10Dzahn: [C: 03+2] smokeping: replace mwmaint2001 with puppetmaster2002 as D5 target (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/667957 (https://phabricator.wikimedia.org/T275905) (owner: 10Dzahn) [22:20:30] (03CR) 10Legoktm: [C: 03+2] install_server: Add registry2003.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/668210 (https://phabricator.wikimedia.org/T276381) (owner: 10Legoktm) [22:21:07] mutante: ok to merge "smokeping: replace mwmaint2001 with puppetmaster2002 as D5 target"? [22:21:07] D5: Ok so I hacked up ssh.py to use mozprocess - https://phabricator.wikimedia.org/D5 [22:21:11] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Eugene Chernov from Speed & Function - https://phabricator.wikimedia.org/T275679 (10Dzahn) Please note gitlab1001/gitlab1002 with private IPs have been deleted and instead `gitlab1001.wikimedia.org` with public IP has been created,... [22:23:10] legoktm: yes please [22:23:14] thanks [22:23:25] {{done}} [22:42:04] (03CR) 10Dzahn: [C: 03+2] site: remove mwmaint2001 [puppet] - 10https://gerrit.wikimedia.org/r/667958 (https://phabricator.wikimedia.org/T275928) (owner: 10Dzahn) [22:42:10] (03PS2) 10Dzahn: site: remove mwmaint2001 [puppet] - 10https://gerrit.wikimedia.org/r/667958 (https://phabricator.wikimedia.org/T275928) [22:42:51] (03CR) 10Dzahn: [C: 04-1] site: remove mwmaint2001 [puppet] - 10https://gerrit.wikimedia.org/r/667958 (https://phabricator.wikimedia.org/T275928) (owner: 10Dzahn) [22:45:41] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 270 probes of 599 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:47:48] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mwmaint2001.codfw.wmnet [22:47:49] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 82 probes of 599 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:54] (03PS1) 10Legoktm: conftool: Add registry2003.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/668221 [22:50:00] (03CR) 10Legoktm: [C: 03+2] conftool: Add registry2003.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/668221 (owner: 10Legoktm) [22:50:56] !log legoktm@deploy1002 conftool action : set/pooled=no; selector: name=registry2003.codfw.wmnet [22:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:03] !log legoktm@deploy1002 conftool action : set/weight=10; selector: name=registry2003.codfw.wmnet [22:51:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:27] PROBLEM - Check systemd state on gitlab1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:54:18] mutante: ^ fyi, in case you weren't already aware [22:55:23] legoktm: ACK, just reset-failed.. shrug [22:55:32] should recover [22:55:35] RECOVERY - Check systemd state on gitlab1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:56:32] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mwmaint2001.codfw.wmnet [22:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:57] (03CR) 10Dzahn: [C: 03+2] site: remove mwmaint2001 [puppet] - 10https://gerrit.wikimedia.org/r/667958 (https://phabricator.wikimedia.org/T275928) (owner: 10Dzahn) [23:03:45] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 48 probes of 599 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:05:03] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 48 probes of 599 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:05:36] (03Abandoned) 10Dzahn: site: remove deploy1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/635112 (https://phabricator.wikimedia.org/T275831) (owner: 10Dzahn) [23:08:27] !log legoktm@deploy1002 conftool action : set/pooled=yes; selector: name=registry2003.codfw.wmnet [23:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:27] (03Abandoned) 10Dzahn: mariadb: remove mwmaint2001 from production-m5 grants [puppet] - 10https://gerrit.wikimedia.org/r/667288 (https://phabricator.wikimedia.org/T275928) (owner: 10Dzahn) [23:11:49] (03PS2) 10Dzahn: mariadb: prod-m5 grants: add mwmaint2002, rm mwmaint2001 [puppet] - 10https://gerrit.wikimedia.org/r/667240 (https://phabricator.wikimedia.org/T274170) [23:12:02] (03CR) 10jerkins-bot: [V: 04-1] mariadb: prod-m5 grants: add mwmaint2002, rm mwmaint2001 [puppet] - 10https://gerrit.wikimedia.org/r/667240 (https://phabricator.wikimedia.org/T274170) (owner: 10Dzahn) [23:13:10] (03PS3) 10Dzahn: mariadb: prod-m5 grants: add mwmaint2002, rm mwmaint2001 [puppet] - 10https://gerrit.wikimedia.org/r/667240 (https://phabricator.wikimedia.org/T274170) [23:14:15] (03CR) 10Dzahn: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/667240 (https://phabricator.wikimedia.org/T274170) (owner: 10Dzahn) [23:15:22] (03CR) 10Dzahn: mariadb: prod-m5 grants: add mwmaint2002, rm mwmaint2001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/667240 (https://phabricator.wikimedia.org/T274170) (owner: 10Dzahn) [23:23:44] PROBLEM - Host mc1027 is DOWN: PING CRITICAL - Packet loss = 100% [23:32:24] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 66 probes of 597 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:33:56] PROBLEM - Check health of redis instance on 6379 on mc2027 is CRITICAL: CRITICAL: replication_delay is 626 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 344573 keys, up 23 days 6 hours - replication_delay is 626 https://wikitech.wikimedia.org/wiki/Redis [23:36:45] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@e47f735]: search_satisfaction_daily: make files readable by druid ingestion [23:36:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:14] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:49:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:50:04] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 57 probes of 597 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:51:27] (03PS1) 10Cwhite: httpd: add wmfecsjson logformat to defaults.conf [puppet] - 10https://gerrit.wikimedia.org/r/668231 (https://phabricator.wikimedia.org/T234565)