[00:37:35] Urbanecm, Amir1: BTW, you scheduled a wiki creation window for tomorrow for T271260, but that's a no-deploy day, so it was commented out. You should pick another day. [00:37:36] T271260: Create Wikivoyage Turkish - https://phabricator.wikimedia.org/T271260 [00:38:02] ahh...didn't know this monday's no-deploy day. Thanks for the heads-up James_F [00:38:55] we'll reschedule :) [00:39:01] Cool. [05:27:48] (03PS3) 10Ryan Kemper: search: bring "new" relforge hosts into service [puppet] - 10https://gerrit.wikimedia.org/r/656369 (https://phabricator.wikimedia.org/T262211) [05:45:00] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:04:24] (03PS4) 10Ryan Kemper: search: bring "new" relforge hosts into service [puppet] - 10https://gerrit.wikimedia.org/r/656369 (https://phabricator.wikimedia.org/T262211) [06:04:48] (03CR) 10Ryan Kemper: "Thanks for the first round of review! I forgot to do the obvious thing and grep for `relforge` and see what else needed to be modified." [puppet] - 10https://gerrit.wikimedia.org/r/656369 (https://phabricator.wikimedia.org/T262211) (owner: 10Ryan Kemper) [06:08:57] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10Marostegui) [06:22:00] !log Reboot db1154 and db1155 for kernel upgrade [06:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:29] !log Reboot dbproxy2001, dbproxy2002, dbproxy2003 for kernel upgrade [06:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:03] (03PS1) 10Marostegui: db1138: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/656652 [06:53:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1138 for kernel upgrade', diff saved to https://phabricator.wikimedia.org/P13782 and previous config saved to /var/cache/conftool/dbconfig/20210118-065312-marostegui.json [06:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:46] (03CR) 10Marostegui: [C: 03+2] db1138: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/656652 (owner: 10Marostegui) [07:09:21] (03PS1) 10Marostegui: Revert "db1138: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/656481 [07:10:05] 10SRE, 10DBA: Refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [07:15:48] (03CR) 10Marostegui: [C: 03+2] Revert "db1138: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/656481 (owner: 10Marostegui) [07:16:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 25%: After restarting for kernel upgraed', diff saved to https://phabricator.wikimedia.org/P13783 and previous config saved to /var/cache/conftool/dbconfig/20210118-071611-root.json [07:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:26] (03PS1) 10Elukey: Fix patch 01-disable_jetty_dir_listing.patch layout [debs/archiva] (debian) - 10https://gerrit.wikimedia.org/r/656655 [07:17:02] (03CR) 10Elukey: [C: 03+2] Fix patch 01-disable_jetty_dir_listing.patch layout [debs/archiva] (debian) - 10https://gerrit.wikimedia.org/r/656655 (owner: 10Elukey) [07:31:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 50%: After restarting for kernel upgraed', diff saved to https://phabricator.wikimedia.org/P13784 and previous config saved to /var/cache/conftool/dbconfig/20210118-073115-root.json [07:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 75%: After restarting for kernel upgraed', diff saved to https://phabricator.wikimedia.org/P13785 and previous config saved to /var/cache/conftool/dbconfig/20210118-074618-root.json [07:46:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210118T0800) [08:01:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 100%: After restarting for kernel upgraed', diff saved to https://phabricator.wikimedia.org/P13786 and previous config saved to /var/cache/conftool/dbconfig/20210118-080122-root.json [08:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:44] (03CR) 10Gehel: [C: 04-1] "See comments inline." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/656369 (https://phabricator.wikimedia.org/T262211) (owner: 10Ryan Kemper) [08:09:18] (03PS2) 10Gehel: query_service: Migrate hiera() to lookup() in gui [puppet] - 10https://gerrit.wikimedia.org/r/656530 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [08:09:45] (03CR) 10Gehel: [C: 03+1] "LGTM, I'll let Ryan deploy" [puppet] - 10https://gerrit.wikimedia.org/r/656530 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [08:11:13] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/656404 (https://phabricator.wikimedia.org/T238753) (owner: 10Hnowlan) [08:12:31] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:13:52] !log clean up old archiva debs and upload 2.2.4-3 to buster-wikimedia - T272082 [08:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:56] T272082: Reflected XSS on archiva.wikimedia.org - https://phabricator.wikimedia.org/T272082 [08:14:39] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:15:28] (03CR) 10Gehel: "LGTM overall. As discussed with @hnowlan, there is a lot that is untested in the overall procedure to split the cluster, but I trust that " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/656460 (owner: 10Hnowlan) [08:15:50] !log installing remaining openssl 1.0 security updated on stretch [08:15:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:10] 10ops-eqiad, 10Traffic: lvs1015 interface errors - https://phabricator.wikimedia.org/T272258 (10ayounsi) p:05Triage→03High [08:17:35] (03CR) 10Gehel: [C: 03+2] query_service: Migrate hiera() to lookup() in gui [puppet] - 10https://gerrit.wikimedia.org/r/656530 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [08:17:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1106 to stop replication, place db1105:3311 temporarily in vslow T272008', diff saved to https://phabricator.wikimedia.org/P13787 and previous config saved to /var/cache/conftool/dbconfig/20210118-081740-marostegui.json [08:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:44] T272008: Move wikireplicas under the new sanitarium hosts (db1154, db1155) - https://phabricator.wikimedia.org/T272008 [08:17:48] 10ops-eqiad, 10Traffic: lvs1015 interface errors - https://phabricator.wikimedia.org/T272258 (10ayounsi) [08:19:01] (03PS1) 10Marostegui: db1106: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/656831 [08:20:32] (03CR) 10Marostegui: [C: 03+2] db1106: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/656831 (owner: 10Marostegui) [08:36:37] (03PS1) 10Marostegui: Revert "db1106: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/656484 [08:37:27] (03CR) 10Marostegui: [C: 03+2] Revert "db1106: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/656484 (owner: 10Marostegui) [08:39:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 25%: After moving wikireplicas to another host', diff saved to https://phabricator.wikimedia.org/P13788 and previous config saved to /var/cache/conftool/dbconfig/20210118-083919-root.json [08:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:26] 10SRE, 10ops-codfw, 10netops: ripe-atlas-codfw is down - https://phabricator.wikimedia.org/T267714 (10ayounsi) 05Resolved→03Open Can we have a decom task for the faulty device? (switch port is still alerting as being down) [08:42:18] !log kormat@cumin1001 START - Cookbook sre.hosts.decommission [08:42:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:45] !log kormat@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [08:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:51] 10SRE, 10DBA, 10Orchestrator, 10CAS-SSO, 10User-Kormat: orchestrator: Support SSO - https://phabricator.wikimedia.org/T266106 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by kormat@cumin1001 for hosts: `dborch1001.eqiad.wmnet` - dborch1001.eqiad.wmnet (**PASS**) - Downtimed host on... [08:48:49] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:51:11] (03PS1) 10DCausse: [wdqs] disable async imports [puppet] - 10https://gerrit.wikimedia.org/r/656833 (https://phabricator.wikimedia.org/T267175) [08:54:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 50%: After moving wikireplicas to another host', diff saved to https://phabricator.wikimedia.org/P13790 and previous config saved to /var/cache/conftool/dbconfig/20210118-085422-root.json [08:54:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:49] !log kormat@cumin1001 START - Cookbook sre.dns.netbox [09:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:28] !log kormat@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:29] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1119.eqiad.wmnet'] ` The log can be fou... [09:09:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 75%: After moving wikireplicas to another host', diff saved to https://phabricator.wikimedia.org/P13791 and previous config saved to /var/cache/conftool/dbconfig/20210118-090926-root.json [09:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:15] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1032 - https://phabricator.wikimedia.org/T272209 (10fgiunchedi) @Cmjohnson @Jclark-ctr host is OOW, please replace the 4TB drive (led should be blinking) [09:13:44] !log installing openssl 1.1 security updates on stretch [09:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:10] 10SRE, 10Traffic, 10serviceops, 10Wikimedia-Incident: The safe service restart script doesn't detect failure when running with poolcounter. - https://phabricator.wikimedia.org/T272262 (10Joe) [09:19:45] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 79, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:19:45] 10SRE, 10DBA, 10Orchestrator, 10CAS-SSO, 10User-Kormat: orchestrator: Support SSO - https://phabricator.wikimedia.org/T266106 (10Kormat) [09:20:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1105:3311 from vslow', diff saved to https://phabricator.wikimedia.org/P13793 and previous config saved to /var/cache/conftool/dbconfig/20210118-092003-marostegui.json [09:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:06] 10SRE, 10Traffic, 10serviceops, 10Wikimedia-Incident: The safe service restart script doesn't detect failure when running with poolcounter. - https://phabricator.wikimedia.org/T272262 (10Joe) p:05Triage→03Unbreak! a:03Joe [09:24:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 100%: After moving wikireplicas to another host', diff saved to https://phabricator.wikimedia.org/P13794 and previous config saved to /var/cache/conftool/dbconfig/20210118-092429-root.json [09:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:51] !log kormat@cumin1001 START - Cookbook sre.ganeti.makevm [09:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1074 to stop replication T272008', diff saved to https://phabricator.wikimedia.org/P13795 and previous config saved to /var/cache/conftool/dbconfig/20210118-092546-marostegui.json [09:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:50] T272008: Move wikireplicas under the new sanitarium hosts (db1154, db1155) - https://phabricator.wikimedia.org/T272008 [09:26:34] (03PS1) 10Marostegui: db1074: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/656836 [09:27:08] (03PS1) 10Filippo Giunchedi: swift: decrease object replicator concurrency [puppet] - 10https://gerrit.wikimedia.org/r/656837 (https://phabricator.wikimedia.org/T271415) [09:27:10] (03CR) 10Marostegui: [C: 03+2] db1074: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/656836 (owner: 10Marostegui) [09:43:11] (03PS1) 10Marostegui: Revert "db1074: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/656485 [09:44:22] (03CR) 10Marostegui: [C: 03+2] Revert "db1074: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/656485 (owner: 10Marostegui) [09:44:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1074 (re)pooling @ 25%: After moving wikireplicas to another host', diff saved to https://phabricator.wikimedia.org/P13796 and previous config saved to /var/cache/conftool/dbconfig/20210118-094449-root.json [09:44:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:53] (03PS1) 10Giuseppe Lavagetto: safe-service-restart: proper error handling with poolcounter [puppet] - 10https://gerrit.wikimedia.org/r/656838 (https://phabricator.wikimedia.org/T272262) [09:45:04] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1001/27504/" [puppet] - 10https://gerrit.wikimedia.org/r/656837 (https://phabricator.wikimedia.org/T271415) (owner: 10Filippo Giunchedi) [09:45:28] (03CR) 10Filippo Giunchedi: "> Patch Set 3: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/574485 (https://phabricator.wikimedia.org/T227080) (owner: 10Filippo Giunchedi) [09:51:53] !log kormat@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [09:51:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:38] (03PS1) 10Ayounsi: Add rmaung to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/656840 (https://phabricator.wikimedia.org/T266250) [09:54:05] (03CR) 10Ayounsi: "> Principal successfully created." [puppet] - 10https://gerrit.wikimedia.org/r/656840 (https://phabricator.wikimedia.org/T266250) (owner: 10Ayounsi) [09:55:22] (03CR) 10Muehlenhoff: [C: 03+1] "Look good" [puppet] - 10https://gerrit.wikimedia.org/r/656840 (https://phabricator.wikimedia.org/T266250) (owner: 10Ayounsi) [09:56:00] (03CR) 10Ayounsi: [C: 03+2] Add rmaung to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/656840 (https://phabricator.wikimedia.org/T266250) (owner: 10Ayounsi) [09:57:21] (03PS1) 10Kormat: install_server: Update mac for dborch1001 [puppet] - 10https://gerrit.wikimedia.org/r/656841 (https://phabricator.wikimedia.org/T266106) [09:58:28] (03CR) 10Kormat: [C: 03+2] install_server: Update mac for dborch1001 [puppet] - 10https://gerrit.wikimedia.org/r/656841 (https://phabricator.wikimedia.org/T266106) (owner: 10Kormat) [09:58:52] 10SRE, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung - https://phabricator.wikimedia.org/T266250 (10ayounsi) 05Open→03Resolved Access created, you should have received an email as well about your kerberos ac... [09:58:59] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656842 (https://phabricator.wikimedia.org/T128546) [09:59:29] (03PS12) 10MSantos: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) [09:59:49] (03CR) 10MSantos: start using imposm as OSM sync tool (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [09:59:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1074 (re)pooling @ 50%: After moving wikireplicas to another host', diff saved to https://phabricator.wikimedia.org/P13797 and previous config saved to /var/cache/conftool/dbconfig/20210118-095952-root.json [09:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:25] <_joe_> !log restarting pybal on lvs1016, not talking to its etcd server [10:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:59] (03CR) 10jerkins-bot: [V: 04-1] start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [10:09:28] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1119.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1119.eqiad.wmnet'] ` [10:09:58] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/656838 (https://phabricator.wikimedia.org/T272262) (owner: 10Giuseppe Lavagetto) [10:14:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1074 (re)pooling @ 75%: After moving wikireplicas to another host', diff saved to https://phabricator.wikimedia.org/P13798 and previous config saved to /var/cache/conftool/dbconfig/20210118-101456-root.json [10:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:45] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:20:39] (03CR) 10Elukey: [C: 03+1] safe-service-restart: proper error handling with poolcounter [puppet] - 10https://gerrit.wikimedia.org/r/656838 (https://phabricator.wikimedia.org/T272262) (owner: 10Giuseppe Lavagetto) [10:20:41] (03PS1) 10Volans: logging: fix base path and name to setup logging [software/spicerack] - 10https://gerrit.wikimedia.org/r/656843 [10:27:41] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:30:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1074 (re)pooling @ 100%: After moving wikireplicas to another host', diff saved to https://phabricator.wikimedia.org/P13799 and previous config saved to /var/cache/conftool/dbconfig/20210118-102959-root.json [10:30:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:37] (03CR) 10Giuseppe Lavagetto: [C: 03+2] safe-service-restart: proper error handling with poolcounter [puppet] - 10https://gerrit.wikimedia.org/r/656838 (https://phabricator.wikimedia.org/T272262) (owner: 10Giuseppe Lavagetto) [10:33:23] 10SRE, 10DBA: Refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [10:38:03] 10SRE, 10serviceops, 10Wikimedia-Incident: High latency on appservers - https://phabricator.wikimedia.org/T272215 (10Joe) [10:38:07] 10SRE, 10Traffic, 10serviceops, 10Patch-For-Review, 10Wikimedia-Incident: The safe service restart script doesn't detect failure when running with poolcounter. - https://phabricator.wikimedia.org/T272262 (10Joe) [10:38:37] 10SRE, 10serviceops, 10Wikimedia-Incident: High latency on appservers - https://phabricator.wikimedia.org/T272215 (10Joe) [10:38:40] 10SRE, 10Traffic, 10serviceops, 10Patch-For-Review, 10Wikimedia-Incident: The safe service restart script doesn't detect failure when running with poolcounter. - https://phabricator.wikimedia.org/T272262 (10Joe) 05Open→03Resolved The script has been merged and will deploy everywhere in the next 20 mi... [10:39:06] (03PS1) 10Giuseppe Lavagetto: Revert "role::mediawiki::canary_appserver: disable php-fpm restart timer" [puppet] - 10https://gerrit.wikimedia.org/r/656846 [10:39:39] (03PS1) 10Giuseppe Lavagetto: Revert "role::mediawiki::appserver: temporary disable php-fpm restarts" [puppet] - 10https://gerrit.wikimedia.org/r/656847 [10:41:15] (03CR) 10Elukey: [C: 03+1] logging: fix base path and name to setup logging [software/spicerack] - 10https://gerrit.wikimedia.org/r/656843 (owner: 10Volans) [10:41:45] (03CR) 10Volans: [C: 03+2] logging: fix base path and name to setup logging [software/spicerack] - 10https://gerrit.wikimedia.org/r/656843 (owner: 10Volans) [10:41:50] (03CR) 10Elukey: [C: 03+1] Revert "role::mediawiki::canary_appserver: disable php-fpm restart timer" [puppet] - 10https://gerrit.wikimedia.org/r/656846 (owner: 10Giuseppe Lavagetto) [10:42:06] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/656846 (owner: 10Giuseppe Lavagetto) [10:42:09] (03CR) 10Elukey: [C: 03+1] Revert "role::mediawiki::appserver: temporary disable php-fpm restarts" [puppet] - 10https://gerrit.wikimedia.org/r/656847 (owner: 10Giuseppe Lavagetto) [10:42:28] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/656847 (owner: 10Giuseppe Lavagetto) [10:42:44] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Revert "role::mediawiki::canary_appserver: disable php-fpm restart timer" [puppet] - 10https://gerrit.wikimedia.org/r/656846 (owner: 10Giuseppe Lavagetto) [10:46:24] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Revert "role::mediawiki::appserver: temporary disable php-fpm restarts" [puppet] - 10https://gerrit.wikimedia.org/r/656847 (owner: 10Giuseppe Lavagetto) [10:47:27] (03Merged) 10jenkins-bot: logging: fix base path and name to setup logging [software/spicerack] - 10https://gerrit.wikimedia.org/r/656843 (owner: 10Volans) [10:49:26] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) I fixed the boot order on an-worker1119 but PXE doesn't really work, I noticed that all NICs show no link up status, maybe there is something not s... [10:52:51] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.48 [software/spicerack] - 10https://gerrit.wikimedia.org/r/656845 [10:53:37] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10Tobi_WMDE_SW) Approving that @lilients_WMDE is in my team. [10:55:21] 10SRE, 10DBA, 10Orchestrator, 10CAS-SSO, 10User-Kormat: orchestrator: Support SSO - https://phabricator.wikimedia.org/T266106 (10Kormat) [10:58:05] !log installing python2.7 security updates on Stretch [10:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:32] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.48 [software/spicerack] - 10https://gerrit.wikimedia.org/r/656845 (owner: 10Volans) [11:05:20] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1120.eqiad.wmnet'] ` The log can be fou... [11:05:31] I am going to restart both Gerrit instances to clear out a memory leak. Should be back automagically after a couple minutes. [11:05:55] 10SRE, 10serviceops, 10Wikimedia-Incident: High latency on appservers - https://phabricator.wikimedia.org/T272215 (10Joe) The timers have been reenabled, and the next scap deployment should properly run check_and_restart for php7-fpm, and restart those. [11:08:21] !log Restarting Gerrit replica on gerrit2001.wikimedia.org [11:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:11] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.48 [software/spicerack] - 10https://gerrit.wikimedia.org/r/656845 (owner: 10Volans) [11:10:20] !log Restarting Gerrit main instance on gerrit1001.wikimedia.org [11:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:55] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-fe1005.eqiad.wmnet [11:11:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:12] 10SRE, 10WMF-NDA-Requests: NDA for Superset Request from WMDE Employee Amrutha Chandra - https://phabricator.wikimedia.org/T272287 (10amy_rc) [11:17:17] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe1005.eqiad.wmnet [11:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:51] 10SRE, 10WMF-NDA-Requests: Request from WMDE intern Amrutha - https://phabricator.wikimedia.org/T271725 (10amy_rc) a:05herron→03None [11:18:22] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1120.eqiad.wmnet with reason: REIMAGE [11:18:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:33] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:20:36] 10SRE, 10WMF-NDA-Requests: NDA for Superset Request from WMDE Intern Amrutha Chandra - https://phabricator.wikimedia.org/T272287 (10amy_rc) [11:22:12] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1120.eqiad.wmnet with reason: REIMAGE [11:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:38] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1121.eqiad.wmnet'] ` The log can be fou... [11:25:19] (03PS1) 10Filippo Giunchedi: debian: add packaging [debs/phalerts] - 10https://gerrit.wikimedia.org/r/656866 [11:26:10] (03CR) 10JMeybohm: [C: 04-1] "It would be nice to have SQLALCHEMY_DATABASE_URI and BASIC_AUTH_PASSWORD in .Values.config.private/the kubernetes secrets object instead o" [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [11:28:51] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1120.eqiad.wmnet'] ` and were **ALL** successful. [11:28:56] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-fe1006.eqiad.wmnet [11:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:45] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] k8s_infrastructure_users: Amend to support groups, avoid uid conflicts [puppet] - 10https://gerrit.wikimedia.org/r/647011 (https://phabricator.wikimedia.org/T269461) (owner: 10Alexandros Kosiaris) [11:30:38] (03PS17) 10Hnowlan: sockpuppet-api: Create basic chart and service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) [11:33:39] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe1006.eqiad.wmnet [11:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:28] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1121.eqiad.wmnet with reason: REIMAGE [11:35:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:03] (03CR) 10JMeybohm: [C: 04-1] sockpuppet-api: Create basic chart and service config (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [11:37:28] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1121.eqiad.wmnet with reason: REIMAGE [11:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:02] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1122.eqiad.wmnet'] ` The log can be fou... [11:40:18] (03PS1) 10Volans: Upstream release v0.0.48 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/656868 [11:40:37] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-fe1007.eqiad.wmnet [11:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:00] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1121.eqiad.wmnet'] ` and were **ALL** successful. [11:44:32] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe1007.eqiad.wmnet [11:44:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:57] (03CR) 10Urbanecm: "> Patch Set 5: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647116 (https://phabricator.wikimedia.org/T269712) (owner: 10Jforrester) [11:46:58] <_joe_> jouncebot: next [11:46:58] In 24 hour(s) and 13 minute(s): European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210119T1200) [11:47:12] <_joe_> ok, I might try a null deploy before then [11:48:07] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-fe1008.eqiad.wmnet [11:48:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:54] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:49:15] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1123.eqiad.wmnet'] ` The log can be fou... [11:49:52] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1122.eqiad.wmnet with reason: REIMAGE [11:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:21] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe1008.eqiad.wmnet [11:52:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:50] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1122.eqiad.wmnet with reason: REIMAGE [11:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:49] (03CR) 10Volans: [C: 03+2] Upstream release v0.0.48 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/656868 (owner: 10Volans) [11:54:50] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-fe2005.codfw.wmnet [11:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:54] 10SRE, 10ops-eqsin, 10serviceops: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 (10MoritzMuehlenhoff) Thanks Papaul and Rob, I'll take care of re-adding ganeti5002 to the eqsin Ganeti cluster. [11:59:24] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1122.eqiad.wmnet'] ` and were **ALL** successful. [11:59:41] (03PS3) 10Giuseppe Lavagetto: Add support for php deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/651757 [11:59:43] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1124.eqiad.wmnet'] ` The log can be fou... [12:01:04] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1123.eqiad.wmnet with reason: REIMAGE [12:01:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:48] (03Merged) 10jenkins-bot: Upstream release v0.0.48 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/656868 (owner: 10Volans) [12:03:13] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1123.eqiad.wmnet with reason: REIMAGE [12:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:43] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe2005.codfw.wmnet [12:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:25] PROBLEM - Check systemd state on elastic2034 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:08:01] !log uploaded spicerack_0.0.48 to apt.wikimedia.org buster-wikimedia [12:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:12] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-fe2006.codfw.wmnet [12:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, with some minor comments inline." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/656640 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott) [12:08:32] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1125.eqiad.wmnet'] ` The log can be fou... [12:10:46] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1123.eqiad.wmnet'] ` and were **ALL** successful. [12:11:32] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1124.eqiad.wmnet with reason: REIMAGE [12:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:31] (03PS18) 10Hnowlan: sockpuppet-api: Create basic chart and service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) [12:13:16] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1126.eqiad.wmnet'] ` The log can be fou... [12:13:38] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1124.eqiad.wmnet with reason: REIMAGE [12:13:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:23] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe2006.codfw.wmnet [12:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:59] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:18:02] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-fe2007.codfw.wmnet [12:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:15] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:19:16] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1127.eqiad.wmnet'] ` The log can be fou... [12:19:48] (03CR) 10Hnowlan: sockpuppet-api: Create basic chart and service config (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [12:20:10] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1124.eqiad.wmnet'] ` and were **ALL** successful. [12:20:22] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1125.eqiad.wmnet with reason: REIMAGE [12:20:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:55] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe2007.codfw.wmnet [12:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:19] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1125.eqiad.wmnet with reason: REIMAGE [12:22:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:04] (03CR) 10Urbanecm: [C: 03+1] "looks good, can you schedule it at [[wikitech:Deployments]] as well, please?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655282 (owner: 10Majavah) [12:24:09] (03CR) 10Urbanecm: [C: 03+1] Revert "Switch fiwiki to their 500k temporary logo!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655281 (owner: 10Majavah) [12:24:56] Urbanecm: won't the temporary logo stay used in some cached pages? [12:25:51] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1126.eqiad.wmnet with reason: REIMAGE [12:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:58] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1126.eqiad.wmnet with reason: REIMAGE [12:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:06] Majavah: hmm, good point. IIRC static resources are heavily cached, so I _think_ it will stay in the frontend caches for some time. [12:28:39] (but it doesn't hurt to clean those up after; we should have some script for identifying orphan static files, at least for logos I guess) [12:29:55] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1125.eqiad.wmnet'] ` and were **ALL** successful. [12:31:04] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1127.eqiad.wmnet with reason: REIMAGE [12:31:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:16] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-fe2008.codfw.wmnet [12:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:05] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1127.eqiad.wmnet with reason: REIMAGE [12:33:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:39] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1126.eqiad.wmnet'] ` and were **ALL** successful. [12:36:52] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe2008.codfw.wmnet [12:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:57] (03PS1) 10Arturo Borrero Gonzalez: aptrepo: refresh key for croit.io repository [puppet] - 10https://gerrit.wikimedia.org/r/656875 (https://phabricator.wikimedia.org/T259873) [12:39:59] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) [12:40:10] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1127.eqiad.wmnet'] ` and were **ALL** successful. [12:41:05] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: refresh key for croit.io repository [puppet] - 10https://gerrit.wikimedia.org/r/656875 (https://phabricator.wikimedia.org/T259873) (owner: 10Arturo Borrero Gonzalez) [12:55:27] PROBLEM - Check systemd state on elastic2035 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:56:02] 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): apt key for `thirdparty/ceph-nautilus/buster` has expired. - https://phabricator.wikimedia.org/T259873 (10aborrero) 05Open→03Resolved [12:56:28] !log add NAT rule on pfw3-codfw - T272066 [12:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:08] (03PS1) 10Arturo Borrero Gonzalez: [DONT MERGE] cloud: NAT egress connections to WMF wikis [puppet] - 10https://gerrit.wikimedia.org/r/656883 (https://phabricator.wikimedia.org/T209011) [13:08:28] !log add NAT rule on pfw3-eqiad - T272066 [13:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:42] (03PS1) 10Muehlenhoff: Bump changelog for new version [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/656885 [13:10:41] 10SRE, 10Epic, 10cloud-services-team (Kanban): CloudVPS: network architecture - https://phabricator.wikimedia.org/T209460 (10aborrero) [13:12:03] !log Upgrade db2071 to 10.4.17 - T268457 [13:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:08] T268457: Investigate possible optimizer regression on 10.4.17 with DELETE statements - https://phabricator.wikimedia.org/T268457 [13:13:28] (03PS1) 10Arturo Borrero Gonzalez: [DONT MERGE] cloud-in4: NAT egress connections to WMF wikis Previous to this patch, all connections originating from inside CloudVPS (including Toolforge) would hit the Neutorn dmz_cidr setting, therefore skipping the general cloud egress NAT. In practice, this meant that WMF wikis saw internal private IP of each VM (172.16.x.x), which is undersirable for several reasons. A patch [13:13:28] setting, therefore enabling the general egress NAT. When that happens, we will no longer need this ACL entry in the core routers. Bug: T209011 Signed-off-by: Arturo Borrero Gonzalez Change-Id: I3ed10bc5c4e833355ed3bd2ec3fd6f3a9ee7a917 [homer/public] - 10https://gerrit.wikimedia.org/r/656886 (https://phabricator.wikimedia.org/T209011) [13:14:33] (03PS2) 10Arturo Borrero Gonzalez: [DONT MERGE] cloud-in4: NAT egress connections to WMF wikis [homer/public] - 10https://gerrit.wikimedia.org/r/656886 (https://phabricator.wikimedia.org/T209011) [13:15:42] 10SRE, 10Wikidata, 10Wikidata Query Builder, 10User-Addshore: Deploy WDQS query builder to microsites - https://phabricator.wikimedia.org/T266703 (10Addshore) [13:16:31] (03PS3) 10Arturo Borrero Gonzalez: [DONT MERGE] cloud-in4: drop ACL entry for WMF wikis [homer/public] - 10https://gerrit.wikimedia.org/r/656886 (https://phabricator.wikimedia.org/T209011) [13:18:18] (03PS1) 10Kormat: mariadb: Allow public orchestrator ip to connect. [puppet] - 10https://gerrit.wikimedia.org/r/656887 (https://phabricator.wikimedia.org/T266106) [13:21:03] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Bump changelog for new version [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/656885 (owner: 10Muehlenhoff) [13:22:21] (03PS2) 10Kormat: mariadb: Allow public orchestrator ip to connect. [puppet] - 10https://gerrit.wikimedia.org/r/656887 (https://phabricator.wikimedia.org/T266106) [13:25:09] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (NOOP 5 DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27505/console" [puppet] - 10https://gerrit.wikimedia.org/r/656887 (https://phabricator.wikimedia.org/T266106) (owner: 10Kormat) [13:26:25] !log installed spicerack 0.0.48-1+deb10u1 on cumin hosts [13:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:03] (03CR) 10JMeybohm: [C: 04-1] Add support for php deployments (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/651757 (owner: 10Giuseppe Lavagetto) [13:34:18] (03PS3) 10Kormat: mariadb: Allow public orchestrator ip to connect. [puppet] - 10https://gerrit.wikimedia.org/r/656887 (https://phabricator.wikimedia.org/T266106) [13:34:56] !log uploaded wmf-sre-laptop 0.3.2 to apt.wikimedia.org [13:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:21] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 10): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27506/console" [puppet] - 10https://gerrit.wikimedia.org/r/656887 (https://phabricator.wikimedia.org/T266106) (owner: 10Kormat) [13:40:45] (03PS4) 10Kormat: mariadb: Allow public orchestrator ip to connect. [puppet] - 10https://gerrit.wikimedia.org/r/656887 (https://phabricator.wikimedia.org/T266106) [13:42:49] (03CR) 10JMeybohm: [C: 04-1] sockpuppet-api: Create basic chart and service config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [13:43:31] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 10): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27507/console" [puppet] - 10https://gerrit.wikimedia.org/r/656887 (https://phabricator.wikimedia.org/T266106) (owner: 10Kormat) [13:47:38] (03CR) 10Kormat: [V: 03+1] "PCC looks good: https://puppet-compiler.wmflabs.org/compiler1001/27507/" [puppet] - 10https://gerrit.wikimedia.org/r/656887 (https://phabricator.wikimedia.org/T266106) (owner: 10Kormat) [13:50:41] (03CR) 10Marostegui: [C: 03+1] mariadb: Allow public orchestrator ip to connect. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/656887 (https://phabricator.wikimedia.org/T266106) (owner: 10Kormat) [13:57:47] (03CR) 10Kormat: [V: 03+1 C: 03+2] mariadb: Allow public orchestrator ip to connect. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/656887 (https://phabricator.wikimedia.org/T266106) (owner: 10Kormat) [13:58:57] (03CR) 10Ayounsi: [C: 03+1] "The change LGTM, we will have to update the routers ACL accordingly once merged." [puppet] - 10https://gerrit.wikimedia.org/r/656883 (https://phabricator.wikimedia.org/T209011) (owner: 10Arturo Borrero Gonzalez) [14:17:19] PROBLEM - Check systemd state on elastic2052 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:08] !log kormat@cumin1001 START - Cookbook sre.hosts.decommission [14:18:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:44] (03PS1) 10Muehlenhoff: Disable bast3004/bast4002/bast5001 as bastions [puppet] - 10https://gerrit.wikimedia.org/r/656894 (https://phabricator.wikimedia.org/T257324) [14:23:51] PROBLEM - Elevated latency for icinga checks in eqiad on alert1001 is CRITICAL: cluster=alerting instance=alert1001 job=icinga site=eqiad https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [14:26:28] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [14:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:33] 10SRE, 10DBA, 10Orchestrator, 10CAS-SSO, 10User-Kormat: orchestrator: Support SSO - https://phabricator.wikimedia.org/T266106 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by kormat@cumin1001 for hosts: `dborch1001.eqiad.wmnet` - dborch1001.eqiad.wmnet (**PASS**) - Downtimed host on... [14:26:57] !log kormat@cumin1001 START - Cookbook sre.ganeti.makevm [14:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:45] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:35] (03PS1) 10Muehlenhoff: Update bastions in smokeping config [puppet] - 10https://gerrit.wikimedia.org/r/656895 (https://phabricator.wikimedia.org/T257324) [14:30:12] (03PS1) 10Volans: logging: improve logging format [software/spicerack] - 10https://gerrit.wikimedia.org/r/656896 [14:30:25] !log updating packages in buster-wikimedia/thirdparty/ceph-nautilus-buster (T272296) [14:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:29] T272296: ceph: nautilus: decide on pending package upgrades - https://phabricator.wikimedia.org/T272296 [14:31:04] !log kormat@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [14:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:19] (03PS16) 10Andrew Bogott: Nova: add a simple vendordata REST service [puppet] - 10https://gerrit.wikimedia.org/r/656640 (https://phabricator.wikimedia.org/T271273) [14:33:21] (03CR) 10Andrew Bogott: Nova: add a simple vendordata REST service (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/656640 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott) [14:35:37] (03CR) 10JMeybohm: [C: 03+1] "While I don't particularly like what this file becomes, I do see the value of adding more stuff to global config. 😊" [puppet] - 10https://gerrit.wikimedia.org/r/656253 (https://phabricator.wikimedia.org/T253058) (owner: 10Ottomata) [14:36:18] (03CR) 10Ayounsi: [C: 03+1] Update bastions in smokeping config [puppet] - 10https://gerrit.wikimedia.org/r/656895 (https://phabricator.wikimedia.org/T257324) (owner: 10Muehlenhoff) [14:36:57] (03CR) 10Muehlenhoff: [C: 03+2] Update bastions in smokeping config [puppet] - 10https://gerrit.wikimedia.org/r/656895 (https://phabricator.wikimedia.org/T257324) (owner: 10Muehlenhoff) [14:42:17] (03PS1) 10Muehlenhoff: Update Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/656898 [14:42:44] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/656898 (owner: 10Muehlenhoff) [14:43:07] !log kormat@cumin1001 START - Cookbook sre.ganeti.makevm [14:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:12] (03PS19) 10Hnowlan: sockpuppet-api: Create basic chart and service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) [14:45:39] (03CR) 10Hnowlan: sockpuppet-api: Create basic chart and service config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [14:46:05] (03CR) 10JMeybohm: [C: 03+1] sockpuppet-api: Create basic chart and service config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [14:47:30] (03CR) 10Volans: Introduce linkrecommendation{,-external} (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/656430 (https://phabricator.wikimedia.org/T258978) (owner: 10Alexandros Kosiaris) [14:48:55] (03CR) 10Muehlenhoff: [C: 03+2] Update Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/656898 (owner: 10Muehlenhoff) [14:49:33] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host theemin.codfw.wmnet [14:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:46] (03CR) 10Volans: [C: 03+1] "did a quick pass and seems reasonabe." [cookbooks] - 10https://gerrit.wikimedia.org/r/656462 (owner: 10Elukey) [14:53:27] PROBLEM - Check systemd state on elastic2032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:53:40] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host theemin.codfw.wmnet [14:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:20] 10SRE, 10MW-on-K8s, 10serviceops: Create a yaml structure for defining apache virtualhosts for mediawiki, that can be used both in puppet and in helm charts. - https://phabricator.wikimedia.org/T272305 (10Joe) [14:56:56] (03CR) 10Elukey: [C: 03+1] logging: improve logging format [software/spicerack] - 10https://gerrit.wikimedia.org/r/656896 (owner: 10Volans) [14:56:56] (03PS1) 10Kormat: install_server: Fix domain for dborch1001 [puppet] - 10https://gerrit.wikimedia.org/r/656903 (https://phabricator.wikimedia.org/T266106) [14:57:07] RECOVERY - Elevated latency for icinga checks in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [14:58:47] (03CR) 10Volans: [C: 03+2] logging: improve logging format [software/spicerack] - 10https://gerrit.wikimedia.org/r/656896 (owner: 10Volans) [15:00:45] (03CR) 10Giuseppe Lavagetto: Add support for php deployments (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/651757 (owner: 10Giuseppe Lavagetto) [15:02:36] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1128.eqiad.wmnet'] ` The log can be fou... [15:05:27] (03Merged) 10jenkins-bot: logging: improve logging format [software/spicerack] - 10https://gerrit.wikimedia.org/r/656896 (owner: 10Volans) [15:07:19] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1130.eqiad.wmnet'] ` The log can be fou... [15:08:15] 10SRE, 10Traffic, 10serviceops: Upgrade envoyproxy to 1.16.2 - https://phabricator.wikimedia.org/T271407 (10Vgutierrez) there are some issues with the python requirements of envoy 1.16.2 as it requires python 3.6 or higher and clearly the building environment isn't fulfilling the requirement. So a tiny worka... [15:09:17] (03PS2) 10Alexandros Kosiaris: Introduce linkrecommendation{,-external} [dns] - 10https://gerrit.wikimedia.org/r/656430 (https://phabricator.wikimedia.org/T258978) [15:10:44] !log kormat@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [15:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:09] 10SRE, 10WMF-NDA-Requests: NDA for Superset Request from WMDE Intern Amrutha Chandra - https://phabricator.wikimedia.org/T272287 (10jcrespo) [15:13:28] (03CR) 10Alexandros Kosiaris: "One thing that crossed my mind is that we are anyway going to expose this on different ports and it may make sense to reuse the first IP a" (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/656430 (https://phabricator.wikimedia.org/T258978) (owner: 10Alexandros Kosiaris) [15:13:29] 10SRE, 10WMF-NDA-Requests: Request from WMDE intern Amrutha - https://phabricator.wikimedia.org/T271725 (10jcrespo) [15:14:01] 10SRE, 10WMF-NDA-Requests: Request from WMDE intern Amrutha - https://phabricator.wikimedia.org/T271725 (10jcrespo) Merging request, as -as far as I read, both requested the same for the same person. [15:14:07] 10SRE, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10mforns) Moving this task to DONE. [15:14:26] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1128.eqiad.wmnet with reason: REIMAGE [15:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:52] 10SRE, 10WMF-NDA-Requests: Request from WMDE intern Amrutha - https://phabricator.wikimedia.org/T271725 (10jcrespo) [15:16:26] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1128.eqiad.wmnet with reason: REIMAGE [15:16:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:59] (03CR) 10Elukey: [C: 03+2] analytics:refinery:job:data_purge Activate netflow auto-purging [puppet] - 10https://gerrit.wikimedia.org/r/655120 (https://phabricator.wikimedia.org/T231339) (owner: 10Mforns) [15:18:15] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1130.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1130.eqiad.wmnet'] ` [15:23:55] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1130.eqiad.wmnet'] ` The log can be fou... [15:24:02] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1128.eqiad.wmnet'] ` and were **ALL** successful. [15:27:42] (03CR) 10Marostegui: [C: 03+1] install_server: Fix domain for dborch1001 [puppet] - 10https://gerrit.wikimedia.org/r/656903 (https://phabricator.wikimedia.org/T266106) (owner: 10Kormat) [15:30:38] 10SRE, 10WMF-NDA-Requests: Request from WMDE intern Amrutha - https://phabricator.wikimedia.org/T271725 (10jcrespo) @amy_rc: To sign the NDA towards the Wimedia Foundation, please read carefully and use this form to sign it for legal: https://phabricator.wikimedia.org/L2 More detailed instructions can be foun... [15:32:49] 10SRE, 10WMF-NDA-Requests: Request from WMDE intern Amrutha - https://phabricator.wikimedia.org/T271725 (10jcrespo) a:03amy_rc Adding @elukey here just for awareness (no actions needed), as a new superset user will appear. [15:33:47] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) No link found as well for an-worker1131, skipping.. [15:36:57] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1130.eqiad.wmnet with reason: REIMAGE [15:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:39] PROBLEM - SSH on logstash2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:39:01] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1130.eqiad.wmnet with reason: REIMAGE [15:39:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:19] RECOVERY - SSH on logstash2006 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:43:19] PROBLEM - ElasticSearch health check for shards on 9200 on logstash2006 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7f7b57a35518: Failed to establish a new connection: [Errno 111] Connection [15:43:19] ://wikitech.wikimedia.org/wiki/Search%23Administration [15:44:41] RECOVERY - ElasticSearch health check for shards on 9200 on logstash2006 is OK: OK - elasticsearch status production-logstash-codfw: delayed_unassigned_shards: 0, unassigned_shards: 0, relocating_shards: 0, number_of_data_nodes: 3, active_shards: 862, timed_out: False, active_shards_percent_as_number: 100.0, initializing_shards: 0, status: green, cluster_name: production-logstash-codfw, number_of_in_flight_fetch: 0, number_of_pen [15:44:41] mber_of_nodes: 6, task_max_waiting_in_queue_millis: 0, active_primary_shards: 456 https://wikitech.wikimedia.org/wiki/Search%23Administration [15:45:39] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1132.eqiad.wmnet'] ` The log can be fou... [15:46:38] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1130.eqiad.wmnet'] ` and were **ALL** successful. [15:48:21] !log installing wavpack security updates [15:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:39] 10SRE, 10WMF-NDA-Requests: Request from WMDE intern Amrutha - https://phabricator.wikimedia.org/T271725 (10jcrespo) Apologies, @amy_rc, I previously gave you the volunteer NDA path. As WMDE stuff, NDA has to be handled by legal. Please @KFrancis, could you handle the necesary steps previous to provide NDA acce... [15:51:39] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1135.eqiad.wmnet'] ` The log can be fou... [15:52:59] 10SRE, 10Traffic, 10serviceops: Upgrade envoyproxy to 1.16.2 - https://phabricator.wikimedia.org/T271407 (10Joe) @Vgutierrez we can create a new building env based on buster I think, that's much better as an option. [15:57:15] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1136.eqiad.wmnet'] ` The log can be fou... [15:57:29] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1132.eqiad.wmnet with reason: REIMAGE [15:57:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:30] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1132.eqiad.wmnet with reason: REIMAGE [15:59:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:08] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10jcrespo) [16:00:36] (03CR) 10Arturo Borrero Gonzalez: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/656883 (https://phabricator.wikimedia.org/T209011) (owner: 10Arturo Borrero Gonzalez) [16:01:46] (03PS2) 10Arturo Borrero Gonzalez: [DONT MERGE] cloud: NAT egress connections to WMF wikis [puppet] - 10https://gerrit.wikimedia.org/r/656883 (https://phabricator.wikimedia.org/T209011) [16:01:48] 10SRE, 10Add-Link, 10GrowthExperiments-NewcomerTasks, 10Product-Infrastructure-Team-Backlog, and 3 others: Service operations setup for Add a Link project - https://phabricator.wikimedia.org/T258978 (10kostajh) [16:01:51] (03PS4) 10Arturo Borrero Gonzalez: [DONT MERGE] cloud-in4: drop ACL entry for WMF wikis [homer/public] - 10https://gerrit.wikimedia.org/r/656886 (https://phabricator.wikimedia.org/T209011) [16:03:29] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1135.eqiad.wmnet with reason: REIMAGE [16:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:10] 10SRE, 10Add-Link, 10GrowthExperiments-NewcomerTasks, 10Product-Infrastructure-Team-Backlog, and 3 others: Service operations setup for Add a Link project - https://phabricator.wikimedia.org/T258978 (10kostajh) >>! In T258978#6751367, @akosiaris wrote: >>>! In T258978#6729580, @kostajh wrote: >> @akosiaris... [16:04:33] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10jcrespo) [16:04:43] (03CR) 10Ayounsi: [C: 03+1] [DONT MERGE] cloud-in4: drop ACL entry for WMF wikis [homer/public] - 10https://gerrit.wikimedia.org/r/656886 (https://phabricator.wikimedia.org/T209011) (owner: 10Arturo Borrero Gonzalez) [16:05:27] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1135.eqiad.wmnet with reason: REIMAGE [16:05:28] (03CR) 10Hnowlan: [C: 03+2] sockpuppet-api: Create basic chart and service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [16:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:10] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1132.eqiad.wmnet'] ` and were **ALL** successful. [16:07:02] (03Merged) 10jenkins-bot: sockpuppet-api: Create basic chart and service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [16:07:44] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10jcrespo) [16:09:12] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1136.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1136.eqiad.wmnet'] ` [16:09:16] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1137.eqiad.wmnet'] ` The log can be fou... [16:10:44] (03CR) 10Volans: "> Patch Set 2:" [dns] - 10https://gerrit.wikimedia.org/r/656430 (https://phabricator.wikimedia.org/T258978) (owner: 10Alexandros Kosiaris) [16:12:02] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1135.eqiad.wmnet'] ` and were **ALL** successful. [16:15:25] (03PS2) 10Hnowlan: similar-users: add helmfile configuration. [deployment-charts] - 10https://gerrit.wikimedia.org/r/655915 (https://phabricator.wikimedia.org/T268837) [16:18:39] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1138.eqiad.wmnet'] ` The log can be fou... [16:19:51] (03CR) 10Elukey: [C: 03+2] sre.hadoop.reboot-workers: add improvements to reboot logic [cookbooks] - 10https://gerrit.wikimedia.org/r/656462 (owner: 10Elukey) [16:22:06] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1137.eqiad.wmnet with reason: REIMAGE [16:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:43] (03Merged) 10jenkins-bot: sre.hadoop.reboot-workers: add improvements to reboot logic [cookbooks] - 10https://gerrit.wikimedia.org/r/656462 (owner: 10Elukey) [16:24:16] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1137.eqiad.wmnet with reason: REIMAGE [16:24:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:56] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10jcrespo) a:03fdans @lilients_WMDE You only had LDAP access before, correct? Assigning to @fdans as, to the best of my understanding, is the right manager to approve new access to the anal... [16:25:45] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10jcrespo) p:05Triage→03High [16:27:23] 10SRE, 10WMF-NDA-Requests: Request from WMDE intern Amrutha - https://phabricator.wikimedia.org/T271725 (10jcrespo) p:05Triage→03High [16:27:37] 10SRE, 10Add-Link, 10GrowthExperiments-NewcomerTasks, 10Product-Infrastructure-Team-Backlog, and 3 others: Service operations setup for Add a Link project - https://phabricator.wikimedia.org/T258978 (10kostajh) [16:29:56] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10jcrespo) a:05fdans→03Ottomata I've been told Andres may be the right person to approve, apologies. [16:30:48] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1138.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1138.eqiad.wmnet'] ` [16:31:56] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1137.eqiad.wmnet'] ` and were **ALL** successful. [16:34:02] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:35:30] PROBLEM - Check systemd state on elastic2026 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:35:54] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:37:56] (03PS4) 10JMeybohm: Allow the kube-controller-manager to run without superuser permissions [puppet] - 10https://gerrit.wikimedia.org/r/650469 (https://phabricator.wikimedia.org/T228967) [16:41:56] (03CR) 10JMeybohm: [C: 04-1] similar-users: add helmfile configuration. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/655915 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [16:42:19] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Nova: add a simple vendordata REST service [puppet] - 10https://gerrit.wikimedia.org/r/656640 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott) [16:48:49] (03PS1) 10Elukey: sre.hadoop.init-hadoop-workers: move to Class API [cookbooks] - 10https://gerrit.wikimedia.org/r/656952 (https://phabricator.wikimedia.org/T269925) [16:59:26] 10SRE, 10Inuka-Team, 10Security-Team, 10Product-Analytics (Kanban): Provide raw KaiOSAppFeedback data to Chelsea Riley for analysis - https://phabricator.wikimedia.org/T271202 (10jcrespo) Based on previous comment by @sbassett, it seems the right direction is to contact Legal & IT support for permission/be... [17:05:57] (03PS3) 10Hnowlan: similar-users: add helmfile configuration. [deployment-charts] - 10https://gerrit.wikimedia.org/r/655915 (https://phabricator.wikimedia.org/T268837) [17:06:18] (03CR) 10Hnowlan: similar-users: add helmfile configuration. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/655915 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [17:08:42] 10SRE, 10WMF-NDA-Requests: Allow WMDE intern Amrutha to access Superset - https://phabricator.wikimedia.org/T271725 (10Aklapper) [17:08:57] 10SRE, 10Dumps-Generation, 10SRE-Access-Requests, 10Patch-For-Review, 10Platform Team Workboards (Clinic Duty Team): Add all of CPT to snapshot/dumpsdata admins - https://phabricator.wikimedia.org/T271718 (10jcrespo) Next meeting is expected to happen on 25 January, will add to the list of topics to disc... [17:09:30] (03PS1) 10Filippo Giunchedi: WIP apt package from component [puppet] - 10https://gerrit.wikimedia.org/r/656953 [17:09:31] 10SRE, 10WMF-NDA-Requests: Allow WMDE intern Amrutha to access Superset - https://phabricator.wikimedia.org/T271725 (10Aklapper) [17:09:33] 10SRE, 10WMF-NDA-Requests: NDA for Superset Request from WMDE Intern Amrutha Chandra - https://phabricator.wikimedia.org/T272287 (10Aklapper) [17:09:48] 10SRE, 10observability, 10User-fgiunchedi: rsyslog occasional segfault on centrallog hosts - https://phabricator.wikimedia.org/T259780 (10fgiunchedi) This particular failure mode seems to be fixed with rsyslog 8.2008.0-1~bpo10+1, I can't find any other rsyslog segmentation faults since deploying the new vers... [17:10:15] (03CR) 10Volans: "Looks mostly reasonable, thanks for the migration! Couple of questions inline" (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/656952 (https://phabricator.wikimedia.org/T269925) (owner: 10Elukey) [17:11:18] (03CR) 10jerkins-bot: [V: 04-1] WIP apt package from component [puppet] - 10https://gerrit.wikimedia.org/r/656953 (owner: 10Filippo Giunchedi) [17:11:51] (03PS2) 10Filippo Giunchedi: WIP apt package from component [puppet] - 10https://gerrit.wikimedia.org/r/656953 [17:14:34] (03CR) 10CRusnov: "This change is ready for review." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/656954 (https://phabricator.wikimedia.org/T266487) (owner: 10CRusnov) [17:14:36] (03PS3) 10Filippo Giunchedi: rsyslog: install rsyslog from component/rsyslog on Buster [puppet] - 10https://gerrit.wikimedia.org/r/656953 (https://phabricator.wikimedia.org/T259780) [17:14:43] (03CR) 10jerkins-bot: [V: 04-1] interface_automation.py: Minor refactors and fixes for 2.9 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/656954 (https://phabricator.wikimedia.org/T266487) (owner: 10CRusnov) [17:15:47] (03PS2) 10CRusnov: interface_automation.py: Minor refactors and fixes for 2.9 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/656954 (https://phabricator.wikimedia.org/T266487) [17:19:03] (03PS2) 10Elukey: sre.hadoop.init-hadoop-workers: move to Class API [cookbooks] - 10https://gerrit.wikimedia.org/r/656952 (https://phabricator.wikimedia.org/T269925) [17:19:46] (03PS3) 10Elukey: sre.hadoop.init-hadoop-workers: move to Class API [cookbooks] - 10https://gerrit.wikimedia.org/r/656952 (https://phabricator.wikimedia.org/T269925) [17:20:04] (03CR) 10Elukey: "Followed up to all the comments, thanks!" (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/656952 (https://phabricator.wikimedia.org/T269925) (owner: 10Elukey) [17:22:12] (03CR) 10Volans: [C: 03+1] "Looks ok to me, ship and test it!" [cookbooks] - 10https://gerrit.wikimedia.org/r/656952 (https://phabricator.wikimedia.org/T269925) (owner: 10Elukey) [17:22:46] (03CR) 10Elukey: [C: 03+2] sre.hadoop.init-hadoop-workers: move to Class API [cookbooks] - 10https://gerrit.wikimedia.org/r/656952 (https://phabricator.wikimedia.org/T269925) (owner: 10Elukey) [17:25:30] 10SRE, 10serviceops, 10Wikimedia-Incident: High latency on appservers - https://phabricator.wikimedia.org/T272215 (10jcrespo) p:05Triage→03Medium This was UBN on Saturday, based on Joe's comment, I am putting this now to Medium. More details are yet to be provided on the Incident report, I can help with... [17:28:13] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host... [17:28:31] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host... [17:28:59] (03CR) 10Volans: [C: 04-1] "I like the refactor direction, there are some things that looks wrong though, see inline." (033 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/656954 (https://phabricator.wikimedia.org/T266487) (owner: 10CRusnov) [17:29:20] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host... [17:30:11] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host... [17:32:51] !log reimaging mw2271,mw2273,mw2274,mw227 (codfw only) [17:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:41] (03CR) 10Ayounsi: interface_automation.py: Minor refactors and fixes for 2.9 (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/656954 (https://phabricator.wikimedia.org/T266487) (owner: 10CRusnov) [17:33:45] 10SRE, 10SRE-Access-Requests: Hue access for Peter Pelberg - https://phabricator.wikimedia.org/T271602 (10jcrespo) @ppelberg This is blocked only on providing additional information requested by @Joe and @Elukey above. [17:34:19] 10SRE, 10SRE-Access-Requests: Hue access for Peter Pelberg - https://phabricator.wikimedia.org/T271602 (10jcrespo) [17:35:54] (03PS3) 10CRusnov: interface_automation.py: Minor refactors and fixes for 2.9 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/656954 (https://phabricator.wikimedia.org/T266487) [17:35:56] (03CR) 10CRusnov: interface_automation.py: Minor refactors and fixes for 2.9 (033 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/656954 (https://phabricator.wikimedia.org/T266487) (owner: 10CRusnov) [17:36:37] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1118.eqiad.wmnet [17:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:26] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1118.eqiad.wmnet [17:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:21] US is holiday but working a bit because I could not on Friday [17:42:29] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1120.eqiad.wmnet [17:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:33] 10SRE, 10SRE-Access-Requests: Hue access for Peter Pelberg - https://phabricator.wikimedia.org/T271602 (10jcrespo) We would also need @DannyH to sign off the request, as to the best of my understanding, this is not a team's "standarized request". [17:43:57] 10SRE, 10SRE-Access-Requests: Hue access for Peter Pelberg - https://phabricator.wikimedia.org/T271602 (10jcrespo) [17:44:18] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1120.eqiad.wmnet [17:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:24] 10SRE, 10SRE-Access-Requests: Hue access for Peter Pelberg - https://phabricator.wikimedia.org/T271602 (10jcrespo) a:03ppelberg [17:45:07] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2271.codfw.wmnet with reason: REIMAGE [17:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:27] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2273.codfw.wmnet with reason: REIMAGE [17:45:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:14] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2274.codfw.wmnet with reason: REIMAGE [17:46:15] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker[1121-1123].eqiad.wmnet [17:46:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:05] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2275.codfw.wmnet with reason: REIMAGE [17:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:07] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2271.codfw.wmnet with reason: REIMAGE [17:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:07] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker[1121-1123].eqiad.wmnet [17:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:49:06] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2273.codfw.wmnet with reason: REIMAGE [17:49:07] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2275.codfw.wmnet with reason: REIMAGE [17:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:31] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker[1124-1127].eqiad.wmnet [17:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:49] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:50:51] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2274.codfw.wmnet with reason: REIMAGE [17:50:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:22] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker[1124-1127].eqiad.wmnet [17:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:12] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1032 - https://phabricator.wikimedia.org/T272209 (10jcrespo) p:05Triage→03Medium Assigning medium status to remove from SRE untriaged inbox, feel free to edit on disagreement. [17:54:53] (03CR) 10Volans: interface_automation.py: Minor refactors and fixes for 2.9 (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/656954 (https://phabricator.wikimedia.org/T266487) (owner: 10CRusnov) [17:55:50] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) [17:56:14] 10SRE: Use static PHIDs instead of fragile Phab project names in in modules/icinga/files/raid_handler.py - https://phabricator.wikimedia.org/T272233 (10jcrespo) Hey, Aklapper, Sorry if you discussed this with someone already. Could you provide a bit more of context: is this a suggestion but not high priority? I... [17:56:55] 10SRE, 10Wikidata, 10Wikidata Query Builder, 10User-Addshore: Deploy WDQS query builder to microsites - https://phabricator.wikimedia.org/T266703 (10Dzahn) @addshore I think all the check boxes (except maybe the last one) on this ticket are already done? [17:58:21] PROBLEM - PHP7 rendering on mw2275 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:59:04] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) [17:59:53] 10SRE, 10SRE-tools: Use static PHIDs instead of fragile Phab project names in in modules/icinga/files/raid_handler.py - https://phabricator.wikimedia.org/T272233 (10jcrespo) Adding @Volans, but I could do the patch myself if it is easy enough, and one of the 2 main users of it (media and dbs). [18:04:17] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1136.eqiad.wmnet'] ` The log can be fou... [18:09:19] RECOVERY - PHP7 rendering on mw2275 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.144 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [18:10:04] (03PS4) 10CRusnov: interface_automation.py: Minor refactors and fixes for 2.9 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/656954 (https://phabricator.wikimedia.org/T266487) [18:10:06] (03CR) 10CRusnov: interface_automation.py: Minor refactors and fixes for 2.9 (033 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/656954 (https://phabricator.wikimedia.org/T266487) (owner: 10CRusnov) [18:10:09] PROBLEM - PHP opcache health on mw2275 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [18:10:20] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1138.eqiad.wmnet'] ` The log can be fou... [18:10:34] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2274.codfw.wmnet'] ` and were **ALL** s... [18:11:06] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2275.codfw.wmnet'] ` an... [18:11:22] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2273.codfw.wmnet'] ` an... [18:11:44] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2271.codfw.wmnet'] ` an... [18:12:43] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1128.eqiad.wmnet [18:12:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:29] PROBLEM - mediawiki-installation DSH group on mw2275 is CRITICAL: Host mw2275 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [18:14:32] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1128.eqiad.wmnet [18:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:07] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1136.eqiad.wmnet with reason: REIMAGE [18:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:39] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1130.eqiad.wmnet [18:17:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:07] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1136.eqiad.wmnet with reason: REIMAGE [18:18:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:27] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1130.eqiad.wmnet [18:19:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:08] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1132.eqiad.wmnet [18:20:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:21] 10SRE, 10SRE-tools: Use static PHIDs instead of fragile Phab project names in in modules/icinga/files/raid_handler.py - https://phabricator.wikimedia.org/T272233 (10Volans) Sure, let's move to PHIDs. That's a general thing, not only for raid_handler.py, but the above linked task should have all the pointers to... [18:21:57] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1132.eqiad.wmnet [18:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:08] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1138.eqiad.wmnet with reason: REIMAGE [18:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:29] (03CR) 10Volans: [C: 03+1] "LGTM to test it!" (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/656954 (https://phabricator.wikimedia.org/T266487) (owner: 10CRusnov) [18:24:06] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1138.eqiad.wmnet with reason: REIMAGE [18:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:40] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1136.eqiad.wmnet'] ` and were **ALL** successful. [18:29:21] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) [18:29:27] 10SRE, 10Wikimedia-Logstash, 10observability, 10Software-Licensing: Elasticsearch and Kibana are switching to non-OSI-approved SSPL licence - https://phabricator.wikimedia.org/T272238 (10jcrespo) Adding observability team, although for initial reactions not sure if a separate ticket for infrastructure logs... [18:33:11] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1138.eqiad.wmnet'] ` and were **ALL** successful. [18:33:55] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2271.codfw.wmnet [18:33:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:07] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2273.codfw.wmnet [18:34:08] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker[1136,1138].eqiad.wmnet [18:34:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:39] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2275.codfw.wmnet [18:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:48] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2274.codfw.wmnet [18:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:05] (03PS1) 10Ladsgroup: Drop profile::analytics::refinery::job::streams_check [puppet] - 10https://gerrit.wikimedia.org/r/656961 [18:35:59] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker[1136,1138].eqiad.wmnet [18:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:11] (03CR) 10Elukey: [C: 03+2] "elukey@cumin1001:~$ sudo cumin 'c:profile::analytics::refinery::job::streams_check'" [puppet] - 10https://gerrit.wikimedia.org/r/656961 (owner: 10Ladsgroup) [18:38:51] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2271.codfw.wmnet [18:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:07] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2273.codfw.wmnet [18:39:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:41] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) @Cmjohnson for an-worker1119 and an-worker1131 I don't have any network link, could you please check if anything is missing from the cabling/config... [18:40:26] (03CR) 10Elukey: [C: 03+2] eventlogging: Migrate hiera() to lookup() and setting datatype [puppet] - 10https://gerrit.wikimedia.org/r/656531 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [18:40:29] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2274.codfw.wmnet [18:40:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:41] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2275.codfw.wmnet [18:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:32] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [19:15:59] RECOVERY - mediawiki-installation DSH group on mw2275 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [19:19:26] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [19:20:50] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [19:31:38] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2276.codfw.wmnet with reason: REIMAGE [19:31:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:30] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2277.codfw.wmnet with reason: REIMAGE [19:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:37] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2276.codfw.wmnet with reason: REIMAGE [19:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:24] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2277.codfw.wmnet with reason: REIMAGE [19:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:27] (03PS1) 10Luke081515: Adding namespace aliases on arbcom-ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656964 (https://phabricator.wikimedia.org/T272292) [19:38:27] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2303.codfw.wmnet with reason: REIMAGE [19:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:50] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2305.codfw.wmnet with reason: REIMAGE [19:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:24] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2303.codfw.wmnet with reason: REIMAGE [19:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:49] 10SRE, 10SRE-tools: Use static PHIDs instead of fragile Phab project names in in modules/icinga/files/raid_handler.py - https://phabricator.wikimedia.org/T272233 (10Aklapper) Hi, this is a suggestion to avoid potential future breakage / to make code more robust. [19:42:01] (03PS5) 10CRusnov: interface_automation.py: Minor refactors and fixes for 2.9 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/656954 (https://phabricator.wikimedia.org/T266487) [19:42:25] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2305.codfw.wmnet with reason: REIMAGE [19:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:11] (03CR) 10CRusnov: "With a minor fix for the new method's arguments, this works on -dev importing new VMs." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/656954 (https://phabricator.wikimedia.org/T266487) (owner: 10CRusnov) [19:53:54] 10SRE, 10Wikidata, 10Wikidata Query Builder, 10User-Addshore: Deploy WDQS query builder to microsites - https://phabricator.wikimedia.org/T266703 (10Addshore) >>! In T266703#6756059, @Dzahn wrote: > @addshore I think all the check boxes (except maybe the last one) on this ticket are already done? This tic... [19:54:47] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2276.codfw.wmnet'] ` an... [19:54:53] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2277.codfw.wmnet'] ` an... [19:56:12] PROBLEM - Check systemd state on elastic2029 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:00:50] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2303.codfw.wmnet'] ` an... [20:01:41] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2305.codfw.wmnet'] ` an... [20:19:27] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2276.codfw.wmnet [20:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:39] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2277.codfw.wmnet [20:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:51] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2305.codfw.wmnet [20:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:04] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2303.codfw.wmnet [20:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:36] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2276.codfw.wmnet [20:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:43] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2277.codfw.wmnet [20:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:04] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2303.codfw.wmnet [20:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:12] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2305.codfw.wmnet [20:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:03] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:28:04] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:29:00] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:30:10] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:30:17] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2310.codfw.wmnet'] ` Of... [20:30:37] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:41:12] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:43:24] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:46:05] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2307.codfw.wmnet with reason: REIMAGE [20:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:04] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2309.codfw.wmnet with reason: REIMAGE [20:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:03] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2310.codfw.wmnet with reason: REIMAGE [20:48:06] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2307.codfw.wmnet with reason: REIMAGE [20:48:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:37] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2311.codfw.wmnet with reason: REIMAGE [20:49:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:10] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2310.codfw.wmnet with reason: REIMAGE [20:50:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:04] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2309.codfw.wmnet with reason: REIMAGE [20:52:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:25] PROBLEM - Check systemd state on elastic2025 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:54:05] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2311.codfw.wmnet with reason: REIMAGE [20:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:39] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2307.codfw.wmnet'] ` an... [21:10:06] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2309.codfw.wmnet'] ` an... [21:10:52] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2310.codfw.wmnet'] ` an... [21:12:47] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2311.codfw.wmnet'] ` an... [21:22:33] PROBLEM - Check systemd state on elastic2033 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:22:51] 10SRE, 10Traffic: Start warning and deprecation process for all legacy TLS - https://phabricator.wikimedia.org/T238038 (10Ricardoa2020) Very interesting information about TLS, if we want to know more about IT and its security we should visit [[ https://demyo.com/ | Demyo ]] [21:29:02] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2311.codfw.wmnet [21:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:23] PROBLEM - PHP opcache health on mw2274 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:29:25] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2310.codfw.wmnet [21:29:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:44] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2307.codfw.wmnet [21:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:58] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2309.codfw.wmnet [21:29:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:16] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2307.codfw.wmnet [21:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:22] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2309.codfw.wmnet [21:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:31] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2310.codfw.wmnet [21:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:37] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2311.codfw.wmnet [21:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:03] PROBLEM - PHP opcache health on mw2273 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:37:03] PROBLEM - PHP opcache health on mw2277 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:39:52] ACKNOWLEDGEMENT - PHP opcache health on mw2268 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn reimaged https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:39:52] ACKNOWLEDGEMENT - PHP opcache health on mw2271 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn reimaged https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:39:52] ACKNOWLEDGEMENT - PHP opcache health on mw2273 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn reimaged https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:39:52] ACKNOWLEDGEMENT - PHP opcache health on mw2274 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn reimaged https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:39:52] ACKNOWLEDGEMENT - PHP opcache health on mw2275 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn reimaged https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:39:53] ACKNOWLEDGEMENT - PHP opcache health on mw2276 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn reimaged https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:39:53] ACKNOWLEDGEMENT - PHP opcache health on mw2277 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn reimaged https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:42:15] PROBLEM - PHP opcache health on mw2303 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:43:17] PROBLEM - PHP opcache health on mw2305 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:45:32] ACKNOWLEDGEMENT - PHP opcache health on mw2303 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn reimaged https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:45:32] ACKNOWLEDGEMENT - PHP opcache health on mw2305 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn reimaged https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:54:47] PROBLEM - Check systemd state on elastic2027 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:02:41] PROBLEM - Check systemd state on elastic2047 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:10:23] PROBLEM - Check systemd state on elastic2030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:21:07] PROBLEM - Elevated latency for icinga checks in eqiad on alert1001 is CRITICAL: cluster=alerting instance=alert1001 job=icinga site=eqiad https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [22:50:23] PROBLEM - PHP opcache health on mw2310 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [22:54:07] PROBLEM - PHP opcache health on mw2311 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [22:55:41] PROBLEM - Elevated latency for icinga checks in eqiad on alert1001 is CRITICAL: cluster=alerting instance=alert1001 job=icinga site=eqiad https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [23:02:57] ACKNOWLEDGEMENT - PHP opcache health on mw2310 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn reimaged https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [23:02:57] ACKNOWLEDGEMENT - PHP opcache health on mw2311 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn reimaged https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [23:42:57] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:45:59] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets