[00:37:35] <James_F>	 Urbanecm, Amir1: BTW, you scheduled a wiki creation window for tomorrow for T271260, but that's a no-deploy day, so it was commented out. You should pick another day.
[00:37:36] <stashbot>	 T271260: Create Wikivoyage Turkish - https://phabricator.wikimedia.org/T271260
[00:38:02] <Urbanecm>	 ahh...didn't know this monday's no-deploy day. Thanks for the heads-up James_F 
[00:38:55] <Urbanecm>	 we'll reschedule :)
[00:39:01] <James_F>	 Cool.
[05:27:48] <wikibugs>	 (03PS3) 10Ryan Kemper: search: bring "new" relforge hosts into service [puppet] - 10https://gerrit.wikimedia.org/r/656369 (https://phabricator.wikimedia.org/T262211)
[05:45:00] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:04:24] <wikibugs>	 (03PS4) 10Ryan Kemper: search: bring "new" relforge hosts into service [puppet] - 10https://gerrit.wikimedia.org/r/656369 (https://phabricator.wikimedia.org/T262211)
[06:04:48] <wikibugs>	 (03CR) 10Ryan Kemper: "Thanks for the first round of review! I forgot to do the obvious thing and grep for `relforge` and see what else needed to be modified." [puppet] - 10https://gerrit.wikimedia.org/r/656369 (https://phabricator.wikimedia.org/T262211) (owner: 10Ryan Kemper)
[06:08:57] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10Marostegui)
[06:22:00] <marostegui>	 !log Reboot db1154 and db1155 for kernel upgrade
[06:22:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:35:29] <marostegui>	 !log Reboot dbproxy2001, dbproxy2002, dbproxy2003 for kernel upgrade
[06:35:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:53:03] <wikibugs>	 (03PS1) 10Marostegui: db1138: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/656652
[06:53:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1138 for kernel upgrade', diff saved to https://phabricator.wikimedia.org/P13782 and previous config saved to /var/cache/conftool/dbconfig/20210118-065312-marostegui.json
[06:53:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:53:46] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1138: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/656652 (owner: 10Marostegui)
[07:09:21] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1138: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/656481
[07:10:05] <wikibugs>	 10SRE, 10DBA: Refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[07:15:48] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1138: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/656481 (owner: 10Marostegui)
[07:16:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 25%: After restarting for kernel upgraed', diff saved to https://phabricator.wikimedia.org/P13783 and previous config saved to /var/cache/conftool/dbconfig/20210118-071611-root.json
[07:16:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:16:26] <wikibugs>	 (03PS1) 10Elukey: Fix patch 01-disable_jetty_dir_listing.patch layout [debs/archiva] (debian) - 10https://gerrit.wikimedia.org/r/656655
[07:17:02] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Fix patch 01-disable_jetty_dir_listing.patch layout [debs/archiva] (debian) - 10https://gerrit.wikimedia.org/r/656655 (owner: 10Elukey)
[07:31:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 50%: After restarting for kernel upgraed', diff saved to https://phabricator.wikimedia.org/P13784 and previous config saved to /var/cache/conftool/dbconfig/20210118-073115-root.json
[07:31:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:46:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 75%: After restarting for kernel upgraed', diff saved to https://phabricator.wikimedia.org/P13785 and previous config saved to /var/cache/conftool/dbconfig/20210118-074618-root.json
[07:46:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210118T0800)
[08:01:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 100%: After restarting for kernel upgraed', diff saved to https://phabricator.wikimedia.org/P13786 and previous config saved to /var/cache/conftool/dbconfig/20210118-080122-root.json
[08:01:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:06:44] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "See comments inline." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/656369 (https://phabricator.wikimedia.org/T262211) (owner: 10Ryan Kemper)
[08:09:18] <wikibugs>	 (03PS2) 10Gehel: query_service: Migrate hiera() to lookup() in gui [puppet] - 10https://gerrit.wikimedia.org/r/656530 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[08:09:45] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM, I'll let Ryan deploy" [puppet] - 10https://gerrit.wikimedia.org/r/656530 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[08:11:13] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/656404 (https://phabricator.wikimedia.org/T238753) (owner: 10Hnowlan)
[08:12:31] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:13:52] <elukey>	 !log clean up old archiva debs and upload 2.2.4-3 to buster-wikimedia - T272082
[08:13:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:56] <stashbot>	 T272082: Reflected XSS on archiva.wikimedia.org - https://phabricator.wikimedia.org/T272082
[08:14:39] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:15:28] <wikibugs>	 (03CR) 10Gehel: "LGTM overall. As discussed with @hnowlan, there is a lot that is untested in the overall procedure to split the cluster, but I trust that " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/656460 (owner: 10Hnowlan)
[08:15:50] <moritzm>	 !log installing remaining openssl 1.0 security updated on stretch
[08:15:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:17:10] <wikibugs>	 10ops-eqiad, 10Traffic: lvs1015 interface errors - https://phabricator.wikimedia.org/T272258 (10ayounsi) p:05Triage→03High
[08:17:35] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] query_service: Migrate hiera() to lookup() in gui [puppet] - 10https://gerrit.wikimedia.org/r/656530 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[08:17:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1106 to stop replication, place db1105:3311 temporarily in vslow T272008', diff saved to https://phabricator.wikimedia.org/P13787 and previous config saved to /var/cache/conftool/dbconfig/20210118-081740-marostegui.json
[08:17:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:17:44] <stashbot>	 T272008: Move wikireplicas under the new sanitarium hosts (db1154, db1155) - https://phabricator.wikimedia.org/T272008
[08:17:48] <wikibugs>	 10ops-eqiad, 10Traffic: lvs1015 interface errors - https://phabricator.wikimedia.org/T272258 (10ayounsi)
[08:19:01] <wikibugs>	 (03PS1) 10Marostegui: db1106: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/656831
[08:20:32] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1106: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/656831 (owner: 10Marostegui)
[08:36:37] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1106: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/656484
[08:37:27] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1106: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/656484 (owner: 10Marostegui)
[08:39:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 25%: After moving wikireplicas to another host', diff saved to https://phabricator.wikimedia.org/P13788 and previous config saved to /var/cache/conftool/dbconfig/20210118-083919-root.json
[08:39:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:26] <wikibugs>	 10SRE, 10ops-codfw, 10netops: ripe-atlas-codfw is down - https://phabricator.wikimedia.org/T267714 (10ayounsi) 05Resolved→03Open Can we have a decom task for the faulty device? (switch port is still alerting as being down)
[08:42:18] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.decommission
[08:42:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:46:45] <logmsgbot>	 !log kormat@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
[08:46:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:46:51] <wikibugs>	 10SRE, 10DBA, 10Orchestrator, 10CAS-SSO, 10User-Kormat: orchestrator: Support SSO - https://phabricator.wikimedia.org/T266106 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by kormat@cumin1001 for hosts: `dborch1001.eqiad.wmnet` - dborch1001.eqiad.wmnet (**PASS**)   - Downtimed host on...
[08:48:49] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:51:11] <wikibugs>	 (03PS1) 10DCausse: [wdqs] disable async imports [puppet] - 10https://gerrit.wikimedia.org/r/656833 (https://phabricator.wikimedia.org/T267175)
[08:54:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 50%: After moving wikireplicas to another host', diff saved to https://phabricator.wikimedia.org/P13790 and previous config saved to /var/cache/conftool/dbconfig/20210118-085422-root.json
[08:54:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:01:49] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.dns.netbox
[09:01:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:28] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:06:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:08:29] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1119.eqiad.wmnet'] ` The log can be fou...
[09:09:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 75%: After moving wikireplicas to another host', diff saved to https://phabricator.wikimedia.org/P13791 and previous config saved to /var/cache/conftool/dbconfig/20210118-090926-root.json
[09:09:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:12:15] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on ms-be1032 - https://phabricator.wikimedia.org/T272209 (10fgiunchedi) @Cmjohnson @Jclark-ctr host is OOW, please replace the 4TB drive (led should be blinking)
[09:13:44] <moritzm>	 !log installing openssl 1.1 security updates on stretch
[09:13:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:10] <wikibugs>	 10SRE, 10Traffic, 10serviceops, 10Wikimedia-Incident: The safe service restart script doesn't detect failure when running with poolcounter. - https://phabricator.wikimedia.org/T272262 (10Joe)
[09:19:45] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 79, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:19:45] <wikibugs>	 10SRE, 10DBA, 10Orchestrator, 10CAS-SSO, 10User-Kormat: orchestrator: Support SSO - https://phabricator.wikimedia.org/T266106 (10Kormat)
[09:20:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1105:3311 from vslow', diff saved to https://phabricator.wikimedia.org/P13793 and previous config saved to /var/cache/conftool/dbconfig/20210118-092003-marostegui.json
[09:20:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:20:06] <wikibugs>	 10SRE, 10Traffic, 10serviceops, 10Wikimedia-Incident: The safe service restart script doesn't detect failure when running with poolcounter. - https://phabricator.wikimedia.org/T272262 (10Joe) p:05Triage→03Unbreak! a:03Joe
[09:24:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 100%: After moving wikireplicas to another host', diff saved to https://phabricator.wikimedia.org/P13794 and previous config saved to /var/cache/conftool/dbconfig/20210118-092429-root.json
[09:24:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:51] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.ganeti.makevm
[09:24:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:25:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1074 to stop replication T272008', diff saved to https://phabricator.wikimedia.org/P13795 and previous config saved to /var/cache/conftool/dbconfig/20210118-092546-marostegui.json
[09:25:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:25:50] <stashbot>	 T272008: Move wikireplicas under the new sanitarium hosts (db1154, db1155) - https://phabricator.wikimedia.org/T272008
[09:26:34] <wikibugs>	 (03PS1) 10Marostegui: db1074: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/656836
[09:27:08] <wikibugs>	 (03PS1) 10Filippo Giunchedi: swift: decrease object replicator concurrency [puppet] - 10https://gerrit.wikimedia.org/r/656837 (https://phabricator.wikimedia.org/T271415)
[09:27:10] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1074: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/656836 (owner: 10Marostegui)
[09:43:11] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1074: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/656485
[09:44:22] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1074: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/656485 (owner: 10Marostegui)
[09:44:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1074 (re)pooling @ 25%: After moving wikireplicas to another host', diff saved to https://phabricator.wikimedia.org/P13796 and previous config saved to /var/cache/conftool/dbconfig/20210118-094449-root.json
[09:44:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:44:53] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: safe-service-restart: proper error handling with poolcounter [puppet] - 10https://gerrit.wikimedia.org/r/656838 (https://phabricator.wikimedia.org/T272262)
[09:45:04] <wikibugs>	 (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1001/27504/" [puppet] - 10https://gerrit.wikimedia.org/r/656837 (https://phabricator.wikimedia.org/T271415) (owner: 10Filippo Giunchedi)
[09:45:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: "> Patch Set 3: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/574485 (https://phabricator.wikimedia.org/T227080) (owner: 10Filippo Giunchedi)
[09:51:53] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
[09:51:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:52:38] <wikibugs>	 (03PS1) 10Ayounsi: Add rmaung to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/656840 (https://phabricator.wikimedia.org/T266250)
[09:54:05] <wikibugs>	 (03CR) 10Ayounsi: "> Principal successfully created." [puppet] - 10https://gerrit.wikimedia.org/r/656840 (https://phabricator.wikimedia.org/T266250) (owner: 10Ayounsi)
[09:55:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Look good" [puppet] - 10https://gerrit.wikimedia.org/r/656840 (https://phabricator.wikimedia.org/T266250) (owner: 10Ayounsi)
[09:56:00] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add rmaung to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/656840 (https://phabricator.wikimedia.org/T266250) (owner: 10Ayounsi)
[09:57:21] <wikibugs>	 (03PS1) 10Kormat: install_server: Update mac for dborch1001 [puppet] - 10https://gerrit.wikimedia.org/r/656841 (https://phabricator.wikimedia.org/T266106)
[09:58:28] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] install_server: Update mac for dborch1001 [puppet] - 10https://gerrit.wikimedia.org/r/656841 (https://phabricator.wikimedia.org/T266106) (owner: 10Kormat)
[09:58:52] <wikibugs>	 10SRE, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung - https://phabricator.wikimedia.org/T266250 (10ayounsi) 05Open→03Resolved Access created, you should have received an email as well about your kerberos ac...
[09:58:59] <wikibugs>	 (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656842 (https://phabricator.wikimedia.org/T128546)
[09:59:29] <wikibugs>	 (03PS12) 10MSantos: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949)
[09:59:49] <wikibugs>	 (03CR) 10MSantos: start using imposm as OSM sync tool (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos)
[09:59:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1074 (re)pooling @ 50%: After moving wikireplicas to another host', diff saved to https://phabricator.wikimedia.org/P13797 and previous config saved to /var/cache/conftool/dbconfig/20210118-095952-root.json
[09:59:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:00:25] <_joe_>	 !log restarting pybal on lvs1016, not talking to its etcd server
[10:00:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:00:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos)
[10:09:28] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1119.eqiad.wmnet'] `  Of which those **FAILED**: ` ['an-worker1119.eqiad.wmnet'] `
[10:09:58] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/656838 (https://phabricator.wikimedia.org/T272262) (owner: 10Giuseppe Lavagetto)
[10:14:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1074 (re)pooling @ 75%: After moving wikireplicas to another host', diff saved to https://phabricator.wikimedia.org/P13798 and previous config saved to /var/cache/conftool/dbconfig/20210118-101456-root.json
[10:14:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:19:45] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:20:39] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] safe-service-restart: proper error handling with poolcounter [puppet] - 10https://gerrit.wikimedia.org/r/656838 (https://phabricator.wikimedia.org/T272262) (owner: 10Giuseppe Lavagetto)
[10:20:41] <wikibugs>	 (03PS1) 10Volans: logging: fix base path and name to setup logging [software/spicerack] - 10https://gerrit.wikimedia.org/r/656843
[10:27:41] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:30:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1074 (re)pooling @ 100%: After moving wikireplicas to another host', diff saved to https://phabricator.wikimedia.org/P13799 and previous config saved to /var/cache/conftool/dbconfig/20210118-102959-root.json
[10:30:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:31:37] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] safe-service-restart: proper error handling with poolcounter [puppet] - 10https://gerrit.wikimedia.org/r/656838 (https://phabricator.wikimedia.org/T272262) (owner: 10Giuseppe Lavagetto)
[10:33:23] <wikibugs>	 10SRE, 10DBA: Refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[10:38:03] <wikibugs>	 10SRE, 10serviceops, 10Wikimedia-Incident: High latency on appservers - https://phabricator.wikimedia.org/T272215 (10Joe)
[10:38:07] <wikibugs>	 10SRE, 10Traffic, 10serviceops, 10Patch-For-Review, 10Wikimedia-Incident: The safe service restart script doesn't detect failure when running with poolcounter. - https://phabricator.wikimedia.org/T272262 (10Joe)
[10:38:37] <wikibugs>	 10SRE, 10serviceops, 10Wikimedia-Incident: High latency on appservers - https://phabricator.wikimedia.org/T272215 (10Joe)
[10:38:40] <wikibugs>	 10SRE, 10Traffic, 10serviceops, 10Patch-For-Review, 10Wikimedia-Incident: The safe service restart script doesn't detect failure when running with poolcounter. - https://phabricator.wikimedia.org/T272262 (10Joe) 05Open→03Resolved The script has been merged and will deploy everywhere in the next 20 mi...
[10:39:06] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Revert "role::mediawiki::canary_appserver: disable php-fpm restart timer" [puppet] - 10https://gerrit.wikimedia.org/r/656846
[10:39:39] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Revert "role::mediawiki::appserver: temporary disable php-fpm restarts" [puppet] - 10https://gerrit.wikimedia.org/r/656847
[10:41:15] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] logging: fix base path and name to setup logging [software/spicerack] - 10https://gerrit.wikimedia.org/r/656843 (owner: 10Volans)
[10:41:45] <wikibugs>	 (03CR) 10Volans: [C: 03+2] logging: fix base path and name to setup logging [software/spicerack] - 10https://gerrit.wikimedia.org/r/656843 (owner: 10Volans)
[10:41:50] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Revert "role::mediawiki::canary_appserver: disable php-fpm restart timer" [puppet] - 10https://gerrit.wikimedia.org/r/656846 (owner: 10Giuseppe Lavagetto)
[10:42:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/656846 (owner: 10Giuseppe Lavagetto)
[10:42:09] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Revert "role::mediawiki::appserver: temporary disable php-fpm restarts" [puppet] - 10https://gerrit.wikimedia.org/r/656847 (owner: 10Giuseppe Lavagetto)
[10:42:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/656847 (owner: 10Giuseppe Lavagetto)
[10:42:44] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Revert "role::mediawiki::canary_appserver: disable php-fpm restart timer" [puppet] - 10https://gerrit.wikimedia.org/r/656846 (owner: 10Giuseppe Lavagetto)
[10:46:24] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Revert "role::mediawiki::appserver: temporary disable php-fpm restarts" [puppet] - 10https://gerrit.wikimedia.org/r/656847 (owner: 10Giuseppe Lavagetto)
[10:47:27] <wikibugs>	 (03Merged) 10jenkins-bot: logging: fix base path and name to setup logging [software/spicerack] - 10https://gerrit.wikimedia.org/r/656843 (owner: 10Volans)
[10:49:26] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) I fixed the boot order on an-worker1119 but PXE doesn't really work, I noticed that all NICs show no link up status, maybe there is something not s...
[10:52:51] <wikibugs>	 (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.48 [software/spicerack] - 10https://gerrit.wikimedia.org/r/656845
[10:53:37] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10Tobi_WMDE_SW) Approving that @lilients_WMDE is in my team.
[10:55:21] <wikibugs>	 10SRE, 10DBA, 10Orchestrator, 10CAS-SSO, 10User-Kormat: orchestrator: Support SSO - https://phabricator.wikimedia.org/T266106 (10Kormat)
[10:58:05] <moritzm>	 !log installing python2.7 security updates on Stretch
[10:58:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:03:32] <wikibugs>	 (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.48 [software/spicerack] - 10https://gerrit.wikimedia.org/r/656845 (owner: 10Volans)
[11:05:20] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1120.eqiad.wmnet'] ` The log can be fou...
[11:05:31] <hashar>	 I am going to restart both Gerrit instances to clear out a memory leak.  Should be back automagically after a couple minutes.
[11:05:55] <wikibugs>	 10SRE, 10serviceops, 10Wikimedia-Incident: High latency on appservers - https://phabricator.wikimedia.org/T272215 (10Joe) The timers have been reenabled, and the next scap deployment should properly run check_and_restart for php7-fpm, and restart those.
[11:08:21] <hashar>	 !log Restarting Gerrit replica on gerrit2001.wikimedia.org
[11:08:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:10:11] <wikibugs>	 (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.48 [software/spicerack] - 10https://gerrit.wikimedia.org/r/656845 (owner: 10Volans)
[11:10:20] <hashar>	 !log Restarting Gerrit main instance on gerrit1001.wikimedia.org
[11:10:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:11:55] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-fe1005.eqiad.wmnet
[11:11:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:12:12] <wikibugs>	 10SRE, 10WMF-NDA-Requests: NDA for Superset Request from WMDE Employee Amrutha Chandra - https://phabricator.wikimedia.org/T272287 (10amy_rc)
[11:17:17] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe1005.eqiad.wmnet
[11:17:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:17:51] <wikibugs>	 10SRE, 10WMF-NDA-Requests: Request from WMDE intern Amrutha - https://phabricator.wikimedia.org/T271725 (10amy_rc) a:05herron→03None
[11:18:22] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1120.eqiad.wmnet with reason: REIMAGE
[11:18:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:18:33] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:20:36] <wikibugs>	 10SRE, 10WMF-NDA-Requests: NDA for Superset Request from WMDE Intern Amrutha Chandra - https://phabricator.wikimedia.org/T272287 (10amy_rc)
[11:22:12] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1120.eqiad.wmnet with reason: REIMAGE
[11:22:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:23:38] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1121.eqiad.wmnet'] ` The log can be fou...
[11:25:19] <wikibugs>	 (03PS1) 10Filippo Giunchedi: debian: add packaging [debs/phalerts] - 10https://gerrit.wikimedia.org/r/656866
[11:26:10] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] "It would be nice to have SQLALCHEMY_DATABASE_URI and BASIC_AUTH_PASSWORD in .Values.config.private/the kubernetes secrets object instead o" [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan)
[11:28:51] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1120.eqiad.wmnet'] `  and were **ALL** successful.
[11:28:56] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-fe1006.eqiad.wmnet
[11:28:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:29:45] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+2] k8s_infrastructure_users: Amend to support groups, avoid uid conflicts [puppet] - 10https://gerrit.wikimedia.org/r/647011 (https://phabricator.wikimedia.org/T269461) (owner: 10Alexandros Kosiaris)
[11:30:38] <wikibugs>	 (03PS17) 10Hnowlan: sockpuppet-api: Create basic chart and service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837)
[11:33:39] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe1006.eqiad.wmnet
[11:33:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:35:28] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1121.eqiad.wmnet with reason: REIMAGE
[11:35:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:37:03] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] sockpuppet-api: Create basic chart and service config (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan)
[11:37:28] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1121.eqiad.wmnet with reason: REIMAGE
[11:37:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:38:02] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1122.eqiad.wmnet'] ` The log can be fou...
[11:40:18] <wikibugs>	 (03PS1) 10Volans: Upstream release v0.0.48 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/656868
[11:40:37] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-fe1007.eqiad.wmnet
[11:40:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:44:00] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1121.eqiad.wmnet'] `  and were **ALL** successful.
[11:44:32] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe1007.eqiad.wmnet
[11:44:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:45:57] <wikibugs>	 (03CR) 10Urbanecm: "> Patch Set 5: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647116 (https://phabricator.wikimedia.org/T269712) (owner: 10Jforrester)
[11:46:58] <_joe_>	 jouncebot: next
[11:46:58] <jouncebot>	 In 24 hour(s) and 13 minute(s): European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210119T1200)
[11:47:12] <_joe_>	 ok, I might try a null deploy before then
[11:48:07] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-fe1008.eqiad.wmnet
[11:48:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:48:54] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:49:15] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1123.eqiad.wmnet'] ` The log can be fou...
[11:49:52] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1122.eqiad.wmnet with reason: REIMAGE
[11:49:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:52:21] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe1008.eqiad.wmnet
[11:52:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:52:50] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1122.eqiad.wmnet with reason: REIMAGE
[11:52:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:54:49] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Upstream release v0.0.48 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/656868 (owner: 10Volans)
[11:54:50] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-fe2005.codfw.wmnet
[11:54:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:58:54] <wikibugs>	 10SRE, 10ops-eqsin, 10serviceops: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 (10MoritzMuehlenhoff) Thanks Papaul and Rob, I'll take care of re-adding ganeti5002 to the eqsin Ganeti cluster.
[11:59:24] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1122.eqiad.wmnet'] `  and were **ALL** successful.
[11:59:41] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: Add support for php deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/651757
[11:59:43] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1124.eqiad.wmnet'] ` The log can be fou...
[12:01:04] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1123.eqiad.wmnet with reason: REIMAGE
[12:01:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:01:48] <wikibugs>	 (03Merged) 10jenkins-bot: Upstream release v0.0.48 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/656868 (owner: 10Volans)
[12:03:13] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1123.eqiad.wmnet with reason: REIMAGE
[12:03:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:04:43] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe2005.codfw.wmnet
[12:04:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:06:25] <icinga-wm>	 PROBLEM - Check systemd state on elastic2034 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:08:01] <volans>	 !log uploaded spicerack_0.0.48 to apt.wikimedia.org buster-wikimedia
[12:08:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:08:12] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-fe2006.codfw.wmnet
[12:08:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:08:23] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, with some minor comments inline." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/656640 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott)
[12:08:32] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1125.eqiad.wmnet'] ` The log can be fou...
[12:10:46] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1123.eqiad.wmnet'] `  and were **ALL** successful.
[12:11:32] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1124.eqiad.wmnet with reason: REIMAGE
[12:11:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:12:31] <wikibugs>	 (03PS18) 10Hnowlan: sockpuppet-api: Create basic chart and service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837)
[12:13:16] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1126.eqiad.wmnet'] ` The log can be fou...
[12:13:38] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1124.eqiad.wmnet with reason: REIMAGE
[12:13:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:14:23] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe2006.codfw.wmnet
[12:14:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:17:59] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:18:02] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-fe2007.codfw.wmnet
[12:18:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:19:15] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:19:16] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1127.eqiad.wmnet'] ` The log can be fou...
[12:19:48] <wikibugs>	 (03CR) 10Hnowlan: sockpuppet-api: Create basic chart and service config (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan)
[12:20:10] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1124.eqiad.wmnet'] `  and were **ALL** successful.
[12:20:22] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1125.eqiad.wmnet with reason: REIMAGE
[12:20:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:21:55] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe2007.codfw.wmnet
[12:21:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:22:19] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1125.eqiad.wmnet with reason: REIMAGE
[12:22:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:24:04] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "looks good, can you schedule it at [[wikitech:Deployments]] as well, please?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655282 (owner: 10Majavah)
[12:24:09] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] Revert "Switch fiwiki to their 500k temporary logo!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655281 (owner: 10Majavah)
[12:24:56] <Majavah>	 Urbanecm: won't the temporary logo stay used in some cached pages?
[12:25:51] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1126.eqiad.wmnet with reason: REIMAGE
[12:25:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:27:58] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1126.eqiad.wmnet with reason: REIMAGE
[12:28:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:28:06] <Urbanecm>	 Majavah: hmm, good point. IIRC static resources are heavily cached, so I _think_ it will stay in the frontend caches for some time. 
[12:28:39] <Urbanecm>	 (but it doesn't hurt to clean those up after; we should have some script for identifying orphan static files, at least for logos I guess)
[12:29:55] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1125.eqiad.wmnet'] `  and were **ALL** successful.
[12:31:04] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1127.eqiad.wmnet with reason: REIMAGE
[12:31:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:32:16] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-fe2008.codfw.wmnet
[12:32:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:33:05] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1127.eqiad.wmnet with reason: REIMAGE
[12:33:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:34:39] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1126.eqiad.wmnet'] `  and were **ALL** successful.
[12:36:52] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe2008.codfw.wmnet
[12:36:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:39:57] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: aptrepo: refresh key for croit.io repository [puppet] - 10https://gerrit.wikimedia.org/r/656875 (https://phabricator.wikimedia.org/T259873)
[12:39:59] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey)
[12:40:10] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1127.eqiad.wmnet'] `  and were **ALL** successful.
[12:41:05] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: refresh key for croit.io repository [puppet] - 10https://gerrit.wikimedia.org/r/656875 (https://phabricator.wikimedia.org/T259873) (owner: 10Arturo Borrero Gonzalez)
[12:55:27] <icinga-wm>	 PROBLEM - Check systemd state on elastic2035 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:56:02] <wikibugs>	 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): apt key for `thirdparty/ceph-nautilus/buster` has expired. - https://phabricator.wikimedia.org/T259873 (10aborrero) 05Open→03Resolved
[12:56:28] <XioNoX>	 !log add NAT rule on pfw3-codfw - T272066
[12:56:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:06:08] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: [DONT MERGE] cloud: NAT egress connections to WMF wikis [puppet] - 10https://gerrit.wikimedia.org/r/656883 (https://phabricator.wikimedia.org/T209011)
[13:08:28] <XioNoX>	 !log add NAT rule on pfw3-eqiad - T272066
[13:08:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:08:42] <wikibugs>	 (03PS1) 10Muehlenhoff: Bump changelog for new version [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/656885
[13:10:41] <wikibugs>	 10SRE, 10Epic, 10cloud-services-team (Kanban): CloudVPS: network architecture - https://phabricator.wikimedia.org/T209460 (10aborrero)
[13:12:03] <marostegui>	 !log Upgrade db2071 to 10.4.17 - T268457
[13:12:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:12:08] <stashbot>	 T268457: Investigate possible optimizer regression on 10.4.17 with DELETE statements - https://phabricator.wikimedia.org/T268457
[13:13:28] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: [DONT MERGE] cloud-in4: NAT egress connections to WMF wikis      Previous to this patch, all connections originating from inside CloudVPS (including Toolforge) would hit the Neutorn dmz_cidr setting, therefore skipping the general cloud egress NAT. In practice, this meant that WMF wikis saw internal private IP of each VM (172.16.x.x), which is undersirable for several reasons.      A patch 
[13:13:28] <wikibugs>	 setting, therefore enabling the general egress NAT. When that happens, we will no longer need this ACL entry in the core routers.      Bug: T209011 Signed-off-by: Arturo Borrero Gonzalez <aborrero@wikimedia.org> Change-Id: I3ed10bc5c4e833355ed3bd2ec3fd6f3a9ee7a917 [homer/public] - 10https://gerrit.wikimedia.org/r/656886 (https://phabricator.wikimedia.org/T209011)
[13:14:33] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: [DONT MERGE] cloud-in4: NAT egress connections to WMF wikis [homer/public] - 10https://gerrit.wikimedia.org/r/656886 (https://phabricator.wikimedia.org/T209011)
[13:15:42] <wikibugs>	 10SRE, 10Wikidata, 10Wikidata Query Builder, 10User-Addshore: Deploy WDQS query builder to microsites - https://phabricator.wikimedia.org/T266703 (10Addshore)
[13:16:31] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: [DONT MERGE] cloud-in4: drop ACL entry for WMF wikis [homer/public] - 10https://gerrit.wikimedia.org/r/656886 (https://phabricator.wikimedia.org/T209011)
[13:18:18] <wikibugs>	 (03PS1) 10Kormat: mariadb: Allow public orchestrator ip to connect. [puppet] - 10https://gerrit.wikimedia.org/r/656887 (https://phabricator.wikimedia.org/T266106)
[13:21:03] <wikibugs>	 (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Bump changelog for new version [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/656885 (owner: 10Muehlenhoff)
[13:22:21] <wikibugs>	 (03PS2) 10Kormat: mariadb: Allow public orchestrator ip to connect. [puppet] - 10https://gerrit.wikimedia.org/r/656887 (https://phabricator.wikimedia.org/T266106)
[13:25:09] <wikibugs>	 (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (NOOP 5 DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27505/console" [puppet] - 10https://gerrit.wikimedia.org/r/656887 (https://phabricator.wikimedia.org/T266106) (owner: 10Kormat)
[13:26:25] <volans>	 !log installed spicerack 0.0.48-1+deb10u1 on cumin hosts
[13:26:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:03] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] Add support for php deployments (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/651757 (owner: 10Giuseppe Lavagetto)
[13:34:18] <wikibugs>	 (03PS3) 10Kormat: mariadb: Allow public orchestrator ip to connect. [puppet] - 10https://gerrit.wikimedia.org/r/656887 (https://phabricator.wikimedia.org/T266106)
[13:34:56] <moritzm>	 !log uploaded wmf-sre-laptop 0.3.2 to apt.wikimedia.org
[13:34:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:37:21] <wikibugs>	 (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 10): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27506/console" [puppet] - 10https://gerrit.wikimedia.org/r/656887 (https://phabricator.wikimedia.org/T266106) (owner: 10Kormat)
[13:40:45] <wikibugs>	 (03PS4) 10Kormat: mariadb: Allow public orchestrator ip to connect. [puppet] - 10https://gerrit.wikimedia.org/r/656887 (https://phabricator.wikimedia.org/T266106)
[13:42:49] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] sockpuppet-api: Create basic chart and service config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan)
[13:43:31] <wikibugs>	 (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 10): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27507/console" [puppet] - 10https://gerrit.wikimedia.org/r/656887 (https://phabricator.wikimedia.org/T266106) (owner: 10Kormat)
[13:47:38] <wikibugs>	 (03CR) 10Kormat: [V: 03+1] "PCC looks good: https://puppet-compiler.wmflabs.org/compiler1001/27507/" [puppet] - 10https://gerrit.wikimedia.org/r/656887 (https://phabricator.wikimedia.org/T266106) (owner: 10Kormat)
[13:50:41] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: Allow public orchestrator ip to connect. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/656887 (https://phabricator.wikimedia.org/T266106) (owner: 10Kormat)
[13:57:47] <wikibugs>	 (03CR) 10Kormat: [V: 03+1 C: 03+2] mariadb: Allow public orchestrator ip to connect. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/656887 (https://phabricator.wikimedia.org/T266106) (owner: 10Kormat)
[13:58:57] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "The change LGTM, we will have to update the routers ACL accordingly once merged." [puppet] - 10https://gerrit.wikimedia.org/r/656883 (https://phabricator.wikimedia.org/T209011) (owner: 10Arturo Borrero Gonzalez)
[14:17:19] <icinga-wm>	 PROBLEM - Check systemd state on elastic2052 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:18:08] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.decommission
[14:18:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:20:44] <wikibugs>	 (03PS1) 10Muehlenhoff: Disable bast3004/bast4002/bast5001 as bastions [puppet] - 10https://gerrit.wikimedia.org/r/656894 (https://phabricator.wikimedia.org/T257324)
[14:23:51] <icinga-wm>	 PROBLEM - Elevated latency for icinga checks in eqiad on alert1001 is CRITICAL: cluster=alerting instance=alert1001 job=icinga site=eqiad https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga
[14:26:28] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
[14:26:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:26:33] <wikibugs>	 10SRE, 10DBA, 10Orchestrator, 10CAS-SSO, 10User-Kormat: orchestrator: Support SSO - https://phabricator.wikimedia.org/T266106 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by kormat@cumin1001 for hosts: `dborch1001.eqiad.wmnet` - dborch1001.eqiad.wmnet (**PASS**)   - Downtimed host on...
[14:26:57] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.ganeti.makevm
[14:26:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:45] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:29:35] <wikibugs>	 (03PS1) 10Muehlenhoff: Update bastions in smokeping config [puppet] - 10https://gerrit.wikimedia.org/r/656895 (https://phabricator.wikimedia.org/T257324)
[14:30:12] <wikibugs>	 (03PS1) 10Volans: logging: improve logging format [software/spicerack] - 10https://gerrit.wikimedia.org/r/656896
[14:30:25] <arturo>	 !log updating packages in buster-wikimedia/thirdparty/ceph-nautilus-buster (T272296)
[14:30:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:29] <stashbot>	 T272296: ceph: nautilus: decide on pending package upgrades - https://phabricator.wikimedia.org/T272296
[14:31:04] <logmsgbot>	 !log kormat@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99)
[14:31:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:19] <wikibugs>	 (03PS16) 10Andrew Bogott: Nova: add a simple vendordata REST service [puppet] - 10https://gerrit.wikimedia.org/r/656640 (https://phabricator.wikimedia.org/T271273)
[14:33:21] <wikibugs>	 (03CR) 10Andrew Bogott: Nova: add a simple vendordata REST service (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/656640 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott)
[14:35:37] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "While I don't particularly like what this file becomes, I do see the value of adding more stuff to global config. 😊" [puppet] - 10https://gerrit.wikimedia.org/r/656253 (https://phabricator.wikimedia.org/T253058) (owner: 10Ottomata)
[14:36:18] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Update bastions in smokeping config [puppet] - 10https://gerrit.wikimedia.org/r/656895 (https://phabricator.wikimedia.org/T257324) (owner: 10Muehlenhoff)
[14:36:57] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Update bastions in smokeping config [puppet] - 10https://gerrit.wikimedia.org/r/656895 (https://phabricator.wikimedia.org/T257324) (owner: 10Muehlenhoff)
[14:42:17] <wikibugs>	 (03PS1) 10Muehlenhoff: Update Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/656898
[14:42:44] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/656898 (owner: 10Muehlenhoff)
[14:43:07] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.ganeti.makevm
[14:43:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:45:12] <wikibugs>	 (03PS19) 10Hnowlan: sockpuppet-api: Create basic chart and service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837)
[14:45:39] <wikibugs>	 (03CR) 10Hnowlan: sockpuppet-api: Create basic chart and service config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan)
[14:46:05] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] sockpuppet-api: Create basic chart and service config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan)
[14:47:30] <wikibugs>	 (03CR) 10Volans: Introduce linkrecommendation{,-external} (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/656430 (https://phabricator.wikimedia.org/T258978) (owner: 10Alexandros Kosiaris)
[14:48:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Update Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/656898 (owner: 10Muehlenhoff)
[14:49:33] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host theemin.codfw.wmnet
[14:49:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:46] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "did a quick pass and seems reasonabe." [cookbooks] - 10https://gerrit.wikimedia.org/r/656462 (owner: 10Elukey)
[14:53:27] <icinga-wm>	 PROBLEM - Check systemd state on elastic2032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:53:40] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host theemin.codfw.wmnet
[14:53:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:54:20] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops: Create a yaml structure for defining apache virtualhosts for mediawiki, that can be used both in puppet and in helm charts. - https://phabricator.wikimedia.org/T272305 (10Joe)
[14:56:56] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] logging: improve logging format [software/spicerack] - 10https://gerrit.wikimedia.org/r/656896 (owner: 10Volans)
[14:56:56] <wikibugs>	 (03PS1) 10Kormat: install_server: Fix domain for dborch1001 [puppet] - 10https://gerrit.wikimedia.org/r/656903 (https://phabricator.wikimedia.org/T266106)
[14:57:07] <icinga-wm>	 RECOVERY - Elevated latency for icinga checks in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga
[14:58:47] <wikibugs>	 (03CR) 10Volans: [C: 03+2] logging: improve logging format [software/spicerack] - 10https://gerrit.wikimedia.org/r/656896 (owner: 10Volans)
[15:00:45] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: Add support for php deployments (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/651757 (owner: 10Giuseppe Lavagetto)
[15:02:36] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1128.eqiad.wmnet'] ` The log can be fou...
[15:05:27] <wikibugs>	 (03Merged) 10jenkins-bot: logging: improve logging format [software/spicerack] - 10https://gerrit.wikimedia.org/r/656896 (owner: 10Volans)
[15:07:19] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1130.eqiad.wmnet'] ` The log can be fou...
[15:08:15] <wikibugs>	 10SRE, 10Traffic, 10serviceops: Upgrade envoyproxy to 1.16.2 - https://phabricator.wikimedia.org/T271407 (10Vgutierrez) there are some issues with the python requirements of envoy 1.16.2 as it requires python 3.6 or higher and clearly the building environment isn't fulfilling the requirement. So a tiny worka...
[15:09:17] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Introduce linkrecommendation{,-external} [dns] - 10https://gerrit.wikimedia.org/r/656430 (https://phabricator.wikimedia.org/T258978)
[15:10:44] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
[15:10:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:09] <wikibugs>	 10SRE, 10WMF-NDA-Requests: NDA for Superset Request from WMDE Intern Amrutha Chandra - https://phabricator.wikimedia.org/T272287 (10jcrespo)
[15:13:28] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "One thing that crossed my mind is that we are anyway going to expose this on different ports and it may make sense to reuse the first IP a" (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/656430 (https://phabricator.wikimedia.org/T258978) (owner: 10Alexandros Kosiaris)
[15:13:29] <wikibugs>	 10SRE, 10WMF-NDA-Requests: Request from WMDE intern Amrutha - https://phabricator.wikimedia.org/T271725 (10jcrespo)
[15:14:01] <wikibugs>	 10SRE, 10WMF-NDA-Requests: Request from WMDE intern Amrutha - https://phabricator.wikimedia.org/T271725 (10jcrespo) Merging request, as -as far as I read, both requested the same for the same person.
[15:14:07] <wikibugs>	 10SRE, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10mforns) Moving this task to DONE.
[15:14:26] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1128.eqiad.wmnet with reason: REIMAGE
[15:14:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:14:52] <wikibugs>	 10SRE, 10WMF-NDA-Requests: Request from WMDE intern Amrutha - https://phabricator.wikimedia.org/T271725 (10jcrespo)
[15:16:26] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1128.eqiad.wmnet with reason: REIMAGE
[15:16:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:16:59] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] analytics:refinery:job:data_purge Activate netflow auto-purging [puppet] - 10https://gerrit.wikimedia.org/r/655120 (https://phabricator.wikimedia.org/T231339) (owner: 10Mforns)
[15:18:15] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1130.eqiad.wmnet'] `  Of which those **FAILED**: ` ['an-worker1130.eqiad.wmnet'] `
[15:23:55] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1130.eqiad.wmnet'] ` The log can be fou...
[15:24:02] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1128.eqiad.wmnet'] `  and were **ALL** successful.
[15:27:42] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] install_server: Fix domain for dborch1001 [puppet] - 10https://gerrit.wikimedia.org/r/656903 (https://phabricator.wikimedia.org/T266106) (owner: 10Kormat)
[15:30:38] <wikibugs>	 10SRE, 10WMF-NDA-Requests: Request from WMDE intern Amrutha - https://phabricator.wikimedia.org/T271725 (10jcrespo) @amy_rc:  To sign the NDA towards the Wimedia Foundation, please read carefully and use this form to sign it for legal: https://phabricator.wikimedia.org/L2 More detailed instructions can be foun...
[15:32:49] <wikibugs>	 10SRE, 10WMF-NDA-Requests: Request from WMDE intern Amrutha - https://phabricator.wikimedia.org/T271725 (10jcrespo) a:03amy_rc Adding @elukey here just for awareness (no actions needed), as a new superset user will appear.
[15:33:47] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) No link found as well for an-worker1131, skipping..
[15:36:57] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1130.eqiad.wmnet with reason: REIMAGE
[15:36:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:38:39] <icinga-wm>	 PROBLEM - SSH on logstash2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[15:39:01] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1130.eqiad.wmnet with reason: REIMAGE
[15:39:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:19] <icinga-wm>	 RECOVERY - SSH on logstash2006 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[15:43:19] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash2006 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7f7b57a35518: Failed to establish a new connection: [Errno 111] Connection
[15:43:19] <icinga-wm>	 ://wikitech.wikimedia.org/wiki/Search%23Administration
[15:44:41] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash2006 is OK: OK - elasticsearch status production-logstash-codfw: delayed_unassigned_shards: 0, unassigned_shards: 0, relocating_shards: 0, number_of_data_nodes: 3, active_shards: 862, timed_out: False, active_shards_percent_as_number: 100.0, initializing_shards: 0, status: green, cluster_name: production-logstash-codfw, number_of_in_flight_fetch: 0, number_of_pen
[15:44:41] <icinga-wm>	 mber_of_nodes: 6, task_max_waiting_in_queue_millis: 0, active_primary_shards: 456 https://wikitech.wikimedia.org/wiki/Search%23Administration
[15:45:39] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1132.eqiad.wmnet'] ` The log can be fou...
[15:46:38] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1130.eqiad.wmnet'] `  and were **ALL** successful.
[15:48:21] <moritzm>	 !log installing wavpack security updates
[15:48:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:48:39] <wikibugs>	 10SRE, 10WMF-NDA-Requests: Request from WMDE intern Amrutha - https://phabricator.wikimedia.org/T271725 (10jcrespo) Apologies, @amy_rc, I previously gave you the volunteer NDA path. As WMDE stuff, NDA has to be handled by legal. Please @KFrancis, could you handle the necesary steps previous to provide NDA acce...
[15:51:39] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1135.eqiad.wmnet'] ` The log can be fou...
[15:52:59] <wikibugs>	 10SRE, 10Traffic, 10serviceops: Upgrade envoyproxy to 1.16.2 - https://phabricator.wikimedia.org/T271407 (10Joe) @Vgutierrez we can create a new building env based on buster I think, that's much better as an option.
[15:57:15] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1136.eqiad.wmnet'] ` The log can be fou...
[15:57:29] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1132.eqiad.wmnet with reason: REIMAGE
[15:57:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:30] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1132.eqiad.wmnet with reason: REIMAGE
[15:59:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:00:08] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10jcrespo)
[16:00:36] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/656883 (https://phabricator.wikimedia.org/T209011) (owner: 10Arturo Borrero Gonzalez)
[16:01:46] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: [DONT MERGE] cloud: NAT egress connections to WMF wikis [puppet] - 10https://gerrit.wikimedia.org/r/656883 (https://phabricator.wikimedia.org/T209011)
[16:01:48] <wikibugs>	 10SRE, 10Add-Link, 10GrowthExperiments-NewcomerTasks, 10Product-Infrastructure-Team-Backlog, and 3 others: Service operations setup for Add a Link project - https://phabricator.wikimedia.org/T258978 (10kostajh)
[16:01:51] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: [DONT MERGE] cloud-in4: drop ACL entry for WMF wikis [homer/public] - 10https://gerrit.wikimedia.org/r/656886 (https://phabricator.wikimedia.org/T209011)
[16:03:29] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1135.eqiad.wmnet with reason: REIMAGE
[16:03:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:04:10] <wikibugs>	 10SRE, 10Add-Link, 10GrowthExperiments-NewcomerTasks, 10Product-Infrastructure-Team-Backlog, and 3 others: Service operations setup for Add a Link project - https://phabricator.wikimedia.org/T258978 (10kostajh) >>! In T258978#6751367, @akosiaris wrote: >>>! In T258978#6729580, @kostajh wrote: >> @akosiaris...
[16:04:33] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10jcrespo)
[16:04:43] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] [DONT MERGE] cloud-in4: drop ACL entry for WMF wikis [homer/public] - 10https://gerrit.wikimedia.org/r/656886 (https://phabricator.wikimedia.org/T209011) (owner: 10Arturo Borrero Gonzalez)
[16:05:27] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1135.eqiad.wmnet with reason: REIMAGE
[16:05:28] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] sockpuppet-api: Create basic chart and service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan)
[16:05:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:06:10] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1132.eqiad.wmnet'] `  and were **ALL** successful.
[16:07:02] <wikibugs>	 (03Merged) 10jenkins-bot: sockpuppet-api: Create basic chart and service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan)
[16:07:44] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10jcrespo)
[16:09:12] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1136.eqiad.wmnet'] `  Of which those **FAILED**: ` ['an-worker1136.eqiad.wmnet'] `
[16:09:16] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1137.eqiad.wmnet'] ` The log can be fou...
[16:10:44] <wikibugs>	 (03CR) 10Volans: "> Patch Set 2:" [dns] - 10https://gerrit.wikimedia.org/r/656430 (https://phabricator.wikimedia.org/T258978) (owner: 10Alexandros Kosiaris)
[16:12:02] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1135.eqiad.wmnet'] `  and were **ALL** successful.
[16:15:25] <wikibugs>	 (03PS2) 10Hnowlan: similar-users: add helmfile configuration. [deployment-charts] - 10https://gerrit.wikimedia.org/r/655915 (https://phabricator.wikimedia.org/T268837)
[16:18:39] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1138.eqiad.wmnet'] ` The log can be fou...
[16:19:51] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] sre.hadoop.reboot-workers: add improvements to reboot logic [cookbooks] - 10https://gerrit.wikimedia.org/r/656462 (owner: 10Elukey)
[16:22:06] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1137.eqiad.wmnet with reason: REIMAGE
[16:22:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:43] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hadoop.reboot-workers: add improvements to reboot logic [cookbooks] - 10https://gerrit.wikimedia.org/r/656462 (owner: 10Elukey)
[16:24:16] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1137.eqiad.wmnet with reason: REIMAGE
[16:24:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:24:56] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10jcrespo) a:03fdans @lilients_WMDE You only had LDAP access before, correct?  Assigning to @fdans as, to the best of my understanding, is the right manager to approve new access to the anal...
[16:25:45] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10jcrespo) p:05Triage→03High
[16:27:23] <wikibugs>	 10SRE, 10WMF-NDA-Requests: Request from WMDE intern Amrutha - https://phabricator.wikimedia.org/T271725 (10jcrespo) p:05Triage→03High
[16:27:37] <wikibugs>	 10SRE, 10Add-Link, 10GrowthExperiments-NewcomerTasks, 10Product-Infrastructure-Team-Backlog, and 3 others: Service operations setup for Add a Link project - https://phabricator.wikimedia.org/T258978 (10kostajh)
[16:29:56] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10jcrespo) a:05fdans→03Ottomata I've been told Andres may be the right person to approve, apologies.
[16:30:48] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1138.eqiad.wmnet'] `  Of which those **FAILED**: ` ['an-worker1138.eqiad.wmnet'] `
[16:31:56] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1137.eqiad.wmnet'] `  and were **ALL** successful.
[16:34:02] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:35:30] <icinga-wm>	 PROBLEM - Check systemd state on elastic2026 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:35:54] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:37:56] <wikibugs>	 (03PS4) 10JMeybohm: Allow the kube-controller-manager to run without superuser permissions [puppet] - 10https://gerrit.wikimedia.org/r/650469 (https://phabricator.wikimedia.org/T228967)
[16:41:56] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] similar-users: add helmfile configuration. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/655915 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan)
[16:42:19] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Nova: add a simple vendordata REST service [puppet] - 10https://gerrit.wikimedia.org/r/656640 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott)
[16:48:49] <wikibugs>	 (03PS1) 10Elukey: sre.hadoop.init-hadoop-workers: move to Class API [cookbooks] - 10https://gerrit.wikimedia.org/r/656952 (https://phabricator.wikimedia.org/T269925)
[16:59:26] <wikibugs>	 10SRE, 10Inuka-Team, 10Security-Team, 10Product-Analytics (Kanban): Provide raw KaiOSAppFeedback data to Chelsea Riley for analysis - https://phabricator.wikimedia.org/T271202 (10jcrespo) Based on previous comment by @sbassett, it seems the right direction is to contact Legal & IT support for permission/be...
[17:05:57] <wikibugs>	 (03PS3) 10Hnowlan: similar-users: add helmfile configuration. [deployment-charts] - 10https://gerrit.wikimedia.org/r/655915 (https://phabricator.wikimedia.org/T268837)
[17:06:18] <wikibugs>	 (03CR) 10Hnowlan: similar-users: add helmfile configuration. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/655915 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan)
[17:08:42] <wikibugs>	 10SRE, 10WMF-NDA-Requests: Allow WMDE intern Amrutha to access Superset - https://phabricator.wikimedia.org/T271725 (10Aklapper)
[17:08:57] <wikibugs>	 10SRE, 10Dumps-Generation, 10SRE-Access-Requests, 10Patch-For-Review, 10Platform Team Workboards (Clinic Duty Team): Add all of CPT to snapshot/dumpsdata admins - https://phabricator.wikimedia.org/T271718 (10jcrespo) Next meeting is expected to happen on 25 January, will add to the list of topics to disc...
[17:09:30] <wikibugs>	 (03PS1) 10Filippo Giunchedi: WIP apt package from component [puppet] - 10https://gerrit.wikimedia.org/r/656953
[17:09:31] <wikibugs>	 10SRE, 10WMF-NDA-Requests: Allow WMDE intern Amrutha to access Superset - https://phabricator.wikimedia.org/T271725 (10Aklapper)
[17:09:33] <wikibugs>	 10SRE, 10WMF-NDA-Requests: NDA for Superset Request from WMDE Intern Amrutha Chandra - https://phabricator.wikimedia.org/T272287 (10Aklapper)
[17:09:48] <wikibugs>	 10SRE, 10observability, 10User-fgiunchedi: rsyslog occasional segfault on centrallog hosts - https://phabricator.wikimedia.org/T259780 (10fgiunchedi) This particular failure mode seems to be fixed with rsyslog 8.2008.0-1~bpo10+1, I can't find any other rsyslog segmentation faults since deploying the new vers...
[17:10:15] <wikibugs>	 (03CR) 10Volans: "Looks mostly reasonable, thanks for the migration! Couple of questions inline" (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/656952 (https://phabricator.wikimedia.org/T269925) (owner: 10Elukey)
[17:11:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP apt package from component [puppet] - 10https://gerrit.wikimedia.org/r/656953 (owner: 10Filippo Giunchedi)
[17:11:51] <wikibugs>	 (03PS2) 10Filippo Giunchedi: WIP apt package from component [puppet] - 10https://gerrit.wikimedia.org/r/656953
[17:14:34] <wikibugs>	 (03CR) 10CRusnov: "This change is ready for review." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/656954 (https://phabricator.wikimedia.org/T266487) (owner: 10CRusnov)
[17:14:36] <wikibugs>	 (03PS3) 10Filippo Giunchedi: rsyslog: install rsyslog from component/rsyslog on Buster [puppet] - 10https://gerrit.wikimedia.org/r/656953 (https://phabricator.wikimedia.org/T259780)
[17:14:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] interface_automation.py: Minor refactors and fixes for 2.9 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/656954 (https://phabricator.wikimedia.org/T266487) (owner: 10CRusnov)
[17:15:47] <wikibugs>	 (03PS2) 10CRusnov: interface_automation.py: Minor refactors and fixes for 2.9 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/656954 (https://phabricator.wikimedia.org/T266487)
[17:19:03] <wikibugs>	 (03PS2) 10Elukey: sre.hadoop.init-hadoop-workers: move to Class API [cookbooks] - 10https://gerrit.wikimedia.org/r/656952 (https://phabricator.wikimedia.org/T269925)
[17:19:46] <wikibugs>	 (03PS3) 10Elukey: sre.hadoop.init-hadoop-workers: move to Class API [cookbooks] - 10https://gerrit.wikimedia.org/r/656952 (https://phabricator.wikimedia.org/T269925)
[17:20:04] <wikibugs>	 (03CR) 10Elukey: "Followed up to all the comments, thanks!" (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/656952 (https://phabricator.wikimedia.org/T269925) (owner: 10Elukey)
[17:22:12] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "Looks ok to me, ship and test it!" [cookbooks] - 10https://gerrit.wikimedia.org/r/656952 (https://phabricator.wikimedia.org/T269925) (owner: 10Elukey)
[17:22:46] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] sre.hadoop.init-hadoop-workers: move to Class API [cookbooks] - 10https://gerrit.wikimedia.org/r/656952 (https://phabricator.wikimedia.org/T269925) (owner: 10Elukey)
[17:25:30] <wikibugs>	 10SRE, 10serviceops, 10Wikimedia-Incident: High latency on appservers - https://phabricator.wikimedia.org/T272215 (10jcrespo) p:05Triage→03Medium This was UBN on Saturday, based on Joe's comment, I am putting this now to Medium.  More details are yet to be provided on the Incident report, I can help with...
[17:28:13] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host...
[17:28:31] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host...
[17:28:59] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "I like the refactor direction, there are some things that looks wrong though, see inline." (033 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/656954 (https://phabricator.wikimedia.org/T266487) (owner: 10CRusnov)
[17:29:20] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host...
[17:30:11] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host...
[17:32:51] <mutante>	 !log reimaging mw2271,mw2273,mw2274,mw227 (codfw only)
[17:32:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:33:41] <wikibugs>	 (03CR) 10Ayounsi: interface_automation.py: Minor refactors and fixes for 2.9 (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/656954 (https://phabricator.wikimedia.org/T266487) (owner: 10CRusnov)
[17:33:45] <wikibugs>	 10SRE, 10SRE-Access-Requests: Hue access for Peter Pelberg - https://phabricator.wikimedia.org/T271602 (10jcrespo) @ppelberg This is blocked only on providing additional information requested by @Joe and @Elukey above.
[17:34:19] <wikibugs>	 10SRE, 10SRE-Access-Requests: Hue access for Peter Pelberg - https://phabricator.wikimedia.org/T271602 (10jcrespo)
[17:35:54] <wikibugs>	 (03PS3) 10CRusnov: interface_automation.py: Minor refactors and fixes for 2.9 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/656954 (https://phabricator.wikimedia.org/T266487)
[17:35:56] <wikibugs>	 (03CR) 10CRusnov: interface_automation.py: Minor refactors and fixes for 2.9 (033 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/656954 (https://phabricator.wikimedia.org/T266487) (owner: 10CRusnov)
[17:36:37] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1118.eqiad.wmnet
[17:36:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:38:26] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1118.eqiad.wmnet
[17:38:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:40:21] <mutante>	 US is holiday but working a bit because I could not on Friday
[17:42:29] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1120.eqiad.wmnet
[17:42:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:43:33] <wikibugs>	 10SRE, 10SRE-Access-Requests: Hue access for Peter Pelberg - https://phabricator.wikimedia.org/T271602 (10jcrespo) We would also need @DannyH to sign off the request, as to the best of my understanding, this is not a team's "standarized request".
[17:43:57] <wikibugs>	 10SRE, 10SRE-Access-Requests: Hue access for Peter Pelberg - https://phabricator.wikimedia.org/T271602 (10jcrespo)
[17:44:18] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1120.eqiad.wmnet
[17:44:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:44:24] <wikibugs>	 10SRE, 10SRE-Access-Requests: Hue access for Peter Pelberg - https://phabricator.wikimedia.org/T271602 (10jcrespo) a:03ppelberg
[17:45:07] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2271.codfw.wmnet with reason: REIMAGE
[17:45:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:45:27] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2273.codfw.wmnet with reason: REIMAGE
[17:45:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:46:14] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2274.codfw.wmnet with reason: REIMAGE
[17:46:15] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker[1121-1123].eqiad.wmnet
[17:46:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:46:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:47:05] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2275.codfw.wmnet with reason: REIMAGE
[17:47:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:47:07] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2271.codfw.wmnet with reason: REIMAGE
[17:47:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:48:07] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker[1121-1123].eqiad.wmnet
[17:48:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:48:38] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:49:06] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2273.codfw.wmnet with reason: REIMAGE
[17:49:07] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2275.codfw.wmnet with reason: REIMAGE
[17:49:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:49:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:49:31] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker[1124-1127].eqiad.wmnet
[17:49:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:49:49] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:50:51] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2274.codfw.wmnet with reason: REIMAGE
[17:50:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:51:22] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker[1124-1127].eqiad.wmnet
[17:51:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:52:12] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on ms-be1032 - https://phabricator.wikimedia.org/T272209 (10jcrespo) p:05Triage→03Medium Assigning medium status to remove from SRE untriaged inbox, feel free to edit on disagreement.
[17:54:53] <wikibugs>	 (03CR) 10Volans: interface_automation.py: Minor refactors and fixes for 2.9 (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/656954 (https://phabricator.wikimedia.org/T266487) (owner: 10CRusnov)
[17:55:50] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey)
[17:56:14] <wikibugs>	 10SRE: Use static PHIDs instead of fragile Phab project names in in modules/icinga/files/raid_handler.py - https://phabricator.wikimedia.org/T272233 (10jcrespo) Hey, Aklapper,  Sorry if you discussed this with someone already. Could you provide a bit more of context: is this a suggestion but not high priority? I...
[17:56:55] <wikibugs>	 10SRE, 10Wikidata, 10Wikidata Query Builder, 10User-Addshore: Deploy WDQS query builder to microsites - https://phabricator.wikimedia.org/T266703 (10Dzahn) @addshore I think all the check boxes (except maybe the last one) on this ticket are already done?
[17:58:21] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2275 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:59:04] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey)
[17:59:53] <wikibugs>	 10SRE, 10SRE-tools: Use static PHIDs instead of fragile Phab project names in in modules/icinga/files/raid_handler.py - https://phabricator.wikimedia.org/T272233 (10jcrespo) Adding @Volans, but I could do the patch myself if it is easy enough, and one of the 2 main users of it (media and dbs).
[18:04:17] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1136.eqiad.wmnet'] ` The log can be fou...
[18:09:19] <icinga-wm>	 RECOVERY - PHP7 rendering on mw2275 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.144 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[18:10:04] <wikibugs>	 (03PS4) 10CRusnov: interface_automation.py: Minor refactors and fixes for 2.9 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/656954 (https://phabricator.wikimedia.org/T266487)
[18:10:06] <wikibugs>	 (03CR) 10CRusnov: interface_automation.py: Minor refactors and fixes for 2.9 (033 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/656954 (https://phabricator.wikimedia.org/T266487) (owner: 10CRusnov)
[18:10:09] <icinga-wm>	 PROBLEM - PHP opcache health on mw2275 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[18:10:20] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1138.eqiad.wmnet'] ` The log can be fou...
[18:10:34] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2274.codfw.wmnet'] `  and were **ALL** s...
[18:11:06] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2275.codfw.wmnet'] `  an...
[18:11:22] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2273.codfw.wmnet'] `  an...
[18:11:44] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2271.codfw.wmnet'] `  an...
[18:12:43] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1128.eqiad.wmnet
[18:12:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:14:29] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2275 is CRITICAL: Host mw2275 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[18:14:32] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1128.eqiad.wmnet
[18:14:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:16:07] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1136.eqiad.wmnet with reason: REIMAGE
[18:16:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:17:39] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1130.eqiad.wmnet
[18:17:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:18:07] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1136.eqiad.wmnet with reason: REIMAGE
[18:18:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:19:27] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1130.eqiad.wmnet
[18:19:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:20:08] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1132.eqiad.wmnet
[18:20:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:20:21] <wikibugs>	 10SRE, 10SRE-tools: Use static PHIDs instead of fragile Phab project names in in modules/icinga/files/raid_handler.py - https://phabricator.wikimedia.org/T272233 (10Volans) Sure, let's move to PHIDs. That's a general thing, not only for raid_handler.py, but the above linked task should have all the pointers to...
[18:21:57] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1132.eqiad.wmnet
[18:21:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:22:08] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1138.eqiad.wmnet with reason: REIMAGE
[18:22:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:23:29] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM to test it!" (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/656954 (https://phabricator.wikimedia.org/T266487) (owner: 10CRusnov)
[18:24:06] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1138.eqiad.wmnet with reason: REIMAGE
[18:24:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:24:40] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1136.eqiad.wmnet'] `  and were **ALL** successful.
[18:29:21] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey)
[18:29:27] <wikibugs>	 10SRE, 10Wikimedia-Logstash, 10observability, 10Software-Licensing: Elasticsearch and Kibana are switching to non-OSI-approved SSPL licence - https://phabricator.wikimedia.org/T272238 (10jcrespo) Adding observability team, although for initial reactions not sure if a separate ticket for infrastructure logs...
[18:33:11] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1138.eqiad.wmnet'] `  and were **ALL** successful.
[18:33:55] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2271.codfw.wmnet
[18:33:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:34:07] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2273.codfw.wmnet
[18:34:08] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker[1136,1138].eqiad.wmnet
[18:34:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:34:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:34:39] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2275.codfw.wmnet
[18:34:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:34:48] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2274.codfw.wmnet
[18:34:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:35:05] <wikibugs>	 (03PS1) 10Ladsgroup: Drop profile::analytics::refinery::job::streams_check [puppet] - 10https://gerrit.wikimedia.org/r/656961
[18:35:59] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker[1136,1138].eqiad.wmnet
[18:36:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:36:11] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "elukey@cumin1001:~$ sudo cumin 'c:profile::analytics::refinery::job::streams_check'" [puppet] - 10https://gerrit.wikimedia.org/r/656961 (owner: 10Ladsgroup)
[18:38:51] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2271.codfw.wmnet
[18:38:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:39:07] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2273.codfw.wmnet
[18:39:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:39:41] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) @Cmjohnson for an-worker1119 and an-worker1131 I don't have any network link, could you please check if anything is missing from the cabling/config...
[18:40:26] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] eventlogging: Migrate hiera() to lookup() and setting datatype [puppet] - 10https://gerrit.wikimedia.org/r/656531 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[18:40:29] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2274.codfw.wmnet
[18:40:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:40:41] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2275.codfw.wmnet
[18:40:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:15:32] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[19:15:59] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2275 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[19:19:26] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[19:20:50] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[19:31:38] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2276.codfw.wmnet with reason: REIMAGE
[19:31:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:32:30] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2277.codfw.wmnet with reason: REIMAGE
[19:32:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:33:37] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2276.codfw.wmnet with reason: REIMAGE
[19:33:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:35:24] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2277.codfw.wmnet with reason: REIMAGE
[19:35:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:35:27] <wikibugs>	 (03PS1) 10Luke081515: Adding namespace aliases on arbcom-ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656964 (https://phabricator.wikimedia.org/T272292)
[19:38:27] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2303.codfw.wmnet with reason: REIMAGE
[19:38:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:39:50] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2305.codfw.wmnet with reason: REIMAGE
[19:39:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:40:24] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2303.codfw.wmnet with reason: REIMAGE
[19:40:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:40:49] <wikibugs>	 10SRE, 10SRE-tools: Use static PHIDs instead of fragile Phab project names in in modules/icinga/files/raid_handler.py - https://phabricator.wikimedia.org/T272233 (10Aklapper) Hi, this is a suggestion to avoid potential future breakage / to make code more robust.
[19:42:01] <wikibugs>	 (03PS5) 10CRusnov: interface_automation.py: Minor refactors and fixes for 2.9 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/656954 (https://phabricator.wikimedia.org/T266487)
[19:42:25] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2305.codfw.wmnet with reason: REIMAGE
[19:42:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:43:11] <wikibugs>	 (03CR) 10CRusnov: "With a minor fix for the new method's arguments, this works on -dev importing new VMs." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/656954 (https://phabricator.wikimedia.org/T266487) (owner: 10CRusnov)
[19:53:54] <wikibugs>	 10SRE, 10Wikidata, 10Wikidata Query Builder, 10User-Addshore: Deploy WDQS query builder to microsites - https://phabricator.wikimedia.org/T266703 (10Addshore) >>! In T266703#6756059, @Dzahn wrote: > @addshore I think all the check boxes (except maybe the last one) on this ticket are already done?  This tic...
[19:54:47] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2276.codfw.wmnet'] `  an...
[19:54:53] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2277.codfw.wmnet'] `  an...
[19:56:12] <icinga-wm>	 PROBLEM - Check systemd state on elastic2029 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:00:50] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2303.codfw.wmnet'] `  an...
[20:01:41] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2305.codfw.wmnet'] `  an...
[20:19:27] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2276.codfw.wmnet
[20:19:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:19:39] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2277.codfw.wmnet
[20:19:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:19:51] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2305.codfw.wmnet
[20:19:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:20:04] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2303.codfw.wmnet
[20:20:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:24:36] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2276.codfw.wmnet
[20:24:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:24:43] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2277.codfw.wmnet
[20:24:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:25:04] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2303.codfw.wmnet
[20:25:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:25:12] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2305.codfw.wmnet
[20:25:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:27:03] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[20:28:04] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[20:29:00] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[20:30:10] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[20:30:17] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2310.codfw.wmnet'] `  Of...
[20:30:37] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[20:41:12] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:43:24] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:46:05] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2307.codfw.wmnet with reason: REIMAGE
[20:46:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:47:04] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2309.codfw.wmnet with reason: REIMAGE
[20:47:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:48:03] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2310.codfw.wmnet with reason: REIMAGE
[20:48:06] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2307.codfw.wmnet with reason: REIMAGE
[20:48:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:48:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:49:37] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2311.codfw.wmnet with reason: REIMAGE
[20:49:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:50:10] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2310.codfw.wmnet with reason: REIMAGE
[20:50:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:52:04] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2309.codfw.wmnet with reason: REIMAGE
[20:52:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:53:25] <icinga-wm>	 PROBLEM - Check systemd state on elastic2025 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:54:05] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2311.codfw.wmnet with reason: REIMAGE
[20:54:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:08:39] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2307.codfw.wmnet'] `  an...
[21:10:06] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2309.codfw.wmnet'] `  an...
[21:10:52] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2310.codfw.wmnet'] `  an...
[21:12:47] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2311.codfw.wmnet'] `  an...
[21:22:33] <icinga-wm>	 PROBLEM - Check systemd state on elastic2033 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:22:51] <wikibugs>	 10SRE, 10Traffic: Start warning and deprecation process for all legacy TLS - https://phabricator.wikimedia.org/T238038 (10Ricardoa2020) Very interesting information about TLS, if we want to know more about IT and its security we should visit [[ https://demyo.com/ | Demyo ]]
[21:29:02] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2311.codfw.wmnet
[21:29:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:29:23] <icinga-wm>	 PROBLEM - PHP opcache health on mw2274 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[21:29:25] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2310.codfw.wmnet
[21:29:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:29:44] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2307.codfw.wmnet
[21:29:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:29:58] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2309.codfw.wmnet
[21:29:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:30:16] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2307.codfw.wmnet
[21:30:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:30:22] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2309.codfw.wmnet
[21:30:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:30:31] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2310.codfw.wmnet
[21:30:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:30:37] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2311.codfw.wmnet
[21:30:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:37:03] <icinga-wm>	 PROBLEM - PHP opcache health on mw2273 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[21:37:03] <icinga-wm>	 PROBLEM - PHP opcache health on mw2277 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[21:39:52] <icinga-wm>	 ACKNOWLEDGEMENT - PHP opcache health on mw2268 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn reimaged https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[21:39:52] <icinga-wm>	 ACKNOWLEDGEMENT - PHP opcache health on mw2271 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn reimaged https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[21:39:52] <icinga-wm>	 ACKNOWLEDGEMENT - PHP opcache health on mw2273 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn reimaged https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[21:39:52] <icinga-wm>	 ACKNOWLEDGEMENT - PHP opcache health on mw2274 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn reimaged https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[21:39:52] <icinga-wm>	 ACKNOWLEDGEMENT - PHP opcache health on mw2275 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn reimaged https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[21:39:53] <icinga-wm>	 ACKNOWLEDGEMENT - PHP opcache health on mw2276 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn reimaged https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[21:39:53] <icinga-wm>	 ACKNOWLEDGEMENT - PHP opcache health on mw2277 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn reimaged https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[21:42:15] <icinga-wm>	 PROBLEM - PHP opcache health on mw2303 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[21:43:17] <icinga-wm>	 PROBLEM - PHP opcache health on mw2305 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[21:45:32] <icinga-wm>	 ACKNOWLEDGEMENT - PHP opcache health on mw2303 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn reimaged https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[21:45:32] <icinga-wm>	 ACKNOWLEDGEMENT - PHP opcache health on mw2305 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn reimaged https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[21:54:47] <icinga-wm>	 PROBLEM - Check systemd state on elastic2027 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:02:41] <icinga-wm>	 PROBLEM - Check systemd state on elastic2047 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:10:23] <icinga-wm>	 PROBLEM - Check systemd state on elastic2030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:21:07] <icinga-wm>	 PROBLEM - Elevated latency for icinga checks in eqiad on alert1001 is CRITICAL: cluster=alerting instance=alert1001 job=icinga site=eqiad https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga
[22:50:23] <icinga-wm>	 PROBLEM - PHP opcache health on mw2310 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[22:54:07] <icinga-wm>	 PROBLEM - PHP opcache health on mw2311 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[22:55:41] <icinga-wm>	 PROBLEM - Elevated latency for icinga checks in eqiad on alert1001 is CRITICAL: cluster=alerting instance=alert1001 job=icinga site=eqiad https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga
[23:02:57] <icinga-wm>	 ACKNOWLEDGEMENT - PHP opcache health on mw2310 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn reimaged https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[23:02:57] <icinga-wm>	 ACKNOWLEDGEMENT - PHP opcache health on mw2311 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn reimaged https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[23:42:57] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:45:59] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets