[01:07:53] <icinga-wm>	 RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:13:05] <icinga-wm>	 PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:13:17] <icinga-wm>	 RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:18:25] <icinga-wm>	 PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:31:55] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on alert1001 is CRITICAL: 5870 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[02:33:31] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on alert1001 is OK: (C)5000 gt (W)1000 gt 2 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[02:42:29] <icinga-wm>	 RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:47:15] <icinga-wm>	 PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:08:29] <icinga-wm>	 RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:13:39] <icinga-wm>	 PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:34:49] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:35:57] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:43:13] <icinga-wm>	 RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:48:23] <icinga-wm>	 PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:02:27] <icinga-wm>	 PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[05:04:03] <icinga-wm>	 RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[06:02:28] <wikibugs>	 10SRE, 10DBA: Decom dbmonitor2001 - https://phabricator.wikimedia.org/T274496 (10Marostegui) p:05Triage→03Medium a:03Kormat Yeah, as far as I remember we're not using this for anything Assigning it for Stevie for confirmation and removal (if that applies)
[06:10:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Investigate and repool db1134 - https://phabricator.wikimedia.org/T274472 (10Marostegui) Thanks everyone who responded to this incident!
[06:17:33] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) >>! In T258361#6822070, @jcrespo wrote: > I am taking db1163 to, at least temporarily, substitute db1134 due to T274472.  Thanks. I...
[06:19:15] <wikibugs>	 (03PS1) 10Marostegui: db1162: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/664087 (https://phabricator.wikimedia.org/T258361)
[06:20:14] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1162: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/664087 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui)
[06:36:31] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db1162 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/664088 (https://phabricator.wikimedia.org/T258361)
[06:37:05] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1162 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/664088 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui)
[06:40:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1162 to dbctl - depooled T258361', diff saved to https://phabricator.wikimedia.org/P14339 and previous config saved to /var/cache/conftool/dbconfig/20210215-064001-marostegui.json
[06:40:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:40:08] <stashbot>	 T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
[06:46:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1162 with minimal weight T258361', diff saved to https://phabricator.wikimedia.org/P14340 and previous config saved to /var/cache/conftool/dbconfig/20210215-064628-marostegui.json
[06:46:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:46:33] <stashbot>	 T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
[06:56:50] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not reimage db1162 and db1163 [puppet] - 10https://gerrit.wikimedia.org/r/664089
[06:57:31] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1162 and db1163 [puppet] - 10https://gerrit.wikimedia.org/r/664089 (owner: 10Marostegui)
[06:58:07] <icinga-wm>	 RECOVERY - Check systemd state on search-loader2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:02:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1162 with minimal weight T258361', diff saved to https://phabricator.wikimedia.org/P14341 and previous config saved to /var/cache/conftool/dbconfig/20210215-070206-marostegui.json
[07:02:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:02:12] <stashbot>	 T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
[07:09:46] <wikibugs>	 10SRE, 10ops-eqiad: ms-be1034 not powering on - https://phabricator.wikimedia.org/T274488 (10elukey) 05Resolved→03Open ms-be1034 is down again, same issue as the one described by Filippo... :(
[07:10:31] <icinga-wm>	 ACKNOWLEDGEMENT - Host ms-be1034 is DOWN: PING CRITICAL - Packet loss = 100% Elukey T274488
[07:14:17] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-tool1007.eqiad.wmnet
[07:14:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:16:37] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1007.eqiad.wmnet
[07:16:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:20:41] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-tool1008.eqiad.wmnet
[07:20:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:22:37] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1008.eqiad.wmnet
[07:22:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:24:21] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-tool1009.eqiad.wmnet
[07:24:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:26:40] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1009.eqiad.wmnet
[07:26:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:28:21] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-tool1010.eqiad.wmnet
[07:28:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:33:24] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1010.eqiad.wmnet
[07:33:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:38:23] <icinga-wm>	 RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:42:54] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.druid.reboot-workers for Druid analytics cluster: Reboot Druid nodes - elukey@cumin1001
[07:42:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:43:33] <icinga-wm>	 PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:47:21] <wikibugs>	 (03PS1) 10ArielGlenn: wikidata json dumps: re-add source of shared functions [puppet] - 10https://gerrit.wikimedia.org/r/664090
[07:48:16] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] wikidata json dumps: re-add source of shared functions [puppet] - 10https://gerrit.wikimedia.org/r/664090 (owner: 10ArielGlenn)
[07:49:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 3%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14342 and previous config saved to /var/cache/conftool/dbconfig/20210215-074932-root.json
[07:49:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:57:16] <wikibugs>	 (03PS1) 10ArielGlenn: now that snapshot1005 is testbed host, make snapshot1007 the enwiki dumps runner [puppet] - 10https://gerrit.wikimedia.org/r/664091 (https://phabricator.wikimedia.org/T269377)
[08:04:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 4%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14343 and previous config saved to /var/cache/conftool/dbconfig/20210215-080435-root.json
[08:04:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:07:33] <icinga-wm>	 RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:08:21] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] now that snapshot1005 is testbed host, make snapshot1007 the enwiki dumps runner [puppet] - 10https://gerrit.wikimedia.org/r/664091 (https://phabricator.wikimedia.org/T269377) (owner: 10ArielGlenn)
[08:10:50] <wikibugs>	 (03PS1) 10ArielGlenn: prep snapshot1005 and 1006 for reinstall with buster [puppet] - 10https://gerrit.wikimedia.org/r/664092 (https://phabricator.wikimedia.org/T269377)
[08:13:14] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] prep snapshot1005 and 1006 for reinstall with buster [puppet] - 10https://gerrit.wikimedia.org/r/664092 (https://phabricator.wikimedia.org/T269377) (owner: 10ArielGlenn)
[08:17:33] <wikibugs>	 10SRE, 10Dumps-Generation, 10Platform Engineering, 10serviceops, and 2 others: Upgrade snapshot hosts to Buster - https://phabricator.wikimedia.org/T269377 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ariel on cumin1001.eqiad.wmnet for hosts: ` snapshot1005.eqiad.wmnet ` The log can be fo...
[08:19:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 5%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14344 and previous config saved to /var/cache/conftool/dbconfig/20210215-081940-root.json
[08:26:51] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[08:27:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1075 T274235', diff saved to https://phabricator.wikimedia.org/P14345 and previous config saved to /var/cache/conftool/dbconfig/20210215-082718-marostegui.json
[08:27:47] <icinga-wm>	 PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:29:05] <gehel>	 !log powercycle wdqs1009
[08:29:22] <wikibugs>	 (03PS1) 10Marostegui: db1075: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/664093 (https://phabricator.wikimedia.org/T274235)
[08:29:24] <wikibugs>	 (03PS1) 10Elukey: profile::hadoop::backup::namenode: add a more precise notes_url [puppet] - 10https://gerrit.wikimedia.org/r/664094
[08:29:25] <logmsgbot>	 !log ariel@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on snapshot1005.eqiad.wmnet with reason: REIMAGE
[08:30:06] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1075: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/664093 (https://phabricator.wikimedia.org/T274235) (owner: 10Marostegui)
[08:31:30] <logmsgbot>	 !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on snapshot1005.eqiad.wmnet with reason: REIMAGE
[08:31:48] <wikibugs>	 (03PS1) 10JMeybohm: tiller: Run tiller as user nobody [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664095 (https://phabricator.wikimedia.org/T274254)
[08:31:50] <wikibugs>	 (03PS1) 10JMeybohm: eventrouter: Use numeric UID [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664096 (https://phabricator.wikimedia.org/T274254)
[08:31:52] <wikibugs>	 (03PS1) 10JMeybohm: fluent-bit: Use numeric UID [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664097 (https://phabricator.wikimedia.org/T274254)
[08:31:57] <wikibugs>	 (03PS1) 10JMeybohm: ratelimit: Use numeric UID [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664098 (https://phabricator.wikimedia.org/T274254)
[08:32:20] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::hadoop::backup::namenode: add a more precise notes_url [puppet] - 10https://gerrit.wikimedia.org/r/664094 (owner: 10Elukey)
[08:34:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 10%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14346 and previous config saved to /var/cache/conftool/dbconfig/20210215-083444-root.json
[08:44:12] <wikibugs>	 (03PS1) 10Elukey: hadoop: enable HDFS service port for Analytics Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/664099 (https://phabricator.wikimedia.org/T273629)
[08:45:24] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "Nice!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan)
[08:47:53] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28056/console" [puppet] - 10https://gerrit.wikimedia.org/r/664099 (https://phabricator.wikimedia.org/T273629) (owner: 10Elukey)
[08:48:01] <logmsgbot>	 !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
[08:49:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 15%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14347 and previous config saved to /var/cache/conftool/dbconfig/20210215-084947-root.json
[08:50:59] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Wikidata, 10Wikidata-Query-Service: Upgrade firmware on wdqs1009 - https://phabricator.wikimedia.org/T274751 (10Gehel)
[08:53:53] <wikibugs>	 10SRE, 10Dumps-Generation, 10Platform Engineering, 10serviceops, and 2 others: Upgrade snapshot hosts to Buster - https://phabricator.wikimedia.org/T269377 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['snapshot1005.eqiad.wmnet'] `  and were **ALL** successful.
[08:58:58] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.druid.reboot-workers (exit_code=0) for Druid analytics cluster: Reboot Druid nodes - elukey@cumin1001
[09:01:22] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[09:01:30] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[09:01:32] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] hadoop: enable HDFS service port for Analytics Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/664099 (https://phabricator.wikimedia.org/T273629) (owner: 10Elukey)
[09:04:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 20%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14348 and previous config saved to /var/cache/conftool/dbconfig/20210215-090451-root.json
[09:05:56] <wikibugs>	 (03PS1) 10Joal: Update oozie sharelib creation [puppet] - 10https://gerrit.wikimedia.org/r/664172 (https://phabricator.wikimedia.org/T274322)
[09:06:00] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] "You do mix list indention styles a bit, don't know if we should argue about it or just leave it be." (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/651757 (owner: 10Giuseppe Lavagetto)
[09:06:03] <joal>	 elukey: --^
[09:06:06] <joal>	 for when you have time
[09:07:48] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Update oozie sharelib creation [puppet] - 10https://gerrit.wikimedia.org/r/664172 (https://phabricator.wikimedia.org/T274322) (owner: 10Joal)
[09:11:52] <logmsgbot>	 !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97)
[09:12:50] <wikibugs>	 10SRE, 10Dumps-Generation, 10Platform Engineering, 10serviceops, and 2 others: Upgrade snapshot hosts to Buster - https://phabricator.wikimedia.org/T269377 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ariel on cumin1001.eqiad.wmnet for hosts: ` snapshot1006.eqiad.wmnet ` The log can be fo...
[09:13:58] <wikibugs>	 (03PS1) 10Filippo Giunchedi: grafana: stop POST to /api/snapshots [puppet] - 10https://gerrit.wikimedia.org/r/664224 (https://phabricator.wikimedia.org/T274736)
[09:15:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28057/console" [puppet] - 10https://gerrit.wikimedia.org/r/664224 (https://phabricator.wikimedia.org/T274736) (owner: 10Filippo Giunchedi)
[09:15:53] <wikibugs>	 (03CR) 10Kosta Harlan: api-gateway: generic discovery service config option, add linkrecommendation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan)
[09:17:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/664224 (https://phabricator.wikimedia.org/T274736) (owner: 10Filippo Giunchedi)
[09:17:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] grafana: stop POST to /api/snapshots [puppet] - 10https://gerrit.wikimedia.org/r/664224 (https://phabricator.wikimedia.org/T274736) (owner: 10Filippo Giunchedi)
[09:19:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 25%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14349 and previous config saved to /var/cache/conftool/dbconfig/20210215-091955-root.json
[09:24:00] <wikibugs>	 (03PS1) 10ArielGlenn: misc dumps: move commons rdf to later on Sunday and media info to earlier [puppet] - 10https://gerrit.wikimedia.org/r/664225 (https://phabricator.wikimedia.org/T269377)
[09:24:02] <wikibugs>	 (03CR) 10David Caro: "Got a couple questions, nits you can safely ignore :)" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/663823 (https://phabricator.wikimedia.org/T272963) (owner: 10Arturo Borrero Gonzalez)
[09:24:20] <wikibugs>	 (03PS2) 10JMeybohm: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/663873 (https://phabricator.wikimedia.org/T274262) (owner: 10PipelineBot)
[09:24:39] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Remove sampling feature flag [homer/public] - 10https://gerrit.wikimedia.org/r/663533 (owner: 10Ayounsi)
[09:25:47] <logmsgbot>	 !log ariel@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on snapshot1006.eqiad.wmnet with reason: REIMAGE
[09:26:49] <wikibugs>	 (03CR) 10Ayounsi: "confirmed NOOP." [homer/public] - 10https://gerrit.wikimedia.org/r/663533 (owner: 10Ayounsi)
[09:27:52] <logmsgbot>	 !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on snapshot1006.eqiad.wmnet with reason: REIMAGE
[09:28:41] <wikibugs>	 (03PS1) 10Vgutierrez: admin: Add christinedk user [puppet] - 10https://gerrit.wikimedia.org/r/664226 (https://phabricator.wikimedia.org/T274304)
[09:28:43] <wikibugs>	 (03PS1) 10Vgutierrez: admin: Add christinedk to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/664227 (https://phabricator.wikimedia.org/T274304)
[09:34:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 30%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14350 and previous config saved to /var/cache/conftool/dbconfig/20210215-093458-root.json
[09:35:26] <wikibugs>	 (03CR) 10Muehlenhoff: admin: Add christinedk user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/664226 (https://phabricator.wikimedia.org/T274304) (owner: 10Vgutierrez)
[09:37:27] <wikibugs>	 10SRE, 10observability: Icinga meta monitoring pages during icinga host reboots - https://phabricator.wikimedia.org/T274662 (10Volans) If we allow for normal reboots going unnoticed, would we catch a scenario in which the icinga host reboots every 5 minutes due to a bug or DoS?  P.S. Keyholder is not armed aft...
[09:43:50] <elukey>	 !log roll restart HDFS daemons in Analytics Hadoop to pick up new RPC queue changes - T273629
[09:47:55] <wikibugs>	 (03CR) 10Volans: "Optional nit inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663860 (owner: 10Hnowlan)
[09:50:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 40%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14351 and previous config saved to /var/cache/conftool/dbconfig/20210215-095002-root.json
[09:50:41] <wikibugs>	 10SRE, 10Dumps-Generation, 10Platform Engineering, 10serviceops, and 2 others: Upgrade snapshot hosts to Buster - https://phabricator.wikimedia.org/T269377 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['snapshot1006.eqiad.wmnet'] `  and were **ALL** successful.
[09:55:54] <wikibugs>	 (03PS1) 10Jcrespo: Revert "dbbackups: disable all ES db bacula runs until next week" [puppet] - 10https://gerrit.wikimedia.org/r/663961
[09:56:15] <wikibugs>	 (03PS2) 10Jcrespo: Revert "dbbackups: disable all ES db bacula runs until next week" [puppet] - 10https://gerrit.wikimedia.org/r/663961
[09:57:14] <wikibugs>	 10SRE, 10Dumps-Generation, 10Platform Engineering, 10serviceops, and 2 others: Upgrade snapshot hosts to Buster - https://phabricator.wikimedia.org/T269377 (10ArielGlenn) I was not going to re-image snapshot1005 and 6 because their replacements were due to have come in, but the boxes have not arrived yet a...
[09:57:18] <wikibugs>	 10SRE: Create cookbook to add a node to a Ganeti cluster - https://phabricator.wikimedia.org/T274527 (10MoritzMuehlenhoff) p:05Triage→03Medium
[09:57:34] <wikibugs>	 (03PS2) 10ArielGlenn: misc dumps: move commons rdf to later on Sunday and media info to earlier [puppet] - 10https://gerrit.wikimedia.org/r/664225 (https://phabricator.wikimedia.org/T269377)
[09:57:51] <wikibugs>	 10SRE, 10Packaging: Copy cassandra packages to buster-wikimedia - https://phabricator.wikimedia.org/T274119 (10MoritzMuehlenhoff) p:05Triage→03Medium
[09:58:12] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Revert "dbbackups: disable all ES db bacula runs until next week" [puppet] - 10https://gerrit.wikimedia.org/r/663961 (owner: 10Jcrespo)
[09:59:02] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] misc dumps: move commons rdf to later on Sunday and media info to earlier [puppet] - 10https://gerrit.wikimedia.org/r/664225 (https://phabricator.wikimedia.org/T269377) (owner: 10ArielGlenn)
[10:00:12] <apergos>	 jynus: may I merge your puppet patch "backup::set { 'mysql-srv-backups-dumps-latest':" etc?
[10:00:17] <jynus>	 yes
[10:00:41] <apergos>	 done!
[10:00:44] <jynus>	 thanks
[10:02:02] <apergos>	 thanks for the quick response!
[10:05:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 50%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14352 and previous config saved to /var/cache/conftool/dbconfig/20210215-100505-root.json
[10:09:14] <hashar>	 !log Switching Jenkins jobs to Quibble 0.0.46
[10:15:52] <wikibugs>	 10SRE, 10ops-eqiad: ms-be1034 not powering on - https://phabricator.wikimedia.org/T274488 (10fgiunchedi) Thank you for all the work ! LMK how I can help e.g. if speeding up the decom of one host in T272836 would help (as opposed as decom'ing all hosts at the same time)
[10:20:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 60%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14353 and previous config saved to /var/cache/conftool/dbconfig/20210215-102009-root.json
[10:23:30] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host netmon1002.wikimedia.org
[10:27:29] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1002.wikimedia.org
[10:30:08] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1134: Do not be tag as candidate master [puppet] - 10https://gerrit.wikimedia.org/r/664230 (https://phabricator.wikimedia.org/T274472) (owner: 10Marostegui)
[10:31:09] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: dumps: distribution: nfs: allow establishing connections with TCP ports > 1024 [puppet] - 10https://gerrit.wikimedia.org/r/664231 (https://phabricator.wikimedia.org/T272397)
[10:35:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 70%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14355 and previous config saved to /var/cache/conftool/dbconfig/20210215-103512-root.json
[10:41:25] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: dumps: distribution: nfs: allow establishing connections with TCP ports >= 1024 [puppet] - 10https://gerrit.wikimedia.org/r/664231 (https://phabricator.wikimedia.org/T272397)
[10:44:09] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] dumps: distribution: nfs: allow establishing connections with TCP ports >= 1024 [puppet] - 10https://gerrit.wikimedia.org/r/664231 (https://phabricator.wikimedia.org/T272397) (owner: 10Arturo Borrero Gonzalez)
[10:47:08] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: labstore: allow NFS connections from public cloud networks [puppet] - 10https://gerrit.wikimedia.org/r/664233 (https://phabricator.wikimedia.org/T272397)
[10:48:49] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labstore: allow NFS connections from public cloud networks [puppet] - 10https://gerrit.wikimedia.org/r/664233 (https://phabricator.wikimedia.org/T272397) (owner: 10Arturo Borrero Gonzalez)
[10:49:05] <wikibugs>	 (03PS1) 10ArielGlenn: swap roles of dumpsdata1001 and 1003 so 1003 is primary for xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/664234 (https://phabricator.wikimedia.org/T273713)
[10:50:16] <godog>	 jouncebot: next
[10:50:16] <jouncebot>	 In 0 hour(s) and 39 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210215T1130)
[10:50:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 80%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14356 and previous config saved to /var/cache/conftool/dbconfig/20210215-105016-root.json
[10:57:30] <wikibugs>	 (03PS2) 10ArielGlenn: swap roles of dumpsdata1001 and 1003 so 1003 is primary for xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/664234 (https://phabricator.wikimedia.org/T273713)
[10:57:59] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host grafana1002.eqiad.wmnet
[10:58:44] <wikibugs>	 (03PS1) 10Jcrespo: Preventive commit for jynus to misspell "bullseye", next Debian version [puppet] - 10https://gerrit.wikimedia.org/r/664237
[10:58:59] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] swap roles of dumpsdata1001 and 1003 so 1003 is primary for xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/664234 (https://phabricator.wikimedia.org/T273713) (owner: 10ArielGlenn)
[11:00:25] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana1002.eqiad.wmnet
[11:02:02] <wikibugs>	 (03CR) 10Effie Mouzeli: "PCC https://puppet-compiler.wmflabs.org/compiler1002/28058/" [puppet] - 10https://gerrit.wikimedia.org/r/663565 (https://phabricator.wikimedia.org/T273115) (owner: 10Effie Mouzeli)
[11:03:17] <wikibugs>	 (03PS2) 10Hnowlan: mtail: add exception handling in tests for non-Debian OSes [puppet] - 10https://gerrit.wikimedia.org/r/663860
[11:05:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 90%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14357 and previous config saved to /var/cache/conftool/dbconfig/20210215-110519-root.json
[11:06:51] <icinga-wm>	 RECOVERY - Maps HTTPS on maps2007 is OK: HTTP OK: HTTP/1.1 200 OK - 1329 bytes in 0.301 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
[11:07:27] <icinga-wm>	 RECOVERY - tilerator on maps2007 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.085 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator
[11:08:21] <icinga-wm>	 RECOVERY - Check systemd state on maps2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:10:25] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: name=maps2007.codfw.wmnet
[11:11:57] <wikibugs>	 (03CR) 10Hnowlan: mtail: add exception handling in tests for non-Debian OSes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663860 (owner: 10Hnowlan)
[11:14:57] <wikibugs>	 (03PS1) 10Elukey: profile::hadoop::master: raise threshold for corrupt blocks [puppet] - 10https://gerrit.wikimedia.org/r/664238
[11:16:50] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::hadoop::master: raise threshold for corrupt blocks [puppet] - 10https://gerrit.wikimedia.org/r/664238 (owner: 10Elukey)
[11:20:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 100%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14358 and previous config saved to /var/cache/conftool/dbconfig/20210215-112023-root.json
[11:27:16] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: cloud: drop NAT exceptions for dumps NFS [puppet] - 10https://gerrit.wikimedia.org/r/657152 (https://phabricator.wikimedia.org/T272397)
[11:28:11] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.druid.reboot-workers for Druid public cluster: Reboot Druid nodes - elukey@cumin1001
[11:28:44] <elukey>	 this may trigger (I hope not) AQS alerts --^
[11:28:52] <elukey>	 in case it is my fault and you can blame me
[11:29:05] * elukey sees kormat ready for it
[11:29:31] * kormat nods solemnly 
[11:29:57] <wikibugs>	 (03CR) 10Hnowlan: api-gateway: generic discovery service config option, add linkrecommendation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan)
[11:30:04] <jouncebot>	 jan_drewniak: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Wikimedia Portals Update . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210215T1130).
[11:32:31] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw: move common hiera into proper file [puppet] - 10https://gerrit.wikimedia.org/r/664241 (https://phabricator.wikimedia.org/T272963)
[11:33:13] <wikibugs>	 (03CR) 10Jbond: "See comments inline, also wonder if you considered using pathlib for the file operations." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/663658 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov)
[11:33:17] <wikibugs>	 (03PS4) 10Effie Mouzeli: hieradata: enable memcached socket mwdebug1003, mwdebug2001 [puppet] - 10https://gerrit.wikimedia.org/r/663796 (https://phabricator.wikimedia.org/T273115)
[11:33:19] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: move common hiera into proper file [puppet] - 10https://gerrit.wikimedia.org/r/664241 (https://phabricator.wikimedia.org/T272963) (owner: 10Arturo Borrero Gonzalez)
[11:34:50] <wikibugs>	 (03PS5) 10Arturo Borrero Gonzalez: cloud: drop NAT exceptions for dumps NFS [puppet] - 10https://gerrit.wikimedia.org/r/657152 (https://phabricator.wikimedia.org/T272397)
[11:37:34] <moritzm>	 !log reimaging bast5001 to buster
[11:45:23] <wikibugs>	 (03CR) 10Jbond: "Adding Andrew to approve privatedata-users access" [puppet] - 10https://gerrit.wikimedia.org/r/664227 (https://phabricator.wikimedia.org/T274304) (owner: 10Vgutierrez)
[11:52:45] <logmsgbot>	 !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1007.eqiad.wmnet
[11:54:09] <wikibugs>	 (03CR) 10Jbond: "see comments" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663993 (owner: 10Urbanecm)
[11:55:13] <wikibugs>	 (03CR) 10Urbanecm: Update urbanecm's dotfiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663993 (owner: 10Urbanecm)
[11:55:23] <wikibugs>	 (03PS2) 10Urbanecm: Update urbanecm's dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/663993
[11:56:00] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Update urbanecm's dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/663993 (owner: 10Urbanecm)
[11:56:21] <jbond42>	 Urbanecm: ^^ merged
[11:56:24] <Urbanecm>	 thanks jbond42 !
[11:56:28] <jbond42>	 :) np
[11:58:52] <logmsgbot>	 !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1007.eqiad.wmnet
[12:00:05] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: That opportune time is upon us again. Time for a European mid-day backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210215T1200).
[12:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[12:00:14] <Urbanecm>	 I'll deploy regardless
[12:01:12] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Revert "Revert "Enable SandboxLink at viwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663736 (https://phabricator.wikimedia.org/T272796) (owner: 10Urbanecm)
[12:02:46] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Revert "Enable SandboxLink at viwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663736 (https://phabricator.wikimedia.org/T272796) (owner: 10Urbanecm)
[12:04:02] <wikibugs>	 (03PS1) 10Effie Mouzeli: hiera: install memcached 1.6 on mc1037 [puppet] - 10https://gerrit.wikimedia.org/r/664271 (https://phabricator.wikimedia.org/T270315)
[12:06:36] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "thanks this will also be a big help to me 😊" [puppet] - 10https://gerrit.wikimedia.org/r/664237 (owner: 10Jcrespo)
[12:07:47] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on bast5001.wikimedia.org with reason: REIMAGE
[12:07:55] <wikibugs>	 (03PS22) 10Kosta Harlan: linkrecommendation: Cron job to load datasets [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893)
[12:08:54] <Urbanecm>	 can someone check mwdebug1002.eqiad.wmnet status, and remove it from scap if it is still broken (as mutante said in ops list)?
[12:09:16] <wikibugs>	 (03CR) 10Effie Mouzeli: "PCC https://puppet-compiler.wmflabs.org/compiler1003/28065/mc2037.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/664271 (https://phabricator.wikimedia.org/T270315) (owner: 10Effie Mouzeli)
[12:09:47] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast5001.wikimedia.org with reason: REIMAGE
[12:09:59] <wikibugs>	 (03PS2) 10Muehlenhoff: Swift: Stop setting net.ipv4.tcp_tw_recycle for buster and later [puppet] - 10https://gerrit.wikimedia.org/r/662918
[12:10:35] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 662d5f6af01f6cf6ce7e9d56cf1bc3ba282afee1: Revert "Revert "Enable SandboxLink at viwiki"" (T272796) (duration: 05m 26s)
[12:10:41] <Urbanecm>	 finally
[12:11:36] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytic Cluster for Research Scientist (Paragon) - https://phabricator.wikimedia.org/T274631 (10MoritzMuehlenhoff) Also needs approval by @Ottomata for Hadoop access.
[12:13:39] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] linkrecommendation: Cron job to load datasets [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan)
[12:14:25] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Cron job to load datasets [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan)
[12:15:59] <wikibugs>	 (03Merged) 10jenkins-bot: linkrecommendation: Cron job to load datasets [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan)
[12:16:19] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] delete class tlsproxy::prometheus and nginx template [puppet] - 10https://gerrit.wikimedia.org/r/659377 (https://phabricator.wikimedia.org/T272559) (owner: 10Dzahn)
[12:16:21] <wikibugs>	 (03PS2) 10Urbanecm: ukwikisource: Finish removal of NS Translations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664053 (https://phabricator.wikimedia.org/T270628)
[12:16:24] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] ukwikisource: Finish removal of NS Translations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664053 (https://phabricator.wikimedia.org/T270628) (owner: 10Urbanecm)
[12:17:21] <wikibugs>	 (03Merged) 10jenkins-bot: ukwikisource: Finish removal of NS Translations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664053 (https://phabricator.wikimedia.org/T270628) (owner: 10Urbanecm)
[12:17:27] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "left a nit for the commit msg, LGTM otherwise!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/664271 (https://phabricator.wikimedia.org/T270315) (owner: 10Effie Mouzeli)
[12:18:18] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "Effie can you run a pcc to see if everything looks good?" [puppet] - 10https://gerrit.wikimedia.org/r/663868 (https://phabricator.wikimedia.org/T270315) (owner: 10Effie Mouzeli)
[12:18:47] <logmsgbot>	 !log kharlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
[12:21:30] <Urbanecm>	 repeating myself: can someone depool mwdebug1002? it's currently down (see mail from dzahn in ops list), but still pooled and thus in scap dsh group :/
[12:22:25] <wikibugs>	 10SRE: hosts failing puppet compile due to missing secrets - https://phabricator.wikimedia.org/T274392 (10MoritzMuehlenhoff) p:05Triage→03Medium
[12:23:45] <wikibugs>	 10SRE: hosts failing puppet compile due to missing secrets - https://phabricator.wikimedia.org/T274392 (10MoritzMuehlenhoff) Adding a few tags for affected sub teams, simply untag when completed
[12:24:38] <wikibugs>	 10SRE, 10Analytics, 10observability, 10serviceops, 10cloud-services-team (Kanban): hosts failing puppet compile due to missing secrets - https://phabricator.wikimedia.org/T274392 (10MoritzMuehlenhoff)
[12:25:33] <wikibugs>	 (03CR) 10Volans: "quick direct reply, will have a pass later" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663658 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov)
[12:25:55] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "Thanks for the review!" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/663823 (https://phabricator.wikimedia.org/T272963) (owner: 10Arturo Borrero Gonzalez)
[12:30:51] <wikibugs>	 (03PS1) 10JMeybohm: admin: Allow tiller to create batch ressources [deployment-charts] - 10https://gerrit.wikimedia.org/r/664273
[12:32:02] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] admin: Allow tiller to create batch ressources [deployment-charts] - 10https://gerrit.wikimedia.org/r/664273 (owner: 10JMeybohm)
[12:32:29] <logmsgbot>	 !log jmm@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mwdebug1002.eqiad.wmnet
[12:33:32] <wikibugs>	 (03Merged) 10jenkins-bot: admin: Allow tiller to create batch ressources [deployment-charts] - 10https://gerrit.wikimedia.org/r/664273 (owner: 10JMeybohm)
[12:35:00] <logmsgbot>	 !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' .
[12:35:39] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: cdf15981f7c6f7e02a3fb1c1ce61dc14815f216d: ukwikisource: Finish removal of NS Translations (T270628) (duration: 01m 07s)
[12:36:24] <wikibugs>	 (03PS1) 10Elukey: Add/Fix kerberos fake keytabs [labs/private] - 10https://gerrit.wikimedia.org/r/664274 (https://phabricator.wikimedia.org/T274392)
[12:36:46] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] Add/Fix kerberos fake keytabs [labs/private] - 10https://gerrit.wikimedia.org/r/664274 (https://phabricator.wikimedia.org/T274392) (owner: 10Elukey)
[12:37:06] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.druid.reboot-workers (exit_code=0) for Druid public cluster: Reboot Druid nodes - elukey@cumin1001
[12:37:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Wikidata, and 2 others: Upgrade firmware on wdqs1009 - https://phabricator.wikimedia.org/T274751 (10MoritzMuehlenhoff) p:05Triage→03Medium
[12:38:28] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] cloudgw: introduce HA by using keepalived/VRRP (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/663823 (https://phabricator.wikimedia.org/T272963) (owner: 10Arturo Borrero Gonzalez)
[12:38:36] <wikibugs>	 (03PS9) 10Arturo Borrero Gonzalez: cloudgw: introduce HA by using keepalived/VRRP [puppet] - 10https://gerrit.wikimedia.org/r/663823 (https://phabricator.wikimedia.org/T272963)
[12:38:38] <wikibugs>	 10SRE, 10observability, 10serviceops, 10Patch-For-Review, 10cloud-services-team (Kanban): hosts failing puppet compile due to missing secrets - https://phabricator.wikimedia.org/T274392 (10elukey)
[12:39:18] <logmsgbot>	 !log kharlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
[12:40:12] <wikibugs>	 (03PS10) 10Arturo Borrero Gonzalez: cloudgw: introduce HA by using keepalived/VRRP [puppet] - 10https://gerrit.wikimedia.org/r/663823 (https://phabricator.wikimedia.org/T272963)
[12:43:59] <moritzm>	 !log reimaging bast4002 to buster
[12:44:04] <wikibugs>	 (03PS11) 10Arturo Borrero Gonzalez: cloudgw: introduce HA by using keepalived/VRRP [puppet] - 10https://gerrit.wikimedia.org/r/663823 (https://phabricator.wikimedia.org/T272963)
[12:44:09] <logmsgbot>	 !log jayme@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' .
[12:44:39] <icinga-wm>	 PROBLEM - etherpad_lite_process_running on etherpad1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/node /usr/share/etherpad-lite/node_modules/ep_etherpad-lite/node/server.js https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org
[12:44:59] <icinga-wm>	 PROBLEM - etherpad_up reduced availability on alert1001 is CRITICAL: 0 le 0.8 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:45:53] <icinga-wm>	 PROBLEM - etherpad.wikimedia.org HTTP on etherpad1002 is CRITICAL: connect to address 10.64.32.178 and port 9001: Connection refused https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org
[12:46:25] <icinga-wm>	 RECOVERY - etherpad_lite_process_running on etherpad1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/node /usr/share/etherpad-lite/node_modules/ep_etherpad-lite/node/server.js https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org
[12:47:24] <wikibugs>	 (03PS12) 10Arturo Borrero Gonzalez: cloudgw: introduce HA by using keepalived/VRRP [puppet] - 10https://gerrit.wikimedia.org/r/663823 (https://phabricator.wikimedia.org/T272963)
[12:47:35] <icinga-wm>	 RECOVERY - etherpad.wikimedia.org HTTP on etherpad1002 is OK: HTTP OK: HTTP/1.1 200 OK - 9184 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org
[12:47:58] <wikibugs>	 (03CR) 10Effie Mouzeli: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/663868 (https://phabricator.wikimedia.org/T270315) (owner: 10Effie Mouzeli)
[12:48:27] <icinga-wm>	 RECOVERY - etherpad_up reduced availability on alert1001 is OK: (C)0.8 le (W)0.9 le 1 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:49:10] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1002/28075/" [puppet] - 10https://gerrit.wikimedia.org/r/663823 (https://phabricator.wikimedia.org/T272963) (owner: 10Arturo Borrero Gonzalez)
[12:49:13] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] cloudgw: introduce HA by using keepalived/VRRP [puppet] - 10https://gerrit.wikimedia.org/r/663823 (https://phabricator.wikimedia.org/T272963) (owner: 10Arturo Borrero Gonzalez)
[12:49:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1093 T273955', diff saved to https://phabricator.wikimedia.org/P14359 and previous config saved to /var/cache/conftool/dbconfig/20210215-124944-marostegui.json
[12:50:24] <wikibugs>	 (03PS2) 10David Caro: utils: add script to run docker ci tests locally [software/spicerack] - 10https://gerrit.wikimedia.org/r/663205 (https://phabricator.wikimedia.org/T274338)
[12:50:27] <wikibugs>	 (03PS1) 10Marostegui: db1093: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/664276 (https://phabricator.wikimedia.org/T273955)
[12:50:50] <logmsgbot>	 !log jayme@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' .
[12:51:16] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1093: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/664276 (https://phabricator.wikimedia.org/T273955) (owner: 10Marostegui)
[12:58:16] <logmsgbot>	 !log kharlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
[12:58:16] <logmsgbot>	 !log kharlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
[12:58:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:58:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:01:10] <Lucas_WMDE>	 we lost a whole bunch of SAL messages because stashbot was out
[13:01:12] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on bast4002.wikimedia.org with reason: REIMAGE
[13:01:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:01:21] <Lucas_WMDE>	 is it worth repeating them all?
[13:01:49] <Lucas_WMDE>	 cc marostegui, ryankemper, ariel, elukey…
[13:02:04] <marostegui>	 Lucas_WMDE: not from my side, thanks though! :)
[13:02:10] <Lucas_WMDE>	 ok
[13:02:26] <Lucas_WMDE>	 sometimes I do it but this seems to be almost 50 missed messages and I’m lazy :D
[13:02:41] <logmsgbot>	 !log kharlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
[13:02:41] <logmsgbot>	 !log kharlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
[13:02:44] <Lucas_WMDE>	 (they’re all in the IRC log)
[13:02:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:02:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:02:50] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) db1162 is fully pooled
[13:03:18] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast4002.wikimedia.org with reason: REIMAGE
[13:03:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:58] <Lucas_WMDE>	 !log notice: stashbot had issues between 8:19 and 12:50, see  for https://wm-bot.wmflabs.org/browser/index.php?start=02%2F15%2F2021&end=02%2F15%2F2021&display=%23wikimedia-operations for missed !log messages
[13:06:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:06:54] <godog>	 !log swift eqiad-prod: decrease weight for SSDs on ms-be[1019-1026] - T272836
[13:06:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:06:58] <stashbot>	 T272836: Decom ms-be[1019-1026] from swift - https://phabricator.wikimedia.org/T272836
[13:14:05] <wikibugs>	 (03PS1) 10JMeybohm: linkrecommendation: Read DB_USER from public config [deployment-charts] - 10https://gerrit.wikimedia.org/r/664277 (https://phabricator.wikimedia.org/T265893)
[13:14:16] <jayme>	 ^ kostajh
[13:14:58] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Read DB_USER from public config [deployment-charts] - 10https://gerrit.wikimedia.org/r/664277 (https://phabricator.wikimedia.org/T265893) (owner: 10JMeybohm)
[13:15:30] <kostajh>	 jayme: cheers
[13:17:35] <wikibugs>	 (03Merged) 10jenkins-bot: linkrecommendation: Read DB_USER from public config [deployment-charts] - 10https://gerrit.wikimedia.org/r/664277 (https://phabricator.wikimedia.org/T265893) (owner: 10JMeybohm)
[13:19:28] <logmsgbot>	 !log kharlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
[13:19:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:47] <wikibugs>	 (03PS4) 10Hnowlan: mtail: create separate  metrics histogram based on endpoint [puppet] - 10https://gerrit.wikimedia.org/r/634207 (https://phabricator.wikimedia.org/T263727)
[13:22:04] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+2 C: 03+2] tegola: Add docker image. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/654662 (https://phabricator.wikimedia.org/T270170) (owner: 10Hnowlan)
[13:28:57] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "Shouldn't this instead be done via the pipeline? It would greatly decouple upgrading tegola from requiring an SRE to build newer versions " [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/654662 (https://phabricator.wikimedia.org/T270170) (owner: 10Hnowlan)
[13:33:36] <marostegui>	 !log Stop MySQL on db1093 - T273955
[13:33:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:33:41] <stashbot>	 T273955: decommission db1093.eqiad.wmnet - https://phabricator.wikimedia.org/T273955
[13:34:02] <wikibugs>	 (03PS5) 10Jbond: Add check to error when calling to hiera() [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/659789 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[13:34:39] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add check to error when calling to hiera() [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/659789 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[13:38:10] <moritzm>	 !log installing subversion security updates
[13:38:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:38] <wikibugs>	 (03PS6) 10Jbond: Add check to error when calling to hiera() [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/659789 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[13:43:11] <icinga-wm>	 RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:47:55] <wikibugs>	 (03PS2) 10Muehlenhoff: admin: Add christinedk user [puppet] - 10https://gerrit.wikimedia.org/r/664226 (https://phabricator.wikimedia.org/T274304) (owner: 10Vgutierrez)
[13:48:03] <wikibugs>	 (03PS7) 10Jbond: Add check to error when calling to hiera() [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/659789 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[13:48:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add check to error when calling to hiera() [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/659789 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[13:53:00] <logmsgbot>	 !log gehel@cumin2001 START - Cookbook sre.wdqs.data-reload
[13:53:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:13] <moritzm>	 !log installing libonig security update for stretch
[13:57:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:09] <godog>	 !log swift eqiad-prod: add weight back to sdg on ms-be1054 - T273582
[14:08:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:15] <stashbot>	 T273582: Put sdg1 on ms-be1054 back in service - https://phabricator.wikimedia.org/T273582
[14:10:43] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Patch-For-Review, 10User-fgiunchedi: swift backend decomms / rebalances are noisy - https://phabricator.wikimedia.org/T221904 (10fgiunchedi) 05Open→03Resolved I'm boldly resolving this again since limiting memory usage for object replication processes helped a whole lot to...
[14:12:42] <wikibugs>	 (03PS1) 10Urbanecm: Add *.president.az to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664294 (https://phabricator.wikimedia.org/T274789)
[14:13:04] <Urbanecm>	 jouncebot: now
[14:13:05] <jouncebot>	 No deployments scheduled for the next 3 hour(s) and 46 minute(s)
[14:13:15] <wikibugs>	 (03PS2) 10Urbanecm: Add *.president.az to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664294 (https://phabricator.wikimedia.org/T274789)
[14:13:18] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Add *.president.az to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664294 (https://phabricator.wikimedia.org/T274789) (owner: 10Urbanecm)
[14:14:07] <wikibugs>	 (03Merged) 10jenkins-bot: Add *.president.az to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664294 (https://phabricator.wikimedia.org/T274789) (owner: 10Urbanecm)
[14:17:02] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 00905c4a7e4bb69f39e52e1c4d4d6168006b0e7b: Add *.president.az to the wgCopyUploadsDomains allowlist of Wikimedia Commons (T274789) (duration: 01m 09s)
[14:17:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:07] <stashbot>	 T274789: Add <https://static.president.az/> to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T274789
[14:19:43] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:23:44] <wikibugs>	 (03PS8) 10Jbond: Add check to error when calling to hiera() [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/659789 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[14:25:33] <icinga-wm>	 RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:28:37] <wikibugs>	 (03CR) 10David Caro: utils: add script to run docker ci tests locally (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/663205 (https://phabricator.wikimedia.org/T274338) (owner: 10David Caro)
[14:31:40] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Add check to error when calling to hiera() [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/659789 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[14:34:23] <wikibugs>	 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Services, 10Service-deployment-requests: New Service Request geoshapes - https://phabricator.wikimedia.org/T274388 (10MoritzMuehlenhoff) p:05Triage→03Medium
[14:34:33] <wikibugs>	 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Services, 10Service-deployment-requests: [DRAFT] New Service Request tegola - https://phabricator.wikimedia.org/T274390 (10MoritzMuehlenhoff) p:05Triage→03Medium
[14:45:09] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:48:25] <wikibugs>	 (03PS1) 10Jbond: Gemfile: increase dependency for wmf_style-stylegude-check [puppet] - 10https://gerrit.wikimedia.org/r/664297 (https://phabricator.wikimedia.org/T209953)
[15:04:50] <godog>	 !log upgrade grafana to 7.4.1 on grafana1002 - T263747
[15:04:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:04:55] <stashbot>	 T263747: Upgrade Grafana to 7.4 - https://phabricator.wikimedia.org/T263747
[15:06:15] <wikibugs>	 (03CR) 10Ppchelko: api-gateway: generic discovery service config option, add linkrecommendation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan)
[15:06:27] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to stat boxes for mlitn - https://phabricator.wikimedia.org/T274749 (10MoritzMuehlenhoff) Also adding @Ottomata for approval for analytics-privatedata-users.
[15:09:46] <moritzm>	 !log reimaging bast3004 to buster
[15:09:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:15:06] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: CommentFormatter: Fix problems with editsection and quotes [extensions/DiscussionTools] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/664254 (https://phabricator.wikimedia.org/T274709)
[15:17:18] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host schema1003.eqiad.wmnet
[15:17:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:17:21] <wikibugs>	 (03CR) 10Jbond: "did a quick pass however im not that familiar with the current decom cook book" (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/663878 (owner: 10Elukey)
[15:20:05] <wikibugs>	 (03PS1) 10Kormat: integration_env: Rework cli to simplify operations [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664300
[15:20:10] <icinga-wm>	 PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp5012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[15:27:56] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "Can be merged anytime, the CI job always does a gem update :]" [puppet] - 10https://gerrit.wikimedia.org/r/664297 (https://phabricator.wikimedia.org/T209953) (owner: 10Jbond)
[15:28:49] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Gemfile: increase dependency for wmf_style-stylegude-check [puppet] - 10https://gerrit.wikimedia.org/r/664297 (https://phabricator.wikimedia.org/T209953) (owner: 10Jbond)
[15:30:19] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema1003.eqiad.wmnet
[15:30:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:21] <icinga-wm>	 RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp5012 is OK: HTTP OK: HTTP/1.0 200 OK - 23547 bytes in 0.829 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[15:33:08] <moritzm>	 !log installing linux-4.19 update for Stretch on servers which have it installed (no reboots, just updating the kernels)
[15:33:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:33:35] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] integration_env: Rework cli to simplify operations [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664300 (owner: 10Kormat)
[15:34:16] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on bast3004.wikimedia.org with reason: REIMAGE
[15:34:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:34:30] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Preventive commit for jynus to misspell "bullseye", next Debian version [puppet] - 10https://gerrit.wikimedia.org/r/664237 (owner: 10Jcrespo)
[15:36:11] <logmsgbot>	 !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
[15:36:12] <logmsgbot>	 !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
[15:36:12] <logmsgbot>	 !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
[15:36:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:36:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:36:20] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast3004.wikimedia.org with reason: REIMAGE
[15:36:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:36:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:36:46] <wikibugs>	 (03PS1) 10Jcrespo: testing test at test at testing [puppet] - 10https://gerrit.wikimedia.org/r/664301
[15:36:54] <wikibugs>	 (03Merged) 10jenkins-bot: integration_env: Rework cli to simplify operations [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664300 (owner: 10Kormat)
[15:38:02] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] testing test at test at testing [puppet] - 10https://gerrit.wikimedia.org/r/664301 (owner: 10Jcrespo)
[15:38:36] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host schema1004.eqiad.wmnet
[15:38:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:38:49] <wikibugs>	 (03CR) 10Jcrespo: "16:37:55 Typo found!" [puppet] - 10https://gerrit.wikimedia.org/r/664301 (owner: 10Jcrespo)
[15:39:13] <wikibugs>	 (03Abandoned) 10Jcrespo: testing test at test at testing [puppet] - 10https://gerrit.wikimedia.org/r/664301 (owner: 10Jcrespo)
[15:39:46] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "1 pedantic comment but perhaps we can solve this more easily, see inline." (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/659863 (owner: 10JMeybohm)
[15:39:52] <wikibugs>	 10SRE: reprepro unable to run checkupdate and import upgraded packages - https://phabricator.wikimedia.org/T274797 (10fgiunchedi)
[15:40:39] <wikibugs>	 (03PS1) 10Elukey: hadoop: update the HDFS Namenode rack configuration [puppet] - 10https://gerrit.wikimedia.org/r/664302 (https://phabricator.wikimedia.org/T274795)
[15:41:13] <wikibugs>	 10SRE: reprepro unable to run checkupdate and import upgraded packages - https://phabricator.wikimedia.org/T274797 (10fgiunchedi)
[15:44:52] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "+1, but perhaps we don't even need it? See dependent commit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/659864 (owner: 10JMeybohm)
[15:45:07] <wikibugs>	 (03PS1) 10Muehlenhoff: Add a comment to the snapshot block [puppet] - 10https://gerrit.wikimedia.org/r/664303
[15:46:19] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema1004.eqiad.wmnet
[15:46:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:46:44] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: Revert "cloud: hiera: add vlan 2120 back into the neutron bridge" [puppet] - 10https://gerrit.wikimedia.org/r/664255
[15:46:53] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: Revert "cloud: hiera: add vlan 2120 back into the neutron bridge" [puppet] - 10https://gerrit.wikimedia.org/r/664255
[15:47:25] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: Revert "cloud: hiera: add vlan 2120 back into the neutron bridge" [puppet] - 10https://gerrit.wikimedia.org/r/664255 (https://phabricator.wikimedia.org/T272963)
[15:47:45] <wikibugs>	 (03PS2) 10Elukey: hadoop: update the HDFS Namenode rack configuration [puppet] - 10https://gerrit.wikimedia.org/r/664302 (https://phabricator.wikimedia.org/T274795)
[15:48:09] <logmsgbot>	 !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
[15:48:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:48:54] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host schema2003.codfw.wmnet
[15:48:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:49:03] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Revert "cloud: hiera: add vlan 2120 back into the neutron bridge" [puppet] - 10https://gerrit.wikimedia.org/r/664255 (https://phabricator.wikimedia.org/T272963) (owner: 10Arturo Borrero Gonzalez)
[15:50:13] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete cloudera config from reprepro [puppet] - 10https://gerrit.wikimedia.org/r/664304 (https://phabricator.wikimedia.org/T274797)
[15:50:56] <wikibugs>	 (03CR) 10Ppchelko: api-gateway: generic discovery service config option, add linkrecommendation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan)
[15:51:26] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema2003.codfw.wmnet
[15:51:26] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: Revert "cloud: hiera: connect cloudnet servers back to vlan 2120" [puppet] - 10https://gerrit.wikimedia.org/r/664256
[15:51:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:51:39] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: Revert "cloud: hiera: connect cloudnet servers back to vlan 2120" [puppet] - 10https://gerrit.wikimedia.org/r/664256 (https://phabricator.wikimedia.org/T272963)
[15:51:47] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: Revert "cloud: hiera: connect cloudnet servers back to vlan 2120" [puppet] - 10https://gerrit.wikimedia.org/r/664256 (https://phabricator.wikimedia.org/T272963)
[15:52:15] <wikibugs>	 10SRE, 10Patch-For-Review: reprepro unable to run checkupdate and import upgraded packages - https://phabricator.wikimedia.org/T274797 (10fgiunchedi) Note that the elastic 5 "not found" errors seem flappy, I just got a `checkupdate` run without those errors
[15:53:19] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: Revert "cloud: hiera: enable back neutron hacks in codfw1dev" [puppet] - 10https://gerrit.wikimedia.org/r/664257
[15:53:26] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: Revert "cloud: hiera: enable back neutron hacks in codfw1dev" [puppet] - 10https://gerrit.wikimedia.org/r/664257
[15:53:34] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host schema2004.codfw.wmnet
[15:53:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:37] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: Revert "cloud: hiera: enable back neutron hacks in codfw1dev" [puppet] - 10https://gerrit.wikimedia.org/r/664257 (https://phabricator.wikimedia.org/T272963)
[15:53:49] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Revert "cloud: hiera: connect cloudnet servers back to vlan 2120" [puppet] - 10https://gerrit.wikimedia.org/r/664256 (https://phabricator.wikimedia.org/T272963) (owner: 10Arturo Borrero Gonzalez)
[15:53:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Add a comment to the snapshot block [puppet] - 10https://gerrit.wikimedia.org/r/664303 (owner: 10Muehlenhoff)
[15:57:26] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: Revert "cloud: hiera: enable back neutron hacks in codfw1dev"      This reverts commit 5ca98c9df08f6c6e2d97bc7b6279cdaf573eddce.      Reason for revert: rebuilding the cloudgw setup      Bug: T272963 Change-Id: I8185f4fa36a70255940d78db45b0f50cfc6abb98 Signed-off-by: Arturo Borrero Gonzalez <aborrero@wikimedia.org> [puppet] - 10https://gerrit.wikimedia.org/r/664257 (https://phabricator.wi
[15:58:00] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema2004.codfw.wmnet
[15:58:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:58:12] <wikibugs>	 (03PS5) 10Arturo Borrero Gonzalez: Revert "cloud: hiera: enable back neutron hacks in codfw1dev" [puppet] - 10https://gerrit.wikimedia.org/r/664257 (https://phabricator.wikimedia.org/T272963)
[15:58:20] <wikibugs>	 10SRE, 10SRE-tools, 10User-Joe: Covert deploy_apache_change.sh to a spicerack cookbook - https://phabricator.wikimedia.org/T203948 (10jijiki)
[16:02:38] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Revert "cloud: hiera: enable back neutron hacks in codfw1dev" [puppet] - 10https://gerrit.wikimedia.org/r/664257 (https://phabricator.wikimedia.org/T272963) (owner: 10Arturo Borrero Gonzalez)
[16:04:06] <wikibugs>	 (03CR) 10Volans: "Thanks for the refactor, some comments inline, some already discussed over IRC." (0314 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/661921 (https://phabricator.wikimedia.org/T267412) (owner: 10David Caro)
[16:04:51] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Wikidata, and 3 others: Upgrade firmware on wdqs1009 - https://phabricator.wikimedia.org/T274751 (10Gehel)
[16:05:18] <logmsgbot>	 !log aborrero@cumin2001 START - Cookbook sre.hosts.reboot-single for host cloudnet2003-dev.codfw.wmnet
[16:05:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:05:56] <wikibugs>	 10SRE: netbox update (triggered from reimage script) failed: 'ImportPuppetDB' object has no attribute 'log_error' - https://phabricator.wikimedia.org/T274802 (10MoritzMuehlenhoff)
[16:07:37] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestage2001.codfw.wmnet
[16:07:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:09:49] <logmsgbot>	 !log aborrero@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet2003-dev.codfw.wmnet
[16:09:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:10:23] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid
[16:11:29] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw: stop setting up VIP addresses that are now handle via keepalived/VRRP [puppet] - 10https://gerrit.wikimedia.org/r/664307 (https://phabricator.wikimedia.org/T272963)
[16:11:55] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[16:12:12] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestage2001.codfw.wmnet
[16:12:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:12:57] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestage2002.codfw.wmnet
[16:13:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:14:35] <hoo>	 !log Updated the Wikidata property suggester with data from the 2021-02-01 JSON dump (with pre-applied T132839 workarounds)
[16:14:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:14:40] <stashbot>	 T132839: [RfC] Property suggester suggests human properties for non-human items - https://phabricator.wikimedia.org/T132839
[16:16:34] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloudgw: stop setting up VIP addresses that are now handle via keepalived/VRRP [puppet] - 10https://gerrit.wikimedia.org/r/664307 (https://phabricator.wikimedia.org/T272963)
[16:18:20] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: stop setting up VIP addresses that are now handle via keepalived/VRRP [puppet] - 10https://gerrit.wikimedia.org/r/664307 (https://phabricator.wikimedia.org/T272963) (owner: 10Arturo Borrero Gonzalez)
[16:18:35] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestage2002.codfw.wmnet
[16:18:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add a comment to the snapshot block [puppet] - 10https://gerrit.wikimedia.org/r/664303 (owner: 10Muehlenhoff)
[16:22:14] <logmsgbot>	 !log jmm@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mwdebug1002.eqiad.wmnet
[16:22:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:24:53] <wikibugs>	 10SRE: netbox update (triggered from reimage script) failed: 'ImportPuppetDB' object has no attribute 'log_error' - https://phabricator.wikimedia.org/T274802 (10Volans) p:05Triage→03High a:03Volans
[16:25:11] <wikibugs>	 (03PS1) 10Volans: interface automation: fix typo in method name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/664308 (https://phabricator.wikimedia.org/T274802)
[16:26:03] <jayme>	 !log rolled back linkrecommendation helm releases to the most recent revision running chart verion linkrecommendation-0.0.4 on clusters codfw and eqiad (cc: kostajh)
[16:26:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:27:09] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestage1001.eqiad.wmnet
[16:27:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:28:09] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "self merging as it's just a typo, will run the script against bast3004 manually to verify it" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/664308 (https://phabricator.wikimedia.org/T274802) (owner: 10Volans)
[16:32:38] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestage1001.eqiad.wmnet
[16:32:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:33:48] <volans>	 !log restarted netbox on netbox1001
[16:33:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:36:18] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:36:42] <wikibugs>	 (03PS1) 10Volans: interface automation: fix typo in method name (2) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/664309 (https://phabricator.wikimedia.org/T274802)
[16:37:12] <volans>	 mmmh icinga, are you sure? it's all good there, it was me and was already fixed
[16:37:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] interface automation: fix typo in method name (2) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/664309 (https://phabricator.wikimedia.org/T274802) (owner: 10Volans)
[16:37:56] <wikibugs>	 (03PS2) 10Volans: interface automation: fix typo in method name (2) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/664309 (https://phabricator.wikimedia.org/T274802)
[16:39:57] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestage1002.eqiad.wmnet
[16:40:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:40:06] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "Typo fix." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/664309 (https://phabricator.wikimedia.org/T274802) (owner: 10Volans)
[16:40:14] <wikibugs>	 (03PS1) 10Kosta Harlan: linkrecommendation: Set backoffLimit to 1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/664310 (https://phabricator.wikimedia.org/T265893)
[16:40:14] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubestage1001 is CRITICAL: instance=kubestage1001.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[16:40:45] <jayme>	 ^ thats "expected" (kind of) from reboots
[16:41:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] linkrecommendation: Set backoffLimit to 1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/664310 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan)
[16:41:40] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubestage1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[16:43:00] <wikibugs>	 (03PS2) 10Kosta Harlan: linkrecommendation: Set backoffLimit to 1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/664310 (https://phabricator.wikimedia.org/T265893)
[16:43:18] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:44:44] <wikibugs>	 10SRE, 10CAS-SSO, 10Patch-For-Review: Investigate CAS Session logout - https://phabricator.wikimedia.org/T273867 (10Gehel) Removing discovery-search, if you need our help again, please ping us!
[16:46:44] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestage1002.eqiad.wmnet
[16:46:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:48:30] <icinga-wm>	 PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:48:50] <wikibugs>	 10SRE, 10Patch-For-Review: netbox update (triggered from reimage script) failed: 'ImportPuppetDB' object has no attribute 'log_error' - https://phabricator.wikimedia.org/T274802 (10Volans) a:05Volans→03crusnov @crusnov passing it over to you. I've fixed the basic typos, but the problem now is that the scri...
[16:49:43] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw: switch data place interface config modes to manual [puppet] - 10https://gerrit.wikimedia.org/r/664311 (https://phabricator.wikimedia.org/T272963)
[16:49:51] <wikibugs>	 10SRE, 10Patch-For-Review: netbox update (triggered from reimage script) failed: 'ImportPuppetDB' object has no attribute 'log_error' - https://phabricator.wikimedia.org/T274802 (10crusnov) That seems reasonable, I'll look at it and get a patch out soonish.
[16:52:45] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: switch data place interface config modes to manual [puppet] - 10https://gerrit.wikimedia.org/r/664311 (https://phabricator.wikimedia.org/T272963) (owner: 10Arturo Borrero Gonzalez)
[16:53:09] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubestage1002 is CRITICAL: instance=kubestage1002.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[16:57:37] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubestage1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[17:00:58] <wikibugs>	 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Services, 10Service-deployment-requests: New Service Request geoshapes - https://phabricator.wikimedia.org/T274388 (10akosiaris) Thanks for this task!  So I 've studied the diagrams a bit, they are helpful.  The deployment pipeline definitely suppor...
[17:03:18] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "Just to confirm - this will keep the cloudera components but clear all the pull-specific bits. If so, big +1, thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/664304 (https://phabricator.wikimedia.org/T274797) (owner: 10Muehlenhoff)
[17:16:13] <wikibugs>	 (03CR) 10Elukey: "John thanks a lot for the review! For this particular use case, I'd prefer to just move the existing code base to the class api and then m" [cookbooks] - 10https://gerrit.wikimedia.org/r/663878 (owner: 10Elukey)
[17:27:06] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] hadoop: update the HDFS Namenode rack configuration [puppet] - 10https://gerrit.wikimedia.org/r/664302 (https://phabricator.wikimedia.org/T274795) (owner: 10Elukey)
[17:28:16] <wikibugs>	 (03PS1) 10Jcrespo: configcluster: Enable etcd v3 backups for stretch hosts [puppet] - 10https://gerrit.wikimedia.org/r/664313 (https://phabricator.wikimedia.org/T271573)
[17:28:18] <wikibugs>	 (03PS1) 10Jcrespo: bacula: Revert TLS 1.0 downgrade on storage servers (including director) [puppet] - 10https://gerrit.wikimedia.org/r/664314 (https://phabricator.wikimedia.org/T273182)
[17:29:54] <wikibugs>	 (03Abandoned) 10Jcrespo: jessie: Remove old openssl override after revert to package version [puppet] - 10https://gerrit.wikimedia.org/r/660857 (https://phabricator.wikimedia.org/T273182) (owner: 10Jcrespo)
[17:30:04] <wikibugs>	 (03CR) 10Kosta Harlan: api-gateway: generic discovery service config option, add linkrecommendation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan)
[17:32:07] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] linkrecommendation: Set backoffLimit to 1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/664310 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan)
[17:32:43] <wikibugs>	 (03PS10) 10David Caro: toolforge.etcdctl: add new etcdctl module [software/spicerack] - 10https://gerrit.wikimedia.org/r/661921 (https://phabricator.wikimedia.org/T267412)
[17:32:43] <icinga-wm>	 RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:33:16] <wikibugs>	 (03CR) 10David Caro: "Done all the changes as requested" (0313 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/661921 (https://phabricator.wikimedia.org/T267412) (owner: 10David Caro)
[17:39:15] <wikibugs>	 (03CR) 10Jcrespo: "Have you tested backups with the script on etcd3? I don't see anything, like a path, completely wrong, but I don't know enough about what " [puppet] - 10https://gerrit.wikimedia.org/r/664313 (https://phabricator.wikimedia.org/T271573) (owner: 10Jcrespo)
[17:41:17] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10jcrespo) I've sent: https://gerrit.wikimedia.org/r/c/operations/puppet/+/664313  Independently of the pace of upgrading, we should give some priority to generating fresh backups from the...
[17:43:56] <wikibugs>	 (03PS2) 10Jcrespo: configcluster: Enable etcd v3 backups for stretch hosts [puppet] - 10https://gerrit.wikimedia.org/r/664313 (https://phabricator.wikimedia.org/T271573)
[17:44:23] <wikibugs>	 (03PS3) 10Jcrespo: configcluster: Enable etcd v3 backups for stretch hosts [puppet] - 10https://gerrit.wikimedia.org/r/664313 (https://phabricator.wikimedia.org/T271573)
[17:55:42] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw: interfaces: relax check on routing setup by using 'onlink' [puppet] - 10https://gerrit.wikimedia.org/r/664317 (https://phabricator.wikimedia.org/T272963)
[17:57:40] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: interfaces: relax check on routing setup by using 'onlink' [puppet] - 10https://gerrit.wikimedia.org/r/664317 (https://phabricator.wikimedia.org/T272963) (owner: 10Arturo Borrero Gonzalez)
[17:59:36] <wikibugs>	 (03CR) 10Muehlenhoff: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/664304 (https://phabricator.wikimedia.org/T274797) (owner: 10Muehlenhoff)
[18:00:04] <jouncebot>	 ryankemper: Dear deployers, time to do the Wikidata Query Service weekly deploy deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210215T1800).
[18:05:14] <wikibugs>	 (03CR) 10Ppchelko: api-gateway: generic discovery service config option, add linkrecommendation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan)
[18:10:38] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "> Patch Set 1:" [cookbooks] - 10https://gerrit.wikimedia.org/r/663878 (owner: 10Elukey)
[18:14:52] <wikibugs>	 10SRE, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10jcrespo)
[18:15:15] <wikibugs>	 10SRE, 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Followup to backup1001 bacula switchover (misc pending tasks) - https://phabricator.wikimedia.org/T238048 (10jcrespo)
[18:15:40] <wikibugs>	 (03CR) 10Kosta Harlan: api-gateway: generic discovery service config option, add linkrecommendation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan)
[18:15:41] <wikibugs>	 10SRE, 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Followup to backup1001 bacula switchover (misc pending tasks) - https://phabricator.wikimedia.org/T238048 (10jcrespo) 05Open→03Resolved Regarding the last 2 points, we have, in a way, done the last point "parametrize better the jobdefaults i...
[18:17:39] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[18:28:38] <wikibugs>	 (03PS1) 10Effie Mouzeli: (WIP) mediawiki::alerts add alert when 20% of servers is saturated [puppet] - 10https://gerrit.wikimedia.org/r/664319 (https://phabricator.wikimedia.org/T267176)
[18:33:52] <wikibugs>	 (03CR) 10Ppchelko: api-gateway: generic discovery service config option, add linkrecommendation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan)
[18:41:27] <icinga-wm>	 PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw
[18:41:47] <icinga-wm>	 PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad
[18:45:40] <jynus>	 that looks like DPLA bot on commons
[18:46:29] <jynus>	 I see no issues, but keep an eye in case something degrades (thumbail generation, codfw s4 replication, etc.)
[18:47:54] <jynus>	 that's 10 1MB files per second
[18:48:16] <tabbycat>	 jynus: swift is TimedMediaHandler or just the place where uploads are being stored?
[18:49:21] <jynus>	 swift is our OpenStack Swift cluster, our backend storage for media and rendered stuff: https://wikitech.wikimedia.org/wiki/Swift
[18:49:59] <jynus>	 the alert is just a warning on a high rate of uploads- that doesn't mean there is a problem, but it is an unusual state
[18:50:23] <jynus>	 normally we worry when it is very low, because it means there is a problem with uploads
[19:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210215T1900). Please do the needful.
[19:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[19:00:56] <Urbanecm>	 jynus: do we want to do T248177?
[19:00:56] <stashbot>	 T248177: Enforce upload rate limits for bots on commons - https://phabricator.wikimedia.org/T248177
[19:01:29] <Urbanecm>	 (but 999 uploads per second is effectively no rate limit anyway :/ )
[19:02:09] <tabbycat>	 999/s is o_O
[19:03:32] <tabbycat>	 IIRC there is/was an UploadStash for large or batch uploads Urbanecm ?
[19:04:10] <Urbanecm>	 there's still uploadstash, dunno if it helps with ratelimited uploads
[19:11:01] <icinga-wm>	 PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad
[19:21:03] <icinga-wm>	 PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw
[19:28:58] <wikibugs>	 (03CR) 10Kosta Harlan: api-gateway: generic discovery service config option, add linkrecommendation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan)
[19:31:51] <wikibugs>	 (03CR) 10CRusnov: "This change is ready for review." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/664332 (https://phabricator.wikimedia.org/T274802) (owner: 10CRusnov)
[20:10:06] <wikibugs>	 (03PS1) 10Ladsgroup: [DNM] Test jenkins new rule on banning use of hiera() [puppet] - 10https://gerrit.wikimedia.org/r/664350
[20:11:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [DNM] Test jenkins new rule on banning use of hiera() [puppet] - 10https://gerrit.wikimedia.org/r/664350 (owner: 10Ladsgroup)
[20:25:00] <wikibugs>	 (03Abandoned) 10Ladsgroup: [DNM] Test jenkins new rule on banning use of hiera() [puppet] - 10https://gerrit.wikimedia.org/r/664350 (owner: 10Ladsgroup)
[20:30:51] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytic Cluster for Research Scientist (Paragon) - https://phabricator.wikimedia.org/T274631 (10leila) approved. Thank you for your support!
[20:46:21] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1097 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[20:46:24] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on an-worker1097 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T274819 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[20:46:27] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on an-worker1097 - https://phabricator.wikimedia.org/T274819 (10ops-monitoring-bot)
[20:47:01] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics: Degraded RAID on an-worker1097 - https://phabricator.wikimedia.org/T274819 (10Peachey88)
[21:00:04] <jouncebot>	 chrisalbon and accraze: It is that lovely time of the day again! You are hereby commanded to deploy Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210215T2100).
[21:51:52] <icinga-wm>	 PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw
[21:52:04] <icinga-wm>	 PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad
[22:00:04] <jouncebot>	 Reedy and sbassett: Dear deployers, time to do the Weekly Security deployment window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210215T2200).
[22:50:50] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "Code looks good to me, please test it on netbox-next to be sure." (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/664332 (https://phabricator.wikimedia.org/T274802) (owner: 10CRusnov)
[22:52:34] <icinga-wm>	 PROBLEM - Device not healthy -SMART- on an-worker1097 is CRITICAL: cluster=analytics device=sat+megaraid,13 instance=an-worker1097 job=node site=eqiad https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1097&var-datasource=eqiad+prometheus/ops
[23:31:52] <wikibugs>	 (03CR) 10Gergő Tisza: api-gateway: generic discovery service config option, add linkrecommendation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan)