[01:07:53] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:13:05] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:13:17] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:18:25] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:31:55] PROBLEM - MediaWiki memcached error rate on alert1001 is CRITICAL: 5870 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:33:31] RECOVERY - MediaWiki memcached error rate on alert1001 is OK: (C)5000 gt (W)1000 gt 2 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:42:29] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:47:15] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:08:29] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:13:39] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:34:49] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:35:57] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:43:13] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:48:23] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:02:27] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [05:04:03] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [06:02:28] 10SRE, 10DBA: Decom dbmonitor2001 - https://phabricator.wikimedia.org/T274496 (10Marostegui) p:05Triage→03Medium a:03Kormat Yeah, as far as I remember we're not using this for anything Assigning it for Stevie for confirmation and removal (if that applies) [06:10:23] 10SRE, 10ops-eqiad, 10DBA: Investigate and repool db1134 - https://phabricator.wikimedia.org/T274472 (10Marostegui) Thanks everyone who responded to this incident! [06:17:33] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) >>! In T258361#6822070, @jcrespo wrote: > I am taking db1163 to, at least temporarily, substitute db1134 due to T274472. Thanks. I... [06:19:15] (03PS1) 10Marostegui: db1162: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/664087 (https://phabricator.wikimedia.org/T258361) [06:20:14] (03CR) 10Marostegui: [C: 03+2] db1162: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/664087 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [06:36:31] (03PS1) 10Marostegui: instances.yaml: Add db1162 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/664088 (https://phabricator.wikimedia.org/T258361) [06:37:05] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1162 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/664088 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [06:40:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1162 to dbctl - depooled T258361', diff saved to https://phabricator.wikimedia.org/P14339 and previous config saved to /var/cache/conftool/dbconfig/20210215-064001-marostegui.json [06:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:08] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [06:46:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1162 with minimal weight T258361', diff saved to https://phabricator.wikimedia.org/P14340 and previous config saved to /var/cache/conftool/dbconfig/20210215-064628-marostegui.json [06:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:33] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [06:56:50] (03PS1) 10Marostegui: install_server: Do not reimage db1162 and db1163 [puppet] - 10https://gerrit.wikimedia.org/r/664089 [06:57:31] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1162 and db1163 [puppet] - 10https://gerrit.wikimedia.org/r/664089 (owner: 10Marostegui) [06:58:07] RECOVERY - Check systemd state on search-loader2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:02:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1162 with minimal weight T258361', diff saved to https://phabricator.wikimedia.org/P14341 and previous config saved to /var/cache/conftool/dbconfig/20210215-070206-marostegui.json [07:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:12] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [07:09:46] 10SRE, 10ops-eqiad: ms-be1034 not powering on - https://phabricator.wikimedia.org/T274488 (10elukey) 05Resolved→03Open ms-be1034 is down again, same issue as the one described by Filippo... :( [07:10:31] ACKNOWLEDGEMENT - Host ms-be1034 is DOWN: PING CRITICAL - Packet loss = 100% Elukey T274488 [07:14:17] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-tool1007.eqiad.wmnet [07:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:37] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1007.eqiad.wmnet [07:16:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:41] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-tool1008.eqiad.wmnet [07:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:37] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1008.eqiad.wmnet [07:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:21] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-tool1009.eqiad.wmnet [07:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:40] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1009.eqiad.wmnet [07:26:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:21] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-tool1010.eqiad.wmnet [07:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:24] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1010.eqiad.wmnet [07:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:23] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:42:54] !log elukey@cumin1001 START - Cookbook sre.druid.reboot-workers for Druid analytics cluster: Reboot Druid nodes - elukey@cumin1001 [07:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:33] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:47:21] (03PS1) 10ArielGlenn: wikidata json dumps: re-add source of shared functions [puppet] - 10https://gerrit.wikimedia.org/r/664090 [07:48:16] (03CR) 10ArielGlenn: [C: 03+2] wikidata json dumps: re-add source of shared functions [puppet] - 10https://gerrit.wikimedia.org/r/664090 (owner: 10ArielGlenn) [07:49:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 3%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14342 and previous config saved to /var/cache/conftool/dbconfig/20210215-074932-root.json [07:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:16] (03PS1) 10ArielGlenn: now that snapshot1005 is testbed host, make snapshot1007 the enwiki dumps runner [puppet] - 10https://gerrit.wikimedia.org/r/664091 (https://phabricator.wikimedia.org/T269377) [08:04:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 4%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14343 and previous config saved to /var/cache/conftool/dbconfig/20210215-080435-root.json [08:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:33] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:08:21] (03CR) 10ArielGlenn: [C: 03+2] now that snapshot1005 is testbed host, make snapshot1007 the enwiki dumps runner [puppet] - 10https://gerrit.wikimedia.org/r/664091 (https://phabricator.wikimedia.org/T269377) (owner: 10ArielGlenn) [08:10:50] (03PS1) 10ArielGlenn: prep snapshot1005 and 1006 for reinstall with buster [puppet] - 10https://gerrit.wikimedia.org/r/664092 (https://phabricator.wikimedia.org/T269377) [08:13:14] (03CR) 10ArielGlenn: [C: 03+2] prep snapshot1005 and 1006 for reinstall with buster [puppet] - 10https://gerrit.wikimedia.org/r/664092 (https://phabricator.wikimedia.org/T269377) (owner: 10ArielGlenn) [08:17:33] 10SRE, 10Dumps-Generation, 10Platform Engineering, 10serviceops, and 2 others: Upgrade snapshot hosts to Buster - https://phabricator.wikimedia.org/T269377 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ariel on cumin1001.eqiad.wmnet for hosts: ` snapshot1005.eqiad.wmnet ` The log can be fo... [08:19:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 5%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14344 and previous config saved to /var/cache/conftool/dbconfig/20210215-081940-root.json [08:26:51] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [08:27:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1075 T274235', diff saved to https://phabricator.wikimedia.org/P14345 and previous config saved to /var/cache/conftool/dbconfig/20210215-082718-marostegui.json [08:27:47] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:29:05] !log powercycle wdqs1009 [08:29:22] (03PS1) 10Marostegui: db1075: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/664093 (https://phabricator.wikimedia.org/T274235) [08:29:24] (03PS1) 10Elukey: profile::hadoop::backup::namenode: add a more precise notes_url [puppet] - 10https://gerrit.wikimedia.org/r/664094 [08:29:25] !log ariel@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on snapshot1005.eqiad.wmnet with reason: REIMAGE [08:30:06] (03CR) 10Marostegui: [C: 03+2] db1075: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/664093 (https://phabricator.wikimedia.org/T274235) (owner: 10Marostegui) [08:31:30] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on snapshot1005.eqiad.wmnet with reason: REIMAGE [08:31:48] (03PS1) 10JMeybohm: tiller: Run tiller as user nobody [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664095 (https://phabricator.wikimedia.org/T274254) [08:31:50] (03PS1) 10JMeybohm: eventrouter: Use numeric UID [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664096 (https://phabricator.wikimedia.org/T274254) [08:31:52] (03PS1) 10JMeybohm: fluent-bit: Use numeric UID [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664097 (https://phabricator.wikimedia.org/T274254) [08:31:57] (03PS1) 10JMeybohm: ratelimit: Use numeric UID [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664098 (https://phabricator.wikimedia.org/T274254) [08:32:20] (03CR) 10Elukey: [C: 03+2] profile::hadoop::backup::namenode: add a more precise notes_url [puppet] - 10https://gerrit.wikimedia.org/r/664094 (owner: 10Elukey) [08:34:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 10%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14346 and previous config saved to /var/cache/conftool/dbconfig/20210215-083444-root.json [08:44:12] (03PS1) 10Elukey: hadoop: enable HDFS service port for Analytics Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/664099 (https://phabricator.wikimedia.org/T273629) [08:45:24] (03CR) 10JMeybohm: [C: 03+1] "Nice!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [08:47:53] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28056/console" [puppet] - 10https://gerrit.wikimedia.org/r/664099 (https://phabricator.wikimedia.org/T273629) (owner: 10Elukey) [08:48:01] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [08:49:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 15%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14347 and previous config saved to /var/cache/conftool/dbconfig/20210215-084947-root.json [08:50:59] 10ops-eqiad, 10DC-Ops, 10Wikidata, 10Wikidata-Query-Service: Upgrade firmware on wdqs1009 - https://phabricator.wikimedia.org/T274751 (10Gehel) [08:53:53] 10SRE, 10Dumps-Generation, 10Platform Engineering, 10serviceops, and 2 others: Upgrade snapshot hosts to Buster - https://phabricator.wikimedia.org/T269377 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['snapshot1005.eqiad.wmnet'] ` and were **ALL** successful. [08:58:58] !log elukey@cumin1001 END (PASS) - Cookbook sre.druid.reboot-workers (exit_code=0) for Druid analytics cluster: Reboot Druid nodes - elukey@cumin1001 [09:01:22] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [09:01:30] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [09:01:32] (03CR) 10Elukey: [V: 03+1 C: 03+2] hadoop: enable HDFS service port for Analytics Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/664099 (https://phabricator.wikimedia.org/T273629) (owner: 10Elukey) [09:04:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 20%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14348 and previous config saved to /var/cache/conftool/dbconfig/20210215-090451-root.json [09:05:56] (03PS1) 10Joal: Update oozie sharelib creation [puppet] - 10https://gerrit.wikimedia.org/r/664172 (https://phabricator.wikimedia.org/T274322) [09:06:00] (03CR) 10JMeybohm: [C: 04-1] "You do mix list indention styles a bit, don't know if we should argue about it or just leave it be." (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/651757 (owner: 10Giuseppe Lavagetto) [09:06:03] elukey: --^ [09:06:06] for when you have time [09:07:48] (03CR) 10Elukey: [C: 03+2] Update oozie sharelib creation [puppet] - 10https://gerrit.wikimedia.org/r/664172 (https://phabricator.wikimedia.org/T274322) (owner: 10Joal) [09:11:52] !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97) [09:12:50] 10SRE, 10Dumps-Generation, 10Platform Engineering, 10serviceops, and 2 others: Upgrade snapshot hosts to Buster - https://phabricator.wikimedia.org/T269377 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ariel on cumin1001.eqiad.wmnet for hosts: ` snapshot1006.eqiad.wmnet ` The log can be fo... [09:13:58] (03PS1) 10Filippo Giunchedi: grafana: stop POST to /api/snapshots [puppet] - 10https://gerrit.wikimedia.org/r/664224 (https://phabricator.wikimedia.org/T274736) [09:15:13] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28057/console" [puppet] - 10https://gerrit.wikimedia.org/r/664224 (https://phabricator.wikimedia.org/T274736) (owner: 10Filippo Giunchedi) [09:15:53] (03CR) 10Kosta Harlan: api-gateway: generic discovery service config option, add linkrecommendation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan) [09:17:11] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/664224 (https://phabricator.wikimedia.org/T274736) (owner: 10Filippo Giunchedi) [09:17:49] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] grafana: stop POST to /api/snapshots [puppet] - 10https://gerrit.wikimedia.org/r/664224 (https://phabricator.wikimedia.org/T274736) (owner: 10Filippo Giunchedi) [09:19:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 25%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14349 and previous config saved to /var/cache/conftool/dbconfig/20210215-091955-root.json [09:24:00] (03PS1) 10ArielGlenn: misc dumps: move commons rdf to later on Sunday and media info to earlier [puppet] - 10https://gerrit.wikimedia.org/r/664225 (https://phabricator.wikimedia.org/T269377) [09:24:02] (03CR) 10David Caro: "Got a couple questions, nits you can safely ignore :)" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/663823 (https://phabricator.wikimedia.org/T272963) (owner: 10Arturo Borrero Gonzalez) [09:24:20] (03PS2) 10JMeybohm: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/663873 (https://phabricator.wikimedia.org/T274262) (owner: 10PipelineBot) [09:24:39] (03CR) 10Ayounsi: [C: 03+2] Remove sampling feature flag [homer/public] - 10https://gerrit.wikimedia.org/r/663533 (owner: 10Ayounsi) [09:25:47] !log ariel@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on snapshot1006.eqiad.wmnet with reason: REIMAGE [09:26:49] (03CR) 10Ayounsi: "confirmed NOOP." [homer/public] - 10https://gerrit.wikimedia.org/r/663533 (owner: 10Ayounsi) [09:27:52] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on snapshot1006.eqiad.wmnet with reason: REIMAGE [09:28:41] (03PS1) 10Vgutierrez: admin: Add christinedk user [puppet] - 10https://gerrit.wikimedia.org/r/664226 (https://phabricator.wikimedia.org/T274304) [09:28:43] (03PS1) 10Vgutierrez: admin: Add christinedk to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/664227 (https://phabricator.wikimedia.org/T274304) [09:34:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 30%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14350 and previous config saved to /var/cache/conftool/dbconfig/20210215-093458-root.json [09:35:26] (03CR) 10Muehlenhoff: admin: Add christinedk user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/664226 (https://phabricator.wikimedia.org/T274304) (owner: 10Vgutierrez) [09:37:27] 10SRE, 10observability: Icinga meta monitoring pages during icinga host reboots - https://phabricator.wikimedia.org/T274662 (10Volans) If we allow for normal reboots going unnoticed, would we catch a scenario in which the icinga host reboots every 5 minutes due to a bug or DoS? P.S. Keyholder is not armed aft... [09:43:50] !log roll restart HDFS daemons in Analytics Hadoop to pick up new RPC queue changes - T273629 [09:47:55] (03CR) 10Volans: "Optional nit inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663860 (owner: 10Hnowlan) [09:50:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 40%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14351 and previous config saved to /var/cache/conftool/dbconfig/20210215-095002-root.json [09:50:41] 10SRE, 10Dumps-Generation, 10Platform Engineering, 10serviceops, and 2 others: Upgrade snapshot hosts to Buster - https://phabricator.wikimedia.org/T269377 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['snapshot1006.eqiad.wmnet'] ` and were **ALL** successful. [09:55:54] (03PS1) 10Jcrespo: Revert "dbbackups: disable all ES db bacula runs until next week" [puppet] - 10https://gerrit.wikimedia.org/r/663961 [09:56:15] (03PS2) 10Jcrespo: Revert "dbbackups: disable all ES db bacula runs until next week" [puppet] - 10https://gerrit.wikimedia.org/r/663961 [09:57:14] 10SRE, 10Dumps-Generation, 10Platform Engineering, 10serviceops, and 2 others: Upgrade snapshot hosts to Buster - https://phabricator.wikimedia.org/T269377 (10ArielGlenn) I was not going to re-image snapshot1005 and 6 because their replacements were due to have come in, but the boxes have not arrived yet a... [09:57:18] 10SRE: Create cookbook to add a node to a Ganeti cluster - https://phabricator.wikimedia.org/T274527 (10MoritzMuehlenhoff) p:05Triage→03Medium [09:57:34] (03PS2) 10ArielGlenn: misc dumps: move commons rdf to later on Sunday and media info to earlier [puppet] - 10https://gerrit.wikimedia.org/r/664225 (https://phabricator.wikimedia.org/T269377) [09:57:51] 10SRE, 10Packaging: Copy cassandra packages to buster-wikimedia - https://phabricator.wikimedia.org/T274119 (10MoritzMuehlenhoff) p:05Triage→03Medium [09:58:12] (03CR) 10Jcrespo: [C: 03+2] Revert "dbbackups: disable all ES db bacula runs until next week" [puppet] - 10https://gerrit.wikimedia.org/r/663961 (owner: 10Jcrespo) [09:59:02] (03CR) 10ArielGlenn: [C: 03+2] misc dumps: move commons rdf to later on Sunday and media info to earlier [puppet] - 10https://gerrit.wikimedia.org/r/664225 (https://phabricator.wikimedia.org/T269377) (owner: 10ArielGlenn) [10:00:12] jynus: may I merge your puppet patch "backup::set { 'mysql-srv-backups-dumps-latest':" etc? [10:00:17] yes [10:00:41] done! [10:00:44] thanks [10:02:02] thanks for the quick response! [10:05:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 50%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14352 and previous config saved to /var/cache/conftool/dbconfig/20210215-100505-root.json [10:09:14] !log Switching Jenkins jobs to Quibble 0.0.46 [10:15:52] 10SRE, 10ops-eqiad: ms-be1034 not powering on - https://phabricator.wikimedia.org/T274488 (10fgiunchedi) Thank you for all the work ! LMK how I can help e.g. if speeding up the decom of one host in T272836 would help (as opposed as decom'ing all hosts at the same time) [10:20:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 60%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14353 and previous config saved to /var/cache/conftool/dbconfig/20210215-102009-root.json [10:23:30] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host netmon1002.wikimedia.org [10:27:29] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1002.wikimedia.org [10:30:08] (03CR) 10Marostegui: [C: 03+2] db1134: Do not be tag as candidate master [puppet] - 10https://gerrit.wikimedia.org/r/664230 (https://phabricator.wikimedia.org/T274472) (owner: 10Marostegui) [10:31:09] (03PS1) 10Arturo Borrero Gonzalez: dumps: distribution: nfs: allow establishing connections with TCP ports > 1024 [puppet] - 10https://gerrit.wikimedia.org/r/664231 (https://phabricator.wikimedia.org/T272397) [10:35:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 70%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14355 and previous config saved to /var/cache/conftool/dbconfig/20210215-103512-root.json [10:41:25] (03PS2) 10Arturo Borrero Gonzalez: dumps: distribution: nfs: allow establishing connections with TCP ports >= 1024 [puppet] - 10https://gerrit.wikimedia.org/r/664231 (https://phabricator.wikimedia.org/T272397) [10:44:09] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] dumps: distribution: nfs: allow establishing connections with TCP ports >= 1024 [puppet] - 10https://gerrit.wikimedia.org/r/664231 (https://phabricator.wikimedia.org/T272397) (owner: 10Arturo Borrero Gonzalez) [10:47:08] (03PS1) 10Arturo Borrero Gonzalez: labstore: allow NFS connections from public cloud networks [puppet] - 10https://gerrit.wikimedia.org/r/664233 (https://phabricator.wikimedia.org/T272397) [10:48:49] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labstore: allow NFS connections from public cloud networks [puppet] - 10https://gerrit.wikimedia.org/r/664233 (https://phabricator.wikimedia.org/T272397) (owner: 10Arturo Borrero Gonzalez) [10:49:05] (03PS1) 10ArielGlenn: swap roles of dumpsdata1001 and 1003 so 1003 is primary for xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/664234 (https://phabricator.wikimedia.org/T273713) [10:50:16] jouncebot: next [10:50:16] In 0 hour(s) and 39 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210215T1130) [10:50:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 80%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14356 and previous config saved to /var/cache/conftool/dbconfig/20210215-105016-root.json [10:57:30] (03PS2) 10ArielGlenn: swap roles of dumpsdata1001 and 1003 so 1003 is primary for xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/664234 (https://phabricator.wikimedia.org/T273713) [10:57:59] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host grafana1002.eqiad.wmnet [10:58:44] (03PS1) 10Jcrespo: Preventive commit for jynus to misspell "bullseye", next Debian version [puppet] - 10https://gerrit.wikimedia.org/r/664237 [10:58:59] (03CR) 10ArielGlenn: [C: 03+2] swap roles of dumpsdata1001 and 1003 so 1003 is primary for xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/664234 (https://phabricator.wikimedia.org/T273713) (owner: 10ArielGlenn) [11:00:25] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana1002.eqiad.wmnet [11:02:02] (03CR) 10Effie Mouzeli: "PCC https://puppet-compiler.wmflabs.org/compiler1002/28058/" [puppet] - 10https://gerrit.wikimedia.org/r/663565 (https://phabricator.wikimedia.org/T273115) (owner: 10Effie Mouzeli) [11:03:17] (03PS2) 10Hnowlan: mtail: add exception handling in tests for non-Debian OSes [puppet] - 10https://gerrit.wikimedia.org/r/663860 [11:05:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 90%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14357 and previous config saved to /var/cache/conftool/dbconfig/20210215-110519-root.json [11:06:51] RECOVERY - Maps HTTPS on maps2007 is OK: HTTP OK: HTTP/1.1 200 OK - 1329 bytes in 0.301 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [11:07:27] RECOVERY - tilerator on maps2007 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.085 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [11:08:21] RECOVERY - Check systemd state on maps2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:10:25] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: name=maps2007.codfw.wmnet [11:11:57] (03CR) 10Hnowlan: mtail: add exception handling in tests for non-Debian OSes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663860 (owner: 10Hnowlan) [11:14:57] (03PS1) 10Elukey: profile::hadoop::master: raise threshold for corrupt blocks [puppet] - 10https://gerrit.wikimedia.org/r/664238 [11:16:50] (03CR) 10Elukey: [C: 03+2] profile::hadoop::master: raise threshold for corrupt blocks [puppet] - 10https://gerrit.wikimedia.org/r/664238 (owner: 10Elukey) [11:20:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 100%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14358 and previous config saved to /var/cache/conftool/dbconfig/20210215-112023-root.json [11:27:16] (03PS4) 10Arturo Borrero Gonzalez: cloud: drop NAT exceptions for dumps NFS [puppet] - 10https://gerrit.wikimedia.org/r/657152 (https://phabricator.wikimedia.org/T272397) [11:28:11] !log elukey@cumin1001 START - Cookbook sre.druid.reboot-workers for Druid public cluster: Reboot Druid nodes - elukey@cumin1001 [11:28:44] this may trigger (I hope not) AQS alerts --^ [11:28:52] in case it is my fault and you can blame me [11:29:05] * elukey sees kormat ready for it [11:29:31] * kormat nods solemnly [11:29:57] (03CR) 10Hnowlan: api-gateway: generic discovery service config option, add linkrecommendation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan) [11:30:04] jan_drewniak: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Wikimedia Portals Update . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210215T1130). [11:32:31] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: move common hiera into proper file [puppet] - 10https://gerrit.wikimedia.org/r/664241 (https://phabricator.wikimedia.org/T272963) [11:33:13] (03CR) 10Jbond: "See comments inline, also wonder if you considered using pathlib for the file operations." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/663658 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [11:33:17] (03PS4) 10Effie Mouzeli: hieradata: enable memcached socket mwdebug1003, mwdebug2001 [puppet] - 10https://gerrit.wikimedia.org/r/663796 (https://phabricator.wikimedia.org/T273115) [11:33:19] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: move common hiera into proper file [puppet] - 10https://gerrit.wikimedia.org/r/664241 (https://phabricator.wikimedia.org/T272963) (owner: 10Arturo Borrero Gonzalez) [11:34:50] (03PS5) 10Arturo Borrero Gonzalez: cloud: drop NAT exceptions for dumps NFS [puppet] - 10https://gerrit.wikimedia.org/r/657152 (https://phabricator.wikimedia.org/T272397) [11:37:34] !log reimaging bast5001 to buster [11:45:23] (03CR) 10Jbond: "Adding Andrew to approve privatedata-users access" [puppet] - 10https://gerrit.wikimedia.org/r/664227 (https://phabricator.wikimedia.org/T274304) (owner: 10Vgutierrez) [11:52:45] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1007.eqiad.wmnet [11:54:09] (03CR) 10Jbond: "see comments" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663993 (owner: 10Urbanecm) [11:55:13] (03CR) 10Urbanecm: Update urbanecm's dotfiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663993 (owner: 10Urbanecm) [11:55:23] (03PS2) 10Urbanecm: Update urbanecm's dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/663993 [11:56:00] (03CR) 10Jbond: [C: 03+2] Update urbanecm's dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/663993 (owner: 10Urbanecm) [11:56:21] Urbanecm: ^^ merged [11:56:24] thanks jbond42 ! [11:56:28] :) np [11:58:52] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1007.eqiad.wmnet [12:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: That opportune time is upon us again. Time for a European mid-day backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210215T1200). [12:00:05] No GERRIT patches in the queue for this window AFAICS. [12:00:14] I'll deploy regardless [12:01:12] (03CR) 10Urbanecm: [C: 03+2] Revert "Revert "Enable SandboxLink at viwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663736 (https://phabricator.wikimedia.org/T272796) (owner: 10Urbanecm) [12:02:46] (03Merged) 10jenkins-bot: Revert "Revert "Enable SandboxLink at viwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663736 (https://phabricator.wikimedia.org/T272796) (owner: 10Urbanecm) [12:04:02] (03PS1) 10Effie Mouzeli: hiera: install memcached 1.6 on mc1037 [puppet] - 10https://gerrit.wikimedia.org/r/664271 (https://phabricator.wikimedia.org/T270315) [12:06:36] (03CR) 10Jbond: [C: 03+1] "thanks this will also be a big help to me 😊" [puppet] - 10https://gerrit.wikimedia.org/r/664237 (owner: 10Jcrespo) [12:07:47] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on bast5001.wikimedia.org with reason: REIMAGE [12:07:55] (03PS22) 10Kosta Harlan: linkrecommendation: Cron job to load datasets [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) [12:08:54] can someone check mwdebug1002.eqiad.wmnet status, and remove it from scap if it is still broken (as mutante said in ops list)? [12:09:16] (03CR) 10Effie Mouzeli: "PCC https://puppet-compiler.wmflabs.org/compiler1003/28065/mc2037.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/664271 (https://phabricator.wikimedia.org/T270315) (owner: 10Effie Mouzeli) [12:09:47] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast5001.wikimedia.org with reason: REIMAGE [12:09:59] (03PS2) 10Muehlenhoff: Swift: Stop setting net.ipv4.tcp_tw_recycle for buster and later [puppet] - 10https://gerrit.wikimedia.org/r/662918 [12:10:35] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 662d5f6af01f6cf6ce7e9d56cf1bc3ba282afee1: Revert "Revert "Enable SandboxLink at viwiki"" (T272796) (duration: 05m 26s) [12:10:41] finally [12:11:36] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytic Cluster for Research Scientist (Paragon) - https://phabricator.wikimedia.org/T274631 (10MoritzMuehlenhoff) Also needs approval by @Ottomata for Hadoop access. [12:13:39] (03CR) 10JMeybohm: [C: 03+1] linkrecommendation: Cron job to load datasets [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [12:14:25] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Cron job to load datasets [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [12:15:59] (03Merged) 10jenkins-bot: linkrecommendation: Cron job to load datasets [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [12:16:19] (03CR) 10Vgutierrez: [C: 03+1] delete class tlsproxy::prometheus and nginx template [puppet] - 10https://gerrit.wikimedia.org/r/659377 (https://phabricator.wikimedia.org/T272559) (owner: 10Dzahn) [12:16:21] (03PS2) 10Urbanecm: ukwikisource: Finish removal of NS Translations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664053 (https://phabricator.wikimedia.org/T270628) [12:16:24] (03CR) 10Urbanecm: [C: 03+2] ukwikisource: Finish removal of NS Translations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664053 (https://phabricator.wikimedia.org/T270628) (owner: 10Urbanecm) [12:17:21] (03Merged) 10jenkins-bot: ukwikisource: Finish removal of NS Translations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664053 (https://phabricator.wikimedia.org/T270628) (owner: 10Urbanecm) [12:17:27] (03CR) 10Elukey: [C: 03+1] "left a nit for the commit msg, LGTM otherwise!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/664271 (https://phabricator.wikimedia.org/T270315) (owner: 10Effie Mouzeli) [12:18:18] (03CR) 10Elukey: [C: 03+1] "Effie can you run a pcc to see if everything looks good?" [puppet] - 10https://gerrit.wikimedia.org/r/663868 (https://phabricator.wikimedia.org/T270315) (owner: 10Effie Mouzeli) [12:18:47] !log kharlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [12:21:30] repeating myself: can someone depool mwdebug1002? it's currently down (see mail from dzahn in ops list), but still pooled and thus in scap dsh group :/ [12:22:25] 10SRE: hosts failing puppet compile due to missing secrets - https://phabricator.wikimedia.org/T274392 (10MoritzMuehlenhoff) p:05Triage→03Medium [12:23:45] 10SRE: hosts failing puppet compile due to missing secrets - https://phabricator.wikimedia.org/T274392 (10MoritzMuehlenhoff) Adding a few tags for affected sub teams, simply untag when completed [12:24:38] 10SRE, 10Analytics, 10observability, 10serviceops, 10cloud-services-team (Kanban): hosts failing puppet compile due to missing secrets - https://phabricator.wikimedia.org/T274392 (10MoritzMuehlenhoff) [12:25:33] (03CR) 10Volans: "quick direct reply, will have a pass later" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663658 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [12:25:55] (03CR) 10Arturo Borrero Gonzalez: "Thanks for the review!" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/663823 (https://phabricator.wikimedia.org/T272963) (owner: 10Arturo Borrero Gonzalez) [12:30:51] (03PS1) 10JMeybohm: admin: Allow tiller to create batch ressources [deployment-charts] - 10https://gerrit.wikimedia.org/r/664273 [12:32:02] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] admin: Allow tiller to create batch ressources [deployment-charts] - 10https://gerrit.wikimedia.org/r/664273 (owner: 10JMeybohm) [12:32:29] !log jmm@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mwdebug1002.eqiad.wmnet [12:33:32] (03Merged) 10jenkins-bot: admin: Allow tiller to create batch ressources [deployment-charts] - 10https://gerrit.wikimedia.org/r/664273 (owner: 10JMeybohm) [12:35:00] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [12:35:39] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: cdf15981f7c6f7e02a3fb1c1ce61dc14815f216d: ukwikisource: Finish removal of NS Translations (T270628) (duration: 01m 07s) [12:36:24] (03PS1) 10Elukey: Add/Fix kerberos fake keytabs [labs/private] - 10https://gerrit.wikimedia.org/r/664274 (https://phabricator.wikimedia.org/T274392) [12:36:46] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add/Fix kerberos fake keytabs [labs/private] - 10https://gerrit.wikimedia.org/r/664274 (https://phabricator.wikimedia.org/T274392) (owner: 10Elukey) [12:37:06] !log elukey@cumin1001 END (PASS) - Cookbook sre.druid.reboot-workers (exit_code=0) for Druid public cluster: Reboot Druid nodes - elukey@cumin1001 [12:37:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10Wikidata, and 2 others: Upgrade firmware on wdqs1009 - https://phabricator.wikimedia.org/T274751 (10MoritzMuehlenhoff) p:05Triage→03Medium [12:38:28] (03CR) 10David Caro: [C: 03+1] cloudgw: introduce HA by using keepalived/VRRP (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/663823 (https://phabricator.wikimedia.org/T272963) (owner: 10Arturo Borrero Gonzalez) [12:38:36] (03PS9) 10Arturo Borrero Gonzalez: cloudgw: introduce HA by using keepalived/VRRP [puppet] - 10https://gerrit.wikimedia.org/r/663823 (https://phabricator.wikimedia.org/T272963) [12:38:38] 10SRE, 10observability, 10serviceops, 10Patch-For-Review, 10cloud-services-team (Kanban): hosts failing puppet compile due to missing secrets - https://phabricator.wikimedia.org/T274392 (10elukey) [12:39:18] !log kharlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [12:40:12] (03PS10) 10Arturo Borrero Gonzalez: cloudgw: introduce HA by using keepalived/VRRP [puppet] - 10https://gerrit.wikimedia.org/r/663823 (https://phabricator.wikimedia.org/T272963) [12:43:59] !log reimaging bast4002 to buster [12:44:04] (03PS11) 10Arturo Borrero Gonzalez: cloudgw: introduce HA by using keepalived/VRRP [puppet] - 10https://gerrit.wikimedia.org/r/663823 (https://phabricator.wikimedia.org/T272963) [12:44:09] !log jayme@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [12:44:39] PROBLEM - etherpad_lite_process_running on etherpad1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/node /usr/share/etherpad-lite/node_modules/ep_etherpad-lite/node/server.js https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org [12:44:59] PROBLEM - etherpad_up reduced availability on alert1001 is CRITICAL: 0 le 0.8 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:45:53] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1002 is CRITICAL: connect to address 10.64.32.178 and port 9001: Connection refused https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org [12:46:25] RECOVERY - etherpad_lite_process_running on etherpad1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/node /usr/share/etherpad-lite/node_modules/ep_etherpad-lite/node/server.js https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org [12:47:24] (03PS12) 10Arturo Borrero Gonzalez: cloudgw: introduce HA by using keepalived/VRRP [puppet] - 10https://gerrit.wikimedia.org/r/663823 (https://phabricator.wikimedia.org/T272963) [12:47:35] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1002 is OK: HTTP OK: HTTP/1.1 200 OK - 9184 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org [12:47:58] (03CR) 10Effie Mouzeli: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/663868 (https://phabricator.wikimedia.org/T270315) (owner: 10Effie Mouzeli) [12:48:27] RECOVERY - etherpad_up reduced availability on alert1001 is OK: (C)0.8 le (W)0.9 le 1 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:49:10] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1002/28075/" [puppet] - 10https://gerrit.wikimedia.org/r/663823 (https://phabricator.wikimedia.org/T272963) (owner: 10Arturo Borrero Gonzalez) [12:49:13] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] cloudgw: introduce HA by using keepalived/VRRP [puppet] - 10https://gerrit.wikimedia.org/r/663823 (https://phabricator.wikimedia.org/T272963) (owner: 10Arturo Borrero Gonzalez) [12:49:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1093 T273955', diff saved to https://phabricator.wikimedia.org/P14359 and previous config saved to /var/cache/conftool/dbconfig/20210215-124944-marostegui.json [12:50:24] (03PS2) 10David Caro: utils: add script to run docker ci tests locally [software/spicerack] - 10https://gerrit.wikimedia.org/r/663205 (https://phabricator.wikimedia.org/T274338) [12:50:27] (03PS1) 10Marostegui: db1093: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/664276 (https://phabricator.wikimedia.org/T273955) [12:50:50] !log jayme@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [12:51:16] (03CR) 10Marostegui: [C: 03+2] db1093: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/664276 (https://phabricator.wikimedia.org/T273955) (owner: 10Marostegui) [12:58:16] !log kharlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [12:58:16] !log kharlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [12:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:10] we lost a whole bunch of SAL messages because stashbot was out [13:01:12] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on bast4002.wikimedia.org with reason: REIMAGE [13:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:21] is it worth repeating them all? [13:01:49] cc marostegui, ryankemper, ariel, elukey… [13:02:04] Lucas_WMDE: not from my side, thanks though! :) [13:02:10] ok [13:02:26] sometimes I do it but this seems to be almost 50 missed messages and I’m lazy :D [13:02:41] !log kharlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [13:02:41] !log kharlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [13:02:44] (they’re all in the IRC log) [13:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:50] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) db1162 is fully pooled [13:03:18] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast4002.wikimedia.org with reason: REIMAGE [13:03:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:58] !log notice: stashbot had issues between 8:19 and 12:50, see for https://wm-bot.wmflabs.org/browser/index.php?start=02%2F15%2F2021&end=02%2F15%2F2021&display=%23wikimedia-operations for missed !log messages [13:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:54] !log swift eqiad-prod: decrease weight for SSDs on ms-be[1019-1026] - T272836 [13:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:58] T272836: Decom ms-be[1019-1026] from swift - https://phabricator.wikimedia.org/T272836 [13:14:05] (03PS1) 10JMeybohm: linkrecommendation: Read DB_USER from public config [deployment-charts] - 10https://gerrit.wikimedia.org/r/664277 (https://phabricator.wikimedia.org/T265893) [13:14:16] ^ kostajh [13:14:58] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Read DB_USER from public config [deployment-charts] - 10https://gerrit.wikimedia.org/r/664277 (https://phabricator.wikimedia.org/T265893) (owner: 10JMeybohm) [13:15:30] jayme: cheers [13:17:35] (03Merged) 10jenkins-bot: linkrecommendation: Read DB_USER from public config [deployment-charts] - 10https://gerrit.wikimedia.org/r/664277 (https://phabricator.wikimedia.org/T265893) (owner: 10JMeybohm) [13:19:28] !log kharlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [13:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:47] (03PS4) 10Hnowlan: mtail: create separate metrics histogram based on endpoint [puppet] - 10https://gerrit.wikimedia.org/r/634207 (https://phabricator.wikimedia.org/T263727) [13:22:04] (03CR) 10Hnowlan: [V: 03+2 C: 03+2] tegola: Add docker image. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/654662 (https://phabricator.wikimedia.org/T270170) (owner: 10Hnowlan) [13:28:57] (03CR) 10Alexandros Kosiaris: "Shouldn't this instead be done via the pipeline? It would greatly decouple upgrading tegola from requiring an SRE to build newer versions " [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/654662 (https://phabricator.wikimedia.org/T270170) (owner: 10Hnowlan) [13:33:36] !log Stop MySQL on db1093 - T273955 [13:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:41] T273955: decommission db1093.eqiad.wmnet - https://phabricator.wikimedia.org/T273955 [13:34:02] (03PS5) 10Jbond: Add check to error when calling to hiera() [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/659789 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [13:34:39] (03CR) 10jerkins-bot: [V: 04-1] Add check to error when calling to hiera() [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/659789 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [13:38:10] !log installing subversion security updates [13:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:38] (03PS6) 10Jbond: Add check to error when calling to hiera() [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/659789 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [13:43:11] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:47:55] (03PS2) 10Muehlenhoff: admin: Add christinedk user [puppet] - 10https://gerrit.wikimedia.org/r/664226 (https://phabricator.wikimedia.org/T274304) (owner: 10Vgutierrez) [13:48:03] (03PS7) 10Jbond: Add check to error when calling to hiera() [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/659789 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [13:48:13] (03CR) 10jerkins-bot: [V: 04-1] Add check to error when calling to hiera() [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/659789 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [13:53:00] !log gehel@cumin2001 START - Cookbook sre.wdqs.data-reload [13:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:13] !log installing libonig security update for stretch [13:57:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:09] !log swift eqiad-prod: add weight back to sdg on ms-be1054 - T273582 [14:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:15] T273582: Put sdg1 on ms-be1054 back in service - https://phabricator.wikimedia.org/T273582 [14:10:43] 10SRE, 10SRE-swift-storage, 10Patch-For-Review, 10User-fgiunchedi: swift backend decomms / rebalances are noisy - https://phabricator.wikimedia.org/T221904 (10fgiunchedi) 05Open→03Resolved I'm boldly resolving this again since limiting memory usage for object replication processes helped a whole lot to... [14:12:42] (03PS1) 10Urbanecm: Add *.president.az to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664294 (https://phabricator.wikimedia.org/T274789) [14:13:04] jouncebot: now [14:13:05] No deployments scheduled for the next 3 hour(s) and 46 minute(s) [14:13:15] (03PS2) 10Urbanecm: Add *.president.az to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664294 (https://phabricator.wikimedia.org/T274789) [14:13:18] (03CR) 10Urbanecm: [C: 03+2] Add *.president.az to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664294 (https://phabricator.wikimedia.org/T274789) (owner: 10Urbanecm) [14:14:07] (03Merged) 10jenkins-bot: Add *.president.az to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664294 (https://phabricator.wikimedia.org/T274789) (owner: 10Urbanecm) [14:17:02] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 00905c4a7e4bb69f39e52e1c4d4d6168006b0e7b: Add *.president.az to the wgCopyUploadsDomains allowlist of Wikimedia Commons (T274789) (duration: 01m 09s) [14:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:07] T274789: Add to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T274789 [14:19:43] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:23:44] (03PS8) 10Jbond: Add check to error when calling to hiera() [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/659789 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [14:25:33] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:37] (03CR) 10David Caro: utils: add script to run docker ci tests locally (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/663205 (https://phabricator.wikimedia.org/T274338) (owner: 10David Caro) [14:31:40] (03CR) 10Jbond: [C: 03+2] Add check to error when calling to hiera() [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/659789 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [14:34:23] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Services, 10Service-deployment-requests: New Service Request geoshapes - https://phabricator.wikimedia.org/T274388 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:34:33] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Services, 10Service-deployment-requests: [DRAFT] New Service Request tegola - https://phabricator.wikimedia.org/T274390 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:45:09] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:48:25] (03PS1) 10Jbond: Gemfile: increase dependency for wmf_style-stylegude-check [puppet] - 10https://gerrit.wikimedia.org/r/664297 (https://phabricator.wikimedia.org/T209953) [15:04:50] !log upgrade grafana to 7.4.1 on grafana1002 - T263747 [15:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:55] T263747: Upgrade Grafana to 7.4 - https://phabricator.wikimedia.org/T263747 [15:06:15] (03CR) 10Ppchelko: api-gateway: generic discovery service config option, add linkrecommendation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan) [15:06:27] 10SRE, 10SRE-Access-Requests: Requesting access to stat boxes for mlitn - https://phabricator.wikimedia.org/T274749 (10MoritzMuehlenhoff) Also adding @Ottomata for approval for analytics-privatedata-users. [15:09:46] !log reimaging bast3004 to buster [15:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:06] (03PS1) 10Bartosz Dziewoński: CommentFormatter: Fix problems with editsection and quotes [extensions/DiscussionTools] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/664254 (https://phabricator.wikimedia.org/T274709) [15:17:18] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host schema1003.eqiad.wmnet [15:17:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:21] (03CR) 10Jbond: "did a quick pass however im not that familiar with the current decom cook book" (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/663878 (owner: 10Elukey) [15:20:05] (03PS1) 10Kormat: integration_env: Rework cli to simplify operations [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664300 [15:20:10] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp5012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:27:56] (03CR) 10Hashar: [C: 03+1] "Can be merged anytime, the CI job always does a gem update :]" [puppet] - 10https://gerrit.wikimedia.org/r/664297 (https://phabricator.wikimedia.org/T209953) (owner: 10Jbond) [15:28:49] (03CR) 10Jbond: [C: 03+2] Gemfile: increase dependency for wmf_style-stylegude-check [puppet] - 10https://gerrit.wikimedia.org/r/664297 (https://phabricator.wikimedia.org/T209953) (owner: 10Jbond) [15:30:19] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema1003.eqiad.wmnet [15:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:21] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp5012 is OK: HTTP OK: HTTP/1.0 200 OK - 23547 bytes in 0.829 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:33:08] !log installing linux-4.19 update for Stretch on servers which have it installed (no reboots, just updating the kernels) [15:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:35] (03CR) 10Kormat: [C: 03+2] integration_env: Rework cli to simplify operations [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664300 (owner: 10Kormat) [15:34:16] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on bast3004.wikimedia.org with reason: REIMAGE [15:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:30] (03CR) 10Jcrespo: [C: 03+2] Preventive commit for jynus to misspell "bullseye", next Debian version [puppet] - 10https://gerrit.wikimedia.org/r/664237 (owner: 10Jcrespo) [15:36:11] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [15:36:12] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [15:36:12] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [15:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:20] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast3004.wikimedia.org with reason: REIMAGE [15:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:46] (03PS1) 10Jcrespo: testing test at test at testing [puppet] - 10https://gerrit.wikimedia.org/r/664301 [15:36:54] (03Merged) 10jenkins-bot: integration_env: Rework cli to simplify operations [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664300 (owner: 10Kormat) [15:38:02] (03CR) 10jerkins-bot: [V: 04-1] testing test at test at testing [puppet] - 10https://gerrit.wikimedia.org/r/664301 (owner: 10Jcrespo) [15:38:36] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host schema1004.eqiad.wmnet [15:38:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:49] (03CR) 10Jcrespo: "16:37:55 Typo found!" [puppet] - 10https://gerrit.wikimedia.org/r/664301 (owner: 10Jcrespo) [15:39:13] (03Abandoned) 10Jcrespo: testing test at test at testing [puppet] - 10https://gerrit.wikimedia.org/r/664301 (owner: 10Jcrespo) [15:39:46] (03CR) 10Alexandros Kosiaris: [C: 04-1] "1 pedantic comment but perhaps we can solve this more easily, see inline." (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/659863 (owner: 10JMeybohm) [15:39:52] 10SRE: reprepro unable to run checkupdate and import upgraded packages - https://phabricator.wikimedia.org/T274797 (10fgiunchedi) [15:40:39] (03PS1) 10Elukey: hadoop: update the HDFS Namenode rack configuration [puppet] - 10https://gerrit.wikimedia.org/r/664302 (https://phabricator.wikimedia.org/T274795) [15:41:13] 10SRE: reprepro unable to run checkupdate and import upgraded packages - https://phabricator.wikimedia.org/T274797 (10fgiunchedi) [15:44:52] (03CR) 10Alexandros Kosiaris: "+1, but perhaps we don't even need it? See dependent commit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/659864 (owner: 10JMeybohm) [15:45:07] (03PS1) 10Muehlenhoff: Add a comment to the snapshot block [puppet] - 10https://gerrit.wikimedia.org/r/664303 [15:46:19] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema1004.eqiad.wmnet [15:46:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:44] (03PS1) 10Arturo Borrero Gonzalez: Revert "cloud: hiera: add vlan 2120 back into the neutron bridge" [puppet] - 10https://gerrit.wikimedia.org/r/664255 [15:46:53] (03PS2) 10Arturo Borrero Gonzalez: Revert "cloud: hiera: add vlan 2120 back into the neutron bridge" [puppet] - 10https://gerrit.wikimedia.org/r/664255 [15:47:25] (03PS3) 10Arturo Borrero Gonzalez: Revert "cloud: hiera: add vlan 2120 back into the neutron bridge" [puppet] - 10https://gerrit.wikimedia.org/r/664255 (https://phabricator.wikimedia.org/T272963) [15:47:45] (03PS2) 10Elukey: hadoop: update the HDFS Namenode rack configuration [puppet] - 10https://gerrit.wikimedia.org/r/664302 (https://phabricator.wikimedia.org/T274795) [15:48:09] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [15:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:54] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host schema2003.codfw.wmnet [15:48:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:03] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Revert "cloud: hiera: add vlan 2120 back into the neutron bridge" [puppet] - 10https://gerrit.wikimedia.org/r/664255 (https://phabricator.wikimedia.org/T272963) (owner: 10Arturo Borrero Gonzalez) [15:50:13] (03PS1) 10Muehlenhoff: Remove obsolete cloudera config from reprepro [puppet] - 10https://gerrit.wikimedia.org/r/664304 (https://phabricator.wikimedia.org/T274797) [15:50:56] (03CR) 10Ppchelko: api-gateway: generic discovery service config option, add linkrecommendation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan) [15:51:26] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema2003.codfw.wmnet [15:51:26] (03PS1) 10Arturo Borrero Gonzalez: Revert "cloud: hiera: connect cloudnet servers back to vlan 2120" [puppet] - 10https://gerrit.wikimedia.org/r/664256 [15:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:39] (03PS2) 10Arturo Borrero Gonzalez: Revert "cloud: hiera: connect cloudnet servers back to vlan 2120" [puppet] - 10https://gerrit.wikimedia.org/r/664256 (https://phabricator.wikimedia.org/T272963) [15:51:47] (03PS3) 10Arturo Borrero Gonzalez: Revert "cloud: hiera: connect cloudnet servers back to vlan 2120" [puppet] - 10https://gerrit.wikimedia.org/r/664256 (https://phabricator.wikimedia.org/T272963) [15:52:15] 10SRE, 10Patch-For-Review: reprepro unable to run checkupdate and import upgraded packages - https://phabricator.wikimedia.org/T274797 (10fgiunchedi) Note that the elastic 5 "not found" errors seem flappy, I just got a `checkupdate` run without those errors [15:53:19] (03PS1) 10Arturo Borrero Gonzalez: Revert "cloud: hiera: enable back neutron hacks in codfw1dev" [puppet] - 10https://gerrit.wikimedia.org/r/664257 [15:53:26] (03PS2) 10Arturo Borrero Gonzalez: Revert "cloud: hiera: enable back neutron hacks in codfw1dev" [puppet] - 10https://gerrit.wikimedia.org/r/664257 [15:53:34] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host schema2004.codfw.wmnet [15:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:37] (03PS3) 10Arturo Borrero Gonzalez: Revert "cloud: hiera: enable back neutron hacks in codfw1dev" [puppet] - 10https://gerrit.wikimedia.org/r/664257 (https://phabricator.wikimedia.org/T272963) [15:53:49] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Revert "cloud: hiera: connect cloudnet servers back to vlan 2120" [puppet] - 10https://gerrit.wikimedia.org/r/664256 (https://phabricator.wikimedia.org/T272963) (owner: 10Arturo Borrero Gonzalez) [15:53:57] (03CR) 10Filippo Giunchedi: [C: 03+1] Add a comment to the snapshot block [puppet] - 10https://gerrit.wikimedia.org/r/664303 (owner: 10Muehlenhoff) [15:57:26] (03PS4) 10Arturo Borrero Gonzalez: Revert "cloud: hiera: enable back neutron hacks in codfw1dev" This reverts commit 5ca98c9df08f6c6e2d97bc7b6279cdaf573eddce. Reason for revert: rebuilding the cloudgw setup Bug: T272963 Change-Id: I8185f4fa36a70255940d78db45b0f50cfc6abb98 Signed-off-by: Arturo Borrero Gonzalez [puppet] - 10https://gerrit.wikimedia.org/r/664257 (https://phabricator.wi [15:58:00] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema2004.codfw.wmnet [15:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:12] (03PS5) 10Arturo Borrero Gonzalez: Revert "cloud: hiera: enable back neutron hacks in codfw1dev" [puppet] - 10https://gerrit.wikimedia.org/r/664257 (https://phabricator.wikimedia.org/T272963) [15:58:20] 10SRE, 10SRE-tools, 10User-Joe: Covert deploy_apache_change.sh to a spicerack cookbook - https://phabricator.wikimedia.org/T203948 (10jijiki) [16:02:38] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Revert "cloud: hiera: enable back neutron hacks in codfw1dev" [puppet] - 10https://gerrit.wikimedia.org/r/664257 (https://phabricator.wikimedia.org/T272963) (owner: 10Arturo Borrero Gonzalez) [16:04:06] (03CR) 10Volans: "Thanks for the refactor, some comments inline, some already discussed over IRC." (0314 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/661921 (https://phabricator.wikimedia.org/T267412) (owner: 10David Caro) [16:04:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10Wikidata, and 3 others: Upgrade firmware on wdqs1009 - https://phabricator.wikimedia.org/T274751 (10Gehel) [16:05:18] !log aborrero@cumin2001 START - Cookbook sre.hosts.reboot-single for host cloudnet2003-dev.codfw.wmnet [16:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:56] 10SRE: netbox update (triggered from reimage script) failed: 'ImportPuppetDB' object has no attribute 'log_error' - https://phabricator.wikimedia.org/T274802 (10MoritzMuehlenhoff) [16:07:37] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestage2001.codfw.wmnet [16:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:49] !log aborrero@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet2003-dev.codfw.wmnet [16:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:23] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [16:11:29] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: stop setting up VIP addresses that are now handle via keepalived/VRRP [puppet] - 10https://gerrit.wikimedia.org/r/664307 (https://phabricator.wikimedia.org/T272963) [16:11:55] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [16:12:12] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestage2001.codfw.wmnet [16:12:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:57] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestage2002.codfw.wmnet [16:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:35] !log Updated the Wikidata property suggester with data from the 2021-02-01 JSON dump (with pre-applied T132839 workarounds) [16:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:40] T132839: [RfC] Property suggester suggests human properties for non-human items - https://phabricator.wikimedia.org/T132839 [16:16:34] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: stop setting up VIP addresses that are now handle via keepalived/VRRP [puppet] - 10https://gerrit.wikimedia.org/r/664307 (https://phabricator.wikimedia.org/T272963) [16:18:20] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: stop setting up VIP addresses that are now handle via keepalived/VRRP [puppet] - 10https://gerrit.wikimedia.org/r/664307 (https://phabricator.wikimedia.org/T272963) (owner: 10Arturo Borrero Gonzalez) [16:18:35] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestage2002.codfw.wmnet [16:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:08] (03CR) 10Muehlenhoff: [C: 03+2] Add a comment to the snapshot block [puppet] - 10https://gerrit.wikimedia.org/r/664303 (owner: 10Muehlenhoff) [16:22:14] !log jmm@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mwdebug1002.eqiad.wmnet [16:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:53] 10SRE: netbox update (triggered from reimage script) failed: 'ImportPuppetDB' object has no attribute 'log_error' - https://phabricator.wikimedia.org/T274802 (10Volans) p:05Triage→03High a:03Volans [16:25:11] (03PS1) 10Volans: interface automation: fix typo in method name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/664308 (https://phabricator.wikimedia.org/T274802) [16:26:03] !log rolled back linkrecommendation helm releases to the most recent revision running chart verion linkrecommendation-0.0.4 on clusters codfw and eqiad (cc: kostajh) [16:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:09] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestage1001.eqiad.wmnet [16:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:09] (03CR) 10Volans: [C: 03+2] "self merging as it's just a typo, will run the script against bast3004 manually to verify it" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/664308 (https://phabricator.wikimedia.org/T274802) (owner: 10Volans) [16:32:38] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestage1001.eqiad.wmnet [16:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:48] !log restarted netbox on netbox1001 [16:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:18] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:36:42] (03PS1) 10Volans: interface automation: fix typo in method name (2) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/664309 (https://phabricator.wikimedia.org/T274802) [16:37:12] mmmh icinga, are you sure? it's all good there, it was me and was already fixed [16:37:20] (03CR) 10jerkins-bot: [V: 04-1] interface automation: fix typo in method name (2) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/664309 (https://phabricator.wikimedia.org/T274802) (owner: 10Volans) [16:37:56] (03PS2) 10Volans: interface automation: fix typo in method name (2) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/664309 (https://phabricator.wikimedia.org/T274802) [16:39:57] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestage1002.eqiad.wmnet [16:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:06] (03CR) 10Volans: [C: 03+2] "Typo fix." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/664309 (https://phabricator.wikimedia.org/T274802) (owner: 10Volans) [16:40:14] (03PS1) 10Kosta Harlan: linkrecommendation: Set backoffLimit to 1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/664310 (https://phabricator.wikimedia.org/T265893) [16:40:14] PROBLEM - kubelet operational latencies on kubestage1001 is CRITICAL: instance=kubestage1001.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:40:45] ^ thats "expected" (kind of) from reboots [16:41:29] (03CR) 10jerkins-bot: [V: 04-1] linkrecommendation: Set backoffLimit to 1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/664310 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [16:41:40] RECOVERY - kubelet operational latencies on kubestage1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:43:00] (03PS2) 10Kosta Harlan: linkrecommendation: Set backoffLimit to 1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/664310 (https://phabricator.wikimedia.org/T265893) [16:43:18] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:44:44] 10SRE, 10CAS-SSO, 10Patch-For-Review: Investigate CAS Session logout - https://phabricator.wikimedia.org/T273867 (10Gehel) Removing discovery-search, if you need our help again, please ping us! [16:46:44] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestage1002.eqiad.wmnet [16:46:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:30] PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:48:50] 10SRE, 10Patch-For-Review: netbox update (triggered from reimage script) failed: 'ImportPuppetDB' object has no attribute 'log_error' - https://phabricator.wikimedia.org/T274802 (10Volans) a:05Volans→03crusnov @crusnov passing it over to you. I've fixed the basic typos, but the problem now is that the scri... [16:49:43] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: switch data place interface config modes to manual [puppet] - 10https://gerrit.wikimedia.org/r/664311 (https://phabricator.wikimedia.org/T272963) [16:49:51] 10SRE, 10Patch-For-Review: netbox update (triggered from reimage script) failed: 'ImportPuppetDB' object has no attribute 'log_error' - https://phabricator.wikimedia.org/T274802 (10crusnov) That seems reasonable, I'll look at it and get a patch out soonish. [16:52:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: switch data place interface config modes to manual [puppet] - 10https://gerrit.wikimedia.org/r/664311 (https://phabricator.wikimedia.org/T272963) (owner: 10Arturo Borrero Gonzalez) [16:53:09] PROBLEM - kubelet operational latencies on kubestage1002 is CRITICAL: instance=kubestage1002.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:57:37] RECOVERY - kubelet operational latencies on kubestage1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:00:58] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Services, 10Service-deployment-requests: New Service Request geoshapes - https://phabricator.wikimedia.org/T274388 (10akosiaris) Thanks for this task! So I 've studied the diagrams a bit, they are helpful. The deployment pipeline definitely suppor... [17:03:18] (03CR) 10Elukey: [C: 03+1] "Just to confirm - this will keep the cloudera components but clear all the pull-specific bits. If so, big +1, thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/664304 (https://phabricator.wikimedia.org/T274797) (owner: 10Muehlenhoff) [17:16:13] (03CR) 10Elukey: "John thanks a lot for the review! For this particular use case, I'd prefer to just move the existing code base to the class api and then m" [cookbooks] - 10https://gerrit.wikimedia.org/r/663878 (owner: 10Elukey) [17:27:06] (03CR) 10Elukey: [C: 03+2] hadoop: update the HDFS Namenode rack configuration [puppet] - 10https://gerrit.wikimedia.org/r/664302 (https://phabricator.wikimedia.org/T274795) (owner: 10Elukey) [17:28:16] (03PS1) 10Jcrespo: configcluster: Enable etcd v3 backups for stretch hosts [puppet] - 10https://gerrit.wikimedia.org/r/664313 (https://phabricator.wikimedia.org/T271573) [17:28:18] (03PS1) 10Jcrespo: bacula: Revert TLS 1.0 downgrade on storage servers (including director) [puppet] - 10https://gerrit.wikimedia.org/r/664314 (https://phabricator.wikimedia.org/T273182) [17:29:54] (03Abandoned) 10Jcrespo: jessie: Remove old openssl override after revert to package version [puppet] - 10https://gerrit.wikimedia.org/r/660857 (https://phabricator.wikimedia.org/T273182) (owner: 10Jcrespo) [17:30:04] (03CR) 10Kosta Harlan: api-gateway: generic discovery service config option, add linkrecommendation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan) [17:32:07] (03CR) 10JMeybohm: [C: 03+1] linkrecommendation: Set backoffLimit to 1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/664310 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [17:32:43] (03PS10) 10David Caro: toolforge.etcdctl: add new etcdctl module [software/spicerack] - 10https://gerrit.wikimedia.org/r/661921 (https://phabricator.wikimedia.org/T267412) [17:32:43] RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:33:16] (03CR) 10David Caro: "Done all the changes as requested" (0313 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/661921 (https://phabricator.wikimedia.org/T267412) (owner: 10David Caro) [17:39:15] (03CR) 10Jcrespo: "Have you tested backups with the script on etcd3? I don't see anything, like a path, completely wrong, but I don't know enough about what " [puppet] - 10https://gerrit.wikimedia.org/r/664313 (https://phabricator.wikimedia.org/T271573) (owner: 10Jcrespo) [17:41:17] 10SRE, 10serviceops, 10Patch-For-Review: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10jcrespo) I've sent: https://gerrit.wikimedia.org/r/c/operations/puppet/+/664313 Independently of the pace of upgrading, we should give some priority to generating fresh backups from the... [17:43:56] (03PS2) 10Jcrespo: configcluster: Enable etcd v3 backups for stretch hosts [puppet] - 10https://gerrit.wikimedia.org/r/664313 (https://phabricator.wikimedia.org/T271573) [17:44:23] (03PS3) 10Jcrespo: configcluster: Enable etcd v3 backups for stretch hosts [puppet] - 10https://gerrit.wikimedia.org/r/664313 (https://phabricator.wikimedia.org/T271573) [17:55:42] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: interfaces: relax check on routing setup by using 'onlink' [puppet] - 10https://gerrit.wikimedia.org/r/664317 (https://phabricator.wikimedia.org/T272963) [17:57:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: interfaces: relax check on routing setup by using 'onlink' [puppet] - 10https://gerrit.wikimedia.org/r/664317 (https://phabricator.wikimedia.org/T272963) (owner: 10Arturo Borrero Gonzalez) [17:59:36] (03CR) 10Muehlenhoff: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/664304 (https://phabricator.wikimedia.org/T274797) (owner: 10Muehlenhoff) [18:00:04] ryankemper: Dear deployers, time to do the Wikidata Query Service weekly deploy deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210215T1800). [18:05:14] (03CR) 10Ppchelko: api-gateway: generic discovery service config option, add linkrecommendation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan) [18:10:38] (03CR) 10Jbond: [C: 03+1] "> Patch Set 1:" [cookbooks] - 10https://gerrit.wikimedia.org/r/663878 (owner: 10Elukey) [18:14:52] 10SRE, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10jcrespo) [18:15:15] 10SRE, 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Followup to backup1001 bacula switchover (misc pending tasks) - https://phabricator.wikimedia.org/T238048 (10jcrespo) [18:15:40] (03CR) 10Kosta Harlan: api-gateway: generic discovery service config option, add linkrecommendation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan) [18:15:41] 10SRE, 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Followup to backup1001 bacula switchover (misc pending tasks) - https://phabricator.wikimedia.org/T238048 (10jcrespo) 05Open→03Resolved Regarding the last 2 points, we have, in a way, done the last point "parametrize better the jobdefaults i... [18:17:39] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:28:38] (03PS1) 10Effie Mouzeli: (WIP) mediawiki::alerts add alert when 20% of servers is saturated [puppet] - 10https://gerrit.wikimedia.org/r/664319 (https://phabricator.wikimedia.org/T267176) [18:33:52] (03CR) 10Ppchelko: api-gateway: generic discovery service config option, add linkrecommendation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan) [18:41:27] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [18:41:47] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [18:45:40] that looks like DPLA bot on commons [18:46:29] I see no issues, but keep an eye in case something degrades (thumbail generation, codfw s4 replication, etc.) [18:47:54] that's 10 1MB files per second [18:48:16] jynus: swift is TimedMediaHandler or just the place where uploads are being stored? [18:49:21] swift is our OpenStack Swift cluster, our backend storage for media and rendered stuff: https://wikitech.wikimedia.org/wiki/Swift [18:49:59] the alert is just a warning on a high rate of uploads- that doesn't mean there is a problem, but it is an unusual state [18:50:23] normally we worry when it is very low, because it means there is a problem with uploads [19:00:04] RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210215T1900). Please do the needful. [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:00:56] jynus: do we want to do T248177? [19:00:56] T248177: Enforce upload rate limits for bots on commons - https://phabricator.wikimedia.org/T248177 [19:01:29] (but 999 uploads per second is effectively no rate limit anyway :/ ) [19:02:09] 999/s is o_O [19:03:32] IIRC there is/was an UploadStash for large or batch uploads Urbanecm ? [19:04:10] there's still uploadstash, dunno if it helps with ratelimited uploads [19:11:01] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [19:21:03] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [19:28:58] (03CR) 10Kosta Harlan: api-gateway: generic discovery service config option, add linkrecommendation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan) [19:31:51] (03CR) 10CRusnov: "This change is ready for review." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/664332 (https://phabricator.wikimedia.org/T274802) (owner: 10CRusnov) [20:10:06] (03PS1) 10Ladsgroup: [DNM] Test jenkins new rule on banning use of hiera() [puppet] - 10https://gerrit.wikimedia.org/r/664350 [20:11:43] (03CR) 10jerkins-bot: [V: 04-1] [DNM] Test jenkins new rule on banning use of hiera() [puppet] - 10https://gerrit.wikimedia.org/r/664350 (owner: 10Ladsgroup) [20:25:00] (03Abandoned) 10Ladsgroup: [DNM] Test jenkins new rule on banning use of hiera() [puppet] - 10https://gerrit.wikimedia.org/r/664350 (owner: 10Ladsgroup) [20:30:51] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytic Cluster for Research Scientist (Paragon) - https://phabricator.wikimedia.org/T274631 (10leila) approved. Thank you for your support! [20:46:21] PROBLEM - MegaRAID on an-worker1097 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:46:24] ACKNOWLEDGEMENT - MegaRAID on an-worker1097 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T274819 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:46:27] 10SRE, 10ops-eqiad: Degraded RAID on an-worker1097 - https://phabricator.wikimedia.org/T274819 (10ops-monitoring-bot) [20:47:01] 10SRE, 10ops-eqiad, 10Analytics: Degraded RAID on an-worker1097 - https://phabricator.wikimedia.org/T274819 (10Peachey88) [21:00:04] chrisalbon and accraze: It is that lovely time of the day again! You are hereby commanded to deploy Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210215T2100). [21:51:52] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [21:52:04] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [22:00:04] Reedy and sbassett: Dear deployers, time to do the Weekly Security deployment window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210215T2200). [22:50:50] (03CR) 10Volans: [C: 03+1] "Code looks good to me, please test it on netbox-next to be sure." (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/664332 (https://phabricator.wikimedia.org/T274802) (owner: 10CRusnov) [22:52:34] PROBLEM - Device not healthy -SMART- on an-worker1097 is CRITICAL: cluster=analytics device=sat+megaraid,13 instance=an-worker1097 job=node site=eqiad https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1097&var-datasource=eqiad+prometheus/ops [23:31:52] (03CR) 10Gergő Tisza: api-gateway: generic discovery service config option, add linkrecommendation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan)