[00:00:48] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [00:02:58] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [00:04:16] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2397.codfw.wmnet with reason: REIMAGE [00:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:15] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2397.codfw.wmnet with reason: REIMAGE [00:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:28] PROBLEM - dump of es4 in eqiad on alert1001 is CRITICAL: dump for es4 at eqiad taken more than 8 days ago: Most recent backup 2021-03-30 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [00:10:44] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2398.codfw.wmnet with reason: REIMAGE [00:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:47] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2398.codfw.wmnet with reason: REIMAGE [00:12:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:45] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2397.codfw.wmnet'] ` and were **ALL** successful. [00:15:00] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:17:48] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:18:12] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:19:35] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2398.codfw.wmnet'] ` and were **ALL** successful. [00:22:00] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2399.codfw.wmnet ` The log can be found in `/... [00:28:27] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2400.codfw.wmnet ` The log can be found in `/... [00:36:31] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2399.codfw.wmnet with reason: REIMAGE [00:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:26] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2399.codfw.wmnet with reason: REIMAGE [00:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:00] 10SRE, 10Wikimedia-Mailing-lists: Import several public mailing lists archives from mailman2 to lists-next to measure database size - https://phabricator.wikimedia.org/T278609 (10Ladsgroup) I imported pywikibot-bugs. Its mbox is 290MB and has 50,058 emails. Importing subscribers, etc. was easy and was done in... [00:41:05] RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [00:41:27] RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [00:42:23] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Create production databases for mailman3 - https://phabricator.wikimedia.org/T278614 (10Ladsgroup) So with importing pywikibot-bugs (More details: T278609#6978553), the better estimation is 27GB size with 3.2GB/year growth (assuming linear growth) [00:43:00] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2400.codfw.wmnet with reason: REIMAGE [00:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:43:44] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10Papaul) [00:44:31] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:44:49] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10Papaul) [00:45:06] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2400.codfw.wmnet with reason: REIMAGE [00:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:24] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2399.codfw.wmnet'] ` and were **ALL** successful. [00:46:11] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2401.codfw.wmnet ` The log can be found in `/... [00:50:50] (03PS2) 10Jforrester: Disable LocalisationUpdate, part I [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677325 (https://phabricator.wikimedia.org/T158360) [00:50:52] (03PS2) 10Jforrester: Disable LocalisationUpdate, part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677326 (https://phabricator.wikimedia.org/T158360) [00:50:54] (03PS2) 10Jforrester: Disable LocalisationUpdate, part III [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677327 (https://phabricator.wikimedia.org/T158360) [00:50:56] (03PS1) 10Jforrester: [BETA CLUSTER] Disable LocalisationUpdate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677385 (https://phabricator.wikimedia.org/T158360) [00:51:58] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2400.codfw.wmnet'] ` and were **ALL** successful. [00:53:19] (03PS1) 10Papaul: DHCH partman Add cloudcephmod2004-dev [puppet] - 10https://gerrit.wikimedia.org/r/677407 (https://phabricator.wikimedia.org/T276509) [00:54:48] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2402.codfw.wmnet ` The log can be found in `/... [00:55:51] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_legacy_failure_flags.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:57:24] (03CR) 10Papaul: [C: 03+2] DHCH partman Add cloudcephmod2004-dev [puppet] - 10https://gerrit.wikimedia.org/r/677407 (https://phabricator.wikimedia.org/T276509) (owner: 10Papaul) [01:01:27] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2401.codfw.wmnet with reason: REIMAGE [01:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:03:26] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2401.codfw.wmnet with reason: REIMAGE [01:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:04:36] (03PS1) 10Papaul: Add cloudcephmon2004-dev to site with role insetup [puppet] - 10https://gerrit.wikimedia.org/r/677411 (https://phabricator.wikimedia.org/T276509) [01:07:08] (03CR) 10Papaul: [C: 03+2] Add cloudcephmon2004-dev to site with role insetup [puppet] - 10https://gerrit.wikimedia.org/r/677411 (https://phabricator.wikimedia.org/T276509) (owner: 10Papaul) [01:08:42] (03CR) 10Krinkle: [C: 03+1] "Yay!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677385 (https://phabricator.wikimedia.org/T158360) (owner: 10Jforrester) [01:09:32] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2402.codfw.wmnet with reason: REIMAGE [01:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:09:45] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephmon2004-dev - https://phabricator.wikimedia.org/T276509 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` cloudcephmon20... [01:10:17] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2401.codfw.wmnet'] ` and were **ALL** successful. [01:10:25] (03PS1) 10Ladsgroup: flaggedrevs: Disable quality and pristine tier in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677412 (https://phabricator.wikimedia.org/T277883) [01:11:35] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2402.codfw.wmnet with reason: REIMAGE [01:11:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:19:55] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2402.codfw.wmnet'] ` and were **ALL** successful. [01:25:11] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2403.codfw.wmnet ` The log can be found in `/... [01:25:17] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephmon2004-dev - https://phabricator.wikimedia.org/T276509 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephmon2004-dev.codfw.wmnet'] ` Of which those **FAILED**: ` ['cloudcephmon200... [01:28:46] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephmon2004-dev - https://phabricator.wikimedia.org/T276509 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` cloudcephmon2004-dev.codfw.wmnet ` T... [01:36:44] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2404.codfw.wmnet ` The log can be found in `/... [01:39:43] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2403.codfw.wmnet with reason: REIMAGE [01:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:41:47] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2403.codfw.wmnet with reason: REIMAGE [01:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:44:10] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon2004-dev.codfw.wmnet with reason: REIMAGE [01:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:46:14] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudcephmon2004-dev.codfw.wmnet with reason: REIMAGE [01:46:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:48:34] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2403.codfw.wmnet'] ` and were **ALL** successful. [01:49:07] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2405.codfw.wmnet ` The log can be found in `/... [01:51:16] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2404.codfw.wmnet with reason: REIMAGE [01:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:53:08] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2404.codfw.wmnet with reason: REIMAGE [01:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:54:58] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephmon2004-dev - https://phabricator.wikimedia.org/T276509 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephmon2004-dev.codfw.wmnet'] ` and were **ALL** successful. [01:56:40] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephmon2004-dev - https://phabricator.wikimedia.org/T276509 (10Papaul) [02:00:11] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2404.codfw.wmnet'] ` and were **ALL** successful. [02:03:23] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephmon2004-dev - https://phabricator.wikimedia.org/T276509 (10Papaul) 05Open→03Resolved @Andrew this is complete [02:03:39] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2406.codfw.wmnet ` The log can be found in `/... [02:03:41] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2405.codfw.wmnet with reason: REIMAGE [02:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:05:54] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2405.codfw.wmnet with reason: REIMAGE [02:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:14:05] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2405.codfw.wmnet'] ` and were **ALL** successful. [02:18:06] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2406.codfw.wmnet with reason: REIMAGE [02:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:18:15] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The following units failed: package_builder_Clean_up_build_directory.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:20:10] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2406.codfw.wmnet with reason: REIMAGE [02:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:27:00] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2406.codfw.wmnet'] ` and were **ALL** successful. [03:03:04] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2407.codfw.wmnet ` The log can be found in `/... [03:12:14] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2408.codfw.wmnet ` The log can be found in `/... [03:17:37] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2407.codfw.wmnet with reason: REIMAGE [03:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:19:35] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2407.codfw.wmnet with reason: REIMAGE [03:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:26:31] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2407.codfw.wmnet'] ` and were **ALL** successful. [03:26:49] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2408.codfw.wmnet with reason: REIMAGE [03:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:28:55] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2408.codfw.wmnet with reason: REIMAGE [03:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:35:42] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2408.codfw.wmnet'] ` and were **ALL** successful. [03:36:01] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10Papaul) [04:11:39] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:13:31] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:24:25] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 236, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:28:57] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:31:11] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:37:01] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 100 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:51:19] (03PS1) 10Krinkle: mc: Add 'wanRoutingPrefix' (replaces 'mcrouterAware' and 'cluster') [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677418 [04:57:18] (03PS1) 10Andrew Bogott: wmfkeystonehooks ldap groups: Handle groups with no members [puppet] - 10https://gerrit.wikimedia.org/r/677419 (https://phabricator.wikimedia.org/T279491) [04:57:53] (03CR) 10jerkins-bot: [V: 04-1] wmfkeystonehooks ldap groups: Handle groups with no members [puppet] - 10https://gerrit.wikimedia.org/r/677419 (https://phabricator.wikimedia.org/T279491) (owner: 10Andrew Bogott) [04:58:13] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:58:40] (03PS2) 10Andrew Bogott: wmfkeystonehooks ldap groups: Handle groups with no members [puppet] - 10https://gerrit.wikimedia.org/r/677419 (https://phabricator.wikimedia.org/T279491) [05:00:33] (03CR) 10Andrew Bogott: "cc'ing all of you in case you're interested in what was happening. I will test this in codfw1dev tomorrow when I'm more awake." [puppet] - 10https://gerrit.wikimedia.org/r/677419 (https://phabricator.wikimedia.org/T279491) (owner: 10Andrew Bogott) [05:02:43] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:04:16] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Create production databases for mailman3 - https://phabricator.wikimedia.org/T278614 (10Marostegui) That's a very doable number, thanks @Ladsgroup! [05:05:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1135 for schema change', diff saved to https://phabricator.wikimedia.org/P15185 and previous config saved to /var/cache/conftool/dbconfig/20210407-050530-marostegui.json [05:05:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:06:09] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The following units failed: wikimedia-discovery-golden.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:06:59] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:07:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1020 for upgrade', diff saved to https://phabricator.wikimedia.org/P15186 and previous config saved to /var/cache/conftool/dbconfig/20210407-050758-marostegui.json [05:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:09:13] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:16:03] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:18:23] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:23:40] (03PS1) 10Marostegui: db1184: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/677420 (https://phabricator.wikimedia.org/T275633) [05:29:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 25%: Repool es1020', diff saved to https://phabricator.wikimedia.org/P15187 and previous config saved to /var/cache/conftool/dbconfig/20210407-052901-root.json [05:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:23] 10SRE, 10DBA, 10Platform Engineering, 10Wikimedia-Incident: Appservers latency spike / parser cache growth 2021-03-28 - https://phabricator.wikimedia.org/T278655 (10Marostegui) I was on holidays when all this happened, is there anything else to follow up with? [05:39:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1135 (re)pooling @ 25%: Repool db1135 after schema change', diff saved to https://phabricator.wikimedia.org/P15188 and previous config saved to /var/cache/conftool/dbconfig/20210407-053940-root.json [05:39:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:51] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:41:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1020 for upgrade', diff saved to https://phabricator.wikimedia.org/P15189 and previous config saved to /var/cache/conftool/dbconfig/20210407-054127-marostegui.json [05:41:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:53] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:45:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 25%: Repool es1020', diff saved to https://phabricator.wikimedia.org/P15190 and previous config saved to /var/cache/conftool/dbconfig/20210407-054522-root.json [05:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:41] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:46:59] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:54:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1135 (re)pooling @ 50%: Repool db1135 after schema change', diff saved to https://phabricator.wikimedia.org/P15191 and previous config saved to /var/cache/conftool/dbconfig/20210407-055444-root.json [05:54:48] !log installing curl security updates [05:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 50%: Repool es1020', diff saved to https://phabricator.wikimedia.org/P15192 and previous config saved to /var/cache/conftool/dbconfig/20210407-060026-root.json [06:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:37] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:08:32] (03PS1) 10Marostegui: misc,phabricator,dbstore_multiinstance.my.cnf: Set innodb_change_buffering = none [puppet] - 10https://gerrit.wikimedia.org/r/677423 (https://phabricator.wikimedia.org/T263443) [06:08:58] (03CR) 10jerkins-bot: [V: 04-1] misc,phabricator,dbstore_multiinstance.my.cnf: Set innodb_change_buffering = none [puppet] - 10https://gerrit.wikimedia.org/r/677423 (https://phabricator.wikimedia.org/T263443) (owner: 10Marostegui) [06:09:05] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:09:07] the Lumen transport is down (esams - eqiad), probably maintenance [06:09:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1135 (re)pooling @ 75%: Repool db1135 after schema change', diff saved to https://phabricator.wikimedia.org/P15193 and previous config saved to /var/cache/conftool/dbconfig/20210407-060948-root.json [06:09:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:17] (03PS2) 10Marostegui: mariadb: Set innodb_change_buffering = none on a few roles [puppet] - 10https://gerrit.wikimedia.org/r/677423 (https://phabricator.wikimedia.org/T263443) [06:11:55] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:13:12] (03CR) 10Marostegui: [C: 03+2] mariadb: Set innodb_change_buffering = none on a few roles [puppet] - 10https://gerrit.wikimedia.org/r/677423 (https://phabricator.wikimedia.org/T263443) (owner: 10Marostegui) [06:13:58] I don't find any maintenance email related to Apr 7 and Lumen though [06:15:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 75%: Repool es1020', diff saved to https://phabricator.wikimedia.org/P15194 and previous config saved to /var/cache/conftool/dbconfig/20210407-061529-root.json [06:15:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1135 (re)pooling @ 100%: Repool db1135 after schema change', diff saved to https://phabricator.wikimedia.org/P15195 and previous config saved to /var/cache/conftool/dbconfig/20210407-062451-root.json [06:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:53] (03PS3) 10Muehlenhoff: wmflib: Switch spec test to www.example.prg [puppet] - 10https://gerrit.wikimedia.org/r/677126 (https://phabricator.wikimedia.org/T276399) [06:28:26] !log restarting apache/FPM on mw canaries to pick up curl updates [06:28:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:20] (03CR) 10jerkins-bot: [V: 04-1] wmflib: Switch spec test to www.example.prg [puppet] - 10https://gerrit.wikimedia.org/r/677126 (https://phabricator.wikimedia.org/T276399) (owner: 10Muehlenhoff) [06:30:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 100%: Repool es1020', diff saved to https://phabricator.wikimedia.org/P15196 and previous config saved to /var/cache/conftool/dbconfig/20210407-063033-root.json [06:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:09] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:31:13] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [06:31:23] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [06:40:23] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:54:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1134 for schema change', diff saved to https://phabricator.wikimedia.org/P15197 and previous config saved to /var/cache/conftool/dbconfig/20210407-065450-marostegui.json [06:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:10] (03PS4) 10Muehlenhoff: wmflib: Switch spec test to www.example.org [puppet] - 10https://gerrit.wikimedia.org/r/677126 (https://phabricator.wikimedia.org/T276399) [06:57:51] PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:59:23] (03CR) 10jerkins-bot: [V: 04-1] wmflib: Switch spec test to www.example.org [puppet] - 10https://gerrit.wikimedia.org/r/677126 (https://phabricator.wikimedia.org/T276399) (owner: 10Muehlenhoff) [06:59:31] !log depooling wdqs1005, restarting blazegraph and waiting for it to catchup on lag [06:59:36] ryankemper: ^ [06:59:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:01] RECOVERY - WDQS SPARQL on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 692 bytes in 1.922 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [07:03:36] !log repooling wdqs1005, catched up on lag [07:03:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:25] (03PS5) 10Muehlenhoff: wmflib: Switch spec test to www.example.org [puppet] - 10https://gerrit.wikimedia.org/r/677126 (https://phabricator.wikimedia.org/T276399) [07:05:49] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 238, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:07:31] (03CR) 10jerkins-bot: [V: 04-1] wmflib: Switch spec test to www.example.org [puppet] - 10https://gerrit.wikimedia.org/r/677126 (https://phabricator.wikimedia.org/T276399) (owner: 10Muehlenhoff) [07:12:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1163 for kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15198 and previous config saved to /var/cache/conftool/dbconfig/20210407-071219-marostegui.json [07:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 25%: Repool db1163 after upgrade', diff saved to https://phabricator.wikimedia.org/P15199 and previous config saved to /var/cache/conftool/dbconfig/20210407-072027-root.json [07:20:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 25%: Repool db1134', diff saved to https://phabricator.wikimedia.org/P15200 and previous config saved to /var/cache/conftool/dbconfig/20210407-072957-root.json [07:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:57] 10SRE, 10Wikimedia-Mailing-lists: All WMF mailing lists should be publicly listed - https://phabricator.wikimedia.org/T124324 (10Legoktm) There are 256 unadvertised lists out of 711 total or 36%, a significantly larger number than I was expecting. At least 37 of those are disabled/renamed/obsolete lists (I loo... [07:33:38] (03PS1) 10ArielGlenn: make batch testing reliably pass [dumps] - 10https://gerrit.wikimedia.org/r/677482 (https://phabricator.wikimedia.org/T252396) [07:35:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 50%: Repool db1163 after upgrade', diff saved to https://phabricator.wikimedia.org/P15201 and previous config saved to /var/cache/conftool/dbconfig/20210407-073530-root.json [07:35:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:20] (03PS1) 10ArielGlenn: get rid of the code backing up job fragments files [dumps] - 10https://gerrit.wikimedia.org/r/677483 (https://phabricator.wikimedia.org/T252396) [07:45:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 50%: Repool db1134', diff saved to https://phabricator.wikimedia.org/P15203 and previous config saved to /var/cache/conftool/dbconfig/20210407-074501-root.json [07:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 75%: Repool db1163 after upgrade', diff saved to https://phabricator.wikimedia.org/P15204 and previous config saved to /var/cache/conftool/dbconfig/20210407-075034-root.json [07:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:28] (03CR) 10Muehlenhoff: debian: Add an alias for bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677279 (https://phabricator.wikimedia.org/T275873) (owner: 10Muehlenhoff) [07:57:18] 10SRE, 10Analytics, 10Traffic: Add Traffic's notion of "from public cloud" to Analytics webrequest data - https://phabricator.wikimedia.org/T279380 (10JAllemandou) +1 on the approach (updating the task description for details) [07:58:21] 10SRE, 10Analytics, 10Traffic: Add Traffic's notion of "from public cloud" to Analytics webrequest data - https://phabricator.wikimedia.org/T279380 (10JAllemandou) [08:00:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 75%: Repool db1134', diff saved to https://phabricator.wikimedia.org/P15205 and previous config saved to /var/cache/conftool/dbconfig/20210407-080005-root.json [08:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:23] (03PS1) 10Muehlenhoff: sretest: Restrict profile::docker::firewall and cuminunpriv to buster only [puppet] - 10https://gerrit.wikimedia.org/r/677486 [08:02:02] 10SRE, 10netops: Unable to load en.wikipedia.org from 84.19.61.192/26 - https://phabricator.wikimedia.org/T279503 (10Peachey88) [08:04:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1105:3311', diff saved to https://phabricator.wikimedia.org/P15206 and previous config saved to /var/cache/conftool/dbconfig/20210407-080410-marostegui.json [08:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:18] 10SRE, 10Traffic, 10netops: Unable to load en.wikipedia.org from 84.19.61.192/26 - https://phabricator.wikimedia.org/T279503 (10Peachey88) For information about reporting network connectivity issues, have a look at https://wikitech-static.wikimedia.org/wiki/Reporting_a_connectivity_issue [08:05:21] (03CR) 10Muehlenhoff: [C: 03+2] sretest: Restrict profile::docker::firewall and cuminunpriv to buster only [puppet] - 10https://gerrit.wikimedia.org/r/677486 (owner: 10Muehlenhoff) [08:05:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 100%: Repool db1163 after upgrade', diff saved to https://phabricator.wikimedia.org/P15207 and previous config saved to /var/cache/conftool/dbconfig/20210407-080537-root.json [08:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:48] (03CR) 10Filippo Giunchedi: [C: 03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/677335 (https://phabricator.wikimedia.org/T261136) (owner: 10Andrew Bogott) [08:11:26] (03CR) 10Muehlenhoff: "I think I'll abandon the patch, in theory we could adapt our Puppet manifests to also work with unreleased Debian versions, but that's a l" [puppet] - 10https://gerrit.wikimedia.org/r/677279 (https://phabricator.wikimedia.org/T275873) (owner: 10Muehlenhoff) [08:12:39] RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [08:12:53] RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [08:15:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 100%: Repool db1134', diff saved to https://phabricator.wikimedia.org/P15209 and previous config saved to /var/cache/conftool/dbconfig/20210407-081508-root.json [08:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:48] 10SRE, 10Traffic, 10netops: Unable to load en.wikipedia.org from 84.19.61.192/26 - https://phabricator.wikimedia.org/T279503 (10A189605) Please see the below information from the link you provided. Output from http://test-ipv6.com/helpdesk/: ` Your Internet help desk may ask you for the information below.... [08:21:55] 10SRE, 10OTRS, 10Security, 10User-notice: Migrate OTRS CE 6 to Znuny LTS fork - https://phabricator.wikimedia.org/T279303 (10akosiaris) >>! In T279303#6977714, @Keegan wrote: > @akosiaris I assume the usual process of emails being held in queue during the migration will occur? Yes. > My other question,... [08:22:44] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2028.codfw.wmnet [08:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:51] RECOVERY - HP RAID on ms-be2028 is OK: OK: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [08:29:09] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2028.codfw.wmnet [08:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:01] (03PS1) 10Muehlenhoff: Rebuild for bullseye T257873 [debs/quickstack] - 10https://gerrit.wikimedia.org/r/677492 [08:30:17] 10SRE, 10ops-codfw: Degraded RAID on ms-be2028 - https://phabricator.wikimedia.org/T279245 (10fgiunchedi) 05Open→03Resolved Thank you @papaul ! [08:38:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3311 (re)pooling @ 25%: Repool db1105:3311 after schema change', diff saved to https://phabricator.wikimedia.org/P15210 and previous config saved to /var/cache/conftool/dbconfig/20210407-083809-root.json [08:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:28] (03CR) 10Majavah: [C: 04-1] Rebuild for bullseye T257873 (031 comment) [debs/quickstack] - 10https://gerrit.wikimedia.org/r/677492 (owner: 10Muehlenhoff) [08:43:02] (03PS1) 10Kosta Harlan: linkrecommendation: Enable profiling on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/677493 (https://phabricator.wikimedia.org/T278719) [08:44:05] (03CR) 10Kosta Harlan: "The profiler doesn't seem to work with the CLI, so this enables it for incoming web requests on staging. We'd then need to get the logs fr" [deployment-charts] - 10https://gerrit.wikimedia.org/r/677493 (https://phabricator.wikimedia.org/T278719) (owner: 10Kosta Harlan) [08:48:32] 10SRE, 10Analytics, 10netops: Audit analytics firewall filters - https://phabricator.wikimedia.org/T279429 (10ayounsi) >>! In T279429#6976000, @ayounsi wrote: > There is also a term permitting UDP fragments, I added a "count" to know if/why we're using it. Looks like we're not. I'll remove it as well. [08:48:47] (03CR) 10JMeybohm: [C: 03+1] linkrecommendation: Enable profiling on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/677493 (https://phabricator.wikimedia.org/T278719) (owner: 10Kosta Harlan) [08:49:03] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Enable profiling on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/677493 (https://phabricator.wikimedia.org/T278719) (owner: 10Kosta Harlan) [08:50:43] (03Merged) 10jenkins-bot: linkrecommendation: Enable profiling on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/677493 (https://phabricator.wikimedia.org/T278719) (owner: 10Kosta Harlan) [08:52:29] (03CR) 10Jbond: "Sorry for the bad advice, see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677126 (https://phabricator.wikimedia.org/T276399) (owner: 10Muehlenhoff) [08:52:38] !log kharlan@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [08:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3311 (re)pooling @ 50%: Repool db1105:3311 after schema change', diff saved to https://phabricator.wikimedia.org/P15211 and previous config saved to /var/cache/conftool/dbconfig/20210407-085313-root.json [08:53:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:33] (03CR) 10JMeybohm: [C: 03+1] "This sounds like the right thing to do!" [puppet] - 10https://gerrit.wikimedia.org/r/524186 (https://phabricator.wikimedia.org/T277876) (owner: 10Alexandros Kosiaris) [08:55:47] (03CR) 10JMeybohm: [C: 03+1] "Maybe you want to adopt this for ML cluster as well?" [puppet] - 10https://gerrit.wikimedia.org/r/524186 (https://phabricator.wikimedia.org/T277876) (owner: 10Alexandros Kosiaris) [08:56:11] 10SRE, 10Traffic, 10netops: Unable to load en.wikipedia.org from 84.19.61.192/26 - https://phabricator.wikimedia.org/T279503 (10ayounsi) Investigation started a while ago on a noc@ thread. As additional data point, our Netflow (captured at our edge) show that we're getting your initial TCP SYN and replying... [08:56:50] (03PS2) 10Muehlenhoff: Rebuild for bullseye T257873 [debs/quickstack] - 10https://gerrit.wikimedia.org/r/677492 [08:56:53] (03CR) 10Jbond: "> Patch Set 1:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677279 (https://phabricator.wikimedia.org/T275873) (owner: 10Muehlenhoff) [08:57:50] (03Abandoned) 10Muehlenhoff: debian: Add an alias for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/677279 (https://phabricator.wikimedia.org/T275873) (owner: 10Muehlenhoff) [08:58:33] !log imported quickstack for bullseye/main (part of standard packages) T275873 [08:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:41] T275873: Prepare our base system layer for Debian 11/bullseye - https://phabricator.wikimedia.org/T275873 [08:59:20] 10SRE, 10Traffic, 10serviceops: Feedback for new service IP flowchart - https://phabricator.wikimedia.org/T279296 (10akosiaris) Hi, thanks for this! This is interesting, some feedback from my side: * What is the intended audience? If it's an SRE or at least someone who knows what LVS is, it's pretty fine,... [09:00:35] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:02:59] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:03:24] 10SRE, 10Analytics, 10netops: Audit analytics firewall filters - https://phabricator.wikimedia.org/T279429 (10elukey) @razzi this is a good task to get started with the firewall rules of our VLAN :) [09:08:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3311 (re)pooling @ 75%: Repool db1105:3311 after schema change', diff saved to https://phabricator.wikimedia.org/P15212 and previous config saved to /var/cache/conftool/dbconfig/20210407-090817-root.json [09:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:31] (03PS1) 10Jbond: debian: Fail if the release values are not numberes [puppet] - 10https://gerrit.wikimedia.org/r/677496 [09:10:49] (03CR) 10Jbond: "related: https://gerrit.wikimedia.org/r/c/operations/puppet/+/677496" [puppet] - 10https://gerrit.wikimedia.org/r/677279 (https://phabricator.wikimedia.org/T275873) (owner: 10Muehlenhoff) [09:11:12] 10SRE, 10Traffic, 10netops: Unable to load en.wikipedia.org from 84.19.61.192/26 - https://phabricator.wikimedia.org/T279503 (10A189605) >>! In T279503#6979248, @ayounsi wrote: > Investigation started a while ago on a noc@ thread. > > As additional data point, our Netflow (captured at our edge) show that we... [09:13:23] 10SRE, 10Python3-Porting: git-fat needs to be ported to Python 3 - https://phabricator.wikimedia.org/T279509 (10MoritzMuehlenhoff) [09:16:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1106', diff saved to https://phabricator.wikimedia.org/P15213 and previous config saved to /var/cache/conftool/dbconfig/20210407-091610-marostegui.json [09:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:41] (03PS1) 10Jbond: gitlab: use correct IP addresses [puppet] - 10https://gerrit.wikimedia.org/r/677497 (https://phabricator.wikimedia.org/T276148) [09:18:39] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28931/console" [puppet] - 10https://gerrit.wikimedia.org/r/677497 (https://phabricator.wikimedia.org/T276148) (owner: 10Jbond) [09:19:13] (03PS1) 10David Caro: Add project to the puppet failed to run email [puppet] - 10https://gerrit.wikimedia.org/r/677498 [09:19:19] (03CR) 10Jbond: [V: 03+1 C: 03+2] gitlab: use correct IP addresses [puppet] - 10https://gerrit.wikimedia.org/r/677497 (https://phabricator.wikimedia.org/T276148) (owner: 10Jbond) [09:23:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3311 (re)pooling @ 100%: Repool db1105:3311 after schema change', diff saved to https://phabricator.wikimedia.org/P15214 and previous config saved to /var/cache/conftool/dbconfig/20210407-092320-root.json [09:23:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:37] 10SRE, 10GitLab (Initialization), 10Patch-For-Review, 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10jbond) > Thanks, got it. I confirm gitlab.wikimedia.org is available for SSH access public... [09:29:17] (03PS1) 10Muehlenhoff: Build for bullseye: Use compat 13 and dh-python [debs/git-fat] (debian) - 10https://gerrit.wikimedia.org/r/677501 [09:40:50] !log imported git-lfs for bullseye/main (part of standard packages) T275873 [09:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:59] T275873: Prepare our base system layer for Debian 11/bullseye - https://phabricator.wikimedia.org/T275873 [09:47:34] (03PS1) 10Muehlenhoff: base::debdeploy: Install python3-dateutil instead of the Py2 package [puppet] - 10https://gerrit.wikimedia.org/r/677503 [09:47:46] (03CR) 10Muehlenhoff: [C: 03+2] Build for bullseye: Use compat 13 and dh-python [debs/git-fat] (debian) - 10https://gerrit.wikimedia.org/r/677501 (owner: 10Muehlenhoff) [09:58:27] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host kraz.wikimedia.org [09:58:29] !log reboot kraz to nudge reconnects to irc2001.w.o for remaining connected clients [09:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:48] !log jmm@cumin2001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host kraz.wikimedia.org [09:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:47] !log kormat@cumin1001 dbctl commit (dc=all): 'Repool db2106 and db2147 T279406', diff saved to https://phabricator.wikimedia.org/P15215 and previous config saved to /var/cache/conftool/dbconfig/20210407-100147-kormat.json [10:01:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:55] T279406: db2106 and db2147 crashed - https://phabricator.wikimedia.org/T279406 [10:03:13] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:03:59] (03CR) 10Marostegui: [C: 03+2] tendril: Remove absented crons [puppet] - 10https://gerrit.wikimedia.org/r/677111 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [10:05:37] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:06:21] (03PS3) 10Jbond: P:debmonitor::server: move the internal server function to apache [puppet] - 10https://gerrit.wikimedia.org/r/677292 [10:06:23] (03PS1) 10Jbond: P:debmonitor::server: Drop nginx in favour of tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/677506 [10:07:31] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28932/console" [puppet] - 10https://gerrit.wikimedia.org/r/677506 (owner: 10Jbond) [10:08:02] 10SRE, 10Wikimedia-IRC-RC-Server, 10Patch-For-Review, 10User-notice: Migrate irc.wikimedia.org/kraz to Buster - https://phabricator.wikimedia.org/T224579 (10MoritzMuehlenhoff) I've rebooted kraz to force the remaining bots still connected to kraz to reconnect to irc2001.w.o. Those connections are quite l... [10:08:23] (03CR) 10Zoranzoki21: [C: 04-1] "How is archiving extension related to this? T257873 is for archiving extension." [debs/quickstack] - 10https://gerrit.wikimedia.org/r/677492 (owner: 10Muehlenhoff) [10:10:03] (03CR) 10Muehlenhoff: "> Patch Set 2: Code-Review-1" [debs/quickstack] - 10https://gerrit.wikimedia.org/r/677492 (owner: 10Muehlenhoff) [10:10:08] (03PS3) 10Muehlenhoff: Rebuild for bullseye T275873 [debs/quickstack] - 10https://gerrit.wikimedia.org/r/677492 [10:11:35] (03CR) 10Zoranzoki21: [C: 04-1] Rebuild for bullseye T275873 (031 comment) [debs/quickstack] - 10https://gerrit.wikimedia.org/r/677492 (owner: 10Muehlenhoff) [10:12:57] 10SRE, 10Wikimedia-Mailing-lists: All WMF mailing lists should be publicly listed - https://phabricator.wikimedia.org/T124324 (10Ladsgroup) I understand the reasoning not to advertise a mailing list (spam etc.) in mailman2 given its perfect, most usable interface (specially for a person who is not familiar wit... [10:14:30] (03PS4) 10Muehlenhoff: Rebuild for bullseye T275873 [debs/quickstack] - 10https://gerrit.wikimedia.org/r/677492 [10:15:28] (03CR) 10Zoranzoki21: [C: 03+1] "Thanks!" [debs/quickstack] - 10https://gerrit.wikimedia.org/r/677492 (owner: 10Muehlenhoff) [10:15:53] (03PS1) 10Muehlenhoff: Assign mw_rc_irc role to irc1001 [puppet] - 10https://gerrit.wikimedia.org/r/677509 (https://phabricator.wikimedia.org/T278255) [10:17:09] 10SRE, 10Traffic, 10serviceops: Feedback for new service IP flowchart - https://phabricator.wikimedia.org/T279296 (10ayounsi) Audience is indeed SREs, to be used to help them know what the (high level and preferred) options are when deploying a new service. `Service-deployment-requests` is nice, I'd imagine... [10:19:53] (03PS1) 10Jbond: P:debmonitor::server: convert cron to systemd::timer::job [puppet] - 10https://gerrit.wikimedia.org/r/677510 (https://phabricator.wikimedia.org/T273673) [10:20:17] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:22:13] (03PS2) 10Jbond: P:debmonitor::server: convert cron to systemd::timer::job [puppet] - 10https://gerrit.wikimedia.org/r/677510 (https://phabricator.wikimedia.org/T273673) [10:22:41] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:23:14] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28934/console" [puppet] - 10https://gerrit.wikimedia.org/r/677510 (https://phabricator.wikimedia.org/T273673) (owner: 10Jbond) [10:28:15] (03PS1) 10Elukey: camus: fix Yarn queue settings in Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/677513 [10:29:49] (03CR) 10Elukey: [C: 03+2] camus: fix Yarn queue settings in Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/677513 (owner: 10Elukey) [10:31:01] (03PS3) 10Jbond: P:debmonitor::server: convert cron to systemd::timer::job [puppet] - 10https://gerrit.wikimedia.org/r/677510 (https://phabricator.wikimedia.org/T273673) [10:31:03] (03PS1) 10Jbond: P:debmonitor::Server: drop absented resource [puppet] - 10https://gerrit.wikimedia.org/r/677514 [10:31:24] (03CR) 10Jbond: P:debmonitor::server: convert cron to systemd::timer::job [puppet] - 10https://gerrit.wikimedia.org/r/677510 (https://phabricator.wikimedia.org/T273673) (owner: 10Jbond) [10:31:49] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28935/console" [puppet] - 10https://gerrit.wikimedia.org/r/677514 (owner: 10Jbond) [10:33:13] (03PS1) 10Elukey: camus: fix the remaining queue name for Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/677515 [10:34:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1106', diff saved to https://phabricator.wikimedia.org/P15216 and previous config saved to /var/cache/conftool/dbconfig/20210407-103404-marostegui.json [10:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:04] (03CR) 10Elukey: [C: 03+2] camus: fix the remaining queue name for Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/677515 (owner: 10Elukey) [10:35:10] (03CR) 10Arturo Borrero Gonzalez: "LGTM in general. Thanks for working on this! a couple of comments inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/677498 (owner: 10David Caro) [10:37:24] (03PS1) 10Muehlenhoff: Update bastion in smokeping config [puppet] - 10https://gerrit.wikimedia.org/r/677516 [10:38:19] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Designate/Victoria: remove a hacked file [puppet] - 10https://gerrit.wikimedia.org/r/677350 (https://phabricator.wikimedia.org/T261137) (owner: 10Andrew Bogott) [10:38:51] (03CR) 10Jbond: wmflib: Switch spec test to www.example.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677126 (https://phabricator.wikimedia.org/T276399) (owner: 10Muehlenhoff) [10:39:27] (03PS23) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 [10:39:42] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] OpenStack: remove config and manifests for version 'Stein' [puppet] - 10https://gerrit.wikimedia.org/r/677337 (https://phabricator.wikimedia.org/T261136) (owner: 10Andrew Bogott) [10:41:30] (03CR) 10Muehlenhoff: wmflib: Switch spec test to www.example.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677126 (https://phabricator.wikimedia.org/T276399) (owner: 10Muehlenhoff) [10:45:30] (03PS2) 10David Caro: Add project to the puppet failed to run email [puppet] - 10https://gerrit.wikimedia.org/r/677498 [10:45:32] (03CR) 10David Caro: Add project to the puppet failed to run email (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/677498 (owner: 10David Caro) [10:46:20] (03CR) 10Arturo Borrero Gonzalez: "LGTM, some comments inlined." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/677306 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [10:50:10] (03CR) 10David Caro: doc: Introduce a code reviewing guideline (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/666601 (owner: 10David Caro) [10:51:22] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "there are several changes on the file that aren't related to the patch, I guess the result of running the black autoformater." [puppet] - 10https://gerrit.wikimedia.org/r/677498 (owner: 10David Caro) [10:51:42] !log Stop apache on dbmonitor1001 T224589 [10:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:51] T224589: Migrate dbmonitor hosts to Buster - https://phabricator.wikimedia.org/T224589 [10:54:53] 10SRE, 10Patch-For-Review: Migrate dbmonitor hosts to Buster - https://phabricator.wikimedia.org/T224589 (10Marostegui) I have stopped apache on dbmonitor1001 (and done chmod -x to apache2 binary so puppet doesn't bring it up), let's leave it till next week and if nothing breaks, let's decommission it [10:56:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1118', diff saved to https://phabricator.wikimedia.org/P15217 and previous config saved to /var/cache/conftool/dbconfig/20210407-105617-marostegui.json [10:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:31] (03PS3) 10David Caro: ceph.mon: parametrize the repository to pull the packages from [puppet] - 10https://gerrit.wikimedia.org/r/677306 (https://phabricator.wikimedia.org/T274566) [10:58:33] (03CR) 10David Caro: ceph.mon: parametrize the repository to pull the packages from (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/677306 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for [[Backport windows|European mid-day backport window]]
''''''. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210407T1100). [11:00:04] Amir1: A patch you scheduled for [[Backport windows|European mid-day backport window]]
'''''' is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:01:00] (03PS3) 10David Caro: ceph: run tests on debian 10 buster [puppet] - 10https://gerrit.wikimedia.org/r/677307 [11:01:06] o/ [11:01:29] * Urbanecm waves, but he doesn't see patches without experienced deployers [11:01:59] flaggedrevs has its own config file? ._. [11:02:06] (03CR) 10David Caro: [C: 03+2] "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/677498 (owner: 10David Caro) [11:02:07] yup [11:02:16] o/ [11:03:47] I can deploy it I guess [11:05:01] Lucas_WMDE: I'm planning to kill that thing with my bare hands [11:05:14] flaggedrevs or config file? [11:05:19] hehehe, I knew some reply like that was coming :> [11:06:07] Urbanecm: the config file. Flaggedrevs hopefully will stay as a tiny extension instead of the monster it is currently [11:06:17] sounds cool :) [11:07:09] (03CR) 10Ladsgroup: [C: 03+2] flaggedrevs: Disable quality and pristine tier in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677412 (https://phabricator.wikimedia.org/T277883) (owner: 10Ladsgroup) [11:07:23] yeah but it's scary, this extension is a big mess [11:07:49] o/ If there's time, I'd like to try deploying the private configuration value that I brought up yesterday. I've not done this before so I might have questions [11:07:51] (03Merged) 10jenkins-bot: flaggedrevs: Disable quality and pristine tier in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677412 (https://phabricator.wikimedia.org/T277883) (owner: 10Ladsgroup) [11:11:08] (03CR) 10Muehlenhoff: [C: 03+2] Rebuild for bullseye T275873 [debs/quickstack] - 10https://gerrit.wikimedia.org/r/677492 (owner: 10Muehlenhoff) [11:12:03] so far looks good, syncing [11:14:25] (03PS1) 10Marostegui: instances.yaml: Add db1184 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/677520 (https://phabricator.wikimedia.org/T275633) [11:14:57] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1184 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/677520 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [11:15:11] !log ladsgroup@deploy1002 Synchronized wmf-config/flaggedrevs.php: [[gerrit:677412|flaggedrevs: Disable quality and pristine tier in all wikis]] (T277883) (duration: 02m 15s) [11:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:20] T277883: Drop all low-use and unused features of FlaggedRevs to make it more maintainable - https://phabricator.wikimedia.org/T277883 [11:17:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1184 to s1 depooled T275633', diff saved to https://phabricator.wikimedia.org/P15218 and previous config saved to /var/cache/conftool/dbconfig/20210407-111708-marostegui.json [11:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:16] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [11:20:51] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: remove parsoidJS from production 1 [puppet] - 10https://gerrit.wikimedia.org/r/676883 (https://phabricator.wikimedia.org/T279059) (owner: 10Effie Mouzeli) [11:32:05] PROBLEM - gdnsd checkconf on authdns2001 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [11:33:13] PROBLEM - gdnsd checkconf on dns3001 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [11:34:21] PROBLEM - gdnsd checkconf on dns3002 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [11:35:09] PROBLEM - gdnsd checkconf on dns1001 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [11:36:13] PROBLEM - gdnsd checkconf on dns1002 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [11:37:44] 10SRE, 10Traffic, 10serviceops: Feedback for new service IP flowchart - https://phabricator.wikimedia.org/T279296 (10akosiaris) >>! In T279296#6979590, @ayounsi wrote: > Audience is indeed SREs, to be used to help them know what the (high level and preferred) options are when deploying a new service. OK, go... [11:37:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 25%: Repool db1118 after schema change', diff saved to https://phabricator.wikimedia.org/P15219 and previous config saved to /var/cache/conftool/dbconfig/20210407-113753-root.json [11:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:53] !log Deploy schema change on s3 codfw, lag will appear T276150 T276156 [11:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:03] T276150: Schema change to make rc_id unsigned and rc_timestamp BINARY - https://phabricator.wikimedia.org/T276150 [11:40:03] T276156: Drop default of rc_timestamp - https://phabricator.wikimedia.org/T276156 [11:41:15] PROBLEM - gdnsd checkconf on dns2002 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [11:42:05] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [11:42:29] PROBLEM - gdnsd checkconf on dns4002 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [11:43:41] PROBLEM - gdnsd checkconf on authdns1001 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [11:44:58] uhhh authdns failures sound alarming ^ [11:47:48] Majavah: thanks looking [11:48:50] error: plugin_geoip: Invalid resource name 'disc-parsoid' detected from zonefile lookup [11:49:11] PROBLEM - gdnsd checkconf on dns4001 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [11:50:57] seems caused by https://gerrit.wikimedia.org/r/676883? [11:50:59] ^ effie [11:51:27] hmmm [11:52:05] (03CR) 10Muehlenhoff: [C: 03+2] base::debdeploy: Install python3-dateutil instead of the Py2 package [puppet] - 10https://gerrit.wikimedia.org/r/677503 (owner: 10Muehlenhoff) [11:52:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 50%: Repool db1118 after schema change', diff saved to https://phabricator.wikimedia.org/P15220 and previous config saved to /var/cache/conftool/dbconfig/20210407-115257-root.json [11:53:01] moritzm: effie: indead i see this in puppet after the change [11:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:06] _var_lib_gdnsd_discovery-parsoid.state.tmpl absent [11:53:21] also absented /etc/confd/conf.d/_var_lib_gdnsd_discovery-parsoid.state.toml [11:53:32] jbond42: yes, I was under the impression that order didn't matter here [11:53:39] I will fix it [11:54:37] effie: ack thanks i think bbl.ack has mentioned before that something are order depended but i dont know what the 'things' are, let me know if i can help [11:55:27] I will remove the discovery record and see [11:55:30] PROBLEM - gdnsd checkconf on dns5001 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [11:55:39] in theory, we are using parsoid-php [11:57:48] (03PS1) 10Muehlenhoff: Stub base package file for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/677555 [11:58:10] (03PS2) 10Muehlenhoff: Stub base package file for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/677555 [11:58:24] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [11:58:48] PROBLEM - gdnsd checkconf on dns5002 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [12:00:25] (03CR) 10Filippo Giunchedi: [C: 03+1] Update bastion in smokeping config [puppet] - 10https://gerrit.wikimedia.org/r/677516 (owner: 10Muehlenhoff) [12:01:43] (03CR) 10Muehlenhoff: [C: 03+2] Stub base package file for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/677555 (owner: 10Muehlenhoff) [12:01:44] jbond42: I will revert the patch, I think there is something more that I need to do befor that [12:02:28] PROBLEM - gdnsd checkconf on dns2001 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [12:02:34] (03PS1) 10Effie Mouzeli: Revert "hieradata: remove parsoidJS from production 1" [puppet] - 10https://gerrit.wikimedia.org/r/677390 [12:03:16] (03CR) 10Effie Mouzeli: [C: 03+2] Revert "hieradata: remove parsoidJS from production 1" [puppet] - 10https://gerrit.wikimedia.org/r/677390 (owner: 10Effie Mouzeli) [12:04:35] effie: https://wikitech.wikimedia.org/wiki/LVS#Remove_a_load_balanced_service says that you need to remove the operations/dns records first, before removing it from lvs [12:05:04] RECOVERY - gdnsd checkconf on dns4002 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [12:05:06] RECOVERY - gdnsd checkconf on authdns1001 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [12:05:11] Majavah: thank you, I am aware [12:05:18] RECOVERY - gdnsd checkconf on dns5001 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [12:05:20] RECOVERY - gdnsd checkconf on dns2001 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [12:05:28] RECOVERY - gdnsd checkconf on dns5002 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [12:05:32] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [12:05:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:52] RECOVERY - gdnsd checkconf on dns3001 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [12:06:12] RECOVERY - gdnsd checkconf on authdns2001 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [12:06:36] (03PS1) 10KartikMistry: Update cxserver to 2021-04-07-062518-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/677557 (https://phabricator.wikimedia.org/T278141) [12:07:05] effie: ack thanks [12:08:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 75%: Repool db1118 after schema change', diff saved to https://phabricator.wikimedia.org/P15221 and previous config saved to /var/cache/conftool/dbconfig/20210407-120800-root.json [12:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:10] RECOVERY - Check systemd state on sretest1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:08:26] RECOVERY - gdnsd checkconf on dns4001 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [12:09:55] (03CR) 10Ayounsi: [C: 04-1] Update bastion in smokeping config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677516 (owner: 10Muehlenhoff) [12:10:55] (03PS1) 10Jbond: initsystem: drop initsystem variable [puppet] - 10https://gerrit.wikimedia.org/r/677558 [12:12:26] (03CR) 10jerkins-bot: [V: 04-1] initsystem: drop initsystem variable [puppet] - 10https://gerrit.wikimedia.org/r/677558 (owner: 10Jbond) [12:12:36] (03CR) 10Ayounsi: [C: 03+1] remove Zayo from transit providers [homer/public] - 10https://gerrit.wikimedia.org/r/659383 (owner: 10CDanis) [12:12:54] RECOVERY - gdnsd checkconf on dns1002 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [12:14:31] (03CR) 10Ayounsi: [C: 03+1] lldp: add new per interface neighbours fact [puppet] - 10https://gerrit.wikimedia.org/r/649592 (https://phabricator.wikimedia.org/T268802) (owner: 10Jbond) [12:15:16] RECOVERY - gdnsd checkconf on dns1001 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [12:15:34] 10SRE, 10LDAP-Access-Requests: Add Lena Meintrup to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T279531 (10Tobi_WMDE_SW) [12:15:46] (03CR) 10Muehlenhoff: Update bastion in smokeping config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677516 (owner: 10Muehlenhoff) [12:17:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1173', diff saved to https://phabricator.wikimedia.org/P15222 and previous config saved to /var/cache/conftool/dbconfig/20210407-121659-marostegui.json [12:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:32] RECOVERY - gdnsd checkconf on dns3002 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [12:18:14] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2028.codfw.wmnet [12:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:25] !log Upgrade db1173's kernel [12:18:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:47] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet [12:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:14] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:22:14] RECOVERY - gdnsd checkconf on dns2002 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [12:23:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 100%: Repool db1118 after schema change', diff saved to https://phabricator.wikimedia.org/P15224 and previous config saved to /var/cache/conftool/dbconfig/20210407-122304-root.json [12:23:06] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:11] (03CR) 10Ayounsi: [C: 03+1] "Not sure if this is still needed after the diffscan improvements, but the change LGTM regardless." [puppet] - 10https://gerrit.wikimedia.org/r/633156 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond) [12:25:08] 10SRE, 10LDAP-Access-Requests: Add Lena Meintrup to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T279531 (10Lena_WMDE) [12:28:48] (03CR) 10Filippo Giunchedi: [C: 03+1] Update bastion in smokeping config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677516 (owner: 10Muehlenhoff) [12:31:53] (03PS2) 10Noa wmde: Remove all remains of idGeneratorLogging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677560 (https://phabricator.wikimedia.org/T274156) [12:36:39] (03CR) 10Ayounsi: [C: 04-1] Update bastion in smokeping config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677516 (owner: 10Muehlenhoff) [12:42:46] PROBLEM - DPKG on sretest1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:45:22] (03CR) 10Alexandros Kosiaris: [C: 03+1] Update cxserver to 2021-04-07-062518-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/677557 (https://phabricator.wikimedia.org/T278141) (owner: 10KartikMistry) [12:45:38] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2028.codfw.wmnet [12:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:32] 10SRE, 10Traffic: Add exp cache admission policy parameters to hiera - https://phabricator.wikimedia.org/T279533 (10ema) [12:46:41] 10SRE, 10Traffic: Add exp cache admission policy parameters to hiera - https://phabricator.wikimedia.org/T279533 (10ema) p:05Triage→03Medium [12:46:46] RECOVERY - Long running screen/tmux on phab1001 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [12:51:02] (03PS1) 10David Caro: backy2: use ensure packages for dependencies [puppet] - 10https://gerrit.wikimedia.org/r/677567 [12:54:23] (03PS24) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 [12:57:24] (03CR) 10Jbond: [C: 03+2] "will merge" [puppet] - 10https://gerrit.wikimedia.org/r/677126 (https://phabricator.wikimedia.org/T276399) (owner: 10Muehlenhoff) [12:58:32] (03CR) 10Jbond: [C: 03+2] lldp: add new per interface neighbours fact [puppet] - 10https://gerrit.wikimedia.org/r/649592 (https://phabricator.wikimedia.org/T268802) (owner: 10Jbond) [13:01:49] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] backy2: use ensure packages for dependencies [puppet] - 10https://gerrit.wikimedia.org/r/677567 (owner: 10David Caro) [13:04:27] (03CR) 10Muehlenhoff: Update bastion in smokeping config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677516 (owner: 10Muehlenhoff) [13:04:37] (03PS1) 10Elukey: service::deploy::gitclone: do not use mediawiki/services as prefix [puppet] - 10https://gerrit.wikimedia.org/r/677569 [13:05:25] 10SRE, 10Wikimedia-Mailing-lists: Import several public mailing lists archives from mailman2 to lists-next to measure database size - https://phabricator.wikimedia.org/T278609 (10Ladsgroup) so I attempted to import wikitech-l and it basically failed the moment I started it. Running cleanrach on the mbox fixed... [13:05:32] (03PS1) 10Effie Mouzeli: Remove parsoid discovery record [dns] - 10https://gerrit.wikimedia.org/r/677570 (https://phabricator.wikimedia.org/T279059) [13:05:36] 10SRE, 10netops: Higher latency on Lumen eqiad/esams link - https://phabricator.wikimedia.org/T277654 (10ayounsi) a:05ayounsi→03wiki_willy [13:06:16] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01008 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:07:53] Duplicate declaration: Package[python3-dateutil] is already declared at (file: /etc/puppet/modules/base/manifests/debdeploy.pp, line: 71) [13:07:58] (03CR) 10Alexandros Kosiaris: [C: 03+1] Remove parsoid discovery record [dns] - 10https://gerrit.wikimedia.org/r/677570 (https://phabricator.wikimedia.org/T279059) (owner: 10Effie Mouzeli) [13:09:20] cannot redeclare (file: /etc/puppet/modules/backy2/manifests/init.pp, line: 48 [13:09:36] dcaro: o/ - is it something that you are working on --^ [13:10:11] elukey:yep, on it https://gerrit.wikimedia.org/r/c/operations/puppet/+/677567, thanks for the ping [13:10:13] :) [13:10:14] 10SRE, 10Traffic, 10netops: TATA SKY Broadband (AS134674) issues with connecting to upload.wikimedia.org - https://phabricator.wikimedia.org/T275234 (10ayounsi) @ssingh did we get results from those test data? [13:10:29] dcaro: <3 [13:10:30] it's happenning in a couple manifests [13:10:43] (03PS2) 10David Caro: backy2,nfs: use ensure packages for dependencies [puppet] - 10https://gerrit.wikimedia.org/r/677567 [13:11:21] (03CR) 10Ottomata: [C: 03+1] "+1. I doubt there are many (any?) node 'mediawiki' services left out there that use scap?" [puppet] - 10https://gerrit.wikimedia.org/r/677569 (owner: 10Elukey) [13:11:26] (03PS2) 10Muehlenhoff: Update server in row C in smokeping config [puppet] - 10https://gerrit.wikimedia.org/r/677516 [13:11:45] (03PS1) 10Klausman: wmf-update-ssh-config: Add functionality to write specific file [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/677571 [13:11:51] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] ceph.mon: parametrize the repository to pull the packages from [puppet] - 10https://gerrit.wikimedia.org/r/677306 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [13:12:38] (03CR) 10Muehlenhoff: Update server in row C in smokeping config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677516 (owner: 10Muehlenhoff) [13:13:09] (03CR) 10Ayounsi: [C: 03+1] Update server in row C in smokeping config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677516 (owner: 10Muehlenhoff) [13:13:18] (03PS1) 10Marostegui: mariadb: Add db1180 to s6 [puppet] - 10https://gerrit.wikimedia.org/r/677572 (https://phabricator.wikimedia.org/T275633) [13:13:36] RECOVERY - DPKG on sretest1002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [13:15:20] (03CR) 10David Caro: "Pcc with two of the failing hosts (nfs + backy2): https://puppet-compiler.wmflabs.org/compiler1001/28942/" [puppet] - 10https://gerrit.wikimedia.org/r/677567 (owner: 10David Caro) [13:15:54] (03CR) 10David Caro: [C: 03+2] backy2,nfs: use ensure packages for dependencies [puppet] - 10https://gerrit.wikimedia.org/r/677567 (owner: 10David Caro) [13:16:24] (03CR) 10Ayounsi: "I should probably merge this one day." [homer/public] - 10https://gerrit.wikimedia.org/r/636392 (https://phabricator.wikimedia.org/T257392) (owner: 10Ayounsi) [13:17:43] (03CR) 10Marostegui: [C: 03+2] mariadb: Add db1180 to s6 [puppet] - 10https://gerrit.wikimedia.org/r/677572 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [13:18:55] (03PS1) 10Alexandros Kosiaris: eventgate: Bump analytics and analytics-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/677573 (https://phabricator.wikimedia.org/T274262) [13:20:24] (03CR) 10Hnowlan: service::deploy::gitclone: do not use mediawiki/services as prefix (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677569 (owner: 10Elukey) [13:20:42] (03CR) 10Alexandros Kosiaris: [C: 03+1] "This will require some minor orchestration to merge. Namely:" [puppet] - 10https://gerrit.wikimedia.org/r/524186 (https://phabricator.wikimedia.org/T277876) (owner: 10Alexandros Kosiaris) [13:20:44] (03CR) 10Filippo Giunchedi: [C: 03+1] Update server in row C in smokeping config [puppet] - 10https://gerrit.wikimedia.org/r/677516 (owner: 10Muehlenhoff) [13:21:09] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [13:21:43] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [13:22:00] I'll silence those ^ expected [13:22:47] (03CR) 10Alexandros Kosiaris: [C: 03+2] helmfile: Remove deployment_server_secrets::admin_services [puppet] - 10https://gerrit.wikimedia.org/r/677228 (https://phabricator.wikimedia.org/T268434) (owner: 10JMeybohm) [13:22:55] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/677228 (https://phabricator.wikimedia.org/T268434) (owner: 10JMeybohm) [13:23:52] (03CR) 10Elukey: service::deploy::gitclone: do not use mediawiki/services as prefix (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677569 (owner: 10Elukey) [13:24:13] (03PS1) 10BBlack: Update mock_etc commentary [dns] - 10https://gerrit.wikimedia.org/r/677574 [13:24:18] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Remove deployment_server_secrets::admin_services [labs/private] - 10https://gerrit.wikimedia.org/r/677227 (https://phabricator.wikimedia.org/T268434) (owner: 10JMeybohm) [13:24:59] (03CR) 10Effie Mouzeli: [C: 03+2] Remove parsoid discovery record [dns] - 10https://gerrit.wikimedia.org/r/677570 (https://phabricator.wikimedia.org/T279059) (owner: 10Effie Mouzeli) [13:25:01] (03CR) 10Alexandros Kosiaris: [C: 03+2] eventgate: Bump analytics and analytics-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/677573 (https://phabricator.wikimedia.org/T274262) (owner: 10Alexandros Kosiaris) [13:26:07] (03PS2) 10BBlack: Update mock_etc commentary [dns] - 10https://gerrit.wikimedia.org/r/677574 [13:26:37] (03Merged) 10jenkins-bot: eventgate: Bump analytics and analytics-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/677573 (https://phabricator.wikimedia.org/T274262) (owner: 10Alexandros Kosiaris) [13:28:10] 10SRE, 10Wikimedia-Mailing-lists: All WMF mailing lists should be publicly listed - https://phabricator.wikimedia.org/T124324 (10Nemo_bis) > We definitely should revisit this with mailman3 as its interface is much more usable. Indeed. Out of 256 mailing lists, I expect a vast majority to be set to unlisted du... [13:28:13] (03CR) 10BBlack: [C: 03+2] Update mock_etc commentary [dns] - 10https://gerrit.wikimedia.org/r/677574 (owner: 10BBlack) [13:28:54] (03CR) 10Muehlenhoff: Update server in row C in smokeping config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677516 (owner: 10Muehlenhoff) [13:30:13] (03PS1) 10Effie Mouzeli: hieradata: remove parsoidJS from production 1 [puppet] - 10https://gerrit.wikimedia.org/r/677575 (https://phabricator.wikimedia.org/T279059) [13:32:13] (03CR) 10Lucas Werkmeister (WMDE): "It’s probably best to split this into two changes – first Wikibase.php, then InitialiseSettings(-labs).php – to ensure that it’s deployed " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677560 (https://phabricator.wikimedia.org/T274156) (owner: 10Noa wmde) [13:33:09] !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [13:33:09] !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [13:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:08] (03PS1) 10Elukey: hadoop: add the liblog4j-extras1.2-java jar to HADOOP_CLASSPATH [puppet] - 10https://gerrit.wikimedia.org/r/677576 (https://phabricator.wikimedia.org/T276906) [13:35:10] (03PS2) 10Jbond: initsystem: drop initsystem variable [puppet] - 10https://gerrit.wikimedia.org/r/677558 [13:35:16] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: remove parsoidJS from production 1 [puppet] - 10https://gerrit.wikimedia.org/r/677575 (https://phabricator.wikimedia.org/T279059) (owner: 10Effie Mouzeli) [13:35:18] (03CR) 10Elukey: "Worked in hadoop test with the HDFS Namenode audit log, and in theory it could be applied to all hadoop-related daemons (Yarn NM, RM, HDFS" [puppet] - 10https://gerrit.wikimedia.org/r/677576 (https://phabricator.wikimedia.org/T276906) (owner: 10Elukey) [13:35:38] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [13:35:39] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [13:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:09] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.004149 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:36:57] !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [13:36:57] !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [13:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:43] (03CR) 10Muehlenhoff: Update server in row C in smokeping config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677516 (owner: 10Muehlenhoff) [13:37:45] (03CR) 10jerkins-bot: [V: 04-1] initsystem: drop initsystem variable [puppet] - 10https://gerrit.wikimedia.org/r/677558 (owner: 10Jbond) [13:37:54] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2409.codfw.wmnet ` The log can be found in `/... [13:39:31] !log imported jenkins 2.277.2 to apt.wikimedia.org (thirdparty/ci) T279033 [13:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:38] T279033: Upgrade Jenkins to 2.277.x - https://phabricator.wikimedia.org/T279033 [13:40:03] (03PS1) 10David Caro: analytics: use correct path to the git repo [puppet] - 10https://gerrit.wikimedia.org/r/677578 [13:40:33] (03CR) 10jerkins-bot: [V: 04-1] analytics: use correct path to the git repo [puppet] - 10https://gerrit.wikimedia.org/r/677578 (owner: 10David Caro) [13:41:19] !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [13:41:19] !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [13:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:52] (03PS2) 10Effie Mouzeli: hieradata: remove parsoidJS from production 2 [puppet] - 10https://gerrit.wikimedia.org/r/676886 (https://phabricator.wikimedia.org/T279059) [13:42:12] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [13:42:12] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [13:42:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:04] (03CR) 10Hnowlan: [C: 03+1] service::deploy::gitclone: do not use mediawiki/services as prefix (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677569 (owner: 10Elukey) [13:43:20] !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [13:43:20] !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [13:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:44] (03CR) 10Elukey: "TIL that we also have mediawiki/services/aqs/deploy, but my understanding is that we never really used it. I filed https://gerrit.wikimedi" [puppet] - 10https://gerrit.wikimedia.org/r/677578 (owner: 10David Caro) [13:43:57] (03Abandoned) 10David Caro: analytics: use correct path to the git repo [puppet] - 10https://gerrit.wikimedia.org/r/677578 (owner: 10David Caro) [13:44:18] (03CR) 10Elukey: [C: 03+2] service::deploy::gitclone: do not use mediawiki/services as prefix [puppet] - 10https://gerrit.wikimedia.org/r/677569 (owner: 10Elukey) [13:46:31] (03CR) 10Alexandros Kosiaris: [C: 03+1] "+1, but we will need to manually cleanup the old service" [puppet] - 10https://gerrit.wikimedia.org/r/676886 (https://phabricator.wikimedia.org/T279059) (owner: 10Effie Mouzeli) [13:47:29] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [13:49:20] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2410.codfw.wmnet ` The log can be found in `/... [13:49:26] 10ops-eqiad, 10Analytics-Clusters: Icinga/MegaRAID alert on an-worker1100 - https://phabricator.wikimedia.org/T279475 (10elukey) [13:50:15] 10ops-eqiad, 10Analytics-Clusters: Icinga/MegaRAID alert on an-worker1100 - https://phabricator.wikimedia.org/T279475 (10elukey) @Cmjohnson this is kind of strange, I don't see any problem reported by megacli for the BBU but I cannot enforce WriteBack on the RAID controller, as if the BBU wasn't working. Any i... [13:50:57] (03CR) 10David Caro: [C: 03+2] "pcc: https://puppet-compiler.wmflabs.org/compiler1003/28943/" [puppet] - 10https://gerrit.wikimedia.org/r/677306 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [13:51:33] (03PS1) 10Ema: vlc: get exp cache admission policy parameters from hiera [puppet] - 10https://gerrit.wikimedia.org/r/677580 (https://phabricator.wikimedia.org/T279533) [13:52:27] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2409.codfw.wmnet with reason: REIMAGE [13:52:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:28] (03PS2) 10Ema: vlc: get exp cache admission policy parameters from hiera [puppet] - 10https://gerrit.wikimedia.org/r/677580 (https://phabricator.wikimedia.org/T279533) [13:54:18] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2409.codfw.wmnet with reason: REIMAGE [13:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:26] (03PS4) 10Jgiannelos: Initial chart for maps-vector-server [deployment-charts] - 10https://gerrit.wikimedia.org/r/667165 (https://phabricator.wikimedia.org/T275874) [13:55:41] (03PS3) 10Ema: vlc: get exp cache admission policy parameters from hiera [puppet] - 10https://gerrit.wikimedia.org/r/677580 (https://phabricator.wikimedia.org/T279533) [13:55:53] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/677580 (https://phabricator.wikimedia.org/T279533) (owner: 10Ema) [14:01:22] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/677580 (https://phabricator.wikimedia.org/T279533) (owner: 10Ema) [14:01:24] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2409.codfw.wmnet'] ` and were **ALL** successful. [14:03:53] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2410.codfw.wmnet with reason: REIMAGE [14:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:40] (03PS4) 10Ema: vlc: get exp cache admission policy parameters from hiera [puppet] - 10https://gerrit.wikimedia.org/r/677580 (https://phabricator.wikimedia.org/T279533) [14:05:59] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2410.codfw.wmnet with reason: REIMAGE [14:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:28] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [14:07:07] (03CR) 10Muehlenhoff: [C: 03+2] Update server in row C in smokeping config [puppet] - 10https://gerrit.wikimedia.org/r/677516 (owner: 10Muehlenhoff) [14:08:14] RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [14:09:53] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/677496 (owner: 10Jbond) [14:10:22] RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [14:11:15] (03PS1) 10David Caro: ceph.codfw1: enable ceph octopus repo [puppet] - 10https://gerrit.wikimedia.org/r/677583 (https://phabricator.wikimedia.org/T274566) [14:11:26] 10ops-eqiad, 10Analytics-Clusters: Icinga/MegaRAID alert on an-worker1100 - https://phabricator.wikimedia.org/T279475 (10Cmjohnson) @elukey that's a first! Maybe the raid bios settings are wrong? [14:11:42] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: remove parsoidJS from production 2 [puppet] - 10https://gerrit.wikimedia.org/r/676886 (https://phabricator.wikimedia.org/T279059) (owner: 10Effie Mouzeli) [14:12:50] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2410.codfw.wmnet'] ` and were **ALL** successful. [14:13:06] (03CR) 10Jgiannelos: [C: 03+2] Initial chart for maps-vector-server [deployment-charts] - 10https://gerrit.wikimedia.org/r/667165 (https://phabricator.wikimedia.org/T275874) (owner: 10Jgiannelos) [14:14:44] (03Merged) 10jenkins-bot: Initial chart for maps-vector-server [deployment-charts] - 10https://gerrit.wikimedia.org/r/667165 (https://phabricator.wikimedia.org/T275874) (owner: 10Jgiannelos) [14:16:07] !log restarting pybal on lvs2010, lvs1016 [14:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:32] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.2.28:8000]) https://wikitech.wikimedia.org/wiki/PyBal [14:17:38] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2411.codfw.wmnet ` The log can be found in `/... [14:17:44] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.1.28:8000]) https://wikitech.wikimedia.org/wiki/PyBal [14:17:59] ^ that is me [14:18:36] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:19:06] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.2.28:8000]) https://wikitech.wikimedia.org/wiki/PyBal [14:19:16] ^ still me [14:19:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 25%: Repool db1173 after cloning db1180', diff saved to https://phabricator.wikimedia.org/P15225 and previous config saved to /var/cache/conftool/dbconfig/20210407-141925-root.json [14:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:44] !log restarting pybal on lvs2009, lvs1015 [14:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:54] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:27:31] (03CR) 10Andrew Bogott: [C: 04-1] "This doesn't work quite how I expected; needs some more attention." [puppet] - 10https://gerrit.wikimedia.org/r/677419 (https://phabricator.wikimedia.org/T279491) (owner: 10Andrew Bogott) [14:28:00] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on thumbor2001 is CRITICAL: 16 ge 4 Effie Mouzeli known issues, servers are scheduled to be refreshed https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor2001&var-datasource=codfw+prometheus/ops [14:28:16] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:28:48] (03PS1) 10Papaul: Add moss-fe200[1-2] MAC address and to site.pp with role insetup [puppet] - 10https://gerrit.wikimedia.org/r/677586 (https://phabricator.wikimedia.org/T275513) [14:29:59] hnowlan: hi, could you merge+deploy https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/676563 at some point? [14:30:50] (03PS2) 10Majavah: changeprop: Update beta jobrunner [deployment-charts] - 10https://gerrit.wikimedia.org/r/676563 [14:31:55] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2411.codfw.wmnet with reason: REIMAGE [14:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:52] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2411.codfw.wmnet with reason: REIMAGE [14:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 50%: Repool db1173 after cloning db1180', diff saved to https://phabricator.wikimedia.org/P15226 and previous config saved to /var/cache/conftool/dbconfig/20210407-143429-root.json [14:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:16] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:35:44] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:35:55] (03PS4) 10Effie Mouzeli: hieradata: remove parsoidJS from production 3 [puppet] - 10https://gerrit.wikimedia.org/r/677114 (https://phabricator.wikimedia.org/T279059) [14:36:05] (03CR) 10Jbond: "thanks see comments inline" (032 comments) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/677571 (owner: 10Klausman) [14:36:10] (03PS5) 10Effie Mouzeli: hieradata: remove parsoidJS from production 3 [puppet] - 10https://gerrit.wikimedia.org/r/677114 (https://phabricator.wikimedia.org/T279059) [14:37:19] (03PS6) 10Effie Mouzeli: hieradata: remove parsoidJS from production 3 [puppet] - 10https://gerrit.wikimedia.org/r/677114 (https://phabricator.wikimedia.org/T279059) [14:38:04] (03PS2) 10Effie Mouzeli: hieradata: remove parsoidJS from production 4 [puppet] - 10https://gerrit.wikimedia.org/r/677118 (https://phabricator.wikimedia.org/T279059) [14:38:28] (03PS2) 10Effie Mouzeli: hieradata: remove parsoidJS from production 5 [puppet] - 10https://gerrit.wikimedia.org/r/677119 (https://phabricator.wikimedia.org/T279059) [14:39:03] (03CR) 10Herron: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/676631 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite) [14:39:26] (03CR) 10Alexandros Kosiaris: [C: 04-1] hieradata: remove parsoidJS from production 3 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/677114 (https://phabricator.wikimedia.org/T279059) (owner: 10Effie Mouzeli) [14:40:19] (03PS2) 10Jbond: debian: Fail if the release values are not numberes [puppet] - 10https://gerrit.wikimedia.org/r/677496 [14:40:54] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2411.codfw.wmnet'] ` and were **ALL** successful. [14:42:01] (03CR) 10Alexandros Kosiaris: [C: 04-1] hieradata: remove parsoidJS from production 4 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677118 (https://phabricator.wikimedia.org/T279059) (owner: 10Effie Mouzeli) [14:42:32] (03CR) 10Alexandros Kosiaris: [C: 03+1] "+1 and this should have been merged already since the LVS service is no longer around." [puppet] - 10https://gerrit.wikimedia.org/r/677119 (https://phabricator.wikimedia.org/T279059) (owner: 10Effie Mouzeli) [14:43:12] (03CR) 10Effie Mouzeli: hieradata: remove parsoidJS from production 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677114 (https://phabricator.wikimedia.org/T279059) (owner: 10Effie Mouzeli) [14:43:40] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I 'd bundle in conftool-data/ removals in this patch." [puppet] - 10https://gerrit.wikimedia.org/r/677114 (https://phabricator.wikimedia.org/T279059) (owner: 10Effie Mouzeli) [14:43:42] (03PS1) 10Jbond: base::monitoring: ensure script has rb extensopn so its checked by CI [puppet] - 10https://gerrit.wikimedia.org/r/677588 [14:43:53] (03CR) 10Jbond: [C: 03+2] debian: Fail if the release values are not numberes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/677496 (owner: 10Jbond) [14:44:42] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28944/console" [puppet] - 10https://gerrit.wikimedia.org/r/677588 (owner: 10Jbond) [14:45:14] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10Papaul) [14:45:34] 10SRE, 10ops-codfw, 10serviceops, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Papaul) [14:46:20] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10Papaul) 05Open→03Resolved @Dzahn this is complete [14:46:24] (03CR) 10Alexandros Kosiaris: [C: 04-1] hieradata: remove parsoidJS from production 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677114 (https://phabricator.wikimedia.org/T279059) (owner: 10Effie Mouzeli) [14:47:01] (03CR) 10Joal: [C: 03+1] "LGTM :)" [puppet] - 10https://gerrit.wikimedia.org/r/677576 (https://phabricator.wikimedia.org/T276906) (owner: 10Elukey) [14:48:59] (03CR) 10Ahmon Dancy: [BETA CLUSTER] Disable LocalisationUpdate (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677385 (https://phabricator.wikimedia.org/T158360) (owner: 10Jforrester) [14:49:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 75%: Repool db1173 after cloning db1180', diff saved to https://phabricator.wikimedia.org/P15227 and previous config saved to /var/cache/conftool/dbconfig/20210407-144933-root.json [14:49:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:15] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] ceph.codfw1: enable ceph octopus repo [puppet] - 10https://gerrit.wikimedia.org/r/677583 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [14:59:18] (03CR) 10Bstorm: Add project to the puppet failed to run email (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677498 (owner: 10David Caro) [14:59:43] (03CR) 10Hnowlan: [C: 03+2] changeprop: Update beta jobrunner [deployment-charts] - 10https://gerrit.wikimedia.org/r/676563 (owner: 10Majavah) [15:00:47] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: session-166437.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:04:36] (03Merged) 10jenkins-bot: changeprop: Update beta jobrunner [deployment-charts] - 10https://gerrit.wikimedia.org/r/676563 (owner: 10Majavah) [15:04:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 100%: Repool db1173 after cloning db1180', diff saved to https://phabricator.wikimedia.org/P15228 and previous config saved to /var/cache/conftool/dbconfig/20210407-150436-root.json [15:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:07] (03PS7) 10Effie Mouzeli: hieradata: remove parsoidJS from production 3 [puppet] - 10https://gerrit.wikimedia.org/r/677114 (https://phabricator.wikimedia.org/T279059) [15:10:51] (03PS2) 10Klausman: wmf-update-ssh-config: Add functionality to write specific file [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/677571 [15:11:06] (03CR) 10Klausman: wmf-update-ssh-config: Add functionality to write specific file (032 comments) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/677571 (owner: 10Klausman) [15:11:38] (03CR) 10Effie Mouzeli: hieradata: remove parsoidJS from production 4 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677118 (https://phabricator.wikimedia.org/T279059) (owner: 10Effie Mouzeli) [15:11:59] 10SRE, 10LDAP-Access-Requests, 10CAS-SSO: CAS SSO for reedy - https://phabricator.wikimedia.org/T279244 (10ema) >>! In T279244#6977339, @Dzahn wrote: > I think racktables is replaced by netbox for Reedy's needs and he does have access to that. @Reedy: can you find the information you need in netbox? If so,... [15:12:23] (03PS3) 10Klausman: wmf-update-ssh-config: Add functionality to write specific file [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/677571 [15:13:18] !log setting enwiki and enwikibooks to wmf.38 on mwdebug1002 to test flagged revs [15:13:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:15] 10SRE, 10LDAP-Access-Requests: Add Lena Meintrup to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T279531 (10ema) p:05Triage→03Medium [15:16:32] 10SRE, 10LDAP-Access-Requests: Add Lena Meintrup to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T279531 (10ema) @RStallman-legalteam, @KFrancis: hello! We have a NDA request for WMDE. Thanks! [15:17:09] 10SRE, 10LDAP-Access-Requests, 10CAS-SSO: CAS SSO for reedy - https://phabricator.wikimedia.org/T279244 (10MoritzMuehlenhoff) >>! In T279244#6977553, @RobH wrote: >>>! In T279244#6977339, @Dzahn wrote: >> I think racktables is replaced by netbox for Reedy's needs and he does have access to that. This ticket... [15:19:41] (03CR) 10Jbond: [V: 03+1 C: 03+2] base::monitoring: ensure script has rb extensopn so its checked by CI [puppet] - 10https://gerrit.wikimedia.org/r/677588 (owner: 10Jbond) [15:20:29] (03CR) 10JMeybohm: "I think this is fine but I'm bugged by the name "values/services.yaml". While it can be derived from context what that means (and I don't " [deployment-charts] - 10https://gerrit.wikimedia.org/r/675558 (https://phabricator.wikimedia.org/T278208) (owner: 10Elukey) [15:20:47] 10SRE, 10LDAP-Access-Requests: Grant access to Superset for Mikeraish - https://phabricator.wikimedia.org/T279147 (10ema) @MRaishWMF: hi, we need approval from your manager here on the ticket. Thanks! [15:20:50] 10SRE, 10LDAP-Access-Requests: Grant access to Superset for Mikeraish - https://phabricator.wikimedia.org/T279147 (10ema) p:05Triage→03Medium [15:24:12] (03PS1) 10Elukey: aqs: update the mediawiki reduced druid settings [puppet] - 10https://gerrit.wikimedia.org/r/677591 [15:25:33] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01065 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:26:25] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install moss-fe200[12].codfw.wmnet - https://phabricator.wikimedia.org/T275513 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` moss-fe2001.codfw.wmnet ` The log can be fou... [15:26:26] (03CR) 10Alexandros Kosiaris: [C: 04-1] "eqiad missing and conftool-data/discovery missing." [puppet] - 10https://gerrit.wikimedia.org/r/677114 (https://phabricator.wikimedia.org/T279059) (owner: 10Effie Mouzeli) [15:27:38] I'm done with testing wmf.38 fingers crossed nothing will explode [15:28:11] (03CR) 10Mforns: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/677591 (owner: 10Elukey) [15:28:29] 10SRE, 10OTRS, 10Security, 10User-notice: Migrate OTRS CE 6 to Znuny LTS fork - https://phabricator.wikimedia.org/T279303 (10Keegan) Sounds good, thank you. If we need help from Znuny we can open a ticket with them at support@znuny.com. [15:28:31] (03CR) 10Elukey: [C: 03+2] aqs: update the mediawiki reduced druid settings [puppet] - 10https://gerrit.wikimedia.org/r/677591 (owner: 10Elukey) [15:29:16] (03CR) 10Alexandros Kosiaris: [C: 04-1] hieradata: remove parsoidJS from production 4 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677118 (https://phabricator.wikimedia.org/T279059) (owner: 10Effie Mouzeli) [15:29:18] (03CR) 10Papaul: [C: 03+2] Add moss-fe200[1-2] MAC address and to site.pp with role insetup [puppet] - 10https://gerrit.wikimedia.org/r/677586 (https://phabricator.wikimedia.org/T275513) (owner: 10Papaul) [15:30:50] !log elukey@cumin1001 START - Cookbook sre.aqs.roll-restart [15:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:23] (03CR) 10Alexandros Kosiaris: "> Patch Set 3:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/675558 (https://phabricator.wikimedia.org/T278208) (owner: 10Elukey) [15:32:45] (03CR) 10Jforrester: [BETA CLUSTER] Disable LocalisationUpdate (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677385 (https://phabricator.wikimedia.org/T158360) (owner: 10Jforrester) [15:34:23] 10SRE, 10ops-codfw, 10netops: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457 (10Papaul) David Valverde 12:39 AM (9 hours ago) to me, support Good night Papaul Hope you’re doing well I would like to inform that this RMA was already processed by the Logistics Te... [15:35:34] 10SRE, 10ops-codfw, 10netops: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457 (10Papaul) Good morning Papaul Hope you’re doing well Thank you for your response, let me answer the additional question that you have today. What do we do with the license we currently... [15:38:10] (03CR) 10Jbond: [C: 03+1] "lgtm" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/677571 (owner: 10Klausman) [15:39:34] !log elukey@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) [15:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:16] (03PS1) 10Cwhite: logstash: refactor how curator jobs are defined and deployed [puppet] - 10https://gerrit.wikimedia.org/r/677593 (https://phabricator.wikimedia.org/T274394) [15:43:21] PROBLEM - Disk space on ms-be2028 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdh1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2028&var-datasource=codfw+prometheus/ops [15:44:31] (03CR) 10jerkins-bot: [V: 04-1] logstash: refactor how curator jobs are defined and deployed [puppet] - 10https://gerrit.wikimedia.org/r/677593 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite) [15:45:30] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-fe2001.codfw.wmnet with reason: REIMAGE [15:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:01] 10SRE, 10LDAP-Access-Requests: Add Lena Meintrup to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T279531 (10Tobi_WMDE_SW) FWIW, I can confirm @Lena_WMDE is working as a product manager in the same team as me. [15:46:31] (03PS2) 10Cwhite: logstash: refactor how curator jobs are defined and deployed [puppet] - 10https://gerrit.wikimedia.org/r/677593 (https://phabricator.wikimedia.org/T274394) [15:47:32] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on moss-fe2001.codfw.wmnet with reason: REIMAGE [15:47:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:28] (03CR) 10Cwhite: "This changeset is an attempt to capture what we talked about on Tuesday." [puppet] - 10https://gerrit.wikimedia.org/r/677593 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite) [15:53:30] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.001775 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:54:52] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install moss-fe200[12].codfw.wmnet - https://phabricator.wikimedia.org/T275513 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['moss-fe2001.codfw.wmnet'] ` and were **ALL** successful. [15:56:52] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install moss-fe200[12].codfw.wmnet - https://phabricator.wikimedia.org/T275513 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` moss-fe2002.codfw.wmnet ` The log can be fou... [15:57:28] (03PS1) 10David Caro: ceph: use ensure_packages instead of package directly [puppet] - 10https://gerrit.wikimedia.org/r/677595 (https://phabricator.wikimedia.org/T274566) [15:58:10] (03CR) 10jerkins-bot: [V: 04-1] ceph: use ensure_packages instead of package directly [puppet] - 10https://gerrit.wikimedia.org/r/677595 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [16:01:52] 10SRE, 10WMF-JobQueue, 10serviceops, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10Legoktm) + 1 to those from me too. Note that we'll also need to make this distinction in codfw too. >>! In T279100#697307... [16:03:50] RECOVERY - Disk space on ms-be2028 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2028&var-datasource=codfw+prometheus/ops [16:04:42] (03PS2) 10David Caro: ceph: use ensure_packages instead of package directly [puppet] - 10https://gerrit.wikimedia.org/r/677595 (https://phabricator.wikimedia.org/T274566) [16:05:01] (03PS1) 10Bstorm: cloud email alerts: remove f-strings in case of stretch vms [puppet] - 10https://gerrit.wikimedia.org/r/677599 [16:05:53] (03CR) 10Ahmon Dancy: [BETA CLUSTER] Disable LocalisationUpdate (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677385 (https://phabricator.wikimedia.org/T158360) (owner: 10Jforrester) [16:09:00] 10SRE, 10DNS, 10Traffic, 10Abstract Wikipedia team (Phase δ): Establish wikifunctions.org - https://phabricator.wikimedia.org/T275904 (10DVrandecic) p:05Medium→03Low [16:24:42] (03CR) 10Klausman: [C: 03+2] wmf-update-ssh-config: Add functionality to write specific file [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/677571 (owner: 10Klausman) [16:24:55] (03CR) 10Klausman: [V: 03+2 C: 03+2] wmf-update-ssh-config: Add functionality to write specific file [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/677571 (owner: 10Klausman) [16:31:09] (03CR) 10Dzahn: "ooh, good catch!" [puppet] - 10https://gerrit.wikimedia.org/r/677497 (https://phabricator.wikimedia.org/T276148) (owner: 10Jbond) [16:32:17] PROBLEM - wikimedia-client-errors-alerts grafana alert on alert1001 is CRITICAL: CRITICAL: Overview ( https://grafana.wikimedia.org/d/000000566/overview ) is alerting: Client error alert. https://logstash.wikimedia.org/app/kibana%23/dashboard/AXDBY8Qhh3Uj6x1zCF56 https://grafana.wikimedia.org/d/000000566/ [16:33:22] (03PS1) 10Kosta Harlan: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/677611 [16:33:54] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/677611 (owner: 10Kosta Harlan) [16:36:17] (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/677611 (owner: 10Kosta Harlan) [16:40:31] 10SRE, 10DNS, 10Traffic, 10Abstract Wikipedia team (Phase δ): Establish wikifunctions.org - https://phabricator.wikimedia.org/T275904 (10DVrandecic) a:03Jdforrester-WMF [16:42:26] (03PS1) 10Hnowlan: service::deploy::gitclone [puppet] - 10https://gerrit.wikimedia.org/r/677614 [16:43:35] (03CR) 10Elukey: [C: 03+1] service::deploy::gitclone [puppet] - 10https://gerrit.wikimedia.org/r/677614 (owner: 10Hnowlan) [16:45:01] !log tgr@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [16:45:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:28] (03PS2) 10Hnowlan: service::deploy::gitclone: change clone path to match repo name [puppet] - 10https://gerrit.wikimedia.org/r/677614 [16:53:18] (03CR) 10Hnowlan: [C: 03+2] service::deploy::gitclone: change clone path to match repo name [puppet] - 10https://gerrit.wikimedia.org/r/677614 (owner: 10Hnowlan) [16:54:10] (03PS1) 10Razzi: nagios: add victorops-analytics contact group [puppet] - 10https://gerrit.wikimedia.org/r/677617 (https://phabricator.wikimedia.org/T273064) [16:56:14] 10SRE, 10Wikimedia-Mailing-lists: All WMF mailing lists should be publicly listed - https://phabricator.wikimedia.org/T124324 (10Legoktm) >>! In T124324#6979583, @Ladsgroup wrote: > - Not being advertised doesn't mean its existence is private and bound to NDA. It means it's just not **advertised** Thanks for... [16:57:59] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install moss-fe200[12].codfw.wmnet - https://phabricator.wikimedia.org/T275513 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['moss-fe2002.codfw.wmnet'] ` Of which those **FAILED**: ` ['moss-fe2002.codfw.wmnet'] ` [17:02:32] (03PS1) 10Hnowlan: service::deploy::gitclone: don't append deploy to repo [puppet] - 10https://gerrit.wikimedia.org/r/677620 [17:03:17] (03CR) 10jerkins-bot: [V: 04-1] service::deploy::gitclone: don't append deploy to repo [puppet] - 10https://gerrit.wikimedia.org/r/677620 (owner: 10Hnowlan) [17:03:51] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install moss-fe200[12].codfw.wmnet - https://phabricator.wikimedia.org/T275513 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` moss-fe2002.codfw.wmnet ` The log can be found in `/var/log/wmf-au... [17:04:31] (03PS2) 10Hnowlan: service::deploy::gitclone: don't append deploy to repo [puppet] - 10https://gerrit.wikimedia.org/r/677620 [17:05:27] 10SRE, 10Wikimedia-Logstash, 10observability: Buster elasticsearch-curator version not compatible with ELK7 - https://phabricator.wikimedia.org/T257024 (10colewhite) p:05Medium→03High The `logstash::elasticsearch7` nodes (and Pontoon) do not have a curator version available that can run against ES 7.0.... [17:07:11] (03PS3) 10Cwhite: logstash: refactor how curator jobs are defined and deployed [puppet] - 10https://gerrit.wikimedia.org/r/677593 (https://phabricator.wikimedia.org/T274394) [17:08:29] RECOVERY - wikimedia-client-errors-alerts grafana alert on alert1001 is OK: OK: Overview ( https://grafana.wikimedia.org/d/000000566/overview ) is not alerting. https://logstash.wikimedia.org/app/kibana%23/dashboard/AXDBY8Qhh3Uj6x1zCF56 https://grafana.wikimedia.org/d/000000566/ [17:08:55] 10SRE, 10Wikimedia-Mailing-lists: Import several public mailing lists archives from mailman2 to lists-next to measure database size - https://phabricator.wikimedia.org/T278609 (10Legoktm) >>! In T278609#6979976, @Ladsgroup wrote: > so I attempted to import wikitech-l and it basically failed the moment I starte... [17:10:11] (03CR) 10Cwhite: [C: 04-1] "Fixed some compile-time issues found in Pontoon testing." [puppet] - 10https://gerrit.wikimedia.org/r/677593 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite) [17:12:14] (03CR) 1020after4: [C: 03+1] "fwiw, I can help with testing this in production, even though that isn't ideal I don't know of another way. The worst case impact if it do" [puppet] - 10https://gerrit.wikimedia.org/r/655743 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [17:18:51] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-fe2002.codfw.wmnet with reason: REIMAGE [17:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:48] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on moss-fe2002.codfw.wmnet with reason: REIMAGE [17:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:30] 10SRE, 10DNS, 10Traffic, 10Abstract Wikipedia team (Phase δ): Establish wikifunctions.org - https://phabricator.wikimedia.org/T275904 (10Jdforrester-WMF) [17:27:58] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install moss-fe200[12].codfw.wmnet - https://phabricator.wikimedia.org/T275513 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['moss-fe2002.codfw.wmnet'] ` and were **ALL** successful. [17:28:08] (03CR) 10JMeybohm: "> Patch Set 3:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/675558 (https://phabricator.wikimedia.org/T278208) (owner: 10Elukey) [17:29:48] !log tgr@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [17:29:48] !log tgr@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [17:29:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:03] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install moss-fe200[12].codfw.wmnet - https://phabricator.wikimedia.org/T275513 (10Papaul) [17:32:38] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install moss-fe200[12].codfw.wmnet - https://phabricator.wikimedia.org/T275513 (10Papaul) 05Open→03Resolved @fgiunchedi this is complete [17:33:14] 10SRE, 10Wikimedia-Mailing-lists: Import several public mailing lists archives from mailman2 to lists-next to measure database size - https://phabricator.wikimedia.org/T278609 (10Legoktm) ` legoktm@lists1002:/home/ladsgroup$ diff wikitech-l.mbox.backup-lego wikitech-l.mbox 1070508c1070508 < charset="iso-8859-... [17:33:39] 10SRE, 10LDAP-Access-Requests: Add Lena Meintrup to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T279531 (10KFrancis) @ema Hello, I am working on this now. Will confirm once it's signed. Thanks! [17:39:15] 10SRE, 10DNS, 10Traffic, 10Abstract Wikipedia team (Phase δ): Establish wikifunctions.org - https://phabricator.wikimedia.org/T275904 (10Jdforrester-WMF) [17:39:40] (03PS1) 10Jforrester: Add wikifunctions.org [dns] - 10https://gerrit.wikimedia.org/r/677626 (https://phabricator.wikimedia.org/T275904) [17:40:12] !log tgr@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [17:40:12] !log tgr@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [17:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:08] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10Sergey.Trofimovsky.SF) >>! In T276148#6979390, @jbond wrote: > As mentioned there is no need for us to manage t... [17:47:35] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:49:37] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [17:49:45] RECOVERY - Check systemd state on elastic2047 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:49:57] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:50:53] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [17:53:26] (03PS3) 10Andrew Bogott: wmfkeystonehooks ldap groups: Handle groups with no members [puppet] - 10https://gerrit.wikimedia.org/r/677419 (https://phabricator.wikimedia.org/T279491) [18:00:05] marxarelli and twentyafterfour: May I have your attention please! Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210407T1800) [18:00:05] RoanKattouw, Niharika, and Urbanecm: Dear deployers, time to do the [[Backport windows|Morning backport window]]
'''''' deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210407T1800). [18:00:05] No GERRIT patches in the queue for this window AFAICS. [18:06:03] (03CR) 10Ottomata: [C: 03+1] "nit and +1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677576 (https://phabricator.wikimedia.org/T276906) (owner: 10Elukey) [18:07:06] (03CR) 10Andrew Bogott: [C: 03+2] wmfkeystonehooks ldap groups: Handle groups with no members [puppet] - 10https://gerrit.wikimedia.org/r/677419 (https://phabricator.wikimedia.org/T279491) (owner: 10Andrew Bogott) [18:30:26] (03PS1) 10Papaul: Add new conf nodes MAC addresses, partman recipe and role insetup [puppet] - 10https://gerrit.wikimedia.org/r/677634 (https://phabricator.wikimedia.org/T275637) [18:32:23] (03CR) 10Papaul: [C: 03+2] Add new conf nodes MAC addresses, partman recipe and role insetup [puppet] - 10https://gerrit.wikimedia.org/r/677634 (https://phabricator.wikimedia.org/T275637) (owner: 10Papaul) [18:33:25] (03CR) 10CRusnov: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/655743 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [18:37:27] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Patch-For-Review: (Need By: TBD) rack/setup/install conf200[456].codfw.wmnet - https://phabricator.wikimedia.org/T275637 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` conf2004.codfw.wmnet ` The log... [18:38:11] (03CR) 10Razzi: [C: 03+1] "LGTM as well" [puppet] - 10https://gerrit.wikimedia.org/r/677576 (https://phabricator.wikimedia.org/T276906) (owner: 10Elukey) [18:49:15] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:51:35] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:52:36] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on conf2004.codfw.wmnet with reason: REIMAGE [18:52:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:32] 10SRE, 10Add-Link, 10Data-Persistence (Consultation), 10Growth-Team, 10Patch-For-Review: Determine why querying is slow and what we can do about it - https://phabricator.wikimedia.org/T279411 (10kostajh) [18:54:24] 10SRE, 10Add-Link, 10Data-Persistence (Consultation), 10Growth-Team, 10Patch-For-Review: Determine why querying is slow and what we can do about it - https://phabricator.wikimedia.org/T279411 (10kostajh) >>! In T279411#6980869, @gerritbot wrote: > Change 677007 **merged** by jenkins-bot: > %%%[research/m... [18:54:30] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on conf2004.codfw.wmnet with reason: REIMAGE [18:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:13] (03CR) 10Dzahn: [C: 03+1] "yes, looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/677617 (https://phabricator.wikimedia.org/T273064) (owner: 10Razzi) [18:58:53] (03CR) 10Dzahn: "what I see looks good to me but for some reason Evaluation Error: Operator '[]' is not applicable to an Undef Value. (file: /srv/workspace" [puppet] - 10https://gerrit.wikimedia.org/r/677558 (owner: 10Jbond) [18:59:33] (03CR) 10Razzi: [C: 03+2] nagios: add victorops-analytics contact group [puppet] - 10https://gerrit.wikimedia.org/r/677617 (https://phabricator.wikimedia.org/T273064) (owner: 10Razzi) [19:00:05] marxarelli and twentyafterfour: How many deployers does it take to do Mediawiki train - American Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210407T1900). [19:00:15] mutante: I'm going to try to roll out the victorops-analytics icinga alert again, this time I'll be here in case there are fireworks :) [19:01:36] razzi: thank you for the heads-up, all seems good to me [19:01:50] that is the fix I had in mind, thumbs up [19:01:52] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Patch-For-Review: (Need By: TBD) rack/setup/install conf200[456].codfw.wmnet - https://phabricator.wikimedia.org/T275637 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['conf2004.codfw.wmnet'] ` and were **ALL** successful. [19:03:03] (03PS1) 10Razzi: superset: Add victorops alerting for superset [puppet] - 10https://gerrit.wikimedia.org/r/677642 (https://phabricator.wikimedia.org/T273064) [19:04:05] (03CR) 10Dzahn: [C: 03+1] "should be fine now after I651251a2deef5d86e2c" [puppet] - 10https://gerrit.wikimedia.org/r/677642 (https://phabricator.wikimedia.org/T273064) (owner: 10Razzi) [19:04:15] (03CR) 10Razzi: [C: 03+2] superset: Add victorops alerting for superset [puppet] - 10https://gerrit.wikimedia.org/r/677642 (https://phabricator.wikimedia.org/T273064) (owner: 10Razzi) [19:04:31] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Patch-For-Review: (Need By: TBD) rack/setup/install conf200[456].codfw.wmnet - https://phabricator.wikimedia.org/T275637 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` conf2005.codfw.wmnet ` The log... [19:06:21] 10SRE, 10DBA, 10Platform Engineering, 10Wikimedia-Incident: Appservers latency spike / parser cache growth 2021-03-28 - https://phabricator.wikimedia.org/T278655 (10Krinkle) I suppose its up to you. Is this growth is acceptable, expected, and normal? [19:10:10] 10SRE, 10WMF-Annual-Report: Update annual.wikimedia.org redirect to point to 2020 Annual Report - https://phabricator.wikimedia.org/T279571 (10spatton) [19:16:04] (03PS2) 10Krinkle: [BETA CLUSTER] Disable LocalisationUpdate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677385 (https://phabricator.wikimedia.org/T158360) (owner: 10Jforrester) [19:16:07] (03CR) 10Krinkle: [C: 03+2] [BETA CLUSTER] Disable LocalisationUpdate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677385 (https://phabricator.wikimedia.org/T158360) (owner: 10Jforrester) [19:17:10] (03Merged) 10jenkins-bot: [BETA CLUSTER] Disable LocalisationUpdate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677385 (https://phabricator.wikimedia.org/T158360) (owner: 10Jforrester) [19:17:45] (03PS5) 10Andrew Bogott: Designate/Victoria: remove a hacked file [puppet] - 10https://gerrit.wikimedia.org/r/677350 (https://phabricator.wikimedia.org/T261137) [19:17:47] (03PS5) 10Andrew Bogott: Add config and manifests for Openstack version Victoria [puppet] - 10https://gerrit.wikimedia.org/r/677351 (https://phabricator.wikimedia.org/T261137) [19:17:49] (03PS4) 10Andrew Bogott: Removed an uneeded setting from nova.conf [puppet] - 10https://gerrit.wikimedia.org/r/677360 (https://phabricator.wikimedia.org/T261137) [19:17:51] (03PS1) 10Andrew Bogott: OpenStack Victoria: pin librdkafka1 to the bpo [puppet] - 10https://gerrit.wikimedia.org/r/677643 (https://phabricator.wikimedia.org/T279470) [19:18:35] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack: remove config and manifests for version 'Stein' [puppet] - 10https://gerrit.wikimedia.org/r/677337 (https://phabricator.wikimedia.org/T261136) (owner: 10Andrew Bogott) [19:18:46] (03CR) 10Andrew Bogott: [C: 03+2] Pontoon: update openstack version to Ussuri [puppet] - 10https://gerrit.wikimedia.org/r/677335 (https://phabricator.wikimedia.org/T261136) (owner: 10Andrew Bogott) [19:19:35] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on conf2005.codfw.wmnet with reason: REIMAGE [19:19:38] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack Victoria: pin librdkafka1 to the bpo [puppet] - 10https://gerrit.wikimedia.org/r/677643 (https://phabricator.wikimedia.org/T279470) (owner: 10Andrew Bogott) [19:19:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:13] (03PS1) 10Dduvall: group1 wikis to 1.36.0-wmf.38 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677644 [19:20:15] (03CR) 10Dduvall: [C: 03+2] group1 wikis to 1.36.0-wmf.38 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677644 (owner: 10Dduvall) [19:21:04] (03Merged) 10jenkins-bot: group1 wikis to 1.36.0-wmf.38 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677644 (owner: 10Dduvall) [19:21:44] marxarelli: I've landed a beta-only patch just now as well, which might get pulled in for you [19:21:48] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on conf2005.codfw.wmnet with reason: REIMAGE [19:21:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:59] (03PS1) 10Ottomata: Release 2020.02~wmf4 [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/677645 (https://phabricator.wikimedia.org/T279480) [19:22:02] could do a no-op sync for it first if you like, no rush either way. [19:22:14] Krinkle: i just saw that during the promote [19:22:24] yeah, a sync is fine by me [19:22:50] ok, checking on mwdebug1002 now [19:22:59] cool. thanks [19:23:28] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.38 [19:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:35] !log dduvall@deploy1002 Synchronized php: group1 wikis to 1.36.0-wmf.38 (duration: 01m 06s) [19:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:11] (03CR) 10BryanDavis: cloud email alerts: remove f-strings in case of stretch vms (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677599 (owner: 10Bstorm) [19:27:38] (03CR) 10BryanDavis: cloud email alerts: remove f-strings in case of stretch vms (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677599 (owner: 10Bstorm) [19:28:30] (03PS1) 10Ottomata: Revert "Revert "Fix bug in jupyterhub-conda ..."" [puppet] - 10https://gerrit.wikimedia.org/r/677403 [19:30:22] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: (Need By: TBD) rack/setup/install conf200[456].codfw.wmnet - https://phabricator.wikimedia.org/T275637 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['conf2005.codfw.wmnet'] ` and were **ALL** successful. [19:30:41] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: (Need By: TBD) rack/setup/install conf200[456].codfw.wmnet - https://phabricator.wikimedia.org/T275637 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` conf2006.codfw.wmnet ` The log can be found in `/var... [19:31:21] 10SRE, 10WMF-JobQueue, 10serviceops, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10Dzahn) ACK, thanks Legoktm. That does make sense. [19:32:25] seeing a handful of "LocalFileLockError" errors today, though they seem to have occurred intermittently for wmf.37 as well [19:32:44] * marxarelli looks in phab for known issues [19:33:37] marxarelli: that sync didn't cover CS.php, right? [19:33:42] hmm, looks like T275072 [19:33:44] sorry, I got distracted and forgot to actually sync it [19:33:44] T275072: LocalFileLockError: Could not acquire lock for "mwstore: …" (via ApiUpload.php) - https://phabricator.wikimedia.org/T275072 [19:33:55] Krinkle: no worries. yeah, it shouldn't have [19:34:00] ok, syncing now then [19:34:34] (03CR) 10Andrew Bogott: [C: 03+2] Designate/Victoria: remove a hacked file [puppet] - 10https://gerrit.wikimedia.org/r/677350 (https://phabricator.wikimedia.org/T261137) (owner: 10Andrew Bogott) [19:35:14] (03Abandoned) 10Ottomata: Release 2020.02~wmf4 [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/677645 (https://phabricator.wikimedia.org/T279480) (owner: 10Ottomata) [19:35:29] !log krinkle@deploy1002 Synchronized wmf-config/CommonSettings.php: no-op for Beta (disable LocalisationUpdate extension) (duration: 01m 06s) [19:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:58] (03Restored) 10Ottomata: Release 2020.02~wmf4 [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/677645 (https://phabricator.wikimedia.org/T279480) (owner: 10Ottomata) [19:36:05] Krinkle: ty! [19:36:15] (03CR) 10Ottomata: [C: 03+2] Release 2020.02~wmf4 [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/677645 (https://phabricator.wikimedia.org/T279480) (owner: 10Ottomata) [19:36:17] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Release 2020.02~wmf4 [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/677645 (https://phabricator.wikimedia.org/T279480) (owner: 10Ottomata) [19:39:13] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10Dzahn) Thank you very much @Papaul for the swift and great work. [19:39:47] (03PS6) 10Andrew Bogott: Add config and manifests for Openstack version Victoria [puppet] - 10https://gerrit.wikimedia.org/r/677351 (https://phabricator.wikimedia.org/T261137) [19:39:49] (03PS5) 10Andrew Bogott: Removed an unneeded setting from nova.conf [puppet] - 10https://gerrit.wikimedia.org/r/677360 (https://phabricator.wikimedia.org/T261137) [19:39:51] (03PS1) 10Andrew Bogott: Neutron: forward our dmz hacks to Victoria [puppet] - 10https://gerrit.wikimedia.org/r/677647 (https://phabricator.wikimedia.org/T261137) [19:40:43] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10Dzahn) work continues on T278396 now [19:40:53] (03CR) 10jerkins-bot: [V: 04-1] Neutron: forward our dmz hacks to Victoria [puppet] - 10https://gerrit.wikimedia.org/r/677647 (https://phabricator.wikimedia.org/T261137) (owner: 10Andrew Bogott) [19:44:27] (03CR) 10Dzahn: "Looks good to me, just also not sure about the value for Google site verification. Ema, I boldly added you because you are both traffic a" [dns] - 10https://gerrit.wikimedia.org/r/677626 (https://phabricator.wikimedia.org/T275904) (owner: 10Jforrester) [19:44:32] (03CR) 10Ottomata: [C: 03+2] Revert "Revert "Fix bug in jupyterhub-conda ..."" [puppet] - 10https://gerrit.wikimedia.org/r/677403 (owner: 10Ottomata) [19:45:05] andrewbogott: pupept merge conflict! ok to merge yours? [19:45:16] Andrew Bogott: Designate/Victoria: remove a hacked file (e620be7b7d) [19:45:22] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on conf2006.codfw.wmnet with reason: REIMAGE [19:45:25] ottomata: yes please [19:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:32] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on conf2006.codfw.wmnet with reason: REIMAGE [19:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:05] (03PS1) 10Dzahn: Revert "site/conftool-data: mw2397 through mw2402 back to insetup, not ready yet" [puppet] - 10https://gerrit.wikimedia.org/r/677404 [19:52:21] (03CR) 10jerkins-bot: [V: 04-1] Revert "site/conftool-data: mw2397 through mw2402 back to insetup, not ready yet" [puppet] - 10https://gerrit.wikimedia.org/r/677404 (owner: 10Dzahn) [19:52:50] (03CR) 10Dzahn: [C: 03+2] "we are reverting that we were not ready, because now we are, remaining servers can be taken into prod service" [puppet] - 10https://gerrit.wikimedia.org/r/677404 (owner: 10Dzahn) [19:54:23] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: (Need By: TBD) rack/setup/install conf200[456].codfw.wmnet - https://phabricator.wikimedia.org/T275637 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['conf2006.codfw.wmnet'] ` and were **ALL** successful. [19:54:43] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: (Need By: TBD) rack/setup/install conf200[456].codfw.wmnet - https://phabricator.wikimedia.org/T275637 (10Papaul) [19:55:32] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: (Need By: TBD) rack/setup/install conf200[456].codfw.wmnet - https://phabricator.wikimedia.org/T275637 (10Papaul) 05Open→03Resolved @Joe this is complete. [19:56:44] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman%23Monitoring [19:57:59] (03CR) 10Bstorm: cloud email alerts: remove f-strings in case of stretch vms (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677599 (owner: 10Bstorm) [19:59:02] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman%23Monitoring [20:00:05] chrisalbon and accraze: Time to snap out of that daydream and deploy [[mw:Services|Services]] – [[mw:Extension:Graph|Graphoid]] / [[ORES]]. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210407T2000). [20:00:10] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 88535 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Mailman%23Monitoring [20:00:10] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 15527 bytes in 0.263 second response time https://wikitech.wikimedia.org/wiki/Mailman%23Monitoring [20:06:20] Amir1: monitoring works and indicates ongoing work? [20:06:50] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be200[12] - https://phabricator.wikimedia.org/T276642 (10Papaul) [20:07:02] mutante: that wasn't me [20:07:06] legoktm: ^ [20:07:30] and it's mailman2 so it should use the old monitoring [20:08:11] ack, it's the old server, just assumed it's the exporting of data [20:08:13] or so [20:16:47] uhm [20:17:26] PROBLEM - DPKG on an-worker1081 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [20:18:00] PROBLEM - DPKG on an-worker1095 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [20:18:12] PROBLEM - DPKG on analytics1077 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [20:18:16] PROBLEM - DPKG on an-worker1103 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [20:18:35] well, it was just a socket timeout, could have been hickup anywhere [20:18:49] never got an active response that somethig is not ok on that server [20:19:12] and looks ok now [20:19:18] Apr 07 19:56:23 lists1001 nrpe[24951]: Error: (!log_opts) Could not complete SSL handshake with 208.80.154.84: 5 [20:19:20] PROBLEM - DPKG on an-worker1082 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [20:19:21] stuff like that in the logs [20:19:48] PROBLEM - DPKG on analytics1070 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [20:20:03] legoktm: that is alert1001 talking to icinga1001 it seems [20:20:27] but yes, I don't see anything wrong with lists1001 itself [20:20:28] eh, i mean, there are both, icinga1001 and alert1001 but alert1001 is the active icinga server [20:20:43] that IP up there was icinga1001 [20:20:53] *nod*, yes [20:23:10] PROBLEM - DPKG on an-worker1085 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [20:23:34] PROBLEM - DPKG on an-worker1126 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [20:24:56] PROBLEM - DPKG on an-worker1088 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [20:26:07] ^ that stuff must be upgrades or something temporary. an-worker1085 doesnt show it to me [20:26:58] what the DPKG check does is pretty much "dpkg -l | grep -v ^ii" [20:30:51] !log mw2397 through mw2402 - new hardware moving into production, initial puppet runs as appservers, added to monitoring etc (T278396) [20:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:59] T278396: bring 26 new mediawiki appserver in codfw into production, rack A3 (mw2377 - mw2402) - https://phabricator.wikimedia.org/T278396 [20:31:22] jouncebot: next [20:31:23] In 2 hour(s) and 28 minute(s): [[Backport windows|Evening backport window]]
'''''' (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210407T2300) [20:35:01] (03CR) 10Cwhite: [C: 03+2] logstash: remove logstash output on legacy cluster [puppet] - 10https://gerrit.wikimedia.org/r/676477 (https://phabricator.wikimedia.org/T234854) (owner: 10Cwhite) [20:37:56] PROBLEM - DPKG on an-worker1100 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [20:44:33] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [20:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:52] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01003 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [20:48:32] RECOVERY - DPKG on an-worker1081 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [20:49:02] RECOVERY - DPKG on an-worker1095 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [20:49:12] RECOVERY - DPKG on analytics1077 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [20:49:16] RECOVERY - DPKG on an-worker1103 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [20:50:28] re: "widespread" puppet failures. it just adds up because multiple people are installing [20:50:33] should recover soonish [20:51:01] speaking for mw but an-worker looks like it too [20:51:11] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [20:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:17] deploy* has issues too.. that i do NOT know about yet [20:53:14] RECOVERY - DPKG on an-worker1085 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [20:53:17] (03PS1) 10Urbanecm: Add growthexperiments_mentee_data to private tables [puppet] - 10https://gerrit.wikimedia.org/r/677653 (https://phabricator.wikimedia.org/T279587) [20:53:24] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.00531 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [20:53:38] RECOVERY - DPKG on an-worker1126 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [20:54:17] !log mw2397 - mw2402 - rebooting [20:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:37] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [20:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:00] RECOVERY - DPKG on an-worker1088 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [20:55:20] PROBLEM - Host mw2400 is DOWN: PING CRITICAL - Packet loss = 100% [20:55:52] PROBLEM - Host mw2399 is DOWN: PING CRITICAL - Packet loss = 100% [20:55:52] PROBLEM - Host mw2398 is DOWN: PING CRITICAL - Packet loss = 100% [20:55:54] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mw[2397-2399].codfw.wmnet with reason: new_install [20:55:55] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mw[2397-2399].codfw.wmnet with reason: new_install [20:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:13] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mw[2400-2401].codfw.wmnet with reason: new_install [20:56:14] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mw[2400-2401].codfw.wmnet with reason: new_install [20:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:26] RECOVERY - Host mw2399 is UP: PING OK - Packet loss = 0%, RTA = 31.68 ms [20:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:30] RECOVERY - Host mw2398 is UP: PING OK - Packet loss = 0%, RTA = 31.71 ms [20:56:44] RECOVERY - Host mw2400 is UP: PING OK - Packet loss = 0%, RTA = 31.72 ms [20:56:50] eh, puppet hasn't run on deploy1002 [20:56:58] The last Puppet run was at Wed Apr 7 13:25:10 UTC 2021 (450 minutes ago). [20:57:12] legoktm: yea, i noticed that on the puppetboard but did not get to looking at it [20:57:26] PROBLEM - DPKG on stat1008 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [20:57:31] juggling the mw boxes [20:58:16] I'm looking into it [20:58:21] thanks [20:58:23] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:58:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:47] legoktm: 2002 is same [21:02:07] !log dzahn@cumin1001 conftool action : set/weight=30; selector: name=mw239[7-9].codfw.wmnet [21:02:07] looks like it was https://gerrit.wikimedia.org/r/c/operations/puppet/+/677228 [21:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:25] akosiaris: still around? [21:03:20] !log clearing watchlist of bots in enwiki (https://en.wikipedia.org/w/index.php?title=Wikipedia:Bots/Noticeboard&oldid=1016563560#Clearing_bot_watchlists) [21:03:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:00] RECOVERY - DPKG on analytics1070 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [21:04:23] !log dzahn@cumin1001 conftool action : set/weight=30; selector: name=mw240[0-2].codfw.wmnet [21:04:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:30] (03PS1) 10Cwhite: remove alerting_host role from icinga[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/677656 (https://phabricator.wikimedia.org/T247966) [21:04:43] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw239[7-9].codfw.wmnet [21:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:58] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw240[0-2].codfw.wmnet [21:05:02] (03CR) 10Legoktm: "Puppet is failing on deploy1002/deploy2002 with: Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: " [puppet] - 10https://gerrit.wikimedia.org/r/677228 (https://phabricator.wikimedia.org/T268434) (owner: 10JMeybohm) [21:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:46] !log mw2397 - mw2402 - scap pull [21:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:26] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be200[12] - https://phabricator.wikimedia.org/T276642 (10Papaul) [21:06:53] ah, I see the logic bug [21:07:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10Jclark-ctr) @jijiki Racking these host i only have 2 available spots in D4 will any of the ones in this rack be decommissioned soon? [21:08:47] great :) [21:10:34] RECOVERY - DPKG on an-worker1082 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [21:12:51] admin is skipped in the upper block and then it tries to require the admin resource that was skipped [21:12:58] I assume it should be skipped in the bottom block too [21:15:20] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudnet2004-dev - https://phabricator.wikimedia.org/T275676 (10Papaul) [21:16:58] (03PS7) 10Andrew Bogott: Add config and manifests for Openstack version Victoria [puppet] - 10https://gerrit.wikimedia.org/r/677351 (https://phabricator.wikimedia.org/T261137) [21:17:00] (03PS6) 10Andrew Bogott: Removed an unneeded setting from nova.conf [puppet] - 10https://gerrit.wikimedia.org/r/677360 (https://phabricator.wikimedia.org/T261137) [21:17:02] (03PS2) 10Andrew Bogott: Neutron: forward our dmz hacks to Victoria [puppet] - 10https://gerrit.wikimedia.org/r/677647 (https://phabricator.wikimedia.org/T261137) [21:17:04] (03PS1) 10Andrew Bogott: codfw1dev designate -> OpenStack Victoria [puppet] - 10https://gerrit.wikimedia.org/r/677658 (https://phabricator.wikimedia.org/T261137) [21:18:19] (03CR) 10jerkins-bot: [V: 04-1] Neutron: forward our dmz hacks to Victoria [puppet] - 10https://gerrit.wikimedia.org/r/677647 (https://phabricator.wikimedia.org/T261137) (owner: 10Andrew Bogott) [21:20:32] 10SRE, 10Wikimedia-Mailing-lists: Import several public mailing lists archives from mailman2 to lists-next to measure database size - https://phabricator.wikimedia.org/T278609 (10Ladsgroup) Oh thanks. you rock. Importing now. [21:21:10] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw239[7-9].codfw.wmnet [21:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:31] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw240[0-2].codfw.wmnet [21:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:33] !log mw2397 through mw2402 - pooled as new API appservers after scap pull and all monitoring green (T278396) [21:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:41] T278396: bring 26 new mediawiki appserver in codfw into production, rack A3 (mw2377 - mw2402) - https://phabricator.wikimedia.org/T278396 [21:25:34] 10SRE, 10serviceops: bring 26 new mediawiki appserver in codfw into production, rack A3 (mw2377 - mw2402) - https://phabricator.wikimedia.org/T278396 (10Dzahn) [21:25:40] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10Dzahn) [21:25:48] 10SRE, 10serviceops: bring 26 new mediawiki appserver in codfw into production, rack A3 (mw2377 - mw2402) - https://phabricator.wikimedia.org/T278396 (10Dzahn) 05Open→03Resolved rack A3 completed [21:26:03] 10SRE, 10serviceops: bring 26 new mediawiki appserver in codfw into production, rack A3 (mw2377 - mw2402) - https://phabricator.wikimedia.org/T278396 (10Dzahn) mw2397 - mw2402 set to Active in Netbox [21:28:40] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10Dzahn) mw2397 through mw2402 set to Active in Netbox. Need new ticket and follow-up for mw2403 through mw2411 next. [21:29:56] PROBLEM - Check systemd state on stat1006 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-otto-singleuser.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:30:12] (03PS3) 10Andrew Bogott: Neutron: forward our dmz hacks to Victoria [puppet] - 10https://gerrit.wikimedia.org/r/677647 (https://phabricator.wikimedia.org/T261137) [21:31:43] !log deployed patch for T279451 [21:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:33] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev designate -> OpenStack Victoria [puppet] - 10https://gerrit.wikimedia.org/r/677658 (https://phabricator.wikimedia.org/T261137) (owner: 10Andrew Bogott) [21:38:08] 10SRE, 10Wikimedia-Mailing-lists: Import several public mailing lists archives from mailman2 to lists-next to measure database size - https://phabricator.wikimedia.org/T278609 (10Ladsgroup) Finished ` real 44m15.207s ` [21:39:50] RECOVERY - DPKG on an-worker1100 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [21:44:30] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10Dzahn) [21:45:59] !log mforns@deploy1002 Started deploy [analytics/refinery@1dbbd3d]: Regular analytics weekly train [analytics/refinery@1dbbd3dfa996d2e970eb1cbc0a63d53040d4e3a3] [21:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:59] checking stat1006 [21:49:52] (03PS1) 10Andrew Bogott: Replace cloudcephmon2001-dev with cloudcephmon2004-dev [puppet] - 10https://gerrit.wikimedia.org/r/677663 (https://phabricator.wikimedia.org/T276509) [21:49:54] (03PS1) 10Andrew Bogott: Switch cloudcephmon2001-dev to a spare::system [puppet] - 10https://gerrit.wikimedia.org/r/677664 [21:50:00] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:51:37] (03CR) 10Andrew Bogott: "David, I'm happy to role this out but adding you as a reviewer to make sure I don't break/change ceph while you're in the middle of perfor" [puppet] - 10https://gerrit.wikimedia.org/r/677663 (https://phabricator.wikimedia.org/T276509) (owner: 10Andrew Bogott) [21:52:08] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:52:16] PROBLEM - DPKG on stat1005 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [21:54:05] (03CR) 10Andrew Bogott: [C: 03+2] Removed an unneeded setting from nova.conf [puppet] - 10https://gerrit.wikimedia.org/r/677360 (https://phabricator.wikimedia.org/T261137) (owner: 10Andrew Bogott) [21:54:14] (03CR) 10Andrew Bogott: [C: 03+2] Add config and manifests for Openstack version Victoria [puppet] - 10https://gerrit.wikimedia.org/r/677351 (https://phabricator.wikimedia.org/T261137) (owner: 10Andrew Bogott) [21:59:46] RECOVERY - DPKG on stat1008 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [22:00:34] 10SRE, 10Wikimedia-Mailing-lists: Import several public mailing lists archives from mailman2 to lists-next to measure database size - https://phabricator.wikimedia.org/T278609 (10Ladsgroup) Still rebuilding the index. [22:00:52] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:01:12] !log deployed patch for T279451 (part 2) [22:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:04] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:03:27] !log clearing watchlist of bots in wikidatawiki (https://www.wikidata.org/w/index.php?title=Wikidata:Project_chat&oldid=1397670734#Clean_up_watchlist_of_bots) [22:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:10] (03CR) 10CRusnov: "Just a little linguistic nit inline." (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/666601 (owner: 10David Caro) [22:10:54] /win 7 [22:13:55] 10SRE, 10serviceops: bring 25 new mediawiki appserver in codfw into production, rack A3 (mw2377 - mw2401) - https://phabricator.wikimedia.org/T278396 (10Dzahn) [22:20:29] 10SRE, 10Wikimedia-Mailing-lists: Import several public mailing lists archives from mailman2 to lists-next to measure database size - https://phabricator.wikimedia.org/T278609 (10Ladsgroup) Done: ` ladsgroup@lists1002:~$ time sudo django-admin update_index_one_list wikitech-l@lists-next.wikimedia.org --pythonp... [22:20:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson @Cmjohnson I have finished up these remaining ones i can do once i get space in A7 and D4. nam... [22:21:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10Jclark-ctr) [22:21:33] (03PS1) 10Legoktm: kubernetes: Fix requiring a resource that doesn't exist [puppet] - 10https://gerrit.wikimedia.org/r/677667 [22:21:56] (03PS2) 10Legoktm: kubernetes: Fix requiring a resource that doesn't exist [puppet] - 10https://gerrit.wikimedia.org/r/677667 [22:23:08] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28945/console" [puppet] - 10https://gerrit.wikimedia.org/r/677667 (owner: 10Legoktm) [22:23:19] (03CR) 10Dzahn: [C: 03+1] "It makes sense to me, given the existing "if $svcname != 'admin'" above. And would be nice to fix the puppet run on deploy*." [puppet] - 10https://gerrit.wikimedia.org/r/677667 (owner: 10Legoktm) [22:23:24] RECOVERY - DPKG on stat1005 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [22:23:47] mutante: I was just about to ping you, ty :) [22:24:03] legoktm: I understand it now that I see your code [22:24:07] looks good [22:24:22] I think it's safe to do now since even if the files are needed, it doesn't absent anything that was already there [22:25:08] I think it's safe to do as well, if the compiler says "only compiles with this change" on deplo* [22:26:22] "no change or only compiles with this change" [22:26:34] (03CR) 10Legoktm: [V: 03+1 C: 03+2] kubernetes: Fix requiring a resource that doesn't exist [puppet] - 10https://gerrit.wikimedia.org/r/677667 (owner: 10Legoktm) [22:28:17] (03CR) 10Legoktm: "Fixed in Change-Id: If2edb20918ed43de7c5f06b505404ec21dd53fb6" [puppet] - 10https://gerrit.wikimedia.org/r/677228 (https://phabricator.wikimedia.org/T268434) (owner: 10JMeybohm) [22:28:31] succeeded on deploy1002, running on 2002 now [22:28:54] !log mforns@deploy1002 Finished deploy [analytics/refinery@1dbbd3d]: Regular analytics weekly train [analytics/refinery@1dbbd3dfa996d2e970eb1cbc0a63d53040d4e3a3] (duration: 42m 54s) [22:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:01] oh, it already ran there [22:29:04] !log mforns@deploy1002 Started deploy [analytics/refinery@1dbbd3d] (thin): Regular analytics weekly train THIN [analytics/refinery@1dbbd3dfa996d2e970eb1cbc0a63d53040d4e3a3] [22:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:11] !log mforns@deploy1002 Finished deploy [analytics/refinery@1dbbd3d] (thin): Regular analytics weekly train THIN [analytics/refinery@1dbbd3dfa996d2e970eb1cbc0a63d53040d4e3a3] (duration: 00m 07s) [22:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:34] !log mforns@deploy1002 Started deploy [analytics/refinery@1dbbd3d] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@1dbbd3dfa996d2e970eb1cbc0a63d53040d4e3a3] [22:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:19] (03PS1) 10Ottomata: Release 2020.02~wmf5 [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/677669 (https://phabricator.wikimedia.org/T279480) [22:31:42] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:33:05] legoktm: 👍 thanks for fixing it [22:33:15] :) [22:33:28] just manually ran package_builder_Clean_up_build_directory.service on deneb too to fix that alert [22:33:50] !log mforns@deploy1002 Finished deploy [analytics/refinery@1dbbd3d] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@1dbbd3dfa996d2e970eb1cbc0a63d53040d4e3a3] (duration: 04m 15s) [22:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:14] oh, i see. nice! [22:38:16] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:40:28] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:47:20] (03PS1) 10Dzahn: site/conftool-data: assign 4 x API, 4 x app, 2 x jobrunner, rack A5 [puppet] - 10https://gerrit.wikimedia.org/r/677674 (https://phabricator.wikimedia.org/T279599) [22:48:11] !log mforns@deploy1002 Started deploy [analytics/refinery@1dbbd3d] (hadoop-test): Regular analytics weekly train TEST retry1 [analytics/refinery@1dbbd3dfa996d2e970eb1cbc0a63d53040d4e3a3] [22:48:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:58] !log mforns@deploy1002 Finished deploy [analytics/refinery@1dbbd3d] (hadoop-test): Regular analytics weekly train TEST retry1 [analytics/refinery@1dbbd3dfa996d2e970eb1cbc0a63d53040d4e3a3] (duration: 01m 47s) [22:50:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:17] (03CR) 10Dzahn: "for comparison, eqiad is: 63 / 63 / 24 so that's 4 more jobrunners and 2 fewer of app/API. previously we had said that 18 jobrunners of" [puppet] - 10https://gerrit.wikimedia.org/r/677674 (https://phabricator.wikimedia.org/T279599) (owner: 10Dzahn) [22:59:00] 10SRE, 10Wikimedia-Mailing-lists: Import several public mailing lists archives from mailman2 to lists-next to measure database size - https://phabricator.wikimedia.org/T278609 (10Ladsgroup) https://lists-next.wikimedia.org/hyperkitty/list/wikitech-l@lists-next.wikimedia.org/thread/CTYPGVR22FOHDFZOZ3RGNQNL34MIR... [23:00:05] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do [[Backport windows|Evening backport window]]
'''''' deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210407T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:01:18] 10SRE, 10Discovery-Search: Create cookbook to reindex into elasticsearch / cirrus - https://phabricator.wikimedia.org/T219507 (10TJones) Some potential nice-to-have features, from recent discussions about reindexing problems: * Catching reindexing failures in general and issuing an alert/warning. Right now the... [23:01:19] ugh [23:01:25] did I add the ptch to the wrong window [23:01:55] I have a patch, not sure why jouncebot didn't pick it up. [23:02:39] RoanKattouw, Niharika, Urbanecm: ^ [23:03:15] not sure, maybe a bug [23:03:19] anyway, let's see [23:03:34] RECOVERY - Check systemd state on stat1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:04:02] (03CR) 10Urbanecm: [C: 03+2] Wikibase: sample function call counters at 1:100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676435 (https://phabricator.wikimedia.org/T277817) (owner: 10Ori.livneh) [23:04:44] (03Merged) 10jenkins-bot: Wikibase: sample function call counters at 1:100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676435 (https://phabricator.wikimedia.org/T277817) (owner: 10Ori.livneh) [23:05:33] ori: pulled to mwdebug1001, not sure if you can do anything meaningful there through [23:06:09] Urbanecm: sure, just a sec [23:06:16] take your time [23:06:47] looks ok [23:07:10] syncing [23:10:06] PROBLEM - DPKG on an-coord1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [23:10:07] !log urbanecm@deploy1002 Synchronized wmf-config/Wikibase.php: 321bf91da7e823f026c2c2bdcc57d8cf60a52ba5: Wikibase: sample function call counters at 1:100 (T277817) (duration: 01m 08s) [23:10:14] ori: done [23:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:15] T277817: incrementStatsKey() calls in Wikibase Lua are expensive - https://phabricator.wikimedia.org/T277817 [23:10:17] anything else? [23:11:24] nope [23:11:30] thank you very much [23:12:08] np [23:24:27] (03PS1) 10Brennen Bearnes: logspam: silence rare but annoying UTF-8 warnings [puppet] - 10https://gerrit.wikimedia.org/r/677676 [23:25:43] (03CR) 10Legoktm: [C: 03+1] site/conftool-data: assign 4 x API, 4 x app, 2 x jobrunner, rack A5 [puppet] - 10https://gerrit.wikimedia.org/r/677674 (https://phabricator.wikimedia.org/T279599) (owner: 10Dzahn) [23:35:26] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:37:40] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:41:20] RECOVERY - DPKG on an-coord1002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [23:45:50] (03CR) 10Bstorm: cloud email alerts: remove f-strings in case of stretch vms (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/677599 (owner: 10Bstorm)