[00:00:04] Deploy window No deploys! DC Switchover. See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201028T0000) [00:02:28] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:02:54] PROBLEM - Check the last execution of grafana-ldap-users-sync on grafana1002 is CRITICAL: CRITICAL: Status of the systemd unit grafana-ldap-users-sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:03:48] PROBLEM - Check systemd state on grafana1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:20] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:33:44] PROBLEM - Disk space on maps2002 is CRITICAL: DISK CRITICAL - free space: /srv 59838 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=maps2002&var-datasource=codfw+prometheus/ops [01:36:40] PROBLEM - Check systemd state on an-worker1101 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:09:52] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1030 with 10G interfaces - https://phabricator.wikimedia.org/T266623 (10Reedy) [02:09:54] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1028 with 10G interfaces - https://phabricator.wikimedia.org/T266514 (10Reedy) [02:10:01] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1026 with 10G interfaces - https://phabricator.wikimedia.org/T266281 (10Reedy) [02:10:03] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1027 with 10G interfaces - https://phabricator.wikimedia.org/T266369 (10Reedy) [02:10:05] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1029 with 10G interfaces - https://phabricator.wikimedia.org/T266206 (10Reedy) [02:10:08] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1025 with 10G interfaces - https://phabricator.wikimedia.org/T266187 (10Reedy) [02:10:11] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1013 with 10G interfaces - https://phabricator.wikimedia.org/T264806 (10Reedy) [02:10:14] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1021 with 10G interfaces - https://phabricator.wikimedia.org/T229873 (10Reedy) [02:10:17] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1022 with 10G interfaces - https://phabricator.wikimedia.org/T229872 (10Reedy) [02:10:20] 10Operations, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1014 with 10G interfaces - https://phabricator.wikimedia.org/T226188 (10Reedy) [02:10:23] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1023 with 10G interfaces - https://phabricator.wikimedia.org/T229871 (10Reedy) [02:10:26] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1013 with 10G interfaces - https://phabricator.wikimedia.org/T243414 (10Reedy) [02:10:31] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1016 with 10G interfaces - https://phabricator.wikimedia.org/T228692 (10Reedy) [02:10:33] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1002 with 10G interfaces - https://phabricator.wikimedia.org/T221140 (10Reedy) [02:10:35] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1017 with 10G interfaces - https://phabricator.wikimedia.org/T228691 (10Reedy) [02:10:38] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1001 with 10G interfaces - https://phabricator.wikimedia.org/T221141 (10Reedy) [02:10:47] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1004 with 10G interfaces - https://phabricator.wikimedia.org/T221138 (10Reedy) [02:10:49] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1003 with 10G interfaces - https://phabricator.wikimedia.org/T221139 (10Reedy) [02:11:03] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1015 with 10G interfaces - https://phabricator.wikimedia.org/T217140 (10Reedy) [02:11:05] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10Reedy) [02:15:34] PROBLEM - Disk space on maps2002 is CRITICAL: DISK CRITICAL - free space: /srv 63445 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=maps2002&var-datasource=codfw+prometheus/ops [02:17:09] (03PS1) 10Gergő Tisza: Suggested edits: Include page ID with task preview data [extensions/GrowthExperiments] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/636787 (https://phabricator.wikimedia.org/T266600) [02:57:54] !log ryankemper@cumin2001 START - Cookbook sre.elasticsearch.rolling-restart [02:57:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:58:43] !log T266492 Beginning rolling restart of codfw cirrus cluster, 3 nodes at a time, on `ryankemper@cumin2001` tmux session `elasticsearch_restart_codfw` [02:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:58:48] T266492: Restart elasticsearch clusters to apply readahead changes - https://phabricator.wikimedia.org/T266492 [03:28:42] PROBLEM - Check systemd state on idp-test2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:32:00] PROBLEM - Check systemd state on netflow4001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:30:56] PROBLEM - Check systemd state on an-worker1097 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:38:22] (03PS1) 10Ryan Kemper: cirrus: fix shard_size thresholds [puppet] - 10https://gerrit.wikimedia.org/r/636811 [04:43:12] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.elasticsearch.rolling-restart (exit_code=0) [04:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:43:45] !log T266492 Finished rolling restart of codfw cirrus cluster [04:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:43:50] T266492: Restart elasticsearch clusters to apply readahead changes - https://phabricator.wikimedia.org/T266492 [04:46:05] (03PS2) 10Ryan Kemper: cirrus: fix shard_size thresholds [puppet] - 10https://gerrit.wikimedia.org/r/636811 [04:53:28] PROBLEM - Check systemd state on an-worker1100 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:34:29] (03PS3) 10Ryan Kemper: cirrus: fix shard_size thresholds [puppet] - 10https://gerrit.wikimedia.org/r/636811 (https://phabricator.wikimedia.org/T265908) [05:35:48] (03CR) 10jerkins-bot: [V: 04-1] cirrus: fix shard_size thresholds [puppet] - 10https://gerrit.wikimedia.org/r/636811 (https://phabricator.wikimedia.org/T265908) (owner: 10Ryan Kemper) [05:37:46] (03PS4) 10Ryan Kemper: cirrus: fix shard_size thresholds [puppet] - 10https://gerrit.wikimedia.org/r/636811 (https://phabricator.wikimedia.org/T265908) [05:39:01] (03CR) 10jerkins-bot: [V: 04-1] cirrus: fix shard_size thresholds [puppet] - 10https://gerrit.wikimedia.org/r/636811 (https://phabricator.wikimedia.org/T265908) (owner: 10Ryan Kemper) [06:06:07] (03CR) 10Ryan Kemper: [C: 03+2] "Will ship this Weds" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636480 (owner: 10Ebernhardson) [06:06:52] (03Merged) 10jenkins-bot: Increase cirrus morelike pool counter by 20% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636480 (owner: 10Ebernhardson) [06:10:10] (03PS1) 10Ryan Kemper: Revert "cirrus: Hardcode more_like to codfw cirrus cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636791 [06:11:45] (03PS5) 10Ryan Kemper: cirrus: fix shard_size thresholds [puppet] - 10https://gerrit.wikimedia.org/r/636811 (https://phabricator.wikimedia.org/T265908) [06:13:00] (03CR) 10jerkins-bot: [V: 04-1] cirrus: fix shard_size thresholds [puppet] - 10https://gerrit.wikimedia.org/r/636811 (https://phabricator.wikimedia.org/T265908) (owner: 10Ryan Kemper) [06:43:30] (03CR) 10Elukey: "@Razzi: there are still some leftovers in the webserver.yaml file (see my comments above), can you check it?" [puppet] - 10https://gerrit.wikimedia.org/r/636514 (https://phabricator.wikimedia.org/T240439) (owner: 10Razzi) [06:53:53] (03PS1) 10Marostegui: db2093: Clarify it is active for orchestrator DB [puppet] - 10https://gerrit.wikimedia.org/r/636816 (https://phabricator.wikimedia.org/T266003) [06:53:57] (03PS1) 10Elukey: profile::analytics::cluster::packages::statistics: add npm [puppet] - 10https://gerrit.wikimedia.org/r/636817 [06:54:21] (03CR) 10Marostegui: [C: 03+2] db2093: Clarify it is active for orchestrator DB [puppet] - 10https://gerrit.wikimedia.org/r/636816 (https://phabricator.wikimedia.org/T266003) (owner: 10Marostegui) [07:05:40] (03Abandoned) 10Elukey: admin: allow users to be removed preserving their home directories [puppet] - 10https://gerrit.wikimedia.org/r/498399 (https://phabricator.wikimedia.org/T215171) (owner: 10Elukey) [07:05:48] PROBLEM - Check systemd state on netflow1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:06:51] 10Operations, 10Analytics, 10Patch-For-Review, 10User-Elukey: Archival of home directories on servers with very large homes - https://phabricator.wikimedia.org/T215171 (10elukey) 05Open→03Declined Declining this since we have been following another path over the past year and it worked well, will re-op... [07:12:50] (03PS1) 10Jcrespo: mariadb-test: Move db1077 from test-s4 to test-s1 [puppet] - 10https://gerrit.wikimedia.org/r/636818 (https://phabricator.wikimedia.org/T187984) [07:14:07] (03CR) 10Marostegui: [C: 03+1] mariadb-test: Move db1077 from test-s4 to test-s1 [puppet] - 10https://gerrit.wikimedia.org/r/636818 (https://phabricator.wikimedia.org/T187984) (owner: 10Jcrespo) [07:15:37] (03CR) 10Jcrespo: [C: 03+2] mariadb-test: Move db1077 from test-s4 to test-s1 [puppet] - 10https://gerrit.wikimedia.org/r/636818 (https://phabricator.wikimedia.org/T187984) (owner: 10Jcrespo) [07:15:54] RECOVERY - Check systemd state on an-worker1097 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:17:56] RECOVERY - Check systemd state on an-worker1100 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:19:38] RECOVERY - Check systemd state on an-worker1101 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:22:45] !log swift codfw-prod: bump object weight for ms-be2057 - T261633 [07:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:52] T261633: Put ms-be2057 (Dell R740xd2) in service - https://phabricator.wikimedia.org/T261633 [07:35:24] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! See inline" (031 comment) [software/ecs] - 10https://gerrit.wikimedia.org/r/636513 (owner: 10Cwhite) [07:36:23] 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Goal, 10Patch-For-Review: Prepare a proof of concept of the minimum setup capable of backup and recover testwiki media files - https://phabricator.wikimedia.org/T264189 (10jcrespo) For archival purposes, this is the (naive) code solution for... [07:40:40] !log update thanos-fe1002 to thanos 0.16.0 - T261281 [07:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:46] T261281: Improve performance of Thanos (+ Prometheus) - https://phabricator.wikimedia.org/T261281 [07:53:15] !log upgraded python3-wmflib to 0.0.3 on the cumin hosts - T257905 [07:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:20] T257905: Spin off common Spicerack modules into a standalone Python library importable anywhere - https://phabricator.wikimedia.org/T257905 [07:53:40] (03CR) 10Volans: [C: 03+2] Remove modules migrated to wmflib [software/spicerack] - 10https://gerrit.wikimedia.org/r/636000 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [07:57:03] (03Merged) 10jenkins-bot: Remove modules migrated to wmflib [software/spicerack] - 10https://gerrit.wikimedia.org/r/636000 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [08:04:55] (03CR) 10Elukey: requests: add new module (032 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/636645 (owner: 10Volans) [08:06:08] (03PS1) 10Elukey: zookeeper: use profile::java [puppet] - 10https://gerrit.wikimedia.org/r/636864 [08:11:10] (03PS2) 10Elukey: zookeeper: use profile::java [puppet] - 10https://gerrit.wikimedia.org/r/636864 [08:17:47] PROBLEM - Check systemd state on dumpsdata1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:20:51] (03PS3) 10Elukey: zookeeper: use profile::java [puppet] - 10https://gerrit.wikimedia.org/r/636864 [08:20:53] (03PS1) 10Elukey: profile::java: add support for Jessie [puppet] - 10https://gerrit.wikimedia.org/r/636865 [08:23:50] (03CR) 10Volans: "replies to questions/comments" (032 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/636645 (owner: 10Volans) [08:24:39] PROBLEM - SSH on ms-be2037 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:25:18] (03PS2) 10Elukey: profile::java: add support for Jessie [puppet] - 10https://gerrit.wikimedia.org/r/636865 [08:25:20] (03PS4) 10Elukey: zookeeper: use profile::java [puppet] - 10https://gerrit.wikimedia.org/r/636864 [08:29:13] PROBLEM - Check systemd state on ms-be2037 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:29:43] RECOVERY - SSH on ms-be2037 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:29:45] (03PS5) 10Elukey: zookeeper: use profile::java [puppet] - 10https://gerrit.wikimedia.org/r/636864 (https://phabricator.wikimedia.org/T264176) [08:31:22] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1003/26180/" [puppet] - 10https://gerrit.wikimedia.org/r/636864 (https://phabricator.wikimedia.org/T264176) (owner: 10Elukey) [08:32:10] (03CR) 10Elukey: "Going to quickly test it manually but overall it looks good!" (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/636645 (owner: 10Volans) [08:32:19] PROBLEM - Check systemd state on netflow3001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:35:17] (03PS2) 10Volans: requests: add new module [software/pywmflib] - 10https://gerrit.wikimedia.org/r/636645 [08:37:29] !log updated dump grants on db2093 [08:37:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:32] (03PS1) 10Filippo Giunchedi: alertmanager: add dashboard url to irc messages [puppet] - 10https://gerrit.wikimedia.org/r/636868 (https://phabricator.wikimedia.org/T266017) [08:38:31] (03CR) 10Elukey: [C: 03+1] requests: add new module [software/pywmflib] - 10https://gerrit.wikimedia.org/r/636645 (owner: 10Volans) [08:38:48] (03CR) 10jerkins-bot: [V: 04-1] alertmanager: add dashboard url to irc messages [puppet] - 10https://gerrit.wikimedia.org/r/636868 (https://phabricator.wikimedia.org/T266017) (owner: 10Filippo Giunchedi) [08:39:01] 10Operations, 10Analytics-Clusters: Switch Zookeeper to profile::java - https://phabricator.wikimedia.org/T264176 (10elukey) a:05razzi→03None [08:40:07] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={atlas_exporter,swagger_check_citoid_cluster_eqiad} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:40:22] (03CR) 10Filippo Giunchedi: "Failure is due to this wmf-style violation:" [puppet] - 10https://gerrit.wikimedia.org/r/636868 (https://phabricator.wikimedia.org/T266017) (owner: 10Filippo Giunchedi) [08:40:24] 10Operations, 10Analytics-Clusters: Switch Zookeeper to profile::java - https://phabricator.wikimedia.org/T264176 (10elukey) a:03elukey Going to takeover the ownership of the task since I need to do some refactoring of some code that I have written :) [08:40:37] 10Operations, 10Analytics-Clusters, 10Analytics-Kanban: Switch Zookeeper to profile::java - https://phabricator.wikimedia.org/T264176 (10elukey) [08:41:51] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:44:13] (03PS2) 10Filippo Giunchedi: alertmanager: add dashboard url to irc messages [puppet] - 10https://gerrit.wikimedia.org/r/636868 (https://phabricator.wikimedia.org/T266017) [08:48:11] RECOVERY - Check systemd state on ms-be2037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:49:43] (03PS1) 10Filippo Giunchedi: Revert "hieradata: move swiftrepl to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/636873 [08:50:58] 10Operations, 10DBA, 10User-Kormat: orchestrator: Get packages into WMF apt - https://phabricator.wikimedia.org/T266023 (10Marostegui) 05Open→03Resolved a:03Kormat Going to close this as resolved as the packages are uploaded. Thank you Stevie! Per T266023#6570807, medium-term we should take a look at c... [08:51:00] 10Operations, 10DBA, 10Patch-For-Review, 10User-Kormat: orchestrator: Puppetize - https://phabricator.wikimedia.org/T265990 (10Marostegui) [08:56:17] (03CR) 10Gehel: [C: 03+1] "LGTM" [software/cumin] - 10https://gerrit.wikimedia.org/r/636729 (owner: 10Volans) [09:02:16] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [09:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:20] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:57] PROBLEM - Check systemd state on ms-be2042 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:14:07] 10Operations, 10Analytics-Clusters, 10Analytics-Kanban: Switch Zookeeper to profile::java - https://phabricator.wikimedia.org/T264176 (10MoritzMuehlenhoff) One gotcha: conf1* is still on jessie (and consequently Java 7), and I don't think anything accounts for Java 7 yet [09:15:35] moritzm: you have a cr for java 7 support :) [09:17:16] 10Operations, 10DBA, 10Orchestrator, 10User-Kormat: orchestrator: integrate promotion rules into puppet - https://phabricator.wikimedia.org/T266002 (10Marostegui) [09:17:25] 10Operations, 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: orchestrator: Puppetize - https://phabricator.wikimedia.org/T265990 (10Marostegui) [09:17:41] ah, "good" :-) [09:17:49] will look in a bit [09:20:20] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] toolforge: script to make long-running processes on bastions less good [puppet] - 10https://gerrit.wikimedia.org/r/635888 (https://phabricator.wikimedia.org/T266300) (owner: 10Bstorm) [09:22:22] (03CR) 10Muehlenhoff: "Obsoleted/duplicated by https://gerrit.wikimedia.org/r/c/operations/puppet/+/636730" [puppet] - 10https://gerrit.wikimedia.org/r/636614 (owner: 10Muehlenhoff) [09:22:32] (03Abandoned) 10Muehlenhoff: Only handle auto restart of Jenkins on active instance [puppet] - 10https://gerrit.wikimedia.org/r/636614 (owner: 10Muehlenhoff) [09:23:24] 10Operations, 10ops-eqiad: Power supply lost for analytics1072 - https://phabricator.wikimedia.org/T266644 (10elukey) [09:23:57] PROBLEM - Check systemd state on ms-be2051 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:24:13] 10Operations, 10Analytics-Clusters, 10Analytics-Kanban: Switch Zookeeper to profile::java - https://phabricator.wikimedia.org/T264176 (10elukey) >>! In T264176#6584036, @MoritzMuehlenhoff wrote: > One gotcha: conf1* is still on jessie (and consequently Java 7), and I don't think anything accounts for Java 7... [09:24:49] 10Operations, 10SRE-Access-Requests: New prod ssh key for calbon - https://phabricator.wikimedia.org/T266498 (10ema) p:05Triage→03Medium [09:26:14] !log imported kubeyaml 0.0.3~20201027+git5f5556c-1 to buster-wikimedia [09:26:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:55] (03CR) 10Kormat: orchestrator: Install mariadb client [puppet] - 10https://gerrit.wikimedia.org/r/636616 (https://phabricator.wikimedia.org/T265990) (owner: 10Kormat) [09:29:00] (03CR) 10Kormat: [C: 03+2] orchestrator: Install mariadb client [puppet] - 10https://gerrit.wikimedia.org/r/636616 (https://phabricator.wikimedia.org/T265990) (owner: 10Kormat) [09:29:21] (03CR) 10Kormat: orchestrator: Search both eqiad and codfw dns [puppet] - 10https://gerrit.wikimedia.org/r/636613 (https://phabricator.wikimedia.org/T265990) (owner: 10Kormat) [09:29:23] (03CR) 10Kormat: [C: 03+2] orchestrator: Search both eqiad and codfw dns [puppet] - 10https://gerrit.wikimedia.org/r/636613 (https://phabricator.wikimedia.org/T265990) (owner: 10Kormat) [09:32:46] 10Operations, 10DBA, 10Orchestrator, 10User-Kormat: Integrate orchestrator with !log - https://phabricator.wikimedia.org/T266452 (10Marostegui) [09:34:06] 10Operations, 10DBA, 10Orchestrator, 10User-Kormat: orchestrator: Add service monitoring - https://phabricator.wikimedia.org/T266338 (10Marostegui) [09:34:15] 10Operations, 10DBA, 10Orchestrator, 10CAS-SSO, 10User-Kormat: orchestrator: Support SSO - https://phabricator.wikimedia.org/T266106 (10Marostegui) [09:35:02] (03PS1) 10Kormat: pontoon: Add orchestrator role in mariadb104-test [puppet] - 10https://gerrit.wikimedia.org/r/636880 [09:36:22] (03CR) 10Kormat: [C: 03+2] pontoon: Add orchestrator role in mariadb104-test [puppet] - 10https://gerrit.wikimedia.org/r/636880 (owner: 10Kormat) [09:46:13] (03CR) 10Jbond: [C: 04-1] "see inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/635656 (owner: 10Dzahn) [09:49:11] (03PS1) 10JMeybohm: Test charts/deployments for compatibility with k8s 1.19 [deployment-charts] - 10https://gerrit.wikimedia.org/r/636881 (https://phabricator.wikimedia.org/T266032) [09:49:23] (03PS1) 10Urbanecm: [cswiki] Set wgGEHomepageManualAssignmentMentorsList to Wikipedie:Potřebuji pomoc/Mentoři/Manuální [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636882 (https://phabricator.wikimedia.org/T245639) [09:49:39] (03CR) 10jerkins-bot: [V: 04-1] Test charts/deployments for compatibility with k8s 1.19 [deployment-charts] - 10https://gerrit.wikimedia.org/r/636881 (https://phabricator.wikimedia.org/T266032) (owner: 10JMeybohm) [09:49:49] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [09:49:52] (03CR) 10JMeybohm: [C: 04-1] "Depends on:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/636881 (https://phabricator.wikimedia.org/T266032) (owner: 10JMeybohm) [09:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:31] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [09:50:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:27] (03CR) 10Jbond: wmflib:: add data type for puppetmaster server type and use it (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635660 (owner: 10Dzahn) [09:51:35] (03CR) 10Jbond: wmflib: add data type for SSLVerifyClient and use it (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635658 (owner: 10Dzahn) [09:52:17] 10Operations, 10SRE-Access-Requests: New prod ssh key for calbon - https://phabricator.wikimedia.org/T266498 (10ema) I've pinged @calbon on Google Chat asking to confirm the public key, taking care of the puppet change once I hear from him. [09:52:51] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:52:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:34] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/634192 (owner: 10Alexandros Kosiaris) [09:54:40] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:08] (03CR) 10Jbond: [C: 03+1] "LGTM" (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/636653 (https://phabricator.wikimedia.org/T266561) (owner: 10Ayounsi) [10:02:37] RECOVERY - Check systemd state on ms-be2042 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:04:33] (03PS4) 10Filippo Giunchedi: Grafana config changes for CAS-enabled grafana-rw.w.o vhost [puppet] - 10https://gerrit.wikimedia.org/r/629122 (https://phabricator.wikimedia.org/T262512) (owner: 10Muehlenhoff) [10:05:07] (03CR) 10Jbond: [C: 03+2] systemd::timer::job: switch monitoring_enabled default to false [puppet] - 10https://gerrit.wikimedia.org/r/636628 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [10:05:44] (03CR) 10Filippo Giunchedi: [C: 03+2] Grafana config changes for CAS-enabled grafana-rw.w.o vhost [puppet] - 10https://gerrit.wikimedia.org/r/629122 (https://phabricator.wikimedia.org/T262512) (owner: 10Muehlenhoff) [10:09:19] (03PS1) 10Muehlenhoff: Also enable cn=grafana-admin for grafana-rw.w.o [puppet] - 10https://gerrit.wikimedia.org/r/636885 (https://phabricator.wikimedia.org/T262512) [10:14:17] (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/636881 (https://phabricator.wikimedia.org/T266032) (owner: 10JMeybohm) [10:14:29] 10Operations, 10Analytics-Clusters, 10vm-requests: Create a ganeti VM in eqiad: an-test-ui1001.eqiad.wmnet - https://phabricator.wikimedia.org/T266648 (10elukey) p:05Triage→03Medium [10:17:31] RECOVERY - Check systemd state on ms-be2051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:19:22] 10Operations, 10Analytics-Clusters, 10vm-requests: Create a ganeti VM in eqiad: an-test-ui1001.eqiad.wmnet - https://phabricator.wikimedia.org/T266648 (10elukey) ` elukey@ganeti1011:~$ sudo gnt-group list Group Nodes Instances AllocPolicy NDParams row_A 4 36 preferred ovs=False, ssh_port=22, o... [10:19:44] (03CR) 10Gehel: [C: 04-1] "Good find! See a few style comments inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/636811 (https://phabricator.wikimedia.org/T265908) (owner: 10Ryan Kemper) [10:20:35] !log elukey@cumin1001 START - Cookbook sre.ganeti.makevm [10:20:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:04] PROBLEM - Check systemd state on ms-be2016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:23:04] RECOVERY - Check systemd state on ms-be2016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:25:00] !log A:cp (except cp3052, running varnish 5) upgrade libvmod-netmapper to 1.9-1 T266567 T264398 [10:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:07] T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 [10:25:07] T266567: libvmod-netmapper: must specify ABI stanza - https://phabricator.wikimedia.org/T266567 [10:27:14] (03PS1) 10Jbond: apereo_cas: dont monitor systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/636889 [10:27:55] (03CR) 10jerkins-bot: [V: 04-1] apereo_cas: dont monitor systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/636889 (owner: 10Jbond) [10:28:21] (03PS2) 10Jbond: apereo_cas: dont monitor systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/636889 [10:28:49] (03CR) 10DCausse: [C: 03+1] Revert "cirrus: Hardcode more_like to codfw cirrus cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636791 (owner: 10Ryan Kemper) [10:29:59] (03PS1) 10Elukey: sre.ganeti.makevm: ask to review args before DNS allocation [cookbooks] - 10https://gerrit.wikimedia.org/r/636890 [10:32:30] (03PS1) 10Jbond: helm: drop monitoring for systemd::timer::job [puppet] - 10https://gerrit.wikimedia.org/r/636892 [10:35:06] (03PS1) 10Mvolz: Update zotero translators [deployment-charts] - 10https://gerrit.wikimedia.org/r/636896 [10:35:37] !log elukey@cumin1001 END (ERROR) - Cookbook sre.ganeti.makevm (exit_code=97) [10:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:07] (03PS2) 10Jbond: helm: drop monitoring for systemd::timer::job [puppet] - 10https://gerrit.wikimedia.org/r/636892 [10:38:04] !log clean up 10.64.5.7 and 2620:0:861:104:10:64:5:7 from Netbox (records mistakely allocated via the makevm cookbook) - T266648 [10:38:07] 10Operations, 10Traffic: varnish crash upon reload after libvmod-netmapper upgrade due to liburcu6 assertion - https://phabricator.wikimedia.org/T266651 (10ema) [10:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:13] T266648: Create a ganeti VM in eqiad: an-test-ui1001.eqiad.wmnet - https://phabricator.wikimedia.org/T266648 [10:38:19] 10Operations, 10Traffic: varnish crash upon reload after libvmod-netmapper upgrade due to liburcu6 assertion - https://phabricator.wikimedia.org/T266651 (10ema) p:05Triage→03High [10:39:14] (03PS1) 10Jbond: profile::docker::builder: drop monitoring for systemd::timer::job [puppet] - 10https://gerrit.wikimedia.org/r/636898 [10:39:33] !log due to T266651, cancel the entry above: A:cp upgrade libvmod-netmapper to 1.9-1 T266567 T264398 [10:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:42] T266651: varnish crash upon reload after libvmod-netmapper upgrade due to liburcu6 assertion - https://phabricator.wikimedia.org/T266651 [10:39:42] T266567: libvmod-netmapper: must specify ABI stanza - https://phabricator.wikimedia.org/T266567 [10:39:42] T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 [10:41:53] (03PS1) 10Jbond: profile::icinga: drop monitoring for systemd::timer::job [puppet] - 10https://gerrit.wikimedia.org/r/636899 [10:48:56] (03PS1) 10Vgutierrez: ATS: Turn ECDHE-ECDSA-AES128-SHA support off [puppet] - 10https://gerrit.wikimedia.org/r/636901 (https://phabricator.wikimedia.org/T258405) [10:50:49] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/636765 (https://phabricator.wikimedia.org/T266593) (owner: 10Bstorm) [10:53:59] (03PS1) 10Vgutierrez: ssl_ciphersuite: Remove CBC based cipher suites [puppet] - 10https://gerrit.wikimedia.org/r/636902 (https://phabricator.wikimedia.org/T258405) [10:57:24] (03PS3) 10Jbond: helm: drop monitoring for systemd::timer::job [puppet] - 10https://gerrit.wikimedia.org/r/636892 [10:57:41] (03CR) 10Muehlenhoff: [C: 03+1] "Sounds fine" [puppet] - 10https://gerrit.wikimedia.org/r/636817 (owner: 10Elukey) [10:59:28] (03CR) 10Muehlenhoff: [C: 04-2] "Yeah, this is a legitimate error and it should be handled instead. The fact that a few cases of failing restarts were showing up is simply" [puppet] - 10https://gerrit.wikimedia.org/r/636728 (owner: 10Dzahn) [11:01:06] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/636889 (owner: 10Jbond) [11:02:44] (03CR) 10Volans: [C: 03+1] "LGTM, one nit inline" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/636890 (owner: 10Elukey) [11:03:03] (03CR) 10Ema: [C: 03+1] ATS: Turn ECDHE-ECDSA-AES128-SHA support off [puppet] - 10https://gerrit.wikimedia.org/r/636901 (https://phabricator.wikimedia.org/T258405) (owner: 10Vgutierrez) [11:09:06] (03PS1) 10Kosta Harlan: Define scaffold_version before attempting to use it [deployment-charts] - 10https://gerrit.wikimedia.org/r/636905 [11:09:39] (03PS2) 10Elukey: sre.ganeti.makevm: ask to review args before DNS allocation [cookbooks] - 10https://gerrit.wikimedia.org/r/636890 [11:10:24] (03CR) 10Elukey: [C: 03+2] profile::analytics::cluster::packages::statistics: add npm [puppet] - 10https://gerrit.wikimedia.org/r/636817 (owner: 10Elukey) [11:11:04] 10Operations, 10CommRel-Specialists-Support (Oct-Dec-2020), 10User-notice: CommRel support for FY2020-2021 Q2 DC switchback - https://phabricator.wikimedia.org/T264364 (10Trizek-WMF) Retro item: dealing with the date displayed on the banner, [[ https://meta.wikimedia.org/wiki/MediaWiki_talk:Centralnotice-tem... [11:11:09] (03CR) 10Elukey: [C: 03+2] sre.ganeti.makevm: ask to review args before DNS allocation [cookbooks] - 10https://gerrit.wikimedia.org/r/636890 (owner: 10Elukey) [11:14:20] 10Operations, 10DBA, 10Orchestrator, 10User-Kormat: Explore orchestrator hooks to integrate them with !log, irc alerts and emails - https://phabricator.wikimedia.org/T266452 (10Marostegui) [11:15:04] 10Operations, 10DBA, 10Orchestrator, 10User-Kormat: Explore orchestrator hooks to integrate them with !log, irc alerts and emails - https://phabricator.wikimedia.org/T266452 (10Marostegui) [11:15:12] 10Operations, 10DBA, 10Orchestrator, 10User-Kormat: Explore orchestrator hooks to integrate them with !log, irc alerts and emails - https://phabricator.wikimedia.org/T266452 (10Marostegui) Along with !log we should include sending an email/irc alerts on some of the most important cases like: PostUnsuccessf... [11:18:05] PROBLEM - Check systemd state on ms-be2044 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:18:41] 10Operations, 10DBA, 10Orchestrator, 10User-Kormat: Explore orchestrator hooks to integrate them with !log, irc alerts and emails - https://phabricator.wikimedia.org/T266452 (10Peachey88) [11:24:01] PROBLEM - Too many messages in kafka logging-eqiad #o11y on alert1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1002 job=burrow partition={0,1} prometheus=ops site=eqiad topic={udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-dat [11:24:01] r-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [11:24:37] (03PS1) 10Muehlenhoff: Hide the "Sign out" menu option when using CAS [puppet] - 10https://gerrit.wikimedia.org/r/636907 (https://phabricator.wikimedia.org/T262512) [11:24:51] (03CR) 10Filippo Giunchedi: [C: 03+1] profile::icinga: drop monitoring for systemd::timer::job [puppet] - 10https://gerrit.wikimedia.org/r/636899 (owner: 10Jbond) [11:25:15] (03CR) 10Filippo Giunchedi: [C: 03+1] Hide the "Sign out" menu option when using CAS [puppet] - 10https://gerrit.wikimedia.org/r/636907 (https://phabricator.wikimedia.org/T262512) (owner: 10Muehlenhoff) [11:30:04] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Upgrade ELK Stack to version 7 - https://phabricator.wikimedia.org/T234854 (10ayounsi) Trying to load: https://logstash-next.wikimedia.org/app/dashboards#/view/6bcd2a10-7d21-11e7-86fb-51c84229aeb7 My laptop fan starts spinning very hard, everything tim... [11:30:37] (03PS3) 10Jbond: base::labs: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/635905 (owner: 10Dzahn) [11:32:18] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/635406 (https://phabricator.wikimedia.org/T165885) (owner: 10Dzahn) [11:32:25] (03PS5) 10Jbond: base/labs: add systemd timer to clean puppet client bucket [puppet] - 10https://gerrit.wikimedia.org/r/635406 (https://phabricator.wikimedia.org/T165885) (owner: 10Dzahn) [11:32:33] (03PS4) 10Jbond: base::labs: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/635905 (owner: 10Dzahn) [11:32:41] (03PS1) 10Hashar: ci: run docker with debug logging [puppet] - 10https://gerrit.wikimedia.org/r/636908 (https://phabricator.wikimedia.org/T265615) [11:33:35] (03PS1) 10Hnowlan: maps: reenable eqiad OSM replication [puppet] - 10https://gerrit.wikimedia.org/r/636909 (https://phabricator.wikimedia.org/T254014) [11:34:45] (03PS5) 10Jbond: base::labs: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/635905 (owner: 10Dzahn) [11:34:47] (03PS6) 10Jbond: base/labs: add systemd timer to clean puppet client bucket [puppet] - 10https://gerrit.wikimedia.org/r/635406 (https://phabricator.wikimedia.org/T165885) (owner: 10Dzahn) [11:35:45] 10Operations, 10DBA, 10Orchestrator: Run orchestrator as non-root - https://phabricator.wikimedia.org/T266656 (10Marostegui) [11:35:54] (03Abandoned) 10Hnowlan: maps: reenable eqiad OSM replication [puppet] - 10https://gerrit.wikimedia.org/r/636909 (https://phabricator.wikimedia.org/T254014) (owner: 10Hnowlan) [11:36:12] (03CR) 10Jbond: [C: 03+2] base::labs: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/635905 (owner: 10Dzahn) [11:36:33] (03CR) 10Jbond: [C: 03+2] "I flipped the relation chain so this can get merged first and will mereg" [puppet] - 10https://gerrit.wikimedia.org/r/635905 (owner: 10Dzahn) [11:36:36] (03CR) 10Hashar: "I have cherry picked it. That then requires docker to be reloaded 'systemctl reload docker'. I have no idea about the amount of logs that" [puppet] - 10https://gerrit.wikimedia.org/r/636908 (https://phabricator.wikimedia.org/T265615) (owner: 10Hashar) [11:36:53] 10Operations, 10DBA, 10Orchestrator: Run orchestrator as non-root - https://phabricator.wikimedia.org/T266656 (10Marostegui) p:05Triage→03Medium [11:36:59] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/635406 (https://phabricator.wikimedia.org/T165885) (owner: 10Dzahn) [11:38:24] 10Operations, 10DBA, 10Orchestrator: Run orchestrator as non-root - https://phabricator.wikimedia.org/T266656 (10Marostegui) [11:38:27] 10Operations, 10DBA, 10User-Kormat: orchestrator: Get packages into WMF apt - https://phabricator.wikimedia.org/T266023 (10Marostegui) [11:39:03] RECOVERY - Check systemd state on ms-be2044 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:42:10] (03CR) 10Hnowlan: [C: 03+1] Define scaffold_version before attempting to use it [deployment-charts] - 10https://gerrit.wikimedia.org/r/636905 (owner: 10Kosta Harlan) [11:44:25] (03CR) 10Jbond: [C: 03+1] "lgtm optional comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636865 (owner: 10Elukey) [11:46:30] !log configure urpf strict log-only on cr3-ulsfo:et-0/0/1.501 - T266561 [11:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:37] T266561: Apply uRPF strict mode on Customer links - https://phabricator.wikimedia.org/T266561 [11:47:24] (03CR) 10Hashar: [C: 04-1] ci: run docker with debug logging [puppet] - 10https://gerrit.wikimedia.org/r/636908 (https://phabricator.wikimedia.org/T265615) (owner: 10Hashar) [11:47:50] (03CR) 10Muehlenhoff: [C: 03+2] Hide the "Sign out" menu option when using CAS [puppet] - 10https://gerrit.wikimedia.org/r/636907 (https://phabricator.wikimedia.org/T262512) (owner: 10Muehlenhoff) [11:50:56] (03CR) 10Muehlenhoff: profile::java: add support for Jessie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636865 (owner: 10Elukey) [11:52:35] (03CR) 10Jbond: "lgtm some optional nits" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/636864 (https://phabricator.wikimedia.org/T264176) (owner: 10Elukey) [11:54:47] (03CR) 10Ayounsi: Add uRPF strict mode to Customers links (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/636653 (https://phabricator.wikimedia.org/T266561) (owner: 10Ayounsi) [11:55:19] 10Operations, 10DBA, 10Orchestrator: Run orchestrator as non-root - https://phabricator.wikimedia.org/T266656 (10MoritzMuehlenhoff) If this is solely about the need to bind to a privileged port, ` sudo setcap 'cap_net_bind_service=+ep' $ORCHESTRATORBINARY ` might also simply work out? [11:56:45] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/636645 (owner: 10Volans) [11:57:06] (03CR) 10Jbond: [C: 03+2] profile::icinga: drop monitoring for systemd::timer::job [puppet] - 10https://gerrit.wikimedia.org/r/636899 (owner: 10Jbond) [11:57:15] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/636898 (owner: 10Jbond) [11:57:51] moritzm: ok to merge [11:58:58] moritzm: merging https://gerrit.wikimedia.org/r/636907 [11:59:04] ack, please go ahead [11:59:07] done [11:59:28] (03CR) 10Jbond: [C: 03+2] apereo_cas: dont monitor systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/636889 (owner: 10Jbond) [12:03:17] (03CR) 10Jbond: [C: 03+2] profile::docker::builder: drop monitoring for systemd::timer::job [puppet] - 10https://gerrit.wikimedia.org/r/636898 (owner: 10Jbond) [12:04:26] (03PS4) 10Hnowlan: Isolate eqiad master maps1004 from cluster [puppet] - 10https://gerrit.wikimedia.org/r/608729 (https://phabricator.wikimedia.org/T254014) (owner: 10Ryan Kemper) [12:04:36] (03CR) 10JMeybohm: [C: 03+1] "Agreed, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/636892 (owner: 10Jbond) [12:07:24] 10Operations, 10netops, 10Patch-For-Review: Apply uRPF strict mode on Customer links - https://phabricator.wikimedia.org/T266561 (10ayounsi) Pushed the following: `lang=diff [edit interfaces et-0/0/1 unit 501 family inet] + rpf-check { + apply-groups-except external-links; + fail-fi... [12:09:07] (03PS5) 10Hnowlan: Isolate eqiad master maps1004 from cluster [puppet] - 10https://gerrit.wikimedia.org/r/608729 (https://phabricator.wikimedia.org/T254014) (owner: 10Ryan Kemper) [12:10:32] (03PS2) 10Ayounsi: Add uRPF strict mode to Customers links [homer/public] - 10https://gerrit.wikimedia.org/r/636653 (https://phabricator.wikimedia.org/T266561) [12:11:00] (03CR) 10Ayounsi: "Diff on ulsfo routers:" [homer/public] - 10https://gerrit.wikimedia.org/r/636653 (https://phabricator.wikimedia.org/T266561) (owner: 10Ayounsi) [12:11:51] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:14:20] (03PS1) 10JMeybohm: eventrouter: deploy to codfw and eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/636912 (https://phabricator.wikimedia.org/T262675) [12:19:47] 10Operations, 10netops, 10Patch-For-Review: Apply uRPF strict mode on Customer links - https://phabricator.wikimedia.org/T266561 (10ayounsi) Another one: `Oct 28 12:08:13 cr4-ulsfo fpc0 PFE_FW_SYSLOG_ETH_IP: FW: et-0/0/1.501 A 01f5:0800 ac:1f:6b:c4:38:c8 -> ec:38:73:75:34:cf tcp 64.x 202.y 30799 54744 (1 p... [12:29:36] (03CR) 10Ayounsi: [C: 03+1] "Fun." [software/cumin] - 10https://gerrit.wikimedia.org/r/636729 (owner: 10Volans) [12:31:11] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:32:09] PROBLEM - SSH on ms-be2039 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:33:41] (03CR) 10Ayounsi: [C: 03+1] "spicerack._lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/634056 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans) [12:34:23] RECOVERY - SSH on ms-be2039 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:37:19] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:37:38] (03CR) 10Muehlenhoff: zookeeper: use profile::java (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636864 (https://phabricator.wikimedia.org/T264176) (owner: 10Elukey) [12:38:41] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:39:58] !log installing libdatetime-timezone-perl updates [12:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:11] 10Operations, 10DBA, 10Orchestrator: Run orchestrator as non-root - https://phabricator.wikimedia.org/T266656 (10Kormat) There's no reason it needs a privileged port. It will be behind a reverse proxy anyway. The package doesn't create a user/group, so that's the first thing to fix. [12:47:51] 10Operations, 10serviceops: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10jijiki) [12:48:47] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:56:51] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:03:08] (03CR) 10Filippo Giunchedi: [C: 03+1] Also enable cn=grafana-admin for grafana-rw.w.o [puppet] - 10https://gerrit.wikimedia.org/r/636885 (https://phabricator.wikimedia.org/T262512) (owner: 10Muehlenhoff) [13:03:38] (03PS1) 10Kosta Harlan: linkrecommendation: Add deployment chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/636916 (https://phabricator.wikimedia.org/T265893) [13:04:05] (03CR) 10jerkins-bot: [V: 04-1] linkrecommendation: Add deployment chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/636916 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [13:07:26] (03PS1) 10Filippo Giunchedi: grafana: fix bytes vs string error in ldap_users_sync [puppet] - 10https://gerrit.wikimedia.org/r/636917 (https://phabricator.wikimedia.org/T265712) [13:07:35] (03CR) 10Ottomata: "FYI npm and nodejs are included in the anaconda / conda distribution. If they use anaconda, they can npm install already. Although...loo" [puppet] - 10https://gerrit.wikimedia.org/r/636817 (owner: 10Elukey) [13:07:56] (03CR) 10jerkins-bot: [V: 04-1] grafana: fix bytes vs string error in ldap_users_sync [puppet] - 10https://gerrit.wikimedia.org/r/636917 (https://phabricator.wikimedia.org/T265712) (owner: 10Filippo Giunchedi) [13:10:08] (03PS2) 10Filippo Giunchedi: grafana: fix bytes vs string error in ldap_users_sync [puppet] - 10https://gerrit.wikimedia.org/r/636917 (https://phabricator.wikimedia.org/T265712) [13:11:26] 10Operations, 10Puppet, 10observability, 10Patch-For-Review, and 2 others: Puppet: get row/rack info from Netbox - https://phabricator.wikimedia.org/T229397 (10ayounsi) Let's not over-engineer it. **Automatic**. For what I understand, that data is to be used for "Grafana dashboards and Cumin", so if Netbo... [13:12:07] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/636922 [13:15:37] (03CR) 10Matthias Mullie: "Code seems to make senses; just spotted a few things that I don't know are intentional." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/629121 (https://phabricator.wikimedia.org/T259067) (owner: 10Cparle) [13:30:03] (03CR) 10CDanis: [C: 03+1] Revert "hieradata: move swiftrepl to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/636873 (owner: 10Filippo Giunchedi) [13:34:14] (03CR) 10Elukey: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/636817 (owner: 10Elukey) [13:36:45] (03PS2) 10Niedzielski: admin: remove niedzielski [puppet] - 10https://gerrit.wikimedia.org/r/636671 [13:36:58] (03PS1) 10Jbond: Gemfile: update puppetlabs_spec_helper version and switch to rspec-mock [puppet] - 10https://gerrit.wikimedia.org/r/636923 [13:37:00] (03PS1) 10Jbond: java: add new java version facts [puppet] - 10https://gerrit.wikimedia.org/r/636924 [13:37:42] (03CR) 10jerkins-bot: [V: 04-1] admin: remove niedzielski [puppet] - 10https://gerrit.wikimedia.org/r/636671 (owner: 10Niedzielski) [13:37:45] (03CR) 10jerkins-bot: [V: 04-1] java: add new java version facts [puppet] - 10https://gerrit.wikimedia.org/r/636924 (owner: 10Jbond) [13:40:24] (03PS3) 10Niedzielski: admin: remove niedzielski [puppet] - 10https://gerrit.wikimedia.org/r/636671 [13:42:41] (03CR) 10Niedzielski: admin: remove niedzielski (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636671 (owner: 10Niedzielski) [13:44:51] (03CR) 10JMeybohm: [C: 03+2] eventrouter: deploy to codfw and eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/636912 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [13:45:13] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/608729 (https://phabricator.wikimedia.org/T254014) (owner: 10Ryan Kemper) [13:47:39] (03Merged) 10jenkins-bot: eventrouter: deploy to codfw and eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/636912 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [13:52:09] (03PS3) 10Filippo Giunchedi: grafana: fix bytes vs string error in ldap_users_sync [puppet] - 10https://gerrit.wikimedia.org/r/636917 (https://phabricator.wikimedia.org/T265712) [13:52:11] (03PS1) 10Filippo Giunchedi: grafana: make grafana-rw dashboards link work for anonymous users [puppet] - 10https://gerrit.wikimedia.org/r/636927 (https://phabricator.wikimedia.org/T265712) [13:54:48] !log jayme@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'kube-system' for release 'eventrouter' . [13:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:08] (03PS2) 10Filippo Giunchedi: grafana: make grafana-rw dashboards link work for anonymous users [puppet] - 10https://gerrit.wikimedia.org/r/636927 (https://phabricator.wikimedia.org/T265712) [13:56:36] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1001/26182/grafana2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/636927 (https://phabricator.wikimedia.org/T265712) (owner: 10Filippo Giunchedi) [13:58:09] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/636917 (https://phabricator.wikimedia.org/T265712) (owner: 10Filippo Giunchedi) [13:58:44] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "hieradata: move swiftrepl to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/636873 (owner: 10Filippo Giunchedi) [14:00:09] (03CR) 10Filippo Giunchedi: [C: 03+2] grafana: fix bytes vs string error in ldap_users_sync [puppet] - 10https://gerrit.wikimedia.org/r/636917 (https://phabricator.wikimedia.org/T265712) (owner: 10Filippo Giunchedi) [14:02:47] 10Operations, 10Analytics-Radar, 10SRE-Access-Requests: Nuria's volunteer account - https://phabricator.wikimedia.org/T266086 (10gsingers) @MoritzMuehlenhoff Thanks! Approved. [14:04:59] 10Operations, 10DBA, 10User-Kormat: Clean up role::mariadb::ferm and profile::mariadb::ferm - https://phabricator.wikimedia.org/T265901 (10LSobanski) [14:05:24] 10Operations, 10DBA, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Config, 10User-Kormat: Create integration test env for wmfmariadbpy - https://phabricator.wikimedia.org/T265266 (10LSobanski) [14:05:44] !log jayme@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'kube-system' for release 'eventrouter' . [14:05:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:53] PROBLEM - Check systemd state on kubestage1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:50] (03PS1) 10Muehlenhoff: Fold Grafana settings for CAS into the Hiera role data [puppet] - 10https://gerrit.wikimedia.org/r/636929 (https://phabricator.wikimedia.org/T265712) [14:11:52] (03PS1) 10Muehlenhoff: Point grafana-rw to grafana1002 [puppet] - 10https://gerrit.wikimedia.org/r/636930 (https://phabricator.wikimedia.org/T265712) [14:12:57] PROBLEM - Check systemd state on ms-be2054 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:05] (03CR) 10C. Scott Ananian: Enable parsoid on api_appserver (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635086 (https://phabricator.wikimedia.org/T265954) (owner: 10Ppchelko) [14:16:06] (03CR) 10C. Scott Ananian: "This patch shouldn't be necessary if $wgParsoidEnableRESTAPI defaults to true?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635095 (https://phabricator.wikimedia.org/T265954) (owner: 10Ppchelko) [14:17:24] (03Abandoned) 10Ppchelko: Enable Parsoid REST API when loading it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635095 (https://phabricator.wikimedia.org/T265954) (owner: 10Ppchelko) [14:18:39] (03PS2) 10Muehlenhoff: Fold Grafana settings for CAS into the Hiera role data [puppet] - 10https://gerrit.wikimedia.org/r/636929 (https://phabricator.wikimedia.org/T265712) [14:26:13] (03PS6) 10Ppchelko: Enable parsoid on api_appserver [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635086 (https://phabricator.wikimedia.org/T265954) [14:27:13] (03CR) 10Ppchelko: Enable parsoid on api_appserver (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635086 (https://phabricator.wikimedia.org/T265954) (owner: 10Ppchelko) [14:31:47] (03CR) 10Subramanya Sastry: "If we can run npm install on testreduce1001 VM, we can probably drop this entire repo. But, we'll need some puppet code to init it on test" [puppet] - 10https://gerrit.wikimedia.org/r/577656 (owner: 10C. Scott Ananian) [14:36:45] (03CR) 10Ottomata: "Naw I think its ok, we still need to spend some time on making this the default way to use stat boxes, but for that we need to get rid of " [puppet] - 10https://gerrit.wikimedia.org/r/636817 (owner: 10Elukey) [14:38:35] (03CR) 10MSantos: [C: 03+2] Update mobileapps to 2020-10-26-150740-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/636496 (https://phabricator.wikimedia.org/T264024) (owner: 10Ppchelko) [14:39:08] 10Operations, 10serviceops: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10jijiki) [14:39:38] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, to be merged once we're GTG" [puppet] - 10https://gerrit.wikimedia.org/r/636929 (https://phabricator.wikimedia.org/T265712) (owner: 10Muehlenhoff) [14:39:41] (03CR) 10C. Scott Ananian: [C: 04-2] "LGTM, although a comment here would be nice." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635086 (https://phabricator.wikimedia.org/T265954) (owner: 10Ppchelko) [14:39:46] (03PS1) 10Muehlenhoff: Update email address for Nuria [puppet] - 10https://gerrit.wikimedia.org/r/636936 [14:41:01] (03CR) 10Giuseppe Lavagetto: safe-service-restart: add optional poolcounter support (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/635991 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [14:41:44] (03Merged) 10jenkins-bot: Update mobileapps to 2020-10-26-150740-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/636496 (https://phabricator.wikimedia.org/T264024) (owner: 10Ppchelko) [14:41:56] 10Operations, 10Analytics-Radar, 10SRE-Access-Requests: Nuria's volunteer account - https://phabricator.wikimedia.org/T266086 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Excellent, thanks! Closing this task since everything is completed now. I'll merge https://gerrit.wikimedia.org/r/c... [14:42:24] (03PS7) 10Ppchelko: Enable parsoid on api_appserver [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635086 (https://phabricator.wikimedia.org/T265954) [14:42:28] (03CR) 10Ppchelko: Enable parsoid on api_appserver (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635086 (https://phabricator.wikimedia.org/T265954) (owner: 10Ppchelko) [14:45:14] (03PS1) 10Kormat: debian: Add repack script [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/636937 [14:45:28] (03PS2) 10Jbond: Gemfile: update puppetlabs_spec_helper version and switch to rspec-mock [puppet] - 10https://gerrit.wikimedia.org/r/636923 [14:46:02] (03CR) 10Ahmon Dancy: [C: 04-1] "Holding for modifications" [puppet] - 10https://gerrit.wikimedia.org/r/636074 (https://phabricator.wikimedia.org/T243009) (owner: 10Ahmon Dancy) [14:46:24] (03CR) 10jerkins-bot: [V: 04-1] Gemfile: update puppetlabs_spec_helper version and switch to rspec-mock [puppet] - 10https://gerrit.wikimedia.org/r/636923 (owner: 10Jbond) [14:46:33] (03CR) 10Kormat: [V: 03+2 C: 03+2] debian: Add repack script [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/636937 (owner: 10Kormat) [14:48:13] (03CR) 10Filippo Giunchedi: [C: 03+1] Point grafana-rw to grafana1002 [puppet] - 10https://gerrit.wikimedia.org/r/636930 (https://phabricator.wikimedia.org/T265712) (owner: 10Muehlenhoff) [14:50:53] !log elukey@cumin1001 START - Cookbook sre.ganeti.makevm [14:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:01] (03PS6) 10Elukey: zookeeper: use profile::java [puppet] - 10https://gerrit.wikimedia.org/r/636864 (https://phabricator.wikimedia.org/T264176) [14:57:13] (03CR) 10Elukey: zookeeper: use profile::java (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/636864 (https://phabricator.wikimedia.org/T264176) (owner: 10Elukey) [14:57:30] 10Operations, 10observability, 10Epic: Monitor and alarm on SMART attributes [tracking] - https://phabricator.wikimedia.org/T86552 (10fgiunchedi) [14:58:03] RECOVERY - Check systemd state on ms-be2054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:58:15] 10Operations, 10observability, 10Patch-For-Review: Open Phab tasks on SMART failure - https://phabricator.wikimedia.org/T196994 (10fgiunchedi) [14:59:49] (03PS3) 10Elukey: profile::java: add support for Jessie [puppet] - 10https://gerrit.wikimedia.org/r/636865 [14:59:51] (03PS7) 10Elukey: zookeeper: use profile::java [puppet] - 10https://gerrit.wikimedia.org/r/636864 (https://phabricator.wikimedia.org/T264176) [15:00:20] (03CR) 10Elukey: profile::java: add support for Jessie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636865 (owner: 10Elukey) [15:01:30] (03CR) 10jerkins-bot: [V: 04-1] profile::java: add support for Jessie [puppet] - 10https://gerrit.wikimedia.org/r/636865 (owner: 10Elukey) [15:04:06] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: codfw: relocate sessionstore2002 and mc2029 from C4 to C3 - https://phabricator.wikimedia.org/T266577 (10Papaul) p:05Triage→03Medium [15:04:15] (03PS4) 10Giuseppe Lavagetto: Add --force flag to safe-service-restart.py [puppet] - 10https://gerrit.wikimedia.org/r/635630 (https://phabricator.wikimedia.org/T243009) (owner: 10Ahmon Dancy) [15:04:17] (03PS4) 10Giuseppe Lavagetto: safe-service-restart: add optional poolcounter support [puppet] - 10https://gerrit.wikimedia.org/r/635991 (https://phabricator.wikimedia.org/T266055) [15:04:19] (03PS4) 10Giuseppe Lavagetto: poolcounter: add client configuration classes [puppet] - 10https://gerrit.wikimedia.org/r/635992 (https://phabricator.wikimedia.org/T266055) [15:04:21] (03PS4) 10Giuseppe Lavagetto: profile::lvs::realserver: add ability to configure poolcounter for pools [puppet] - 10https://gerrit.wikimedia.org/r/635993 (https://phabricator.wikimedia.org/T266055) [15:04:23] (03PS4) 10Giuseppe Lavagetto: restbase: add poolcounter support to safe-service-restart scripts [puppet] - 10https://gerrit.wikimedia.org/r/635994 (https://phabricator.wikimedia.org/T266055) [15:04:25] (03CR) 10Giuseppe Lavagetto: profile::lvs::realserver: add ability to configure poolcounter for pools (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/635993 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [15:04:31] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:06:13] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:07:26] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission wtp2001 through wtp2020 - https://phabricator.wikimedia.org/T265558 (10Papaul) [15:08:02] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission wtp2001 through wtp2020 - https://phabricator.wikimedia.org/T265558 (10Papaul) 05Open→03Resolved complete [15:08:32] there's lag in logstash5 codfw, I'll take a look [15:09:11] (03PS3) 10Jbond: Gemfile: update puppetlabs_spec_helper version and switch to rspec-mock [puppet] - 10https://gerrit.wikimedia.org/r/636923 [15:09:13] (03PS1) 10Jbond: pick_initscript_spec: use shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/636942 [15:10:19] 10Operations, 10netops: all network devices must run OpenSSH >= 7.2p1 but != 7.4p1 - https://phabricator.wikimedia.org/T254013 (10ayounsi) [15:10:22] (03CR) 10Volans: "> Patch Set 3: Code-Review+1" [software/cumin] - 10https://gerrit.wikimedia.org/r/636729 (owner: 10Volans) [15:10:25] !log roll restart logstash5 in codfw [15:10:28] (03CR) 10Volans: "reply inline" [software/cumin] - 10https://gerrit.wikimedia.org/r/636729 (owner: 10Volans) [15:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:33] (03CR) 10jerkins-bot: [V: 04-1] Gemfile: update puppetlabs_spec_helper version and switch to rspec-mock [puppet] - 10https://gerrit.wikimedia.org/r/636923 (owner: 10Jbond) [15:11:06] 10Operations, 10ops-codfw, 10ops-eqiad, 10DC-Ops: Audit & update spares part tracking for all sites - https://phabricator.wikimedia.org/T243450 (10Papaul) [15:11:33] (03CR) 10Jbond: [C: 03+2] pick_initscript_spec: use shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/636942 (owner: 10Jbond) [15:12:01] (03CR) 10Volans: "> Patch Set 1: Code-Review+1" [software/spicerack] - 10https://gerrit.wikimedia.org/r/634056 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans) [15:15:42] (03PS4) 10Elukey: profile::java: add support for Jessie [puppet] - 10https://gerrit.wikimedia.org/r/636865 [15:15:44] (03PS8) 10Elukey: zookeeper: use profile::java [puppet] - 10https://gerrit.wikimedia.org/r/636864 (https://phabricator.wikimedia.org/T264176) [15:19:42] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10Cmjohnson) [15:20:10] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/636865 (owner: 10Elukey) [15:20:43] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10Cmjohnson) a:05Cmjohnson→03RobH @robh These still need the raid setup, you mentioned you could do that. If not please let me know and I will take c... [15:23:26] !log ppchelko@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [15:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:39] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10Papaul) [15:24:06] (03CR) 10Muehlenhoff: [C: 03+1] "Thanks, looks good! I'll merge the patch after the 7th." [puppet] - 10https://gerrit.wikimedia.org/r/636671 (owner: 10Niedzielski) [15:24:53] !log ppchelko@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [15:24:53] !log ppchelko@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [15:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:28] (03PS1) 10Jbond: base: fix spec test [puppet] - 10https://gerrit.wikimedia.org/r/636943 [15:26:10] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/26185/" [puppet] - 10https://gerrit.wikimedia.org/r/636864 (https://phabricator.wikimedia.org/T264176) (owner: 10Elukey) [15:27:10] (03CR) 10Jbond: [C: 03+2] base: fix spec test [puppet] - 10https://gerrit.wikimedia.org/r/636943 (owner: 10Jbond) [15:29:39] (03PS1) 10Elukey: aptrepo: add flink to the bigtop14 package list [puppet] - 10https://gerrit.wikimedia.org/r/636944 (https://phabricator.wikimedia.org/T266495) [15:33:09] RECOVERY - Too many messages in kafka logging-eqiad #o11y on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [15:33:17] !log ppchelko@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [15:33:17] !log ppchelko@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [15:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:35] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:33:46] (03CR) 10Elukey: [C: 03+2] aptrepo: add flink to the bigtop14 package list [puppet] - 10https://gerrit.wikimedia.org/r/636944 (https://phabricator.wikimedia.org/T266495) (owner: 10Elukey) [15:35:21] PROBLEM - kubelet operational latencies on kubernetes2010 is CRITICAL: instance=kubernetes2010.codfw.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:36:05] PROBLEM - kubelet operational latencies on kubernetes2008 is CRITICAL: instance=kubernetes2008.codfw.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:36:12] (03CR) 10Volans: [C: 03+2] requests: add new module [software/pywmflib] - 10https://gerrit.wikimedia.org/r/636645 (owner: 10Volans) [15:36:14] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/636864 (https://phabricator.wikimedia.org/T264176) (owner: 10Elukey) [15:37:39] (03Merged) 10jenkins-bot: requests: add new module [software/pywmflib] - 10https://gerrit.wikimedia.org/r/636645 (owner: 10Volans) [15:38:51] kubelet stuff is "fine" (consequence of mobileapps deploy...) [15:38:52] (03PS7) 10Volans: sre.hosts.decommission: import from new library [cookbooks] - 10https://gerrit.wikimedia.org/r/629692 (https://phabricator.wikimedia.org/T257905) [15:39:40] 10Operations, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10jijiki) [15:39:47] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Proton, 10Traffic, and 2 others: PDF download generates invalid PDF files - https://phabricator.wikimedia.org/T266559 (10LGoto) a:03Jgiannelos [15:41:02] (03PS1) 10Jbond: node_intel_microcode: fix time spec 1h vs hourly [puppet] - 10https://gerrit.wikimedia.org/r/636968 [15:41:51] PROBLEM - kubelet operational latencies on kubernetes2007 is CRITICAL: instance=kubernetes2007.codfw.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:41:56] (03CR) 10Niedzielski: "> Patch Set 3: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/636671 (owner: 10Niedzielski) [15:42:07] (03CR) 10Jbond: [C: 03+2] node_intel_microcode: fix time spec 1h vs hourly [puppet] - 10https://gerrit.wikimedia.org/r/636968 (owner: 10Jbond) [15:43:27] (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: import from new library [cookbooks] - 10https://gerrit.wikimedia.org/r/629692 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [15:43:31] RECOVERY - kubelet operational latencies on kubernetes2007 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:43:39] RECOVERY - kubelet operational latencies on kubernetes2010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:44:23] PROBLEM - kubelet operational latencies on kubernetes2008 is CRITICAL: instance=kubernetes2008.codfw.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:46:05] RECOVERY - kubelet operational latencies on kubernetes2008 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:46:34] 10Operations, 10Desktop Improvements, 10Product-Infrastructure-Team-Backlog, 10Proton, and 3 others: Connection closed while downloading PDF of articles - https://phabricator.wikimedia.org/T266373 (10LGoto) a:03Jgiannelos [15:47:46] (03Merged) 10jenkins-bot: sre.hosts.decommission: import from new library [cookbooks] - 10https://gerrit.wikimedia.org/r/629692 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [15:48:55] (03PS1) 10Volans: netbox: add dependency on python3-wmflib [puppet] - 10https://gerrit.wikimedia.org/r/636969 [15:49:47] !log elukey@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [15:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:20] (03PS3) 10Volans: dns: add retry logic to all Netbox API calls [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/636406 [15:50:57] (03CR) 10Volans: "Updated to use the new feature in the wmflib package. Added the depedency in Puppet in the Depends-On patch." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/636406 (owner: 10Volans) [15:51:25] PROBLEM - kubelet operational latencies on kubernetes1013 is CRITICAL: instance=kubernetes1013.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:51:43] !log restarting uwsgi on ores in eqiad [15:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:47] PROBLEM - kubelet operational latencies on kubernetes1012 is CRITICAL: instance=kubernetes1012.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:55:43] (03CR) 10Ppchelko: [C: 03+2] Add changeprop rules for newcomerTasksCacheRefreshJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/636078 (https://phabricator.wikimedia.org/T260758) (owner: 10Catrope) [15:56:07] RECOVERY - kubelet operational latencies on kubernetes1012 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:56:23] RECOVERY - kubelet operational latencies on kubernetes1013 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:59:29] (03Merged) 10jenkins-bot: Add changeprop rules for newcomerTasksCacheRefreshJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/636078 (https://phabricator.wikimedia.org/T260758) (owner: 10Catrope) [16:01:51] PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:03:19] !log ppchelko@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [16:03:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:20] (03PS1) 10Cmjohnson: Add mac addresses to dhcp for clouddb1013-1020 [puppet] - 10https://gerrit.wikimedia.org/r/636971 (https://phabricator.wikimedia.org/T260441) [16:05:40] !log ppchelko@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [16:05:42] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: ASAP) rack/setup/install clouddb10[13-20] - https://phabricator.wikimedia.org/T260441 (10Cmjohnson) [16:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:54] (03CR) 10Cmjohnson: [C: 03+2] Add mac addresses to dhcp for clouddb1013-1020 [puppet] - 10https://gerrit.wikimedia.org/r/636971 (https://phabricator.wikimedia.org/T260441) (owner: 10Cmjohnson) [16:06:25] !log ppchelko@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [16:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:00] (03PS5) 10Jeena Huneidi: [DNM] Experimental King helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/634354 (https://phabricator.wikimedia.org/T258572) [16:07:35] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:09:05] (03PS6) 10Jeena Huneidi: King helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/634354 (https://phabricator.wikimedia.org/T258572) [16:09:49] (03PS1) 10Cmjohnson: updating site.pp entry for new ES servers in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/636972 (https://phabricator.wikimedia.org/T260370) [16:09:51] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:11:28] (03CR) 10Cmjohnson: [C: 03+2] updating site.pp entry for new ES servers in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/636972 (https://phabricator.wikimedia.org/T260370) (owner: 10Cmjohnson) [16:11:31] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:14:42] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: ASAP) rack/setup/install clouddb10[13-20] - https://phabricator.wikimedia.org/T260441 (10Cmjohnson) a:05Cmjohnson→03RobH @robh these are ready for install, the raid configuration has been completed. Just need to do the fin... [16:15:07] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [16:15:08] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:37] (03CR) 10Hnowlan: [C: 03+2] Temporarily disable tilerator in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/608459 (https://phabricator.wikimedia.org/T254014) (owner: 10Ryan Kemper) [16:16:02] !log Disabling tilerator in eqiad [16:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:12] (03Abandoned) 10Andrew Bogott: shinkengen for all projects [puppet] - 10https://gerrit.wikimedia.org/r/374897 (https://phabricator.wikimedia.org/T166845) (owner: 10Alex Monk) [16:18:03] (03CR) 10Marostegui: "This was already done a few weeks ago at https://gerrit.wikimedia.org/r/c/operations/puppet/+/620881 - going to revert this and update the" [puppet] - 10https://gerrit.wikimedia.org/r/636467 (https://phabricator.wikimedia.org/T260370) (owner: 10Cmjohnson) [16:18:54] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: dc=eqiad,cluster=kartotherian,service=kartotherian,name=maps1004.eqiad.wmnet [16:18:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:48] (03CR) 10Bstorm: [C: 03+2] k8s-haproxy: take steps to fix logging [puppet] - 10https://gerrit.wikimedia.org/r/636765 (https://phabricator.wikimedia.org/T266593) (owner: 10Bstorm) [16:20:58] (03PS1) 10Marostegui: site.pp: Remove duplicate external store entries. [puppet] - 10https://gerrit.wikimedia.org/r/636974 (https://phabricator.wikimedia.org/T260370) [16:21:33] 10Operations, 10DC-Ops, 10Platform Engineering, 10serviceops: Rename wtp* servers to parse* (Parsoid PHP servers) - https://phabricator.wikimedia.org/T245888 (10Dzahn) The codfw part of this is done meanwhile. There are only parse2* but no wtp2*. (T247441 and others) The eqiad part though is still left t... [16:21:35] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove duplicate external store entries. [puppet] - 10https://gerrit.wikimedia.org/r/636974 (https://phabricator.wikimedia.org/T260370) (owner: 10Marostegui) [16:21:37] PROBLEM - Check systemd state on maps1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:22:02] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: dc=eqiad,cluster=maps,service=kartotherian,name=maps1004.eqiad.wmnet [16:22:04] bstorm: ok to merge your change? [16:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:08] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: dc=eqiad,cluster=maps,service=kartotherian-ssl,name=maps1004.eqiad.wmnet [16:22:09] Sure! [16:22:12] I was just about to [16:22:14] bstorm: Merging! [16:22:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:17] Thanks! [16:27:11] PROBLEM - Check systemd state on idp2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:27:44] 10Operations, 10DBA, 10Orchestrator, 10User-Kormat: Explore orchestrator hooks to integrate them with dbctl, !log, irc alerts and emails - https://phabricator.wikimedia.org/T266452 (10Marostegui) [16:28:01] 10Operations, 10DBA, 10Orchestrator: Run orchestrator as non-root - https://phabricator.wikimedia.org/T266656 (10Marostegui) [16:28:15] ACKNOWLEDGEMENT - Check systemd state on maps1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Hnowlan Expected for maps resync https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:28:49] (03PS1) 10Bstorm: k8s-haproxy: Fix a typo in the logrotate config [puppet] - 10https://gerrit.wikimedia.org/r/636975 (https://phabricator.wikimedia.org/T266593) [16:29:09] 10Operations, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudvirt100[1-9].eqiad.wmnet - https://phabricator.wikimedia.org/T263151 (10Cmjohnson) [16:29:14] 10Operations, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudvirt100[1-9].eqiad.wmnet - https://phabricator.wikimedia.org/T263151 (10Cmjohnson) 05Open→03Resolved [16:29:46] (03PS6) 10Hnowlan: Isolate eqiad master maps1004 from cluster [puppet] - 10https://gerrit.wikimedia.org/r/608729 (https://phabricator.wikimedia.org/T254014) (owner: 10Ryan Kemper) [16:31:35] (03CR) 10Hnowlan: [C: 03+2] Isolate eqiad master maps1004 from cluster [puppet] - 10https://gerrit.wikimedia.org/r/608729 (https://phabricator.wikimedia.org/T254014) (owner: 10Ryan Kemper) [16:31:40] 10Operations, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudvirt1015.eqiad.wmnet - https://phabricator.wikimedia.org/T260840 (10Cmjohnson) [16:31:44] (03PS2) 10Bstorm: k8s-haproxy: Fix a typo in the logrotate config [puppet] - 10https://gerrit.wikimedia.org/r/636975 (https://phabricator.wikimedia.org/T266593) [16:31:51] 10Operations, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudvirt1015.eqiad.wmnet - https://phabricator.wikimedia.org/T260840 (10Cmjohnson) 05Open→03Resolved [16:32:11] (03PS3) 10Bstorm: k8s-haproxy: Fix a typo in the logrotate config [puppet] - 10https://gerrit.wikimedia.org/r/636975 (https://phabricator.wikimedia.org/T266593) [16:34:34] (03CR) 10Bstorm: "Just saw this because it conflicts with something I was doing. Is it still a current patch?" [puppet] - 10https://gerrit.wikimedia.org/r/604782 (https://phabricator.wikimedia.org/T195217) (owner: 10Arturo Borrero Gonzalez) [16:34:40] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [16:34:41] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:34:44] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [16:34:44] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:46] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [16:34:47] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:54] (03CR) 10Bstorm: [C: 03+2] k8s-haproxy: Fix a typo in the logrotate config [puppet] - 10https://gerrit.wikimedia.org/r/636975 (https://phabricator.wikimedia.org/T266593) (owner: 10Bstorm) [16:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:35] 10Operations, 10Patch-For-Review: logrotate for visualdiff tests on Parsoid test host (scandium) - https://phabricator.wikimedia.org/T161920 (10Dzahn) Yes, there are 56G in scandium:/srv/visualdiff/pngs but that is just over 50% so not an acute issue here. @ssastry Do you ever manually delete pngs from /srv/v... [16:38:48] (03CR) 10Dzahn: [C: 03+1] "looks good for the production part: https://puppet-compiler.wmflabs.org/compiler1002/26186/" [puppet] - 10https://gerrit.wikimedia.org/r/577043 (owner: 10C. Scott Ananian) [16:44:38] (03CR) 10Dzahn: "ready for an ACK or non-veto from WMCS team. Since it is opt-in it should not have any consequences until a host or role is added in Hiera" [puppet] - 10https://gerrit.wikimedia.org/r/635406 (https://phabricator.wikimedia.org/T165885) (owner: 10Dzahn) [16:45:46] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Create a cron to clean clientbucket every day or hour - https://phabricator.wikimedia.org/T165885 (10Dzahn) @Paladox Please see the change above. still interested in this? [16:46:23] (03CR) 10Andrew Bogott: [C: 03+1] "Fine with me -- since it's a no-op by default it seems harmless." [puppet] - 10https://gerrit.wikimedia.org/r/635406 (https://phabricator.wikimedia.org/T165885) (owner: 10Dzahn) [16:46:53] (03Abandoned) 10Dzahn: wmf-auto-restart: return 0 if service is not present or running [puppet] - 10https://gerrit.wikimedia.org/r/636728 (owner: 10Dzahn) [16:49:27] PROBLEM - Check systemd state on releases2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:50:37] ^ was working on fixing that, will look again [16:54:48] RECOVERY - Check systemd state on releases2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:34] (03PS1) 10Ppchelko: Temporary enable 'editpage' warn logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636983 (https://phabricator.wikimedia.org/T251023) [17:02:53] (03CR) 10DannyS712: Temporary enable 'editpage' warn logging (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636983 (https://phabricator.wikimedia.org/T251023) (owner: 10Ppchelko) [17:04:09] (03PS2) 10Ppchelko: Temporary enable 'editpage' warn logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636983 (https://phabricator.wikimedia.org/T251023) [17:11:22] (03CR) 10Jbond: "Have tested this with `bundle exec rake global:spec` and profile fails however ` bundle exec rake global:spec:profile` succeeds" [puppet] - 10https://gerrit.wikimedia.org/r/636923 (owner: 10Jbond) [17:12:14] (03PS4) 10Jbond: Gemfile: update puppetlabs_spec_helper version and switch to rspec-mock [puppet] - 10https://gerrit.wikimedia.org/r/636923 [17:12:21] (03CR) 10Paladox: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/635406 (https://phabricator.wikimedia.org/T165885) (owner: 10Dzahn) [17:12:54] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Create a cron to clean clientbucket every day or hour - https://phabricator.wikimedia.org/T165885 (10Paladox) +1'd [17:13:42] (03CR) 10jerkins-bot: [V: 04-1] Gemfile: update puppetlabs_spec_helper version and switch to rspec-mock [puppet] - 10https://gerrit.wikimedia.org/r/636923 (owner: 10Jbond) [17:15:25] (03PS2) 10Jbond: java: add new java version facts [puppet] - 10https://gerrit.wikimedia.org/r/636924 [17:15:32] (03PS1) 10Andrew Bogott: deployment-prep: unset profile::cassandra::metrics_whitelist [puppet] - 10https://gerrit.wikimedia.org/r/636987 [17:16:29] (03CR) 10Andrew Bogott: "This should fix puppet on at least 6 deployment-prep instances" [puppet] - 10https://gerrit.wikimedia.org/r/636987 (owner: 10Andrew Bogott) [17:16:46] (03CR) 10Andrew Bogott: [C: 03+2] deployment-prep: unset profile::cassandra::metrics_whitelist [puppet] - 10https://gerrit.wikimedia.org/r/636987 (owner: 10Andrew Bogott) [17:17:35] 10Operations, 10serviceops: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10jijiki) [17:18:24] 10Operations, 10Patch-For-Review: logrotate for visualdiff tests on Parsoid test host (scandium) - https://phabricator.wikimedia.org/T161920 (10ssastry) Since we aren't going to be running parsoid-vd and parsoid-vd-client on scanidum, that whole dir can be removed and unmounted (iirc, that is an external volum... [17:19:59] (03CR) 10Jbond: "ignore the previous CR in this chain (unless you know rspec then review most welcome)." [puppet] - 10https://gerrit.wikimedia.org/r/636924 (owner: 10Jbond) [17:20:24] (03PS3) 10Jbond: java: add new java version facts [puppet] - 10https://gerrit.wikimedia.org/r/636924 [17:22:25] (03PS1) 10Bstorm: paws: Get haproxy logging working [puppet] - 10https://gerrit.wikimedia.org/r/636988 (https://phabricator.wikimedia.org/T266593) [17:22:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:23:49] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:24:24] (03CR) 10Jbond: [C: 03+2] helm: drop monitoring for systemd::timer::job [puppet] - 10https://gerrit.wikimedia.org/r/636892 (owner: 10Jbond) [17:24:26] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Upgrade ELK Stack to version 7 - https://phabricator.wikimedia.org/T234854 (10herron) >>! In T234854#6584418, @ayounsi wrote: > Trying to load: https://logstash-next.wikimedia.org/app/dashboards#/view/6bcd2a10-7d21-11e7-86fb-51c84229aeb7 > > My laptop... [17:24:42] !log removing OSM database on maps1004 [17:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:19] RECOVERY - Host ms-be1057 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [17:28:13] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/636865 (owner: 10Elukey) [17:28:16] 10Operations, 10ops-eqiad: Power supply lost for analytics1072 - https://phabricator.wikimedia.org/T266644 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson fixed [17:28:33] 10Operations, 10Platform Engineering, 10serviceops, 10User-jijiki: Upgrade MediaWiki's Redis cluster to Debian Buster - https://phabricator.wikimedia.org/T265643 (10jijiki) [17:29:13] 10Operations, 10ops-eqiad, 10DC-Ops: ms-be1057 down - cable disconnected? - https://phabricator.wikimedia.org/T266604 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson I am sorry, I am not sure how I did that but I did....it's fixed now. [17:30:24] !log reimporting OSM data for eqiad [17:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:42] (03CR) 10Jbond: [C: 03+1] "lgtm" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636864 (https://phabricator.wikimedia.org/T264176) (owner: 10Elukey) [17:34:53] RECOVERY - IPMI Sensor Status on analytics1072 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [17:34:53] PROBLEM - Host cloudvirt1030.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:37:31] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1030 with 10G interfaces - https://phabricator.wikimedia.org/T266623 (10Cmjohnson) this server does not have a 10GB nic card [17:38:00] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1030 with 10G interfaces - https://phabricator.wikimedia.org/T266623 (10Cmjohnson) @Andrew @Bstorm Do you want me to put this back where it was? [17:38:20] (03CR) 10Bstorm: [C: 03+2] "Since this is the same change as for Toolforge, I'm going to go ahead and merge it." [puppet] - 10https://gerrit.wikimedia.org/r/636988 (https://phabricator.wikimedia.org/T266593) (owner: 10Bstorm) [17:38:57] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1028 with 10G interfaces - https://phabricator.wikimedia.org/T266514 (10Cmjohnson) @andrew @bstrom This server does not have a 10GB nic card [17:39:19] (03CR) 10Razzi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/636514 (https://phabricator.wikimedia.org/T240439) (owner: 10Razzi) [17:39:30] 10Operations, 10DC-Ops, 10netops: patch in FB peering into cr1-eqiad:xe-3/2/1 - https://phabricator.wikimedia.org/T265916 (10Cmjohnson) [17:39:33] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1028 with 10G interfaces - https://phabricator.wikimedia.org/T266514 (10Bstorm) That's very surprising! [17:39:39] 10Operations, 10ops-eqiad, 10DC-Ops: fix/replace cable ID 2648 on FB peering patch - cable report error - https://phabricator.wikimedia.org/T266497 (10Cmjohnson) 05Open→03Resolved Fixed, the cable that we had labeled as 2648 is actually 2649 [17:39:47] RECOVERY - Host cloudvirt1030.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.01 ms [17:40:27] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:40:56] (03PS2) 10Razzi: stats: Remove nginx from thorium [puppet] - 10https://gerrit.wikimedia.org/r/636514 (https://phabricator.wikimedia.org/T240439) [17:42:34] (03CR) 10Dzahn: [C: 04-1] "Enum['absent', 'present'], got 'enable'" [puppet] - 10https://gerrit.wikimedia.org/r/635406 (https://phabricator.wikimedia.org/T165885) (owner: 10Dzahn) [17:43:54] (03PS7) 10Dzahn: base/labs: add systemd timer to clean puppet client bucket [puppet] - 10https://gerrit.wikimedia.org/r/635406 (https://phabricator.wikimedia.org/T165885) [17:44:14] (03CR) 10Elukey: [C: 03+1] stats: Remove nginx from thorium [puppet] - 10https://gerrit.wikimedia.org/r/636514 (https://phabricator.wikimedia.org/T240439) (owner: 10Razzi) [17:44:26] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 3 others: Upgrade memcached cluster to Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10jijiki) [17:44:29] (03CR) 10Razzi: [C: 03+2] stats: Remove nginx from thorium [puppet] - 10https://gerrit.wikimedia.org/r/636514 (https://phabricator.wikimedia.org/T240439) (owner: 10Razzi) [17:44:33] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 3 others: Upgrade memcached cluster to Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10jijiki) [17:46:07] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:46:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:49] !log elukey@cumin1001 START - Cookbook sre.dns.netbox [17:46:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:10] nice race! let's see who wins [17:47:13] and please don't abort [17:47:28] perfect way to test the locking mechanism [17:48:59] today I am not lucky with DNS [17:49:09] (03PS5) 10Jbond: Gemfile: update puppetlabs_spec_helper version and switch to rspec-mock [puppet] - 10https://gerrit.wikimedia.org/r/636923 [17:51:25] (03CR) 10jerkins-bot: [V: 04-1] Gemfile: update puppetlabs_spec_helper version and switch to rspec-mock [puppet] - 10https://gerrit.wikimedia.org/r/636923 (owner: 10Jbond) [17:51:55] PROBLEM - Host labstore1005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:52:28] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1028 with 10G interfaces - https://phabricator.wikimedia.org/T266514 (10Bstorm) The quote at T201352 lists the combined QLogic 57800 NIC, which should have 10 and 1 GB ports. [17:52:33] PROBLEM - Host cloudvirt1030.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:52:50] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1030 with 10G interfaces - https://phabricator.wikimedia.org/T266623 (10Bstorm) The quote at T201352 lists the combined QLogic 57800 NIC, which should have 10 and 1 GB ports. I do not know if that matches reality... [17:53:20] (03PS1) 10Legoktm: Look for service.template in various code directories [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/636993 (https://phabricator.wikimedia.org/T266692) [17:54:03] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/26189/" [puppet] - 10https://gerrit.wikimedia.org/r/635406 (https://phabricator.wikimedia.org/T165885) (owner: 10Dzahn) [17:54:15] volans: I see the diff, cmjohnson1 did you get a failure? [17:54:28] both should see the diff [17:54:52] the first one that pushes the commit "wins" and the other one should fail when pushing [17:54:54] 10Operations, 10Parsoid, 10Parsoid-Tests, 10serviceops, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10ssastry) [17:55:12] because history changed [17:55:26] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 3 others: Upgrade memcached cluster to Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10aaron) Regarding RedisLockManager (it only needs 2 of the 3 host to be reachable). If one of them is depooled or refuses connection... [17:55:28] volans: ok ok, so should I go forward or should I wait for cmjohnson1's feedback? [17:55:49] (03CR) 10DannyS712: [C: 03+1] Temporary enable 'editpage' warn logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636983 (https://phabricator.wikimedia.org/T251023) (owner: 10Ppchelko) [17:56:09] for the sake of testing if we can get a feedback better, but it's designed to DTRT whatever you do :) [17:56:59] RECOVERY - Host labstore1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.61 ms [18:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201028T1800). [18:00:04] kaldari, tgr, ryankemper, and Pchelolo: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:13] here [18:00:26] o/ [18:01:11] \o [18:01:16] mine is super simple - just enabling a logging channel [18:02:00] Oh I just noticed I made a mistake last night - I forgot that +2'ing the patch auto submits [18:02:05] Should I do a quick revert of https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/636480? [18:02:51] ryankemper: if it was not deployed yet, sure [18:02:51] !log elukey@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [18:02:54] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1030 with 10G interfaces - https://phabricator.wikimedia.org/T266623 (10Bstorm) From online photos, I'd expect he NIC to have 4 ports, and 2 of those would be 10Gb [18:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:04] tgr_: happy to deploy, unless you want to lead the window? [18:03:21] (03PS1) 10Ryan Kemper: Revert "Increase cirrus morelike pool counter by 20%" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636955 [18:03:35] (03CR) 10Ryan Kemper: [C: 03+2] Revert "Increase cirrus morelike pool counter by 20%" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636955 (owner: 10Ryan Kemper) [18:03:43] PROBLEM - Host labstore1005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:03:48] I don't think there's much point in reverting, we can just deploy it first [18:03:55] Urbanecm: works for me either way [18:04:24] tgr_: it's you deploying then :) [18:04:24] Well I'm great at bad timing :P [18:04:38] (03Merged) 10jenkins-bot: Revert "Increase cirrus morelike pool counter by 20%" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636955 (owner: 10Ryan Kemper) [18:04:40] (03PS6) 10Jbond: Gemfile: update puppetlabs_spec_helper version and switch to rspec-mock [puppet] - 10https://gerrit.wikimedia.org/r/636923 [18:05:06] It's reverted so you can proceed in order of the queue [18:06:04] ok, thanks [18:06:25] (03CR) 10jerkins-bot: [V: 04-1] Gemfile: update puppetlabs_spec_helper version and switch to rspec-mock [puppet] - 10https://gerrit.wikimedia.org/r/636923 (owner: 10Jbond) [18:06:31] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [18:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:10] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Create a cron to clean clientbucket every day or hour - https://phabricator.wikimedia.org/T165885 (10Dzahn) ` dzahn@wikistats-dancing-goat:~$ sudo systemctl start cleanup_puppet_client_bucket.timer dzahn@wikista... [18:07:17] (03CR) 10Gergő Tisza: [C: 03+2] Removing obsolete license definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619880 (owner: 10Kaldari) [18:07:19] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:35] (03PS1) 10Ryan Kemper: Revert "Revert "Increase cirrus morelike pool counter by 20%"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636956 [18:08:05] RECOVERY - Host labstore1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [18:08:07] (03Merged) 10jenkins-bot: Removing obsolete license definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619880 (owner: 10Kaldari) [18:08:14] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Create a cron to clean clientbucket every day or hour - https://phabricator.wikimedia.org/T165885 (10Dzahn) a:03Dzahn [18:08:31] RECOVERY - Host cloudvirt1030.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms [18:10:21] kaldari: it's on mwdebug2001 if you want to check [18:10:28] checking.... [18:11:28] "The Wikimedia Commons database is temporarily in read-only mode for the following reason"? [18:11:30] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:59] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:12:03] (03PS2) 10Legoktm: Look for service.template in various code directories [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/636993 (https://phabricator.wikimedia.org/T266692) [18:12:22] tgr_: fetch to 1002 [18:12:22] huh [18:12:32] wrong mwdebug - we are on eqiad again [18:12:36] ohh, right, we just switched back [18:12:57] (03CR) 10Legoktm: "Tested on Toolforge: https://phabricator.wikimedia.org/P13091" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/636993 (https://phabricator.wikimedia.org/T266692) (owner: 10Legoktm) [18:13:19] (03PS7) 10Jbond: Gemfile: update puppetlabs_spec_helper version and switch to rspec-mock [puppet] - 10https://gerrit.wikimedia.org/r/636923 [18:13:34] lemme know when I should test on 1002 [18:13:35] ok, it's on 1001 [18:13:39] or 1001 :) [18:14:34] (03CR) 10Gergő Tisza: [C: 03+2] Suggested edits: Include page ID with task preview data [extensions/GrowthExperiments] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/636787 (https://phabricator.wikimedia.org/T266600) (owner: 10Gergő Tisza) [18:15:03] (03CR) 10jerkins-bot: [V: 04-1] Gemfile: update puppetlabs_spec_helper version and switch to rspec-mock [puppet] - 10https://gerrit.wikimedia.org/r/636923 (owner: 10Jbond) [18:15:23] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1030 with 10G interfaces - https://phabricator.wikimedia.org/T266623 (10Cmjohnson) @bstorm you are correct, that is the nic that is in the server but the 10G capability would require 10GB SFP transceiver. I belie... [18:16:14] (03CR) 10Dzahn: [V: 03+1] puppetmaster: add data types to all remaining parameters (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/635656 (owner: 10Dzahn) [18:16:34] tgr_: All good, feel free to push [18:18:31] (03PS7) 10Dzahn: puppetmaster: add data types to all remaining parameters [puppet] - 10https://gerrit.wikimedia.org/r/635656 [18:19:37] !log tgr@deploy1001 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:619880|Removing obsolete license definition]] (duration: 01m 00s) [18:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:44] (03CR) 10Gergő Tisza: [C: 03+2] Revert "Revert "Increase cirrus morelike pool counter by 20%"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636956 (owner: 10Ryan Kemper) [18:20:05] (03PS1) 10Elukey: sre.dns.netbox: add link to help with --force option [cookbooks] - 10https://gerrit.wikimedia.org/r/636997 [18:20:22] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1030 with 10G interfaces - https://phabricator.wikimedia.org/T266623 (10Cmjohnson) we will also need cat6 or 6a cable for these 2 each at 7M please. [18:20:34] ryankemper: do you want those two config patches deployed separately, or in one step? [18:20:35] (03CR) 10Volans: [C: 03+1] "As requested by Jbond I've reviewed the script more or less as if it was a new script, but avoiding to suggest anything drastic that would" (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/634572 (owner: 10Jbond) [18:21:03] (03Merged) 10jenkins-bot: Revert "Revert "Increase cirrus morelike pool counter by 20%"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636956 (owner: 10Ryan Kemper) [18:21:19] (03PS2) 10Gergő Tisza: Revert "cirrus: Hardcode more_like to codfw cirrus cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636791 (owner: 10Ryan Kemper) [18:21:32] (03CR) 10Jbond: [C: 03+1] "LGTM +1 assuming pcc is good" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635656 (owner: 10Dzahn) [18:21:36] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for creating the doc" [cookbooks] - 10https://gerrit.wikimedia.org/r/636997 (owner: 10Elukey) [18:22:15] (03CR) 10Elukey: [C: 03+2] sre.dns.netbox: add link to help with --force option [cookbooks] - 10https://gerrit.wikimedia.org/r/636997 (owner: 10Elukey) [18:22:16] tgr_: they're unrelated, but neither of them require any testing at the `mwdebug` stage before proceeding to the rest of the fleet, so if you can do together that'd be great [18:22:24] "1 apaches had sync errors" [18:22:40] but it doesn't say which one... [18:23:02] ['/usr/bin/scap', 'pull', '--no-update-l10n', '--no-php-restart', '--include', 'wmf-config', '--include', 'wmf-config/CommonSettings.php', 'mw1268.eqiad.wmnet', 'mw1319.eqiad.wmnet', 'mw1366.eqiad.wmnet', 'mw2215.codfw.wmnet', 'mw2254.codfw.wmnet', 'mw2289.codfw.wmnet', 'mw1285.eqiad.wmnet', 'mw1313.eqiad.wmnet'] on scandium.eqiad.wmnet returned [255]: Host key verification failed. [18:23:10] so I guess one of those? [18:23:40] > Host key verification failed [18:23:42] that's weird [18:24:16] (03PS1) 10Dzahn: phabricator: replace require_package with ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/636999 (https://phabricator.wikimedia.org/T266479) [18:24:22] I think scandium just got removed from ssh_known_hosts- no idea why but I saw it in a puppet run [18:25:04] because puppet has been disabled (10168 minutes ago) [18:25:31] it's been requested to disable it for parsoid tests [18:25:31] and so it has been evicted from puppetdb after 1 week [18:25:56] Pchelolo once your patch merges, is there any way for us to test, since it is expected not to be logging anything? [18:26:06] in fact it shows up in https://netbox.wikimedia.org/extras/reports/puppetdb.PhysicalHosts/ [18:26:27] so the failed apache is scandium and all the other hostnames in that message are just distraction? [18:26:38] nope, no way to test DannyS712 we hope that code that logs this is never executed [18:26:46] if that's the case we can ignore the error [18:26:59] tgr_: no scandium has been removed from the other hosts so tehy can't verify scandium's identity [18:27:09] (03CR) 10Muehlenhoff: "Note this doesn't clean out the nginx* packages (not sure whether known/intentional etc, just mentioning it here)" [puppet] - 10https://gerrit.wikimedia.org/r/636514 (https://phabricator.wikimedia.org/T240439) (owner: 10Razzi) [18:27:24] now is scandium a source of scap or just a target? [18:27:38] (03Merged) 10jenkins-bot: Suggested edits: Include page ID with task preview data [extensions/GrowthExperiments] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/636787 (https://phabricator.wikimedia.org/T266600) (owner: 10Gergő Tisza) [18:27:44] if just a target I think it could be ignored [18:27:53] yes, it's just scandium. note how the error message gives a list of mw hosts but then it's "on scandium" after the bracket closes [18:27:56] that message makes it sound like a source [18:28:08] 'mw1313.eqiad.wmnet'] on ... [18:28:10] yeah it's a bit confusing as a message [18:28:12] and I have no context [18:29:09] if scap pull failed on scandium, that's a meh. If fanout failed for half a dozen production apaches that would be bad [18:29:49] the list of mw servers is part of the command line [18:31:21] mutante: I'm just confused why a scap pull would be run on scandium with a bunch of server names in the parameters [18:31:25] (03CR) 10Razzi: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/636514 (https://phabricator.wikimedia.org/T240439) (owner: 10Razzi) [18:32:20] but you are saying this no other appserver than scandium is affected in any way, right? [18:33:12] tgr_: they are the scap proxies [18:33:42] yes, pretty sure it's just scandium [18:33:52] right, thanks. I see it's mentioned in scap --help, I should just have rtfm [18:33:55] and scandium is not pooled and I was just told today we can now remove it [18:34:31] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: ASAP) rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T265086 (10Cmjohnson) [18:34:40] (03PS8) 10Jbond: Gemfile: update puppetlabs_spec_helper version and switch to rspec-mock [puppet] - 10https://gerrit.wikimedia.org/r/636923 [18:35:49] tgr_: ignore it.. and I will make a patch to remove that from the dsh group.. then reimage it later [18:36:17] subbu just told me all parsoid patches are merged which means we can reimage that and no more reason to keep puppet disabled after that [18:38:19] mutante, are you going to reimage scandium anytime today? or later in the week? [18:38:32] just checking so i know when to start new rt testing runs. [18:39:39] (03CR) 10Gergő Tisza: [C: 03+2] Revert "cirrus: Hardcode more_like to codfw cirrus cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636791 (owner: 10Ryan Kemper) [18:39:47] subbu: I can do it today.. but maybe I should first remove it from scap dsh groups to avoid that stuff above from happening [18:39:56] that was interesting timing :) [18:40:02] !log tgr@deploy1001 Synchronized php-1.36.0-wmf.14/extensions/GrowthExperiments/includes/HomepageModules/SuggestedEdits.php: Backport: [[gerrit:636787|Suggested edits: Include page ID with task preview data (T266600)]] (duration: 00m 59s) [18:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:08] T266600: Newcomer tasks: edit tag not applying to all edits - https://phabricator.wikimedia.org/T266600 [18:40:09] ok. right. [18:40:10] "up to a week" is no problem to disable puppet but after that we run into this [18:40:44] (03Merged) 10jenkins-bot: Revert "cirrus: Hardcode more_like to codfw cirrus cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636791 (owner: 10Ryan Kemper) [18:41:05] ok, i'll hold off kicking off test runs for now. let me know once scandium is ready again. [18:41:08] (03PS1) 10Dzahn: remove scandium from scap dsh group [puppet] - 10https://gerrit.wikimedia.org/r/637003 [18:41:35] ok [18:43:37] !log volans@cumin1001 START - Cookbook sre.dns.netbox [18:43:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:08] jouncebot: next [18:44:09] In 1 hour(s) and 15 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201028T2000) [18:44:18] (03PS2) 10Dzahn: remove scandium from scap dsh group [puppet] - 10https://gerrit.wikimedia.org/r/637003 [18:44:56] (03PS3) 10Gergő Tisza: Temporary enable 'editpage' warn logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636983 (https://phabricator.wikimedia.org/T251023) (owner: 10Ppchelko) [18:45:04] !log tgr@deploy1001 Synchronized wmf-config/PoolCounterSettings.php: Config: [[gerrit:636956|Revert "Revert "Increase cirrus morelike pool counter by 20%"" ()]] (duration: 00m 57s) [18:45:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:09] (03PS3) 10Dzahn: remove scandium from scap dsh group [puppet] - 10https://gerrit.wikimedia.org/r/637003 [18:46:13] (03CR) 10Gergő Tisza: [C: 03+2] Temporary enable 'editpage' warn logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636983 (https://phabricator.wikimedia.org/T251023) (owner: 10Ppchelko) [18:46:43] (03CR) 10jerkins-bot: [V: 04-1] remove scandium from scap dsh group [puppet] - 10https://gerrit.wikimedia.org/r/637003 (owner: 10Dzahn) [18:46:49] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:636791|Revert "cirrus: Hardcode more_like to codfw cirrus cluster"]] (duration: 00m 56s) [18:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:08] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:49] (03PS4) 10Dzahn: remove scandium from scap dsh group [puppet] - 10https://gerrit.wikimedia.org/r/637003 (https://phabricator.wikimedia.org/T257906) [18:48:05] always love the -1 for "one space missing after Bug:" [18:48:17] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1030 with 10G interfaces - https://phabricator.wikimedia.org/T266623 (10Andrew) >>! In T266623#6585835, @Cmjohnson wrote: > @bstorm you are correct, that is the nic that is in the server but the 10G capability wo... [18:49:00] (03CR) 10Dzahn: [C: 03+2] remove scandium from scap dsh group [puppet] - 10https://gerrit.wikimedia.org/r/637003 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn) [18:49:15] ^ there, that issue should be gone for now.. [18:49:56] thx mutante [18:50:20] yep, np, the host will be reinstalled and added later, but then with running puppet and new host keys [18:51:09] (03Merged) 10jenkins-bot: Temporary enable 'editpage' warn logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636983 (https://phabricator.wikimedia.org/T251023) (owner: 10Ppchelko) [18:51:32] ^ too late, but probably should have been *temporarily* [18:51:50] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [18:51:50] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [18:51:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:42] I'm sure all zero people who read mw-config git log as a hobby will be sad [18:53:02] (03CR) 10Dzahn: "this can go ahead now" [puppet] - 10https://gerrit.wikimedia.org/r/634383 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn) [18:54:11] leaves a "if you can read this you won a barnstar / apply for a job" message to test that hypothesis [18:55:11] * DannyS712 does read it [18:55:21] (when really bored) [18:55:56] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:636983|Temporary enable 'editpage' warn logging (T251023)]] (duration: 00m 57s) [18:55:58] tried to downtime scandium in Icinga but of course that won't work if the host is already dropped crom puppet :) [18:55:59] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:02] T251023: EditPage::getCurrentContent unexpectedly changes $currentModel and $currentFormat - https://phabricator.wikimedia.org/T251023 [18:56:09] !log Morning deploys done [18:56:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:14] 10Operations, 10DC-Ops, 10netops: patch in FB peering into cr1-eqiad:xe-3/2/1 - https://phabricator.wikimedia.org/T265916 (10RobH) [18:56:23] ryankemper: Pchelolo: patches are live [18:56:28] 10Operations, 10ops-eqiad, 10DC-Ops: fix/replace cable ID 2648 on FB peering patch - cable report error - https://phabricator.wikimedia.org/T266497 (10RobH) 05Resolved→03Open 2649 is also already in use, so now your fix introduced a new error: https://netbox.wikimedia.org/dcim/cables/1167/ https://netbo... [18:56:42] subbu: your -1 is now a +1 on https://gerrit.wikimedia.org/r/c/operations/puppet/+/634383 ? right? [18:56:47] thank you tgr_. [18:56:54] eg merging https://gerrit.wikimedia.org/r/c/mediawiki/core/+/592471 [18:57:12] (03CR) 10Subramanya Sastry: [C: 03+1] parsoid: stop using nodejs parsoid on scandium [puppet] - 10https://gerrit.wikimedia.org/r/634383 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn) [18:57:16] fixed it. [18:57:28] thanks [18:57:39] (03CR) 10Dzahn: [C: 03+2] parsoid: stop using nodejs parsoid on scandium [puppet] - 10https://gerrit.wikimedia.org/r/634383 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn) [18:57:41] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:57:44] (03PS2) 10Dzahn: parsoid: stop using nodejs parsoid on scandium [puppet] - 10https://gerrit.wikimedia.org/r/634383 (https://phabricator.wikimedia.org/T257906) [18:57:56] if you spell a word wrong, just create a page on Wiktionary that says "alternative spelling" and no one will ever know ;-) [18:58:12] (03PS1) 10Dave Pifke: webperf: move navtiming monitoring back to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/637007 [18:59:12] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 3 others: Upgrade memcached cluster to Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10jijiki) @aaron If you have any insights regarding the Redis Lock Manager and file upload, it would be much appreciated (+ T265643) [19:01:23] (03PS1) 10Dzahn: Revert "remove scandium from scap dsh group" [puppet] - 10https://gerrit.wikimedia.org/r/636958 [19:01:46] (03CR) 10Gehel: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/636924 (owner: 10Jbond) [19:04:03] (03PS3) 10Ahmon Dancy: modules/scap/templates/scap.cfg.erb: Define php_fpm_unsafe_restart_script [puppet] - 10https://gerrit.wikimedia.org/r/636074 (https://phabricator.wikimedia.org/T243009) [19:05:34] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1030 with 10G interfaces - https://phabricator.wikimedia.org/T266623 (10wiki_willy) ++ @RobH - can you create a related procurement task and look into getting a quote for what the WMCS team needs? Much appreciat... [19:05:57] 10Operations, 10Parsoid, 10Parsoid-Tests, 10serviceops, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` scandium.eqi... [19:06:00] 10Operations, 10Parsoid, 10Parsoid-Tests, 10serviceops, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['scandium.eqiad.wmnet'] ` Of which those **FAILED**: ` ['sc... [19:06:33] (03CR) 10Ahmon Dancy: [C: 03+1] profile::lvs::realserver: add ability to configure poolcounter for pools [puppet] - 10https://gerrit.wikimedia.org/r/635993 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [19:07:27] (03CR) 10Ahmon Dancy: [C: 03+1] safe-service-restart: add optional poolcounter support [puppet] - 10https://gerrit.wikimedia.org/r/635991 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [19:07:34] 10Operations, 10Parsoid, 10Parsoid-Tests, 10serviceops, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` scandium.eqi... [19:10:18] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1030 with 10G interfaces - https://phabricator.wikimedia.org/T266623 (10RobH) a:05Cmjohnson→03RobH >>! In T266623#6585634, @Cmjohnson wrote: > this server does not have a 10GB nic card These were ordered wit... [19:11:25] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/637011 [19:11:25] 10Operations, 10Patch-For-Review: logrotate for visualdiff tests on Parsoid test host (scandium) - https://phabricator.wikimedia.org/T161920 (10Dzahn) Thanks! Reimaging scandium right now as part of T257906. We can close this ticket then as not needed anymore. [19:20:12] (03PS1) 10Bstorm: paws-k8s: switch the ingress for https to http logging [puppet] - 10https://gerrit.wikimedia.org/r/637017 [19:20:22] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [19:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:22] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:22:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:40] (03PS1) 10Andrew Bogott: cloud-vps instances: include bsd-mailx on all hosts [puppet] - 10https://gerrit.wikimedia.org/r/637018 [19:30:42] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [19:30:43] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:29] (03CR) 10Dzahn: "In the past when I had to do the same fix on prod hosts I was about to do that but ended up using the "s-nail" package instead. apt-cache " [puppet] - 10https://gerrit.wikimedia.org/r/637018 (owner: 10Andrew Bogott) [19:33:08] (03CR) 10Dzahn: "as far as I remember both packages provide /usr/bin/mail but the latter meant I did not have to change my existing commands and parameters" [puppet] - 10https://gerrit.wikimedia.org/r/637018 (owner: 10Andrew Bogott) [19:35:57] (03CR) 10Jbond: "> Patch Set 3: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/636924 (owner: 10Jbond) [19:36:34] !log herron@cumin1001 START - Cookbook sre.ganeti.makevm [19:36:34] !log herron@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [19:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:47] !log herron@cumin1001 START - Cookbook sre.ganeti.makevm [19:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:21] PROBLEM - Check systemd state on mw1381 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:44:26] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Upgrade ELK Stack to version 7 - https://phabricator.wikimedia.org/T234854 (10herron) [19:48:11] PROBLEM - Check systemd state on releases2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:50:23] !log herron@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [19:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:24] !log herron@cumin1001 START - Cookbook sre.ganeti.makevm [19:53:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:34] (03CR) 10Andrew Bogott: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/637018 (owner: 10Andrew Bogott) [19:53:43] !log herron@cumin1001 END (ERROR) - Cookbook sre.ganeti.makevm (exit_code=97) [19:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:37] 10Operations, 10Wikidata, 10Wikidata Query Builder: Deploy WDQS query builder to microsites - https://phabricator.wikimedia.org/T266703 (10Addshore) [19:55:56] 10Operations, 10Wikidata, 10Wikidata Query UI: Move WDQS UI to microsites - https://phabricator.wikimedia.org/T266702 (10Addshore) [19:57:21] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1030 with 10G interfaces - https://phabricator.wikimedia.org/T266623 (10RobH) IRC Update: * These are copper based 1g/10g NICs and we in 1g racks so it wasn't an issue until now. * We'll need to swap these out e... [19:57:26] 10Operations, 10Wikidata, 10Wikidata Query UI, 10User-Addshore: Move WDQS UI to microsites - https://phabricator.wikimedia.org/T266702 (10Addshore) [19:57:32] 10Operations, 10Wikidata, 10Wikidata Query Builder, 10User-Addshore: Deploy WDQS query builder to microsites - https://phabricator.wikimedia.org/T266703 (10Addshore) [19:58:31] (03PS2) 10Cwhite: Initial release based on ECS 1.6.0. [software/ecs] - 10https://gerrit.wikimedia.org/r/636513 [20:00:04] chrisalbon and accraze: My dear minions, it's time we take the moon! Just kidding. Time for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201028T2000). [20:03:40] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 3 others: Upgrade memcached cluster to Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10jijiki) [20:04:12] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 3 others: Upgrade memcached cluster to Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10jijiki) @aaron thank you! I updated the task description [20:04:23] (03CR) 10Cwhite: [C: 03+1] alertmanager: add dashboard url to irc messages [puppet] - 10https://gerrit.wikimedia.org/r/636868 (https://phabricator.wikimedia.org/T266017) (owner: 10Filippo Giunchedi) [20:12:35] (03CR) 10Cwhite: [C: 03+2] Initial release based on ECS 1.6.0. [software/ecs] - 10https://gerrit.wikimedia.org/r/636513 (owner: 10Cwhite) [20:12:37] (03CR) 10Cwhite: [V: 03+2 C: 03+2] Initial release based on ECS 1.6.0. [software/ecs] - 10https://gerrit.wikimedia.org/r/636513 (owner: 10Cwhite) [20:14:28] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1025 with 10G interfaces - https://phabricator.wikimedia.org/T266187 (10RobH) [20:14:39] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1029 with 10G interfaces - https://phabricator.wikimedia.org/T266206 (10RobH) [20:14:48] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1026 with 10G interfaces - https://phabricator.wikimedia.org/T266281 (10RobH) [20:15:13] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1027 with 10G interfaces - https://phabricator.wikimedia.org/T266369 (10RobH) [20:15:20] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1028 with 10G interfaces - https://phabricator.wikimedia.org/T266514 (10RobH) [20:16:00] (03PS1) 10Ladsgroup: Change logo of Wikidata for the eighth birthday [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637024 [20:16:11] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1030 with 10G interfaces - https://phabricator.wikimedia.org/T266623 (10RobH) [20:17:46] (03CR) 10Ladsgroup: [C: 03+2] Change logo of Wikidata for the eighth birthday [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637024 (owner: 10Ladsgroup) [20:19:13] (03CR) 10DannyS712: "is there an on-wiki announcement of this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637024 (owner: 10Ladsgroup) [20:19:28] (03Merged) 10jenkins-bot: Change logo of Wikidata for the eighth birthday [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637024 (owner: 10Ladsgroup) [20:22:59] !log ladsgroup@deploy1001 Synchronized static/images/project-logos: Changing logo of Wikidata for the brithday (duration: 00m 58s) [20:23:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:50] (03CR) 10Ladsgroup: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637024 (owner: 10Ladsgroup) [20:30:40] (03PS2) 10Dzahn: httpd/puppetmaster: add data type for SSLVerifyClient and use it [puppet] - 10https://gerrit.wikimedia.org/r/635658 [20:32:06] (03CR) 10jerkins-bot: [V: 04-1] httpd/puppetmaster: add data type for SSLVerifyClient and use it [puppet] - 10https://gerrit.wikimedia.org/r/635658 (owner: 10Dzahn) [20:32:10] (03CR) 10Dzahn: "thanks!:)" [puppet] - 10https://gerrit.wikimedia.org/r/635905 (owner: 10Dzahn) [20:35:34] (03CR) 10Dzahn: wmflib:: add data type for puppetmaster server type and use it (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635660 (owner: 10Dzahn) [20:36:54] (03PS2) 10Dzahn: puppetmaster: add data type for server type and use it [puppet] - 10https://gerrit.wikimedia.org/r/635660 [20:37:27] (03CR) 10Dzahn: httpd/puppetmaster: add data type for SSLVerifyClient and use it (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635658 (owner: 10Dzahn) [20:37:49] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: add data type for server type and use it [puppet] - 10https://gerrit.wikimedia.org/r/635660 (owner: 10Dzahn) [20:43:06] 10Operations, 10ops-eqiad, 10Reading Epics (Analytics): an-coord1001 ram upgrade - https://phabricator.wikimedia.org/T266709 (10RobH) p:05Triage→03Medium [20:43:14] 10Operations, 10ops-eqiad, 10Reading Epics (Analytics): an-coord1001 ram upgrade - https://phabricator.wikimedia.org/T266709 (10RobH) [20:44:00] (03PS8) 10Dzahn: puppetmaster: add data types to all remaining parameters [puppet] - 10https://gerrit.wikimedia.org/r/635656 [20:44:02] (03PS3) 10Dzahn: httpd/puppetmaster: add data type for SSLVerifyClient and use it [puppet] - 10https://gerrit.wikimedia.org/r/635658 [20:44:06] (03CR) 10Dzahn: httpd/puppetmaster: add data type for SSLVerifyClient and use it (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635658 (owner: 10Dzahn) [20:44:37] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: add data types to all remaining parameters [puppet] - 10https://gerrit.wikimedia.org/r/635656 (owner: 10Dzahn) [20:45:40] (03CR) 10jerkins-bot: [V: 04-1] httpd/puppetmaster: add data type for SSLVerifyClient and use it [puppet] - 10https://gerrit.wikimedia.org/r/635658 (owner: 10Dzahn) [20:47:40] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [20:50:41] !log herron@cumin1001 START - Cookbook sre.ganeti.makevm [20:50:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:24] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:58:58] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:59:44] (03CR) 10Razzi: "https://puppet-compiler.wmflabs.org/compiler1003/26191/" [puppet] - 10https://gerrit.wikimedia.org/r/636517 (https://phabricator.wikimedia.org/T264152) (owner: 10Razzi) [21:02:56] PROBLEM - Check systemd state on kubestage1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:03:29] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/26190/" [puppet] - 10https://gerrit.wikimedia.org/r/633857 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [21:03:56] (03CR) 10CDanis: [C: 03+1] prometheus: re-enable compaction by default [puppet] - 10https://gerrit.wikimedia.org/r/636362 (https://phabricator.wikimedia.org/T261281) (owner: 10Filippo Giunchedi) [21:04:26] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [21:04:47] (03CR) 10Bstorm: "We definitely are using bsd-mailx on the bastions." [puppet] - 10https://gerrit.wikimedia.org/r/637018 (owner: 10Andrew Bogott) [21:05:01] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [21:05:03] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:42] (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/637018 (owner: 10Andrew Bogott) [21:09:24] (03CR) 10CDanis: pcc: expore posting pcc to gerrit comments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636652 (owner: 10Jbond) [21:13:19] (03CR) 10Dzahn: "confirmed this is working like so:" [puppet] - 10https://gerrit.wikimedia.org/r/633857 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [21:13:22] (03CR) 10Jeena Huneidi: [C: 04-1] "This mostly looks good to me. I left a few comments on some changes that are needed." (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/636916 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [21:16:13] (03Abandoned) 10Dzahn: cdh/hiveserver2: add shebang, fix bashisms [puppet] - 10https://gerrit.wikimedia.org/r/631889 (https://phabricator.wikimedia.org/T95064) (owner: 10Dzahn) [21:18:18] (03CR) 10Dzahn: [C: 04-1] "stalled on https://phabricator.wikimedia.org/T264920" [puppet] - 10https://gerrit.wikimedia.org/r/632570 (https://phabricator.wikimedia.org/T210993) (owner: 10Dzahn) [21:18:20] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [21:19:22] 10Operations, 10Patch-For-Review: logrotate for visualdiff tests on Parsoid test host (scandium) - https://phabricator.wikimedia.org/T161920 (10Dzahn) 05Open→03Invalid [21:19:52] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [21:22:01] !log herron@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [21:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:12] (03PS10) 10Jbond: diffscan: pyhotnify [puppet] - 10https://gerrit.wikimedia.org/r/634572 [21:26:50] (03PS1) 10Ppchelko: JobQueue: Increase concurrency for cdnPurge jobs. [deployment-charts] - 10https://gerrit.wikimedia.org/r/637031 [21:27:45] (03CR) 10Jbond: "made some updates but still more to come, thanks for the review as always very useful 😊" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/634572 (owner: 10Jbond) [21:37:11] (03CR) 10Dzahn: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/637018 (owner: 10Andrew Bogott) [21:40:20] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:41:19] !log Disabled elasticsearch "saneitizer" systemd timer in eqiad due to checker jobs falling behind: `sudo systemctl disable mediawiki_job_cirrus_sanitize_jobs.timer` on `mwmaint1002` [21:41:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:58] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:46:36] 10Operations, 10conftool, 10serviceops, 10Datacenter-Switchover: Disable maintenance scripts via conftool - https://phabricator.wikimedia.org/T266717 (10RLazarus) [21:46:44] 10Operations, 10conftool, 10serviceops, 10Datacenter-Switchover: Disable maintenance scripts via conftool - https://phabricator.wikimedia.org/T266717 (10RLazarus) p:05Triage→03Medium [21:47:25] 10Operations, 10SRE-Access-Requests: Requesting access to prod cluster for annet - https://phabricator.wikimedia.org/T266718 (10AnneT) [21:50:53] 10Operations, 10Parsoid, 10Parsoid-Tests, 10serviceops, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['scandium.eqiad.wmnet'] ` Of which those **FAILED**: ` ['sc... [21:58:30] (03PS1) 10Dzahn: site: assign insetup role to scandium, reimaging fails with prod role [puppet] - 10https://gerrit.wikimedia.org/r/637034 (https://phabricator.wikimedia.org/T257906) [21:59:00] (03CR) 10Dzahn: [C: 03+2] site: assign insetup role to scandium, reimaging fails with prod role [puppet] - 10https://gerrit.wikimedia.org/r/637034 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn)