[00:01:38] RECOVERY - Check systemd state on elastic2058 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:08:10] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Mailman3 is not importing list footers ("templates") as expected - https://phabricator.wikimedia.org/T281425 (10Legoktm) Documented at https://meta.wikimedia.org/wiki/Mailing_lists/Mailman3_migration#Review_custom_footer_and_other_templates (thanks to @Qu... [00:08:20] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Mailman3 is not importing list footers ("templates") as expected - https://phabricator.wikimedia.org/T281425 (10Legoktm) 05Open→03Resolved [00:20:12] RECOVERY - Check systemd state on elastic2032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:25:00] PROBLEM - Check systemd state on an-worker1126 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:27:28] RECOVERY - Check systemd state on an-worker1126 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:42:02] RECOVERY - Check systemd state on logstash1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:42:58] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 183723216 and 11 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:45:38] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 578272 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:49:44] PROBLEM - Check systemd state on logstash1007 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_apifeatureusage_codfw.service,curator_actions_apifeatureusage_eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:00:32] (03PS1) 10Papaul: DHCP Add MAC address for backup200[5,6] [puppet] - 10https://gerrit.wikimedia.org/r/683762 (https://phabricator.wikimedia.org/T277323) [01:03:46] (03CR) 10Papaul: [C: 03+2] DHCP Add MAC address for backup200[5,6] [puppet] - 10https://gerrit.wikimedia.org/r/683762 (https://phabricator.wikimedia.org/T277323) (owner: 10Papaul) [01:05:11] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: 2021-04-30) rack/setup/install backup200[4-7] - https://phabricator.wikimedia.org/T277323 (10Papaul) [01:06:22] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: 2021-04-30) rack/setup/install backup200[4-7] - https://phabricator.wikimedia.org/T277323 (10Papaul) @jcrespo all 4 nodes are ready for OS install good luck. [01:08:12] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin1001 - T280563 [01:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:08:22] T280563: Reboot elasticsearch* and relforge* to apply kernel security updates - https://phabricator.wikimedia.org/T280563 [01:28:31] (03PS4) 10Jforrester: Replace $wgRelatedArticlesFooterWhitelistedSkins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680814 (owner: 10Reedy) [01:28:52] (03CR) 10Jforrester: "Assuming the train doesn't roll back from wmf.3, this is now safe to deploy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680814 (owner: 10Reedy) [01:38:16] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:43:14] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:50:32] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [01:53:02] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [02:03:47] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10User-fgiunchedi: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10Papaul) >>! In T265435#7041318, @fgiunchedi wrote: > Thank you @papaul, today I poked a little at librenms chatsworth support and it looks like the current suppo... [02:24:45] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: (Need By: TBD) rack/setup/install conf200[456].codfw.wmnet - https://phabricator.wikimedia.org/T275637 (10Papaul) @joe please remember to change the server status in Netbox to "Active" once the server is in service. https://netbox.wikimedia.org/extras/reports/r... [02:29:38] RECOVERY - Long running screen/tmux on elastic2049 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [02:32:54] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 33806440 and 11 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:35:24] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 484008 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:44:50] (03PS1) 10Legoktm: lists: Redirect listinfo pages to Postorius after migration [puppet] - 10https://gerrit.wikimedia.org/r/683775 (https://phabricator.wikimedia.org/T280893) [02:46:37] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29309/console" [puppet] - 10https://gerrit.wikimedia.org/r/683775 (https://phabricator.wikimedia.org/T280893) (owner: 10Legoktm) [02:50:48] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: After lists have been migrated, https://lists.wikimedia.org/mailman/listinfo/ should redirect to postorius - https://phabricator.wikimedia.org/T280893 (10Legoktm) a:03Legoktm ` $ curl -I "https://polymorphic.lists.wmcloud.org/mailman/listinfo/... [03:30:32] RECOVERY - Check systemd state on prometheus2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:30:36] RECOVERY - Check systemd state on dbmonitor1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:30:42] RECOVERY - Check systemd state on prometheus1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:31:08] RECOVERY - Check systemd state on prometheus2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:31:14] PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:32:02] RECOVERY - HTTPS-dbtree on dbmonitor1002 is OK: HTTP OK: HTTP/1.1 200 OK - 113566 bytes in 3.528 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [03:45:33] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin1001 - T280563 [03:45:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:45:44] T280563: Reboot elasticsearch* and relforge* to apply kernel security updates - https://phabricator.wikimedia.org/T280563 [03:47:48] !log T280563 about half of codfw nodes have been rebooted before the failure caused by write queue not emptying fast enough, kicking it off again:`sudo -i cookbook sre.elasticsearch.rolling-operation search_codfw "codfw reboot" --reboot --nodes-per-run 3 --start-datetime 2021-04-29T23:04:29 --task-id T280563` on `ryankemper@cumin1001` tmux session `elastic_restarts` [03:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:50:57] !log T280382 `sudo -i wmf-auto-reimage-host -p T280382 wdqs1010.eqiad.wmnet` on `ryankemper@cumin1001` tmux session `reimage` [03:51:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:51:05] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [03:58:08] 10SRE, 10Wikimedia-Mailing-lists: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 (10Legoktm) {T249678} is the task for OAuth, it needs someone to do some work upstream to add MediaWiki as an OAuth provider. [04:03:40] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1010.eqiad.wmnet with reason: REIMAGE [04:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:04:48] PROBLEM - Check systemd state on prometheus1003 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:05:14] PROBLEM - Check systemd state on prometheus1004 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:05:16] PROBLEM - Check systemd state on prometheus2004 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:05:46] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1010.eqiad.wmnet with reason: REIMAGE [04:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:16:01] PROBLEM - Elasticsearch HTTPS for production-search-psi-codfw on elastic2033 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection timed out https://wikitech.wikimedia.org/wiki/Search [04:16:17] PROBLEM - Check systemd state on elastic2048 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:16:37] PROBLEM - Elasticsearch HTTPS for production-search-codfw on elastic2033 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection timed out https://wikitech.wikimedia.org/wiki/Search [04:16:47] PROBLEM - SSH on elastic2033 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:18:57] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_main_delayed.service,monitor_refine_event_sanitized_main_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:24:20] PROBLEM - Mediawiki CirrusSearch update rate - codfw on alert1001 is CRITICAL: CRITICAL: 10.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [04:25:18] RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:30:38] RECOVERY - Check systemd state on prometheus2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:30:58] RECOVERY - Check systemd state on prometheus1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:34:24] RECOVERY - Check systemd state on prometheus1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:34:48] PROBLEM - Check systemd state on elastic2045 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:39:45] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [04:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:39:52] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [04:39:54] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1003.eqiad.wmnet --dest wdqs1010.eqiad.wmnet --reason "transferring fresh categories journal following reimage" --blazegraph_instance categories` on `ryankemper@cumin1001` tmux session `reimage` [04:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:40:06] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [04:41:04] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin1001 - T280563 [04:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:41:11] T280563: Reboot elasticsearch* and relforge* to apply kernel security updates - https://phabricator.wikimedia.org/T280563 [04:42:17] !log T261239 `elastic2033`, which is known to be in a state of hardware failure (we have a ticket open), is holding up the reboot of codfw. I don't think we have a good way to exclude a node currently. Going to just proceed to `eqiad` for now [04:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:42:29] T261239: Reboot (restart) Elasticsearch nodes - https://phabricator.wikimedia.org/T261239 [04:43:07] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563 [04:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:43:20] !log T280563 `sudo -i cookbook sre.elasticsearch.rolling-operation search_eqiad "eqiad reboot to apply sec updates" --reboot --nodes-per-run 3 --start-datetime 2021-04-29T23:04:29 --task-id T280563` on `ryankemper@cumin1001` tmux session `elastic_restarts` [04:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:53:20] PROBLEM - Check systemd state on elastic1063 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:53:22] RECOVERY - Check systemd state on elastic2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:54:50] PROBLEM - Check systemd state on elastic1065 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:56:56] !log [WDQS] `ryankemper@wdqs1006:~$ sudo systemctl restart wdqs-blazegraph` [04:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:57:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1080.eqiad.wmnet [04:57:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:58:40] RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.069 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:59:21] (03PS1) 10Marostegui: mariadb: Decommission db1080 [puppet] - 10https://gerrit.wikimedia.org/r/683784 (https://phabricator.wikimedia.org/T280121) [05:02:36] RECOVERY - Check systemd state on elastic2048 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:07:43] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1080 [puppet] - 10https://gerrit.wikimedia.org/r/683784 (https://phabricator.wikimedia.org/T280121) (owner: 10Marostegui) [05:08:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1080.eqiad.wmnet [05:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:09:30] papaul: I have merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/683762 as it was pending [05:10:34] RECOVERY - Mediawiki CirrusSearch update rate - codfw on alert1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [05:10:35] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1080.eqiad.mnet - https://phabricator.wikimedia.org/T280121 (10Marostegui) [05:10:38] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1080.eqiad.mnet - https://phabricator.wikimedia.org/T280121 (10Marostegui) a:05Marostegui→03wiki_willy [05:10:59] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [05:12:12] RECOVERY - Check systemd state on elastic1065 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:15:02] RECOVERY - Check systemd state on elastic1063 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:15:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1114 to enable report_host T266483', diff saved to https://phabricator.wikimedia.org/P15663 and previous config saved to /var/cache/conftool/dbconfig/20210430-051558-marostegui.json [05:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:07] T266483: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 [05:16:30] !log Upgrade kernel on db1114 [05:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:16] (03CR) 10Jcrespo: [C: 03+1] bacula: add people1003 job to monitoring ignorelist [puppet] - 10https://gerrit.wikimedia.org/r/683732 (owner: 10Dzahn) [05:30:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 25%: Repool db1114', diff saved to https://phabricator.wikimedia.org/P15664 and previous config saved to /var/cache/conftool/dbconfig/20210430-053038-root.json [05:30:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:10] (03CR) 10Marostegui: [C: 03+1] "will merge next week" [puppet] - 10https://gerrit.wikimedia.org/r/683070 (https://phabricator.wikimedia.org/T263817) (owner: 10Bartosz Dziewoński) [05:45:30] 10SRE, 10Patch-For-Review: try planet/people on bullseye - https://phabricator.wikimedia.org/T280989 (10jcrespo) We can try to force a manual run of a backup- backups can fail for many reasons- they are attempted while the host is rebooting or without network, or simply they return no files. Let me know when t... [05:45:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 50%: Repool db1114', diff saved to https://phabricator.wikimedia.org/P15665 and previous config saved to /var/cache/conftool/dbconfig/20210430-054542-root.json [05:45:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:38] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563 [05:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:46] T280563: Reboot elasticsearch* and relforge* to apply kernel security updates - https://phabricator.wikimedia.org/T280563 [06:00:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 75%: Repool db1114', diff saved to https://phabricator.wikimedia.org/P15666 and previous config saved to /var/cache/conftool/dbconfig/20210430-060046-root.json [06:00:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 100%: Repool db1114', diff saved to https://phabricator.wikimedia.org/P15667 and previous config saved to /var/cache/conftool/dbconfig/20210430-061549-root.json [06:15:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:17] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:23:35] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:37:48] 10SRE, 10ops-eqiad, 10Analytics-Radar: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10elukey) @Cmjohnson hi! Any news about the worker nodes? [06:45:12] (03PS1) 10Elukey: role::analytics_cluster::hadoop::ui: use the 'hue' db [puppet] - 10https://gerrit.wikimedia.org/r/683786 (https://phabricator.wikimedia.org/T280262) [06:50:08] (03CR) 10Muehlenhoff: "> Patch Set 4:" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677002 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [06:52:27] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/683611 (https://phabricator.wikimedia.org/T281344) (owner: 10MMandere) [06:55:52] (03CR) 10Muehlenhoff: [C: 03+1] "Looks. I think otherwise the rsync package always get installed via rsync::server" [puppet] - 10https://gerrit.wikimedia.org/r/683741 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn) [06:57:20] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::hadoop::ui: use the 'hue' db [puppet] - 10https://gerrit.wikimedia.org/r/683786 (https://phabricator.wikimedia.org/T280262) (owner: 10Elukey) [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210430T0700) [07:01:05] (03CR) 10Filippo Giunchedi: [C: 03+2] "> Patch Set 2: Code-Review+1" [puppet] (sandbox/filippo/pontoon-o11y) - 10https://gerrit.wikimedia.org/r/676384 (owner: 10Filippo Giunchedi) [07:01:27] (03PS3) 10Filippo Giunchedi: pontoon: add hosts_for_role function [puppet] (sandbox/filippo/pontoon-o11y) - 10https://gerrit.wikimedia.org/r/676384 [07:07:13] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] (sandbox/filippo/pontoon-o11y) - 10https://gerrit.wikimedia.org/r/676383 (owner: 10Filippo Giunchedi) [07:22:33] PROBLEM - SSH on phab2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:32:04] (03PS1) 10Muehlenhoff: Remove Puppet refencences to old Buster failoid hosts [puppet] - 10https://gerrit.wikimedia.org/r/683802 [07:33:46] log installing iputils updates from Buster point release [07:34:55] moritzm: missed ! from !log [07:38:56] ack, thx [07:38:58] !log installing iputils updates from Buster point release [07:39:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:03] !log remove mc1027 from debmonitor, server is broken and won't return (T276415) [08:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:11] T276415: mc1027.eqiad.wmnet is down, not powering back up - https://phabricator.wikimedia.org/T276415 [08:20:32] (03PS1) 10Hashar: gerrit: escape URI in Phabricator comment [puppet] - 10https://gerrit.wikimedia.org/r/683810 (https://phabricator.wikimedia.org/T280197) [08:26:20] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) p75 averages for the past 12 hours: |**host**|**version**|**p75** |cp3060|6.0.7|618ms |cp3052|5.1.3|628ms |cp3056|6.0.7|... [08:37:39] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:40:03] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:55:39] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: 2021-04-30) rack/setup/install backup200[4-7] - https://phabricator.wikimedia.org/T277323 (10LSobanski) @papaul Thanks! [09:00:11] (03CR) 10Filippo Giunchedi: "I understand the rationale, but IMHO we'd be better off moving deployment-prep to elk7 (AFAICS logstash-beta.wmflabs.org is elk5)" [puppet] - 10https://gerrit.wikimedia.org/r/683695 (https://phabricator.wikimedia.org/T233134) (owner: 10Herron) [09:03:55] !log dcaro@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1040.eqiad.wmnet with reason: primary nic disconnected [09:03:56] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1040.eqiad.wmnet with reason: primary nic disconnected [09:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:02] 10SRE, 10ops-eqiad, 10procurement, 10cloud-services-team (Kanban): cloudvirt1040 primary NIC disconnected - https://phabricator.wikimedia.org/T281399 (10ops-monitoring-bot) Icinga downtime set by dcaro@cumin1001 for 2:00:00 1 host(s) and their services with reason: primary nic disconnected ` cloudvirt1040.... [09:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:34] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10User-fgiunchedi: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10fgiunchedi) >>! In T265435#7047985, @Papaul wrote: >>>! In T265435#7041318, @fgiunchedi wrote: >> Thank you @papaul, today I poked a little at librenms chatswort... [09:20:29] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:22:51] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:27:00] (03CR) 10Jbond: [C: 03+2] gerrit: escape URI in Phabricator comment [puppet] - 10https://gerrit.wikimedia.org/r/683810 (https://phabricator.wikimedia.org/T280197) (owner: 10Hashar) [09:29:24] (03Abandoned) 10Jbond: Revert "debmonitor:client: switch ssl to pki" [puppet] - 10https://gerrit.wikimedia.org/r/681630 (owner: 10Jbond) [09:29:56] (03CR) 10Jbond: "merged and if i read things right looks like it worked" [puppet] - 10https://gerrit.wikimedia.org/r/683810 (https://phabricator.wikimedia.org/T280197) (owner: 10Hashar) [09:41:35] (03PS2) 10MMandere: admin: Add mmandere to ops group [puppet] - 10https://gerrit.wikimedia.org/r/683611 (https://phabricator.wikimedia.org/T281344) [09:43:15] (03CR) 10Ssingh: [C: 03+2] admin: Add mmandere to ops group [puppet] - 10https://gerrit.wikimedia.org/r/683611 (https://phabricator.wikimedia.org/T281344) (owner: 10MMandere) [09:49:40] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: SRE Onboarding for Marc Mandere - https://phabricator.wikimedia.org/T281344 (10ssingh) [10:01:54] (03CR) 10Jbond: "> Patch Set 1:" [puppet] (sandbox/filippo/pontoon-o11y) - 10https://gerrit.wikimedia.org/r/676383 (owner: 10Filippo Giunchedi) [10:02:35] (03CR) 10Ladsgroup: [C: 03+1] "Tested too. Works like a charm" [puppet] - 10https://gerrit.wikimedia.org/r/683775 (https://phabricator.wikimedia.org/T280893) (owner: 10Legoktm) [10:13:47] 10SRE, 10Beta-Cluster-Infrastructure: Possible to run writes (e.g. UPDATE) on Beta Cluster replica - https://phabricator.wikimedia.org/T110115 (10Majavah) Instead of tracking the current primary in hiera/etc, I'd propose to just set `read_only = 1` by default on all beta database servers. The replicas should b... [10:17:15] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:17:43] 10SRE, 10SRE-Access-Requests: SRE Onboarding for Marc Mandere - https://phabricator.wikimedia.org/T281344 (10ssingh) [10:17:48] 10SRE, 10SRE-Access-Requests: SRE Onboarding for Marc Mandere - https://phabricator.wikimedia.org/T281344 (10ssingh) @MMandere: Added to "ops" LDAP group. https://wikitech.wikimedia.org/wiki/LDAP/Groups#ops_group [10:19:37] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:23:04] (03CR) 10Muehlenhoff: "This looks great, a few comments inside (some a bit more largish since this could be a good starting point for better streamlining of thes" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/681445 (owner: 10Jbond) [10:26:27] RECOVERY - SSH on phab2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:40:26] (03PS1) 10Jbond: O:rsync::server: refactor [puppet] - 10https://gerrit.wikimedia.org/r/683827 [10:41:31] (03CR) 10Jbond: "This is fine but see comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/683741 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn) [10:41:51] (03CR) 10jerkins-bot: [V: 04-1] O:rsync::server: refactor [puppet] - 10https://gerrit.wikimedia.org/r/683827 (owner: 10Jbond) [10:46:26] 10SRE, 10SRE-Access-Requests: Requesting access to Deployment shell for derick - https://phabricator.wikimedia.org/T281564 (10xSavitar) [10:52:15] (03PS1) 10Hnowlan: eventlogging: remove mariadb profile and create log dir [puppet] - 10https://gerrit.wikimedia.org/r/683831 (https://phabricator.wikimedia.org/T280679) [10:54:55] (03PS2) 10Jbond: O:rsync::server: refactor [puppet] - 10https://gerrit.wikimedia.org/r/683827 [10:58:44] 10SRE, 10Wikimedia-Mailing-lists: Delete lists-next.wikimedia.org - https://phabricator.wikimedia.org/T281548 (10Ladsgroup) We should rename it to lists-current.wikimedia.org ;) Adding @Marostegui for visibility. [11:00:48] I know it's friday, I'm just quickly deploy the new wikitech logo [11:01:07] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 19): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29311/console" [puppet] - 10https://gerrit.wikimedia.org/r/683827 (owner: 10Jbond) [11:08:39] (03PS1) 10Jbond: pcc: also add ERROR to capture states [puppet] - 10https://gerrit.wikimedia.org/r/683832 [11:16:51] 10SRE, 10SRE-Access-Requests: SRE Onboarding for Marc Mandere - https://phabricator.wikimedia.org/T281344 (10ssingh) [11:17:53] (03PS2) 10Hnowlan: eventlogging: remove mariadb profile and create log dir [puppet] - 10https://gerrit.wikimedia.org/r/683831 (https://phabricator.wikimedia.org/T280679) [11:19:59] (03CR) 10Jbond: [C: 03+2] pcc: also add ERROR to capture states [puppet] - 10https://gerrit.wikimedia.org/r/683832 (owner: 10Jbond) [11:24:07] (03PS1) 10Ladsgroup: Update wikitech logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683835 [11:26:03] (03CR) 10Ladsgroup: [C: 03+2] "It is compressed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683835 (owner: 10Ladsgroup) [11:26:45] (03Merged) 10jenkins-bot: Update wikitech logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683835 (owner: 10Ladsgroup) [11:29:15] RECOVERY - Disk space on mwlog1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mwlog1001&var-datasource=eqiad+prometheus/ops [11:30:44] (03PS3) 10Hnowlan: eventlogging: remove mariadb profile and create log dir [puppet] - 10https://gerrit.wikimedia.org/r/683831 (https://phabricator.wikimedia.org/T280679) [11:30:58] (03PS1) 10Majavah: Allow controlling P::services_proxy::envoy::ensure indepently [puppet] - 10https://gerrit.wikimedia.org/r/683836 [11:31:43] !log ladsgroup@deploy1002 Synchronized logos/config.yaml: Config: [[gerrit:683835|Update wikitech logo]] (duration: 00m 57s) [11:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:53] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29314/console" [puppet] - 10https://gerrit.wikimedia.org/r/683831 (https://phabricator.wikimedia.org/T280679) (owner: 10Hnowlan) [11:32:24] (03PS1) 10Jbond: P::envoy: allow users to run tlsproxy without service proxy [puppet] - 10https://gerrit.wikimedia.org/r/683837 [11:32:54] 10SRE, 10SRE-Access-Requests: SRE Onboarding for Marc Mandere - https://phabricator.wikimedia.org/T281344 (10ssingh) [11:33:05] !log ladsgroup@deploy1002 Synchronized static/images/project-logos/wikitech.png: Config: [[gerrit:683835|Update wikitech logo]] (duration: 00m 57s) [11:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:31] (03PS4) 10Hnowlan: eventlogging: remove mariadb profile and create log dir [puppet] - 10https://gerrit.wikimedia.org/r/683831 (https://phabricator.wikimedia.org/T280679) [11:33:57] (03CR) 10Jbond: "I also created https://gerrit.wikimedia.org/r/c/operations/puppet/+/683837" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/683836 (owner: 10Majavah) [11:34:41] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29316/console" [puppet] - 10https://gerrit.wikimedia.org/r/683831 (https://phabricator.wikimedia.org/T280679) (owner: 10Hnowlan) [11:34:43] !log ladsgroup@deploy1002 Synchronized static/images/project-logos/wikitech-2x.png: Config: [[gerrit:683835|Update wikitech logo]] (duration: 00m 57s) [11:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:33] (03PS2) 10Jbond: P::envoy: allow users to run tlsproxy without service proxy [puppet] - 10https://gerrit.wikimedia.org/r/683837 [11:36:52] !log ladsgroup@deploy1002 Synchronized static/images/project-logos/wikitech-1.5x.png: Config: [[gerrit:683835|Update wikitech logo]] (duration: 00m 56s) [11:36:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:03] !log ladsgroup@deploy1002 Synchronized static/favicon/wikitech.ico: Config: [[gerrit:683835|Update wikitech logo]] (duration: 00m 56s) [11:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:32] (03PS3) 10Jbond: O:rsync::server: refactor [puppet] - 10https://gerrit.wikimedia.org/r/683827 [11:41:18] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 36 NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29315/console" [puppet] - 10https://gerrit.wikimedia.org/r/683837 (owner: 10Jbond) [11:44:40] 10SRE, 10SRE-Access-Requests: SRE Onboarding for Marc Mandere - https://phabricator.wikimedia.org/T281344 (10ssingh) [12:12:37] (03PS1) 10Jbond: P:tlsproxy::envoy: Ensure cfssl cert is owned by envoy [puppet] - 10https://gerrit.wikimedia.org/r/683849 [12:12:44] (03Abandoned) 10Majavah: Allow controlling P::services_proxy::envoy::ensure indepently [puppet] - 10https://gerrit.wikimedia.org/r/683836 (owner: 10Majavah) [12:14:18] (03CR) 10Jbond: [C: 03+2] P:tlsproxy::envoy: Ensure cfssl cert is owned by envoy [puppet] - 10https://gerrit.wikimedia.org/r/683849 (owner: 10Jbond) [12:20:07] (03PS1) 10Jbond: P:tlsproxt::envoy: Specify specific outdir [puppet] - 10https://gerrit.wikimedia.org/r/683854 [12:20:43] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:21:39] (03CR) 10Jbond: [C: 03+2] P:tlsproxt::envoy: Specify specific outdir [puppet] - 10https://gerrit.wikimedia.org/r/683854 (owner: 10Jbond) [12:22:47] (03PS1) 10Arturo Borrero Gonzalez: wikimediacloud.org: add more FQDNs in prepartion for the cloudgw migration [dns] - 10https://gerrit.wikimedia.org/r/683855 [12:23:22] (03CR) 10jerkins-bot: [V: 04-1] wikimediacloud.org: add more FQDNs in prepartion for the cloudgw migration [dns] - 10https://gerrit.wikimedia.org/r/683855 (owner: 10Arturo Borrero Gonzalez) [12:24:09] (03PS2) 10Arturo Borrero Gonzalez: wikimediacloud.org: add more FQDNs in prepartion for the cloudgw migration [dns] - 10https://gerrit.wikimedia.org/r/683855 [12:25:27] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:26:39] (03PS3) 10Majavah: profile::etcd::v3: use puppet certs for standalone cluster [puppet] - 10https://gerrit.wikimedia.org/r/668701 (owner: 10Giuseppe Lavagetto) [12:26:44] (03CR) 10Hnowlan: [C: 03+2] cqlshrc.erb: Use TLSv1.2 for cqlsh client connections [puppet] - 10https://gerrit.wikimedia.org/r/683422 (https://phabricator.wikimedia.org/T281404) (owner: 10Eevans) [12:27:15] (03CR) 10Majavah: "(manually rebased to update the cherrypick on beta)" [puppet] - 10https://gerrit.wikimedia.org/r/668701 (owner: 10Giuseppe Lavagetto) [12:27:47] (03PS1) 10Jbond: hiera - cloud: pki increase default expire [puppet] - 10https://gerrit.wikimedia.org/r/683856 [12:28:10] (03CR) 10Jbond: [V: 03+2 C: 03+2] hiera - cloud: pki increase default expire [puppet] - 10https://gerrit.wikimedia.org/r/683856 (owner: 10Jbond) [12:30:50] (03PS1) 10David Caro: wmcs.drain_hypervisor: skip all VMs in the canary project [puppet] - 10https://gerrit.wikimedia.org/r/683857 (https://phabricator.wikimedia.org/T280641) [12:32:08] (03PS1) 10Jbond: P:tlsproxy::envoy: ensure we notify tlsproxy when the cert changes [puppet] - 10https://gerrit.wikimedia.org/r/683858 [12:34:48] (03CR) 10Jbond: [C: 03+2] P:tlsproxy::envoy: ensure we notify tlsproxy when the cert changes [puppet] - 10https://gerrit.wikimedia.org/r/683858 (owner: 10Jbond) [12:39:59] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 23): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29320/console" [puppet] - 10https://gerrit.wikimedia.org/r/683827 (owner: 10Jbond) [12:40:58] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Delete lists-next.wikimedia.org - https://phabricator.wikimedia.org/T281548 (10Marostegui) Thanks @Ladsgroup - keep me posted! [12:41:33] (03CR) 10Jbond: [V: 03+1 C: 03+2] O:rsync::server: refactor [puppet] - 10https://gerrit.wikimedia.org/r/683827 (owner: 10Jbond) [12:42:34] 10SRE, 10DBA, 10Data-Persistence-Backup, 10Wikimedia-Mailing-lists: Create production databases for mailman3 - https://phabricator.wikimedia.org/T278614 (10Marostegui) @Ladsgroup if we are going to keep track of the testing database deletion on {T281548}, we can probably ignore T278614#7022985 and close th... [12:58:47] (03CR) 10Ottomata: [C: 03+1] "This is no longer used in deployment-prep, right? If so, +1" [puppet] - 10https://gerrit.wikimedia.org/r/683831 (https://phabricator.wikimedia.org/T280679) (owner: 10Hnowlan) [12:59:50] (03PS2) 10David Caro: wmcs: add cloudvirt drain cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/683370 (https://phabricator.wikimedia.org/T280641) [12:59:52] (03PS2) 10David Caro: wmcs.openstack: add live_upgrade cloudvirt cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/683371 (https://phabricator.wikimedia.org/T280641) [12:59:54] (03PS1) 10David Caro: wmcs.openstack: add safe_reboot cloudvirt cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/683888 (https://phabricator.wikimedia.org/T280641) [13:00:54] (03PS1) 10Jbond: cfssl::cert: also created a chained file with the full certificate chain [puppet] - 10https://gerrit.wikimedia.org/r/683889 [13:02:13] (03PS1) 10Ottomata: Remove absented druid data drop job from test cluster [puppet] - 10https://gerrit.wikimedia.org/r/683890 (https://phabricator.wikimedia.org/T273789) [13:02:44] (03PS1) 10Jbond: P:tlsproxy::envoy: when using cfssl use the chained file [puppet] - 10https://gerrit.wikimedia.org/r/683891 [13:02:55] (03CR) 10jerkins-bot: [V: 04-1] wmcs.openstack: add live_upgrade cloudvirt cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/683371 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [13:04:12] (03PS2) 10Jbond: cfssl::cert: also created a chained file with the full certificate chain [puppet] - 10https://gerrit.wikimedia.org/r/683889 [13:05:05] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29322/console" [puppet] - 10https://gerrit.wikimedia.org/r/683889 (owner: 10Jbond) [13:06:38] 10SRE, 10Wikimedia-Mailing-lists: Install mailman3 on lists1001.wikimedia.org - https://phabricator.wikimedia.org/T278610 (10Ladsgroup) [13:06:50] 10SRE, 10DBA, 10Data-Persistence-Backup, 10Wikimedia-Mailing-lists: Create production databases for mailman3 - https://phabricator.wikimedia.org/T278614 (10Ladsgroup) 05Open→03Resolved a:03Marostegui Let's call it done "Create production databases for mailman3" is clearly done. [13:20:15] (03PS1) 10Jbond: P:pki::client: move ca-certificates managment to profile [puppet] - 10https://gerrit.wikimedia.org/r/683892 [13:20:26] (03CR) 10Jbond: [V: 03+1 C: 03+2] cfssl::cert: also created a chained file with the full certificate chain [puppet] - 10https://gerrit.wikimedia.org/r/683889 (owner: 10Jbond) [13:20:44] (03PS2) 10Jbond: P:tlsproxy::envoy: when using cfssl use the chained file [puppet] - 10https://gerrit.wikimedia.org/r/683891 [13:21:44] (03PS2) 10Ottomata: Remove absented druid data drop job from test cluster [puppet] - 10https://gerrit.wikimedia.org/r/683890 (https://phabricator.wikimedia.org/T273789) [13:22:42] (03CR) 10Elukey: "IPs looks good, left a nit just for consistency!" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/683706 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [13:22:52] (03PS2) 10Jbond: P:pki::client: move ca-certificates managment to profile [puppet] - 10https://gerrit.wikimedia.org/r/683892 [13:23:33] (03PS1) 10Ottomata: otto - add bin dir to PATH and add kerberos-run-command wrapper [puppet] - 10https://gerrit.wikimedia.org/r/683894 [13:23:52] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29324/console" [puppet] - 10https://gerrit.wikimedia.org/r/683892 (owner: 10Jbond) [13:24:44] (03CR) 10Ottomata: [C: 03+2] Remove absented druid data drop job from test cluster [puppet] - 10https://gerrit.wikimedia.org/r/683890 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [13:26:05] (03PS2) 10Ottomata: otto - add bin dir to PATH and add kerberos-run-command wrapper [puppet] - 10https://gerrit.wikimedia.org/r/683894 [13:26:44] (03PS3) 10Ottomata: otto - add kerberos-run-command wrapper [puppet] - 10https://gerrit.wikimedia.org/r/683894 [13:29:04] (03CR) 10Ottomata: [C: 03+2] otto - add kerberos-run-command wrapper [puppet] - 10https://gerrit.wikimedia.org/r/683894 (owner: 10Ottomata) [13:30:46] (03CR) 10Jbond: [C: 03+2] P:tlsproxy::envoy: when using cfssl use the chained file [puppet] - 10https://gerrit.wikimedia.org/r/683891 (owner: 10Jbond) [13:32:47] (03PS3) 10David Caro: wmcs.openstack: add live_upgrade cloudvirt cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/683371 (https://phabricator.wikimedia.org/T280641) [13:35:41] (03CR) 10Hnowlan: [V: 03+1] "> Patch Set 4: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/683831 (https://phabricator.wikimedia.org/T280679) (owner: 10Hnowlan) [13:40:09] (03PS1) 10Jbond: P:tlsproxy::envoy: also use the chained file for the global cert [puppet] - 10https://gerrit.wikimedia.org/r/683896 [13:42:05] (03CR) 10Jbond: [C: 03+2] P:tlsproxy::envoy: also use the chained file for the global cert [puppet] - 10https://gerrit.wikimedia.org/r/683896 (owner: 10Jbond) [13:45:46] (03PS1) 10Jbond: cfssl::cert: dont purge outdir [puppet] - 10https://gerrit.wikimedia.org/r/683897 [13:46:21] (03CR) 10jerkins-bot: [V: 04-1] cfssl::cert: dont purge outdir [puppet] - 10https://gerrit.wikimedia.org/r/683897 (owner: 10Jbond) [13:47:01] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:47:05] (03PS2) 10Jbond: cfssl::cert: dont purge outdir [puppet] - 10https://gerrit.wikimedia.org/r/683897 [13:47:49] (03CR) 10Jbond: [C: 03+2] cfssl::cert: dont purge outdir [puppet] - 10https://gerrit.wikimedia.org/r/683897 (owner: 10Jbond) [13:49:27] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:59:21] (03PS1) 10Jbond: P::tlsproxy::envoy: go back to using the cert [puppet] - 10https://gerrit.wikimedia.org/r/683900 [14:00:48] (03CR) 10Jbond: [C: 03+2] P::tlsproxy::envoy: go back to using the cert [puppet] - 10https://gerrit.wikimedia.org/r/683900 (owner: 10Jbond) [14:00:50] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:pki::client: move ca-certificates managment to profile [puppet] - 10https://gerrit.wikimedia.org/r/683892 (owner: 10Jbond) [14:02:48] (03CR) 10Ottomata: "> Patch Set 1:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/683047 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron) [14:03:58] (03CR) 10Ottomata: "This will need to be done for every helmfile values that use kafka-main! :)" [puppet] - 10https://gerrit.wikimedia.org/r/683044 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [14:04:26] (03CR) 10Ottomata: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/683044 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [14:13:53] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack Nova: Remove some settings from nova.conf on compute nodes [puppet] - 10https://gerrit.wikimedia.org/r/683670 (https://phabricator.wikimedia.org/T281384) (owner: 10Andrew Bogott) [14:17:27] PROBLEM - AuthDNS-over-TLS Works on authdns1001 is CRITICAL: CRITICAL: ns[012] kdig DoTLS check failure https://wikitech.wikimedia.org/wiki/DNS [14:18:31] (03CR) 10Phuedx: [C: 03+1] Prepare for new configuration option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683720 (https://phabricator.wikimedia.org/T277951) (owner: 10Jdlrobson) [14:19:47] RECOVERY - AuthDNS-over-TLS Works on authdns1001 is OK: OK: ns[012] kdig DoTLS check success https://wikitech.wikimedia.org/wiki/DNS [14:22:18] (03PS1) 10Andrew Bogott: nova vendordata: another attempt to avoid puppet races [puppet] - 10https://gerrit.wikimedia.org/r/683901 [14:22:51] (03PS1) 10Hashar: Revert "gerrit: escape URI in Phabricator comment" [puppet] - 10https://gerrit.wikimedia.org/r/683879 (https://phabricator.wikimedia.org/T280197) [14:22:58] 10SRE: Additional CFSSL tasks - https://phabricator.wikimedia.org/T281369 (10jbond) [14:23:43] 10SRE: Additional CFSSL tasks - https://phabricator.wikimedia.org/T281369 (10jbond) [14:23:58] (03CR) 10Jbond: [C: 03+2] Revert "gerrit: escape URI in Phabricator comment" [puppet] - 10https://gerrit.wikimedia.org/r/683879 (https://phabricator.wikimedia.org/T280197) (owner: 10Hashar) [14:24:04] (03CR) 10Andrew Bogott: [C: 03+2] nova vendordata: another attempt to avoid puppet races [puppet] - 10https://gerrit.wikimedia.org/r/683901 (owner: 10Andrew Bogott) [14:24:44] andrewbogott: happy for me to merge yours [14:24:57] yes please! [14:25:10] ack merged, hashar yours too [14:32:21] (03PS2) 10Jbond: peopleweb: ensure rsync is installed [puppet] - 10https://gerrit.wikimedia.org/r/683742 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn) [14:33:16] (03PS3) 10Jbond: peopleweb: ensure rsync is installed [puppet] - 10https://gerrit.wikimedia.org/r/683742 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn) [14:33:45] (03PS3) 10Jbond: rsync::quickdatacopy: ensure a destination host gets an rsync client [puppet] - 10https://gerrit.wikimedia.org/r/683741 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn) [14:34:17] (03PS4) 10Jbond: rsync::quickdatacopy: ensure a destination host gets an rsync client [puppet] - 10https://gerrit.wikimedia.org/r/683741 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn) [14:35:10] (03PS4) 10Jbond: peopleweb: ensure rsync is installed [puppet] - 10https://gerrit.wikimedia.org/r/683742 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn) [14:36:00] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29327/console" [puppet] - 10https://gerrit.wikimedia.org/r/683742 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn) [14:38:10] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 5 NOOP 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29328/console" [puppet] - 10https://gerrit.wikimedia.org/r/683741 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn) [14:41:14] (03CR) 10Herron: "> Patch Set 4:" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677002 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [14:43:58] (03CR) 10Jbond: [V: 03+1 C: 03+1] "> Patch Set 4: Verified+1" [puppet] - 10https://gerrit.wikimedia.org/r/683741 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn) [14:45:10] (03CR) 10CDanis: [C: 03+1] rsync::quickdatacopy: ensure a destination host gets an rsync client [puppet] - 10https://gerrit.wikimedia.org/r/683741 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn) [14:53:17] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:55:18] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: (Need By: TBD) rack/setup/install conf200[456].codfw.wmnet - https://phabricator.wikimedia.org/T275637 (10JMeybohm) >>! In T275637#7048007, @Papaul wrote: > @joe please remember to change the server status in Netbox to "Active" once the server is in service. >... [15:07:35] 10SRE, 10Analytics: wmf-auto-restart.py + lsof + /mnt/hdfs may need to be tuned - https://phabricator.wikimedia.org/T278371 (10elukey) 05Open→03Declined Let's revisit this if anything happens again, it seems a sporadic issue. [15:25:22] !log hard rebooting cloudmetrics1002 T275605 [15:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:33] T275605: cloudmetrics1002: mysterious issue - https://phabricator.wikimedia.org/T275605 [15:29:17] PROBLEM - Prometheus cloudmetrics1002/labs restarted: beware possible monitoring artifacts on cloudmetrics1002 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/labs [15:29:56] !log dcaro@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudmetrics1002.eqiad.wmnet with reason: Flaky host [15:29:57] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudmetrics1002.eqiad.wmnet with reason: Flaky host [15:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:41] (03PS1) 10Ottomata: Remove SWAP / virtualenv based jupyterhub [puppet] - 10https://gerrit.wikimedia.org/r/683916 (https://phabricator.wikimedia.org/T262847) [15:36:11] (03CR) 10Herron: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/683695 (https://phabricator.wikimedia.org/T233134) (owner: 10Herron) [15:52:12] (03PS2) 10Herron: add kafka-main[12]00[45] to existing kafka-main egress rules and broker lists [deployment-charts] - 10https://gerrit.wikimedia.org/r/683706 (https://phabricator.wikimedia.org/T225005) [15:54:36] (03CR) 10Herron: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/683044 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [15:56:16] (03CR) 10Elukey: "Andrew: question about egress rules vs broker metadata :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/683706 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [15:57:55] RECOVERY - Prometheus cloudmetrics1002/labs restarted: beware possible monitoring artifacts on cloudmetrics1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/labs [16:01:43] (03PS6) 1001miki10: Disable ContentTranslation New article campaign in fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672416 (https://phabricator.wikimedia.org/T277473) [16:03:22] (03PS2) 10Herron: eventgate-logging-external: add codfw kafka-logging hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/683047 (https://phabricator.wikimedia.org/T279342) [16:06:59] (03CR) 10Herron: "> Patch Set 1:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/683047 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron) [16:15:33] (03PS1) 10Andrew Bogott: Trove: use our quay.io docker registry rather than docker hub. [puppet] - 10https://gerrit.wikimedia.org/r/683924 (https://phabricator.wikimedia.org/T212595) [16:27:07] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:29:31] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:29:34] (03CR) 10Ottomata: "sounds good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/683706 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [16:30:46] (03CR) 10Ottomata: "Ya that should work!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/683047 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron) [16:36:16] (03CR) 10Andrew Bogott: [C: 03+2] Trove: use our quay.io docker registry rather than docker hub. [puppet] - 10https://gerrit.wikimedia.org/r/683924 (https://phabricator.wikimedia.org/T212595) (owner: 10Andrew Bogott) [16:52:29] (03PS1) 10Andrew Bogott: Trove: specify the mysql container should come from quay.io [puppet] - 10https://gerrit.wikimedia.org/r/683928 (https://phabricator.wikimedia.org/T212595) [16:52:55] (03PS1) 10Bstorm: wikireplicas: redirect all database CNAMEs to the new system [puppet] - 10https://gerrit.wikimedia.org/r/683929 (https://phabricator.wikimedia.org/T260389) [16:54:14] (03CR) 10Andrew Bogott: [C: 03+2] Trove: specify the mysql container should come from quay.io [puppet] - 10https://gerrit.wikimedia.org/r/683928 (https://phabricator.wikimedia.org/T212595) (owner: 10Andrew Bogott) [16:57:56] (03CR) 10Bstorm: "There was an easy way to do this (just changing the IP addresses), but I wanted to leave the s1.analytics.db.svc.eqiad.wmflabs style addre" [puppet] - 10https://gerrit.wikimedia.org/r/683929 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [17:04:15] PROBLEM - Host elastic2033 is DOWN: PING CRITICAL - Packet loss = 100% [17:21:37] (03PS3) 10Razzi: netboot: add reuse-analytics-raid1-2dev.cfg recipe for an-master and an-coord [puppet] - 10https://gerrit.wikimedia.org/r/682785 (https://phabricator.wikimedia.org/T278423) [17:32:43] (03PS1) 10Ottomata: test/refine - use local schema repositories instead of schema service [puppet] - 10https://gerrit.wikimedia.org/r/683937 (https://phabricator.wikimedia.org/T280017) [17:34:18] (03CR) 10jerkins-bot: [V: 04-1] test/refine - use local schema repositories instead of schema service [puppet] - 10https://gerrit.wikimedia.org/r/683937 (https://phabricator.wikimedia.org/T280017) (owner: 10Ottomata) [17:35:44] (03PS2) 10Ottomata: test/refine - use local schema repositories instead of schema service [puppet] - 10https://gerrit.wikimedia.org/r/683937 (https://phabricator.wikimedia.org/T280017) [17:37:27] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29329/console" [puppet] - 10https://gerrit.wikimedia.org/r/683937 (https://phabricator.wikimedia.org/T280017) (owner: 10Ottomata) [17:38:16] (03CR) 10Ottomata: [V: 03+1 C: 03+2] test/refine - use local schema repositories instead of schema service [puppet] - 10https://gerrit.wikimedia.org/r/683937 (https://phabricator.wikimedia.org/T280017) (owner: 10Ottomata) [17:44:52] 10SRE, 10Packaging, 10serviceops: Create Debian packages for Node.js 14 upgrade - https://phabricator.wikimedia.org/T267891 (10Jdforrester-WMF) >>! In T267891#6626526, @Joe wrote: > CI can easily use the binaries from nodejs.org, as I stated above. I disagree, strongly. The purpose of CI is to simulate the... [17:47:23] (03PS1) 10Ottomata: test/refine require ::eventschemas [puppet] - 10https://gerrit.wikimedia.org/r/683938 (https://phabricator.wikimedia.org/T280017) [17:48:25] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29330/console" [puppet] - 10https://gerrit.wikimedia.org/r/683938 (https://phabricator.wikimedia.org/T280017) (owner: 10Ottomata) [17:48:41] (03CR) 10jerkins-bot: [V: 04-1] test/refine require ::eventschemas [puppet] - 10https://gerrit.wikimedia.org/r/683938 (https://phabricator.wikimedia.org/T280017) (owner: 10Ottomata) [17:49:00] (03CR) 10Andrew Bogott: [C: 03+1] "fewer alarms!" [puppet] - 10https://gerrit.wikimedia.org/r/683739 (owner: 10Bstorm) [17:49:49] (03PS2) 10Ottomata: test/refine require ::eventschemas [puppet] - 10https://gerrit.wikimedia.org/r/683938 (https://phabricator.wikimedia.org/T280017) [17:50:38] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29331/console" [puppet] - 10https://gerrit.wikimedia.org/r/683938 (https://phabricator.wikimedia.org/T280017) (owner: 10Ottomata) [17:51:00] (03PS3) 10Ottomata: test/refine require ::eventschemas [puppet] - 10https://gerrit.wikimedia.org/r/683938 (https://phabricator.wikimedia.org/T280017) [17:51:07] (03CR) 10jerkins-bot: [V: 04-1] test/refine require ::eventschemas [puppet] - 10https://gerrit.wikimedia.org/r/683938 (https://phabricator.wikimedia.org/T280017) (owner: 10Ottomata) [17:51:54] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29332/console" [puppet] - 10https://gerrit.wikimedia.org/r/683938 (https://phabricator.wikimedia.org/T280017) (owner: 10Ottomata) [17:52:29] (03CR) 10Ottomata: [V: 03+1 C: 03+2] test/refine require ::eventschemas [puppet] - 10https://gerrit.wikimedia.org/r/683938 (https://phabricator.wikimedia.org/T280017) (owner: 10Ottomata) [17:53:07] (03CR) 10Andrew Bogott: [C: 03+1] "This seems right to me." [puppet] - 10https://gerrit.wikimedia.org/r/683929 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [17:55:47] (03CR) 10Bstorm: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/683929 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [17:56:44] (03CR) 10Bstorm: [C: 03+2] toolforge: remove spreadcheck for etcd [puppet] - 10https://gerrit.wikimedia.org/r/683739 (owner: 10Bstorm) [18:00:41] (03PS4) 10Razzi: netboot: add reuse-analytics-raid1-2dev.cfg recipe for an-master and an-coord [puppet] - 10https://gerrit.wikimedia.org/r/682785 (https://phabricator.wikimedia.org/T278423) [18:04:24] (03CR) 10Andrew Bogott: "I don't understand why this won't break access for everyone who is currently using their wiki names rather than their shell names. Maybe " [puppet] - 10https://gerrit.wikimedia.org/r/674151 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm) [18:04:57] (03CR) 10BryanDavis: "> Most users don't access the s[1-8] names directly, instead via a cname. I would be surprised if there isn't at least one person using th" [puppet] - 10https://gerrit.wikimedia.org/r/683929 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [18:13:50] 10SRE, 10Packaging, 10serviceops: Create Debian packages for Node.js 14 upgrade - https://phabricator.wikimedia.org/T267891 (10Legoktm) >>! In T267891#7049437, @Jdforrester-WMF wrote: >>>! In T267891#6626526, @Joe wrote: >> CI can easily use the binaries from nodejs.org, as I stated above. > > I disagree, s... [18:19:31] 10SRE, 10Packaging, 10serviceops: Create Debian packages for Node.js 14 upgrade - https://phabricator.wikimedia.org/T267891 (10Krinkle) I recommended Node 14 because I doubt we'll even manage to get even most production Node services to //start// using Node 12 through priorization/planning/CI/verify/beta/pro... [18:21:01] (03CR) 10Bstorm: [C: 04-1] "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/674151 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm) [18:21:42] 10SRE, 10serviceops: Publish wikimedia-bullseye base docker image - https://phabricator.wikimedia.org/T281596 (10Legoktm) [18:27:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:30:17] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:46:07] (03PS1) 10Ottomata: Revert "test/refine - use local schema repositories instead of schema service" [puppet] - 10https://gerrit.wikimedia.org/r/683949 [18:47:31] (03CR) 10jerkins-bot: [V: 04-1] Revert "test/refine - use local schema repositories instead of schema service" [puppet] - 10https://gerrit.wikimedia.org/r/683949 (owner: 10Ottomata) [18:50:54] (03PS2) 10Ottomata: Revert "test/refine - use local schema repositories instead of schema service" [puppet] - 10https://gerrit.wikimedia.org/r/683949 [18:53:59] (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/683929 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [18:55:32] (03CR) 10Ottomata: [C: 03+2] Revert "test/refine - use local schema repositories instead of schema service" [puppet] - 10https://gerrit.wikimedia.org/r/683949 (owner: 10Ottomata) [19:07:30] (03PS1) 10Legoktm: docker: Just have one keyring [puppet] - 10https://gerrit.wikimedia.org/r/683977 [19:07:32] (03PS1) 10Legoktm: docker: Build bullseye base image [puppet] - 10https://gerrit.wikimedia.org/r/683978 (https://phabricator.wikimedia.org/T281596) [19:07:34] (03PS1) 10Legoktm: docker: Stop copying config for each Debian version [puppet] - 10https://gerrit.wikimedia.org/r/683979 [19:12:40] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29335/console" [puppet] - 10https://gerrit.wikimedia.org/r/683977 (owner: 10Legoktm) [19:13:12] (03PS2) 10Legoktm: lists: Redirect listinfo pages to Postorius after migration [puppet] - 10https://gerrit.wikimedia.org/r/683775 (https://phabricator.wikimedia.org/T280893) [19:13:14] (03PS2) 10Legoktm: docker: Just have one keyring [puppet] - 10https://gerrit.wikimedia.org/r/683977 [19:13:16] (03PS2) 10Legoktm: docker: Build bullseye base image [puppet] - 10https://gerrit.wikimedia.org/r/683978 (https://phabricator.wikimedia.org/T281596) [19:13:18] (03PS2) 10Legoktm: docker: Stop copying config for each Debian version [puppet] - 10https://gerrit.wikimedia.org/r/683979 [19:13:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1080.eqiad.mnet - https://phabricator.wikimedia.org/T280121 (10wiki_willy) a:05wiki_willy→03Cmjohnson [19:14:43] (03CR) 10Legoktm: [C: 03+2] lists: Redirect listinfo pages to Postorius after migration [puppet] - 10https://gerrit.wikimedia.org/r/683775 (https://phabricator.wikimedia.org/T280893) (owner: 10Legoktm) [19:17:00] (03CR) 10Legoktm: [C: 03+2] docker: Just have one keyring [puppet] - 10https://gerrit.wikimedia.org/r/683977 (owner: 10Legoktm) [19:17:21] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: After lists have been migrated, https://lists.wikimedia.org/mailman/listinfo/ should redirect to postorius - https://phabricator.wikimedia.org/T280893 (10Legoktm) 05Open→03Resolved Clicked on some stuff on https://lists.wikimedia.org/mailman... [19:24:27] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29336/console" [puppet] - 10https://gerrit.wikimedia.org/r/683978 (https://phabricator.wikimedia.org/T281596) (owner: 10Legoktm) [19:24:41] (03CR) 10Legoktm: [V: 03+1 C: 03+2] docker: Build bullseye base image [puppet] - 10https://gerrit.wikimedia.org/r/683978 (https://phabricator.wikimedia.org/T281596) (owner: 10Legoktm) [19:28:01] (03PS1) 10Legoktm: Revert "docker: Build bullseye base image" [puppet] - 10https://gerrit.wikimedia.org/r/683954 [19:29:39] 10SRE, 10serviceops, 10Patch-For-Review: Publish wikimedia-bullseye base docker image - https://phabricator.wikimedia.org/T281596 (10Legoktm) ` root@deneb:/home/legoktm# DISTRIBUTIONS="bullseye" build-base-images Traceback (most recent call last): File "/usr/bin/bootstrap-vz", line 11, in loa... [19:29:46] (03CR) 10Legoktm: [C: 03+2] Revert "docker: Build bullseye base image" [puppet] - 10https://gerrit.wikimedia.org/r/683954 (owner: 10Legoktm) [19:32:53] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: After lists have been migrated, https://lists.wikimedia.org/mailman/listinfo/ should redirect to postorius - https://phabricator.wikimedia.org/T280893 (10Ladsgroup) wohoooo [19:41:01] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:45:58] (03PS1) 10Ottomata: Finalize TranslationRecommendation refine migration [puppet] - 10https://gerrit.wikimedia.org/r/683984 (https://phabricator.wikimedia.org/T271163) [19:48:19] 10SRE, 10fundraising-tech-ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install payments100[5-8] - https://phabricator.wikimedia.org/T266481 (10Jclark-ctr) [19:48:22] (03CR) 10Ottomata: [C: 03+2] Finalize TranslationRecommendation refine migration [puppet] - 10https://gerrit.wikimedia.org/r/683984 (https://phabricator.wikimedia.org/T271163) (owner: 10Ottomata) [19:48:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) install second SSD into payments100[5-8] - https://phabricator.wikimedia.org/T278250 (10Jclark-ctr) 05Open→03Resolved [19:55:31] 10SRE, 10SRE-Access-Requests: SRE Onboarding for Marc Mandere - https://phabricator.wikimedia.org/T281344 (10Dzahn) added to Phabricator WMF-NDA group (https://phabricator.wikimedia.org/project/members/61/) [19:55:43] 10SRE, 10SRE-Access-Requests: SRE Onboarding for Marc Mandere - https://phabricator.wikimedia.org/T281344 (10Dzahn) [19:56:58] 10SRE, 10SRE-Access-Requests: SRE Onboarding for Marc Mandere - https://phabricator.wikimedia.org/T281344 (10Dzahn) added to Phabricator acl*sre-team https://phabricator.wikimedia.org/project/members/29/ [19:57:20] mutante: thanks! I was not sure about this one and thought we will tackle this on Monday [19:58:11] sukhe: np, WMF-NDA was clear to me, acl*sre-team I did not immediately think about, but.. yea.. done. should be all [19:58:43] that latter one might be used in some special custom policies [19:59:10] but not widespread I think, WMF-NDA is usually the default way to make tickets non-public [20:00:28] ah, yea, procurement tickets [20:00:28] IIRC I am just WMF-NDA and can create non-public tickets [20:01:12] one is "non-public but all of WMF" and the other is "non-public and really just SRE" [20:01:18] ah! [20:01:34] well, not sure about "all of WMF" either, but more than SRE for sure [20:03:42] also, the policies should be like "if you are the creator or someone subscribed you then you can view it even if not in other groups". so for example people can report an issue non-public and see it but doesnt mean they see all other private tickets [20:04:50] that's how it is currenty though right? [20:05:11] I checked, you are in acl*sre as well [20:05:36] I just saw :P [20:05:49] Yea, I am saying should as in "pretty sure it is but not 100% because we once had discussions about that, but if not it -should- be" heh [20:06:04] ah :) [20:07:40] it was like that for security tickets, but they have separate ACLs from the "NDA" tickets and anyone could use that sre-team group in custom policies on individual tickets [20:12:39] 10SRE, 10Patch-For-Review: try planet/people on bullseye - https://phabricator.wikimedia.org/T280989 (10Dzahn) @jcrespo It looked to me like it wasn't actually failed but just not scheduled yet, like I saw the job scheduled for May 1st and it was the 29th. I assumed the issue is the Icinga check can't distingu... [20:34:49] (03PS1) 10RLazarus: hieradata: Add pywikibot-bugs to mailman2_exclude_backups [puppet] - 10https://gerrit.wikimedia.org/r/683988 [20:47:27] 10SRE, 10ops-eqiad, 10Analytics-Radar: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10wiki_willy) Hi @elukey - the rack space in A7 is pending on T280203. @Cmjohnson - you should be able to complete the move to A2 though - you just need to decom... [20:51:02] (03PS1) 10Ahmon Dancy: Change owner of /srv/patches to mwdeploy (from root) [puppet] - 10https://gerrit.wikimedia.org/r/683989 [21:01:13] PROBLEM - wikimedia-client-errors-alerts grafana alert on alert1001 is CRITICAL: CRITICAL: Overview ( https://grafana.wikimedia.org/d/000000566/overview ) is alerting: Client error alert. https://logstash.wikimedia.org/app/kibana%23/dashboard/AXDBY8Qhh3Uj6x1zCF56 https://grafana.wikimedia.org/d/000000566/ [21:24:10] (03CR) 10Dzahn: "wow, all of this is much more than expected. thanks" [puppet] - 10https://gerrit.wikimedia.org/r/683827 (owner: 10Jbond) [21:26:31] (03CR) 10Dzahn: "The deployment_group already has mode "7" as well, just like the owner root. And that is "wikidev" and I think all shell users are in wiki" [puppet] - 10https://gerrit.wikimedia.org/r/683989 (owner: 10Ahmon Dancy) [21:29:11] (03CR) 10Dzahn: [C: 03+2] peopleweb: ensure rsync is installed [puppet] - 10https://gerrit.wikimedia.org/r/683742 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn) [21:33:22] (03CR) 10Bstorm: [C: 03+2] cloudstore: enable drbd on cloudstore1008 [puppet] - 10https://gerrit.wikimedia.org/r/683737 (https://phabricator.wikimedia.org/T224747) (owner: 10Bstorm) [21:34:19] (03CR) 10Dzahn: "Notice: /Stage[main]/Profile::Microsites::Peopleweb/Package[rsync]/ensure: created" [puppet] - 10https://gerrit.wikimedia.org/r/683742 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn) [21:34:23] RECOVERY - wikimedia-client-errors-alerts grafana alert on alert1001 is OK: OK: Overview ( https://grafana.wikimedia.org/d/000000566/overview ) is not alerting. https://logstash.wikimedia.org/app/kibana%23/dashboard/AXDBY8Qhh3Uj6x1zCF56 https://grafana.wikimedia.org/d/000000566/ [21:35:16] (03PS5) 10Dzahn: rsync::quickdatacopy: ensure a destination host gets an rsync client [puppet] - 10https://gerrit.wikimedia.org/r/683741 (https://phabricator.wikimedia.org/T280989) [21:54:58] !log people1003 - rsycncing /home from peopel1002 [21:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:08] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team (Doing), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10Sergey.Trofimovsky.SF) Here it is, requesting settings review: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/gitlab-ansible/+/re... [22:06:33] RECOVERY - people.wikimedia.org requires authentication on people1003 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 586 bytes in 1.006 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [22:08:32] 10ops-eqiad, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Move maps and scratch on cloudstore1008/9 to a DRBD failover similar to labstore1004/5 - https://phabricator.wikimedia.org/T224747 (10Bstorm) [22:12:08] ah, syncing /home fixed the Icinga alert. makes sense.. only on this host, it's mod userdir after all [22:14:25] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/29340/" [puppet] - 10https://gerrit.wikimedia.org/r/683741 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn) [22:18:02] 10ops-eqiad, 10Data-Services, 10cloud-services-team (Kanban): Move maps and scratch on cloudstore1008/9 to a DRBD failover similar to labstore1004/5 - https://phabricator.wikimedia.org/T224747 (10Bstorm) For all I know, I'm just trying the wrong port? I just figure the next port in line seems safe. [22:29:10] (03CR) 10Ahmon Dancy: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/683989 (owner: 10Ahmon Dancy) [22:30:38] 10SRE, 10Mail: Mail to root@lists1001.wikimedia.org doesn't work because of /etc/aliases file permissions - https://phabricator.wikimedia.org/T280744 (10RLazarus) Two moving parts here in the exim config, comparing lists1001's config to other hosts where root mail does work. #1 is that lists1001's config (via... [22:31:57] (03PS6) 10Dzahn: rsync::quickdatacopy: ensure a destination host gets an rsync client [puppet] - 10https://gerrit.wikimedia.org/r/683741 (https://phabricator.wikimedia.org/T280989) [22:33:11] (03CR) 10RLazarus: "As discussed -- leaving this until later so we can use the mail from check_exclude_backups to conveniently debug T280744. This isn't urgen" [puppet] - 10https://gerrit.wikimedia.org/r/683988 (owner: 10RLazarus) [22:38:38] (03PS1) 10Jbond: P:pki::multirootca: use hardcoded sources for pki certs [puppet] - 10https://gerrit.wikimedia.org/r/683997 [22:39:31] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29341/console" [puppet] - 10https://gerrit.wikimedia.org/r/683997 (owner: 10Jbond) [22:41:03] (03CR) 10Dzahn: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/683989 (owner: 10Ahmon Dancy) [22:41:43] (03CR) 10Dzahn: [C: 03+1] Change owner of /srv/patches to mwdeploy (from root) [puppet] - 10https://gerrit.wikimedia.org/r/683989 (owner: 10Ahmon Dancy) [22:42:04] (03CR) 10Ahmon Dancy: "> I see! Would you mind linking the ticket in commit message?" [puppet] - 10https://gerrit.wikimedia.org/r/683989 (owner: 10Ahmon Dancy) [22:42:29] RECOVERY - WDQS high update lag on wdqs2001 is OK: (C)4.32e+04 ge (W)2.16e+04 ge 2.157e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [22:43:11] PROBLEM - AuthDNS-over-TLS Works on authdns1001 is CRITICAL: CRITICAL: ns[012] kdig DoTLS check failure https://wikitech.wikimedia.org/wiki/DNS [22:43:54] (03PS2) 10Ahmon Dancy: Change owner of /srv/patches to mwdeploy (from root) [puppet] - 10https://gerrit.wikimedia.org/r/683989 (https://phabricator.wikimedia.org/T245184) [22:44:59] (03PS2) 10Jbond: P:pki::multirootca: use hardcoded sources for pki certs [puppet] - 10https://gerrit.wikimedia.org/r/683997 [22:45:27] RECOVERY - AuthDNS-over-TLS Works on authdns1001 is OK: OK: ns[012] kdig DoTLS check success https://wikitech.wikimedia.org/wiki/DNS [22:45:48] Just took that over to -traffic but already resolved. ok [22:51:04] (03PS3) 10Jbond: P:pki::multirootca: use hardcoded sources for pki certs [puppet] - 10https://gerrit.wikimedia.org/r/683997 [22:56:52] (03CR) 10Dzahn: rsync::quickdatacopy: ensure a destination host gets an rsync client (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/683741 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn) [23:04:07] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/29342/" [puppet] - 10https://gerrit.wikimedia.org/r/683741 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn) [23:18:31] 10SRE, 10Services, 10Service-deployment-requests: New Service Request miscweb - https://phabricator.wikimedia.org/T281538 (10Dzahn) - added tokens in CI::master and deployment_server in private repo (just like for shellbox) [23:34:09] (03PS1) 10Dzahn: Add k8s dummy tokens for miscweb [labs/private] - 10https://gerrit.wikimedia.org/r/684000 (https://phabricator.wikimedia.org/T281538) [23:36:38] (03CR) 10Dzahn: [V: 03+2 C: 03+2] "added in private repo, just like shellbox" [labs/private] - 10https://gerrit.wikimedia.org/r/684000 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn)