[00:01:03] RECOVERY - Check systemd state on centrallog1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:09:08] !log sudo cumin 'cp2028* or cp2036* or cp2039* or cp4022* or cp4025* or cp4028* or cp4031*' 'systemctl restart purged' -b 3 - T267865 [00:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:16] T267865: Switch on rack C7 in codfw is down - https://phabricator.wikimedia.org/T267865 [00:09:45] RECOVERY - Time elapsed since the last kafka event processed by purged on cp2039 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2039 [00:11:11] RECOVERY - Time elapsed since the last kafka event processed by purged on cp2036 is OK: (C)5000 gt (W)3000 gt 987.1 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2036 [00:11:27] RECOVERY - Time elapsed since the last kafka event processed by purged on cp2028 is OK: (C)5000 gt (W)3000 gt 125.7 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2028 [00:11:35] RECOVERY - Time elapsed since the last kafka event processed by purged on cp4022 is OK: (C)5000 gt (W)3000 gt 0 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4022 [00:12:29] RECOVERY - Time elapsed since the last kafka event processed by purged on cp4025 is OK: (C)5000 gt (W)3000 gt 195.9 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4025 [00:17:07] RECOVERY - Juniper virtual chassis ports on asw-c-codfw is OK: OK: UP: 20 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [00:17:11] RECOVERY - Host cp2037 is UP: PING OK - Packet loss = 0%, RTA = 30.21 ms [00:17:11] RECOVERY - Host elastic2048 is UP: PING OK - Packet loss = 0%, RTA = 30.24 ms [00:17:11] RECOVERY - Host elastic2059 is UP: PING OK - Packet loss = 0%, RTA = 30.22 ms [00:17:13] RECOVERY - Host ms-be2036 is UP: PING OK - Packet loss = 0%, RTA = 30.17 ms [00:17:13] RECOVERY - Host cp2038 is UP: PING OK - Packet loss = 0%, RTA = 30.74 ms [00:17:15] RECOVERY - Host elastic2049 is UP: PING OK - Packet loss = 0%, RTA = 30.21 ms [00:17:15] RECOVERY - Host ms-be2021 is UP: PING OK - Packet loss = 0%, RTA = 30.13 ms [00:17:15] RECOVERY - Host thanos-be2003 is UP: PING OK - Packet loss = 0%, RTA = 30.19 ms [00:17:15] RECOVERY - Host kafka-main2003 is UP: PING OK - Packet loss = 0%, RTA = 31.67 ms [00:17:17] RECOVERY - Host ms-be2049 is UP: PING OK - Packet loss = 0%, RTA = 31.23 ms [00:17:27] RECOVERY - Host ms-be2054 is UP: PING OK - Packet loss = 0%, RTA = 33.34 ms [00:19:13] PROBLEM - Host elastic2048 is DOWN: PING CRITICAL - Packet loss = 100% [00:19:37] PROBLEM - Host elastic2059 is DOWN: PING CRITICAL - Packet loss = 100% [00:19:39] PROBLEM - Host elastic2049 is DOWN: PING CRITICAL - Packet loss = 100% [00:19:39] PROBLEM - Host cp2038 is DOWN: PING CRITICAL - Packet loss = 100% [00:19:39] !log run 'systemctl mask kafka' and 'systemctl mask kafka-mirror-main-eqiad_to_main-codfw@0' on kafka-main2003 (for the brief moment when it was up) to avoid purged issues - T267865 [00:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:46] T267865: Switch on rack C7 in codfw is down - https://phabricator.wikimedia.org/T267865 [00:20:03] PROBLEM - Number of messages locally queued by purged for processing on cp4028 is CRITICAL: cluster=cache_text instance=cp4028 job=purged layer=backend site=ulsfo https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4028 [00:20:07] PROBLEM - Host cp2037 is DOWN: PING CRITICAL - Packet loss = 100% [00:20:15] hopefully this should be enough for purged [00:20:21] PROBLEM - Host kafka-main2003 is DOWN: PING CRITICAL - Packet loss = 100% [00:20:33] PROBLEM - Host ms-be2054 is DOWN: PING CRITICAL - Packet loss = 100% [00:20:33] PROBLEM - Host ms-be2036 is DOWN: PING CRITICAL - Packet loss = 100% [00:20:55] PROBLEM - Number of messages locally queued by purged for processing on cp4031 is CRITICAL: cluster=cache_text instance=cp4031 job=purged layer=backend site=ulsfo https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4031 [00:21:09] PROBLEM - Host thanos-be2003 is DOWN: PING CRITICAL - Packet loss = 100% [00:21:09] PROBLEM - Host ms-be2021 is DOWN: PING CRITICAL - Packet loss = 100% [00:21:45] PROBLEM - Host ms-be2049 is DOWN: PING CRITICAL - Packet loss = 100% [00:22:13] PROBLEM - Juniper virtual chassis ports on asw-c-codfw is CRITICAL: CRIT: Down: 5 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [00:24:37] RECOVERY - Time elapsed since the last kafka event processed by purged on cp4028 is OK: (C)5000 gt (W)3000 gt 362.7 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4028 [00:25:09] RECOVERY - Number of messages locally queued by purged for processing on cp4028 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4028 [00:26:09] RECOVERY - Time elapsed since the last kafka event processed by purged on cp4031 is OK: (C)5000 gt (W)3000 gt 117.1 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4031 [00:27:43] RECOVERY - Number of messages locally queued by purged for processing on cp4031 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4031 [00:32:15] PROBLEM - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: ms-be2031.codfw.wmnet, peek2001.codfw.wmnet, cp2037.codfw.wmnet, deploy1002.eqiad.wmnet, wdqs1009.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [00:35:37] RECOVERY - Host ms-be2054 is UP: PING OK - Packet loss = 0%, RTA = 30.25 ms [00:35:39] RECOVERY - Host cp2037 is UP: PING OK - Packet loss = 0%, RTA = 30.28 ms [00:35:45] RECOVERY - Juniper virtual chassis ports on asw-c-codfw is OK: OK: UP: 20 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [00:35:45] RECOVERY - Host cp2038 is UP: PING OK - Packet loss = 0%, RTA = 31.77 ms [00:35:45] RECOVERY - Host elastic2048 is UP: PING OK - Packet loss = 0%, RTA = 31.79 ms [00:35:45] RECOVERY - Host elastic2059 is UP: PING OK - Packet loss = 0%, RTA = 31.76 ms [00:35:49] RECOVERY - Host kafka-main2003 is UP: PING OK - Packet loss = 0%, RTA = 36.21 ms [00:35:49] RECOVERY - Host elastic2049 is UP: PING OK - Packet loss = 0%, RTA = 33.42 ms [00:35:51] RECOVERY - Host ms-be2021 is UP: PING OK - Packet loss = 0%, RTA = 34.67 ms [00:35:51] RECOVERY - Host thanos-be2003 is UP: PING OK - Packet loss = 0%, RTA = 33.42 ms [00:36:05] RECOVERY - Host ms-be2036 is UP: PING OK - Packet loss = 0%, RTA = 33.34 ms [00:36:07] RECOVERY - Host ms-be2049 is UP: PING OK - Packet loss = 0%, RTA = 33.39 ms [00:36:37] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw@0 on kafka-main2003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_main-codfw@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [00:36:37] PROBLEM - Check systemd state on kafka-main2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:36:50] PROBLEM - Kafka Broker Server #page on kafka-main2003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [00:37:23] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 135, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:38:05] hi [00:39:24] hi, be at keys in five [00:39:34] elukey: is kafka on kafka-main2003 part of the work you were doing earlier? [00:40:47] context looks like it's https://phabricator.wikimedia.org/T267865, I'm reading back [00:41:05] I haven't looked at anything in a few days though, catching up as fast as I can :) [00:41:21] sunday funday huh [00:41:33] the switch has been flapping (bringing all the hosts up and down from the network’s perspective) [00:43:09] okay looking back at what elukey was doing, I believe kafka is deliberately disabled on that host [00:43:22] cdanis: seeing that too. the unit is masked [00:43:26] yeah that's the conclusion I'm coming to also [00:43:30] RECOVERY - Kafka Broker Server #page on kafka-main2003 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [00:43:31] https://sal.toolforge.org/log/969rznUBhxWNv8gI6EN5 [00:43:32] so just a missed downtime then? [00:43:36] yeah [00:43:55] I am going to downtime kafka-main2003 and all services [00:44:02] +1 [00:44:07] PROBLEM - Kafka Broker Replica Max Lag on kafka-main2003 is CRITICAL: 1.055e+06 ge 5e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2003 [00:44:11] sounds good [00:44:19] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-main2001 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2001 [00:44:23] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-main2002 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2002 [00:44:57] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw@0 on kafka-main2003 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_main-codfw@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [00:44:57] RECOVERY - Check systemd state on kafka-main2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:45:26] ah, and VO resolved the incident [00:46:16] anyway, schedule downtime until 15:44 UTC tomorrow, which isn't too late in the day for either timezone [00:46:35] ヘ( ^o^)ノ\(^_^ ) [00:46:38] SGTM [00:46:51] see y'all tomorrow then, thanks [00:46:56] 👋 [00:47:11] ttyl! [00:47:21] RECOVERY - Kafka Broker Replica Max Lag on kafka-main2003 is OK: (C)5e+05 ge (W)1e+05 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2003 [00:47:23] 🤞 g’night [00:50:09] 10Operations, 10Traffic, 10HTTPS, 10Performance-Team (Radar): TLS certificates renewal process - https://phabricator.wikimedia.org/T196248 (10Krenair) I don't think we use certbot anywhere except maybe Gerrit. This ticket hasn't been updated since the acme-chief deployment, which is now being used for the... [00:50:57] hmm, it looks like Puppet unmaksed kafka and restarted it [00:54:57] PROBLEM - Number of messages locally queued by purged for processing on cp2037 is CRITICAL: cluster=cache_text instance=cp2037 job=purged layer=frontend site=codfw https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2037 [00:55:47] !log re-applied mask to kafka and kafka-mirror-main-eqiad_to_main-codfw@0 on kafka-main2003 and disabled puppet to prevent restart - T267865 [00:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:55:54] T267865: Switch on rack C7 in codfw is down - https://phabricator.wikimedia.org/T267865 [00:59:57] RECOVERY - Number of messages locally queued by purged for processing on cp2037 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2037 [01:10:53] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-main2001 is CRITICAL: 412 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2001 [01:10:55] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-main2002 is CRITICAL: 305 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2002 [01:37:25] PROBLEM - Host cp2037 is DOWN: PING CRITICAL - Packet loss = 100% [01:37:41] PROBLEM - Host elastic2048 is DOWN: PING CRITICAL - Packet loss = 100% [01:37:47] PROBLEM - Host ms-be2054 is DOWN: PING CRITICAL - Packet loss = 100% [01:38:05] PROBLEM - Host elastic2059 is DOWN: PING CRITICAL - Packet loss = 100% [01:38:05] PROBLEM - Host elastic2049 is DOWN: PING CRITICAL - Packet loss = 100% [01:38:07] PROBLEM - Host cp2038 is DOWN: PING CRITICAL - Packet loss = 100% [01:38:47] PROBLEM - Host thanos-be2003 is DOWN: PING CRITICAL - Packet loss = 100% [01:38:47] PROBLEM - Host ms-be2049 is DOWN: PING CRITICAL - Packet loss = 100% [01:38:47] PROBLEM - Host ms-be2036 is DOWN: PING CRITICAL - Packet loss = 100% [01:39:01] PROBLEM - Host ms-be2021 is DOWN: PING CRITICAL - Packet loss = 100% [01:40:53] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 127, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:40:59] PROBLEM - Juniper virtual chassis ports on asw-c-codfw is CRITICAL: CRIT: Down: 5 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [01:59:11] PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:59] PROBLEM - Check systemd state on centrallog1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:21:19] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2028 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [02:25:55] RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:51:59] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2028 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [03:32:07] PROBLEM - swift codfw container availability low on alert1001 is CRITICAL: cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=codfw [03:57:13] PROBLEM - swift codfw object availability low on alert1001 is CRITICAL: cluster=thanos instance=thanos-fe2001 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=codfw [06:00:11] 10Operations, 10Traffic, 10HTTPS, 10Performance-Team (Radar): TLS certificates renewal process - https://phabricator.wikimedia.org/T196248 (10Krinkle) [06:01:47] (03Abandoned) 10Marostegui: core-mysql.my.cnf.erb: Change expire_log_days [puppet] - 10https://gerrit.wikimedia.org/r/640293 (owner: 10Marostegui) [06:02:29] 10Operations, 10Traffic, 10HTTPS, 10Performance-Team (Radar): TLS certificates renewal process - https://phabricator.wikimedia.org/T196248 (10Krinkle) @BBlack Based on the three references you've made to this ticket over the past two years, I guess this has de-facto been accepted as-is. Should we document... [06:02:50] !log Restart mysql on db1115 (tendril/dbtree) due to memory usage [06:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1018, es1015, es1019 - T261717', diff saved to https://phabricator.wikimedia.org/P13262 and previous config saved to /var/cache/conftool/dbconfig/20201116-060624-marostegui.json [06:06:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:31] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [06:06:43] PROBLEM - HTTPS-dbtree on dbmonitor1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 354 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [06:11:08] ^ expected due to db1115's reboot [06:11:25] (03PS1) 10Marostegui: mariadb: Productionize es1032-es1034 [puppet] - 10https://gerrit.wikimedia.org/r/641056 (https://phabricator.wikimedia.org/T261717) [06:11:45] RECOVERY - HTTPS-dbtree on dbmonitor1001 is OK: HTTP OK: HTTP/1.1 200 OK - 92421 bytes in 0.801 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [06:13:28] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize es1032-es1034 [puppet] - 10https://gerrit.wikimedia.org/r/641056 (https://phabricator.wikimedia.org/T261717) (owner: 10Marostegui) [06:14:06] !log Stop MySQL on es1018, es1015, es1019 to clone es1032, es1033, es1034 - T261717 [06:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:12] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [06:20:03] PROBLEM - Check systemd state on logstash1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:34:03] 10Operations, 10serviceops: Create Debian packages for Node.js 14 upgrade - https://phabricator.wikimedia.org/T267891 (10Krinkle) [06:34:27] 10Operations, 10serviceops: Create Debian packages for Node.js 14 upgrade - https://phabricator.wikimedia.org/T267891 (10Krinkle) [06:35:10] 10Operations, 10serviceops: Create Debian packages for Node.js 14 upgrade - https://phabricator.wikimedia.org/T267891 (10Krinkle) [06:35:15] !log Stop replication on s3 codfw master (db2105) for MCR schema change deployment T238966 [06:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:24] T238966: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 [06:38:02] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1008 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7f392a6594a8: Failed to establish a new connection: [Errno 111] Connection [06:38:02] ://wikitech.wikimedia.org/wiki/Search%23Administration [06:38:59] (03PS3) 10Marostegui: wikireplicas: set up site.pp and hosts hiera for new servers [puppet] - 10https://gerrit.wikimedia.org/r/639815 (https://phabricator.wikimedia.org/T260843) (owner: 10Bstorm) [06:45:08] RECOVERY - Check systemd state on logstash1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:46:30] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1008 is OK: OK - elasticsearch status production-logstash-eqiad: number_of_nodes: 6, number_of_pending_tasks: 0, active_shards_percent_as_number: 100.0, active_primary_shards: 483, delayed_unassigned_shards: 0, active_shards: 916, initializing_shards: 0, relocating_shards: 0, status: green, number_of_data_nodes: 3, timed_out: False, task_max_waiting_in_queue_mill [06:46:30] _shards: 0, cluster_name: production-logstash-eqiad, number_of_in_flight_fetch: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [06:53:01] 10Operations, 10serviceops: Create Debian packages for Node.js 14 upgrade - https://phabricator.wikimedia.org/T267891 (10Joe) 05Open→03Declined Not sure what this task rationale is. Debian buster has node10, https://packages.debian.org/buster/nodejs and will provide security updates until at least 202... [06:54:38] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:56:16] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:57:02] 10Operations, 10serviceops: Create Debian packages for Node.js 14 upgrade - https://phabricator.wikimedia.org/T267891 (10Krinkle) I don't think Debian provides security support for the 1,446,739 packages on npmjs.org. It won't be long before our production services or CI tooling will no longer function on a su... [07:06:09] 10Operations, 10ops-codfw, 10netops: Switch on rack C7 in codfw is down - https://phabricator.wikimedia.org/T267865 (10elukey) Just added two days of downtime to all the hosts in the rack, hopefully it will be less spammy. As follow up of this task I think that we should prioritize T225005, having only 3 ka... [07:22:26] 10Operations, 10serviceops, 10Growth-Team (Current Sprint), 10Patch-For-Review, and 2 others: Reimage one memcached shard per DC to Buster - https://phabricator.wikimedia.org/T252391 (10elukey) Adding some thoughts about mc1036, in my opinion it is really flying with the new config :) With the extra +20G... [07:24:23] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::lvs::realserver: use poolcounter for guarding service restarts [puppet] - 10https://gerrit.wikimedia.org/r/640928 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [07:27:04] RECOVERY - Host elastic2048 is UP: PING WARNING - Packet loss = 50%, RTA = 216.49 ms [07:27:04] RECOVERY - Juniper virtual chassis ports on asw-c-codfw is OK: OK: UP: 20 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [07:27:04] RECOVERY - Host ms-be2036 is UP: PING WARNING - Packet loss = 60%, RTA = 30.15 ms [07:27:04] RECOVERY - Host ms-be2049 is UP: PING WARNING - Packet loss = 71%, RTA = 30.16 ms [07:27:06] RECOVERY - Host cp2038 is UP: PING WARNING - Packet loss = 33%, RTA = 30.25 ms [07:27:06] RECOVERY - Host ms-be2054 is UP: PING OK - Packet loss = 0%, RTA = 30.19 ms [07:27:06] RECOVERY - Host thanos-be2003 is UP: PING OK - Packet loss = 0%, RTA = 30.22 ms [07:27:06] RECOVERY - Host ms-be2021 is UP: PING OK - Packet loss = 0%, RTA = 30.18 ms [07:27:06] RECOVERY - Host elastic2049 is UP: PING OK - Packet loss = 0%, RTA = 31.48 ms [07:27:06] RECOVERY - Host cp2037 is UP: PING OK - Packet loss = 0%, RTA = 30.23 ms [07:27:08] RECOVERY - Host elastic2059 is UP: PING OK - Packet loss = 0%, RTA = 39.34 ms [07:27:25] really weird [07:28:26] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 135, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:29:44] RECOVERY - swift codfw object availability low on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=codfw [07:30:46] PROBLEM - Juniper alarms on asw-c-codfw is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [07:32:20] RECOVERY - swift codfw container availability low on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=codfw [07:37:02] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.46e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [07:40:20] (03PS1) 10Elukey: role::analytics_cluster::coordinator::replica: use kerberos [puppet] - 10https://gerrit.wikimedia.org/r/641128 [07:42:52] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::coordinator::replica: use kerberos [puppet] - 10https://gerrit.wikimedia.org/r/641128 (owner: 10Elukey) [07:43:40] RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.01406 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [07:47:11] 04Critical Alert for device asw-c-codfw.mgmt.codfw.wmnet - Juniper alarm active [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201116T0800) [08:04:15] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Use a single "ssh-agent" systemd unit [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/639912 (owner: 10Giuseppe Lavagetto) [08:07:20] RECOVERY - Juniper alarms on asw-c-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [08:08:37] !log asw-c-codfw> request system power-off member 7 - T267865 [08:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:45] T267865: Switch on rack C7 in codfw is down - https://phabricator.wikimedia.org/T267865 [08:08:56] XioNoX: thanks :) [08:10:56] error: timeout waiting for response from fpc7 [08:10:56] error: request-power-off failed on fpc7 [08:10:57] er [08:12:20] PROBLEM - Juniper alarms on asw-c-codfw is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [08:13:03] actually it might have worked, from the logs [08:13:22] !log joal@deploy1001 Started deploy [analytics/refinery@3df51cb]: Analytics special train for webrequest table update [analytics/refinery@3df51cb] [08:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:28] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 127, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:13:43] (03CR) 10Nikerabbit: [C: 04-1] Remove wgContentTranslationRESTBase config (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634956 (https://phabricator.wikimedia.org/T266213) (owner: 10KartikMistry) [08:13:52] PROBLEM - Juniper virtual chassis ports on asw-c-codfw is CRITICAL: CRIT: Down: 5 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [08:20:44] RECOVERY - Juniper virtual chassis ports on asw-c-codfw is OK: OK: UP: 20 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [08:21:30] ACKNOWLEDGEMENT - Juniper alarms on asw-c-codfw is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms ayounsi https://phabricator.wikimedia.org/T267865 - The acknowledgement expires at: 2020-11-17 08:21:10. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [08:21:30] ACKNOWLEDGEMENT - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal ayounsi https://phabricator.wikimedia.org/T267865 - The acknowledgement expires at: 2020-11-17 08:21:10. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:21:30] ACKNOWLEDGEMENT - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal ayounsi https://phabricator.wikimedia.org/T267865 - The acknowledgement expires at: 2020-11-17 08:21:10. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:21:30] ACKNOWLEDGEMENT - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 127, down: 2, dormant: 0, excluded: 0, unused: 0: ayounsi https://phabricator.wikimedia.org/T267865 - The acknowledgement expires at: 2020-11-17 08:21:10. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:22:06] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 135, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:22:10] 04̶C̶r̶i̶t̶i̶c̶a̶l Device asw-c-codfw.mgmt.codfw.wmnet recovered from Juniper alarm active [08:23:32] !log joal@deploy1001 Finished deploy [analytics/refinery@3df51cb]: Analytics special train for webrequest table update [analytics/refinery@3df51cb] (duration: 10m 09s) [08:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:59] !log joal@deploy1001 Started deploy [analytics/refinery@3df51cb] (thin): Analytics special train for webrequest table update THIN [analytics/refinery@3df51cb] [08:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:07] !log joal@deploy1001 Finished deploy [analytics/refinery@3df51cb] (thin): Analytics special train for webrequest table update THIN [analytics/refinery@3df51cb] (duration: 00m 07s) [08:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:12] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 127, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:27:31] 10Operations, 10Desktop Improvements, 10Product-Infrastructure-Team-Backlog, 10Proton, and 4 others: Connection closed while downloading PDF of articles - https://phabricator.wikimedia.org/T266373 (10Jgiannelos) >>! In T266373#6616778, @akosiaris wrote: >>>! In T266373#6613038, @Jgiannelos wrote: >> @akosi... [08:27:36] PROBLEM - Juniper virtual chassis ports on asw-c-codfw is CRITICAL: CRIT: Down: 5 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [08:30:58] RECOVERY - Juniper virtual chassis ports on asw-c-codfw is OK: OK: UP: 20 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [08:32:16] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 135, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:32:39] !log asw-c-codfw> request system power-off member 7 - T267865 [08:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:45] T267865: Switch on rack C7 in codfw is down - https://phabricator.wikimedia.org/T267865 [08:33:25] 10Operations, 10Desktop Improvements, 10Product-Infrastructure-Team-Backlog, 10Proton, and 4 others: Connection closed while downloading PDF of articles - https://phabricator.wikimedia.org/T266373 (10Jgiannelos) >>! In T266373#6617586, @akosiaris wrote: >> Interestingly, proton returns transfer-encoding: c... [08:35:08] alright I think it's powered down for real this time... [08:35:48] !log installing codemirror-js security updates [08:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:42] PROBLEM - Juniper virtual chassis ports on asw-c-codfw is CRITICAL: CRIT: Down: 5 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [08:39:34] RECOVERY - Check systemd state on centrallog1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:39:50] !log centrallog1001 move invalid config /etc/logrotate.d/logrotate-debug to /etc [08:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:10] 04Critical Alert for device asw-c-codfw.mgmt.codfw.wmnet - Juniper alarm active [08:58:53] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1022 - https://phabricator.wikimedia.org/T267872 (10jijiki) p:05Triage→03Medium [09:00:03] 10Operations, 10Traffic: purged is not resilient to kafka main nodes going down - https://phabricator.wikimedia.org/T267867 (10jijiki) p:05Triage→03Medium [09:03:40] (03CR) 10Filippo Giunchedi: [C: 03+1] smart: add metric to track number of devices detected [puppet] - 10https://gerrit.wikimedia.org/r/640473 (https://phabricator.wikimedia.org/T267135) (owner: 10Cwhite) [09:05:48] hashar: ok to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/640850 ? [09:06:05] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/640850 (owner: 10Hashar) [09:06:38] godog: yeah for sure ! :] [09:06:41] 10Operations, 10serviceops: Create Debian packages for Node.js 14 upgrade - https://phabricator.wikimedia.org/T267891 (10hashar) 05Declined→03Open The rationale is that developers are adopting newer versions on a different timeline than the Debian releases. Either we are ahead (the case for NodeJS) and/or... [09:06:43] I actually tested that one locally [09:06:50] I was running the regex using a wrong assumption [09:07:00] the patterns are matched against hte gerrit metrics which are slashes separated [09:07:19] while I wrote my regex using the metrics exposed by prometheus which have slashes replaced by underscore ... [09:07:41] annnd [09:08:11] I have created a bunch of dashboard last week, even imported some grafana json from upstream https://grafana.wikimedia.org/d/uXZMn9PWz/overview-upstream \o/ [09:09:24] 10Operations, 10ops-codfw, 10DC-Ops, 10observability: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10fgiunchedi) Once the PDU are installed please let #observability know. At minimum we'd need to test librenms discovery and their SNMP MIB to snmp-exporter for pulling power da... [09:10:35] hashar: neat! [09:10:40] ok I'll merge [09:10:52] (03CR) 10Filippo Giunchedi: [C: 03+2] gerrit: fix Prometheus excludeMetrics patterns [puppet] - 10https://gerrit.wikimedia.org/r/640850 (owner: 10Hashar) [09:12:09] and last thursday I looked at having the metrics collected from a single prometheus system but could not figure it out [09:12:34] I could also use a collector for the other gerrit instance ( gerrit-replica ) but ditto, I wasn't sure how to express that in puppet without bunch of copy pasting :-\ [09:13:00] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 123 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:13:03] yeah that's indeed the other missing bit, are the metrics available when pulled using the internal gerrit hostname ? [09:13:11] as opposed to gerrit.w.o that is [09:13:22] that would be the optimal solution IMHO [09:14:14] using the hostname instead? I can check that [09:14:38] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 15 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:14:48] but I think they use different IP [09:16:07] yeah I think that'd be ok [09:16:09] yeah different IP :-\ [09:16:35] the prometheus metrics are exposed by the java daemon which is on gerrit.wikimedia.org [09:16:53] while the hostname is gerrit1001.wikimedia.org with different IP addresses [09:17:17] or to say it otherwise, Gerrit doesn't listen on gerrit1001.wikimedia.org ip addresses, only on the service IP assigned for gerrit.wikimedia.org [09:17:31] same for the other gerrit instance ( gerrit2001.wikimedia.org / gerrit-replica.wikimedia.org ) [09:18:22] ah I see, ok so yeah that wouldn't work indeed [09:19:15] I mean yes and no, we're asking apache for metrics not gerrit [09:19:38] via https that is, would that work when queried with the internal hostname ? [09:28:09] (03CR) 10JMeybohm: [C: 03+1] Add apache httpd base image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/634924 (https://phabricator.wikimedia.org/T265324) (owner: 10Giuseppe Lavagetto) [09:28:30] 10Operations, 10ops-codfw: Degraded RAID on ms-be2031 - https://phabricator.wikimedia.org/T267748 (10fgiunchedi) a:03Papaul @Papaul looks like the SSD is busted on this host. Host is OOW I think, we'll need a replacement SSD, thanks! [09:29:24] (03CR) 10Kormat: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/639824 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [09:29:52] 10Operations, 10serviceops: Create Debian packages for Node.js 14 upgrade - https://phabricator.wikimedia.org/T267891 (10hashar) Reply from the package uploader: > indeed, all other things being equal, nodejs 12.x will be in debian 11. > (unless a developer starts working full time on transitioning nodejs 14... [09:29:57] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1022 - https://phabricator.wikimedia.org/T267872 (10fgiunchedi) Replacement BBU needed, looks like the host is OOW. cc @Cmjohnson @Jclark-ctr [09:30:15] hashar: ^ WDYT? [09:30:30] godog: apache doesn't listen on gerrit1001.wikimedia.org [09:31:09] I mean, neither on the IP nor is there any virtualhost section for it in apache config [09:31:34] ack, ok so no joy [09:31:43] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [09:31:43] !log gehel@cumin2001 END (FAIL) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=99) [09:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:12] so the tl;dr is that we'd need basically write the gerrit.wikimedia.org hostname in a yaml file only when running in the same site as gerrit itself [09:33:16] I'll think about it a little [09:33:33] sure [09:33:45] at least we have metrics for the main gerrit service which is a large improvement ;] [09:35:03] the task also referred to a generic jmx_reporter and I could not quite understand where it is defined nor how to define it [09:35:17] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1022 - https://phabricator.wikimedia.org/T267872 (10fgiunchedi) Nevermind, let's follow up on {T267870} [09:35:31] (03CR) 10Kormat: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/639785 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [09:35:33] I believe the JavaMelody exposes some similar metrics, though most probably under different names [09:36:08] for jmx_exporter on the JVM side it is essentially ad additional argument on the jvm command line, and an optional config file [09:36:24] I have to go afk for 10 min, brb [09:36:33] godog: sure see you :] [09:44:15] 10Operations, 10Packaging, 10serviceops: Create Debian packages for Node.js 14 upgrade - https://phabricator.wikimedia.org/T267891 (10Peachey88) [09:47:47] (03CR) 10Kormat: [C: 04-1] prometheus::mysqld_exporter::instance: Refactor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/640210 (owner: 10Jbond) [09:49:49] hashar: I couldn't find anything in puppet to "hook" that says in which site gerrit is currently active, might as well leave it as it is [09:54:20] or maybe only eqiad and codfw, next best thing [09:54:57] up to you I guess [09:55:10] I had the dashboard configured to use Thanos as the datasource [09:55:47] and I think I filter the metrics based on {instance=gerrit.wikimedia.org, site=eqiad} [09:56:16] which is also mean that if we switch over gerrit.wikimedia.org to codfw, the dashboards will need to be updated [09:58:38] no they won't, site=codfw or site=eqiad doesn't matter because they all have the same data, that's the issue [09:58:41] "issue" [09:59:04] ahh [09:59:10] (03PS1) 10Filippo Giunchedi: prometheus: limit gerrit polling to codfw/eqiad [puppet] - 10https://gerrit.wikimedia.org/r/641144 (https://phabricator.wikimedia.org/T184086) [10:00:33] (03CR) 10jerkins-bot: [V: 04-1] prometheus: limit gerrit polling to codfw/eqiad [puppet] - 10https://gerrit.wikimedia.org/r/641144 (https://phabricator.wikimedia.org/T184086) (owner: 10Filippo Giunchedi) [10:02:15] (03PS2) 10Filippo Giunchedi: prometheus: limit gerrit polling to codfw/eqiad [puppet] - 10https://gerrit.wikimedia.org/r/641144 (https://phabricator.wikimedia.org/T184086) [10:03:29] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10observability, and 2 others: Add prometheus exporter to Gerrit - https://phabricator.wikimedia.org/T184086 (10fgiunchedi) >>! In T184086#6616580, @fgiunchedi wrote: > The patch is live, unfortunately due to how our Prometheus puppetization works it... [10:05:33] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26424" [puppet] - 10https://gerrit.wikimedia.org/r/641144 (https://phabricator.wikimedia.org/T184086) (owner: 10Filippo Giunchedi) [10:08:01] (03CR) 10Hashar: [C: 03+1] prometheus: limit gerrit polling to codfw/eqiad [puppet] - 10https://gerrit.wikimedia.org/r/641144 (https://phabricator.wikimedia.org/T184086) (owner: 10Filippo Giunchedi) [10:08:26] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26425" [puppet] - 10https://gerrit.wikimedia.org/r/641144 (https://phabricator.wikimedia.org/T184086) (owner: 10Filippo Giunchedi) [10:10:09] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: limit gerrit polling to codfw/eqiad [puppet] - 10https://gerrit.wikimedia.org/r/641144 (https://phabricator.wikimedia.org/T184086) (owner: 10Filippo Giunchedi) [10:12:28] godog: and to add a second hostname (gerrit-replica.wikimedia.org) we would want to copy paste and use tls_config => { 'server_name' => 'gerrit-replica.wikimedia.org' } ? [10:13:38] hashar: in that case I'd recommend writing both hostnames to "${targets_path}/gerrit.yaml" instead [10:13:48] and remove tls_config [10:15:04] (03CR) 10Jbond: [C: 03+2] P:tendril::webserver: migrate to ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/639824 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [10:15:27] yup I have seen ${targets_path}/gerrit.yaml [10:15:36] but can't find the file nor how it is generated [10:15:47] (03CR) 10Jbond: [C: 03+2] mariadb: migrate to ensure_packages and minor refactor [puppet] - 10https://gerrit.wikimedia.org/r/639785 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [10:16:30] you'd have to write it yourself, with a 'targets' list as mentioned here https://prometheus.io/docs/prometheus/latest/configuration/configuration/#file_sd_config [10:17:18] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: split memcached servers and port [puppet] - 10https://gerrit.wikimedia.org/r/639014 (owner: 10Filippo Giunchedi) [10:21:21] (03PS3) 10Kormat: orchestrator: Require ssl connections to db servers [puppet] - 10https://gerrit.wikimedia.org/r/639765 (https://phabricator.wikimedia.org/T267401) [10:21:35] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/640664 (owner: 10Ayounsi) [10:21:56] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01978 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [10:23:27] ^ that doesn't sound good [10:23:58] looking [10:24:09] jbond42: https://puppetboard.wikimedia.org/report/db1100.eqiad.wmnet/654f5a05c7bb97814f57052e468dd267bc312354 ensure_packages() failure [10:24:34] revert [10:24:46] no please dont it will take 5 mins to push a fix [10:25:55] jbond42: just missing the `[]` around the list of packages, yeah? [10:26:13] (03PS1) 10Jbond: mariadb: use correct number of args [puppet] - 10https://gerrit.wikimedia.org/r/641146 [10:26:15] kormat: yes thanks please see ^^ [10:26:20] on it [10:26:32] (03CR) 10Kormat: [C: 03+1] mariadb: use correct number of args [puppet] - 10https://gerrit.wikimedia.org/r/641146 (owner: 10Jbond) [10:26:42] (03CR) 10Jbond: [V: 03+2 C: 03+2] mariadb: use correct number of args [puppet] - 10https://gerrit.wikimedia.org/r/641146 (owner: 10Jbond) [10:27:31] fixed will run puppet on failed hosts [10:27:44] thanks :) [10:30:18] (03CR) 10Effie Mouzeli: "> Patch Set 3:" [software/cumin] - 10https://gerrit.wikimedia.org/r/636729 (owner: 10Volans) [10:33:20] PROBLEM - swift codfw container availability low on alert1001 is CRITICAL: cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=codfw [10:35:16] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.001276 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [10:39:05] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/638125 (owner: 10Hnowlan) [10:42:29] 10Operations, 10DBA, 10Patch-For-Review, 10User-Kormat, 10User-jbond: Refactor mariadb puppet code - https://phabricator.wikimedia.org/T256972 (10Kormat) [10:44:27] !log dcaro@cumin1001 START - Cookbook sre.hosts.downtime [10:44:27] !log dcaro@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [10:44:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:42] (03PS1) 10Klausman: analytics: Switch stat1008 to use rocm 3.8 [puppet] - 10https://gerrit.wikimedia.org/r/641147 (https://phabricator.wikimedia.org/T264408) [10:44:44] (03CR) 10Gehel: "Still minor comments inline. Feel free to ignore the split in multiple CR, this is trivial enough." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/634381 (https://phabricator.wikimedia.org/T246345) (owner: 10Ryan Kemper) [10:45:01] !log dcaro@cumin1001 START - Cookbook sre.hosts.downtime [10:45:02] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:06] 10Operations, 10ops-codfw, 10netops: Switch on rack C7 in codfw is down - https://phabricator.wikimedia.org/T267865 (10ops-monitoring-bot) Icinga downtime for 1 day, 0:00:00 set by dcaro@cumin1001 on 1 host(s) and their services with reason: The switch it depends on is down ` cloudbackup2002.codfw.wmnet ` [10:45:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:30] (03CR) 10Klausman: [C: 03+2] analytics: Switch stat1008 to use rocm 3.8 [puppet] - 10https://gerrit.wikimedia.org/r/641147 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman) [10:45:38] (03CR) 10Ayounsi: [C: 03+2] Add python 3.8 to tox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/640664 (owner: 10Ayounsi) [10:46:06] !log klausman@cumin1001 START - Cookbook sre.hosts.downtime [10:46:06] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:46:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:39] 10Operations, 10LDAP-Access-Requests: Add gmodena to wmf LDAP group - https://phabricator.wikimedia.org/T267913 (10hnowlan) [10:47:17] is the c7 switch in codfw flapping up/down ? I was looking at the swift availability alert [10:48:10] XioNoX: ^ maybe ? [10:48:16] godog: it was yes, then Arzhel stopped it earlier on [10:48:26] so in theory it shouldn't flap anymore [10:48:47] I think it was up and went down again just now, judging from the alert [10:48:51] https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?orgId=1&var-DC=codfw&viewPanel=8&from=1605416979384&to=1605523511146 [10:48:52] lovely [10:49:22] anyways not a huge deal on the swift side, aside from the alert noise [10:49:22] godog: doesn't look like it from the logs [10:49:26] PROBLEM - Host stat1008 is DOWN: PING CRITICAL - Packet loss = 100% [10:49:28] (03PS7) 10Jbond: prometheus::mysqld_exporter::instance: drop arguments paramteter [puppet] - 10https://gerrit.wikimedia.org/r/640210 (https://phabricator.wikimedia.org/T256972) [10:50:12] (03CR) 10Jbond: "check experimental" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/640210 (https://phabricator.wikimedia.org/T256972) (owner: 10Jbond) [10:50:28] godog: I guess I can get Gerrit prometheus metrics exposed to https://gerrit1001.wikimedia.org/ instead of the service host. That will make the prometheus config wayyy easier (after I have read we can query the targets from puppetdb) ;) [10:50:34] will look at that tomorrow :] [10:50:46] thx for the hints! [10:50:47] (03PS1) 10Hnowlan: ldap: add gmodena to users group [puppet] - 10https://gerrit.wikimedia.org/r/641149 (https://phabricator.wikimedia.org/T267913) [10:51:44] RECOVERY - Host stat1008 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [10:51:55] XioNoX: as in the switch didn't flap? like it was up until say 30 min ago and now down again [10:52:21] hashar: nice, thanks for taking a look, yes definitely having metrics on the internal hostnames will be easier [10:53:34] godog: last time it went down according to the switch was 08:36:02 UTC [10:54:51] XioNoX: ack, ok thank you! [10:55:32] 10Operations, 10ops-codfw, 10cloud-services-team (Hardware): Network flap on cloudbackup2002 - https://phabricator.wikimedia.org/T267864 (10dcaro) Isn't it better to put this one as 'depends on'? That way when we check for issues with the host cloudbackup2002 we will find an open task, that still will depend... [10:55:53] godog: https://imgflip.com/i/4mk5b1 [10:57:13] ahahahhaahhah [10:57:34] PROBLEM - swift codfw object availability low on alert1001 is CRITICAL: cluster=thanos instance=thanos-fe2001 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=codfw [10:59:36] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10observability, and 2 others: Add prometheus exporter to Gerrit - https://phabricator.wikimedia.org/T184086 (10hashar) From a discussion with Filippo, the metrics are exposed on the service hostname. That makes it a bit hard to configure the Prometh... [11:01:02] PROBLEM - Host stat1008 is DOWN: PING CRITICAL - Packet loss = 100% [11:02:52] XioNoX: hahahah! [11:02:59] :) [11:03:30] RECOVERY - Host stat1008 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [11:05:37] 10Operations, 10Desktop Improvements, 10Product-Infrastructure-Team-Backlog, 10Proton, and 4 others: Connection closed while downloading PDF of articles - https://phabricator.wikimedia.org/T266373 (10akosiaris) >>! In T266373#6623109, @Jgiannelos wrote: >>>! In T266373#6616778, @akosiaris wrote: >>>>! In T... [11:06:42] (03CR) 10Jbond: [C: 03+2] P:idp: add paramters to control CORS [puppet] - 10https://gerrit.wikimedia.org/r/639201 (https://phabricator.wikimedia.org/T267186) (owner: 10Jbond) [11:06:46] (03CR) 10Jbond: [C: 03+2] O:idp_test: Enable CORS on idp-test [puppet] - 10https://gerrit.wikimedia.org/r/639202 (https://phabricator.wikimedia.org/T267186) (owner: 10Jbond) [11:06:56] (03PS7) 10Jbond: P:idp: add paramters to control CORS [puppet] - 10https://gerrit.wikimedia.org/r/639201 (https://phabricator.wikimedia.org/T267186) [11:07:04] (03PS6) 10Jbond: O:idp_test: Enable CORS on idp-test [puppet] - 10https://gerrit.wikimedia.org/r/639202 (https://phabricator.wikimedia.org/T267186) [11:13:41] !log installing poppler security updates [11:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:34] (03PS1) 10Filippo Giunchedi: grafana: don't redirect cas (re)-auth requests [puppet] - 10https://gerrit.wikimedia.org/r/641150 (https://phabricator.wikimedia.org/T267645) [11:16:27] (03CR) 10Vgutierrez: [C: 03+2] varnish: Improve wording of the browser security error a bit [puppet] - 10https://gerrit.wikimedia.org/r/637850 (https://phabricator.wikimedia.org/T241656) (owner: 10Ladsgroup) [11:16:53] (03CR) 10Filippo Giunchedi: "This should address the bug referenced in the task, please let me know what you think!" [puppet] - 10https://gerrit.wikimedia.org/r/641150 (https://phabricator.wikimedia.org/T267645) (owner: 10Filippo Giunchedi) [11:18:37] 10Operations, 10Traffic, 10HTTPS, 10Patch-For-Review, and 2 others: sec-warning page uses the term "Wikipedia" incorrectly - https://phabricator.wikimedia.org/T241656 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup [11:20:29] (03PS4) 10Volans: cli: change confirmation input check [software/cumin] - 10https://gerrit.wikimedia.org/r/636729 [11:20:54] (03CR) 10Volans: "As requested/agreed I've reverted the thousands separator bit" [software/cumin] - 10https://gerrit.wikimedia.org/r/636729 (owner: 10Volans) [11:27:15] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641151 (https://phabricator.wikimedia.org/T128546) [11:49:03] !log hnowlan@cumin1001 START - Cookbook sre.cassandra.roll-restart [11:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:12] !log roll restarting sessionstore for java updates [11:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:45] (03PS2) 10Gilles: Regenerate Bengali Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640816 (https://phabricator.wikimedia.org/T265553) (owner: 10Zoranzoki21) [11:53:50] (03CR) 10Gilles: "I've just applied the new recommended optimisation steps documented here: https://wikitech.wikimedia.org/wiki/Wikimedia_site_requests#Chan" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640816 (https://phabricator.wikimedia.org/T265553) (owner: 10Zoranzoki21) [12:00:14] (03CR) 10Gilles: [C: 03+1] coal: use Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/640226 (https://phabricator.wikimedia.org/T267269) (owner: 10Dave Pifke) [12:06:57] 10Operations, 10serviceops, 10Kubernetes: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10JMeybohm) [12:10:25] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [12:10:25] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [12:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:09] 10Operations, 10LDAP-Access-Requests: LDAP access for Tobias Schumann - https://phabricator.wikimedia.org/T267917 (10JanJaquemot) [12:23:15] 10Operations, 10LDAP-Access-Requests: LDAP access for Tobias Schumann - https://phabricator.wikimedia.org/T267917 (10MoritzMuehlenhoff) JFTR, this would be for the cn=nda LDAP group, not cn=wmf. [12:24:11] (03PS1) 10Jbond: apereo_cas: update config keys to use correct values [puppet] - 10https://gerrit.wikimedia.org/r/641157 [12:24:54] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) [12:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:01] (03CR) 10Jbond: [C: 03+2] apereo_cas: update config keys to use correct values [puppet] - 10https://gerrit.wikimedia.org/r/641157 (owner: 10Jbond) [12:25:29] !log roll-restarting restbase-codfw [12:25:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:37] !log hnowlan@cumin1001 START - Cookbook sre.cassandra.roll-restart [12:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:28] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:32:42] (03PS1) 10Muehlenhoff: pws: When building the keyring, read keys from local directory [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/641159 [12:34:26] (03PS1) 10Jbond: idp_test: allow only alerts.wikimedia.org to do CORS [puppet] - 10https://gerrit.wikimedia.org/r/641160 [12:35:21] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [12:35:22] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:29] (03CR) 10Jbond: [C: 03+2] idp_test: allow only alerts.wikimedia.org to do CORS [puppet] - 10https://gerrit.wikimedia.org/r/641160 (owner: 10Jbond) [12:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:36] (03CR) 10Kormat: prometheus::mysqld_exporter::instance: drop arguments paramteter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/640210 (https://phabricator.wikimedia.org/T256972) (owner: 10Jbond) [12:43:48] !log installing tcpdump security updates [12:43:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:22] (03PS1) 10Jbond: apereo_cas: dont restart tomcat on the active node [puppet] - 10https://gerrit.wikimedia.org/r/641161 [12:45:24] (03PS1) 10Jbond: idp: enable cors for alerts.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/641162 (https://phabricator.wikimedia.org/T267186) [12:46:51] (03CR) 10Jbond: "With the previous patch we can safley merge this when ever and the next time the production server is restarted i.e. 6.2.4 upgrade the new" [puppet] - 10https://gerrit.wikimedia.org/r/641162 (https://phabricator.wikimedia.org/T267186) (owner: 10Jbond) [12:49:15] (03CR) 10Muehlenhoff: [C: 03+1] "Ack, makes sense." [puppet] - 10https://gerrit.wikimedia.org/r/641161 (owner: 10Jbond) [12:49:19] (03PS8) 10Jbond: prometheus::mysqld_exporter::instance: drop arguments paramteter [puppet] - 10https://gerrit.wikimedia.org/r/640210 (https://phabricator.wikimedia.org/T256972) [12:49:53] (03CR) 10Jbond: "updated thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/640210 (https://phabricator.wikimedia.org/T256972) (owner: 10Jbond) [12:50:21] (03CR) 10Jbond: [C: 03+2] apereo_cas: dont restart tomcat on the active node [puppet] - 10https://gerrit.wikimedia.org/r/641161 (owner: 10Jbond) [12:58:05] (03CR) 10Kormat: prometheus::mysqld_exporter::instance: drop arguments paramteter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/640210 (https://phabricator.wikimedia.org/T256972) (owner: 10Jbond) [12:59:25] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [12:59:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:31] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:06] !log running schema change against s1 in codfw T259831 [13:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:12] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [13:01:35] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/641150 (https://phabricator.wikimedia.org/T267645) (owner: 10Filippo Giunchedi) [13:03:02] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, but let's rather merge after the 6.2 update to not entangle the two?" [puppet] - 10https://gerrit.wikimedia.org/r/641162 (https://phabricator.wikimedia.org/T267186) (owner: 10Jbond) [13:09:53] 10Operations, 10DBA, 10Blocked-on-schema-change, 10User-Kormat: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 (10Kormat) [13:11:52] (03CR) 10Filippo Giunchedi: [C: 03+2] grafana: don't redirect cas (re)-auth requests [puppet] - 10https://gerrit.wikimedia.org/r/641150 (https://phabricator.wikimedia.org/T267645) (owner: 10Filippo Giunchedi) [13:16:09] (03PS1) 10Filippo Giunchedi: profile: redirect to grafana-rw with referer [puppet] - 10https://gerrit.wikimedia.org/r/641164 (https://phabricator.wikimedia.org/T267645) [13:16:23] (03CR) 10Filippo Giunchedi: "Attempt #2" [puppet] - 10https://gerrit.wikimedia.org/r/641164 (https://phabricator.wikimedia.org/T267645) (owner: 10Filippo Giunchedi) [13:18:35] (03CR) 10Jbond: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/641162 (https://phabricator.wikimedia.org/T267186) (owner: 10Jbond) [13:22:04] (03CR) 10Jbond: "LGTM might be nicer to implement in ruby but looks like pws already dose a lot of shell out so..." (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/641159 (owner: 10Muehlenhoff) [13:23:56] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) [13:23:59] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/641149 (https://phabricator.wikimedia.org/T267913) (owner: 10Hnowlan) [13:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:35] (03PS9) 10Jbond: prometheus::mysqld_exporter::instance: drop arguments paramteter [puppet] - 10https://gerrit.wikimedia.org/r/640210 (https://phabricator.wikimedia.org/T256972) [13:25:47] (03CR) 10Jbond: "updated" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/640210 (https://phabricator.wikimedia.org/T256972) (owner: 10Jbond) [13:25:57] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/640210 (https://phabricator.wikimedia.org/T256972) (owner: 10Jbond) [13:33:17] (03CR) 10Kormat: [C: 03+1] "LGTM, thanks for this!" [puppet] - 10https://gerrit.wikimedia.org/r/640210 (https://phabricator.wikimedia.org/T256972) (owner: 10Jbond) [13:37:17] (03CR) 10Muehlenhoff: pws: When building the keyring, read keys from local directory (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/641159 (owner: 10Muehlenhoff) [13:38:09] (03CR) 10Jbond: pws: When building the keyring, read keys from local directory (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/641159 (owner: 10Muehlenhoff) [13:50:38] (03PS1) 10Jbond: mariadb: create empty section mappings for cloud [puppet] - 10https://gerrit.wikimedia.org/r/641167 (https://phabricator.wikimedia.org/T267006) [13:51:22] (03PS1) 10Marostegui: db-eqiad.php: Pool pc1010 instead of pc1007. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641168 (https://phabricator.wikimedia.org/T266483) [13:55:02] (03CR) 10Jbond: [C: 03+2] mariadb: create empty section mappings for cloud [puppet] - 10https://gerrit.wikimedia.org/r/641167 (https://phabricator.wikimedia.org/T267006) (owner: 10Jbond) [13:58:36] (03PS1) 10Jbond: profile::query_service: add default for username [puppet] - 10https://gerrit.wikimedia.org/r/641169 (https://phabricator.wikimedia.org/T267006) [14:00:08] (03CR) 10Jbond: [C: 03+2] profile::query_service: add default for username [puppet] - 10https://gerrit.wikimedia.org/r/641169 (https://phabricator.wikimedia.org/T267006) (owner: 10Jbond) [14:02:21] (03PS2) 10Marostegui: db-eqiad.php: Pool pc1010 instead of pc1007. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641168 (https://phabricator.wikimedia.org/T266483) [14:03:00] (03CR) 10Kormat: [C: 03+1] db-eqiad.php: Pool pc1010 instead of pc1007. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641168 (https://phabricator.wikimedia.org/T266483) (owner: 10Marostegui) [14:04:04] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Pool pc1010 instead of pc1007. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641168 (https://phabricator.wikimedia.org/T266483) (owner: 10Marostegui) [14:04:51] (03Merged) 10jenkins-bot: db-eqiad.php: Pool pc1010 instead of pc1007. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641168 (https://phabricator.wikimedia.org/T266483) (owner: 10Marostegui) [14:05:05] (03PS1) 10Jbond: profile::query_service: add federation_user_agent to deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/641171 (https://phabricator.wikimedia.org/T267006) [14:05:49] (03CR) 10Jbond: [C: 03+2] profile::query_service: add federation_user_agent to deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/641171 (https://phabricator.wikimedia.org/T267006) (owner: 10Jbond) [14:06:02] (03PS1) 10Marostegui: Revert "db-eqiad.php: Pool pc1010 instead of pc1007." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641041 [14:06:38] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool pc1007 and place pc1010 instead of it T266483 (duration: 01m 00s) [14:06:40] !log Restart pc1007's mysql T266483 [14:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:45] T266483: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 [14:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:48] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Pool pc1010 instead of pc1007." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641041 (owner: 10Marostegui) [14:10:37] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Pool pc1010 instead of pc1007." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641041 (owner: 10Marostegui) [14:12:02] 10Operations, 10SRE-Access-Requests: Requesting access to researchers for fkaelin - https://phabricator.wikimedia.org/T267817 (10Ottomata) [14:12:09] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool pc1007 in pc1 after restarting mysql T266483 (duration: 00m 59s) [14:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:15] T266483: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 [14:12:34] 10Operations, 10SRE-Access-Requests: Requesting access to researchers for fkaelin - https://phabricator.wikimedia.org/T267817 (10Ottomata) Fabian will also need analyitcs-privatedata-users. I edited the task description to say so. Approved! [14:13:42] (03PS1) 10Jbond: deployment-prep: add local commit c503964991f3b12523ea03c7bdea521619ca300c [puppet] - 10https://gerrit.wikimedia.org/r/641172 [14:14:10] (03CR) 10Jbond: [C: 03+2] deployment-prep: add local commit c503964991f3b12523ea03c7bdea521619ca300c [puppet] - 10https://gerrit.wikimedia.org/r/641172 (owner: 10Jbond) [14:15:07] 10Operations, 10ORES, 10Machine Learning Platform (Current): ores.wmflabs.org - 503 icinga alerts - https://phabricator.wikimedia.org/T242819 (10Ladsgroup) The whole ores.wmflabs.org should be downscaled and cleaned up. At its current state, it's completely broken and has lots of maintenance overhead for nex... [14:16:44] !log hnowlan@cumin1001 START - Cookbook sre.cassandra.roll-restart [14:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:06] 10Operations, 10SRE-Access-Requests: Requesting access to researchers for fkaelin - https://phabricator.wikimedia.org/T267817 (10fkaelin) Thanks. I did create a separate task for the analytics-privatedata-users group, which seemingly wasn't necessary. https://phabricator.wikimedia.org/T267816 [14:19:00] 10Operations, 10SRE-Access-Requests: Requesting access to researchers for fkaelin - https://phabricator.wikimedia.org/T267817 (10Ottomata) [14:19:45] 10Operations, 10SRE-Access-Requests: Requesting access to researchers for fkaelin - https://phabricator.wikimedia.org/T267817 (10Ottomata) AH! yeah one ticket is fine. I'll close that as a duplicate. [14:19:58] 10Operations, 10SRE-Access-Requests: Requesting access to researchers for fkaelin - https://phabricator.wikimedia.org/T267817 (10Ottomata) [14:20:11] 10Operations, 10SRE-Access-Requests: Requesting access to researchers, analytics-privatedata-users and wmf LDAP for fkaelin - https://phabricator.wikimedia.org/T267817 (10Ottomata) [14:20:25] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for fkaelin - https://phabricator.wikimedia.org/T267816 (10Ottomata) [14:25:12] 10Operations, 10SRE-Access-Requests: Requesting access to researchers, analytics-privatedata-users and wmf LDAP for fkaelin - https://phabricator.wikimedia.org/T267817 (10fkaelin) Also, I noticed that there is an previous outdated entry for me in that yaml file. https://phabricator.wikimedia.org/source/operat... [14:30:56] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic, 10HTTPS: The certificate for upload.beta.wmflabs.org expired on November 13, 2020. - https://phabricator.wikimedia.org/T267858 (10Vgutierrez) >>! In T267858#6622251, @Krenair wrote: > @Vgutierrez FYI in case this could happen in prod too, I haven't be... [14:45:14] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 245 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:46:52] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 46 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:53:46] 10Operations, 10serviceops, 10CommRel-Specialists-Support (Oct-Dec-2020), 10User-notice: CommRel support for ICU 63 upgrade - https://phabricator.wikimedia.org/T267145 (10Trizek-WMF) I've been blocked by a last minute change made on translation, which required me to manually change date formats in translat... [14:56:00] (03PS2) 10Muehlenhoff: pws: When building the keyring, read keys from local directory [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/641159 [14:58:38] (03PS1) 10Jbond: hieradata: add data for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/641190 [14:59:06] (03CR) 10Jbond: [C: 03+2] hieradata: add data for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/641190 (owner: 10Jbond) [15:01:40] (03PS1) 10Filippo Giunchedi: profile: add Alertmanager API virtual host [puppet] - 10https://gerrit.wikimedia.org/r/641191 (https://phabricator.wikimedia.org/T266017) [15:01:42] (03PS1) 10Filippo Giunchedi: role: add Alertmanager API profile [puppet] - 10https://gerrit.wikimedia.org/r/641192 (https://phabricator.wikimedia.org/T266017) [15:02:15] (03CR) 10Alexandros Kosiaris: "Sorry for taking so long to reply on this, it feel through the cracks last quarter. I did leave some comments inline, but meanwhile the st" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [15:05:02] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26426" [puppet] - 10https://gerrit.wikimedia.org/r/641192 (https://phabricator.wikimedia.org/T266017) (owner: 10Filippo Giunchedi) [15:08:31] (03CR) 10Filippo Giunchedi: [V: 03+1] "See also: https://puppet-compiler.wmflabs.org/compiler1003/26426/alert1001.wikimedia.org/fulldiff.html" [puppet] - 10https://gerrit.wikimedia.org/r/641192 (https://phabricator.wikimedia.org/T266017) (owner: 10Filippo Giunchedi) [15:10:35] (03CR) 10Jbond: [C: 03+2] prometheus::mysqld_exporter::instance: drop arguments paramteter [puppet] - 10https://gerrit.wikimedia.org/r/640210 (https://phabricator.wikimedia.org/T256972) (owner: 10Jbond) [15:13:19] (03PS7) 10Alexandros Kosiaris: Add recommendation-api helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [15:14:12] (03CR) 10Jbond: [C: 03+1] "LGTM" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/641159 (owner: 10Muehlenhoff) [15:16:05] (03CR) 10jerkins-bot: [V: 04-1] Add recommendation-api helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [15:18:00] (03CR) 10RLazarus: [C: 03+1] jobrunner: add SERVERGROUP environment variable [puppet] - 10https://gerrit.wikimedia.org/r/640923 (https://phabricator.wikimedia.org/T266515) (owner: 10Giuseppe Lavagetto) [15:18:02] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] pws: When building the keyring, read keys from local directory [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/641159 (owner: 10Muehlenhoff) [15:23:32] (03PS1) 10Jbond: P:lvs::realserver: only sintall python3-poolcounter on > jessie [puppet] - 10https://gerrit.wikimedia.org/r/641194 [15:24:30] (03CR) 10Volans: "LGTM, couple of typos and a possible improvement inline" (037 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635849 (owner: 10Ayounsi) [15:25:39] (03PS2) 10Jbond: P:lvs::realserver: only intall python3-poolcounter on > jessie [puppet] - 10https://gerrit.wikimedia.org/r/641194 [15:26:24] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/26427/" [puppet] - 10https://gerrit.wikimedia.org/r/641194 (owner: 10Jbond) [15:28:51] (03PS3) 10Jbond: P:lvs::realserver: only sintall python3-poolcounter on > jessie [puppet] - 10https://gerrit.wikimedia.org/r/641194 [15:30:16] (03CR) 10jerkins-bot: [V: 04-1] P:lvs::realserver: only sintall python3-poolcounter on > jessie [puppet] - 10https://gerrit.wikimedia.org/r/641194 (owner: 10Jbond) [15:33:17] (03PS4) 10Jbond: P:lvs::realserver: only sintall python3-poolcounter on > jessie [puppet] - 10https://gerrit.wikimedia.org/r/641194 [15:35:55] (03PS8) 10Alexandros Kosiaris: Add recommendation-api helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [15:38:23] (03CR) 10jerkins-bot: [V: 04-1] Add recommendation-api helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [15:39:00] (03PS1) 10Elukey: cumin: change target for hadoop-worker-canary [puppet] - 10https://gerrit.wikimedia.org/r/641195 (https://phabricator.wikimedia.org/T267932) [15:39:39] (03CR) 10Elukey: [C: 03+2] cumin: change target for hadoop-worker-canary [puppet] - 10https://gerrit.wikimedia.org/r/641195 (https://phabricator.wikimedia.org/T267932) (owner: 10Elukey) [15:40:00] !log cdanis@cumin1001 START - Cookbook sre.network.cf [15:40:02] !log cdanis@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [15:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:06] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26429" [puppet] - 10https://gerrit.wikimedia.org/r/641194 (owner: 10Jbond) [15:44:24] PROBLEM - Host kafka-main2003 is DOWN: PING CRITICAL - Packet loss = 100% [15:45:50] this is our dear rack c7 --^ [15:46:33] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/641149 (https://phabricator.wikimedia.org/T267913) (owner: 10Hnowlan) [15:47:16] (03CR) 10Jbond: [V: 03+1] "This change doesn't fix the underlining issue it just stops puppet from complaining. As such i would recommend not merging this change un" [puppet] - 10https://gerrit.wikimedia.org/r/641194 (owner: 10Jbond) [15:49:11] 10Operations, 10Traffic, 10Patch-For-Review, 10User-notice: Deprecate TLSv1.2 weak ciphersuites - https://phabricator.wikimedia.org/T258405 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [15:50:39] !log elukey@cumin1001 START - Cookbook sre.hosts.decommission [15:50:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:47] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung - https://phabricator.wikimedia.org/T266250 (10herron) 05Open→03Resolved Since this has been awaiting input for several weeks, I'll temporarily transition it to closed d... [15:54:39] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/cumin] - 10https://gerrit.wikimedia.org/r/636729 (owner: 10Volans) [15:56:31] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) [15:56:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:06] !log roll-restarting eqiad restbase for java security updates [15:57:10] !log hnowlan@cumin1001 START - Cookbook sre.cassandra.roll-restart [15:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:26] 10Operations, 10SRE-Access-Requests: Requesting access to researchers, analytics-privatedata-users and wmf LDAP for fkaelin - https://phabricator.wikimedia.org/T267817 (10herron) [15:58:08] 10Operations, 10SRE-Access-Requests: Requesting access to researchers, analytics-privatedata-users and wmf LDAP for fkaelin - https://phabricator.wikimedia.org/T267817 (10herron) [15:58:24] PROBLEM - Check systemd state on pki2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:59:08] (03CR) 10RLazarus: [C: 03+1] "Thanks! Optionally you could also add a comment inline saying something like" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640576 (https://phabricator.wikimedia.org/T267581) (owner: 10Krinkle) [16:00:55] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/641199 [16:01:19] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [16:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:50] !log joined maps2006 to maps codfw cassandra cluster [16:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:15] (03PS1) 10Effie Mouzeli: hiera: enable icu 63 component on appserver canaries [puppet] - 10https://gerrit.wikimedia.org/r/641200 [16:05:30] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [16:06:18] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) [16:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:40] RECOVERY - Check systemd state on pki2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:07:06] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [16:07:41] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2037.codfw.wmnet [16:07:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:56] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2038.codfw.wmnet [16:08:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:16] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:08:35] !log disable puppet in appservers canaries to install ICU 63 - T264991 [16:08:36] (03CR) 10RLazarus: [C: 03+1] hiera: enable icu 63 component on appserver canaries [puppet] - 10https://gerrit.wikimedia.org/r/641200 (owner: 10Effie Mouzeli) [16:08:38] (03PS1) 10RLazarus: hiera: Enable icu63 component on api canaries [puppet] - 10https://gerrit.wikimedia.org/r/641203 [16:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:41] T264991: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 [16:09:06] (03PS2) 10RLazarus: hiera: Enable icu63 component on api canaries [puppet] - 10https://gerrit.wikimedia.org/r/641203 (https://phabricator.wikimedia.org/T264991) [16:09:32] (03PS2) 10Effie Mouzeli: hiera: enable icu 63 component on appserver canaries [puppet] - 10https://gerrit.wikimedia.org/r/641200 (https://phabricator.wikimedia.org/T264991) [16:09:53] (03CR) 10Effie Mouzeli: [C: 03+1] hiera: Enable icu63 component on api canaries [puppet] - 10https://gerrit.wikimedia.org/r/641203 (https://phabricator.wikimedia.org/T264991) (owner: 10RLazarus) [16:10:33] (03CR) 10Muehlenhoff: "LGTM, but you can drop the host-specific setting for mwdebug2002 now (since it also uses the role)" [puppet] - 10https://gerrit.wikimedia.org/r/641200 (https://phabricator.wikimedia.org/T264991) (owner: 10Effie Mouzeli) [16:10:44] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Maps (Kartotherian), 10Patch-For-Review, 10Sustainability (Incident Followup): Kartotherian/Maps outage followups, 2020-10-29 - https://phabricator.wikimedia.org/T266807 (10MSantos) [16:11:49] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/641203 (https://phabricator.wikimedia.org/T264991) (owner: 10RLazarus) [16:11:57] 10Operations, 10ops-codfw, 10netops: ripe-atlast-codfw is down - https://phabricator.wikimedia.org/T267714 (10Papaul) power cycle device, checked cable, swapped cable device is still showing down [16:12:48] 10Operations, 10ops-codfw, 10netops: ripe-atlas-codfw is down - https://phabricator.wikimedia.org/T267714 (10ayounsi) [16:13:50] 10Operations, 10ops-codfw, 10netops: ripe-atlas-codfw is down - https://phabricator.wikimedia.org/T267714 (10ayounsi) a:03faidon I think Faidon is the person who knows the most about the Atlas :) Feel free to re-assign as needed. [16:14:00] 10Operations, 10Beta-Cluster-Infrastructure, 10DBA, 10serviceops, 10Patch-For-Review: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10ArielGlenn) Note that I ran a little dumps test on a non-latin1 wiki in deployment-prep (ruwiki to be precise) and the results look... [16:14:39] !log hnowlan@cumin1001 START - Cookbook sre.cassandra.roll-restart [16:14:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:37] !log disable puppet on A:mw-api-canary T264991 [16:16:39] !log update c7 serial in row C VC config - T267865 [16:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:43] T264991: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 [16:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:50] T267865: Switch on rack C7 in codfw is down - https://phabricator.wikimedia.org/T267865 [16:17:26] (03CR) 10RLazarus: [C: 03+2] hiera: Enable icu63 component on api canaries [puppet] - 10https://gerrit.wikimedia.org/r/641203 (https://phabricator.wikimedia.org/T264991) (owner: 10RLazarus) [16:19:17] (03PS3) 10Effie Mouzeli: hiera: enable icu 63 component on appserver canaries [puppet] - 10https://gerrit.wikimedia.org/r/641200 (https://phabricator.wikimedia.org/T264991) [16:20:01] (03PS4) 10Effie Mouzeli: hiera: enable icu 63 component on appserver canaries [puppet] - 10https://gerrit.wikimedia.org/r/641200 (https://phabricator.wikimedia.org/T264991) [16:22:33] (03CR) 10Effie Mouzeli: [C: 03+2] hiera: enable icu 63 component on appserver canaries [puppet] - 10https://gerrit.wikimedia.org/r/641200 (https://phabricator.wikimedia.org/T264991) (owner: 10Effie Mouzeli) [16:26:58] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install cloudcephmon200[12] - https://phabricator.wikimedia.org/T267378 (10Papaul) [16:27:36] PROBLEM - ElasticSearch numbers of masters eligible - 9643 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [16:29:22] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic, 10User-Ryasmeen: Beta needs to be upgraded to Varnish 6 - https://phabricator.wikimedia.org/T267561 (10Vgutierrez) 05Resolved→03Open Re-opening as mentioned in https://phabricator.wikimedia.org/T267006#6624466 deployment-cache-upload06 has been om... [16:35:15] 10Operations, 10ops-codfw, 10netops: Switch on rack C7 in codfw is down - https://phabricator.wikimedia.org/T267865 (10ayounsi) Spare switch configured. Old (failed): https://netbox.wikimedia.org/dcim/devices/1892/ New (spare): https://netbox.wikimedia.org/dcim/devices/235/ That's where T259166 would be use... [16:38:38] PROBLEM - Apache HTTP on mw1276 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 419 bytes in 0.038 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:38:56] PROBLEM - PHP7 rendering on mw1276 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 419 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:40:10] !log elukey@cumin1001 START - Cookbook sre.hosts.decommission [16:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:02] RECOVERY - Juniper virtual chassis ports on asw-c-codfw is OK: OK: UP: 20 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [16:43:34] RECOVERY - Apache HTTP on mw1276 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:43:52] RECOVERY - PHP7 rendering on mw1276 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.038 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:44:28] (03PS6) 10Dave Pifke: webperf: change navtiming to use Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/639197 (https://phabricator.wikimedia.org/T267269) [16:45:46] RECOVERY - ElasticSearch numbers of masters eligible - 9643 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [16:46:23] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [16:46:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:10] RECOVERY - Host kafka-main2003 is UP: PING OK - Packet loss = 0%, RTA = 33.45 ms [16:47:14] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw@0 on kafka-main2003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_main-codfw@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [16:47:35] PROBLEM - Kafka Broker Server #page on kafka-main2003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [16:47:46] oh hey that downtime expired [16:47:51] ⏰ [16:47:51] * volans here [16:47:52] PROBLEM - Check systemd state on kafka-main2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:48:05] cdanis: I think the host came back up? [16:48:23] both of those things happened in that order, yeah [16:48:29] ^ [16:48:32] eh [16:48:37] (03PS1) 10David Caro: last-puppet-run: don't crash if puppet has not run yet [puppet] - 10https://gerrit.wikimedia.org/r/641207 [16:48:37] so no user impact? [16:48:38] the host has been flapping, and paging when it comes up, IIUC [16:48:50] yeah, I had downtimed host+all services until like 15:something UTC [16:48:52] I think [16:49:03] I thought I did it this morning ,sorry people :( [16:49:30] RECOVERY - swift codfw object availability low on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=codfw [16:50:04] RECOVERY - swift codfw container availability low on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=codfw [16:50:57] elukey: I'm pretty sure kafka-main2003 was downtimed earlier the day, I was checking hosts where debdeploy failed [16:51:12] 10Operations, 10serviceops, 10CommRel-Specialists-Support (Oct-Dec-2020), 10User-notice: CommRel support for ICU 63 upgrade - https://phabricator.wikimedia.org/T267145 (10RLazarus) We've started upgrading the canary appservers to ICU 63, so the window of category sorting disruption has officially started. [16:51:12] ah okok [16:53:44] I am acking the issue in Splunk® On-call™ [16:54:35] so forgive me but I haven't been following closely -- what's the plan & timeline for fixing/swapping the itself? [16:55:25] I believe it is happening / has happened as we speak, from -dcops [17:00:05] 10Operations, 10Research, 10SRE-Access-Requests: Access to analytics-privatedata-users for Research volunteer Swagoel - https://phabricator.wikimedia.org/T267314 (10herron) a:05jijiki→03None [17:02:46] 10Operations, 10ORES, 10Machine Learning Platform (Current): ores.wmflabs.org - 503 icinga alerts - https://phabricator.wikimedia.org/T242819 (10calbon) @Ladsgroup Yeah go ahead. [17:04:57] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic, 10User-Ryasmeen: Beta needs to be upgraded to Varnish 6 - https://phabricator.wikimedia.org/T267561 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez manually ran `apt upgrade` and puppet afterwards.. everything seems ok on deployment-cache-upload06 [17:07:49] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install cloudcephmon200[12] - https://phabricator.wikimedia.org/T267378 (10Papaul) [17:09:03] 10Operations, 10User-DannyS712: Access to #mediawiki_security IRC channel for DannyS712 - https://phabricator.wikimedia.org/T267800 (10sbassett) 05Open→03Resolved a:03sbassett T233235 is done, so I think that takes care of everything? Resolving for now. [17:09:36] 10Operations, 10ops-eqiad: Interface errors on cr1-eqiad:xe-3/2/1 - https://phabricator.wikimedia.org/T267672 (10ayounsi) 05Open→03Resolved a:03ayounsi This is not showing errors anymore. [17:14:29] 10Operations, 10User-DannyS712: Access to #mediawiki_security IRC channel for DannyS712 - https://phabricator.wikimedia.org/T267800 (10sbassett) 05Resolved→03Open [17:15:06] 10Operations, 10Research, 10SRE-Access-Requests: Access to analytics-privatedata-users for Research volunteer Swagoel - https://phabricator.wikimedia.org/T267314 (10herron) [17:16:20] 10Operations, 10Research, 10SRE-Access-Requests: Access to analytics-privatedata-users for Research volunteer Swagoel - https://phabricator.wikimedia.org/T267314 (10herron) Looping in @Nuria for review and approval of `analytics-privatedata-users` access [17:16:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:18:20] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:18:35] 10Operations, 10ops-eqsin, 10DC-Ops: cr2-eqsin: fan failure - https://phabricator.wikimedia.org/T267544 (10wiki_willy) a:03RobH [17:21:56] !log repooling cp2037 and cp2038 - T267865 [17:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:03] T267865: Switch on rack C7 in codfw is down - https://phabricator.wikimedia.org/T267865 [17:24:20] !log switching back from lvs2010 to lvs2007 - T267865 [17:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:08] RECOVERY - PyBal backends health check on lvs2007 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:25:22] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 71, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:25:32] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 52, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:25:42] RECOVERY - pybal on lvs2007 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [17:29:28] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic, 10User-Ryasmeen: Beta needs to be upgraded to Varnish 6 - https://phabricator.wikimedia.org/T267561 (10hashar) thank you! [17:32:56] PROBLEM - Long running screen/tmux on an-launcher1002 is CRITICAL: CRIT: Long running SCREEN process. (user: mforns PID: 3897, 1734595s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [17:33:19] oops [17:34:13] killed the screen, sorry for the alarm [17:36:14] !log moved interfaces in Netbox from old to new switch - T267865 [17:36:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:21] T267865: Switch on rack C7 in codfw is down - https://phabricator.wikimedia.org/T267865 [17:38:12] 10Operations, 10ops-codfw, 10netops: Switch on rack C7 in codfw is down - https://phabricator.wikimedia.org/T267865 (10Volans) I've run this piece of code to migrate the interfaces from the old to the new device in a Netbox `nbshell`. ` import uuid request_id = uuid.uuid4() user = User.objects.get(username='... [17:40:50] 10Operations, 10ops-codfw: RMA failed codfw C7 switch - WMF6114 - https://phabricator.wikimedia.org/T267950 (10ayounsi) [17:41:40] 10Operations, 10ops-codfw, 10netops: Switch on rack C7 in codfw is down - https://phabricator.wikimedia.org/T267865 (10ayounsi) 05Open→03Resolved a:03ayounsi Thanks, I think we're all done here. RMA in T267950. [17:44:52] XioNoX: o/ - can I start kafka on kafka-main2003? (was masked earlier on to avoid purged failures) [17:45:18] 10Operations, 10ops-eqsin, 10DC-Ops: cr2-eqsin: fan failure - https://phabricator.wikimedia.org/T267544 (10RobH) I've opened up a Juniper case via their case management tool on https://casemanager.juniper.net 2020-1116-0428 They should email me with details and how to setup the tracking/shipment. [17:47:13] elukey: dunno, I'm no kafka expert :) [17:47:22] XioNoX: no no I mean if it is all stable etc.. [17:47:26] sound so from the task [17:47:57] elukey: are things ever stable? [17:48:10] jk, yeah, all good :) [17:48:45] !log enable and run puppet on kafka-main2003 (it will start kafka services) - T267865 [17:48:49] RECOVERY - Kafka Broker Server #page on kafka-main2003 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [17:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:51] T267865: Switch on rack C7 in codfw is down - https://phabricator.wikimedia.org/T267865 [17:49:06] RECOVERY - Check systemd state on kafka-main2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:49:14] ah snap it paged [17:49:18] * elukey hides in shame [17:49:20] heh [17:49:20] PROBLEM - Kafka Broker Replica Max Lag on kafka-main2003 is CRITICAL: 4.409e+07 ge 5e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2003 [17:49:31] yes this is expected --^ [17:49:39] it will take a while before recovering [17:50:08] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw@0 on kafka-main2003 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_main-codfw@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [17:51:32] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-main2002 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2002 [17:53:12] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-main2001 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2001 [17:54:03] (03CR) 10Dzahn: [C: 03+2] "tested from @phab1001:~# mysql -u phstats -h m3-slave.eqiad.wmnet phabricator_project ...query works and is fast" [puppet] - 10https://gerrit.wikimedia.org/r/638163 (https://phabricator.wikimedia.org/T252522) (owner: 10Aklapper) [17:54:56] (03PS3) 10Dzahn: phabricator weekly changes email: List stalled task stalled for years [puppet] - 10https://gerrit.wikimedia.org/r/638163 (https://phabricator.wikimedia.org/T252522) (owner: 10Aklapper) [17:55:12] (03CR) 10Dzahn: [C: 04-1] "wait a sec, unexpected change to .gitreview.." [puppet] - 10https://gerrit.wikimedia.org/r/638163 (https://phabricator.wikimedia.org/T252522) (owner: 10Aklapper) [17:58:38] (03PS1) 10RLazarus: hiera: Enable icu63 component on api servers [puppet] - 10https://gerrit.wikimedia.org/r/641213 (https://phabricator.wikimedia.org/T264991) [17:59:18] (03PS4) 10Dzahn: phabricator weekly changes email: List stalled task stalled for years [puppet] - 10https://gerrit.wikimedia.org/r/638163 (https://phabricator.wikimedia.org/T252522) (owner: 10Aklapper) [18:01:05] (03CR) 10Dzahn: [C: 03+2] phabricator weekly changes email: List stalled task stalled for years [puppet] - 10https://gerrit.wikimedia.org/r/638163 (https://phabricator.wikimedia.org/T252522) (owner: 10Aklapper) [18:01:07] 10Operations, 10ops-eqiad: an-worker1113 not in librenms and doesn't show up on juno's interface description - https://phabricator.wikimedia.org/T267827 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson I am not sure how it was missed but port 19 is an-worker1114 and 18 is an-worker1113. I updated the switc... [18:02:37] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1022 - https://phabricator.wikimedia.org/T267872 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson [18:03:23] (03PS1) 10Effie Mouzeli: hiera: enable icu 63 component on appservers [puppet] - 10https://gerrit.wikimedia.org/r/641214 (https://phabricator.wikimedia.org/T264991) [18:03:40] (03CR) 10Dzahn: [C: 03+2] gerrit: fix SonarQube report url discovery [puppet] - 10https://gerrit.wikimedia.org/r/638565 (https://phabricator.wikimedia.org/T267028) (owner: 10Hashar) [18:04:09] (03CR) 10Effie Mouzeli: [C: 03+1] hiera: Enable icu63 component on api servers [puppet] - 10https://gerrit.wikimedia.org/r/641213 (https://phabricator.wikimedia.org/T264991) (owner: 10RLazarus) [18:04:12] (03CR) 10RLazarus: [C: 03+1] hiera: enable icu 63 component on appservers [puppet] - 10https://gerrit.wikimedia.org/r/641214 (https://phabricator.wikimedia.org/T264991) (owner: 10Effie Mouzeli) [18:05:28] woo hoo! [18:05:32] 10Operations, 10ops-eqiad, 10SRE-swift-storage: ms-be1022 smart storage battery failure; disk sdb possibly bad - https://phabricator.wikimedia.org/T267870 (10Cmjohnson) @fgiunchedi The server is out of warranty, I have some decom'd HP servers and most likely can steal a bbu from one of them. I also have dec... [18:05:53] (03CR) 10Paladox: [C: 03+1] gerrit: remove Velocity log configuration [puppet] - 10https://gerrit.wikimedia.org/r/640066 (owner: 10Hashar) [18:05:55] !log disable puppet on all appservers [18:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:29] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot (issue continues after board change) 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10wiki_willy) @Jclark-ctr - can you double-check the S/N for db1139. We're getting the following Netbox error: mismatched serials: MXQ91300JF (netbox) !... [18:07:15] (03PS2) 10Dzahn: gerrit: remove Velocity log configuration [puppet] - 10https://gerrit.wikimedia.org/r/640066 (owner: 10Hashar) [18:08:07] (03CR) 10Dzahn: [C: 03+2] gerrit: remove Velocity log configuration [puppet] - 10https://gerrit.wikimedia.org/r/640066 (owner: 10Hashar) [18:08:45] 10Operations, 10serviceops, 10CommRel-Specialists-Support (Oct-Dec-2020), 10User-notice: CommRel support for ICU 63 upgrade - https://phabricator.wikimedia.org/T267145 (10RLazarus) [18:13:01] 10Operations, 10ops-codfw: RMA failed codfw C7 switch - WMF6114 - https://phabricator.wikimedia.org/T267950 (10wiki_willy) Related Netbox error: https://netbox.wikimedia.org/extras/reports/coherence.Rack/ [18:13:50] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot (issue continues after board change) 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10Cmjohnson) @wiki_willy I do not know what the Q number would be, all of the HP servers start with MXQ and confirmed MXQ91300JF is correct. [18:14:24] 10Operations, 10Research, 10SRE-Access-Requests: Access to analytics-privatedata-users for Research volunteer Swagoel - https://phabricator.wikimedia.org/T267314 (10Ottomata) (@Herron I am the new approver since Nuria doesn't work at WMF anymore.) Approved! [18:15:15] !log disable puppet on 'A:mw-api and not A:mw-api-canary' T264991 [18:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:22] T264991: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 [18:16:08] (03CR) 10Effie Mouzeli: [C: 03+2] hiera: enable icu 63 component on appservers [puppet] - 10https://gerrit.wikimedia.org/r/641214 (https://phabricator.wikimedia.org/T264991) (owner: 10Effie Mouzeli) [18:16:10] (03CR) 10RLazarus: [C: 03+2] hiera: Enable icu63 component on api servers [puppet] - 10https://gerrit.wikimedia.org/r/641213 (https://phabricator.wikimedia.org/T264991) (owner: 10RLazarus) [18:18:24] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ml-deploy100[1-4] - https://phabricator.wikimedia.org/T267955 (10RobH) [18:19:11] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ml-deploy100[1-4] - https://phabricator.wikimedia.org/T267955 (10RobH) [18:22:26] RECOVERY - Kafka Broker Replica Max Lag on kafka-main2003 is OK: (C)5e+05 ge (W)1e+05 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2003 [18:24:04] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install cloudcephmon200[12] - https://phabricator.wikimedia.org/T267378 (10Papaul) [18:24:35] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ml-depoly200[1-4] - https://phabricator.wikimedia.org/T267957 (10RobH) [18:27:54] 10Operations, 10Growth-Team, 10serviceops, 10Patch-For-Review, and 2 others: Reimage one memcached shard per DC to Buster - https://phabricator.wikimedia.org/T252391 (10kostajh) [18:28:55] (03PS6) 10Dzahn: planet: replace update cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/636105 (https://phabricator.wikimedia.org/T265138) [18:36:54] PROBLEM - DPKG on mw2284 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [18:37:24] PROBLEM - DPKG on mw2313 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [18:37:33] rzl: ^ I see this but sort under race condition during package upgrade [18:38:00] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install cloudcephmon200[12] - https://phabricator.wikimedia.org/T267378 (10Andrew) I'm going to reuse an old puppetmaster as cloudcephmon2003 (T258103) -- does that server also need to be re-racked or can we just rename it in place? [18:38:11] those are both appservers, cc effie [18:38:32] or no sorry, 2284 is mine [18:38:51] that's the same host that just errored out during the apt install so I just started looking into it, thanks for the pointer [18:38:52] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01013 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [18:39:12] (wouldn't be surprised if that error is what caused the alert though) [18:39:36] the DPKG Icinga check translates to: dpkg -l | grep -v ^ii [18:39:50] so alerts about anything that isn't ii [18:40:15] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install cloudcephmon200[12]-dev - https://phabricator.wikimedia.org/T267378 (10Andrew) [18:40:19] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install cloudcephmon200[12]-dev - https://phabricator.wikimedia.org/T267378 (10Andrew) please note hostname change -- these should be cloudcephmon2001-dev and cloudcephmon2002-dev [18:40:23] on mw2284 it is currently about "iU" for php7.2-fpm and "rc" for apt-listchanges [18:41:03] the U in iU should mean "Unpacked" but not installed [18:41:19] hm okay [18:41:33] I initially thought that we might overwhelmed the repos a bit [18:41:33] the desired state is that it's actually installed [18:41:57] heh, it sure is [18:42:02] I'm going to try rerunning the install on mw2284 and see if that fails again [18:42:24] yea, i mean "Desired" as in output of dpkg as well :) [18:42:42] yeah I know, just laughing :) appreciate the help [18:43:01] aha [18:43:21] Configuration file '/etc/php/7.2/fpm/pool.d/www.conf' ==> Modified (by you or by a script) since installation. ==> Package distributor has shipped an updated version. [18:43:29] let's see what the "widespread puppet failures" thing has [18:43:37] unrelatedly [18:44:02] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install rdb20[09|10] - https://phabricator.wikimedia.org/T266721 (10Papaul) [18:44:52] looking at icinga UI to get the host list but it's not there [18:45:41] rzl: should have a "show diff" at that point, right [18:46:03] yeah I just couldn't reach through cumin to get to it [18:46:22] ah, right [18:46:40] pastebinning it, sec [18:48:06] https://phabricator.wikimedia.org/P13266 [18:48:43] regarding "widespread puppet fail": The Icinga check used to list some hosts but now it changed. Then i manually went to puppetboard UI instead (which the alert did not link to) and it's 17 hosts and i see a lot of scb and 2 mw hosts, mw2313 and mw2255 but that's it. not more appservers [18:48:49] hm, what I *want* is the diff between the file on disk and the file installed by the *old* package version [18:51:22] * apergos peeks in [18:52:39] so... if you say "take the new version from the new package" and then puppet runs it will change that file again but for a short time it will be that default config using. and if we keep existing version and let puppet run then there should be no diff? [18:52:40] 10Operations, 10LDAP-Access-Requests: Request Superset Access - https://phabricator.wikimedia.org/T267961 (10KEchavarriqueen) [18:53:48] how about: (depool), tell it to keep old version, manually run puppet [18:55:09] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install cloudcephmon200[12]-dev - https://phabricator.wikimedia.org/T267378 (10Papaul) Please in the future those changes need to be done before i have already applied the label on all the hosts now i have to go back and make those changes again [18:55:13] looking at those hosts with puppet failures [18:55:27] rzl: we can add Dpkg::Options::="--force-confold" [18:55:37] hmmm... this is the diff between mw2215 (not updated yet) and 2284 [18:55:39] https://phabricator.wikimedia.org/P13267 [18:56:11] if the request_slowlog_timeout diff rings a bell for anybody [18:56:24] ah ok max children I think it is calculated based on cpu threads [18:56:25] effie: yeah, I'm not sure if doing that across the board is the right call necessarily though, that might be too big a hammer [18:56:32] yeah for max_children that makes sense [18:56:38] I could have grabbed another host of the same model, I was just lazy [18:56:51] !log running puppet on mw2313 and mw2255 which were listed in puppetboard as failed puppet runs [18:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:07] !log mw2255 - is pooled and puppet works on next run, after it removed php 7.2 config files [18:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:40] running puppet on that one host made it remove a bunch of php 7.2 config files and tideways extension [18:59:53] and after that the puppet run is happy again. and the host was and is pooled [19:00:04] on the other one mw2313 something else is going on [19:00:52] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install cloudcephmon200[12]-dev - https://phabricator.wikimedia.org/T267378 (10Andrew) >>! In T267378#6625238, @Papaul wrote: > Please in the future those changes need to be done before i have already applied the label on all the hosts now i have... [19:01:44] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) [19:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:30] RECOVERY - DPKG on mw2284 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [19:07:49] 10Operations, 10LDAP-Access-Requests: Request for LDAP Access in order to access Superset - https://phabricator.wikimedia.org/T267962 (10IJethroBT-WMF) [19:13:31] (03PS1) 10Ebernhardson: airflow: Set webserver port in shared config file [puppet] - 10https://gerrit.wikimedia.org/r/641221 [19:14:01] (03PS2) 10Ebernhardson: airflow: Set webserver port in shared config file [puppet] - 10https://gerrit.wikimedia.org/r/641221 [19:14:06] (03PS1) 10Ladsgroup: icinga: Drop ores.wmflabs.org monitoring [puppet] - 10https://gerrit.wikimedia.org/r/641222 (https://phabricator.wikimedia.org/T242819) [19:16:56] 10Operations, 10Research, 10SRE-Access-Requests: Access to analytics-privatedata-users for Research volunteer Swagoel - https://phabricator.wikimedia.org/T267314 (10herron) [19:17:06] PROBLEM - DNS on analytics1042.mgmt is CRITICAL: Domain analytics1042.mgmt.eqiad.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:21:24] 10Operations, 10LDAP-Access-Requests: Request for LDAP Access in order to access Superset - https://phabricator.wikimedia.org/T267962 (10IJethroBT-WMF) [19:23:47] 10Operations, 10Research, 10SRE-Access-Requests: Access to analytics-privatedata-users for Research volunteer Swagoel - https://phabricator.wikimedia.org/T267314 (10herron) Hi @KFrancis could you please verify that @Swagoel has a valid NDA on file? Thanks in advance! [19:24:02] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26431" [puppet] - 10https://gerrit.wikimedia.org/r/641221 (owner: 10Ebernhardson) [19:24:55] (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] "PCC looks exactly as expected: https://puppet-compiler.wmflabs.org/compiler1001/26431/" [puppet] - 10https://gerrit.wikimedia.org/r/641221 (owner: 10Ebernhardson) [19:24:56] 10Operations, 10SRE-Access-Requests: Requesting access to production shell groups for DNdubane - https://phabricator.wikimedia.org/T266791 (10herron) [19:28:03] 10Operations, 10SRE-Access-Requests: Requesting access to production shell groups for DNdubane - https://phabricator.wikimedia.org/T266791 (10herron) Hi @DNdubane_WMF, could you please coordinate obtaining a comment from your manager approving this request? Also, looping in @Ottomata for analytics group revie... [19:28:32] 10Operations, 10LDAP-Access-Requests: Request for LDAP Access in order to access Superset for IJethroBT-WMF - https://phabricator.wikimedia.org/T267962 (10Peachey88) [19:28:54] 10Operations, 10LDAP-Access-Requests: Request Superset Access for KEchavarriqueen - https://phabricator.wikimedia.org/T267961 (10Peachey88) [19:32:44] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10herron) 05Open→03Resolved I'll transition this to closed for the time being due to inactivity. When ready to proceed please add a comment of manager a... [19:34:10] (03PS2) 10Ssingh: Initial commit of the knead-wikidough test suite [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/639838 (https://phabricator.wikimedia.org/T267424) [19:39:27] (03PS1) 10Dzahn: releases::mediawiki: only restart jenkins if it is enabled [puppet] - 10https://gerrit.wikimedia.org/r/641228 [19:42:10] 10Operations, 10LDAP-Access-Requests: LDAP access for Tobias Schumann - https://phabricator.wikimedia.org/T267917 (10herron) Hi @KFrancis, could you please confirm that we have an NDA on file for Tobias? Thanks in advance! [19:45:21] (03PS2) 10Dzahn: releases::mediawiki: do not attempt to restart masked service [puppet] - 10https://gerrit.wikimedia.org/r/641228 [19:46:07] (03CR) 10Herron: [C: 03+2] "LGTM, thanks for writing the patch!" [puppet] - 10https://gerrit.wikimedia.org/r/641149 (https://phabricator.wikimedia.org/T267913) (owner: 10Hnowlan) [19:47:17] (03CR) 10Muehlenhoff: releases::mediawiki: do not attempt to restart masked service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/641228 (owner: 10Dzahn) [19:48:57] !log disable puppet on parsoid servers - T264991 [19:49:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:04] T264991: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 [19:49:45] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add gmodena to wmf LDAP group - https://phabricator.wikimedia.org/T267913 (10herron) 05Open→03Resolved Hi @hnowlan `gmodena` has been added to LDAP group `wmf`, and the above patch has been merged. Thanks for that! [19:50:56] (03PS3) 10Dzahn: releases::mediawiki: do not attempt to restart masked service [puppet] - 10https://gerrit.wikimedia.org/r/641228 [19:51:06] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@4a953ca]: query_clicks_hourly: handle wmf.webrequest page_id change from int to bigint [19:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:12] (03PS1) 10Andrew Bogott: OpenStack: add initial config files for version Stein [puppet] - 10https://gerrit.wikimedia.org/r/641230 (https://phabricator.wikimedia.org/T261134) [19:51:14] (03PS1) 10Andrew Bogott: OpenStack: add server packages for version Stein [puppet] - 10https://gerrit.wikimedia.org/r/641231 (https://phabricator.wikimedia.org/T261134) [19:51:16] (03PS1) 10Andrew Bogott: OpenStack Designate: updates for version Stein [puppet] - 10https://gerrit.wikimedia.org/r/641232 (https://phabricator.wikimedia.org/T261134) [19:51:18] (03PS1) 10Andrew Bogott: OpenStack: add client packages for Stein [puppet] - 10https://gerrit.wikimedia.org/r/641233 (https://phabricator.wikimedia.org/T261134) [19:51:40] (03CR) 10jerkins-bot: [V: 04-1] OpenStack: add initial config files for version Stein [puppet] - 10https://gerrit.wikimedia.org/r/641230 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [19:52:20] (03CR) 10jerkins-bot: [V: 04-1] releases::mediawiki: do not attempt to restart masked service [puppet] - 10https://gerrit.wikimedia.org/r/641228 (owner: 10Dzahn) [19:52:24] (03PS4) 10Dzahn: releases::mediawiki: do not attempt to restart masked service [puppet] - 10https://gerrit.wikimedia.org/r/641228 [19:53:09] (03PS1) 10Effie Mouzeli: hiera: enable icu 63 component on appservers [puppet] - 10https://gerrit.wikimedia.org/r/641234 (https://phabricator.wikimedia.org/T264991) [19:53:33] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@4a953ca]: query_clicks_hourly: handle wmf.webrequest page_id change from int to bigint (duration: 02m 27s) [19:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:39] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/641228 (owner: 10Dzahn) [19:55:44] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic, 10HTTPS: The certificate for upload.beta.wmflabs.org expired on November 13, 2020. - https://phabricator.wikimedia.org/T267858 (10Krenair) 05Open→03Resolved [19:56:07] (03CR) 10Dzahn: "thanks! typo with "mask" vs "masked" but fixing it" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/641228 (owner: 10Dzahn) [19:56:56] (03CR) 10RLazarus: [C: 03+1] hiera: enable icu 63 component on appservers [puppet] - 10https://gerrit.wikimedia.org/r/641234 (https://phabricator.wikimedia.org/T264991) (owner: 10Effie Mouzeli) [19:56:58] (03PS5) 10Dzahn: releases::mediawiki: do not attempt to restart masked service [puppet] - 10https://gerrit.wikimedia.org/r/641228 [19:57:11] (03PS2) 10Effie Mouzeli: hiera: enable icu 63 component on parsoid [puppet] - 10https://gerrit.wikimedia.org/r/641234 (https://phabricator.wikimedia.org/T264991) [19:57:18] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [19:57:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:49] (03CR) 10Effie Mouzeli: [C: 03+2] hiera: enable icu 63 component on parsoid [puppet] - 10https://gerrit.wikimedia.org/r/641234 (https://phabricator.wikimedia.org/T264991) (owner: 10Effie Mouzeli) [19:58:43] (03PS2) 10Andrew Bogott: OpenStack: add initial config files for version Stein [puppet] - 10https://gerrit.wikimedia.org/r/641230 (https://phabricator.wikimedia.org/T261134) [19:58:45] (03PS2) 10Andrew Bogott: OpenStack: add server packages for version Stein [puppet] - 10https://gerrit.wikimedia.org/r/641231 (https://phabricator.wikimedia.org/T261134) [19:58:47] (03PS2) 10Andrew Bogott: OpenStack: add client packages for Stein [puppet] - 10https://gerrit.wikimedia.org/r/641233 (https://phabricator.wikimedia.org/T261134) [19:58:49] (03PS2) 10Andrew Bogott: OpenStack Designate: updates for version Stein [puppet] - 10https://gerrit.wikimedia.org/r/641232 (https://phabricator.wikimedia.org/T261134) [19:59:05] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/26435/" [puppet] - 10https://gerrit.wikimedia.org/r/641228 (owner: 10Dzahn) [19:59:54] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Woohoo! Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/641222 (https://phabricator.wikimedia.org/T242819) (owner: 10Ladsgroup) [20:00:49] (03CR) 10jerkins-bot: [V: 04-1] OpenStack: add initial config files for version Stein [puppet] - 10https://gerrit.wikimedia.org/r/641230 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [20:01:16] (03CR) 10Dzahn: "This isn't all though, there is more like the check_command used by this and the config for it." [puppet] - 10https://gerrit.wikimedia.org/r/641222 (https://phabricator.wikimedia.org/T242819) (owner: 10Ladsgroup) [20:03:17] (03PS3) 10Andrew Bogott: OpenStack: add initial config files for version Stein [puppet] - 10https://gerrit.wikimedia.org/r/641230 (https://phabricator.wikimedia.org/T261134) [20:03:19] (03PS3) 10Andrew Bogott: OpenStack: add server packages for version Stein [puppet] - 10https://gerrit.wikimedia.org/r/641231 (https://phabricator.wikimedia.org/T261134) [20:03:21] (03PS3) 10Andrew Bogott: OpenStack: add client packages for Stein [puppet] - 10https://gerrit.wikimedia.org/r/641233 (https://phabricator.wikimedia.org/T261134) [20:03:23] (03PS3) 10Andrew Bogott: OpenStack Designate: updates for version Stein [puppet] - 10https://gerrit.wikimedia.org/r/641232 (https://phabricator.wikimedia.org/T261134) [20:05:44] (03PS1) 10Dzahn: nagios_common: delete check_ores_workers command and config [puppet] - 10https://gerrit.wikimedia.org/r/641238 (https://phabricator.wikimedia.org/T242819) [20:06:46] !log releases2002 systemctl reset-failed should clear Icinga systemd alert after gerrit:641228 [20:06:47] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:12] RECOVERY - Check systemd state on releases2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:08:08] (03PS1) 10Bstorm: Cloud NFS: remove the load alerts [puppet] - 10https://gerrit.wikimedia.org/r/641239 [20:09:11] (03CR) 10Dzahn: "This did not absent the resources in puppet so it will not be removed from Icinga config." [puppet] - 10https://gerrit.wikimedia.org/r/641222 (https://phabricator.wikimedia.org/T242819) (owner: 10Ladsgroup) [20:09:37] 10Operations, 10LDAP-Access-Requests: Add STran to `wmf` LDAF group - https://phabricator.wikimedia.org/T267968 (10STran) [20:10:06] 10Operations, 10LDAP-Access-Requests: Add STran to `wmf` LDAF group - https://phabricator.wikimedia.org/T267968 (10Tchanders) Vouching for @STran! [20:10:24] (03CR) 10Dzahn: "But only if you never want ORES monitoring in Icinga again." [puppet] - 10https://gerrit.wikimedia.org/r/641238 (https://phabricator.wikimedia.org/T242819) (owner: 10Dzahn) [20:10:59] (03CR) 10Ladsgroup: nagios_common: delete check_ores_workers command and config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/641238 (https://phabricator.wikimedia.org/T242819) (owner: 10Dzahn) [20:12:23] (03CR) 10Bstorm: "I don't think these alerts are worth filling email with. Change my mind?" [puppet] - 10https://gerrit.wikimedia.org/r/641239 (owner: 10Bstorm) [20:13:12] (03Abandoned) 10Dzahn: nagios_common: delete check_ores_workers command and config [puppet] - 10https://gerrit.wikimedia.org/r/641238 (https://phabricator.wikimedia.org/T242819) (owner: 10Dzahn) [20:13:34] (03CR) 10Bstorm: Cloud NFS: remove the load alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/641239 (owner: 10Bstorm) [20:16:29] (03PS1) 10Herron: admin: add Fabian Kaelin 'fab' account, and group memberships [puppet] - 10https://gerrit.wikimedia.org/r/641241 (https://phabricator.wikimedia.org/T267817) [20:24:58] 10Operations, 10SRE-Access-Requests: Requesting access to production shell groups for DNdubane - https://phabricator.wikimedia.org/T266791 (10Ottomata) Approved. Please also make sure DNdubane is in the `wmf` LDAP group. [20:26:06] 10Operations, 10ops-codfw: Degraded RAID on ms-be2031 - https://phabricator.wikimedia.org/T267748 (10Papaul) a:05Papaul→03fgiunchedi SSD was 480GB so i replaced it with a 600GB SSD since i have no 480GB on site. [20:28:07] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install rdb20[09|10] - https://phabricator.wikimedia.org/T266721 (10Papaul) [20:28:18] RECOVERY - HP RAID on ms-be2031 is OK: OK: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [20:28:39] 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10Epic, and 2 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10jeena) [20:31:16] (03PS1) 10Dzahn: peek: don't change permissions within a git repo [puppet] - 10https://gerrit.wikimedia.org/r/641245 [20:32:18] (03CR) 10Andrew Bogott: [C: 03+1] "I definitely ignore these warnings when they come around" [puppet] - 10https://gerrit.wikimedia.org/r/641239 (owner: 10Bstorm) [20:32:22] (03CR) 10Dzahn: "Icinga alert was resolved on releases2002" [puppet] - 10https://gerrit.wikimedia.org/r/641228 (owner: 10Dzahn) [20:34:52] 10Operations, 10ops-codfw: RMA failed codfw C7 switch - WMF6114 - https://phabricator.wikimedia.org/T267950 (10Papaul) [20:36:18] 10Operations, 10LDAP-Access-Requests: Add STran to `wmf` LDAP group - https://phabricator.wikimedia.org/T267968 (10STran) [20:36:36] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/26436/" [puppet] - 10https://gerrit.wikimedia.org/r/636105 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [20:37:22] 10Operations, 10LDAP-Access-Requests: Request Superset Access (LDAP group 'wmf') for KEchavarriqueen - https://phabricator.wikimedia.org/T267961 (10herron) [20:39:01] (03PS1) 10Herron: admin: add ldap-only entry for kassiameq [puppet] - 10https://gerrit.wikimedia.org/r/641248 (https://phabricator.wikimedia.org/T267961) [20:39:58] RECOVERY - DPKG on mw2313 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [20:40:06] !log planet1002/planet2002 - delete entire crontab of user planet, drop update cronjobs after switching to systemd timers with gerrit:636105 (T265138) [20:40:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:14] T265138: OKR: Work required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 [20:41:36] PROBLEM - Check systemd state on planet2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:41:43] (03PS1) 10RLazarus: hiera: Enable icu63 component on jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/641249 (https://phabricator.wikimedia.org/T264991) [20:41:54] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:43:32] ACKNOWLEDGEMENT - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn WIP - timers just created https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:43:32] ACKNOWLEDGEMENT - Check systemd state on planet2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn WIP - timers just created https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:50:49] (03PS1) 10HMonroy: Enable watchlist expiry feature on Wikidata & Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641250 (https://phabricator.wikimedia.org/T266874) [20:52:24] (03CR) 10Herron: [C: 03+1] profile: add Alertmanager API virtual host [puppet] - 10https://gerrit.wikimedia.org/r/641191 (https://phabricator.wikimedia.org/T266017) (owner: 10Filippo Giunchedi) [20:52:54] (03CR) 10Nskaggs: "Brooke provided some excellent history in an email thread a few months ago. I'll some snippets here for posterity. I'm delighted to see th" [puppet] - 10https://gerrit.wikimedia.org/r/641239 (owner: 10Bstorm) [20:53:29] (03CR) 10Herron: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/641192 (https://phabricator.wikimedia.org/T266017) (owner: 10Filippo Giunchedi) [20:53:38] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26438" [puppet] - 10https://gerrit.wikimedia.org/r/641249 (https://phabricator.wikimedia.org/T264991) (owner: 10RLazarus) [20:54:10] (03PS1) 10Ebernhardson: cpjobqueue: Increase cirrusSearchCheckerJob concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/641251 (https://phabricator.wikimedia.org/T266762) [20:56:13] (03CR) 10Cwhite: [C: 03+2] smart: add metric to track number of devices detected [puppet] - 10https://gerrit.wikimedia.org/r/640473 (https://phabricator.wikimedia.org/T267135) (owner: 10Cwhite) [20:58:26] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:01:22] (03CR) 10Dzahn: "found out today this probably never worked. because:" [puppet] - 10https://gerrit.wikimedia.org/r/602319 (owner: 10Amire80) [21:06:24] (03PS1) 10Dzahn: planet: fix updates of UK planet, replace non-ASCII chars [puppet] - 10https://gerrit.wikimedia.org/r/641254 [21:06:26] (03CR) 10RLazarus: [V: 03+1 C: 03+2] hiera: Enable icu63 component on jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/641249 (https://phabricator.wikimedia.org/T264991) (owner: 10RLazarus) [21:06:31] mutante: sorry about that, can you remove them? the web interface gives me "not authorized" error [21:07:19] search for ores.wmflabs.org and remove everything if possible [21:07:22] !log disable puppet on jobrunners T264991 [21:07:24] (03PS1) 10Effie Mouzeli: hiera: enable icu 63 component on snapshot [puppet] - 10https://gerrit.wikimedia.org/r/641255 (https://phabricator.wikimedia.org/T264991) [21:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:29] T264991: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 [21:09:30] (03CR) 10RLazarus: [C: 03+1] hiera: enable icu 63 component on snapshot [puppet] - 10https://gerrit.wikimedia.org/r/641255 (https://phabricator.wikimedia.org/T264991) (owner: 10Effie Mouzeli) [21:09:37] 10Operations, 10Research, 10SRE-Access-Requests: Access to analytics-privatedata-users for Research volunteer Swagoel - https://phabricator.wikimedia.org/T267314 (10KFrancis) @herron confirming @Swagoel has a valid NDA on file. Thanks! [21:12:06] (03PS7) 10Gilles: webperf: change navtiming to use Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/639197 (https://phabricator.wikimedia.org/T267269) (owner: 10Dave Pifke) [21:12:49] (03CR) 10Gilles: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/639197 (https://phabricator.wikimedia.org/T267269) (owner: 10Dave Pifke) [21:13:17] Amir1: you mean manually editing the generated icinga config? that would be multiple hosts and huge files and puppet would recreate them since they are still in puppetdb [21:13:41] it gets generated from exported resources [21:14:45] mutante: oh, that's complicated :( How can we remove those then? [21:19:04] RECOVERY - Device not healthy -SMART- on ms-be2031 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2031&var-datasource=codfw+prometheus/ops [21:19:21] (03CR) 10Gilles: [C: 03+1] webperf: change navtiming to use Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/639197 (https://phabricator.wikimedia.org/T267269) (owner: 10Dave Pifke) [21:20:22] (03CR) 10ArielGlenn: [C: 03+1] "Great. Let's do this thing." [puppet] - 10https://gerrit.wikimedia.org/r/641255 (https://phabricator.wikimedia.org/T264991) (owner: 10Effie Mouzeli) [21:21:55] (03PS1) 10Jbond: ca.labs.codfw1dev.pem: add new codfwdev labs ca file [puppet] - 10https://gerrit.wikimedia.org/r/641258 [21:24:12] (03CR) 10Jbond: [C: 03+2] ca.labs.codfw1dev.pem: add new codfwdev labs ca file [puppet] - 10https://gerrit.wikimedia.org/r/641258 (owner: 10Jbond) [21:24:50] (03CR) 10Effie Mouzeli: [C: 03+2] hiera: enable icu 63 component on snapshot [puppet] - 10https://gerrit.wikimedia.org/r/641255 (https://phabricator.wikimedia.org/T264991) (owner: 10Effie Mouzeli) [21:25:28] Amir1: set them to "absent" before removing the code .. or manually delete from puppetdb somehow [21:26:16] mutante: okay, let me try it [21:27:57] (03CR) 10Dzahn: nagios_common: delete check_ores_workers command and config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/641238 (https://phabricator.wikimedia.org/T242819) (owner: 10Dzahn) [21:30:01] !log peek2001 - mv /var/lib/peek/git to git.old ; run puppet ; let it fix git checkout [21:30:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:34] RECOVERY - Long running screen/tmux on an-launcher1002 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [21:36:02] (03CR) 10Dmaza: [C: 03+1] Enable watchlist expiry feature on Wikidata & Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641250 (https://phabricator.wikimedia.org/T266874) (owner: 10HMonroy) [21:38:02] (03PS1) 10Ladsgroup: Revert "icinga: Drop ores.wmflabs.org monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/641263 [21:38:44] (03CR) 10jerkins-bot: [V: 04-1] Revert "icinga: Drop ores.wmflabs.org monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/641263 (owner: 10Ladsgroup) [21:38:49] !log rzl@cumin1001 conftool action : set/pooled=yes; selector: name=mw2250.codfw.wmnet,cluster=jobrunner [21:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:26] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 239 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:40:43] !log rzl@cumin1001 conftool action : set/weight=1; selector: name=mw2250.codfw.wmnet,cluster=videoscaler,service=canary [21:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:58] !log rzl@cumin1001 conftool action : set/pooled=yes; selector: name=mw2250.codfw.wmnet [21:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:44] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 8 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:48:34] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [21:48:57] (03CR) 10Dzahn: [C: 03+2] planet: fix updates of UK planet, replace non-ASCII chars [puppet] - 10https://gerrit.wikimedia.org/r/641254 (owner: 10Dzahn) [21:50:35] (03CR) 10Krinkle: peek: don't change permissions within a git repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/641245 (owner: 10Dzahn) [21:50:55] (03CR) 10Dzahn: "This is more of a reminder that we should do something with this and clean up the "git.old" dir vs the "git" dir... and then remove this c" [puppet] - 10https://gerrit.wikimedia.org/r/641245 (owner: 10Dzahn) [21:51:14] (03PS2) 10Ladsgroup: Revert "icinga: Drop ores.wmflabs.org monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/641263 [21:53:25] (03CR) 10Dzahn: peek: don't change permissions within a git repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/641245 (owner: 10Dzahn) [21:55:24] (03CR) 10Ladsgroup: "Hey, Does this look good?" [puppet] - 10https://gerrit.wikimedia.org/r/641263 (owner: 10Ladsgroup) [22:01:03] 10Operations, 10Domains, 10Traffic, 10Patch-For-Review: Change of nameservers for Wikimedia.org.tr - https://phabricator.wikimedia.org/T259792 (10CRoslof) 05Open→03Resolved a:03CRoslof I have updated the nameservers to the ones requested. [22:02:07] (03PS1) 10RLazarus: hiera: Enable icu63 component on maintenance hosts [puppet] - 10https://gerrit.wikimedia.org/r/641265 (https://phabricator.wikimedia.org/T264991) [22:02:19] (03CR) 10Dzahn: [C: 03+2] "yea, that should be fine. thanks! I'll merge and watch it on alert1001." [puppet] - 10https://gerrit.wikimedia.org/r/641263 (owner: 10Ladsgroup) [22:05:46] PROBLEM - Check systemd state on stat1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:06:01] (03PS1) 10Ottomata: eventgate-* - Bump eventgate-wikimedia version to 2020-11-16-212345-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/641266 (https://phabricator.wikimedia.org/T240460) [22:06:18] !log planet - fixed updates of uk.planet which failed due to non-ASCII chars in a URL - since updates are systemd timers now that affects the entire systemd state monitoring [22:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:37] (03PS2) 10Ottomata: eventgate-* - Bump eventgate-wikimedia version to 2020-11-16-212345-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/641266 (https://phabricator.wikimedia.org/T240460) [22:07:48] (03CR) 10Effie Mouzeli: [C: 03+1] hiera: Enable icu63 component on maintenance hosts [puppet] - 10https://gerrit.wikimedia.org/r/641265 (https://phabricator.wikimedia.org/T264991) (owner: 10RLazarus) [22:08:02] (03CR) 10RLazarus: [C: 03+2] hiera: Enable icu63 component on maintenance hosts [puppet] - 10https://gerrit.wikimedia.org/r/641265 (https://phabricator.wikimedia.org/T264991) (owner: 10RLazarus) [22:08:04] RECOVERY - Check systemd state on planet2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:08:44] (03CR) 10Ottomata: [C: 03+2] eventgate-* - Bump eventgate-wikimedia version to 2020-11-16-212345-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/641266 (https://phabricator.wikimedia.org/T240460) (owner: 10Ottomata) [22:08:46] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-* - Bump eventgate-wikimedia version to 2020-11-16-212345-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/641266 (https://phabricator.wikimedia.org/T240460) (owner: 10Ottomata) [22:09:50] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [22:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:56] Amir1: it's not being removed yet ... unexpectedly.. hrmm [22:13:28] normally this is a matter of running puppet on the host and then on the icinga host. this is a virtual host though [22:15:00] ah, I see why.. [22:15:25] or not:) [22:17:20] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [22:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:11] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [22:19:11] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [22:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:05] Amir1: I guess if the resources are already gone and then you readd them but with "absent" then it never gets to a step of "removing a resource" [22:22:34] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [22:22:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:07] Amir1: I'll try it by creating and then removing properly [22:26:27] (03CR) 10Ryan Kemper: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1002/26444/" [puppet] - 10https://gerrit.wikimedia.org/r/640274 (https://phabricator.wikimedia.org/T259539) (owner: 10Ebernhardson) [22:27:20] RECOVERY - Check systemd state on stat1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:27:48] (03PS1) 10Dzahn: ores: temp. reenabled icinga monitoring of labs nodes [puppet] - 10https://gerrit.wikimedia.org/r/641269 [22:29:24] (03PS2) 10Dzahn: ores: temp. re-enable icinga monitoring of labs nodes [puppet] - 10https://gerrit.wikimedia.org/r/641269 [22:31:35] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/26445/alert1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/641269 (owner: 10Dzahn) [22:36:00] (03PS1) 10Dzahn: ores: absent icinga monitoring of labs nodes [puppet] - 10https://gerrit.wikimedia.org/r/641270 [22:36:35] (03CR) 10Dzahn: [C: 03+2] ores: absent icinga monitoring of labs nodes [puppet] - 10https://gerrit.wikimedia.org/r/641270 (owner: 10Dzahn) [22:42:02] 10Operations, 10LDAP-Access-Requests: Request for LDAP Access in order to access Superset for IJethroBT-WMF - https://phabricator.wikimedia.org/T267962 (10Dzahn) a:05Dzahn→03None Hi, unassigning this from me personally, but that doesn't mean it won't be done, it just means we have a rotating system who ha... [22:51:13] 10Operations, 10LDAP-Access-Requests: Request for LDAP Access in order to access Superset for IJethroBT-WMF - https://phabricator.wikimedia.org/T267962 (10Dzahn) But I can already confirm the wikitech user / shell user part. Don't worry about it, we have the needed information and it adds up: ` uid: ijethrob... [23:00:26] (03CR) 10Dzahn: "This revealed that the updates for the "uk" language version had been failing for a while. (good!)" [puppet] - 10https://gerrit.wikimedia.org/r/636105 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [23:01:21] (03CR) 10Dzahn: "After this the content for this feed should finally show up in Planet for real." [puppet] - 10https://gerrit.wikimedia.org/r/641254 (owner: 10Dzahn) [23:02:47] oh man... yes, adding a monitoring resource and then removing it again does properly remove it from icinga. but since that happened on icinga1001 and alert2001 but not alert1001... the checks are STILL in Icinga web UI that should be gone... [23:05:51] (03PS2) 10Dzahn: peek: don't change permissions within a git repo [puppet] - 10https://gerrit.wikimedia.org/r/641245 [23:12:31] (03PS3) 10Dzahn: cumin: remove stretch support and move python_version to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/636101 [23:12:53] (03CR) 10Dzahn: cumin: remove stretch support and move python_version to Hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636101 (owner: 10Dzahn) [23:16:49] (03PS4) 10Dzahn: cumin: replace check-aliases-cron with a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/636102 (https://phabricator.wikimedia.org/T265138) [23:18:42] (03PS1) 10Dzahn: cumin: remove code for absented check aliases cron job [puppet] - 10https://gerrit.wikimedia.org/r/641274 [23:20:15] (03PS1) 10Dzahn: Revert "ores: absent icinga monitoring of labs nodes" [puppet] - 10https://gerrit.wikimedia.org/r/641287 [23:21:38] (03CR) 10Dzahn: [C: 03+2] Revert "ores: absent icinga monitoring of labs nodes" [puppet] - 10https://gerrit.wikimedia.org/r/641287 (owner: 10Dzahn) [23:22:54] (03CR) 10Dzahn: [C: 03+2] cumin: replace check-aliases-cron with a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/636102 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [23:26:50] (03CR) 10Dzahn: "on both cumin masters, cron tab entry was removed by puppet, confirmed. also "sudo systemctl status cumin-check-aliases" show the new time" [puppet] - 10https://gerrit.wikimedia.org/r/636102 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [23:28:44] !log cumin1001 - sudo systemctl start cumin-check-aliases (to confirm switching cron to timer worked) T265138 [23:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:52] T265138: OKR: Work required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 [23:31:21] 10Operations, 10Beta-Cluster-Infrastructure, 10DBA, 10serviceops, 10Patch-For-Review: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10jijiki) We upgraded to ICU 63: appservers, api, parsoid, jobrunners, mwmaint, and snapshot. What is left is deploy*. We are runnin... [23:32:25] 10Operations, 10Beta-Cluster-Infrastructure, 10DBA, 10serviceops, 10Patch-For-Review: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10jijiki) [23:37:01] 10Operations, 10serviceops, 10CommRel-Specialists-Support (Oct-Dec-2020), 10User-notice: CommRel support for ICU 63 upgrade - https://phabricator.wikimedia.org/T267145 (10RLazarus) All appservers are now running ICU 63, and the collation update script is running. Earlier today should have been the moment o... [23:37:06] (03CR) 10Dzahn: "Do you still want to receive email or would it be sufficient if we notice failures from Icinga systemd alert and then seeing in status of " [puppet] - 10https://gerrit.wikimedia.org/r/636102 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [23:38:26] (03PS1) 10Dzahn: Revert "Revert "ores: absent icinga monitoring of labs nodes"" [puppet] - 10https://gerrit.wikimedia.org/r/641288 [23:41:18] (03CR) 10Dzahn: [C: 03+2] Revert "Revert "ores: absent icinga monitoring of labs nodes"" [puppet] - 10https://gerrit.wikimedia.org/r/641288 (owner: 10Dzahn) [23:42:16] (03CR) 10Dzahn: "[cumin1001:~] $ sudo systemctl status cumin-check-aliases" [puppet] - 10https://gerrit.wikimedia.org/r/636102 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [23:47:05] (03CR) 10Dzahn: "after https://gerrit.wikimedia.org/r/c/operations/puppet/+/641254 it should now work though!" [puppet] - 10https://gerrit.wikimedia.org/r/602319 (owner: 10Amire80) [23:51:57] (03CR) 10Dzahn: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/632351 (https://phabricator.wikimedia.org/T210993) (owner: 10Dzahn) [23:56:10] 10Operations, 10ORES, 10Machine Learning Platform (Current), 10Patch-For-Review: ores.wmflabs.org - 503 icinga alerts - https://phabricator.wikimedia.org/T242819 (10Dzahn) https://gerrit.wikimedia.org/r/c/operations/puppet/+/641263 https://gerrit.wikimedia.org/r/c/operations/puppet/+/641269 https://gerrit.... [23:57:27] (03CR) 10Dzahn: "it's finally gone from Icinga web UI now.. but only after all this.. when going "present -> absent" and making sure puppet ran on all 4 ho" [puppet] - 10https://gerrit.wikimedia.org/r/641222 (https://phabricator.wikimedia.org/T242819) (owner: 10Ladsgroup) [23:58:41] (03CR) 10Andrew Bogott: "I have applied this change by hand to cloudvirt2003-dev; so far it doesn't seem to actually help (although I'm testing with Buster rather " [puppet] - 10https://gerrit.wikimedia.org/r/638146 (https://phabricator.wikimedia.org/T267433) (owner: 10Ahmon Dancy) [23:59:15] 10Operations, 10ORES, 10Machine Learning Platform (Current), 10Patch-For-Review: ores.wmflabs.org - 503 icinga alerts - https://phabricator.wikimedia.org/T242819 (10Dzahn) 05Open→03Resolved All alerts for ores.wmflabs.org have been removed from Icinga.