[00:00:04] RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201103T0000) [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:03:51] PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:04:44] PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:01] PROBLEM - cassandra CQL 10.192.16.179:9042 on maps2002 is CRITICAL: connect to address 10.192.16.179 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [00:06:26] (03CR) 10Dzahn: "Found out there is still puppet 4.8 on mostly toolforge: https://phabricator.wikimedia.org/P13129" [puppet] - 10https://gerrit.wikimedia.org/r/633853 (owner: 10Dzahn) [00:07:42] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` es1032.eqiad.wmnet ` The log can be found in `/v... [00:14:23] (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools as a beta feature on almost all Wikipedias ("phase 3") [mediawiki-config] - 10https://gerrit.wikimedia.org/r/638201 (https://phabricator.wikimedia.org/T266303) [00:19:39] (03PS1) 10RobH: clouddb1020 mac address update [puppet] - 10https://gerrit.wikimedia.org/r/638205 (https://phabricator.wikimedia.org/T260441) [00:20:33] (03CR) 10RobH: [C: 03+2] clouddb1020 mac address update [puppet] - 10https://gerrit.wikimedia.org/r/638205 (https://phabricator.wikimedia.org/T260441) (owner: 10RobH) [00:23:17] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: ASAP) rack/setup/install clouddb10[13-20] - https://phabricator.wikimedia.org/T260441 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` clouddb1020.... [00:23:20] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: ASAP) rack/setup/install clouddb10[13-20] - https://phabricator.wikimedia.org/T260441 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['clouddb1020.eqiad.wmnet'] ` Of which those **FAILED**: `... [00:24:44] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es1032.eqiad.wmnet'] ` Of which those **FAILED**: ` ['es1032.eqiad.wmnet'] ` [00:26:24] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: ASAP) rack/setup/install clouddb10[13-20] - https://phabricator.wikimedia.org/T260441 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` clouddb1020.... [00:27:03] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10RobH) This gets to the debian loader, and halts on 'Probing EDD' which had no issues on the other hosts. I'm still investigating on what is different... [00:27:28] RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:27:44] RECOVERY - cassandra CQL 10.192.16.179:9042 on maps2002 is OK: TCP OK - 0.032 second response time on 10.192.16.179 port 9042 https://phabricator.wikimedia.org/T93886 [00:29:58] (03PS1) 10Ebernhardson: Turn off airflow scheduler during db downtime [puppet] - 10https://gerrit.wikimedia.org/r/638207 [00:31:02] (03CR) 10Dzahn: [C: 04-1] "and toolforge appears to be using systemd timers" [puppet] - 10https://gerrit.wikimedia.org/r/633853 (owner: 10Dzahn) [00:31:17] (03PS1) 10Legoktm: Prevent webservice from doing anything if buildpacks are being used [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/638208 (https://phabricator.wikimedia.org/T266901) [00:33:55] (03CR) 10Ebernhardson: "PCC looks reasonable: https://puppet-compiler.wmflabs.org/compiler1001/26267/" [puppet] - 10https://gerrit.wikimedia.org/r/638207 (owner: 10Ebernhardson) [00:35:21] (03CR) 10Ottomata: [C: 03+1] "I didn't check if there are any uses of this but if you did then +1 :)" [puppet] - 10https://gerrit.wikimedia.org/r/638185 (https://phabricator.wikimedia.org/T240439) (owner: 10Razzi) [00:40:19] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [00:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:23] (03CR) 10Dzahn: [C: 04-1] "The query works when executed from the "phabricator_maniphest" database but not when in the "phabricator_project" database. That results i" [puppet] - 10https://gerrit.wikimedia.org/r/638163 (https://phabricator.wikimedia.org/T252522) (owner: 10Aklapper) [00:42:22] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [00:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:42:36] (03CR) 10Dzahn: [C: 04-1] "P.S. I will be off for about 2 weeks, if you have a follow-up please also try to get someone else to review/merge it." [puppet] - 10https://gerrit.wikimedia.org/r/638163 (https://phabricator.wikimedia.org/T252522) (owner: 10Aklapper) [00:44:59] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` es1032.eqiad.wmnet ` The log can be found in `/v... [00:45:01] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es1032.eqiad.wmnet'] ` Of which those **FAILED**: ` ['es1032.eqiad.wmnet'] ` [00:45:18] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` es1032.eqiad.wmnet ` The log can be found in `/v... [00:45:33] (03CR) 10Dzahn: [C: 04-1] "meanwhile, here is raw data for you to start with this month:" [puppet] - 10https://gerrit.wikimedia.org/r/638163 (https://phabricator.wikimedia.org/T252522) (owner: 10Aklapper) [00:47:47] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:48:35] 10Operations, 10Traffic, 10netops: Wikimedia projects not reachable for some Telecom Italia users - https://phabricator.wikimedia.org/T262869 (10CDanis) >>! In T262869#6565461, @Nemo_bis wrote: >> We'll prepare at least a lightweight incident report in the coming days. > > Did this happen? I couldn't find i... [00:48:49] RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:48:57] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:49:22] 10Operations, 10Wikimedia-Mailing-lists: Request for creation: Wiki Loves Africa Organizers Mailing List - https://phabricator.wikimedia.org/T267083 (10Reedy) [00:49:25] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: ASAP) rack/setup/install clouddb10[13-20] - https://phabricator.wikimedia.org/T260441 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['clouddb1020.eqiad.wmnet'] ` and were **ALL** successful. [00:51:53] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: ASAP) rack/setup/install clouddb10[13-20] - https://phabricator.wikimedia.org/T260441 (10RobH) [00:52:05] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: ASAP) rack/setup/install clouddb10[13-20] - https://phabricator.wikimedia.org/T260441 (10RobH) 05Open→03Resolved all hosts installed, calling into puppet, staged in netbox. [00:53:04] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10RobH) [00:59:17] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [00:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:18] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [01:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:07:46] (03CR) 10Dzahn: cumin: replace check-aliases-cron with a systemd timer (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/636102 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [01:11:39] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es1032.eqiad.wmnet'] ` and were **ALL** successful. [01:20:40] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [01:22:51] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [01:24:03] (03PS3) 10Dzahn: cumin: replace check-aliases-cron with a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/636102 (https://phabricator.wikimedia.org/T265138) [01:26:16] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10RobH) 05Open→03Resolved >>! In T260370#6598690, @ops-monitoring-bot wrote: > Completed auto-reimage of hosts: > ` > ['es1032.eqiad.wmnet'] > ` > >... [01:34:11] (03PS1) 10Dzahn: site: introduce mwdebug1003 as debug server on buster [puppet] - 10https://gerrit.wikimedia.org/r/638218 (https://phabricator.wikimedia.org/T245757) [01:40:04] (03CR) 10Jeena Huneidi: [C: 03+1] "Changes LGTM, although the lint error I see in CI is not what I expected (it looks correct running locally)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/638025 (owner: 10JMeybohm) [01:49:48] (03CR) 10Dzahn: "low prio but it could technically be done any time and I wanted to also just leave the comments here already from learnings trying it with" [puppet] - 10https://gerrit.wikimedia.org/r/638218 (https://phabricator.wikimedia.org/T245757) (owner: 10Dzahn) [01:55:51] PROBLEM - Check systemd state on logstash1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:56:14] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "I'm all for trying it, as I said on the previous attempt, we don't have much to lose here. But I'm currently not here to actually deploy t" [puppet] - 10https://gerrit.wikimedia.org/r/637852 (https://phabricator.wikimedia.org/T261031) (owner: 10Ladsgroup) [01:56:31] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1009 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7f7e22c024e0: Failed to establish a new connection: [Errno 111] Connection [01:56:31] ://wikitech.wikimedia.org/wiki/Search%23Administration [01:57:37] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: eqiad: Physical Moves for MediaWiki Servers - https://phabricator.wikimedia.org/T266164 (10Dzahn) [01:58:14] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: eqiad: Physical Moves for MediaWiki Servers - https://phabricator.wikimedia.org/T266164 (10Dzahn) One rack done, the other rack we will continue from week of Nov 16. [02:08:41] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.16 [core] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/638236 [02:10:01] (03PS2) 10DannyS712: Branch commit for wmf/1.36.0-wmf.16 [core] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/638236 (https://phabricator.wikimedia.org/T263182) (owner: 10TrainBranchBot) [02:12:39] RECOVERY - Check systemd state on logstash1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:13:19] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1009 is OK: OK - elasticsearch status production-logstash-eqiad: active_shards: 916, unassigned_shards: 0, number_of_nodes: 6, number_of_pending_tasks: 0, delayed_unassigned_shards: 0, active_shards_percent_as_number: 100.0, active_primary_shards: 483, task_max_waiting_in_queue_millis: 0, timed_out: False, initializing_shards: 0, number_of_in_flight_fetch: 0, num [02:13:19] : 3, cluster_name: production-logstash-eqiad, status: green, relocating_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [02:13:24] !log restart ES on logstash1009 - oom killed [02:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:14:39] 10Operations, 10Parsoid, 10Parsoid-Tests, 10serviceops: Make testreduce web UI publicly accessible on the internet - https://phabricator.wikimedia.org/T266509 (10Dzahn) I will be afk for about 2 weeks. If this needs earlier attention (I assume not, based on low prio etc) please contact the subteam. [02:15:25] 10Operations, 10Parsoid, 10Parsoid-Tests, 10serviceops: Make testreduce web UI publicly accessible on the internet - https://phabricator.wikimedia.org/T266509 (10Dzahn) p:05Triage→03Medium [02:25:49] 10Operations, 10Wikidata, 10Wikidata Query Builder, 10User-Addshore: Deploy WDQS query builder to microsites - https://phabricator.wikimedia.org/T266703 (10Dzahn) See progress in T266702#6592812 From deployment and other internal servers you can already talk to query.wikidata.org on miscweb, it serves what... [02:28:56] 10Operations, 10SRE-Access-Requests: Requesting access to GLOBAL ROOT for David Caro - https://phabricator.wikimedia.org/T267040 (10Dzahn) [02:33:14] (03CR) 10Cwhite: [C: 03+1] thanos: configure memcached size via hiera [puppet] - 10https://gerrit.wikimedia.org/r/638110 (https://phabricator.wikimedia.org/T261281) (owner: 10Filippo Giunchedi) [02:34:08] 10Operations, 10SRE-Access-Requests: Requesting access to GLOBAL ROOT for David Caro - https://phabricator.wikimedia.org/T267040 (10Dzahn) - confirmed L3 signature 👍 - `[ldap-corp1001:~] $ /usr/bin/ldapsearch -x "mail=dcaro*@*" | grep -E 'employee|mail|manager'` 👍 (confirms full time employee and who is manage... [02:36:32] 10Operations, 10SRE-Access-Requests: Requesting access to GLOBAL ROOT for David Caro - https://phabricator.wikimedia.org/T267040 (10Dzahn) [02:38:57] (03CR) 10Cwhite: [C: 03+1] thanos: use systemd overrides for query/store/compact [puppet] - 10https://gerrit.wikimedia.org/r/638036 (https://phabricator.wikimedia.org/T261281) (owner: 10Filippo Giunchedi) [02:40:26] 10Operations, 10SRE-Access-Requests: Requesting access to GLOBAL ROOT for David Caro - https://phabricator.wikimedia.org/T267040 (10Dzahn) @RobH See above, I did these things to verify the user but on vacation from tomorow. Since it's a global root access and I see you are clinic duty for week of Nov 2, could... [02:50:30] (03PS2) 10Dzahn: cumin: remove stretch support and move python_version to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/636101 [02:51:07] PROBLEM - cassandra CQL 10.192.16.179:9042 on maps2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [03:04:02] 10Operations, 10Wikimedia-Etherpad, 10Patch-For-Review: rate limited etherpad - https://phabricator.wikimedia.org/T265490 (10Dzahn) Has anyone had this issue today, since it was another Monday? I hope not, given that effectively what was a 10x in allowed points/sec. I will be off for about 2 weeks, please... [03:20:35] 10Operations, 10Wikimedia-Etherpad, 10Patch-For-Review: rate limited etherpad - https://phabricator.wikimedia.org/T265490 (10dpifke) Etherpad worked well before/during our team meeting today, which wasn't the case last week. Thanks for the fix! [04:15:25] 10Operations, 10DBA, 10Performance-Team, 10Platform Engineering, 10User-Kormat: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 (10Krinkle) [04:17:50] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: eqiad: Physical moves for MediaWiki servers - https://phabricator.wikimedia.org/T266164 (10Krinkle) [04:19:11] 10Operations, 10ops-eqiad, 10DC-Ops: eqiad: Server moves to free up space on 10g racks - https://phabricator.wikimedia.org/T267065 (10Krinkle) [04:31:31] RECOVERY - cassandra CQL 10.192.16.179:9042 on maps2002 is OK: TCP OK - 0.032 second response time on 10.192.16.179 port 9042 https://phabricator.wikimedia.org/T93886 [04:35:09] PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:35:25] PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:36:35] PROBLEM - cassandra CQL 10.192.16.179:9042 on maps2002 is CRITICAL: connect to address 10.192.16.179 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [04:37:01] PROBLEM - Check systemd state on logstash2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:37:45] PROBLEM - ElasticSearch health check for shards on 9200 on logstash2006 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7f14edd704e0: Failed to establish a new connection: [Errno 111] Connection [04:37:45] ://wikitech.wikimedia.org/wiki/Search%23Administration [04:57:13] RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:57:29] RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:58:37] RECOVERY - cassandra CQL 10.192.16.179:9042 on maps2002 is OK: TCP OK - 0.032 second response time on 10.192.16.179 port 9042 https://phabricator.wikimedia.org/T93886 [05:03:25] RECOVERY - ElasticSearch health check for shards on 9200 on logstash2006 is OK: OK - elasticsearch status production-logstash-codfw: number_of_nodes: 6, initializing_shards: 0, number_of_in_flight_fetch: 0, relocating_shards: 0, cluster_name: production-logstash-codfw, unassigned_shards: 0, active_shards: 862, active_primary_shards: 456, delayed_unassigned_shards: 0, number_of_data_nodes: 3, timed_out: False, task_max_waiting_in_ [05:03:25] active_shards_percent_as_number: 100.0, status: green, number_of_pending_tasks: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [05:04:21] RECOVERY - Check systemd state on logstash2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:35:36] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10Marostegui) Thanks Rob, es1032 looks good now: ` Name :Virtual Disk 0 RAID Level : Primary-1, Secondary-0, RAID Level Qualifier... [05:36:59] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Beeswaxcandle) >>! In T257066#6576540, @FordPrefect42 wrote: > Since a couple of days, saving any art... [05:37:16] 10Operations, 10ops-eqiad, 10DC-Ops: eqiad: Spare Drive Onsite for db1091 - https://phabricator.wikimedia.org/T266988 (10Marostegui) 05Open→03Resolved Thanks for checking, I will decommission this host. There is no point spending time on it if its replacement will arrive "soon". [05:37:19] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1091 crashed - https://phabricator.wikimedia.org/T225060 (10Marostegui) [05:38:13] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1091 crashed - https://phabricator.wikimedia.org/T225060 (10Marostegui) [05:45:39] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 101 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:47:32] 10Operations, 10DBA, 10Performance-Team, 10Platform Engineering, and 2 others: Document remaining database load groups - https://phabricator.wikimedia.org/T267077 (10nnikkhoui) Thanks @ArielGlenn! I put a really simple/generic blurb in the attached patchset, please feel free to comment/amend however you th... [05:48:31] PROBLEM - At least one CPU core of an LVS is saturated- packet drops are likely on lvs3005 is CRITICAL: cpu={1,11,13,15,3,5,7,9} https://bit.ly/wmf-lvscpu https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs3005&var-datasource=esams+prometheus/ops [05:51:57] RECOVERY - At least one CPU core of an LVS is saturated- packet drops are likely on lvs3005 is OK: All metrics within thresholds. https://bit.ly/wmf-lvscpu https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs3005&var-datasource=esams+prometheus/ops [05:55:19] (03PS1) 10Marostegui: mariadb: Productionize es1026 [puppet] - 10https://gerrit.wikimedia.org/r/638309 (https://phabricator.wikimedia.org/T261717) [05:56:07] (03PS2) 10Marostegui: mariadb: Productionize es1026 [puppet] - 10https://gerrit.wikimedia.org/r/638309 (https://phabricator.wikimedia.org/T261717) [05:57:15] (03PS3) 10Marostegui: mariadb: Productionize es1026 [puppet] - 10https://gerrit.wikimedia.org/r/638309 (https://phabricator.wikimedia.org/T261717) [05:57:45] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize es1026 [puppet] - 10https://gerrit.wikimedia.org/r/638309 (https://phabricator.wikimedia.org/T261717) (owner: 10Marostegui) [06:00:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es1015 to es2 master (this is a noop) T261717', diff saved to https://phabricator.wikimedia.org/P13131 and previous config saved to /var/cache/conftool/dbconfig/20201103-060038-marostegui.json [06:00:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:47] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [06:00:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1011 to reclone es1026 T261717', diff saved to https://phabricator.wikimedia.org/P13132 and previous config saved to /var/cache/conftool/dbconfig/20201103-060054-marostegui.json [06:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:08] !log Stop MySQL on es1011 to clone es1026 T261717 [06:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es1018 to es1 master (this is a noop) T261717', diff saved to https://phabricator.wikimedia.org/P13133 and previous config saved to /var/cache/conftool/dbconfig/20201103-060705-marostegui.json [06:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:12] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [06:07:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1012 to reclone es1027 T261717', diff saved to https://phabricator.wikimedia.org/P13134 and previous config saved to /var/cache/conftool/dbconfig/20201103-060727-marostegui.json [06:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:45] (03PS1) 10Marostegui: mariadb: Productionize es1027 [puppet] - 10https://gerrit.wikimedia.org/r/638312 (https://phabricator.wikimedia.org/T261717) [06:10:42] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize es1027 [puppet] - 10https://gerrit.wikimedia.org/r/638312 (https://phabricator.wikimedia.org/T261717) (owner: 10Marostegui) [06:11:52] !log Stop MySQL on es1012 to clone es1027 T261717 [06:11:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es1019 to es3 master (this is a noop) T261717', diff saved to https://phabricator.wikimedia.org/P13135 and previous config saved to /var/cache/conftool/dbconfig/20201103-061403-marostegui.json [06:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:10] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [06:14:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1014 to reclone es1028 T261717', diff saved to https://phabricator.wikimedia.org/P13136 and previous config saved to /var/cache/conftool/dbconfig/20201103-061423-marostegui.json [06:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:28] (03PS1) 10Marostegui: mariadb: Productionize es1028 [puppet] - 10https://gerrit.wikimedia.org/r/638317 (https://phabricator.wikimedia.org/T261717) [06:16:40] !log Stop MySQL on es1014 to clone es1028 T261717 [06:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:21] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize es1028 [puppet] - 10https://gerrit.wikimedia.org/r/638317 (https://phabricator.wikimedia.org/T261717) (owner: 10Marostegui) [06:26:00] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: ASAP) rack/setup/install clouddb10[13-20] - https://phabricator.wikimedia.org/T260441 (10Marostegui) Thanks. RAID (level and stripe size), memory and CPU looks good. [06:39:24] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [06:40:44] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [06:43:11] 10Operations, 10DBA, 10Performance-Team, 10Platform Engineering, and 2 others: Document remaining database load groups - https://phabricator.wikimedia.org/T267077 (10ArielGlenn) >>! In T267077#6598886, @nnikkhoui wrote: > Thanks @ArielGlenn! I put a really simple/generic blurb in the attached patchset, ple... [06:46:35] !log Deploy schema change on s1 codfw master: T265349 [06:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:42] T265349: querycache qc_type and qc_title have different nullabality on s1 only - https://phabricator.wikimedia.org/T265349 [06:52:48] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [06:56:41] (03PS1) 10Marostegui: instances.yaml: Remove db1091 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/638343 (https://phabricator.wikimedia.org/T267088) [06:57:18] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1091 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/638343 (https://phabricator.wikimedia.org/T267088) (owner: 10Marostegui) [06:57:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1091 from dbctl T267088', diff saved to https://phabricator.wikimedia.org/P13137 and previous config saved to /var/cache/conftool/dbconfig/20201103-065756-marostegui.json [06:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:03] T267088: decommission db1091.eqiad.wmnet - https://phabricator.wikimedia.org/T267088 [06:58:11] (03CR) 10Elukey: "> Patch Set 6:" [puppet] - 10https://gerrit.wikimedia.org/r/637587 (https://phabricator.wikimedia.org/T262660) (owner: 10Razzi) [06:59:12] !log Remove db1091 from tendril and zarcillo T267088 [06:59:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:01] (03CR) 10Elukey: [C: 03+1] oozie: Add admin groups for authorization (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/637587 (https://phabricator.wikimedia.org/T262660) (owner: 10Razzi) [07:00:35] (03CR) 10Marostegui: [C: 03+1] profile::analytics::database::meta: specify max_connections for mariadb [puppet] - 10https://gerrit.wikimedia.org/r/638148 (owner: 10Elukey) [07:01:40] marostegui: <3 [07:01:45] good morning [07:01:52] PROBLEM - cassandra CQL 10.192.16.179:9042 on maps2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [07:01:54] (03CR) 10Elukey: [C: 03+2] profile::analytics::database::meta: specify max_connections for mariadb [puppet] - 10https://gerrit.wikimedia.org/r/638148 (owner: 10Elukey) [07:03:22] RECOVERY - cassandra CQL 10.192.16.179:9042 on maps2002 is OK: TCP OK - 0.034 second response time on 10.192.16.179 port 9042 https://phabricator.wikimedia.org/T93886 [07:07:54] (03PS1) 10Marostegui: mariadb: Remove puppet entries for db1091 [puppet] - 10https://gerrit.wikimedia.org/r/638352 (https://phabricator.wikimedia.org/T267088) [07:12:43] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10Patch-For-Review, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10elukey) Reporting a conversation with Chris over email about when to do the maintenance: > The 4th works for me at 1130EST If pos... [07:19:26] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [07:23:24] PROBLEM - cassandra CQL 10.192.16.179:9042 on maps2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [07:33:09] (03CR) 10Elukey: [C: 03+2] Turn off airflow scheduler during db downtime [puppet] - 10https://gerrit.wikimedia.org/r/638207 (owner: 10Ebernhardson) [07:41:57] 10Operations, 10Traffic, 10netops: Was unable to connect to most Wikimedia sites for a few minutes - https://phabricator.wikimedia.org/T267089 (10Peachey88) For information required to report connectivity issues: https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue [07:51:35] 10Operations, 10DBA: Refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [07:52:12] (03CR) 10Kosta Harlan: "Thanks for the changes, Jeena." [deployment-charts] - 10https://gerrit.wikimedia.org/r/636916 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [07:52:58] 10Operations, 10DBA: db1080-95 batch possibly suffering BBU issues - https://phabricator.wikimedia.org/T258386 (10Marostegui) [07:56:27] (03PS1) 10Giuseppe Lavagetto: Remove old-style helmfile structure from CI [deployment-charts] - 10https://gerrit.wikimedia.org/r/638415 (https://phabricator.wikimedia.org/T258572) [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201103T0800) [08:04:22] PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:04:22] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Remove old-style helmfile structure from CI [deployment-charts] - 10https://gerrit.wikimedia.org/r/638415 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [08:05:12] PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:06:47] (03Merged) 10jenkins-bot: Remove old-style helmfile structure from CI [deployment-charts] - 10https://gerrit.wikimedia.org/r/638415 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [08:06:55] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/638019 (https://phabricator.wikimedia.org/T266995) (owner: 10Gehel) [08:13:12] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: configure memcached size via hiera [puppet] - 10https://gerrit.wikimedia.org/r/638110 (https://phabricator.wikimedia.org/T261281) (owner: 10Filippo Giunchedi) [08:13:23] (03PS2) 10Filippo Giunchedi: thanos: configure memcached size via hiera [puppet] - 10https://gerrit.wikimedia.org/r/638110 (https://phabricator.wikimedia.org/T261281) [08:22:11] (03PS2) 10Filippo Giunchedi: thanos: use systemd overrides for query/store/compact [puppet] - 10https://gerrit.wikimedia.org/r/638036 (https://phabricator.wikimedia.org/T261281) [08:22:13] (03PS3) 10Filippo Giunchedi: prometheus: re-enable compaction by default [puppet] - 10https://gerrit.wikimedia.org/r/636362 (https://phabricator.wikimedia.org/T261281) [08:22:15] (03PS2) 10Filippo Giunchedi: thanos: add query-frontend [puppet] - 10https://gerrit.wikimedia.org/r/638119 (https://phabricator.wikimedia.org/T261281) [08:22:17] (03PS2) 10Filippo Giunchedi: prometheus: add thanos query-frontend jobs [puppet] - 10https://gerrit.wikimedia.org/r/638120 (https://phabricator.wikimedia.org/T261281) [08:22:19] (03PS2) 10Filippo Giunchedi: role: add query_frontend to thanos frontend [puppet] - 10https://gerrit.wikimedia.org/r/638121 (https://phabricator.wikimedia.org/T261281) [08:22:21] (03PS2) 10Filippo Giunchedi: pontoon: use frontends for query_frontend memcache [puppet] - 10https://gerrit.wikimedia.org/r/638122 (https://phabricator.wikimedia.org/T261281) [08:24:37] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: use systemd overrides for query/store/compact [puppet] - 10https://gerrit.wikimedia.org/r/638036 (https://phabricator.wikimedia.org/T261281) (owner: 10Filippo Giunchedi) [08:27:52] RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:28:28] RECOVERY - cassandra CQL 10.192.16.179:9042 on maps2002 is OK: TCP OK - 0.032 second response time on 10.192.16.179 port 9042 https://phabricator.wikimedia.org/T93886 [08:28:40] RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:31:03] I'm reenabling compactions in Prometheus on the fleet, there will be expected alerts about prometheus restarted [08:31:24] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: re-enable compaction by default [puppet] - 10https://gerrit.wikimedia.org/r/636362 (https://phabricator.wikimedia.org/T261281) (owner: 10Filippo Giunchedi) [08:32:33] !log Prometheus re-enable compactions - T261281 [08:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:40] T261281: Improve performance of Thanos (+ Prometheus) - https://phabricator.wikimedia.org/T261281 [08:35:53] (03PS5) 10Kormat: mariadb: (Ab)use wsrep_cluster_name for DC name [puppet] - 10https://gerrit.wikimedia.org/r/637702 (https://phabricator.wikimedia.org/T266635) [08:36:17] (03PS3) 10Kormat: orchestrator: Support running as non-root [puppet] - 10https://gerrit.wikimedia.org/r/637693 (https://phabricator.wikimedia.org/T266763) [08:36:19] (03PS10) 10Kormat: orchestrator: Support sqlite backend [puppet] - 10https://gerrit.wikimedia.org/r/637684 (https://phabricator.wikimedia.org/T266657) [08:37:45] godog: speaking of prom, how much history do we keep? [08:39:51] kormat: ATM ~5-6 months locally depending on the Prometheus instance, and "as space allows" in Thanos, but currently 60w [08:40:53] * kormat nods [08:41:09] with the plan being to reduce local storage if we don't really need to keep it, or the space is scarse et [08:41:12] etc [08:41:17] 👍 [08:41:31] does thanos currently have 60w of data? [08:41:56] not yet no, we started uploading data there around June I think [08:41:56] PROBLEM - Prometheus prometheus2004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2004 is CRITICAL: instance=127.0.0.1 job=prometheus site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [08:42:16] PROBLEM - Prometheus prometheus1003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1 job=prometheus site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [08:42:40] godog: gotcha. thanks :) [08:42:50] 10Operations, 10DBA, 10Performance-Team, 10Platform Engineering, 10User-Kormat: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 (10jcrespo) A reminder that T195578 is waiting for feedback to see if it would be useful to gather query performance statistics. [08:42:56] PROBLEM - Prometheus prometheus2003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance=127.0.0.1 job=prometheus site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [08:43:08] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker10[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) Thanks a lot for the insights! So few comments: * As far of current hadoop workers/rows distribution, we have: 13 in A, 14 in B, 19 in C, 1... [08:43:10] PROBLEM - Prometheus prometheus5001/ops restarted: beware possible monitoring artifacts on prometheus5001 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqsin+prometheus/ops [08:43:26] PROBLEM - Prometheus prometheus1004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1 job=prometheus site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [08:43:28] kormat: np! [08:43:59] I'll silence these alerts [08:44:22] 10Operations, 10Wikimedia-Etherpad, 10Patch-For-Review: rate limited etherpad - https://phabricator.wikimedia.org/T265490 (10hashar) Works for me as well. [08:50:22] PROBLEM - Thanos query has high gRPC client errors on alert1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [08:51:36] also expected ^ [08:54:46] PROBLEM - Thanos sidecar cannot connect to Prometheus on alert1001 is CRITICAL: cluster=prometheus instance=prometheus1004 job=thanos-sidecar prometheus=ops site=eqiad https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar [08:56:02] PROBLEM - Aggregate IPsec Tunnel Status eqiad on alert1001 is CRITICAL: instance=mc1036 site=eqiad tunnel=mc2036_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [08:56:23] 10Operations, 10observability: VictorOps ~5min delay from email received to incident paging - https://phabricator.wikimedia.org/T266800 (10fgiunchedi) >>! In T266800#6596466, @Volans wrote: > @fgiunchedi should we consider converting out transport from email to API calls at this point? Should give us an immedi... [08:58:08] RECOVERY - Thanos sidecar cannot connect to Prometheus on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar [08:58:48] RECOVERY - Thanos query has high gRPC client errors on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [09:00:10] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [09:00:11] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:00:13] !log kormat@cumin1001 dbctl commit (dc=all): 'Rebooting for T261389', diff saved to https://phabricator.wikimedia.org/P13138 and previous config saved to /var/cache/conftool/dbconfig/20201103-090013-kormat.json [09:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:14] (03CR) 10Marostegui: [C: 03+1] orchestrator: Support sqlite backend [puppet] - 10https://gerrit.wikimedia.org/r/637684 (https://phabricator.wikimedia.org/T266657) (owner: 10Kormat) [09:02:28] (03CR) 10Marostegui: orchestrator: Support running as non-root (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/637693 (https://phabricator.wikimedia.org/T266763) (owner: 10Kormat) [09:03:59] (03CR) 10Kormat: orchestrator: Support running as non-root (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/637693 (https://phabricator.wikimedia.org/T266763) (owner: 10Kormat) [09:05:10] RECOVERY - Prometheus prometheus5001/ops restarted: beware possible monitoring artifacts on prometheus5001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqsin+prometheus/ops [09:05:23] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after reboot. T261389', diff saved to https://phabricator.wikimedia.org/P13139 and previous config saved to /var/cache/conftool/dbconfig/20201103-090523-kormat.json [09:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:29] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [09:06:29] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:26] (03PS1) 10Ayounsi: Core routing for WMCS via cloudsw [homer/public] - 10https://gerrit.wikimedia.org/r/638425 (https://phabricator.wikimedia.org/T265288) [09:10:04] 10Operations, 10Traffic, 10netops: Was unable to connect to most Wikimedia sites for a few minutes - https://phabricator.wikimedia.org/T267089 (10Aklapper) 05Open→03Stalled [09:10:45] (03CR) 10Ayounsi: "185.15.56.0/24: BGP_aggregate_contributors # WMCS eqiad" [homer/public] - 10https://gerrit.wikimedia.org/r/638425 (https://phabricator.wikimedia.org/T265288) (owner: 10Ayounsi) [09:10:50] (03CR) 10Ayounsi: [C: 03+2] Core routing for WMCS via cloudsw [homer/public] - 10https://gerrit.wikimedia.org/r/638425 (https://phabricator.wikimedia.org/T265288) (owner: 10Ayounsi) [09:11:17] (03Merged) 10jenkins-bot: Core routing for WMCS via cloudsw [homer/public] - 10https://gerrit.wikimedia.org/r/638425 (https://phabricator.wikimedia.org/T265288) (owner: 10Ayounsi) [09:11:26] (03CR) 10Volans: [C: 04-1] "The buster logic for the component should be kept." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636101 (owner: 10Dzahn) [09:12:41] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [09:12:42] (03PS2) 10Volans: netbox: add dependency on python3-wmflib [puppet] - 10https://gerrit.wikimedia.org/r/636969 [09:12:42] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:36] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Upgrade ELK Stack to version 7 - https://phabricator.wikimedia.org/T234854 (10jcrespo) >>! In T234854#6440649, @colewhite wrote: > @jcrespo Thanks for bringing this to our attention. The filters on that dashboard indicate they are broken because the fi... [09:14:30] 10Operations, 10Performance-Team, 10Traffic, 10serviceops, and 2 others: Very long response time on frwiki main page - https://phabricator.wikimedia.org/T266865 (10Aklapper) +#Wikimedia-Incident if this was considered an incident [09:17:00] (03CR) 10Marostegui: [C: 03+1] orchestrator: Support running as non-root (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/637693 (https://phabricator.wikimedia.org/T266763) (owner: 10Kormat) [09:17:30] (03CR) 10Kormat: [C: 03+2] orchestrator: Support running as non-root [puppet] - 10https://gerrit.wikimedia.org/r/637693 (https://phabricator.wikimedia.org/T266763) (owner: 10Kormat) [09:17:42] (03CR) 10Kormat: [C: 03+2] orchestrator: Support sqlite backend [puppet] - 10https://gerrit.wikimedia.org/r/637684 (https://phabricator.wikimedia.org/T266657) (owner: 10Kormat) [09:17:42] RECOVERY - Prometheus prometheus1003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [09:18:20] RECOVERY - Prometheus prometheus2003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [09:18:27] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [09:18:28] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:18:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:48] RECOVERY - Prometheus prometheus1004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [09:19:06] RECOVERY - Prometheus prometheus2004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [09:19:16] (03CR) 10Volans: [C: 03+2] "Self-merging to unbock related patch, trivial package addition." [puppet] - 10https://gerrit.wikimedia.org/r/636969 (owner: 10Volans) [09:21:32] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10Epic: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10PLMFXA) https://www.fxa.dk [09:26:23] (03CR) 10Volans: [C: 03+1] "LGTM, compiler results here:" [puppet] - 10https://gerrit.wikimedia.org/r/636102 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [09:35:01] (03PS1) 10Kormat: debian: Fix templates location in .deb [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/638427 [09:40:43] (03CR) 10Marostegui: [C: 03+1] debian: Fix templates location in .deb [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/638427 (owner: 10Kormat) [09:40:56] (03CR) 10Kormat: [V: 03+2 C: 03+2] debian: Fix templates location in .deb [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/638427 (owner: 10Kormat) [09:49:03] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [09:49:04] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:19] (03PS1) 10Kormat: debian: Fix release name. [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/638432 [09:50:27] (03CR) 10Volans: [C: 03+2] dns: add retry logic to all Netbox API calls [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/636406 (owner: 10Volans) [09:50:38] (03CR) 10Kormat: [V: 03+2 C: 03+2] debian: Fix release name. [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/638432 (owner: 10Kormat) [09:54:16] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [09:54:17] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:10] (03PS1) 10Giuseppe Lavagetto: profile::kubernetes::deployment_server: cleanup the old layout [puppet] - 10https://gerrit.wikimedia.org/r/638438 (https://phabricator.wikimedia.org/T258572) [09:56:12] (03PS1) 10Giuseppe Lavagetto: profile::kubernetes::deployment_server::helmfile: simplify code [puppet] - 10https://gerrit.wikimedia.org/r/638439 (https://phabricator.wikimedia.org/T258572) [09:57:09] !log uploaded orchestrator 3.2.3-2 to apt [09:57:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:56] (03PS1) 10Kormat: orchestrator: Drop the obsolete thirdparty/orchestrator component. [puppet] - 10https://gerrit.wikimedia.org/r/638443 (https://phabricator.wikimedia.org/T266763) [10:05:46] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.457e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [10:07:15] (03PS2) 10Giuseppe Lavagetto: profile::kubernetes::deployment_server: cleanup the old layout [puppet] - 10https://gerrit.wikimedia.org/r/638438 (https://phabricator.wikimedia.org/T258572) [10:07:17] (03PS2) 10Giuseppe Lavagetto: profile::kubernetes::deployment_server::helmfile: simplify code [puppet] - 10https://gerrit.wikimedia.org/r/638439 (https://phabricator.wikimedia.org/T258572) [10:08:11] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [10:08:11] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:43] !log elukey@deploy1001 Started deploy [analytics/refinery@cf5db74] (hadoop-test): (no justification provided) [10:11:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:47] 10Operations, 10DBA, 10Orchestrator: Run orchestrator as non-root - https://phabricator.wikimedia.org/T266656 (10Kormat) [10:12:50] 10Operations, 10DBA, 10Orchestrator, 10Patch-For-Review: Repackage orchestrator - https://phabricator.wikimedia.org/T266763 (10Kormat) 05Open→03Resolved a:03Kormat Done and deployed. [10:13:04] 10Operations, 10DBA, 10Patch-For-Review, 10User-Kormat: orchestrator: Get packages into WMF apt - https://phabricator.wikimedia.org/T266023 (10Kormat) [10:13:12] 10Operations, 10DBA, 10Orchestrator: Run orchestrator as non-root - https://phabricator.wikimedia.org/T266656 (10Kormat) 05Open→03Resolved a:03Kormat [10:13:28] !log elukey@deploy1001 Finished deploy [analytics/refinery@cf5db74] (hadoop-test): (no justification provided) (duration: 01m 45s) [10:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:16] !log elukey@deploy1001 Started deploy [analytics/refinery@cf5db74] (hadoop-test): (no justification provided) [10:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:35] (03PS2) 10Aklapper: phabricator weekly changes email: List stalled task stalled for years [puppet] - 10https://gerrit.wikimedia.org/r/638163 (https://phabricator.wikimedia.org/T252522) [10:15:07] 10Operations, 10Puppet: Puppet Proposal to remove require_package - https://phabricator.wikimedia.org/T266479 (10jbond) Thanks @akosiaris sounds good to me ill try to keep a specific eye out for missing dependencies [10:15:28] (03CR) 10Aklapper: "Garr, one day I'll learn this, sorry! (PS: No need to post any query results; I have CLI access to run SQL queries on the DB)" [puppet] - 10https://gerrit.wikimedia.org/r/638163 (https://phabricator.wikimedia.org/T252522) (owner: 10Aklapper) [10:16:03] (03PS3) 10Giuseppe Lavagetto: profile::kubernetes::deployment_server: cleanup the old layout [puppet] - 10https://gerrit.wikimedia.org/r/638438 (https://phabricator.wikimedia.org/T258572) [10:16:05] (03PS3) 10Giuseppe Lavagetto: profile::kubernetes::deployment_server::helmfile: simplify code [puppet] - 10https://gerrit.wikimedia.org/r/638439 (https://phabricator.wikimedia.org/T258572) [10:16:31] !log elukey@deploy1001 Finished deploy [analytics/refinery@cf5db74] (hadoop-test): (no justification provided) (duration: 02m 15s) [10:16:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:15] (03PS4) 10Giuseppe Lavagetto: profile::kubernetes::deployment_server: cleanup the old layout [puppet] - 10https://gerrit.wikimedia.org/r/638438 (https://phabricator.wikimedia.org/T258572) [10:19:17] (03PS4) 10Giuseppe Lavagetto: profile::kubernetes::deployment_server::helmfile: simplify code [puppet] - 10https://gerrit.wikimedia.org/r/638439 (https://phabricator.wikimedia.org/T258572) [10:21:38] !log gilles@deploy1001 Started deploy [performance/asoranking@2a2cb05]: T266985 [10:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:44] T266985: asoranking failed its monthly run stat1007 - https://phabricator.wikimedia.org/T266985 [10:21:49] (03PS4) 10Jbond: java: add new java version facts [puppet] - 10https://gerrit.wikimedia.org/r/636924 [10:22:03] !log gilles@deploy1001 Finished deploy [performance/asoranking@2a2cb05]: T266985 (duration: 00m 26s) [10:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:20] (03CR) 10jerkins-bot: [V: 04-1] java: add new java version facts [puppet] - 10https://gerrit.wikimedia.org/r/636924 (owner: 10Jbond) [10:22:43] (03CR) 10Jbond: "> Patch Set 3: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/636924 (owner: 10Jbond) [10:23:03] (03CR) 10Jbond: [V: 03+2 C: 03+2] "Needs an update to spec helper to fix tests, overriding" [puppet] - 10https://gerrit.wikimedia.org/r/636924 (owner: 10Jbond) [10:29:21] (03PS1) 10Ladsgroup: ores: Reduce number of requests that trigger a restart of uwsgi worker [puppet] - 10https://gerrit.wikimedia.org/r/638467 (https://phabricator.wikimedia.org/T263910) [10:36:06] PROBLEM - cassandra CQL 10.192.16.179:9042 on maps2002 is CRITICAL: connect to address 10.192.16.179 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [10:37:10] (03PS3) 10Aklapper: puppet-agent: remove --show_diff from scheduled puppet-run script [puppet] - 10https://gerrit.wikimedia.org/r/434719 (https://phabricator.wikimedia.org/T1) (owner: 10Herron) [10:37:47] 10Operations, 10DBA, 10Orchestrator: Orchestrator binary needs version embedded - https://phabricator.wikimedia.org/T267113 (10Kormat) [10:38:21] 10Operations, 10DBA, 10Orchestrator: Orchestrator binary needs version embedded - https://phabricator.wikimedia.org/T267113 (10Kormat) When this is fixed, we also need to clean up the existing db: ` root@db2093.codfw.wmnet[orchestrator]> select * from orchestrator_db_deployments; +------------------+-------... [10:39:04] PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:39:21] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [10:39:22] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:39:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:56] PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:43:37] (03PS5) 10Giuseppe Lavagetto: profile::kubernetes::deployment_server: cleanup the old layout [puppet] - 10https://gerrit.wikimedia.org/r/638438 (https://phabricator.wikimedia.org/T258572) [10:43:39] (03PS5) 10Giuseppe Lavagetto: profile::kubernetes::deployment_server::helmfile: simplify code [puppet] - 10https://gerrit.wikimedia.org/r/638439 (https://phabricator.wikimedia.org/T258572) [10:45:12] !log gilles@deploy1001 Started deploy [performance/asoranking@2a2cb05]: T266985 [10:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:17] (03CR) 10JMeybohm: [C: 03+1] "Nice. LGTM!" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/637753 (owner: 10Jeena Huneidi) [10:45:19] T266985: asoranking failed its monthly run stat1007 - https://phabricator.wikimedia.org/T266985 [10:45:19] !log gilles@deploy1001 Finished deploy [performance/asoranking@2a2cb05]: T266985 (duration: 00m 07s) [10:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:50] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [10:50:50] (03CR) 10Ema: [C: 04-1] "The change looks good to me, though it breaks ./modules/varnish/files/tests/text/09-analytics-cookies.vtc - please update the VTC test too" [puppet] - 10https://gerrit.wikimedia.org/r/637851 (https://phabricator.wikimedia.org/T147967) (owner: 10Ladsgroup) [10:57:30] RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:57:50] RECOVERY - cassandra CQL 10.192.16.179:9042 on maps2002 is OK: TCP OK - 0.032 second response time on 10.192.16.179 port 9042 https://phabricator.wikimedia.org/T93886 [10:58:20] RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:59:46] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [10:59:47] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:59:51] (03PS2) 10Ladsgroup: varnish: Replace "Expires" in Set-Cookie with "Max-Age" [puppet] - 10https://gerrit.wikimedia.org/r/637851 (https://phabricator.wikimedia.org/T147967) [10:59:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:03] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/638443 (https://phabricator.wikimedia.org/T266763) (owner: 10Kormat) [11:01:17] (03CR) 10Kormat: [C: 03+2] orchestrator: Drop the obsolete thirdparty/orchestrator component. [puppet] - 10https://gerrit.wikimedia.org/r/638443 (https://phabricator.wikimedia.org/T266763) (owner: 10Kormat) [11:03:15] !log rolling restart of ores [11:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:58] (03CR) 10Arturo Borrero Gonzalez: set cpu_model_extra_flags = vmx,pcid (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/638146 (owner: 10Ahmon Dancy) [11:04:35] (03CR) 10Ladsgroup: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/637851 (https://phabricator.wikimedia.org/T147967) (owner: 10Ladsgroup) [11:05:44] (03CR) 10JMeybohm: [C: 04-1] "This does not work in CI. Probably writes to an RO filesystem or so" [deployment-charts] - 10https://gerrit.wikimedia.org/r/638025 (owner: 10JMeybohm) [11:08:20] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::kubernetes::deployment_server: cleanup the old layout [puppet] - 10https://gerrit.wikimedia.org/r/638438 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [11:17:33] (03PS6) 10Giuseppe Lavagetto: profile::kubernetes::deployment_server::helmfile: simplify code [puppet] - 10https://gerrit.wikimedia.org/r/638439 (https://phabricator.wikimedia.org/T258572) [11:17:43] (03PS1) 10Jbond: facter: java_version fix nil check on java_version [puppet] - 10https://gerrit.wikimedia.org/r/638473 [11:18:16] (03CR) 10jerkins-bot: [V: 04-1] facter: java_version fix nil check on java_version [puppet] - 10https://gerrit.wikimedia.org/r/638473 (owner: 10Jbond) [11:18:18] (03PS1) 10Hnowlan: Revert "Isolate eqiad master maps1004 from cluster" [puppet] - 10https://gerrit.wikimedia.org/r/638454 [11:19:25] (03PS2) 10Hnowlan: Revert "Isolate eqiad master maps1004 from cluster" [puppet] - 10https://gerrit.wikimedia.org/r/638454 [11:19:28] (03PS9) 10Jbond: Gemfile: update puppetlabs_spec_helper version and switch to rspec-mock [puppet] - 10https://gerrit.wikimedia.org/r/636923 [11:19:53] (03PS2) 10Jbond: facter: java_version fix nil check on java_version [puppet] - 10https://gerrit.wikimedia.org/r/638473 [11:20:15] (03CR) 10Hnowlan: [C: 03+2] Revert "Isolate eqiad master maps1004 from cluster" [puppet] - 10https://gerrit.wikimedia.org/r/638454 (owner: 10Hnowlan) [11:20:35] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::kubernetes::deployment_server::helmfile: simplify code [puppet] - 10https://gerrit.wikimedia.org/r/638439 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [11:21:19] (03CR) 10jerkins-bot: [V: 04-1] Gemfile: update puppetlabs_spec_helper version and switch to rspec-mock [puppet] - 10https://gerrit.wikimedia.org/r/636923 (owner: 10Jbond) [11:23:28] (03CR) 10Jbond: [C: 03+2] facter: java_version fix nil check on java_version [puppet] - 10https://gerrit.wikimedia.org/r/638473 (owner: 10Jbond) [11:23:33] !log resyncing postgres replica maps1001 [11:23:36] (03PS3) 10Jbond: facter: java_version fix nil check on java_version [puppet] - 10https://gerrit.wikimedia.org/r/638473 [11:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:37] !log hnowlan@cumin1001 START - Cookbook sre.postgresql.postgres-init [11:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:42] (03CR) 10Jbond: [V: 03+2 C: 03+2] facter: java_version fix nil check on java_version [puppet] - 10https://gerrit.wikimedia.org/r/638473 (owner: 10Jbond) [11:23:47] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.postgresql.postgres-init (exit_code=99) [11:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:15] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [11:29:16] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:29:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:30] (03PS10) 10Jbond: Gemfile: update puppetlabs_spec_helper version and switch to rspec-mock [puppet] - 10https://gerrit.wikimedia.org/r/636923 [11:32:32] (03PS1) 10Jbond: Rakefile: exclude stdlibg and lvm from our spec tests [puppet] - 10https://gerrit.wikimedia.org/r/638476 [11:33:15] (03PS2) 10Jbond: Rakefile: exclude stdlib and lvm from our spec tests [puppet] - 10https://gerrit.wikimedia.org/r/638476 [11:34:28] (03CR) 10jerkins-bot: [V: 04-1] Gemfile: update puppetlabs_spec_helper version and switch to rspec-mock [puppet] - 10https://gerrit.wikimedia.org/r/636923 (owner: 10Jbond) [11:35:54] (03CR) 10Muehlenhoff: [C: 03+1] "Makes sense" [puppet] - 10https://gerrit.wikimedia.org/r/638476 (owner: 10Jbond) [11:38:25] (03CR) 10Jbond: [C: 03+2] Rakefile: exclude stdlib and lvm from our spec tests [puppet] - 10https://gerrit.wikimedia.org/r/638476 (owner: 10Jbond) [11:41:26] (03CR) 10Giuseppe Lavagetto: Add apache httpd base image (038 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/634924 (https://phabricator.wikimedia.org/T265324) (owner: 10Giuseppe Lavagetto) [11:46:40] RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.03488 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [11:51:05] !log gilles@deploy1001 Started deploy [performance/asoranking@2a2cb05]: T266985 [11:51:09] !log gilles@deploy1001 Finished deploy [performance/asoranking@2a2cb05]: T266985 (duration: 00m 03s) [11:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:11] T266985: asoranking failed its monthly run stat1007 - https://phabricator.wikimedia.org/T266985 [11:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:24] !log running "reprepro clearvanished" to prune thirdparty/orchestrator [11:57:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:24] !log imported php-apcu/php-geoip/php-imagick/php-mailparse to component/php72 for buster-wikimedia [11:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:33] test [12:00:20] (03PS1) 10MSantos: replicate osm twice a day like codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/638481 [12:00:37] (03CR) 10jerkins-bot: [V: 04-1] replicate osm twice a day like codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/638481 (owner: 10MSantos) [12:01:22] (03CR) 10Hnowlan: "This seems okay but please hold off on merging until all replicas in eqiad have resynced (in progress atm)" [puppet] - 10https://gerrit.wikimedia.org/r/638481 (owner: 10MSantos) [12:15:54] (03CR) 10JMeybohm: Add apache httpd base image (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/634924 (https://phabricator.wikimedia.org/T265324) (owner: 10Giuseppe Lavagetto) [12:19:47] 10Operations, 10LDAP-Access-Requests: Add msantos to wmf LDAP group - https://phabricator.wikimedia.org/T267125 (10hnowlan) [12:27:40] (03PS2) 10MSantos: replicate osm twice a day like codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/638481 [12:27:59] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10User-herron: Implement sensitive logstash access control - https://phabricator.wikimedia.org/T213902 (10Aklapper) [12:28:08] (03CR) 10MSantos: [C: 04-1] "Hold while re-syncing" [puppet] - 10https://gerrit.wikimedia.org/r/638481 (owner: 10MSantos) [12:40:48] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [12:40:49] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:40:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:23] 10Operations, 10Traffic, 10netops: Was unable to connect to most Wikimedia sites for a few minutes - https://phabricator.wikimedia.org/T267089 (10AlexisJazz) >>! In T267089#6599101, @Peachey88 wrote: > For information required to report connectivity issues: https://wikitech.wikimedia.org/wiki/Reporting_a_con... [12:57:59] (03PS1) 10Kormat: debian: Set version field in orchestrator binary [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/638499 (https://phabricator.wikimedia.org/T267113) [13:02:37] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [13:02:37] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:02:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:23] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/638499 (https://phabricator.wikimedia.org/T267113) (owner: 10Kormat) [13:04:59] (03CR) 10Kormat: [V: 03+2 C: 03+2] debian: Set version field in orchestrator binary [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/638499 (https://phabricator.wikimedia.org/T267113) (owner: 10Kormat) [13:05:01] (03PS1) 10Elukey: Change analytics-test-hive CNAME to test failover [dns] - 10https://gerrit.wikimedia.org/r/638502 (https://phabricator.wikimedia.org/T257412) [13:11:06] (03CR) 10Elukey: [C: 03+2] Change analytics-test-hive CNAME to test failover [dns] - 10https://gerrit.wikimedia.org/r/638502 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [13:12:00] (03PS3) 10Filippo Giunchedi: role: add query_frontend to thanos frontend [puppet] - 10https://gerrit.wikimedia.org/r/638121 (https://phabricator.wikimedia.org/T261281) [13:12:02] (03PS3) 10Filippo Giunchedi: pontoon: use frontends for query_frontend memcache [puppet] - 10https://gerrit.wikimedia.org/r/638122 (https://phabricator.wikimedia.org/T261281) [13:14:02] 10Operations, 10DBA, 10Orchestrator: Orchestrator binary needs version embedded - https://phabricator.wikimedia.org/T267113 (10Kormat) Fix deployed, and db cleaned up: ` root@db2093.codfw.wmnet[orchestrator]> delete from orchestrator_db_deployments where deployed_version="" limit 1; Query OK, 1 row affected... [13:14:04] 10Operations, 10Traffic, 10netops: Was unable to connect to most Wikimedia sites for a few minutes - https://phabricator.wikimedia.org/T267089 (10Aklapper) See the list on https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue : A `traceroute` for example, if this is still a problem. You wrote "T... [13:14:14] 10Operations, 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: orchestrator: Puppetize - https://phabricator.wikimedia.org/T265990 (10Kormat) [13:14:17] 10Operations, 10DBA, 10Orchestrator: Orchestrator binary needs version embedded - https://phabricator.wikimedia.org/T267113 (10Kormat) 05Open→03Resolved a:03Kormat [13:18:36] PROBLEM - Maps HTTPS on maps2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:18:44] PROBLEM - Maps HTTPS on maps2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:18:44] PROBLEM - Maps HTTPS on maps2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:20:14] RECOVERY - Maps HTTPS on maps2001 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 0.150 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:20:14] RECOVERY - Maps HTTPS on maps2002 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 0.156 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:20:14] PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:22:03] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [13:22:03] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:20] (03CR) 10Giuseppe Lavagetto: Add an httpd-fcgi image (039 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/636634 (https://phabricator.wikimedia.org/T265324) (owner: 10Giuseppe Lavagetto) [13:23:28] RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 0.117 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:23:30] RECOVERY - Maps HTTPS on maps2003 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 0.228 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:23:42] (03PS7) 10Giuseppe Lavagetto: Add apache httpd base image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/634924 (https://phabricator.wikimedia.org/T265324) [13:23:44] (03PS5) 10Giuseppe Lavagetto: Add an httpd-fcgi image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/636634 (https://phabricator.wikimedia.org/T265324) [13:24:20] (03PS1) 10Elukey: Revert "Change analytics-test-hive CNAME to test failover" [dns] - 10https://gerrit.wikimedia.org/r/638511 [13:24:22] (03PS1) 10Jbond: logrotate: migrate spec tests to shared helper [puppet] - 10https://gerrit.wikimedia.org/r/638505 [13:24:53] !log lsobanski@cumin1001 START - Cookbook sre.hosts.decommission [13:24:53] (03CR) 10Elukey: [C: 03+2] Revert "Change analytics-test-hive CNAME to test failover" [dns] - 10https://gerrit.wikimedia.org/r/638511 (owner: 10Elukey) [13:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:58] PROBLEM - DPKG on serpens is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [13:32:56] (03CR) 10Marostegui: [C: 03+1] "Lukasz, this looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/638352 (https://phabricator.wikimedia.org/T267088) (owner: 10Marostegui) [13:33:23] !log lsobanski@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [13:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:32] (03CR) 10LSobanski: [C: 03+2] mariadb: Remove puppet entries for db1091 [puppet] - 10https://gerrit.wikimedia.org/r/638352 (https://phabricator.wikimedia.org/T267088) (owner: 10Marostegui) [13:34:14] PROBLEM - cassandra CQL 10.192.16.179:9042 on maps2002 is CRITICAL: connect to address 10.192.16.179 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [13:34:16] PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:34:38] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [13:34:38] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:06] PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:38:19] 10Operations, 10Traffic, 10netops: Was unable to connect to most Wikimedia sites for a few minutes - https://phabricator.wikimedia.org/T267089 (10CDanis) >>! In T267089#6599101, @Peachey88 wrote: > For information required to report connectivity issues: https://wikitech.wikimedia.org/wiki/Reporting_a_connect... [13:39:40] 10Operations, 10Traffic, 10netops: Was unable to connect to most Wikimedia sites for a few minutes - https://phabricator.wikimedia.org/T267089 (10CDanis) 05Stalled→03Resolved a:03CDanis Thanks for the report @AlexisJazz. There was a known issue that occurred at the esams edge site during the time wind... [13:43:47] !log Removing db1091 from tendril and zarcillo T267088 [13:43:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:54] T267088: decommission db1091.eqiad.wmnet - https://phabricator.wikimedia.org/T267088 [13:46:11] 10Operations, 10Wikimedia-Etherpad, 10Patch-For-Review: rate limited etherpad - https://phabricator.wikimedia.org/T265490 (10Dzahn) 05Open→03Resolved [13:46:48] PROBLEM - DPKG on seaborgium is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [13:48:44] (03CR) 10Kormat: [C: 03+2] mariadb: (Ab)use wsrep_cluster_name for DC name [puppet] - 10https://gerrit.wikimedia.org/r/637702 (https://phabricator.wikimedia.org/T266635) (owner: 10Kormat) [13:50:08] (03PS1) 10Ayounsi: Don't alert on missing cable ID for servers uplinks in eqiad/codfw [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/638529 (https://phabricator.wikimedia.org/T266533) [13:51:18] (03PS1) 10Jbond: rsync: update spec tests to use shared spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/638530 [13:51:20] (03PS1) 10Jbond: aptrepo: remove old spec_helper file [puppet] - 10https://gerrit.wikimedia.org/r/638531 [13:51:22] (03PS1) 10Jbond: base: clean up old spec helper and fix broken spec tests [puppet] - 10https://gerrit.wikimedia.org/r/638532 [13:53:42] (03PS1) 10Jbond: apt: move to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638534 [13:53:44] (03PS2) 10Ayounsi: Don't alert on missing cable ID for servers uplinks in eqiad/codfw [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/638529 (https://phabricator.wikimedia.org/T266533) [13:53:53] !log imported php-mongodb/php-wmerrors/wikidiff2 to component/php72 for buster-wikimedia [13:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:05] (03PS3) 10Ayounsi: Don't alert on missing cable ID for servers uplinks in eqiad/codfw [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/638529 (https://phabricator.wikimedia.org/T266533) [13:55:14] (03CR) 10jerkins-bot: [V: 04-1] apt: move to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638534 (owner: 10Jbond) [13:56:36] (03PS3) 10Filippo Giunchedi: thanos: add query-frontend [puppet] - 10https://gerrit.wikimedia.org/r/638119 (https://phabricator.wikimedia.org/T261281) [13:56:38] (03PS3) 10Filippo Giunchedi: prometheus: add thanos query-frontend jobs [puppet] - 10https://gerrit.wikimedia.org/r/638120 (https://phabricator.wikimedia.org/T261281) [13:56:40] (03PS4) 10Filippo Giunchedi: role: add query_frontend to thanos frontend [puppet] - 10https://gerrit.wikimedia.org/r/638121 (https://phabricator.wikimedia.org/T261281) [13:56:42] (03PS4) 10Filippo Giunchedi: pontoon: use frontends for query_frontend memcache [puppet] - 10https://gerrit.wikimedia.org/r/638122 (https://phabricator.wikimedia.org/T261281) [13:57:21] (03PS1) 10Jbond: mirrors: fix spec tests [puppet] - 10https://gerrit.wikimedia.org/r/638536 [13:57:34] (03CR) 10Jbond: [C: 03+2] logrotate: migrate spec tests to shared helper [puppet] - 10https://gerrit.wikimedia.org/r/638505 (owner: 10Jbond) [13:57:39] (03PS2) 10Marostegui: orchestrator.conf: Add DetectDataCenterQuery to detect DC [puppet] - 10https://gerrit.wikimedia.org/r/637715 (https://phabricator.wikimedia.org/T266635) [13:57:41] (03CR) 10Jbond: [C: 03+2] rsync: update spec tests to use shared spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/638530 (owner: 10Jbond) [13:57:43] RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:57:44] (03CR) 10Jbond: [C: 03+2] aptrepo: remove old spec_helper file [puppet] - 10https://gerrit.wikimedia.org/r/638531 (owner: 10Jbond) [13:57:49] (03CR) 10Jbond: [C: 03+2] base: clean up old spec helper and fix broken spec tests [puppet] - 10https://gerrit.wikimedia.org/r/638532 (owner: 10Jbond) [13:58:03] RECOVERY - DPKG on serpens is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [13:59:09] (03CR) 10Marostegui: [C: 03+2] orchestrator.conf: Add DetectDataCenterQuery to detect DC [puppet] - 10https://gerrit.wikimedia.org/r/637715 (https://phabricator.wikimedia.org/T266635) (owner: 10Marostegui) [13:59:41] RECOVERY - cassandra CQL 10.192.16.179:9042 on maps2002 is OK: TCP OK - 0.032 second response time on 10.192.16.179 port 9042 https://phabricator.wikimedia.org/T93886 [14:01:36] (03PS2) 10Jbond: apt: move to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638534 [14:01:38] (03PS1) 10Jbond: java: commentout facte spec tests untill we move to rspec-mock [puppet] - 10https://gerrit.wikimedia.org/r/638538 [14:01:53] RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:02:13] (03CR) 10jerkins-bot: [V: 04-1] java: commentout facte spec tests untill we move to rspec-mock [puppet] - 10https://gerrit.wikimedia.org/r/638538 (owner: 10Jbond) [14:03:18] (03PS2) 10Jbond: java: comment out fact spec tests until we move to rspec-mock [puppet] - 10https://gerrit.wikimedia.org/r/638538 [14:03:49] (03CR) 10jerkins-bot: [V: 04-1] java: comment out fact spec tests until we move to rspec-mock [puppet] - 10https://gerrit.wikimedia.org/r/638538 (owner: 10Jbond) [14:04:05] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [14:04:06] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:54] (03CR) 10Volans: [C: 04-1] "nits inline, looks good otherwise" (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/638529 (https://phabricator.wikimedia.org/T266533) (owner: 10Ayounsi) [14:09:48] (03PS3) 10Jbond: java: comment out fact spec tests until we move to rspec-mock [puppet] - 10https://gerrit.wikimedia.org/r/638538 [14:11:52] (03PS3) 10Jbond: apt: move to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638534 [14:12:00] (03PS2) 10Jbond: mirrors: fix spec tests [puppet] - 10https://gerrit.wikimedia.org/r/638536 [14:12:56] (03CR) 10Jbond: [C: 03+2] java: comment out fact spec tests until we move to rspec-mock [puppet] - 10https://gerrit.wikimedia.org/r/638538 (owner: 10Jbond) [14:13:35] (03CR) 10Jbond: [C: 03+2] apt: move to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638534 (owner: 10Jbond) [14:13:38] (03CR) 10Jbond: [C: 03+2] mirrors: fix spec tests [puppet] - 10https://gerrit.wikimedia.org/r/638536 (owner: 10Jbond) [14:14:23] (03PS1) 10Jbond: network: migrate spec tests to shared helper [puppet] - 10https://gerrit.wikimedia.org/r/638543 [14:14:43] (03PS4) 10Ayounsi: Don't alert on missing cable ID for servers uplinks in eqiad/codfw [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/638529 (https://phabricator.wikimedia.org/T266533) [14:16:15] 10Operations, 10Traffic, 10netops: Was unable to connect to most Wikimedia sites for a few minutes - https://phabricator.wikimedia.org/T267089 (10AlexisJazz) >>! In T267089#6600271, @CDanis wrote: > Thanks for the report @AlexisJazz. > > There was a known issue that occurred at the esams edge site during th... [14:16:21] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/638529 (https://phabricator.wikimedia.org/T266533) (owner: 10Ayounsi) [14:17:19] RECOVERY - DPKG on seaborgium is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:17:36] (03CR) 10Ayounsi: [C: 03+2] Don't alert on missing cable ID for servers uplinks in eqiad/codfw [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/638529 (https://phabricator.wikimedia.org/T266533) (owner: 10Ayounsi) [14:19:20] (03PS1) 10Jbond: backup: migrate specs to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638544 [14:19:22] (03PS1) 10Jbond: osm: switch to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638545 [14:19:24] (03PS1) 10Jbond: interface: remove unused spec files [puppet] - 10https://gerrit.wikimedia.org/r/638546 [14:20:00] (03CR) 10jerkins-bot: [V: 04-1] osm: switch to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638545 (owner: 10Jbond) [14:20:38] (03PS2) 10Jbond: interface/java: remove unused spec files [puppet] - 10https://gerrit.wikimedia.org/r/638546 [14:20:45] (03CR) 10jerkins-bot: [V: 04-1] backup: migrate specs to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638544 (owner: 10Jbond) [14:21:51] (03PS2) 10Jbond: osm: switch to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638545 [14:22:09] (03PS3) 10Jbond: interface/java: remove unused spec files [puppet] - 10https://gerrit.wikimedia.org/r/638546 [14:22:31] (03CR) 10Jbond: [C: 03+2] network: migrate spec tests to shared helper [puppet] - 10https://gerrit.wikimedia.org/r/638543 (owner: 10Jbond) [14:24:19] (03PS2) 10Jbond: backup: migrate specs to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638544 [14:26:39] (03CR) 10Jbond: [C: 03+2] backup: migrate specs to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638544 (owner: 10Jbond) [14:26:45] (03CR) 10Jbond: [C: 03+2] osm: switch to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638545 (owner: 10Jbond) [14:26:51] (03CR) 10Jbond: [C: 03+2] interface/java: remove unused spec files [puppet] - 10https://gerrit.wikimedia.org/r/638546 (owner: 10Jbond) [14:27:06] (03PS3) 10Jbond: osm: switch to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638545 [14:27:14] (03PS4) 10Jbond: interface/java: remove unused spec files [puppet] - 10https://gerrit.wikimedia.org/r/638546 [14:29:15] (03PS1) 10Jbond: git: migrate to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638548 [14:32:59] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [14:33:04] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:33:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:22] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [14:35:23] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:23] !log imported php-apcu-bc/php-igbinary/tideways-xhprof to component/php72 for buster-wikimedia [14:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:29] (03PS1) 10Volans: spicerack: add requests_session accessor [software/spicerack] - 10https://gerrit.wikimedia.org/r/638550 [14:37:30] (03PS1) 10Volans: Use wmflib.requests.http_session everywhere [software/spicerack] - 10https://gerrit.wikimedia.org/r/638551 [14:40:42] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1091.eqiad.wmnet - https://phabricator.wikimedia.org/T267088 (10LSobanski) a:05Marostegui→03wiki_willy [14:40:53] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1091.eqiad.wmnet - https://phabricator.wikimedia.org/T267088 (10LSobanski) This host is ready for DC-Ops to decommission. [14:59:35] (03CR) 10Jbond: [C: 03+2] git: migrate to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638548 (owner: 10Jbond) [15:03:36] (03PS1) 10Jbond: monitoring: complete migration to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638559 [15:03:38] (03PS1) 10Jbond: graphite: migrate to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638560 [15:03:40] (03PS1) 10Jbond: jenkis: migrate to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638561 [15:03:42] (03PS1) 10Jbond: scap:migrate to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638562 [15:05:31] (03CR) 10Jbond: [C: 03+2] monitoring: complete migration to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638559 (owner: 10Jbond) [15:05:33] (03CR) 10jerkins-bot: [V: 04-1] scap:migrate to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638562 (owner: 10Jbond) [15:05:35] (03CR) 10Jbond: [C: 03+2] graphite: migrate to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638560 (owner: 10Jbond) [15:05:39] (03CR) 10Jbond: [C: 03+2] jenkis: migrate to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638561 (owner: 10Jbond) [15:05:59] (03PS4) 10Filippo Giunchedi: thanos: add query-frontend [puppet] - 10https://gerrit.wikimedia.org/r/638119 (https://phabricator.wikimedia.org/T261281) [15:06:01] (03PS4) 10Filippo Giunchedi: prometheus: add thanos query-frontend jobs [puppet] - 10https://gerrit.wikimedia.org/r/638120 (https://phabricator.wikimedia.org/T261281) [15:06:03] (03PS5) 10Filippo Giunchedi: role: add query_frontend to thanos frontend [puppet] - 10https://gerrit.wikimedia.org/r/638121 (https://phabricator.wikimedia.org/T261281) [15:06:05] (03PS5) 10Filippo Giunchedi: pontoon: use frontends for query_frontend memcache [puppet] - 10https://gerrit.wikimedia.org/r/638122 (https://phabricator.wikimedia.org/T261281) [15:07:46] (03PS1) 10Jbond: apt: remove old spec helper file [puppet] - 10https://gerrit.wikimedia.org/r/638564 [15:08:20] !log imported php-redis/xdebug to component/php72 for buster-wikimedia [15:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:16] (03PS1) 10Hashar: gerrit: fix SonarQube report url discovery [puppet] - 10https://gerrit.wikimedia.org/r/638565 (https://phabricator.wikimedia.org/T267028) [15:15:16] 10Operations, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10JMeybohm) [15:15:45] 10Operations, 10observability: smart-data-dump should fail loudly when it can't gather metrics - https://phabricator.wikimedia.org/T267135 (10fgiunchedi) [15:18:50] (03PS1) 10Jbond: nginx: migrate to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638568 [15:18:52] (03PS1) 10Jbond: elasticsearch: migrate to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638569 [15:18:54] (03PS1) 10Jbond: puppetmaster: clean up old spec files [puppet] - 10https://gerrit.wikimedia.org/r/638570 [15:18:56] (03PS1) 10Jbond: puppetmaster: migrate to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638571 [15:19:35] PROBLEM - Check systemd state on an-airflow1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:34:43] (03PS1) 10Jbond: zuul: migrate to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638583 [15:35:56] (03CR) 10Jbond: [C: 03+2] apt: remove old spec helper file [puppet] - 10https://gerrit.wikimedia.org/r/638564 (owner: 10Jbond) [15:35:57] (03CR) 10Jbond: [C: 03+2] nginx: migrate to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638568 (owner: 10Jbond) [15:36:00] (03CR) 10Jbond: [C: 03+2] elasticsearch: migrate to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638569 (owner: 10Jbond) [15:36:03] (03CR) 10Jbond: [C: 03+2] puppetmaster: clean up old spec files [puppet] - 10https://gerrit.wikimedia.org/r/638570 (owner: 10Jbond) [15:36:09] (03CR) 10Jbond: [C: 03+2] puppetmaster: migrate to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638571 (owner: 10Jbond) [15:36:11] (03CR) 10Jbond: [C: 03+2] zuul: migrate to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638583 (owner: 10Jbond) [15:40:31] PROBLEM - MariaDB Replica IO: analytics_meta on db1108 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@an-coord1001.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on an-coord1001.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:41:33] (03CR) 10Giuseppe Lavagetto: safe-service-restart: add optional poolcounter support (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/635991 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [15:41:41] elukey: ^ [15:41:43] (03PS6) 10Giuseppe Lavagetto: safe-service-restart: add optional poolcounter support [puppet] - 10https://gerrit.wikimedia.org/r/635991 (https://phabricator.wikimedia.org/T266055) [15:41:50] marostegui: yep I am fixing [15:41:54] <3 [15:45:18] (03PS10) 10Hnowlan: postgres: set max connections in postgres based on replica count [puppet] - 10https://gerrit.wikimedia.org/r/638143 (https://phabricator.wikimedia.org/T266820) [15:47:07] (03PS1) 10Jbond: nagios_common: migrate to shared spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/638589 [15:47:08] (03PS1) 10Jbond: query_service: clean up unused spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/638590 [15:47:11] (03PS1) 10Jbond: contint: remove unused spec folder [puppet] - 10https://gerrit.wikimedia.org/r/638591 [15:47:13] (03PS1) 10Jbond: cassandra: remove unused spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/638592 [15:49:12] (03PS1) 10JMeybohm: Update to 1.16.15 [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/638595 (https://phabricator.wikimedia.org/T266766) [15:49:18] (03CR) 10Giuseppe Lavagetto: safe-service-restart: add optional poolcounter support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635991 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [15:49:25] (03PS1) 10Jbond: Rakefile: add puppetdbquery to thirdparty modules [puppet] - 10https://gerrit.wikimedia.org/r/638596 [15:49:43] (03CR) 10Jbond: [C: 03+2] nagios_common: migrate to shared spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/638589 (owner: 10Jbond) [15:49:46] (03CR) 10Jbond: [C: 03+2] query_service: clean up unused spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/638590 (owner: 10Jbond) [15:49:49] (03CR) 10Jbond: [C: 03+2] contint: remove unused spec folder [puppet] - 10https://gerrit.wikimedia.org/r/638591 (owner: 10Jbond) [15:49:51] (03CR) 10Jbond: [C: 03+2] cassandra: remove unused spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/638592 (owner: 10Jbond) [15:49:53] 10Operations, 10Maps, 10Wikimedia-Logstash, 10observability, and 3 others: Move kartotherian/tilerator logging to new logging pipeline - https://phabricator.wikimedia.org/T222377 (10MSantos) @fgiunchedi we've been working on fixing OSM replication recently in the eqiad cluster, so we blocked deployments fo... [15:50:19] RECOVERY - MariaDB Replica IO: analytics_meta on db1108 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:50:38] (03PS7) 10Giuseppe Lavagetto: safe-service-restart: add optional poolcounter support [puppet] - 10https://gerrit.wikimedia.org/r/635991 (https://phabricator.wikimedia.org/T266055) [15:51:11] (03PS1) 10Ayounsi: ImportPuppetDB should warn if it imported a SLAAC IP [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/638598 (https://phabricator.wikimedia.org/T265905) [15:51:23] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [15:51:26] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:52] (03CR) 10Jbond: [C: 03+2] Rakefile: add puppetdbquery to thirdparty modules [puppet] - 10https://gerrit.wikimedia.org/r/638596 (owner: 10Jbond) [15:51:55] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [15:51:56] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:52:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:50] (03PS1) 10Arturo Borrero Gonzalez: passwords: root-authorized-keys.erb: add dcaro CloudVPS-wide root key [labs/private] - 10https://gerrit.wikimedia.org/r/638601 (https://phabricator.wikimedia.org/T266068) [15:53:55] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] passwords: root-authorized-keys.erb: add dcaro CloudVPS-wide root key [labs/private] - 10https://gerrit.wikimedia.org/r/638601 (https://phabricator.wikimedia.org/T266068) (owner: 10Arturo Borrero Gonzalez) [15:54:24] (03CR) 10Ayounsi: "Tested in netbox-next." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/638598 (https://phabricator.wikimedia.org/T265905) (owner: 10Ayounsi) [15:54:45] (03CR) 10Hnowlan: [C: 03+2] api-gateway: use envoy 1.16.0 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/637431 (owner: 10Hnowlan) [15:56:26] (03CR) 10RLazarus: [C: 03+1] "🚀" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635991 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [15:57:19] (03Merged) 10jenkins-bot: api-gateway: use envoy 1.16.0 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/637431 (owner: 10Hnowlan) [15:59:17] !log shutdown kafka-jumbo1006 to replace 1G with 10G nic [15:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:36] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [16:01:36] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [16:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:47] (03PS2) 10Ahmon Dancy: openstack: nova: set cpu_model_extra_flags = vmx,pcid [puppet] - 10https://gerrit.wikimedia.org/r/638146 [16:02:44] (03CR) 10Ahmon Dancy: openstack: nova: set cpu_model_extra_flags = vmx,pcid (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/638146 (owner: 10Ahmon Dancy) [16:02:55] PROBLEM - Host kafka-jumbo1006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:03:59] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [16:03:59] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [16:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:07] (03CR) 10MSantos: [C: 03+1] postgres: set max connections in postgres based on replica count [puppet] - 10https://gerrit.wikimedia.org/r/638143 (https://phabricator.wikimedia.org/T266820) (owner: 10Hnowlan) [16:08:29] RECOVERY - Host kafka-jumbo1006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.54 ms [16:09:09] (03CR) 10Ahmon Dancy: [C: 03+1] safe-service-restart: add optional poolcounter support [puppet] - 10https://gerrit.wikimedia.org/r/635991 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [16:09:15] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is CRITICAL: 98 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1005 [16:09:21] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1001 is CRITICAL: 72 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001 [16:09:27] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1003 is CRITICAL: 98 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1003 [16:10:03] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1002 is CRITICAL: 106 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1002 [16:10:05] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/638598 (https://phabricator.wikimedia.org/T265905) (owner: 10Ayounsi) [16:10:45] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1004 is CRITICAL: 97 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1004 [16:11:25] (03PS1) 10Jbond: wmflib: migrate to use shared spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/638607 [16:11:26] sorry for the spam, should solve in a bit [16:12:09] (03CR) 10Ayounsi: [C: 03+2] ImportPuppetDB should warn if it imported a SLAAC IP [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/638598 (https://phabricator.wikimedia.org/T265905) (owner: 10Ayounsi) [16:13:12] (03CR) 10Jbond: [C: 03+2] wmflib: migrate to use shared spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/638607 (owner: 10Jbond) [16:13:42] !log cdanis@cumin1001 START - Cookbook sre.network.cf [16:13:43] !log cdanis@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [16:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:22] (03PS3) 10Ahmon Dancy: openstack: nova: set cpu_model_extra_flags = vmx,pcid [puppet] - 10https://gerrit.wikimedia.org/r/638146 [16:15:54] 10Operations, 10ops-eqiad, 10Reading Epics (Analytics): an-coord1001 ram upgrade - https://phabricator.wikimedia.org/T266709 (10Cmjohnson) [16:16:02] 10Operations, 10ops-eqiad, 10Reading Epics (Analytics): an-coord1001 ram upgrade - https://phabricator.wikimedia.org/T266709 (10Cmjohnson) 05Open→03Resolved [16:18:45] (03CR) 10Hnowlan: [C: 03+2] postgres: set max connections in postgres based on replica count [puppet] - 10https://gerrit.wikimedia.org/r/638143 (https://phabricator.wikimedia.org/T266820) (owner: 10Hnowlan) [16:19:15] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1001 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001 [16:19:23] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1003 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1003 [16:20:43] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1004 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1004 [16:20:49] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1005 [16:21:10] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10Patch-For-Review, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10elukey) Today I swapped the NIC on kafka-jumbo1006 with Chris and there was no need for /etc/network/interfaces changes, `firmware-... [16:21:19] (03PS1) 10Jbond: stronswan: remove spec tests for iprsolve as handled bu wmflib [puppet] - 10https://gerrit.wikimedia.org/r/638609 [16:21:39] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1002 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1002 [16:21:52] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [16:21:56] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:58] (03CR) 10Jbond: [C: 03+2] stronswan: remove spec tests for iprsolve as handled bu wmflib [puppet] - 10https://gerrit.wikimedia.org/r/638609 (owner: 10Jbond) [16:22:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:18] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [16:22:18] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:07] (03PS1) 10Jbond: wmflib: remove unused spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/638610 [16:28:59] (03PS2) 10Jbond: httpd: convert to shared spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/638610 [16:32:51] 10Operations, 10Scap, 10serviceops, 10Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)): Make a way to build Scap .deb in Docker - https://phabricator.wikimedia.org/T265501 (10LarsWirzenius) I can't actually find the `package_builder` host and can't check if I have login access or the ability t... [16:35:34] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [16:35:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:39] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:35:41] (03PS1) 10Ayounsi: Restart puppetdb-microservice if config changes [puppet] - 10https://gerrit.wikimedia.org/r/638616 [16:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:59] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [16:36:00] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:51] (03PS2) 10Ayounsi: Restart puppetdb-microservice if config changes [puppet] - 10https://gerrit.wikimedia.org/r/638616 [16:38:19] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/638616 (owner: 10Ayounsi) [16:38:48] (03PS1) 10Jbond: php: migrate to shared spec_test [puppet] - 10https://gerrit.wikimedia.org/r/638621 [16:39:49] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [16:39:50] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:39:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:11] (03CR) 10Jbond: [C: 03+2] httpd: convert to shared spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/638610 (owner: 10Jbond) [16:40:15] (03CR) 10Jbond: [C: 03+2] php: migrate to shared spec_test [puppet] - 10https://gerrit.wikimedia.org/r/638621 (owner: 10Jbond) [16:41:23] (03PS1) 10Jbond: php: remove unused spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/638623 [16:41:25] (03PS1) 10LSobanski: Fix typo in authorized_for_configuration_information permission [puppet] - 10https://gerrit.wikimedia.org/r/638624 [16:41:28] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [puppet] - 10https://gerrit.wikimedia.org/r/638624 (owner: 10LSobanski) [16:42:09] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/638616 (owner: 10Ayounsi) [16:42:32] (03CR) 10LSobanski: "Pretty please :)" [puppet] - 10https://gerrit.wikimedia.org/r/638624 (owner: 10LSobanski) [16:42:34] (03CR) 10Ayounsi: [C: 03+2] Restart puppetdb-microservice if config changes [puppet] - 10https://gerrit.wikimedia.org/r/638616 (owner: 10Ayounsi) [16:42:49] (03CR) 10Jbond: [C: 03+2] php: remove unused spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/638623 (owner: 10Jbond) [16:44:13] (03CR) 10Kormat: "We generally put the module name at the start of the commit message. In this case, it would be `icinga: `" [puppet] - 10https://gerrit.wikimedia.org/r/638624 (owner: 10LSobanski) [16:48:01] (03PS2) 10LSobanski: icinga: fix typo in authorized_for_configuration_information permission [puppet] - 10https://gerrit.wikimedia.org/r/638624 [16:48:01] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [puppet] - 10https://gerrit.wikimedia.org/r/638624 (owner: 10LSobanski) [16:49:08] (03CR) 10Kormat: [C: 03+1] "LGTM :)" [puppet] - 10https://gerrit.wikimedia.org/r/638624 (owner: 10LSobanski) [16:50:03] (03CR) 10LSobanski: [C: 03+2] icinga: fix typo in authorized_for_configuration_information permission [puppet] - 10https://gerrit.wikimedia.org/r/638624 (owner: 10LSobanski) [16:51:07] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:42] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1091.eqiad.wmnet - https://phabricator.wikimedia.org/T267088 (10Cmjohnson) [16:56:49] 10Operations, 10DBA: db1080-95 batch possibly suffering BBU issues - https://phabricator.wikimedia.org/T258386 (10Cmjohnson) [16:56:52] 10Operations, 10DBA: Refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Cmjohnson) [16:56:54] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1091.eqiad.wmnet - https://phabricator.wikimedia.org/T267088 (10Cmjohnson) 05Open→03Resolved done [16:56:57] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1091 crashed - https://phabricator.wikimedia.org/T225060 (10Cmjohnson) [16:57:03] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:00] (03PS2) 10Jbond: scap:migrate to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638562 [17:00:20] (03CR) 10jerkins-bot: [V: 04-1] scap:migrate to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638562 (owner: 10Jbond) [17:09:19] (03CR) 10Giuseppe Lavagetto: [C: 03+2] safe-service-restart: add optional poolcounter support [puppet] - 10https://gerrit.wikimedia.org/r/635991 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [17:10:29] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/636627 (https://phabricator.wikimedia.org/T262899) (owner: 10Ayounsi) [17:14:43] (03CR) 10Giuseppe Lavagetto: [C: 03+2] poolcounter: add client configuration classes [puppet] - 10https://gerrit.wikimedia.org/r/635992 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [17:14:54] (03PS5) 10Giuseppe Lavagetto: poolcounter: add client configuration classes [puppet] - 10https://gerrit.wikimedia.org/r/635992 (https://phabricator.wikimedia.org/T266055) [17:29:59] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [17:30:01] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:06] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [17:30:06] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:49] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [17:31:49] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [17:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:58] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [17:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:59] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:47] (03PS4) 10Ayounsi: ImportPuppetDB: add cable color/type [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/636627 (https://phabricator.wikimedia.org/T262899) [17:33:49] (03PS5) 10Ayounsi: Update AssignIPs to handle switch port and cable [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635385 (https://phabricator.wikimedia.org/T265339) [17:33:51] (03PS6) 10Ayounsi: Add CSV import to AssignIPs script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635849 [17:33:53] (03PS5) 10Ayounsi: AssingIPs, cleanup and standardize logs format [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635853 (https://phabricator.wikimedia.org/T265339) [17:35:00] (03CR) 10Ayounsi: ImportPuppetDB: add cable color/type (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/636627 (https://phabricator.wikimedia.org/T262899) (owner: 10Ayounsi) [17:38:18] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10Cmjohnson) The mainboard arrived [17:43:25] (03PS3) 10Jbond: scap: migrate to shared spec helper and force rspec-mock where needed [puppet] - 10https://gerrit.wikimedia.org/r/638562 [17:43:27] (03PS1) 10Jbond: spec_helper: use puppet_spec_helper not module_spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/638643 [17:43:48] (03PS5) 10Ayounsi: ImportPuppetDB: add cable color/type [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/636627 (https://phabricator.wikimedia.org/T262899) [17:43:50] (03PS6) 10Ayounsi: Update AssignIPs to handle switch port and cable [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635385 (https://phabricator.wikimedia.org/T265339) [17:43:52] (03PS7) 10Ayounsi: Add CSV import to AssignIPs script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635849 [17:43:54] (03PS6) 10Ayounsi: AssingIPs, cleanup and standardize logs format [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635853 (https://phabricator.wikimedia.org/T265339) [17:43:56] (03CR) 10jerkins-bot: [V: 04-1] spec_helper: use puppet_spec_helper not module_spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/638643 (owner: 10Jbond) [17:44:00] 10Operations, 10serviceops, 10CommRel-Specialists-Support (Oct-Dec-2020): CommRel support for ICU 63 upgrade - https://phabricator.wikimedia.org/T267145 (10RLazarus) [17:44:41] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/636627 (https://phabricator.wikimedia.org/T262899) (owner: 10Ayounsi) [17:44:50] (03CR) 10jerkins-bot: [V: 04-1] scap: migrate to shared spec helper and force rspec-mock where needed [puppet] - 10https://gerrit.wikimedia.org/r/638562 (owner: 10Jbond) [17:45:06] (03PS2) 10Jbond: spec_helper: use puppet_spec_helper not module_spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/638643 [17:45:44] !log shutting elastic1063 down to reseat DIMM T265113 [17:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:51] T265113: Memory issue on elastic1063 caused elasticsearch to be killed - https://phabricator.wikimedia.org/T265113 [17:46:15] (03CR) 10Ayounsi: [C: 03+2] ImportPuppetDB: add cable color/type [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/636627 (https://phabricator.wikimedia.org/T262899) (owner: 10Ayounsi) [17:47:13] PROBLEM - Host elastic1063 is DOWN: PING CRITICAL - Packet loss = 100% [17:49:39] PROBLEM - Host elastic1063.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:51:50] 10Operations, 10MW-on-K8s, 10observability, 10serviceops: Logging options for apache httpd in k8s - https://phabricator.wikimedia.org/T265876 (10lmata) Just dropping a quick update here, we should schedule some time to review options. Had a brief exchange with @akosiaris and we'll get the team together for... [17:55:21] RECOVERY - Host elastic1063.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.11 ms [17:55:24] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/633853 (owner: 10Dzahn) [17:56:45] RECOVERY - Host elastic1063 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [17:58:08] 10Operations, 10ops-eqiad, 10Discovery-Search: Memory issue on elastic1063 caused elasticsearch to be killed - https://phabricator.wikimedia.org/T265113 (10Cmjohnson) 05Open→03Resolved I reseated all the DIMM and there were several. I am not getting any Dell h/w errors. Hopefully, the reseat and flea... [18:03:34] (03PS3) 10Jbond: spec_helper: use puppet_spec_helper not module_spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/638643 [18:04:36] (03CR) 10Jbond: [C: 03+2] spec_helper: use puppet_spec_helper not module_spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/638643 (owner: 10Jbond) [18:04:44] (03PS4) 10Jbond: scap: migrate to shared spec helper and force rspec-mock where needed [puppet] - 10https://gerrit.wikimedia.org/r/638562 [18:06:00] (03PS5) 10Jbond: scap: migrate to shared spec helper and force rspec-mock where needed [puppet] - 10https://gerrit.wikimedia.org/r/638562 [18:07:21] (03CR) 10jerkins-bot: [V: 04-1] scap: migrate to shared spec helper and force rspec-mock where needed [puppet] - 10https://gerrit.wikimedia.org/r/638562 (owner: 10Jbond) [18:15:24] (03CR) 10Dzahn: "currently on vacation, please add some other reviewers with +2, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/638565 (https://phabricator.wikimedia.org/T267028) (owner: 10Hashar) [18:15:52] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:54] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:21:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:59] 10Operations, 10ops-eqiad, 10DC-Ops: eqiad: Server moves to free up space on 10g racks - https://phabricator.wikimedia.org/T267065 (10Cmjohnson) | These are all 1G servers in 10G racks for row A | A2 | |A4||A7| |db1074||stat1004||mw1269| |db1075||logstash1020||mw1270| |db1079||wdqs1003||mw1271| |db1080||g... [18:36:39] 10Operations, 10ops-eqiad, 10DC-Ops: eqiad: Server moves to free up space on 10g racks - https://phabricator.wikimedia.org/T267065 (10Cmjohnson) These are all 1G serves in 10G racks for row B |B2||B4|B7| |db1099||elastic1050|wtp1031| |analytics1072|2U|elastic1049|wtp1032| |||conf1005|wtp1033| |||Kublog1002|... [18:42:21] (03PS1) 10Volans: Dependencies: remove temporary hacks [software/spicerack] - 10https://gerrit.wikimedia.org/r/638664 [18:42:29] 10Operations, 10ops-eqiad, 10DC-Ops: eqiad: Server moves to free up space on 10g racks - https://phabricator.wikimedia.org/T267065 (10Cmjohnson) Row C |C2||C4|C7 |es1016|2U|ores1006|francium| |db1100||mwlog1001|polonium| |db1101||snapshot1006|scb1003| |analytics1064|2U|deploy1001|elastic1051| |analytics1065... [18:45:41] (03PS1) 10Jcrespo: POC: Testing interfacing with swift to gather metadata [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/638665 (https://phabricator.wikimedia.org/T264189) [18:46:15] 10Operations, 10ops-eqiad, 10DC-Ops: eqiad: Server moves to free up space on 10g racks - https://phabricator.wikimedia.org/T267065 (10Cmjohnson) Racks D2 and D7 are 100% 10G but they were initially built that way. D4 was just converted to 10G |D4| |db1114| |ores1008| |mc1033| |mc1034| |mc1035| |mc1036| |a... [18:46:21] (03CR) 10jerkins-bot: [V: 04-1] POC: Testing interfacing with swift to gather metadata [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/638665 (https://phabricator.wikimedia.org/T264189) (owner: 10Jcrespo) [18:54:57] 10Operations, 10ops-eqiad, 10decommission-hardware: Return sulfur to spares - https://phabricator.wikimedia.org/T224475 (10Cmjohnson) 05Open→03Resolved a:05Cmjohnson→03RobH @RobH This server is ready to go back to you for spares. Where are you tracking that? [18:56:37] 10Operations, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): Reclaim labpuppetmaster1001 and 1002 - https://phabricator.wikimedia.org/T234462 (10Cmjohnson) a:05Cmjohnson→03wiki_willy @wiki_willy @RobH Are we returning to spare or decommissioning these? [19:01:21] 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Goal, 10Patch-For-Review: Prepare a proof of concept of the minimum setup capable of backup and recover testwiki media files - https://phabricator.wikimedia.org/T264189 (10jcrespo) There are more discrepancies between swift and mediawiki db:... [19:12:08] 10Operations, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: Decommission asw-a-eqiad - https://phabricator.wikimedia.org/T218734 (10Cmjohnson) a:05Cmjohnson→03Jclark-ctr @Jclark-ctr Please make sure all of these switches have been restored to factory defaults, unplug, and remove the racks.... [19:32:05] !log restarting blazegraph on wdqs1007 to reset ban list [19:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:21] (03CR) 10Volans: "Some minor details/questions inline" (037 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635385 (https://phabricator.wikimedia.org/T265339) (owner: 10Ayounsi) [19:34:30] (03CR) 10Gehel: [C: 03+1] "LGTM, trivial enough" [software/spicerack] - 10https://gerrit.wikimedia.org/r/638664 (owner: 10Volans) [19:35:05] (03CR) 10Volans: [C: 03+2] Dependencies: remove temporary hacks [software/spicerack] - 10https://gerrit.wikimedia.org/r/638664 (owner: 10Volans) [19:36:27] PROBLEM - Host an-presto1004 is DOWN: PING CRITICAL - Packet loss = 100% [19:37:43] (03Merged) 10jenkins-bot: Dependencies: remove temporary hacks [software/spicerack] - 10https://gerrit.wikimedia.org/r/638664 (owner: 10Volans) [19:37:50] 10Operations, 10ops-eqiad, 10DC-Ops: eqiad: Server moves to free up space on 10g racks - https://phabricator.wikimedia.org/T267065 (10Cmjohnson) a:05Jclark-ctr→03Cmjohnson [19:40:11] 10Operations, 10ops-eqiad, 10DC-Ops: eqiad: Server moves to free up space on 10g racks - https://phabricator.wikimedia.org/T267065 (10Cmjohnson) @wiki_willy I had time to do this today while the Dell tech worked on an-presto1004. I am going to be utilizing a 2U space in A2 and B2 for the kafka-jumbo 10G up... [19:40:41] RECOVERY - Host an-presto1004 is UP: PING OK - Packet loss = 0%, RTA = 1.06 ms [19:45:29] 10Operations, 10ops-eqiad, 10Analytics-Radar: an-presto1004 down - https://phabricator.wikimedia.org/T253438 (10Cmjohnson) 05Open→03Resolved @elukey the an-presto1004 motherboard has been replaced and the backplane, everything came back up as normal except I am not able to ssh into the server and fresh i... [19:51:36] (03PS1) 10Jbond: (WIP) profile: migrate to shared spec_test [puppet] - 10https://gerrit.wikimedia.org/r/638678 [19:53:35] (03PS2) 10Jbond: (WIP) profile: migrate to shared spec_test [puppet] - 10https://gerrit.wikimedia.org/r/638678 [19:54:56] (03Abandoned) 10Jbond: WIP: migrate profile spec tests to shared spec_healper [puppet] - 10https://gerrit.wikimedia.org/r/541245 (owner: 10Jbond) [19:55:59] (03Abandoned) 10Jbond: stunnel: add stunnel module and update rsync to use it [puppet] - 10https://gerrit.wikimedia.org/r/558133 (owner: 10Jbond) [19:56:01] (03CR) 10jerkins-bot: [V: 04-1] (WIP) profile: migrate to shared spec_test [puppet] - 10https://gerrit.wikimedia.org/r/638678 (owner: 10Jbond) [19:56:32] (03Abandoned) 10Jbond: apereo_cas: update to use stunnel client [puppet] - 10https://gerrit.wikimedia.org/r/558590 (owner: 10Jbond) [20:01:52] (03CR) 10Jbond: "should update shared spec_test to support changing site, realm and cluster (and any other globals) which are currently hacked in as rspec_" [puppet] - 10https://gerrit.wikimedia.org/r/638678 (owner: 10Jbond) [20:03:43] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=prometheus site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:04:01] (03PS3) 10Jbond: remote-backup-mariadb: remove cron type [puppet] - 10https://gerrit.wikimedia.org/r/636411 (https://phabricator.wikimedia.org/T265138) [20:04:07] PROBLEM - MegaRAID on an-presto1004 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:04:08] ACKNOWLEDGEMENT - MegaRAID on an-presto1004 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T267160 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:04:11] 10Operations, 10ops-eqiad: Degraded RAID on an-presto1004 - https://phabricator.wikimedia.org/T267160 (10ops-monitoring-bot) [20:04:21] (03PS3) 10Jbond: prometheus_intel_microcode: remove cron type [puppet] - 10https://gerrit.wikimedia.org/r/636409 (https://phabricator.wikimedia.org/T265138) [20:04:34] (03CR) 10Jbond: [C: 03+2] cumin: remove cron type [puppet] - 10https://gerrit.wikimedia.org/r/636405 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [20:05:06] (03Abandoned) 10Jbond: cumin: remove cron type [puppet] - 10https://gerrit.wikimedia.org/r/636405 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [20:05:12] (03CR) 10Jbond: [C: 03+2] smart: remove cron type [puppet] - 10https://gerrit.wikimedia.org/r/636402 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [20:05:20] (03PS2) 10Jbond: smart: remove cron type [puppet] - 10https://gerrit.wikimedia.org/r/636402 (https://phabricator.wikimedia.org/T265138) [20:05:21] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:05:27] (03CR) 10Jbond: [C: 03+2] remote-backup-mariadb: remove cron type [puppet] - 10https://gerrit.wikimedia.org/r/636411 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [20:05:43] (03CR) 10Jbond: [C: 03+2] prometheus_intel_microcode: remove cron type [puppet] - 10https://gerrit.wikimedia.org/r/636409 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [20:06:18] (03CR) 10Jbond: [C: 03+2] service-auto-restart: clean up cron [puppet] - 10https://gerrit.wikimedia.org/r/635517 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [20:06:27] (03PS3) 10Jbond: service-auto-restart: clean up cron [puppet] - 10https://gerrit.wikimedia.org/r/635517 (https://phabricator.wikimedia.org/T265138) [20:06:33] (03CR) 10Jbond: [V: 03+2 C: 03+2] service-auto-restart: clean up cron [puppet] - 10https://gerrit.wikimedia.org/r/635517 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [20:07:23] (03Abandoned) 10Jbond: bird: ensure bird service is running [puppet] - 10https://gerrit.wikimedia.org/r/625926 (owner: 10Jbond) [20:07:32] (03CR) 10Jbond: [C: 03+2] labstore::nfs_mount: drop support for empty string share_path [puppet] - 10https://gerrit.wikimedia.org/r/630589 (owner: 10Jbond) [20:15:49] PROBLEM - MediaWiki memcached error rate on alert1001 is CRITICAL: 5115 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:17:27] RECOVERY - MediaWiki memcached error rate on alert1001 is OK: (C)5000 gt (W)1000 gt 39 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:20:26] (03PS5) 10Jbond: confd: pass srv_dns directly instead of loading confd::srv_dns [puppet] - 10https://gerrit.wikimedia.org/r/617716 (https://phabricator.wikimedia.org/T247956) [20:21:45] (03CR) 10jerkins-bot: [V: 04-1] confd: pass srv_dns directly instead of loading confd::srv_dns [puppet] - 10https://gerrit.wikimedia.org/r/617716 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [20:26:50] (03PS4) 10Ladsgroup: [WIP] varnish: Improve wording of the browser security error a bit [puppet] - 10https://gerrit.wikimedia.org/r/637850 (https://phabricator.wikimedia.org/T241656) [20:27:28] (03Abandoned) 10Jbond: httpd: test validate_cmd [puppet] - 10https://gerrit.wikimedia.org/r/604765 (https://phabricator.wikimedia.org/T255124) (owner: 10Jbond) [20:27:34] (03Abandoned) 10Jbond: httpd: add validate_cmd to apache configs [puppet] - 10https://gerrit.wikimedia.org/r/604764 (https://phabricator.wikimedia.org/T255124) (owner: 10Jbond) [20:28:59] (03CR) 10Ladsgroup: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/637850 (https://phabricator.wikimedia.org/T241656) (owner: 10Ladsgroup) [20:29:32] (03Abandoned) 10Jbond: standard: move none standard class to profile::standard [puppet] - 10https://gerrit.wikimedia.org/r/617703 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [20:29:53] (03Restored) 10Jbond: standard: move none standard class to profile::standard [puppet] - 10https://gerrit.wikimedia.org/r/617703 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [20:30:26] (03Abandoned) 10Jbond: DO NOT MERGE: Testing pcc [puppet] - 10https://gerrit.wikimedia.org/r/613610 (owner: 10Jbond) [20:30:56] (03Abandoned) 10Jbond: cas6.2: merge changes from upstream 6.2 branch [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/609402 (owner: 10Jbond) [20:31:24] (03PS3) 10Jbond: labstore::fileserver::exports: use sudo::safe_wildcard_cmd [puppet] - 10https://gerrit.wikimedia.org/r/616718 (https://phabricator.wikimedia.org/T258943) [20:31:46] (03CR) 10jerkins-bot: [V: 04-1] labstore::fileserver::exports: use sudo::safe_wildcard_cmd [puppet] - 10https://gerrit.wikimedia.org/r/616718 (https://phabricator.wikimedia.org/T258943) (owner: 10Jbond) [20:32:54] (03PS4) 10Jbond: labstore::fileserver::exports: use sudo::safe_wildcard_cmd [puppet] - 10https://gerrit.wikimedia.org/r/616718 (https://phabricator.wikimedia.org/T258943) [20:34:14] (03Abandoned) 10Jbond: WIP: profile::mariadb::misc: create generic profile for misc classes [puppet] - 10https://gerrit.wikimedia.org/r/608895 (owner: 10Jbond) [20:34:35] (03Abandoned) 10Jbond: role::tendril: move tendril to new profile::mariadb::misc [puppet] - 10https://gerrit.wikimedia.org/r/608896 (owner: 10Jbond) [20:35:43] (03Abandoned) 10Jbond: profile::backup::director: increase number of open files. [puppet] - 10https://gerrit.wikimedia.org/r/556207 (owner: 10Jbond) [20:36:16] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10Epic: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10aapeli) The Wikimedia Maps TOS does not yet mention this change (it's linked in the attribution line on maps.wikimedia.... [20:36:59] (03Abandoned) 10Jbond: cfssl: Ensure CSR exists before we try to sign it [puppet] - 10https://gerrit.wikimedia.org/r/581559 (owner: 10Jbond) [20:38:30] (03PS3) 10Jbond: role::puppetmaster::standalone: add type checking to autosign [puppet] - 10https://gerrit.wikimedia.org/r/566512 [20:38:55] (03PS7) 10Jbond: admin: add tests for system users [puppet] - 10https://gerrit.wikimedia.org/r/554302 (https://phabricator.wikimedia.org/T239686) [20:39:18] (03PS2) 10Jbond: puppetmaster: update webconfig to use correct file path [puppet] - 10https://gerrit.wikimedia.org/r/550459 (https://phabricator.wikimedia.org/T237994) [20:41:55] (03CR) 10Jbond: "jut going through old changes, this still seems valid?" [puppet] - 10https://gerrit.wikimedia.org/r/550459 (https://phabricator.wikimedia.org/T237994) (owner: 10Jbond) [20:42:30] (03Abandoned) 10Jbond: service definitions: add custom type/provider to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/559536 (https://phabricator.wikimedia.org/T241160) (owner: 10Jbond) [20:42:38] (03Abandoned) 10Jbond: service_definitions: add defined ports [puppet] - 10https://gerrit.wikimedia.org/r/559537 (https://phabricator.wikimedia.org/T241160) (owner: 10Jbond) [20:43:12] (03PS2) 10Jbond: backup::director: add type checking and use lookup vs hiera [puppet] - 10https://gerrit.wikimedia.org/r/556211 [20:43:29] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/556211 (owner: 10Jbond) [20:44:04] (03Abandoned) 10Jbond: apereo_cas: add ability to configure basic memcached support [puppet] - 10https://gerrit.wikimedia.org/r/550695 (https://phabricator.wikimedia.org/T233931) (owner: 10Jbond) [20:45:58] (03Abandoned) 10Jbond: raid: update check_raid to detect missing disk"" [puppet] - 10https://gerrit.wikimedia.org/r/510139 (owner: 10Jbond) [20:46:16] (03Abandoned) 10Jbond: icinga: Add a new script and configuration to send prowl notifications [puppet] - 10https://gerrit.wikimedia.org/r/502993 (owner: 10Jbond) [21:09:05] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:09:57] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:10:03] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:10:29] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:10:41] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:10:45] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:11:15] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:11:15] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-me [21:12:11] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:12:23] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:12:25] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:13:15] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:13:16] jouncebot: now [21:13:16] For the next 10 hour(s) and 46 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201103T0800) [21:13:51] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:14:29] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:15:01] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:16:44] !log Gerrit: triggering java garbage collection # T263008 [21:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:52] T263008: Gerrit out of heap - https://phabricator.wikimedia.org/T263008 [21:17:23] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:18:55] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:18:57] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:19:39] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:22:29] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:22:55] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [21:24:11] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:24:27] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:24:37] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [21:26:35] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:26:57] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a re [21:26:57] ed https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:27:43] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:27:45] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:27:47] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:29:09] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:29:13] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:29:19] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:30:09] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:30:51] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:30:59] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:31:35] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:32:29] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:32:49] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:33:09] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:34:35] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10Epic: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10Aklapper) @aapeli: Could you please file a separate task about updating https://foundation.wikimedia.org/wiki/Maps_Term... [21:36:03] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:36:05] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:37:03] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [21:37:39] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:37:49] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:38:39] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [21:39:15] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:40:05] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:41:21] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:41:51] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:42:21] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:42:47] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:42:49] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:43:07] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:43:57] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:44:35] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:44:39] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a re [21:44:39] ed https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:44:49] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:45:31] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:45:43] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:46:11] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:46:15] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:46:27] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:48:07] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:49:41] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:50:09] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10Epic: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10aapeli) OK, I've made a new task: T267170. [21:50:39] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:50:47] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:51:21] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:51:33] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:52:01] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:52:15] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:52:27] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:52:55] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:52:57] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:54:49] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:55:21] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:56:33] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:56:43] (03CR) 10QChris: [C: 03+1] gerrit: fix SonarQube report url discovery [puppet] - 10https://gerrit.wikimedia.org/r/638565 (https://phabricator.wikimedia.org/T267028) (owner: 10Hashar) [21:57:25] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:57:53] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:58:17] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:59:07] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:00:35] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:01:04] (03PS1) 10QChris: Add .gitreview [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/638719 [22:01:05] (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/638719 (owner: 10QChris) [22:01:05] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:01:33] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:01:52] (03CR) 10Awight: gerrit: fix SonarQube report url discovery (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/638565 (https://phabricator.wikimedia.org/T267028) (owner: 10Hashar) [22:02:11] (03PS2) 10Awight: gerrit: fix SonarQube report url discovery [puppet] - 10https://gerrit.wikimedia.org/r/638565 (https://phabricator.wikimedia.org/T267028) (owner: 10Hashar) [22:02:53] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:03:25] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:03:29] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [22:03:43] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:04:15] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:05:09] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [22:05:11] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:05:16] (03PS2) 10CDanis: depool esams [dns] - 10https://gerrit.wikimedia.org/r/627919 [22:08:31] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:08:39] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:08:53] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:10:15] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:10:27] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:11:59] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:12:15] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:12:43] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:13:37] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:13:53] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:16:03] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:17:19] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:18:09] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:19:01] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:19:01] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:21:57] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:22:11] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:22:11] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:22:13] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:23:35] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:25:30] !log ✔️ cdanis@mw1278.eqiad.wmnet ~ 🕠🍺 sudo depool [22:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:57] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:27:15] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:28:35] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:29:41] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:31:06] !log depool mw1276 and mw1279 also [22:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:19] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:33:17] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:33:43] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor [22:33:43] received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:34:01] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:34:01] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:34:14] !log restart-php7.2-fpm and pool on mw1276 [22:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:53] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:35:21] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:35:39] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:35:39] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:35:56] !log ✔️ cdanis@mw1290.eqiad.wmnet ~ 🕠🍺 sudo restart-php7.2-fpm [22:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:53] !log repool mw1278 and mw1279 [22:37:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:12] 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review, 10User-jijiki: MediaWiki to route specific keys to /*/mw-with-onhost-tier/ - https://phabricator.wikimedia.org/T264604 (10jijiki) @aaron is there a timeline as to when those patches will be merged? [22:39:54] 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review, and 2 others: Reduce read pressure on mc* servers by adding a machine-local Memcached instance (on-host memcached) - https://phabricator.wikimedia.org/T244340 (10jijiki) [22:40:47] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [22:44:09] <_joe_> cdanis: did you restart more servers? [22:44:14] no [22:44:46] <_joe_> it's recovering... by itself then [22:44:51] <_joe_> cpu usage is going down [22:44:52] there are a few still pegged, but yeah, only a few [22:49:31] !log mw1342 restart-php7.2-fpm [22:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:26] <_joe_> !log depooling mw1346 [22:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:32] <_joe_> !log repooling mw1346 [22:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:39] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:30:19] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:36:09] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic: Limit maps serving to Wikimedia hosted sites only - https://phabricator.wikimedia.org/T261424 (10AntiCompositeNumber) [23:47:09] (03PS1) 10Krinkle: Apply bucketing to query sizes stats [extensions/TemplateData] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/638519 [23:55:23] (03CR) 10BryanDavis: [C: 03+1] Use common k8s labels (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/637813 (https://phabricator.wikimedia.org/T266844) (owner: 10Legoktm)