[00:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201103T0000)
[00:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[00:03:51] <icinga-wm>	 PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[00:04:44] <icinga-wm>	 PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:05:01] <icinga-wm>	 PROBLEM - cassandra CQL 10.192.16.179:9042 on maps2002 is CRITICAL: connect to address 10.192.16.179 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[00:06:26] <wikibugs>	 (03CR) 10Dzahn: "Found out there is still puppet 4.8 on mostly toolforge: https://phabricator.wikimedia.org/P13129" [puppet] - 10https://gerrit.wikimedia.org/r/633853 (owner: 10Dzahn)
[00:07:42] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` es1032.eqiad.wmnet ` The log can be found in `/v...
[00:14:23] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools as a beta feature on almost all Wikipedias ("phase 3") [mediawiki-config] - 10https://gerrit.wikimedia.org/r/638201 (https://phabricator.wikimedia.org/T266303)
[00:19:39] <wikibugs>	 (03PS1) 10RobH: clouddb1020 mac address update [puppet] - 10https://gerrit.wikimedia.org/r/638205 (https://phabricator.wikimedia.org/T260441)
[00:20:33] <wikibugs>	 (03CR) 10RobH: [C: 03+2] clouddb1020 mac address update [puppet] - 10https://gerrit.wikimedia.org/r/638205 (https://phabricator.wikimedia.org/T260441) (owner: 10RobH)
[00:23:17] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: ASAP) rack/setup/install clouddb10[13-20] - https://phabricator.wikimedia.org/T260441 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` clouddb1020....
[00:23:20] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: ASAP) rack/setup/install clouddb10[13-20] - https://phabricator.wikimedia.org/T260441 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['clouddb1020.eqiad.wmnet'] `  Of which those **FAILED**: `...
[00:24:44] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es1032.eqiad.wmnet'] `  Of which those **FAILED**: ` ['es1032.eqiad.wmnet'] `
[00:26:24] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: ASAP) rack/setup/install clouddb10[13-20] - https://phabricator.wikimedia.org/T260441 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` clouddb1020....
[00:27:03] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10RobH) This gets to the debian loader, and halts on 'Probing EDD' which had no issues on the other hosts.  I'm still investigating on what is different...
[00:27:28] <icinga-wm>	 RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:27:44] <icinga-wm>	 RECOVERY - cassandra CQL 10.192.16.179:9042 on maps2002 is OK: TCP OK - 0.032 second response time on 10.192.16.179 port 9042 https://phabricator.wikimedia.org/T93886
[00:29:58] <wikibugs>	 (03PS1) 10Ebernhardson: Turn off airflow scheduler during db downtime [puppet] - 10https://gerrit.wikimedia.org/r/638207
[00:31:02] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "and toolforge appears to be using systemd timers" [puppet] - 10https://gerrit.wikimedia.org/r/633853 (owner: 10Dzahn)
[00:31:17] <wikibugs>	 (03PS1) 10Legoktm: Prevent webservice from doing anything if buildpacks are being used [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/638208 (https://phabricator.wikimedia.org/T266901)
[00:33:55] <wikibugs>	 (03CR) 10Ebernhardson: "PCC looks reasonable: https://puppet-compiler.wmflabs.org/compiler1001/26267/" [puppet] - 10https://gerrit.wikimedia.org/r/638207 (owner: 10Ebernhardson)
[00:35:21] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] "I didn't check if there are any uses of this but if you did then +1 :)" [puppet] - 10https://gerrit.wikimedia.org/r/638185 (https://phabricator.wikimedia.org/T240439) (owner: 10Razzi)
[00:40:19] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime
[00:40:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:41:23] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "The query works when executed from the "phabricator_maniphest" database but not when in the "phabricator_project" database. That results i" [puppet] - 10https://gerrit.wikimedia.org/r/638163 (https://phabricator.wikimedia.org/T252522) (owner: 10Aklapper)
[00:42:22] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[00:42:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:42:36] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "P.S. I will be off for about 2 weeks, if you have a follow-up please also try to get someone else to review/merge it." [puppet] - 10https://gerrit.wikimedia.org/r/638163 (https://phabricator.wikimedia.org/T252522) (owner: 10Aklapper)
[00:44:59] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` es1032.eqiad.wmnet ` The log can be found in `/v...
[00:45:01] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es1032.eqiad.wmnet'] `  Of which those **FAILED**: ` ['es1032.eqiad.wmnet'] `
[00:45:18] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` es1032.eqiad.wmnet ` The log can be found in `/v...
[00:45:33] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "meanwhile, here is raw data for you to start with this month:" [puppet] - 10https://gerrit.wikimedia.org/r/638163 (https://phabricator.wikimedia.org/T252522) (owner: 10Aklapper)
[00:47:47] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:48:35] <wikibugs>	 10Operations, 10Traffic, 10netops: Wikimedia projects not reachable for some Telecom Italia users - https://phabricator.wikimedia.org/T262869 (10CDanis) >>! In T262869#6565461, @Nemo_bis wrote: >> We'll prepare at least a lightweight incident report in the coming days. >  > Did this happen? I couldn't find i...
[00:48:49] <icinga-wm>	 RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[00:48:57] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:49:22] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Request for creation: Wiki Loves Africa Organizers Mailing List - https://phabricator.wikimedia.org/T267083 (10Reedy)
[00:49:25] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: ASAP) rack/setup/install clouddb10[13-20] - https://phabricator.wikimedia.org/T260441 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['clouddb1020.eqiad.wmnet'] `  and were **ALL** successful.
[00:51:53] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: ASAP) rack/setup/install clouddb10[13-20] - https://phabricator.wikimedia.org/T260441 (10RobH)
[00:52:05] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: ASAP) rack/setup/install clouddb10[13-20] - https://phabricator.wikimedia.org/T260441 (10RobH) 05Open→03Resolved all hosts installed, calling into puppet, staged in netbox.
[00:53:04] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10RobH)
[00:59:17] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime
[00:59:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:01:18] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[01:01:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:07:46] <wikibugs>	 (03CR) 10Dzahn: cumin: replace check-aliases-cron with a systemd timer (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/636102 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn)
[01:11:39] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es1032.eqiad.wmnet'] `  and were **ALL** successful.
[01:20:40] <icinga-wm>	 PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[01:22:51] <icinga-wm>	 RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[01:24:03] <wikibugs>	 (03PS3) 10Dzahn: cumin: replace check-aliases-cron with a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/636102 (https://phabricator.wikimedia.org/T265138)
[01:26:16] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10RobH) 05Open→03Resolved >>! In T260370#6598690, @ops-monitoring-bot wrote: > Completed auto-reimage of hosts: > ` > ['es1032.eqiad.wmnet'] > ` >  >...
[01:34:11] <wikibugs>	 (03PS1) 10Dzahn: site: introduce mwdebug1003 as debug server on buster [puppet] - 10https://gerrit.wikimedia.org/r/638218 (https://phabricator.wikimedia.org/T245757)
[01:40:04] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 03+1] "Changes LGTM, although the lint error I see in CI is not what I expected (it looks correct running locally)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/638025 (owner: 10JMeybohm)
[01:49:48] <wikibugs>	 (03CR) 10Dzahn: "low prio but it could technically be done any time and I wanted to also just leave the comments here already from learnings trying it with" [puppet] - 10https://gerrit.wikimedia.org/r/638218 (https://phabricator.wikimedia.org/T245757) (owner: 10Dzahn)
[01:55:51] <icinga-wm>	 PROBLEM - Check systemd state on logstash1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:56:14] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+1] "I'm all for trying it, as I said on the previous attempt, we don't have much to lose here. But I'm currently not here to actually deploy t" [puppet] - 10https://gerrit.wikimedia.org/r/637852 (https://phabricator.wikimedia.org/T261031) (owner: 10Ladsgroup)
[01:56:31] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1009 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7f7e22c024e0: Failed to establish a new connection: [Errno 111] Connection
[01:56:31] <icinga-wm>	 ://wikitech.wikimedia.org/wiki/Search%23Administration
[01:57:37] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: eqiad: Physical Moves for MediaWiki Servers - https://phabricator.wikimedia.org/T266164 (10Dzahn)
[01:58:14] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: eqiad: Physical Moves for MediaWiki Servers - https://phabricator.wikimedia.org/T266164 (10Dzahn) One rack done, the other rack we will continue from week of Nov 16.
[02:08:41] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.16 [core] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/638236
[02:10:01] <wikibugs>	 (03PS2) 10DannyS712: Branch commit for wmf/1.36.0-wmf.16 [core] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/638236 (https://phabricator.wikimedia.org/T263182) (owner: 10TrainBranchBot)
[02:12:39] <icinga-wm>	 RECOVERY - Check systemd state on logstash1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:13:19] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1009 is OK: OK - elasticsearch status production-logstash-eqiad: active_shards: 916, unassigned_shards: 0, number_of_nodes: 6, number_of_pending_tasks: 0, delayed_unassigned_shards: 0, active_shards_percent_as_number: 100.0, active_primary_shards: 483, task_max_waiting_in_queue_millis: 0, timed_out: False, initializing_shards: 0, number_of_in_flight_fetch: 0, num
[02:13:19] <icinga-wm>	 : 3, cluster_name: production-logstash-eqiad, status: green, relocating_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[02:13:24] <shdubsh>	 !log restart ES on logstash1009 - oom killed
[02:13:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:14:39] <wikibugs>	 10Operations, 10Parsoid, 10Parsoid-Tests, 10serviceops: Make testreduce web UI publicly accessible on the internet - https://phabricator.wikimedia.org/T266509 (10Dzahn) I will be afk for about 2 weeks.   If this needs earlier attention (I assume not, based on low prio etc) please contact the subteam.
[02:15:25] <wikibugs>	 10Operations, 10Parsoid, 10Parsoid-Tests, 10serviceops: Make testreduce web UI publicly accessible on the internet - https://phabricator.wikimedia.org/T266509 (10Dzahn) p:05Triage→03Medium
[02:25:49] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata Query Builder, 10User-Addshore: Deploy WDQS query builder to microsites - https://phabricator.wikimedia.org/T266703 (10Dzahn) See progress in T266702#6592812 From deployment and other internal servers you can already talk to query.wikidata.org on miscweb, it serves what...
[02:28:56] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to GLOBAL ROOT for David Caro - https://phabricator.wikimedia.org/T267040 (10Dzahn)
[02:33:14] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] thanos: configure memcached size via hiera [puppet] - 10https://gerrit.wikimedia.org/r/638110 (https://phabricator.wikimedia.org/T261281) (owner: 10Filippo Giunchedi)
[02:34:08] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to GLOBAL ROOT for David Caro - https://phabricator.wikimedia.org/T267040 (10Dzahn) - confirmed L3 signature 👍 - `[ldap-corp1001:~] $ /usr/bin/ldapsearch -x "mail=dcaro*@*" | grep -E 'employee|mail|manager'` 👍 (confirms full time employee and who is manage...
[02:36:32] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to GLOBAL ROOT for David Caro - https://phabricator.wikimedia.org/T267040 (10Dzahn)
[02:38:57] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] thanos: use systemd overrides for query/store/compact [puppet] - 10https://gerrit.wikimedia.org/r/638036 (https://phabricator.wikimedia.org/T261281) (owner: 10Filippo Giunchedi)
[02:40:26] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to GLOBAL ROOT for David Caro - https://phabricator.wikimedia.org/T267040 (10Dzahn) @RobH See above, I did these things to verify the user but on vacation from tomorow. Since it's a global root access and I see you are clinic duty for week of Nov 2, could...
[02:50:30] <wikibugs>	 (03PS2) 10Dzahn: cumin: remove stretch support and move python_version to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/636101
[02:51:07] <icinga-wm>	 PROBLEM - cassandra CQL 10.192.16.179:9042 on maps2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886
[03:04:02] <wikibugs>	 10Operations, 10Wikimedia-Etherpad, 10Patch-For-Review: rate limited etherpad - https://phabricator.wikimedia.org/T265490 (10Dzahn) Has anyone had this issue today, since it was another Monday?  I hope not, given that effectively what was a 10x in allowed points/sec.  I will be off for about 2 weeks, please...
[03:20:35] <wikibugs>	 10Operations, 10Wikimedia-Etherpad, 10Patch-For-Review: rate limited etherpad - https://phabricator.wikimedia.org/T265490 (10dpifke) Etherpad worked well before/during our team meeting today, which wasn't the case last week.  Thanks for the fix!
[04:15:25] <wikibugs>	 10Operations, 10DBA, 10Performance-Team, 10Platform Engineering, 10User-Kormat: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 (10Krinkle)
[04:17:50] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: eqiad: Physical moves for MediaWiki servers - https://phabricator.wikimedia.org/T266164 (10Krinkle)
[04:19:11] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: eqiad: Server moves to free up space on 10g racks - https://phabricator.wikimedia.org/T267065 (10Krinkle)
[04:31:31] <icinga-wm>	 RECOVERY - cassandra CQL 10.192.16.179:9042 on maps2002 is OK: TCP OK - 0.032 second response time on 10.192.16.179 port 9042 https://phabricator.wikimedia.org/T93886
[04:35:09] <icinga-wm>	 PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[04:35:25] <icinga-wm>	 PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:36:35] <icinga-wm>	 PROBLEM - cassandra CQL 10.192.16.179:9042 on maps2002 is CRITICAL: connect to address 10.192.16.179 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[04:37:01] <icinga-wm>	 PROBLEM - Check systemd state on logstash2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:37:45] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash2006 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7f14edd704e0: Failed to establish a new connection: [Errno 111] Connection
[04:37:45] <icinga-wm>	 ://wikitech.wikimedia.org/wiki/Search%23Administration
[04:57:13] <icinga-wm>	 RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[04:57:29] <icinga-wm>	 RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:58:37] <icinga-wm>	 RECOVERY - cassandra CQL 10.192.16.179:9042 on maps2002 is OK: TCP OK - 0.032 second response time on 10.192.16.179 port 9042 https://phabricator.wikimedia.org/T93886
[05:03:25] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash2006 is OK: OK - elasticsearch status production-logstash-codfw: number_of_nodes: 6, initializing_shards: 0, number_of_in_flight_fetch: 0, relocating_shards: 0, cluster_name: production-logstash-codfw, unassigned_shards: 0, active_shards: 862, active_primary_shards: 456, delayed_unassigned_shards: 0, number_of_data_nodes: 3, timed_out: False, task_max_waiting_in_
[05:03:25] <icinga-wm>	 active_shards_percent_as_number: 100.0, status: green, number_of_pending_tasks: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[05:04:21] <icinga-wm>	 RECOVERY - Check systemd state on logstash2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:35:36] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10Marostegui) Thanks Rob, es1032 looks good now: ` Name                :Virtual Disk 0 RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier...
[05:36:59] <wikibugs>	 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Beeswaxcandle) >>! In T257066#6576540, @FordPrefect42 wrote: > Since a couple of days, saving any art...
[05:37:16] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: eqiad: Spare Drive Onsite for db1091 - https://phabricator.wikimedia.org/T266988 (10Marostegui) 05Open→03Resolved Thanks for checking, I will decommission this host. There is no point spending time on it if its replacement will arrive "soon".
[05:37:19] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1091 crashed - https://phabricator.wikimedia.org/T225060 (10Marostegui)
[05:38:13] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1091 crashed - https://phabricator.wikimedia.org/T225060 (10Marostegui)
[05:45:39] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 101 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[05:47:32] <wikibugs>	 10Operations, 10DBA, 10Performance-Team, 10Platform Engineering, and 2 others: Document remaining database load groups - https://phabricator.wikimedia.org/T267077 (10nnikkhoui) Thanks @ArielGlenn! I put a really simple/generic blurb in the attached patchset, please feel free to comment/amend however you th...
[05:48:31] <icinga-wm>	 PROBLEM - At least one CPU core of an LVS is saturated- packet drops are likely on lvs3005 is CRITICAL: cpu={1,11,13,15,3,5,7,9} https://bit.ly/wmf-lvscpu https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs3005&var-datasource=esams+prometheus/ops
[05:51:57] <icinga-wm>	 RECOVERY - At least one CPU core of an LVS is saturated- packet drops are likely on lvs3005 is OK: All metrics within thresholds. https://bit.ly/wmf-lvscpu https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs3005&var-datasource=esams+prometheus/ops
[05:55:19] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize es1026 [puppet] - 10https://gerrit.wikimedia.org/r/638309 (https://phabricator.wikimedia.org/T261717)
[05:56:07] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Productionize es1026 [puppet] - 10https://gerrit.wikimedia.org/r/638309 (https://phabricator.wikimedia.org/T261717)
[05:57:15] <wikibugs>	 (03PS3) 10Marostegui: mariadb: Productionize es1026 [puppet] - 10https://gerrit.wikimedia.org/r/638309 (https://phabricator.wikimedia.org/T261717)
[05:57:45] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize es1026 [puppet] - 10https://gerrit.wikimedia.org/r/638309 (https://phabricator.wikimedia.org/T261717) (owner: 10Marostegui)
[06:00:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es1015 to es2 master (this is a noop) T261717', diff saved to https://phabricator.wikimedia.org/P13131 and previous config saved to /var/cache/conftool/dbconfig/20201103-060038-marostegui.json
[06:00:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:00:47] <stashbot>	 T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717
[06:00:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1011 to reclone es1026 T261717', diff saved to https://phabricator.wikimedia.org/P13132 and previous config saved to /var/cache/conftool/dbconfig/20201103-060054-marostegui.json
[06:00:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:04:08] <marostegui>	 !log Stop MySQL on es1011 to clone es1026 T261717
[06:04:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:07:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es1018 to es1 master (this is a noop) T261717', diff saved to https://phabricator.wikimedia.org/P13133 and previous config saved to /var/cache/conftool/dbconfig/20201103-060705-marostegui.json
[06:07:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:07:12] <stashbot>	 T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717
[06:07:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1012 to reclone es1027 T261717', diff saved to https://phabricator.wikimedia.org/P13134 and previous config saved to /var/cache/conftool/dbconfig/20201103-060727-marostegui.json
[06:07:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:09:45] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize es1027 [puppet] - 10https://gerrit.wikimedia.org/r/638312 (https://phabricator.wikimedia.org/T261717)
[06:10:42] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize es1027 [puppet] - 10https://gerrit.wikimedia.org/r/638312 (https://phabricator.wikimedia.org/T261717) (owner: 10Marostegui)
[06:11:52] <marostegui>	 !log Stop MySQL on es1012 to clone es1027 T261717
[06:11:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:14:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es1019 to es3 master (this is a noop) T261717', diff saved to https://phabricator.wikimedia.org/P13135 and previous config saved to /var/cache/conftool/dbconfig/20201103-061403-marostegui.json
[06:14:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:14:10] <stashbot>	 T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717
[06:14:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1014 to reclone es1028 T261717', diff saved to https://phabricator.wikimedia.org/P13136 and previous config saved to /var/cache/conftool/dbconfig/20201103-061423-marostegui.json
[06:14:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:16:28] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize es1028 [puppet] - 10https://gerrit.wikimedia.org/r/638317 (https://phabricator.wikimedia.org/T261717)
[06:16:40] <marostegui>	 !log Stop MySQL on es1014 to clone es1028 T261717
[06:16:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:17:21] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize es1028 [puppet] - 10https://gerrit.wikimedia.org/r/638317 (https://phabricator.wikimedia.org/T261717) (owner: 10Marostegui)
[06:26:00] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: ASAP) rack/setup/install clouddb10[13-20] - https://phabricator.wikimedia.org/T260441 (10Marostegui) Thanks. RAID (level and stripe size), memory and CPU looks good.
[06:39:24] <icinga-wm>	 PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[06:40:44] <icinga-wm>	 RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[06:43:11] <wikibugs>	 10Operations, 10DBA, 10Performance-Team, 10Platform Engineering, and 2 others: Document remaining database load groups - https://phabricator.wikimedia.org/T267077 (10ArielGlenn) >>! In T267077#6598886, @nnikkhoui wrote: > Thanks @ArielGlenn! I put a really simple/generic blurb in the attached patchset, ple...
[06:46:35] <marostegui>	 !log Deploy schema change on s1 codfw master: T265349
[06:46:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:46:42] <stashbot>	 T265349: querycache qc_type and qc_title have different nullabality on s1 only - https://phabricator.wikimedia.org/T265349
[06:52:48] <icinga-wm>	 PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[06:56:41] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Remove db1091 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/638343 (https://phabricator.wikimedia.org/T267088)
[06:57:18] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1091 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/638343 (https://phabricator.wikimedia.org/T267088) (owner: 10Marostegui)
[06:57:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1091 from dbctl T267088', diff saved to https://phabricator.wikimedia.org/P13137 and previous config saved to /var/cache/conftool/dbconfig/20201103-065756-marostegui.json
[06:58:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:58:03] <stashbot>	 T267088: decommission db1091.eqiad.wmnet - https://phabricator.wikimedia.org/T267088
[06:58:11] <wikibugs>	 (03CR) 10Elukey: "> Patch Set 6:" [puppet] - 10https://gerrit.wikimedia.org/r/637587 (https://phabricator.wikimedia.org/T262660) (owner: 10Razzi)
[06:59:12] <marostegui>	 !log Remove db1091 from tendril and zarcillo T267088
[06:59:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:00:01] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] oozie: Add admin groups for authorization (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/637587 (https://phabricator.wikimedia.org/T262660) (owner: 10Razzi)
[07:00:35] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] profile::analytics::database::meta: specify max_connections for mariadb [puppet] - 10https://gerrit.wikimedia.org/r/638148 (owner: 10Elukey)
[07:01:40] <elukey>	 marostegui: <3
[07:01:45] <elukey>	 good morning
[07:01:52] <icinga-wm>	 PROBLEM - cassandra CQL 10.192.16.179:9042 on maps2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886
[07:01:54] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::analytics::database::meta: specify max_connections for mariadb [puppet] - 10https://gerrit.wikimedia.org/r/638148 (owner: 10Elukey)
[07:03:22] <icinga-wm>	 RECOVERY - cassandra CQL 10.192.16.179:9042 on maps2002 is OK: TCP OK - 0.034 second response time on 10.192.16.179 port 9042 https://phabricator.wikimedia.org/T93886
[07:07:54] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Remove puppet entries for db1091 [puppet] - 10https://gerrit.wikimedia.org/r/638352 (https://phabricator.wikimedia.org/T267088)
[07:12:43] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10Patch-For-Review, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10elukey) Reporting a conversation with Chris over email about when to do the maintenance:  > The 4th works for me at 1130EST  If pos...
[07:19:26] <icinga-wm>	 RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[07:23:24] <icinga-wm>	 PROBLEM - cassandra CQL 10.192.16.179:9042 on maps2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886
[07:33:09] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Turn off airflow scheduler during db downtime [puppet] - 10https://gerrit.wikimedia.org/r/638207 (owner: 10Ebernhardson)
[07:41:57] <wikibugs>	 10Operations, 10Traffic, 10netops: Was unable to connect to most Wikimedia sites for a few minutes - https://phabricator.wikimedia.org/T267089 (10Peachey88) For information required to report connectivity issues: https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue
[07:51:35] <wikibugs>	 10Operations, 10DBA: Refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[07:52:12] <wikibugs>	 (03CR) 10Kosta Harlan: "Thanks for the changes, Jeena." [deployment-charts] - 10https://gerrit.wikimedia.org/r/636916 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan)
[07:52:58] <wikibugs>	 10Operations, 10DBA: db1080-95 batch possibly suffering BBU issues - https://phabricator.wikimedia.org/T258386 (10Marostegui)
[07:56:27] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Remove old-style helmfile structure from CI [deployment-charts] - 10https://gerrit.wikimedia.org/r/638415 (https://phabricator.wikimedia.org/T258572)
[08:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201103T0800)
[08:04:22] <icinga-wm>	 PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:04:22] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Remove old-style helmfile structure from CI [deployment-charts] - 10https://gerrit.wikimedia.org/r/638415 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto)
[08:05:12] <icinga-wm>	 PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:06:47] <wikibugs>	 (03Merged) 10jenkins-bot: Remove old-style helmfile structure from CI [deployment-charts] - 10https://gerrit.wikimedia.org/r/638415 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto)
[08:06:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/638019 (https://phabricator.wikimedia.org/T266995) (owner: 10Gehel)
[08:13:12] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: configure memcached size via hiera [puppet] - 10https://gerrit.wikimedia.org/r/638110 (https://phabricator.wikimedia.org/T261281) (owner: 10Filippo Giunchedi)
[08:13:23] <wikibugs>	 (03PS2) 10Filippo Giunchedi: thanos: configure memcached size via hiera [puppet] - 10https://gerrit.wikimedia.org/r/638110 (https://phabricator.wikimedia.org/T261281)
[08:22:11] <wikibugs>	 (03PS2) 10Filippo Giunchedi: thanos: use systemd overrides for query/store/compact [puppet] - 10https://gerrit.wikimedia.org/r/638036 (https://phabricator.wikimedia.org/T261281)
[08:22:13] <wikibugs>	 (03PS3) 10Filippo Giunchedi: prometheus: re-enable compaction by default [puppet] - 10https://gerrit.wikimedia.org/r/636362 (https://phabricator.wikimedia.org/T261281)
[08:22:15] <wikibugs>	 (03PS2) 10Filippo Giunchedi: thanos: add query-frontend [puppet] - 10https://gerrit.wikimedia.org/r/638119 (https://phabricator.wikimedia.org/T261281)
[08:22:17] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: add thanos query-frontend jobs [puppet] - 10https://gerrit.wikimedia.org/r/638120 (https://phabricator.wikimedia.org/T261281)
[08:22:19] <wikibugs>	 (03PS2) 10Filippo Giunchedi: role: add query_frontend to thanos frontend [puppet] - 10https://gerrit.wikimedia.org/r/638121 (https://phabricator.wikimedia.org/T261281)
[08:22:21] <wikibugs>	 (03PS2) 10Filippo Giunchedi: pontoon: use frontends for query_frontend memcache [puppet] - 10https://gerrit.wikimedia.org/r/638122 (https://phabricator.wikimedia.org/T261281)
[08:24:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: use systemd overrides for query/store/compact [puppet] - 10https://gerrit.wikimedia.org/r/638036 (https://phabricator.wikimedia.org/T261281) (owner: 10Filippo Giunchedi)
[08:27:52] <icinga-wm>	 RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:28:28] <icinga-wm>	 RECOVERY - cassandra CQL 10.192.16.179:9042 on maps2002 is OK: TCP OK - 0.032 second response time on 10.192.16.179 port 9042 https://phabricator.wikimedia.org/T93886
[08:28:40] <icinga-wm>	 RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:31:03] <godog>	 I'm reenabling compactions in Prometheus on the fleet, there will be expected alerts about prometheus restarted
[08:31:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: re-enable compaction by default [puppet] - 10https://gerrit.wikimedia.org/r/636362 (https://phabricator.wikimedia.org/T261281) (owner: 10Filippo Giunchedi)
[08:32:33] <godog>	 !log Prometheus re-enable compactions - T261281
[08:32:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:40] <stashbot>	 T261281: Improve performance of Thanos (+ Prometheus) - https://phabricator.wikimedia.org/T261281
[08:35:53] <wikibugs>	 (03PS5) 10Kormat: mariadb: (Ab)use wsrep_cluster_name for DC name [puppet] - 10https://gerrit.wikimedia.org/r/637702 (https://phabricator.wikimedia.org/T266635)
[08:36:17] <wikibugs>	 (03PS3) 10Kormat: orchestrator: Support running as non-root [puppet] - 10https://gerrit.wikimedia.org/r/637693 (https://phabricator.wikimedia.org/T266763)
[08:36:19] <wikibugs>	 (03PS10) 10Kormat: orchestrator: Support sqlite backend [puppet] - 10https://gerrit.wikimedia.org/r/637684 (https://phabricator.wikimedia.org/T266657)
[08:37:45] <kormat>	 godog: speaking of prom, how much history do we keep?
[08:39:51] <godog>	 kormat: ATM ~5-6 months locally depending on the Prometheus instance, and "as space allows" in Thanos, but currently 60w
[08:40:53] * kormat nods
[08:41:09] <godog>	 with the plan being to reduce local storage if we don't really need to keep it, or the space is scarse et
[08:41:12] <godog>	 etc
[08:41:17] <kormat>	 👍
[08:41:31] <kormat>	 does thanos currently have 60w of data?
[08:41:56] <godog>	 not yet no, we started uploading data there around June I think
[08:41:56] <icinga-wm>	 PROBLEM - Prometheus prometheus2004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2004 is CRITICAL: instance=127.0.0.1 job=prometheus site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global
[08:42:16] <icinga-wm>	 PROBLEM - Prometheus prometheus1003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1 job=prometheus site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global
[08:42:40] <kormat>	 godog: gotcha. thanks :)
[08:42:50] <wikibugs>	 10Operations, 10DBA, 10Performance-Team, 10Platform Engineering, 10User-Kormat: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 (10jcrespo) A reminder that T195578 is waiting for feedback to see if it would be useful to gather query performance statistics.
[08:42:56] <icinga-wm>	 PROBLEM - Prometheus prometheus2003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance=127.0.0.1 job=prometheus site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global
[08:43:08] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker10[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) Thanks a lot for the insights! So few comments:  * As far of current hadoop workers/rows distribution, we have: 13 in A, 14 in B, 19 in C, 1...
[08:43:10] <icinga-wm>	 PROBLEM - Prometheus prometheus5001/ops restarted: beware possible monitoring artifacts on prometheus5001 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqsin+prometheus/ops
[08:43:26] <icinga-wm>	 PROBLEM - Prometheus prometheus1004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1 job=prometheus site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global
[08:43:28] <godog>	 kormat: np!
[08:43:59] <godog>	 I'll silence these alerts
[08:44:22] <wikibugs>	 10Operations, 10Wikimedia-Etherpad, 10Patch-For-Review: rate limited etherpad - https://phabricator.wikimedia.org/T265490 (10hashar) Works for me as well.
[08:50:22] <icinga-wm>	 PROBLEM - Thanos query has high gRPC client errors on alert1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query
[08:51:36] <godog>	 also expected ^
[08:54:46] <icinga-wm>	 PROBLEM - Thanos sidecar cannot connect to Prometheus on alert1001 is CRITICAL: cluster=prometheus instance=prometheus1004 job=thanos-sidecar prometheus=ops site=eqiad https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar
[08:56:02] <icinga-wm>	 PROBLEM - Aggregate IPsec Tunnel Status eqiad on alert1001 is CRITICAL: instance=mc1036 site=eqiad tunnel=mc2036_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[08:56:23] <wikibugs>	 10Operations, 10observability: VictorOps ~5min delay from email received to incident paging - https://phabricator.wikimedia.org/T266800 (10fgiunchedi) >>! In T266800#6596466, @Volans wrote: > @fgiunchedi should we consider converting out transport from email to API calls at this point? Should give us an immedi...
[08:58:08] <icinga-wm>	 RECOVERY - Thanos sidecar cannot connect to Prometheus on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar
[08:58:48] <icinga-wm>	 RECOVERY - Thanos query has high gRPC client errors on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query
[09:00:10] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime
[09:00:11] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[09:00:13] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Rebooting for T261389', diff saved to https://phabricator.wikimedia.org/P13138 and previous config saved to /var/cache/conftool/dbconfig/20201103-090013-kormat.json
[09:00:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:00:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:00:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:01:14] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] orchestrator: Support sqlite backend [puppet] - 10https://gerrit.wikimedia.org/r/637684 (https://phabricator.wikimedia.org/T266657) (owner: 10Kormat)
[09:02:28] <wikibugs>	 (03CR) 10Marostegui: orchestrator: Support running as non-root (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/637693 (https://phabricator.wikimedia.org/T266763) (owner: 10Kormat)
[09:03:59] <wikibugs>	 (03CR) 10Kormat: orchestrator: Support running as non-root (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/637693 (https://phabricator.wikimedia.org/T266763) (owner: 10Kormat)
[09:05:10] <icinga-wm>	 RECOVERY - Prometheus prometheus5001/ops restarted: beware possible monitoring artifacts on prometheus5001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqsin+prometheus/ops
[09:05:23] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after reboot. T261389', diff saved to https://phabricator.wikimedia.org/P13139 and previous config saved to /var/cache/conftool/dbconfig/20201103-090523-kormat.json
[09:05:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:29] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime
[09:06:29] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[09:06:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:26] <wikibugs>	 (03PS1) 10Ayounsi: Core routing for WMCS via cloudsw [homer/public] - 10https://gerrit.wikimedia.org/r/638425 (https://phabricator.wikimedia.org/T265288)
[09:10:04] <wikibugs>	 10Operations, 10Traffic, 10netops: Was unable to connect to most Wikimedia sites for a few minutes - https://phabricator.wikimedia.org/T267089 (10Aklapper) 05Open→03Stalled
[09:10:45] <wikibugs>	 (03CR) 10Ayounsi: "185.15.56.0/24: BGP_aggregate_contributors  # WMCS eqiad" [homer/public] - 10https://gerrit.wikimedia.org/r/638425 (https://phabricator.wikimedia.org/T265288) (owner: 10Ayounsi)
[09:10:50] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Core routing for WMCS via cloudsw [homer/public] - 10https://gerrit.wikimedia.org/r/638425 (https://phabricator.wikimedia.org/T265288) (owner: 10Ayounsi)
[09:11:17] <wikibugs>	 (03Merged) 10jenkins-bot: Core routing for WMCS via cloudsw [homer/public] - 10https://gerrit.wikimedia.org/r/638425 (https://phabricator.wikimedia.org/T265288) (owner: 10Ayounsi)
[09:11:26] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "The buster logic for the component should be kept." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636101 (owner: 10Dzahn)
[09:12:41] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime
[09:12:42] <wikibugs>	 (03PS2) 10Volans: netbox: add dependency on python3-wmflib [puppet] - 10https://gerrit.wikimedia.org/r/636969
[09:12:42] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[09:12:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:12:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:36] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Upgrade ELK Stack to version 7 - https://phabricator.wikimedia.org/T234854 (10jcrespo) >>! In T234854#6440649, @colewhite wrote: > @jcrespo Thanks for bringing this to our attention.  The filters on that dashboard indicate they are broken because the fi...
[09:14:30] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10serviceops, and 2 others: Very long response time on frwiki main page - https://phabricator.wikimedia.org/T266865 (10Aklapper) +#Wikimedia-Incident if this was considered an incident
[09:17:00] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] orchestrator: Support running as non-root (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/637693 (https://phabricator.wikimedia.org/T266763) (owner: 10Kormat)
[09:17:30] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] orchestrator: Support running as non-root [puppet] - 10https://gerrit.wikimedia.org/r/637693 (https://phabricator.wikimedia.org/T266763) (owner: 10Kormat)
[09:17:42] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] orchestrator: Support sqlite backend [puppet] - 10https://gerrit.wikimedia.org/r/637684 (https://phabricator.wikimedia.org/T266657) (owner: 10Kormat)
[09:17:42] <icinga-wm>	 RECOVERY - Prometheus prometheus1003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global
[09:18:20] <icinga-wm>	 RECOVERY - Prometheus prometheus2003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global
[09:18:27] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime
[09:18:28] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[09:18:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:18:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:18:48] <icinga-wm>	 RECOVERY - Prometheus prometheus1004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global
[09:19:06] <icinga-wm>	 RECOVERY - Prometheus prometheus2004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global
[09:19:16] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "Self-merging to unbock related patch, trivial package addition." [puppet] - 10https://gerrit.wikimedia.org/r/636969 (owner: 10Volans)
[09:21:32] <wikibugs>	 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10Epic: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10PLMFXA) https://www.fxa.dk
[09:26:23] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, compiler results here:" [puppet] - 10https://gerrit.wikimedia.org/r/636102 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn)
[09:35:01] <wikibugs>	 (03PS1) 10Kormat: debian: Fix templates location in .deb [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/638427
[09:40:43] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] debian: Fix templates location in .deb [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/638427 (owner: 10Kormat)
[09:40:56] <wikibugs>	 (03CR) 10Kormat: [V: 03+2 C: 03+2] debian: Fix templates location in .deb [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/638427 (owner: 10Kormat)
[09:49:03] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime
[09:49:04] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[09:49:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:49:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:19] <wikibugs>	 (03PS1) 10Kormat: debian: Fix release name. [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/638432
[09:50:27] <wikibugs>	 (03CR) 10Volans: [C: 03+2] dns: add retry logic to all Netbox API calls [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/636406 (owner: 10Volans)
[09:50:38] <wikibugs>	 (03CR) 10Kormat: [V: 03+2 C: 03+2] debian: Fix release name. [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/638432 (owner: 10Kormat)
[09:54:16] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime
[09:54:17] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[09:54:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:54:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:56:10] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: profile::kubernetes::deployment_server: cleanup the old layout [puppet] - 10https://gerrit.wikimedia.org/r/638438 (https://phabricator.wikimedia.org/T258572)
[09:56:12] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: profile::kubernetes::deployment_server::helmfile: simplify code [puppet] - 10https://gerrit.wikimedia.org/r/638439 (https://phabricator.wikimedia.org/T258572)
[09:57:09] <kormat>	 !log uploaded orchestrator 3.2.3-2 to apt
[09:57:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:03:56] <wikibugs>	 (03PS1) 10Kormat: orchestrator: Drop the obsolete thirdparty/orchestrator component. [puppet] - 10https://gerrit.wikimedia.org/r/638443 (https://phabricator.wikimedia.org/T266763)
[10:05:46] <icinga-wm>	 PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.457e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact
[10:07:15] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: profile::kubernetes::deployment_server: cleanup the old layout [puppet] - 10https://gerrit.wikimedia.org/r/638438 (https://phabricator.wikimedia.org/T258572)
[10:07:17] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: profile::kubernetes::deployment_server::helmfile: simplify code [puppet] - 10https://gerrit.wikimedia.org/r/638439 (https://phabricator.wikimedia.org/T258572)
[10:08:11] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime
[10:08:11] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[10:08:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:11:43] <logmsgbot>	 !log elukey@deploy1001 Started deploy [analytics/refinery@cf5db74] (hadoop-test): (no justification provided)
[10:11:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:12:47] <wikibugs>	 10Operations, 10DBA, 10Orchestrator: Run orchestrator as non-root - https://phabricator.wikimedia.org/T266656 (10Kormat)
[10:12:50] <wikibugs>	 10Operations, 10DBA, 10Orchestrator, 10Patch-For-Review: Repackage orchestrator - https://phabricator.wikimedia.org/T266763 (10Kormat) 05Open→03Resolved a:03Kormat Done and deployed.
[10:13:04] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review, 10User-Kormat: orchestrator: Get packages into WMF apt - https://phabricator.wikimedia.org/T266023 (10Kormat)
[10:13:12] <wikibugs>	 10Operations, 10DBA, 10Orchestrator: Run orchestrator as non-root - https://phabricator.wikimedia.org/T266656 (10Kormat) 05Open→03Resolved a:03Kormat
[10:13:28] <logmsgbot>	 !log elukey@deploy1001 Finished deploy [analytics/refinery@cf5db74] (hadoop-test): (no justification provided) (duration: 01m 45s)
[10:13:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:14:16] <logmsgbot>	 !log elukey@deploy1001 Started deploy [analytics/refinery@cf5db74] (hadoop-test): (no justification provided)
[10:14:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:14:35] <wikibugs>	 (03PS2) 10Aklapper: phabricator weekly changes email: List stalled task stalled for years [puppet] - 10https://gerrit.wikimedia.org/r/638163 (https://phabricator.wikimedia.org/T252522)
[10:15:07] <wikibugs>	 10Operations, 10Puppet: Puppet Proposal to remove require_package - https://phabricator.wikimedia.org/T266479 (10jbond) Thanks @akosiaris sounds good to me ill try to keep a specific eye out for missing dependencies
[10:15:28] <wikibugs>	 (03CR) 10Aklapper: "Garr, one day I'll learn this, sorry! (PS: No need to post any query results; I have CLI access to run SQL queries on the DB)" [puppet] - 10https://gerrit.wikimedia.org/r/638163 (https://phabricator.wikimedia.org/T252522) (owner: 10Aklapper)
[10:16:03] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: profile::kubernetes::deployment_server: cleanup the old layout [puppet] - 10https://gerrit.wikimedia.org/r/638438 (https://phabricator.wikimedia.org/T258572)
[10:16:05] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: profile::kubernetes::deployment_server::helmfile: simplify code [puppet] - 10https://gerrit.wikimedia.org/r/638439 (https://phabricator.wikimedia.org/T258572)
[10:16:31] <logmsgbot>	 !log elukey@deploy1001 Finished deploy [analytics/refinery@cf5db74] (hadoop-test): (no justification provided) (duration: 02m 15s)
[10:16:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:19:15] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: profile::kubernetes::deployment_server: cleanup the old layout [puppet] - 10https://gerrit.wikimedia.org/r/638438 (https://phabricator.wikimedia.org/T258572)
[10:19:17] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: profile::kubernetes::deployment_server::helmfile: simplify code [puppet] - 10https://gerrit.wikimedia.org/r/638439 (https://phabricator.wikimedia.org/T258572)
[10:21:38] <logmsgbot>	 !log gilles@deploy1001 Started deploy [performance/asoranking@2a2cb05]: T266985
[10:21:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:21:44] <stashbot>	 T266985: asoranking failed its monthly run stat1007 - https://phabricator.wikimedia.org/T266985
[10:21:49] <wikibugs>	 (03PS4) 10Jbond: java: add new java version facts [puppet] - 10https://gerrit.wikimedia.org/r/636924
[10:22:03] <logmsgbot>	 !log gilles@deploy1001 Finished deploy [performance/asoranking@2a2cb05]: T266985 (duration: 00m 26s)
[10:22:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] java: add new java version facts [puppet] - 10https://gerrit.wikimedia.org/r/636924 (owner: 10Jbond)
[10:22:43] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 3: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/636924 (owner: 10Jbond)
[10:23:03] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] "Needs an update to spec helper to fix tests, overriding" [puppet] - 10https://gerrit.wikimedia.org/r/636924 (owner: 10Jbond)
[10:29:21] <wikibugs>	 (03PS1) 10Ladsgroup: ores: Reduce number of requests that trigger a restart of uwsgi worker [puppet] - 10https://gerrit.wikimedia.org/r/638467 (https://phabricator.wikimedia.org/T263910)
[10:36:06] <icinga-wm>	 PROBLEM - cassandra CQL 10.192.16.179:9042 on maps2002 is CRITICAL: connect to address 10.192.16.179 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[10:37:10] <wikibugs>	 (03PS3) 10Aklapper: puppet-agent: remove --show_diff from scheduled puppet-run script [puppet] - 10https://gerrit.wikimedia.org/r/434719 (https://phabricator.wikimedia.org/T1) (owner: 10Herron)
[10:37:47] <wikibugs>	 10Operations, 10DBA, 10Orchestrator: Orchestrator binary needs version embedded - https://phabricator.wikimedia.org/T267113 (10Kormat)
[10:38:21] <wikibugs>	 10Operations, 10DBA, 10Orchestrator: Orchestrator binary needs version embedded - https://phabricator.wikimedia.org/T267113 (10Kormat) When this is fixed, we also need to clean up the existing db:  ` root@db2093.codfw.wmnet[orchestrator]> select * from orchestrator_db_deployments; +------------------+-------...
[10:39:04] <icinga-wm>	 PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:39:21] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime
[10:39:22] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[10:39:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:39:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:39:56] <icinga-wm>	 PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:43:37] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: profile::kubernetes::deployment_server: cleanup the old layout [puppet] - 10https://gerrit.wikimedia.org/r/638438 (https://phabricator.wikimedia.org/T258572)
[10:43:39] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: profile::kubernetes::deployment_server::helmfile: simplify code [puppet] - 10https://gerrit.wikimedia.org/r/638439 (https://phabricator.wikimedia.org/T258572)
[10:45:12] <logmsgbot>	 !log gilles@deploy1001 Started deploy [performance/asoranking@2a2cb05]: T266985
[10:45:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:17] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "Nice. LGTM!" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/637753 (owner: 10Jeena Huneidi)
[10:45:19] <stashbot>	 T266985: asoranking failed its monthly run stat1007 - https://phabricator.wikimedia.org/T266985
[10:45:19] <logmsgbot>	 !log gilles@deploy1001 Finished deploy [performance/asoranking@2a2cb05]: T266985 (duration: 00m 07s)
[10:45:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:48:50] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[10:50:50] <wikibugs>	 (03CR) 10Ema: [C: 04-1] "The change looks good to me, though it breaks ./modules/varnish/files/tests/text/09-analytics-cookies.vtc - please update the VTC test too" [puppet] - 10https://gerrit.wikimedia.org/r/637851 (https://phabricator.wikimedia.org/T147967) (owner: 10Ladsgroup)
[10:57:30] <icinga-wm>	 RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:57:50] <icinga-wm>	 RECOVERY - cassandra CQL 10.192.16.179:9042 on maps2002 is OK: TCP OK - 0.032 second response time on 10.192.16.179 port 9042 https://phabricator.wikimedia.org/T93886
[10:58:20] <icinga-wm>	 RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:59:46] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime
[10:59:47] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[10:59:51] <wikibugs>	 (03PS2) 10Ladsgroup: varnish: Replace "Expires" in Set-Cookie with "Max-Age" [puppet] - 10https://gerrit.wikimedia.org/r/637851 (https://phabricator.wikimedia.org/T147967)
[10:59:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:59:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:01:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/638443 (https://phabricator.wikimedia.org/T266763) (owner: 10Kormat)
[11:01:17] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] orchestrator: Drop the obsolete thirdparty/orchestrator component. [puppet] - 10https://gerrit.wikimedia.org/r/638443 (https://phabricator.wikimedia.org/T266763) (owner: 10Kormat)
[11:03:15] <Amir1>	 !log rolling restart of ores
[11:03:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:03:58] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: set cpu_model_extra_flags = vmx,pcid (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/638146 (owner: 10Ahmon Dancy)
[11:04:35] <wikibugs>	 (03CR) 10Ladsgroup: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/637851 (https://phabricator.wikimedia.org/T147967) (owner: 10Ladsgroup)
[11:05:44] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] "This does not work in CI. Probably writes to an RO filesystem or so" [deployment-charts] - 10https://gerrit.wikimedia.org/r/638025 (owner: 10JMeybohm)
[11:08:20] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::kubernetes::deployment_server: cleanup the old layout [puppet] - 10https://gerrit.wikimedia.org/r/638438 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto)
[11:17:33] <wikibugs>	 (03PS6) 10Giuseppe Lavagetto: profile::kubernetes::deployment_server::helmfile: simplify code [puppet] - 10https://gerrit.wikimedia.org/r/638439 (https://phabricator.wikimedia.org/T258572)
[11:17:43] <wikibugs>	 (03PS1) 10Jbond: facter: java_version fix nil check on java_version [puppet] - 10https://gerrit.wikimedia.org/r/638473
[11:18:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] facter: java_version fix nil check on java_version [puppet] - 10https://gerrit.wikimedia.org/r/638473 (owner: 10Jbond)
[11:18:18] <wikibugs>	 (03PS1) 10Hnowlan: Revert "Isolate eqiad master maps1004 from cluster" [puppet] - 10https://gerrit.wikimedia.org/r/638454
[11:19:25] <wikibugs>	 (03PS2) 10Hnowlan: Revert "Isolate eqiad master maps1004 from cluster" [puppet] - 10https://gerrit.wikimedia.org/r/638454
[11:19:28] <wikibugs>	 (03PS9) 10Jbond: Gemfile: update puppetlabs_spec_helper version and switch to rspec-mock [puppet] - 10https://gerrit.wikimedia.org/r/636923
[11:19:53] <wikibugs>	 (03PS2) 10Jbond: facter: java_version fix nil check on java_version [puppet] - 10https://gerrit.wikimedia.org/r/638473
[11:20:15] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] Revert "Isolate eqiad master maps1004 from cluster" [puppet] - 10https://gerrit.wikimedia.org/r/638454 (owner: 10Hnowlan)
[11:20:35] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::kubernetes::deployment_server::helmfile: simplify code [puppet] - 10https://gerrit.wikimedia.org/r/638439 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto)
[11:21:19] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Gemfile: update puppetlabs_spec_helper version and switch to rspec-mock [puppet] - 10https://gerrit.wikimedia.org/r/636923 (owner: 10Jbond)
[11:23:28] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] facter: java_version fix nil check on java_version [puppet] - 10https://gerrit.wikimedia.org/r/638473 (owner: 10Jbond)
[11:23:33] <hnowlan>	 !log resyncing postgres replica maps1001
[11:23:36] <wikibugs>	 (03PS3) 10Jbond: facter: java_version fix nil check on java_version [puppet] - 10https://gerrit.wikimedia.org/r/638473
[11:23:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:23:37] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.postgresql.postgres-init
[11:23:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:23:42] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] facter: java_version fix nil check on java_version [puppet] - 10https://gerrit.wikimedia.org/r/638473 (owner: 10Jbond)
[11:23:47] <logmsgbot>	 !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.postgresql.postgres-init (exit_code=99)
[11:23:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:29:15] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime
[11:29:16] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[11:29:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:29:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:30] <wikibugs>	 (03PS10) 10Jbond: Gemfile: update puppetlabs_spec_helper version and switch to rspec-mock [puppet] - 10https://gerrit.wikimedia.org/r/636923
[11:32:32] <wikibugs>	 (03PS1) 10Jbond: Rakefile: exclude stdlibg and lvm from our spec tests [puppet] - 10https://gerrit.wikimedia.org/r/638476
[11:33:15] <wikibugs>	 (03PS2) 10Jbond: Rakefile: exclude stdlib and lvm from our spec tests [puppet] - 10https://gerrit.wikimedia.org/r/638476
[11:34:28] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Gemfile: update puppetlabs_spec_helper version and switch to rspec-mock [puppet] - 10https://gerrit.wikimedia.org/r/636923 (owner: 10Jbond)
[11:35:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Makes sense" [puppet] - 10https://gerrit.wikimedia.org/r/638476 (owner: 10Jbond)
[11:38:25] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Rakefile: exclude stdlib and lvm from our spec tests [puppet] - 10https://gerrit.wikimedia.org/r/638476 (owner: 10Jbond)
[11:41:26] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: Add apache httpd base image (038 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/634924 (https://phabricator.wikimedia.org/T265324) (owner: 10Giuseppe Lavagetto)
[11:46:40] <icinga-wm>	 RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.03488 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact
[11:51:05] <logmsgbot>	 !log gilles@deploy1001 Started deploy [performance/asoranking@2a2cb05]: T266985
[11:51:09] <logmsgbot>	 !log gilles@deploy1001 Finished deploy [performance/asoranking@2a2cb05]: T266985 (duration: 00m 03s)
[11:51:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:51:11] <stashbot>	 T266985: asoranking failed its monthly run stat1007 - https://phabricator.wikimedia.org/T266985
[11:51:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:57:24] <moritzm>	 !log running "reprepro clearvanished" to prune thirdparty/orchestrator
[11:57:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:58:24] <moritzm>	 !log imported php-apcu/php-geoip/php-imagick/php-mailparse to component/php72 for buster-wikimedia
[11:58:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:59:33] <arturo>	 test
[12:00:20] <wikibugs>	 (03PS1) 10MSantos: replicate osm twice a day like codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/638481
[12:00:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] replicate osm twice a day like codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/638481 (owner: 10MSantos)
[12:01:22] <wikibugs>	 (03CR) 10Hnowlan: "This seems okay but please hold off on merging until all replicas in eqiad have resynced (in progress atm)" [puppet] - 10https://gerrit.wikimedia.org/r/638481 (owner: 10MSantos)
[12:15:54] <wikibugs>	 (03CR) 10JMeybohm: Add apache httpd base image (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/634924 (https://phabricator.wikimedia.org/T265324) (owner: 10Giuseppe Lavagetto)
[12:19:47] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Add msantos to wmf LDAP group - https://phabricator.wikimedia.org/T267125 (10hnowlan)
[12:27:40] <wikibugs>	 (03PS2) 10MSantos: replicate osm twice a day like codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/638481
[12:27:59] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10User-herron: Implement sensitive logstash access control - https://phabricator.wikimedia.org/T213902 (10Aklapper)
[12:28:08] <wikibugs>	 (03CR) 10MSantos: [C: 04-1] "Hold while re-syncing" [puppet] - 10https://gerrit.wikimedia.org/r/638481 (owner: 10MSantos)
[12:40:48] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime
[12:40:49] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[12:40:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:40:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:55:23] <wikibugs>	 10Operations, 10Traffic, 10netops: Was unable to connect to most Wikimedia sites for a few minutes - https://phabricator.wikimedia.org/T267089 (10AlexisJazz) >>! In T267089#6599101, @Peachey88 wrote: > For information required to report connectivity issues: https://wikitech.wikimedia.org/wiki/Reporting_a_con...
[12:57:59] <wikibugs>	 (03PS1) 10Kormat: debian: Set version field in orchestrator binary [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/638499 (https://phabricator.wikimedia.org/T267113)
[13:02:37] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime
[13:02:37] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[13:02:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:02:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:04:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/638499 (https://phabricator.wikimedia.org/T267113) (owner: 10Kormat)
[13:04:59] <wikibugs>	 (03CR) 10Kormat: [V: 03+2 C: 03+2] debian: Set version field in orchestrator binary [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/638499 (https://phabricator.wikimedia.org/T267113) (owner: 10Kormat)
[13:05:01] <wikibugs>	 (03PS1) 10Elukey: Change analytics-test-hive CNAME to test failover [dns] - 10https://gerrit.wikimedia.org/r/638502 (https://phabricator.wikimedia.org/T257412)
[13:11:06] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Change analytics-test-hive CNAME to test failover [dns] - 10https://gerrit.wikimedia.org/r/638502 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey)
[13:12:00] <wikibugs>	 (03PS3) 10Filippo Giunchedi: role: add query_frontend to thanos frontend [puppet] - 10https://gerrit.wikimedia.org/r/638121 (https://phabricator.wikimedia.org/T261281)
[13:12:02] <wikibugs>	 (03PS3) 10Filippo Giunchedi: pontoon: use frontends for query_frontend memcache [puppet] - 10https://gerrit.wikimedia.org/r/638122 (https://phabricator.wikimedia.org/T261281)
[13:14:02] <wikibugs>	 10Operations, 10DBA, 10Orchestrator: Orchestrator binary needs version embedded - https://phabricator.wikimedia.org/T267113 (10Kormat) Fix deployed, and db cleaned up:  ` root@db2093.codfw.wmnet[orchestrator]> delete from orchestrator_db_deployments where deployed_version="" limit 1; Query OK, 1 row affected...
[13:14:04] <wikibugs>	 10Operations, 10Traffic, 10netops: Was unable to connect to most Wikimedia sites for a few minutes - https://phabricator.wikimedia.org/T267089 (10Aklapper) See the list on https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue : A `traceroute` for example, if this is still a problem. You wrote "T...
[13:14:14] <wikibugs>	 10Operations, 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: orchestrator: Puppetize - https://phabricator.wikimedia.org/T265990 (10Kormat)
[13:14:17] <wikibugs>	 10Operations, 10DBA, 10Orchestrator: Orchestrator binary needs version embedded - https://phabricator.wikimedia.org/T267113 (10Kormat) 05Open→03Resolved a:03Kormat
[13:18:36] <icinga-wm>	 PROBLEM - Maps HTTPS on maps2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
[13:18:44] <icinga-wm>	 PROBLEM - Maps HTTPS on maps2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
[13:18:44] <icinga-wm>	 PROBLEM - Maps HTTPS on maps2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
[13:20:14] <icinga-wm>	 RECOVERY - Maps HTTPS on maps2001 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 0.150 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
[13:20:14] <icinga-wm>	 RECOVERY - Maps HTTPS on maps2002 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 0.156 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
[13:20:14] <icinga-wm>	 PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[13:22:03] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime
[13:22:03] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[13:22:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:20] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: Add an httpd-fcgi image (039 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/636634 (https://phabricator.wikimedia.org/T265324) (owner: 10Giuseppe Lavagetto)
[13:23:28] <icinga-wm>	 RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 0.117 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[13:23:30] <icinga-wm>	 RECOVERY - Maps HTTPS on maps2003 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 0.228 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
[13:23:42] <wikibugs>	 (03PS7) 10Giuseppe Lavagetto: Add apache httpd base image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/634924 (https://phabricator.wikimedia.org/T265324)
[13:23:44] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: Add an httpd-fcgi image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/636634 (https://phabricator.wikimedia.org/T265324)
[13:24:20] <wikibugs>	 (03PS1) 10Elukey: Revert "Change analytics-test-hive CNAME to test failover" [dns] - 10https://gerrit.wikimedia.org/r/638511
[13:24:22] <wikibugs>	 (03PS1) 10Jbond: logrotate: migrate spec tests to shared helper [puppet] - 10https://gerrit.wikimedia.org/r/638505
[13:24:53] <logmsgbot>	 !log lsobanski@cumin1001 START - Cookbook sre.hosts.decommission
[13:24:53] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Revert "Change analytics-test-hive CNAME to test failover" [dns] - 10https://gerrit.wikimedia.org/r/638511 (owner: 10Elukey)
[13:24:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:27:58] <icinga-wm>	 PROBLEM - DPKG on serpens is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[13:32:56] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "Lukasz, this looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/638352 (https://phabricator.wikimedia.org/T267088) (owner: 10Marostegui)
[13:33:23] <logmsgbot>	 !log lsobanski@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
[13:33:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:33:32] <wikibugs>	 (03CR) 10LSobanski: [C: 03+2] mariadb: Remove puppet entries for db1091 [puppet] - 10https://gerrit.wikimedia.org/r/638352 (https://phabricator.wikimedia.org/T267088) (owner: 10Marostegui)
[13:34:14] <icinga-wm>	 PROBLEM - cassandra CQL 10.192.16.179:9042 on maps2002 is CRITICAL: connect to address 10.192.16.179 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[13:34:16] <icinga-wm>	 PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:34:38] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime
[13:34:38] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[13:34:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:06] <icinga-wm>	 PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:38:19] <wikibugs>	 10Operations, 10Traffic, 10netops: Was unable to connect to most Wikimedia sites for a few minutes - https://phabricator.wikimedia.org/T267089 (10CDanis) >>! In T267089#6599101, @Peachey88 wrote: > For information required to report connectivity issues: https://wikitech.wikimedia.org/wiki/Reporting_a_connect...
[13:39:40] <wikibugs>	 10Operations, 10Traffic, 10netops: Was unable to connect to most Wikimedia sites for a few minutes - https://phabricator.wikimedia.org/T267089 (10CDanis) 05Stalled→03Resolved a:03CDanis Thanks for the report @AlexisJazz.  There was a known issue that occurred at the esams edge site during the time wind...
[13:43:47] <sobanski>	 !log Removing db1091 from tendril and zarcillo T267088
[13:43:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:54] <stashbot>	 T267088: decommission db1091.eqiad.wmnet - https://phabricator.wikimedia.org/T267088
[13:46:11] <wikibugs>	 10Operations, 10Wikimedia-Etherpad, 10Patch-For-Review: rate limited etherpad - https://phabricator.wikimedia.org/T265490 (10Dzahn) 05Open→03Resolved
[13:46:48] <icinga-wm>	 PROBLEM - DPKG on seaborgium is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[13:48:44] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] mariadb: (Ab)use wsrep_cluster_name for DC name [puppet] - 10https://gerrit.wikimedia.org/r/637702 (https://phabricator.wikimedia.org/T266635) (owner: 10Kormat)
[13:50:08] <wikibugs>	 (03PS1) 10Ayounsi: Don't alert on missing cable ID for servers uplinks in eqiad/codfw [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/638529 (https://phabricator.wikimedia.org/T266533)
[13:51:18] <wikibugs>	 (03PS1) 10Jbond: rsync: update spec tests to use shared spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/638530
[13:51:20] <wikibugs>	 (03PS1) 10Jbond: aptrepo: remove old spec_helper file [puppet] - 10https://gerrit.wikimedia.org/r/638531
[13:51:22] <wikibugs>	 (03PS1) 10Jbond: base: clean up old spec helper and fix broken spec tests [puppet] - 10https://gerrit.wikimedia.org/r/638532
[13:53:42] <wikibugs>	 (03PS1) 10Jbond: apt: move to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638534
[13:53:44] <wikibugs>	 (03PS2) 10Ayounsi: Don't alert on missing cable ID for servers uplinks in eqiad/codfw [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/638529 (https://phabricator.wikimedia.org/T266533)
[13:53:53] <moritzm>	 !log imported php-mongodb/php-wmerrors/wikidiff2 to component/php72 for buster-wikimedia
[13:53:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:05] <wikibugs>	 (03PS3) 10Ayounsi: Don't alert on missing cable ID for servers uplinks in eqiad/codfw [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/638529 (https://phabricator.wikimedia.org/T266533)
[13:55:14] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] apt: move to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638534 (owner: 10Jbond)
[13:56:36] <wikibugs>	 (03PS3) 10Filippo Giunchedi: thanos: add query-frontend [puppet] - 10https://gerrit.wikimedia.org/r/638119 (https://phabricator.wikimedia.org/T261281)
[13:56:38] <wikibugs>	 (03PS3) 10Filippo Giunchedi: prometheus: add thanos query-frontend jobs [puppet] - 10https://gerrit.wikimedia.org/r/638120 (https://phabricator.wikimedia.org/T261281)
[13:56:40] <wikibugs>	 (03PS4) 10Filippo Giunchedi: role: add query_frontend to thanos frontend [puppet] - 10https://gerrit.wikimedia.org/r/638121 (https://phabricator.wikimedia.org/T261281)
[13:56:42] <wikibugs>	 (03PS4) 10Filippo Giunchedi: pontoon: use frontends for query_frontend memcache [puppet] - 10https://gerrit.wikimedia.org/r/638122 (https://phabricator.wikimedia.org/T261281)
[13:57:21] <wikibugs>	 (03PS1) 10Jbond: mirrors: fix spec tests [puppet] - 10https://gerrit.wikimedia.org/r/638536
[13:57:34] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] logrotate: migrate spec tests to shared helper [puppet] - 10https://gerrit.wikimedia.org/r/638505 (owner: 10Jbond)
[13:57:39] <wikibugs>	 (03PS2) 10Marostegui: orchestrator.conf: Add DetectDataCenterQuery to detect DC [puppet] - 10https://gerrit.wikimedia.org/r/637715 (https://phabricator.wikimedia.org/T266635)
[13:57:41] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] rsync: update spec tests to use shared spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/638530 (owner: 10Jbond)
[13:57:43] <icinga-wm>	 RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:57:44] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] aptrepo: remove old spec_helper file [puppet] - 10https://gerrit.wikimedia.org/r/638531 (owner: 10Jbond)
[13:57:49] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] base: clean up old spec helper and fix broken spec tests [puppet] - 10https://gerrit.wikimedia.org/r/638532 (owner: 10Jbond)
[13:58:03] <icinga-wm>	 RECOVERY - DPKG on serpens is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[13:59:09] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] orchestrator.conf: Add DetectDataCenterQuery to detect DC [puppet] - 10https://gerrit.wikimedia.org/r/637715 (https://phabricator.wikimedia.org/T266635) (owner: 10Marostegui)
[13:59:41] <icinga-wm>	 RECOVERY - cassandra CQL 10.192.16.179:9042 on maps2002 is OK: TCP OK - 0.032 second response time on 10.192.16.179 port 9042 https://phabricator.wikimedia.org/T93886
[14:01:36] <wikibugs>	 (03PS2) 10Jbond: apt: move to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638534
[14:01:38] <wikibugs>	 (03PS1) 10Jbond: java: commentout facte spec tests untill we move to rspec-mock [puppet] - 10https://gerrit.wikimedia.org/r/638538
[14:01:53] <icinga-wm>	 RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:02:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] java: commentout facte spec tests untill we move to rspec-mock [puppet] - 10https://gerrit.wikimedia.org/r/638538 (owner: 10Jbond)
[14:03:18] <wikibugs>	 (03PS2) 10Jbond: java: comment out fact spec tests until we move to rspec-mock [puppet] - 10https://gerrit.wikimedia.org/r/638538
[14:03:49] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] java: comment out fact spec tests until we move to rspec-mock [puppet] - 10https://gerrit.wikimedia.org/r/638538 (owner: 10Jbond)
[14:04:05] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime
[14:04:06] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[14:04:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:04:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:07:54] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "nits inline, looks good otherwise" (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/638529 (https://phabricator.wikimedia.org/T266533) (owner: 10Ayounsi)
[14:09:48] <wikibugs>	 (03PS3) 10Jbond: java: comment out fact spec tests until we move to rspec-mock [puppet] - 10https://gerrit.wikimedia.org/r/638538
[14:11:52] <wikibugs>	 (03PS3) 10Jbond: apt: move to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638534
[14:12:00] <wikibugs>	 (03PS2) 10Jbond: mirrors: fix spec tests [puppet] - 10https://gerrit.wikimedia.org/r/638536
[14:12:56] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] java: comment out fact spec tests until we move to rspec-mock [puppet] - 10https://gerrit.wikimedia.org/r/638538 (owner: 10Jbond)
[14:13:35] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] apt: move to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638534 (owner: 10Jbond)
[14:13:38] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] mirrors: fix spec tests [puppet] - 10https://gerrit.wikimedia.org/r/638536 (owner: 10Jbond)
[14:14:23] <wikibugs>	 (03PS1) 10Jbond: network: migrate spec tests to shared helper [puppet] - 10https://gerrit.wikimedia.org/r/638543
[14:14:43] <wikibugs>	 (03PS4) 10Ayounsi: Don't alert on missing cable ID for servers uplinks in eqiad/codfw [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/638529 (https://phabricator.wikimedia.org/T266533)
[14:16:15] <wikibugs>	 10Operations, 10Traffic, 10netops: Was unable to connect to most Wikimedia sites for a few minutes - https://phabricator.wikimedia.org/T267089 (10AlexisJazz) >>! In T267089#6600271, @CDanis wrote: > Thanks for the report @AlexisJazz. >  > There was a known issue that occurred at the esams edge site during th...
[14:16:21] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/638529 (https://phabricator.wikimedia.org/T266533) (owner: 10Ayounsi)
[14:17:19] <icinga-wm>	 RECOVERY - DPKG on seaborgium is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[14:17:36] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Don't alert on missing cable ID for servers uplinks in eqiad/codfw [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/638529 (https://phabricator.wikimedia.org/T266533) (owner: 10Ayounsi)
[14:19:20] <wikibugs>	 (03PS1) 10Jbond: backup: migrate specs to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638544
[14:19:22] <wikibugs>	 (03PS1) 10Jbond: osm: switch to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638545
[14:19:24] <wikibugs>	 (03PS1) 10Jbond: interface: remove unused spec files [puppet] - 10https://gerrit.wikimedia.org/r/638546
[14:20:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] osm: switch to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638545 (owner: 10Jbond)
[14:20:38] <wikibugs>	 (03PS2) 10Jbond: interface/java: remove unused spec files [puppet] - 10https://gerrit.wikimedia.org/r/638546
[14:20:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] backup: migrate specs to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638544 (owner: 10Jbond)
[14:21:51] <wikibugs>	 (03PS2) 10Jbond: osm: switch to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638545
[14:22:09] <wikibugs>	 (03PS3) 10Jbond: interface/java: remove unused spec files [puppet] - 10https://gerrit.wikimedia.org/r/638546
[14:22:31] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] network: migrate spec tests to shared helper [puppet] - 10https://gerrit.wikimedia.org/r/638543 (owner: 10Jbond)
[14:24:19] <wikibugs>	 (03PS2) 10Jbond: backup: migrate specs to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638544
[14:26:39] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] backup: migrate specs to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638544 (owner: 10Jbond)
[14:26:45] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] osm: switch to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638545 (owner: 10Jbond)
[14:26:51] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] interface/java: remove unused spec files [puppet] - 10https://gerrit.wikimedia.org/r/638546 (owner: 10Jbond)
[14:27:06] <wikibugs>	 (03PS3) 10Jbond: osm: switch to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638545
[14:27:14] <wikibugs>	 (03PS4) 10Jbond: interface/java: remove unused spec files [puppet] - 10https://gerrit.wikimedia.org/r/638546
[14:29:15] <wikibugs>	 (03PS1) 10Jbond: git: migrate to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638548
[14:32:59] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime
[14:33:04] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[14:33:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:35:22] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime
[14:35:23] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[14:35:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:35:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:37:23] <moritzm>	 !log imported php-apcu-bc/php-igbinary/tideways-xhprof to component/php72 for buster-wikimedia
[14:37:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:37:29] <wikibugs>	 (03PS1) 10Volans: spicerack: add requests_session accessor [software/spicerack] - 10https://gerrit.wikimedia.org/r/638550
[14:37:30] <wikibugs>	 (03PS1) 10Volans: Use wmflib.requests.http_session everywhere [software/spicerack] - 10https://gerrit.wikimedia.org/r/638551
[14:40:42] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1091.eqiad.wmnet - https://phabricator.wikimedia.org/T267088 (10LSobanski) a:05Marostegui→03wiki_willy
[14:40:53] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1091.eqiad.wmnet - https://phabricator.wikimedia.org/T267088 (10LSobanski) This host is ready for DC-Ops to decommission.
[14:59:35] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] git: migrate to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638548 (owner: 10Jbond)
[15:03:36] <wikibugs>	 (03PS1) 10Jbond: monitoring: complete migration to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638559
[15:03:38] <wikibugs>	 (03PS1) 10Jbond: graphite: migrate to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638560
[15:03:40] <wikibugs>	 (03PS1) 10Jbond: jenkis: migrate to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638561
[15:03:42] <wikibugs>	 (03PS1) 10Jbond: scap:migrate to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638562
[15:05:31] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] monitoring: complete migration to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638559 (owner: 10Jbond)
[15:05:33] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] scap:migrate to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638562 (owner: 10Jbond)
[15:05:35] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] graphite: migrate to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638560 (owner: 10Jbond)
[15:05:39] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] jenkis: migrate to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638561 (owner: 10Jbond)
[15:05:59] <wikibugs>	 (03PS4) 10Filippo Giunchedi: thanos: add query-frontend [puppet] - 10https://gerrit.wikimedia.org/r/638119 (https://phabricator.wikimedia.org/T261281)
[15:06:01] <wikibugs>	 (03PS4) 10Filippo Giunchedi: prometheus: add thanos query-frontend jobs [puppet] - 10https://gerrit.wikimedia.org/r/638120 (https://phabricator.wikimedia.org/T261281)
[15:06:03] <wikibugs>	 (03PS5) 10Filippo Giunchedi: role: add query_frontend to thanos frontend [puppet] - 10https://gerrit.wikimedia.org/r/638121 (https://phabricator.wikimedia.org/T261281)
[15:06:05] <wikibugs>	 (03PS5) 10Filippo Giunchedi: pontoon: use frontends for query_frontend memcache [puppet] - 10https://gerrit.wikimedia.org/r/638122 (https://phabricator.wikimedia.org/T261281)
[15:07:46] <wikibugs>	 (03PS1) 10Jbond: apt: remove old spec helper file [puppet] - 10https://gerrit.wikimedia.org/r/638564
[15:08:20] <moritzm>	 !log imported php-redis/xdebug to component/php72 for buster-wikimedia
[15:08:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:16] <wikibugs>	 (03PS1) 10Hashar: gerrit: fix SonarQube report url discovery [puppet] - 10https://gerrit.wikimedia.org/r/638565 (https://phabricator.wikimedia.org/T267028)
[15:15:16] <wikibugs>	 10Operations, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10JMeybohm)
[15:15:45] <wikibugs>	 10Operations, 10observability: smart-data-dump should fail loudly when it can't gather metrics - https://phabricator.wikimedia.org/T267135 (10fgiunchedi)
[15:18:50] <wikibugs>	 (03PS1) 10Jbond: nginx: migrate to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638568
[15:18:52] <wikibugs>	 (03PS1) 10Jbond: elasticsearch: migrate to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638569
[15:18:54] <wikibugs>	 (03PS1) 10Jbond: puppetmaster: clean up old spec files [puppet] - 10https://gerrit.wikimedia.org/r/638570
[15:18:56] <wikibugs>	 (03PS1) 10Jbond: puppetmaster: migrate to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638571
[15:19:35] <icinga-wm>	 PROBLEM - Check systemd state on an-airflow1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:34:43] <wikibugs>	 (03PS1) 10Jbond: zuul: migrate to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638583
[15:35:56] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] apt: remove old spec helper file [puppet] - 10https://gerrit.wikimedia.org/r/638564 (owner: 10Jbond)
[15:35:57] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] nginx: migrate to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638568 (owner: 10Jbond)
[15:36:00] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] elasticsearch: migrate to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638569 (owner: 10Jbond)
[15:36:03] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetmaster: clean up old spec files [puppet] - 10https://gerrit.wikimedia.org/r/638570 (owner: 10Jbond)
[15:36:09] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetmaster: migrate to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638571 (owner: 10Jbond)
[15:36:11] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] zuul: migrate to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638583 (owner: 10Jbond)
[15:40:31] <icinga-wm>	 PROBLEM - MariaDB Replica IO: analytics_meta on db1108 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@an-coord1001.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on an-coord1001.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:41:33] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: safe-service-restart: add optional poolcounter support (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/635991 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto)
[15:41:41] <marostegui>	 elukey: ^
[15:41:43] <wikibugs>	 (03PS6) 10Giuseppe Lavagetto: safe-service-restart: add optional poolcounter support [puppet] - 10https://gerrit.wikimedia.org/r/635991 (https://phabricator.wikimedia.org/T266055)
[15:41:50] <elukey>	 marostegui: yep I am fixing
[15:41:54] <marostegui>	 <3
[15:45:18] <wikibugs>	 (03PS10) 10Hnowlan: postgres: set max connections in postgres based on replica count [puppet] - 10https://gerrit.wikimedia.org/r/638143 (https://phabricator.wikimedia.org/T266820)
[15:47:07] <wikibugs>	 (03PS1) 10Jbond: nagios_common: migrate to shared spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/638589
[15:47:08] <wikibugs>	 (03PS1) 10Jbond: query_service: clean up unused spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/638590
[15:47:11] <wikibugs>	 (03PS1) 10Jbond: contint: remove unused spec folder [puppet] - 10https://gerrit.wikimedia.org/r/638591
[15:47:13] <wikibugs>	 (03PS1) 10Jbond: cassandra: remove unused spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/638592
[15:49:12] <wikibugs>	 (03PS1) 10JMeybohm: Update to 1.16.15 [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/638595 (https://phabricator.wikimedia.org/T266766)
[15:49:18] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: safe-service-restart: add optional poolcounter support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635991 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto)
[15:49:25] <wikibugs>	 (03PS1) 10Jbond: Rakefile: add puppetdbquery to thirdparty modules [puppet] - 10https://gerrit.wikimedia.org/r/638596
[15:49:43] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] nagios_common: migrate to shared spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/638589 (owner: 10Jbond)
[15:49:46] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] query_service: clean up unused spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/638590 (owner: 10Jbond)
[15:49:49] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] contint: remove unused spec folder [puppet] - 10https://gerrit.wikimedia.org/r/638591 (owner: 10Jbond)
[15:49:51] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] cassandra: remove unused spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/638592 (owner: 10Jbond)
[15:49:53] <wikibugs>	 10Operations, 10Maps, 10Wikimedia-Logstash, 10observability, and 3 others: Move kartotherian/tilerator logging to new logging pipeline - https://phabricator.wikimedia.org/T222377 (10MSantos) @fgiunchedi we've been working on fixing OSM replication recently in the eqiad cluster, so we blocked deployments fo...
[15:50:19] <icinga-wm>	 RECOVERY - MariaDB Replica IO: analytics_meta on db1108 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:50:38] <wikibugs>	 (03PS7) 10Giuseppe Lavagetto: safe-service-restart: add optional poolcounter support [puppet] - 10https://gerrit.wikimedia.org/r/635991 (https://phabricator.wikimedia.org/T266055)
[15:51:11] <wikibugs>	 (03PS1) 10Ayounsi: ImportPuppetDB should warn if it imported a SLAAC IP [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/638598 (https://phabricator.wikimedia.org/T265905)
[15:51:23] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime
[15:51:26] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[15:51:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:51:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:51:52] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Rakefile: add puppetdbquery to thirdparty modules [puppet] - 10https://gerrit.wikimedia.org/r/638596 (owner: 10Jbond)
[15:51:55] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime
[15:51:56] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[15:52:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:52:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:52:50] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: passwords: root-authorized-keys.erb: add dcaro CloudVPS-wide root key [labs/private] - 10https://gerrit.wikimedia.org/r/638601 (https://phabricator.wikimedia.org/T266068)
[15:53:55] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] passwords: root-authorized-keys.erb: add dcaro CloudVPS-wide root key [labs/private] - 10https://gerrit.wikimedia.org/r/638601 (https://phabricator.wikimedia.org/T266068) (owner: 10Arturo Borrero Gonzalez)
[15:54:24] <wikibugs>	 (03CR) 10Ayounsi: "Tested in netbox-next." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/638598 (https://phabricator.wikimedia.org/T265905) (owner: 10Ayounsi)
[15:54:45] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] api-gateway: use envoy 1.16.0 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/637431 (owner: 10Hnowlan)
[15:56:26] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] "🚀" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635991 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto)
[15:57:19] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: use envoy 1.16.0 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/637431 (owner: 10Hnowlan)
[15:59:17] <elukey>	 !log shutdown kafka-jumbo1006 to replace 1G with 10G nic
[15:59:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:01:36] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
[16:01:36] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
[16:01:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:01:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:01:47] <wikibugs>	 (03PS2) 10Ahmon Dancy: openstack: nova: set cpu_model_extra_flags = vmx,pcid [puppet] - 10https://gerrit.wikimedia.org/r/638146
[16:02:44] <wikibugs>	 (03CR) 10Ahmon Dancy: openstack: nova: set cpu_model_extra_flags = vmx,pcid (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/638146 (owner: 10Ahmon Dancy)
[16:02:55] <icinga-wm>	 PROBLEM - Host kafka-jumbo1006.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:03:59] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
[16:03:59] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
[16:04:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:04:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:06:07] <wikibugs>	 (03CR) 10MSantos: [C: 03+1] postgres: set max connections in postgres based on replica count [puppet] - 10https://gerrit.wikimedia.org/r/638143 (https://phabricator.wikimedia.org/T266820) (owner: 10Hnowlan)
[16:08:29] <icinga-wm>	 RECOVERY - Host kafka-jumbo1006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.54 ms
[16:09:09] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] safe-service-restart: add optional poolcounter support [puppet] - 10https://gerrit.wikimedia.org/r/635991 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto)
[16:09:15] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is CRITICAL: 98 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1005
[16:09:21] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1001 is CRITICAL: 72 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001
[16:09:27] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1003 is CRITICAL: 98 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1003
[16:10:03] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1002 is CRITICAL: 106 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1002
[16:10:05] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/638598 (https://phabricator.wikimedia.org/T265905) (owner: 10Ayounsi)
[16:10:45] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1004 is CRITICAL: 97 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1004
[16:11:25] <wikibugs>	 (03PS1) 10Jbond: wmflib: migrate to use shared spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/638607
[16:11:26] <elukey>	 sorry for the spam, should solve in a bit
[16:12:09] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] ImportPuppetDB should warn if it imported a SLAAC IP [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/638598 (https://phabricator.wikimedia.org/T265905) (owner: 10Ayounsi)
[16:13:12] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] wmflib: migrate to use shared spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/638607 (owner: 10Jbond)
[16:13:42] <logmsgbot>	 !log cdanis@cumin1001 START - Cookbook sre.network.cf
[16:13:43] <logmsgbot>	 !log cdanis@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0)
[16:13:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:13:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:14:22] <wikibugs>	 (03PS3) 10Ahmon Dancy: openstack: nova: set cpu_model_extra_flags = vmx,pcid [puppet] - 10https://gerrit.wikimedia.org/r/638146
[16:15:54] <wikibugs>	 10Operations, 10ops-eqiad, 10Reading Epics (Analytics): an-coord1001 ram upgrade - https://phabricator.wikimedia.org/T266709 (10Cmjohnson)
[16:16:02] <wikibugs>	 10Operations, 10ops-eqiad, 10Reading Epics (Analytics): an-coord1001 ram upgrade - https://phabricator.wikimedia.org/T266709 (10Cmjohnson) 05Open→03Resolved
[16:18:45] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] postgres: set max connections in postgres based on replica count [puppet] - 10https://gerrit.wikimedia.org/r/638143 (https://phabricator.wikimedia.org/T266820) (owner: 10Hnowlan)
[16:19:15] <icinga-wm>	 RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1001 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001
[16:19:23] <icinga-wm>	 RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1003 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1003
[16:20:43] <icinga-wm>	 RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1004 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1004
[16:20:49] <icinga-wm>	 RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1005
[16:21:10] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10Patch-For-Review, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10elukey) Today I swapped the NIC on kafka-jumbo1006 with Chris and there was no need for /etc/network/interfaces changes, `firmware-...
[16:21:19] <wikibugs>	 (03PS1) 10Jbond: stronswan: remove spec tests for iprsolve as handled bu wmflib [puppet] - 10https://gerrit.wikimedia.org/r/638609
[16:21:39] <icinga-wm>	 RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1002 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1002
[16:21:52] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime
[16:21:56] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[16:21:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:21:58] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] stronswan: remove spec tests for iprsolve as handled bu wmflib [puppet] - 10https://gerrit.wikimedia.org/r/638609 (owner: 10Jbond)
[16:22:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:18] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime
[16:22:18] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[16:22:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:07] <wikibugs>	 (03PS1) 10Jbond: wmflib: remove unused spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/638610
[16:28:59] <wikibugs>	 (03PS2) 10Jbond: httpd: convert to shared spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/638610
[16:32:51] <wikibugs>	 10Operations, 10Scap, 10serviceops, 10Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)): Make a way to build Scap .deb in Docker - https://phabricator.wikimedia.org/T265501 (10LarsWirzenius) I can't actually find the `package_builder` host and can't check if I have login access or the ability t...
[16:35:34] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime
[16:35:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:35:39] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[16:35:41] <wikibugs>	 (03PS1) 10Ayounsi: Restart puppetdb-microservice if config changes [puppet] - 10https://gerrit.wikimedia.org/r/638616
[16:35:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:35:59] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime
[16:36:00] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[16:36:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:36:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:37:51] <wikibugs>	 (03PS2) 10Ayounsi: Restart puppetdb-microservice if config changes [puppet] - 10https://gerrit.wikimedia.org/r/638616
[16:38:19] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/638616 (owner: 10Ayounsi)
[16:38:48] <wikibugs>	 (03PS1) 10Jbond: php: migrate to shared spec_test [puppet] - 10https://gerrit.wikimedia.org/r/638621
[16:39:49] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime
[16:39:50] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[16:39:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:39:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:40:11] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] httpd: convert to shared spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/638610 (owner: 10Jbond)
[16:40:15] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] php: migrate to shared spec_test [puppet] - 10https://gerrit.wikimedia.org/r/638621 (owner: 10Jbond)
[16:41:23] <wikibugs>	 (03PS1) 10Jbond: php: remove unused spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/638623
[16:41:25] <wikibugs>	 (03PS1) 10LSobanski: Fix typo in authorized_for_configuration_information permission [puppet] - 10https://gerrit.wikimedia.org/r/638624
[16:41:28] <wikibugs>	 (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [puppet] - 10https://gerrit.wikimedia.org/r/638624 (owner: 10LSobanski)
[16:42:09] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/638616 (owner: 10Ayounsi)
[16:42:32] <wikibugs>	 (03CR) 10LSobanski: "Pretty please :)" [puppet] - 10https://gerrit.wikimedia.org/r/638624 (owner: 10LSobanski)
[16:42:34] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Restart puppetdb-microservice if config changes [puppet] - 10https://gerrit.wikimedia.org/r/638616 (owner: 10Ayounsi)
[16:42:49] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] php: remove unused spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/638623 (owner: 10Jbond)
[16:44:13] <wikibugs>	 (03CR) 10Kormat: "We generally put the module name at the start of the commit message. In this case, it would be `icinga: `" [puppet] - 10https://gerrit.wikimedia.org/r/638624 (owner: 10LSobanski)
[16:48:01] <wikibugs>	 (03PS2) 10LSobanski: icinga: fix typo in authorized_for_configuration_information permission [puppet] - 10https://gerrit.wikimedia.org/r/638624
[16:48:01] <wikibugs>	 (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [puppet] - 10https://gerrit.wikimedia.org/r/638624 (owner: 10LSobanski)
[16:49:08] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] "LGTM :)" [puppet] - 10https://gerrit.wikimedia.org/r/638624 (owner: 10LSobanski)
[16:50:03] <wikibugs>	 (03CR) 10LSobanski: [C: 03+2] icinga: fix typo in authorized_for_configuration_information permission [puppet] - 10https://gerrit.wikimedia.org/r/638624 (owner: 10LSobanski)
[16:51:07] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[16:51:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:51:42] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1091.eqiad.wmnet - https://phabricator.wikimedia.org/T267088 (10Cmjohnson)
[16:56:49] <wikibugs>	 10Operations, 10DBA: db1080-95 batch possibly suffering BBU issues - https://phabricator.wikimedia.org/T258386 (10Cmjohnson)
[16:56:52] <wikibugs>	 10Operations, 10DBA: Refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Cmjohnson)
[16:56:54] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1091.eqiad.wmnet - https://phabricator.wikimedia.org/T267088 (10Cmjohnson) 05Open→03Resolved done
[16:56:57] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1091 crashed - https://phabricator.wikimedia.org/T225060 (10Cmjohnson)
[16:57:03] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:57:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:59:00] <wikibugs>	 (03PS2) 10Jbond: scap:migrate to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638562
[17:00:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] scap:migrate to shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/638562 (owner: 10Jbond)
[17:09:19] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] safe-service-restart: add optional poolcounter support [puppet] - 10https://gerrit.wikimedia.org/r/635991 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto)
[17:10:29] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/636627 (https://phabricator.wikimedia.org/T262899) (owner: 10Ayounsi)
[17:14:43] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] poolcounter: add client configuration classes [puppet] - 10https://gerrit.wikimedia.org/r/635992 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto)
[17:14:54] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: poolcounter: add client configuration classes [puppet] - 10https://gerrit.wikimedia.org/r/635992 (https://phabricator.wikimedia.org/T266055)
[17:29:59] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime
[17:30:01] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[17:30:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:30:06] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime
[17:30:06] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[17:30:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:30:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:30:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:31:49] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime
[17:31:49] <logmsgbot>	 !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[17:31:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:31:58] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime
[17:31:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:31:59] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[17:32:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:32:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:33:47] <wikibugs>	 (03PS4) 10Ayounsi: ImportPuppetDB: add cable color/type [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/636627 (https://phabricator.wikimedia.org/T262899)
[17:33:49] <wikibugs>	 (03PS5) 10Ayounsi: Update AssignIPs to handle switch port and cable [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635385 (https://phabricator.wikimedia.org/T265339)
[17:33:51] <wikibugs>	 (03PS6) 10Ayounsi: Add CSV import to AssignIPs script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635849
[17:33:53] <wikibugs>	 (03PS5) 10Ayounsi: AssingIPs, cleanup and standardize logs format [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635853 (https://phabricator.wikimedia.org/T265339)
[17:35:00] <wikibugs>	 (03CR) 10Ayounsi: ImportPuppetDB: add cable color/type (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/636627 (https://phabricator.wikimedia.org/T262899) (owner: 10Ayounsi)
[17:38:18] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10Cmjohnson) The mainboard arrived
[17:43:25] <wikibugs>	 (03PS3) 10Jbond: scap: migrate to shared spec helper and force rspec-mock where needed [puppet] - 10https://gerrit.wikimedia.org/r/638562
[17:43:27] <wikibugs>	 (03PS1) 10Jbond: spec_helper: use puppet_spec_helper not module_spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/638643
[17:43:48] <wikibugs>	 (03PS5) 10Ayounsi: ImportPuppetDB: add cable color/type [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/636627 (https://phabricator.wikimedia.org/T262899)
[17:43:50] <wikibugs>	 (03PS6) 10Ayounsi: Update AssignIPs to handle switch port and cable [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635385 (https://phabricator.wikimedia.org/T265339)
[17:43:52] <wikibugs>	 (03PS7) 10Ayounsi: Add CSV import to AssignIPs script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635849
[17:43:54] <wikibugs>	 (03PS6) 10Ayounsi: AssingIPs, cleanup and standardize logs format [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635853 (https://phabricator.wikimedia.org/T265339)
[17:43:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] spec_helper: use puppet_spec_helper not module_spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/638643 (owner: 10Jbond)
[17:44:00] <wikibugs>	 10Operations, 10serviceops, 10CommRel-Specialists-Support (Oct-Dec-2020): CommRel support for ICU 63 upgrade - https://phabricator.wikimedia.org/T267145 (10RLazarus)
[17:44:41] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/636627 (https://phabricator.wikimedia.org/T262899) (owner: 10Ayounsi)
[17:44:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] scap: migrate to shared spec helper and force rspec-mock where needed [puppet] - 10https://gerrit.wikimedia.org/r/638562 (owner: 10Jbond)
[17:45:06] <wikibugs>	 (03PS2) 10Jbond: spec_helper: use puppet_spec_helper not module_spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/638643
[17:45:44] <cmjohnson1>	 !log shutting elastic1063 down to reseat DIMM T265113
[17:45:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:45:51] <stashbot>	 T265113: Memory issue on elastic1063 caused elasticsearch to be killed - https://phabricator.wikimedia.org/T265113
[17:46:15] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] ImportPuppetDB: add cable color/type [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/636627 (https://phabricator.wikimedia.org/T262899) (owner: 10Ayounsi)
[17:47:13] <icinga-wm>	 PROBLEM - Host elastic1063 is DOWN: PING CRITICAL - Packet loss = 100%
[17:49:39] <icinga-wm>	 PROBLEM - Host elastic1063.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:51:50] <wikibugs>	 10Operations, 10MW-on-K8s, 10observability, 10serviceops: Logging options for apache httpd in k8s - https://phabricator.wikimedia.org/T265876 (10lmata) Just dropping a quick update here, we should schedule some time to review options. Had a brief exchange with @akosiaris and we'll get the team together for...
[17:55:21] <icinga-wm>	 RECOVERY - Host elastic1063.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.11 ms
[17:55:24] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/633853 (owner: 10Dzahn)
[17:56:45] <icinga-wm>	 RECOVERY - Host elastic1063 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms
[17:58:08] <wikibugs>	 10Operations, 10ops-eqiad, 10Discovery-Search: Memory issue on elastic1063 caused elasticsearch to be killed - https://phabricator.wikimedia.org/T265113 (10Cmjohnson) 05Open→03Resolved I reseated all the DIMM and there were several.   I am not getting any Dell h/w errors.  Hopefully, the reseat and flea...
[18:03:34] <wikibugs>	 (03PS3) 10Jbond: spec_helper: use puppet_spec_helper not module_spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/638643
[18:04:36] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] spec_helper: use puppet_spec_helper not module_spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/638643 (owner: 10Jbond)
[18:04:44] <wikibugs>	 (03PS4) 10Jbond: scap: migrate to shared spec helper and force rspec-mock where needed [puppet] - 10https://gerrit.wikimedia.org/r/638562
[18:06:00] <wikibugs>	 (03PS5) 10Jbond: scap: migrate to shared spec helper and force rspec-mock where needed [puppet] - 10https://gerrit.wikimedia.org/r/638562
[18:07:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] scap: migrate to shared spec helper and force rspec-mock where needed [puppet] - 10https://gerrit.wikimedia.org/r/638562 (owner: 10Jbond)
[18:15:24] <wikibugs>	 (03CR) 10Dzahn: "currently on vacation, please add some other reviewers with +2, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/638565 (https://phabricator.wikimedia.org/T267028) (owner: 10Hashar)
[18:15:52] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[18:15:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:21:54] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:21:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:32:59] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: eqiad: Server moves to free up space on 10g racks - https://phabricator.wikimedia.org/T267065 (10Cmjohnson) |   These are all 1G servers in 10G racks for row A  | A2 | |A4||A7| |db1074||stat1004||mw1269| |db1075||logstash1020||mw1270| |db1079||wdqs1003||mw1271| |db1080||g...
[18:36:39] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: eqiad: Server moves to free up space on 10g racks - https://phabricator.wikimedia.org/T267065 (10Cmjohnson) These are all 1G serves in 10G racks for row B  |B2||B4|B7| |db1099||elastic1050|wtp1031| |analytics1072|2U|elastic1049|wtp1032| |||conf1005|wtp1033| |||Kublog1002|...
[18:42:21] <wikibugs>	 (03PS1) 10Volans: Dependencies: remove temporary hacks [software/spicerack] - 10https://gerrit.wikimedia.org/r/638664
[18:42:29] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: eqiad: Server moves to free up space on 10g racks - https://phabricator.wikimedia.org/T267065 (10Cmjohnson) Row C  |C2||C4|C7 |es1016|2U|ores1006|francium| |db1100||mwlog1001|polonium| |db1101||snapshot1006|scb1003| |analytics1064|2U|deploy1001|elastic1051| |analytics1065...
[18:45:41] <wikibugs>	 (03PS1) 10Jcrespo: POC: Testing interfacing with swift to gather metadata [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/638665 (https://phabricator.wikimedia.org/T264189)
[18:46:15] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: eqiad: Server moves to free up space on 10g racks - https://phabricator.wikimedia.org/T267065 (10Cmjohnson)  Racks D2 and D7 are 100% 10G but they were initially built that way.  D4 was just converted to 10G  |D4| |db1114| |ores1008| |mc1033| |mc1034| |mc1035| |mc1036| |a...
[18:46:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] POC: Testing interfacing with swift to gather metadata [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/638665 (https://phabricator.wikimedia.org/T264189) (owner: 10Jcrespo)
[18:54:57] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission-hardware: Return sulfur to spares - https://phabricator.wikimedia.org/T224475 (10Cmjohnson) 05Open→03Resolved a:05Cmjohnson→03RobH @RobH This server is ready to go back to you for spares.  Where are you tracking that?
[18:56:37] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): Reclaim labpuppetmaster1001 and 1002 - https://phabricator.wikimedia.org/T234462 (10Cmjohnson) a:05Cmjohnson→03wiki_willy @wiki_willy @RobH Are we returning to spare or decommissioning these?
[19:01:21] <wikibugs>	 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Goal, 10Patch-For-Review: Prepare a proof of concept of the minimum setup capable of backup and recover testwiki media files - https://phabricator.wikimedia.org/T264189 (10jcrespo) There are more discrepancies between swift and mediawiki db:...
[19:12:08] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: Decommission asw-a-eqiad - https://phabricator.wikimedia.org/T218734 (10Cmjohnson) a:05Cmjohnson→03Jclark-ctr @Jclark-ctr  Please make sure all of these switches have been restored to factory defaults, unplug, and remove the racks....
[19:32:05] <gehel>	 !log restarting blazegraph on wdqs1007 to reset ban list
[19:32:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:32:21] <wikibugs>	 (03CR) 10Volans: "Some minor details/questions inline" (037 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635385 (https://phabricator.wikimedia.org/T265339) (owner: 10Ayounsi)
[19:34:30] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM, trivial enough" [software/spicerack] - 10https://gerrit.wikimedia.org/r/638664 (owner: 10Volans)
[19:35:05] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Dependencies: remove temporary hacks [software/spicerack] - 10https://gerrit.wikimedia.org/r/638664 (owner: 10Volans)
[19:36:27] <icinga-wm>	 PROBLEM - Host an-presto1004 is DOWN: PING CRITICAL - Packet loss = 100%
[19:37:43] <wikibugs>	 (03Merged) 10jenkins-bot: Dependencies: remove temporary hacks [software/spicerack] - 10https://gerrit.wikimedia.org/r/638664 (owner: 10Volans)
[19:37:50] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: eqiad: Server moves to free up space on 10g racks - https://phabricator.wikimedia.org/T267065 (10Cmjohnson) a:05Jclark-ctr→03Cmjohnson
[19:40:11] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: eqiad: Server moves to free up space on 10g racks - https://phabricator.wikimedia.org/T267065 (10Cmjohnson) @wiki_willy I had time to do this today while the Dell tech worked on an-presto1004.   I am going to be utilizing a 2U space in A2 and B2 for the kafka-jumbo 10G up...
[19:40:41] <icinga-wm>	 RECOVERY - Host an-presto1004 is UP: PING OK - Packet loss = 0%, RTA = 1.06 ms
[19:45:29] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Radar: an-presto1004 down - https://phabricator.wikimedia.org/T253438 (10Cmjohnson) 05Open→03Resolved @elukey the an-presto1004 motherboard has been replaced and the backplane, everything came back up as normal except I am not able to ssh into the server and fresh i...
[19:51:36] <wikibugs>	 (03PS1) 10Jbond: (WIP) profile: migrate to shared spec_test [puppet] - 10https://gerrit.wikimedia.org/r/638678
[19:53:35] <wikibugs>	 (03PS2) 10Jbond: (WIP) profile: migrate to shared spec_test [puppet] - 10https://gerrit.wikimedia.org/r/638678
[19:54:56] <wikibugs>	 (03Abandoned) 10Jbond: WIP: migrate profile spec tests to shared spec_healper [puppet] - 10https://gerrit.wikimedia.org/r/541245 (owner: 10Jbond)
[19:55:59] <wikibugs>	 (03Abandoned) 10Jbond: stunnel: add stunnel module and update rsync to use it [puppet] - 10https://gerrit.wikimedia.org/r/558133 (owner: 10Jbond)
[19:56:01] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] (WIP) profile: migrate to shared spec_test [puppet] - 10https://gerrit.wikimedia.org/r/638678 (owner: 10Jbond)
[19:56:32] <wikibugs>	 (03Abandoned) 10Jbond: apereo_cas: update to use stunnel client [puppet] - 10https://gerrit.wikimedia.org/r/558590 (owner: 10Jbond)
[20:01:52] <wikibugs>	 (03CR) 10Jbond: "should update shared spec_test to support changing site, realm and cluster (and any other globals) which are currently hacked in as rspec_" [puppet] - 10https://gerrit.wikimedia.org/r/638678 (owner: 10Jbond)
[20:03:43] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=prometheus site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:04:01] <wikibugs>	 (03PS3) 10Jbond: remote-backup-mariadb: remove cron type [puppet] - 10https://gerrit.wikimedia.org/r/636411 (https://phabricator.wikimedia.org/T265138)
[20:04:07] <icinga-wm>	 PROBLEM - MegaRAID on an-presto1004 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[20:04:08] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on an-presto1004 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T267160 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[20:04:11] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on an-presto1004 - https://phabricator.wikimedia.org/T267160 (10ops-monitoring-bot)
[20:04:21] <wikibugs>	 (03PS3) 10Jbond: prometheus_intel_microcode: remove cron type [puppet] - 10https://gerrit.wikimedia.org/r/636409 (https://phabricator.wikimedia.org/T265138)
[20:04:34] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] cumin: remove cron type [puppet] - 10https://gerrit.wikimedia.org/r/636405 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond)
[20:05:06] <wikibugs>	 (03Abandoned) 10Jbond: cumin: remove cron type [puppet] - 10https://gerrit.wikimedia.org/r/636405 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond)
[20:05:12] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] smart: remove cron type [puppet] - 10https://gerrit.wikimedia.org/r/636402 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond)
[20:05:20] <wikibugs>	 (03PS2) 10Jbond: smart: remove cron type [puppet] - 10https://gerrit.wikimedia.org/r/636402 (https://phabricator.wikimedia.org/T265138)
[20:05:21] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:05:27] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] remote-backup-mariadb: remove cron type [puppet] - 10https://gerrit.wikimedia.org/r/636411 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond)
[20:05:43] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] prometheus_intel_microcode: remove cron type [puppet] - 10https://gerrit.wikimedia.org/r/636409 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond)
[20:06:18] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] service-auto-restart: clean up cron [puppet] - 10https://gerrit.wikimedia.org/r/635517 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond)
[20:06:27] <wikibugs>	 (03PS3) 10Jbond: service-auto-restart: clean up cron [puppet] - 10https://gerrit.wikimedia.org/r/635517 (https://phabricator.wikimedia.org/T265138)
[20:06:33] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] service-auto-restart: clean up cron [puppet] - 10https://gerrit.wikimedia.org/r/635517 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond)
[20:07:23] <wikibugs>	 (03Abandoned) 10Jbond: bird: ensure bird service is running [puppet] - 10https://gerrit.wikimedia.org/r/625926 (owner: 10Jbond)
[20:07:32] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] labstore::nfs_mount: drop support for empty string share_path [puppet] - 10https://gerrit.wikimedia.org/r/630589 (owner: 10Jbond)
[20:15:49] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on alert1001 is CRITICAL: 5115 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:17:27] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on alert1001 is OK: (C)5000 gt (W)1000 gt 39 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:20:26] <wikibugs>	 (03PS5) 10Jbond: confd: pass srv_dns directly instead of loading confd::srv_dns [puppet] - 10https://gerrit.wikimedia.org/r/617716 (https://phabricator.wikimedia.org/T247956)
[20:21:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] confd: pass srv_dns directly instead of loading confd::srv_dns [puppet] - 10https://gerrit.wikimedia.org/r/617716 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond)
[20:26:50] <wikibugs>	 (03PS4) 10Ladsgroup: [WIP] varnish: Improve wording of the browser security error a bit [puppet] - 10https://gerrit.wikimedia.org/r/637850 (https://phabricator.wikimedia.org/T241656)
[20:27:28] <wikibugs>	 (03Abandoned) 10Jbond: httpd: test validate_cmd [puppet] - 10https://gerrit.wikimedia.org/r/604765 (https://phabricator.wikimedia.org/T255124) (owner: 10Jbond)
[20:27:34] <wikibugs>	 (03Abandoned) 10Jbond: httpd: add validate_cmd to apache configs [puppet] - 10https://gerrit.wikimedia.org/r/604764 (https://phabricator.wikimedia.org/T255124) (owner: 10Jbond)
[20:28:59] <wikibugs>	 (03CR) 10Ladsgroup: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/637850 (https://phabricator.wikimedia.org/T241656) (owner: 10Ladsgroup)
[20:29:32] <wikibugs>	 (03Abandoned) 10Jbond: standard: move none standard class to profile::standard [puppet] - 10https://gerrit.wikimedia.org/r/617703 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond)
[20:29:53] <wikibugs>	 (03Restored) 10Jbond: standard: move none standard class to profile::standard [puppet] - 10https://gerrit.wikimedia.org/r/617703 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond)
[20:30:26] <wikibugs>	 (03Abandoned) 10Jbond: DO NOT MERGE: Testing pcc [puppet] - 10https://gerrit.wikimedia.org/r/613610 (owner: 10Jbond)
[20:30:56] <wikibugs>	 (03Abandoned) 10Jbond: cas6.2: merge changes from upstream 6.2 branch [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/609402 (owner: 10Jbond)
[20:31:24] <wikibugs>	 (03PS3) 10Jbond: labstore::fileserver::exports: use sudo::safe_wildcard_cmd [puppet] - 10https://gerrit.wikimedia.org/r/616718 (https://phabricator.wikimedia.org/T258943)
[20:31:46] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] labstore::fileserver::exports: use sudo::safe_wildcard_cmd [puppet] - 10https://gerrit.wikimedia.org/r/616718 (https://phabricator.wikimedia.org/T258943) (owner: 10Jbond)
[20:32:54] <wikibugs>	 (03PS4) 10Jbond: labstore::fileserver::exports: use sudo::safe_wildcard_cmd [puppet] - 10https://gerrit.wikimedia.org/r/616718 (https://phabricator.wikimedia.org/T258943)
[20:34:14] <wikibugs>	 (03Abandoned) 10Jbond: WIP: profile::mariadb::misc: create generic profile for misc classes [puppet] - 10https://gerrit.wikimedia.org/r/608895 (owner: 10Jbond)
[20:34:35] <wikibugs>	 (03Abandoned) 10Jbond: role::tendril: move tendril to new profile::mariadb::misc [puppet] - 10https://gerrit.wikimedia.org/r/608896 (owner: 10Jbond)
[20:35:43] <wikibugs>	 (03Abandoned) 10Jbond: profile::backup::director: increase number of open files. [puppet] - 10https://gerrit.wikimedia.org/r/556207 (owner: 10Jbond)
[20:36:16] <wikibugs>	 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10Epic: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10aapeli) The Wikimedia Maps TOS does not yet mention this change (it's linked in the attribution line on maps.wikimedia....
[20:36:59] <wikibugs>	 (03Abandoned) 10Jbond: cfssl: Ensure CSR exists before we try to sign it [puppet] - 10https://gerrit.wikimedia.org/r/581559 (owner: 10Jbond)
[20:38:30] <wikibugs>	 (03PS3) 10Jbond: role::puppetmaster::standalone: add type checking to autosign [puppet] - 10https://gerrit.wikimedia.org/r/566512
[20:38:55] <wikibugs>	 (03PS7) 10Jbond: admin: add tests for system users [puppet] - 10https://gerrit.wikimedia.org/r/554302 (https://phabricator.wikimedia.org/T239686)
[20:39:18] <wikibugs>	 (03PS2) 10Jbond: puppetmaster: update webconfig to use correct file path [puppet] - 10https://gerrit.wikimedia.org/r/550459 (https://phabricator.wikimedia.org/T237994)
[20:41:55] <wikibugs>	 (03CR) 10Jbond: "jut going through old changes, this still seems valid?" [puppet] - 10https://gerrit.wikimedia.org/r/550459 (https://phabricator.wikimedia.org/T237994) (owner: 10Jbond)
[20:42:30] <wikibugs>	 (03Abandoned) 10Jbond: service definitions: add custom type/provider to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/559536 (https://phabricator.wikimedia.org/T241160) (owner: 10Jbond)
[20:42:38] <wikibugs>	 (03Abandoned) 10Jbond: service_definitions: add defined ports [puppet] - 10https://gerrit.wikimedia.org/r/559537 (https://phabricator.wikimedia.org/T241160) (owner: 10Jbond)
[20:43:12] <wikibugs>	 (03PS2) 10Jbond: backup::director: add type checking and use lookup vs hiera [puppet] - 10https://gerrit.wikimedia.org/r/556211
[20:43:29] <wikibugs>	 (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/556211 (owner: 10Jbond)
[20:44:04] <wikibugs>	 (03Abandoned) 10Jbond: apereo_cas: add ability to configure basic memcached support [puppet] - 10https://gerrit.wikimedia.org/r/550695 (https://phabricator.wikimedia.org/T233931) (owner: 10Jbond)
[20:45:58] <wikibugs>	 (03Abandoned) 10Jbond: raid: update check_raid to detect missing disk"" [puppet] - 10https://gerrit.wikimedia.org/r/510139 (owner: 10Jbond)
[20:46:16] <wikibugs>	 (03Abandoned) 10Jbond: icinga: Add a new script and configuration to send prowl notifications [puppet] - 10https://gerrit.wikimedia.org/r/502993 (owner: 10Jbond)
[21:09:05] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:09:57] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:10:03] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:10:29] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:10:41] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:10:45] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:11:15] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:11:15] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-me
[21:12:11] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:12:23] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:12:25] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:13:15] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:13:16] <hashar>	 jouncebot: now
[21:13:16] <jouncebot>	 For the next 10 hour(s) and 46 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201103T0800)
[21:13:51] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:14:29] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:15:01] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:16:44] <hashar>	 !log Gerrit: triggering java garbage collection # T263008
[21:16:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:16:52] <stashbot>	 T263008: Gerrit out of heap - https://phabricator.wikimedia.org/T263008
[21:17:23] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:18:55] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:18:57] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:19:39] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:22:29] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:22:55] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[21:24:11] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:24:27] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:24:37] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[21:26:35] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:26:57] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a re
[21:26:57] <icinga-wm>	 ed https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:27:43] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:27:45] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:27:47] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:29:09] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:29:13] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:29:19] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:30:09] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:30:51] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:30:59] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:31:35] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:32:29] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:32:49] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:33:09] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:34:35] <wikibugs>	 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10Epic: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10Aklapper) @aapeli: Could you please file a separate task about updating https://foundation.wikimedia.org/wiki/Maps_Term...
[21:36:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[21:36:05] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:37:03] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds
[21:37:39] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[21:37:49] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:38:39] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[21:39:15] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:40:05] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:41:21] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:41:51] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:42:21] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:42:47] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:42:49] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:43:07] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:43:57] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[21:44:35] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:44:39] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a re
[21:44:39] <icinga-wm>	 ed https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:44:49] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:45:31] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[21:45:43] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:46:11] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:46:15] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:46:27] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:48:07] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:49:41] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[21:50:09] <wikibugs>	 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10Epic: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10aapeli) OK, I've made a new task: T267170.
[21:50:39] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[21:50:47] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:51:21] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:51:33] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:52:01] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:52:15] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[21:52:27] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:52:55] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:52:57] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[21:54:49] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:55:21] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:56:33] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:56:43] <wikibugs>	 (03CR) 10QChris: [C: 03+1] gerrit: fix SonarQube report url discovery [puppet] - 10https://gerrit.wikimedia.org/r/638565 (https://phabricator.wikimedia.org/T267028) (owner: 10Hashar)
[21:57:25] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:57:53] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:58:17] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:59:07] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:00:35] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:01:04] <wikibugs>	 (03PS1) 10QChris: Add .gitreview [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/638719
[22:01:05] <wikibugs>	 (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/638719 (owner: 10QChris)
[22:01:05] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:01:33] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:01:52] <wikibugs>	 (03CR) 10Awight: gerrit: fix SonarQube report url discovery (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/638565 (https://phabricator.wikimedia.org/T267028) (owner: 10Hashar)
[22:02:11] <wikibugs>	 (03PS2) 10Awight: gerrit: fix SonarQube report url discovery [puppet] - 10https://gerrit.wikimedia.org/r/638565 (https://phabricator.wikimedia.org/T267028) (owner: 10Hashar)
[22:02:53] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:03:25] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:03:29] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[22:03:43] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:04:15] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:05:09] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[22:05:11] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:05:16] <wikibugs>	 (03PS2) 10CDanis: depool esams [dns] - 10https://gerrit.wikimedia.org/r/627919
[22:08:31] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:08:39] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:08:53] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:10:15] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:10:27] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:11:59] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:12:15] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:12:43] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:13:37] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:13:53] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:16:03] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:17:19] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:18:09] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:19:01] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:19:01] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:21:57] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:22:11] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:22:11] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:22:13] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:23:35] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:25:30] <cdanis>	 !log ✔️ cdanis@mw1278.eqiad.wmnet ~ 🕠🍺 sudo depool
[22:25:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:26:57] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:27:15] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:28:35] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:29:41] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:31:06] <cdanis>	 !log depool mw1276 and mw1279 also
[22:31:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:32:19] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:33:17] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:33:43] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor
[22:33:43] <icinga-wm>	 received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:34:01] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:34:01] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:34:14] <cdanis>	 !log restart-php7.2-fpm and pool on mw1276
[22:34:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:34:53] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:35:21] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:35:39] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:35:39] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:35:56] <cdanis>	 !log ✔️ cdanis@mw1290.eqiad.wmnet ~ 🕠🍺 sudo restart-php7.2-fpm 
[22:36:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:37:53] <cdanis>	 !log repool mw1278 and mw1279
[22:37:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:39:12] <wikibugs>	 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review, 10User-jijiki: MediaWiki to route specific keys to /*/mw-with-onhost-tier/ - https://phabricator.wikimedia.org/T264604 (10jijiki) @aaron is there a timeline as to when those patches will be merged?
[22:39:54] <wikibugs>	 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review, and 2 others: Reduce read pressure on mc* servers by adding a machine-local Memcached instance (on-host memcached) - https://phabricator.wikimedia.org/T244340 (10jijiki)
[22:40:47] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[22:44:09] <_joe_>	 cdanis: did you restart more servers?
[22:44:14] <cdanis>	 no
[22:44:46] <_joe_>	 it's recovering... by itself then
[22:44:51] <_joe_>	 cpu usage is going down
[22:44:52] <cdanis>	 there are a few still pegged, but yeah, only a few
[22:49:31] <cdanis>	 !log mw1342 restart-php7.2-fpm
[22:49:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:55:26] <_joe_>	 !log depooling mw1346
[22:55:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:56:32] <_joe_>	 !log repooling mw1346
[22:56:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:18:39] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:30:19] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:36:09] <wikibugs>	 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic: Limit maps serving to Wikimedia hosted sites only - https://phabricator.wikimedia.org/T261424 (10AntiCompositeNumber)
[23:47:09] <wikibugs>	 (03PS1) 10Krinkle: Apply bucketing to query sizes stats [extensions/TemplateData] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/638519
[23:55:23] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+1] Use common k8s labels (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/637813 (https://phabricator.wikimedia.org/T266844) (owner: 10Legoktm)