[00:46:30] <cdanis>	 !log T238305 cp3053.mgmt /admin1-> racadm serveraction hardreset
[00:46:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:46:34] <stashbot>	 T238305: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305
[00:46:49] <wikibugs>	 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10CDanis) `23:22:06 <+icinga-wm> PROBLEM - Host cp3053 is DOWN: PING CRITICAL - Packet loss = 100%` nothing in logs as usual
[00:49:27] <icinga-wm>	 RECOVERY - Host cp3053 is UP: PING OK - Packet loss = 0%, RTA = 83.41 ms
[02:54:31] <icinga-wm>	 RECOVERY - Debian mirror in sync with upstream on sodium is OK: /srv/mirrors/debian is over 0 hours old. https://wikitech.wikimedia.org/wiki/Mirrors
[03:55:30] <wikibugs>	 (03PS1) 10Ammarpad: Document why we have duplicate false value [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565751 (https://phabricator.wikimedia.org/T183549)
[04:07:47] <wikibugs>	 (03PS2) 10Ammarpad: Document why we have duplicate false value [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565751 (https://phabricator.wikimedia.org/T183549)
[05:57:13] <wikibugs>	 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Marostegui) Is there any action plan to investigate these issues?
[06:06:33] <wikibugs>	 10Puppet, 10VPS-project-codesearch, 10Patch-For-Review: Puppetize codesearch - https://phabricator.wikimedia.org/T242319 (10Legoktm) docker instances aren't starting because of:  ` Jan 19 06:05:30 codesearch6 dockerd[8955]: time="2020-01-19T06:05:30.388754452Z" level=error msg="Handler for POST /v1.40/contai...
[06:50:24] <wikibugs>	 10Puppet, 10VPS-project-codesearch, 10Patch-For-Review: Puppetize codesearch - https://phabricator.wikimedia.org/T242319 (10Legoktm) >>! In T242319#5815001, @Legoktm wrote: > docker instances aren't starting because of: >  > ` > Jan 19 06:05:30 codesearch6 dockerd[8955]: time="2020-01-19T06:05:30.388754452Z"...
[07:07:41] <wikibugs>	 10Puppet, 10VPS-project-codesearch, 10Patch-For-Review: Puppetize codesearch - https://phabricator.wikimedia.org/T242319 (10Legoktm) After reading https://github.com/moby/moby/issues/26824#issuecomment-412309421 which said that running iptables legacy and nftables at the same time was a bad idea, I performed...
[07:23:41] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-m
[07:25:20] <wikibugs>	 (03PS1) 10Legoktm: codesearch: Use iptables-legacy for docker compatibility [puppet] - 10https://gerrit.wikimedia.org/r/565752 (https://phabricator.wikimedia.org/T242319)
[07:27:14] <wikibugs>	 (03CR) 10Legoktm: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/20444/" [puppet] - 10https://gerrit.wikimedia.org/r/565752 (https://phabricator.wikimedia.org/T242319) (owner: 10Legoktm)
[07:29:27] <wikibugs>	 10Puppet, 10VPS-project-codesearch, 10Patch-For-Review: Puppetize codesearch - https://phabricator.wikimedia.org/T242319 (10Legoktm) OK! https://codesearch6.wmflabs.org/search/ is ready for testing as a fully puppetized codesearch buster instance.
[07:50:25] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in codfw on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[07:57:49] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in codfw on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[08:03:17] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[08:16:07] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in codfw on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[08:17:57] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[08:45:31] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in codfw on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[08:54:45] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in codfw on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[08:58:25] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in codfw on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[09:05:45] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[09:51:47] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in codfw on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[10:04:35] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in codfw on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[10:12:53] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-m
[10:13:45] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in codfw on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[10:16:53] <wikibugs>	 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Vgutierrez) >>! In T238305#5814996, @Marostegui wrote: > Is there any action plan to investigate these issues?  Currently T242579 is our only hope of getting more information about this issue
[10:21:05] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[10:26:37] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in codfw on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[10:26:53] * effie checking
[10:48:47] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[10:56:11] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in codfw on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[11:01:43] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[11:09:03] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in codfw on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[11:10:06] <elukey>	 effie: o/ - is it change-prop related? Just realized it is about codfw 
[11:10:34] <effie>	 it does not look like it, I am still digging here and there 
[11:10:49] <elukey>	 ack, lemme know if you need help
[11:10:55] <elukey>	 I am around for ~30/40 mins :)
[11:11:03] <effie>	 traffic has not increased, but latency to monitoring urls increased sometime after 7 UTC
[11:11:14] <effie>	 tx :)
[11:11:35] <effie>	 I will give it a little more time and then open a task 
[11:12:45] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[11:16:58] <elukey>	 I tried to render a per host latency graph in codfw, the heaviest affected are mwdebugs afaics
[11:17:01] <elukey>	 weird
[11:17:15] <elukey>	 maybe because they are vms
[11:18:59] <elukey>	 also I am confused, I don't see any requests in the apache logs
[11:19:13] <effie>	 I also saw some tcp attempt fails starting 
[11:19:26] <effie>	 so I was wondering if some network device is missbehaving 
[11:20:07] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in codfw on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[11:20:11] <effie>	 elukey: https://w.wiki/FkK
[11:20:30] <elukey>	 !log restart-php-fpm on mw2181 to rule out temporary php-related issues in codfw
[11:20:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:20:57] <elukey>	 that is a good one, maybe they are coming from a single row
[11:22:28] <elukey>	 yep rack A3 from netbox
[11:23:10] <elukey>	 ah no also one in A4
[11:23:44] <effie>	 and C4 
[11:24:45] <elukey>	 very interesting from mw2181's access logs
[11:25:15] <elukey>	 time to render the Special:Blank page is 3136583 for the last that I checked
[11:26:10] <elukey>	 that is in micro seconds, so could it be that the latency increase is simply the health check taking ages ?
[11:26:57] <effie>	 it appears so yes
[11:27:07] <effie>	 I didn't see anything funny on some random logs I checked
[11:27:13] <elukey>	 so something between lvs and mw hosts
[11:27:27] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[11:27:46] <effie>	 I am curious
[11:32:57] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in codfw on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[11:39:53] <elukey>	 effie: need to go, seems something not super critical, will re-check later on!
[11:40:11] <effie>	 yeah I will leave it here too 
[11:40:14] <effie>	 tx 
[11:40:17] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in codfw on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[11:50:01] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: db2085 crashed - https://phabricator.wikimedia.org/T243148 (10Marostegui)
[11:51:00] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: db2085 crashed - https://phabricator.wikimedia.org/T243148 (10Marostegui) ` racadm>>serveraction powerstatus  racadm serveraction powerstatus Server power status: OFF  racadm>> `
[11:53:11] <wikibugs>	 10Operations, 10serviceops: Increased latency in CODFW API and APP monitoring urls (~07:20 UTC 19 Jan 2020) - https://phabricator.wikimedia.org/T243149 (10jijiki)
[11:53:51] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: db2085 crashed - https://phabricator.wikimedia.org/T243148 (10Marostegui) Nothing relevant on `centrallog1001` from db2085.
[11:56:51] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[11:57:39] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: db2085 crashed - https://phabricator.wikimedia.org/T243148 (10Marostegui) I have powered it back on - no errors on boot. MySQL hasn't been started.
[11:59:39] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[12:02:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2085:3311, db2085:3318 T243148', diff saved to https://phabricator.wikimedia.org/P10210 and previous config saved to /var/cache/conftool/dbconfig/20200119-120236-marostegui.json
[12:02:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:02:41] <stashbot>	 T243148: db2085 crashed - https://phabricator.wikimedia.org/T243148
[15:00:03] <wikibugs>	 10Operations, 10Wikibugs: wikibugs needs restart almost everyday - https://phabricator.wikimedia.org/T241109 (10valhallasw) 05Open→03Resolved a:03valhallasw
[15:05:33] <wikibugs>	 10Operations, 10Wikibugs: wikibugs needs restart almost everyday - https://phabricator.wikimedia.org/T241109 (10valhallasw) From what I can see the bot crashes and restarts a few times per day, which -- although not great -- I think is acceptable. About 2/3rd of those are errors retrieving anchors, the rest ar...
[15:11:46] <elukey>	 marostegui: o/
[15:12:11] <marostegui>	 elukey: o/
[15:16:05] <elukey>	 marostegui: I was checking the codfw mw appserver latency issue that Effie was working on, and it seems  that it went away as soon as you depooled db2085
[15:17:16] <elukey>	 the high latency was seen for Special:Blank page
[15:17:18] <elukey>	 of enwiki
[15:17:46] <elukey>	 so I suppose that the high latency was related to db2085 being overwhelmed
[15:17:49] <elukey>	 does it make sense?
[15:18:01] <elukey>	 check https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-cluster=appserver&var-method=GET&var-code=200&fullscreen&panelId=9
[15:20:11] <wikibugs>	 10Operations, 10serviceops: Increased latency in CODFW API and APP monitoring urls (~07:20 UTC 19 Jan 2020) - https://phabricator.wikimedia.org/T243149 (10elukey) This seems to be related to T243148, db2085 was overwhelmed and this explains the high latency (Special:Blank page health checks were taking ages to...
[15:20:20] <elukey>	 ok added more info to --^
[15:22:17] <elukey>	 all right going afk again, if anybody has a different idea please add it in the task, otherwise I think we are good to close it :)
[15:25:16] <wikibugs>	 (03PS1) 10Ayounsi: Add RPKI whitelist support [homer/public] - 10https://gerrit.wikimedia.org/r/565771
[15:28:18] <wikibugs>	 (03CR) 10Volans: "one question inline" (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/565771 (owner: 10Ayounsi)
[15:37:33] <marostegui>	 elukey: not really, that host doesn't receive reads or anything
[15:37:36] <marostegui>	 so it is weird
[15:47:20] <elukey>	 marostegui: ah snap, then my theory is wrong
[15:47:36] <elukey>	 timing is really weird though
[15:51:17] <marostegui>	 elukey: It is strange, that host isn't supposed to be even checked
[15:51:24] <marostegui>	 Definitely isn't getting any reads
[17:40:20] <wikibugs>	 10Operations, 10serviceops: Increased latency in CODFW API and APP monitoring urls (~07:20 UTC 19 Jan 2020) - https://phabricator.wikimedia.org/T243149 (10Marostegui) db2085 is a s1 and s8 codfw slave (multi instance). We don't have read traffic on codfw databases, how could it cause those latency issues?
[18:13:13] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 74445528 and 9 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[18:13:13] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 40505016 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[18:15:03] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1620888 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[18:16:53] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 3992 and 32 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[18:17:55] <icinga-wm>	 PROBLEM - Host cp3061 is DOWN: PING CRITICAL - Packet loss = 100%
[18:26:59] <icinga-wm>	 PROBLEM - Host mr1-eqsin.oob is DOWN: PING CRITICAL - Packet loss = 100%
[18:32:55] <icinga-wm>	 RECOVERY - Host mr1-eqsin.oob is UP: PING OK - Packet loss = 0%, RTA = 233.36 ms
[19:27:34] <wikibugs>	 10Operations, 10serviceops: Increased latency in CODFW API and APP monitoring urls (~07:20 UTC 19 Jan 2020) - https://phabricator.wikimedia.org/T243149 (10Marostegui) And according to the graph the latency increase indeed starts when db2085 went down ` Jan 19 07:19:49 icinga1001 icinga: SERVICE ALERT: db2085;p...
[19:29:00] <wikibugs>	 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Marostegui) `18:17:56 <+icinga-wm> PROBLEM - Host cp3061 is DOWN: PING CRITICAL - Packet loss = 100%` Might be another case...
[19:32:58] <wikibugs>	 (03PS2) 10Legoktm: codesearch: Use iptables-legacy for docker compatibility [puppet] - 10https://gerrit.wikimedia.org/r/565752 (https://phabricator.wikimedia.org/T242319)
[19:34:05] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] codesearch: Use iptables-legacy for docker compatibility [puppet] - 10https://gerrit.wikimedia.org/r/565752 (https://phabricator.wikimedia.org/T242319) (owner: 10Legoktm)
[19:35:15] <wikibugs>	 (03PS3) 10Legoktm: codesearch: Use iptables-legacy for docker compatibility [puppet] - 10https://gerrit.wikimedia.org/r/565752 (https://phabricator.wikimedia.org/T242319)