[01:34:44] <wikibugs>	 10serviceops, 10Beta-Cluster-Infrastructure, 10Operations, 10Scap, 10Release-Engineering-Team (Deployment services): On beta, scap can't clear opcache on some mw servers - https://phabricator.wikimedia.org/T237033 (10Krinkle) Confirmed this is still happening on every beta deploy ([latest](https://integr...
[01:34:59] <wikibugs>	 10serviceops, 10Beta-Cluster-Infrastructure, 10Operations, 10Scap, 10Release-Engineering-Team (Deployment services): Sap can't clear opcache on mw servers in Beta Cluster - https://phabricator.wikimedia.org/T237033 (10Krinkle)
[01:35:04] <wikibugs>	 10serviceops, 10Beta-Cluster-Infrastructure, 10Operations, 10Scap, 10Release-Engineering-Team (Deployment services): Scap can't clear opcache on mw servers in Beta Cluster - https://phabricator.wikimedia.org/T237033 (10Krinkle)
[01:47:06] <wikibugs>	 10serviceops, 10Operations, 10Wikimedia-production-error: PHP7 corruption: Method call executed on unrelated object (also: Call to undefined method) - https://phabricator.wikimedia.org/T245183 (10Krinkle) Here is another mysterious mis-call:  [Logstash single document](https://logstash.wikimedia.org/app/kiba...
[11:17:30] <wikibugs>	 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: restrouter.svc.{eqiad,codfw}.wmnet in a failed state - https://phabricator.wikimedia.org/T242461 (10akosiaris)
[14:58:45] <ottomata>	 akosiaris:  q for ya
[14:58:49] <ottomata>	 if I try to helmfile apply
[14:58:51] <ottomata>	 and it fails
[14:59:07] <ottomata>	 any further use of helmfile says No affected releases
[14:59:16] <ottomata>	 but, my previous pod (release?) is still there
[14:59:31] <ottomata>	 i just applied --selector name=canary for eventstreams in eqiad
[14:59:35] <ottomata>	 it timedout
[14:59:42] <ottomata>	 if I try to apply again
[14:59:46] <ottomata>	 it does nothing
[14:59:58] <ottomata>	 i've worked around this before by destroying the canary release
[14:59:58] <akosiaris>	 cause there's no diff
[15:00:00] <ottomata>	 but that seems harsh
[15:00:08] <akosiaris>	 you can try sync
[15:00:17] <akosiaris>	 but first figure out why it failed in the first place
[15:00:34] <ottomata>	 AHH i think sync just told me why
[15:00:40] <ottomata>	 Error: UPGRADE FAILED: no Service with the name "eventstreams-canary-debug" found
[15:00:41] <ottomata>	 or
[15:00:44] <ottomata>	 is that just the sync failing
[15:00:45] <ottomata>	 hm
[15:01:02] <ottomata>	 if I add a new service will the upgrade fail in general?
[15:01:26] <akosiaris>	 no, it shouldn't, you can add new services 
[15:01:32] <ottomata>	 hmm do I even need a service?  I can connect to the pod ip directly, right?
[15:01:41] <akosiaris>	 yup
[15:02:09] <ottomata>	 yeah i guess that is just for local dev stuff, since i can't do that locally
[15:02:12] <akosiaris>	 the service probably makes sense in development though. e.g. minikube won't allow you to connect to the pod ip directly
[15:02:17] <ottomata>	 right
[15:02:20] <ottomata>	 :)(
[15:02:21] <ottomata>	 :)
[15:08:51] <brennen>	 _joe_: re: https://phabricator.wikimedia.org/T247562#5974902 - what's needed here?
[15:09:31] <_joe_>	 brennen: try to re-deploy and we check what keys are overwhelming
[15:11:15] <brennen>	 _joe_: just roll wmf.23 to group2 again for a defined window, or to some partial subset?
[15:11:39] <_joe_>	 brennen: I would say 10 minutes should be more than enough
[15:11:41] <_joe_>	 even less
[15:12:03] <_joe_>	 if the problem shows up again, SRE will be ready to debug in depth to try to find which keys are responsible
[15:12:59] <brennen>	 ok, that makes sense.  i can go ahead now, if that's workable.
[15:16:16] <brennen>	 ^ _joe_
[15:16:29] <_joe_>	 brennen: uhm gimme 5 minus
[15:16:32] <_joe_>	 *minutes
[15:17:02] <brennen>	 yeah, of course - sorry to rush.
[15:21:33] <_joe_>	 brennen: let's do it then
[15:21:41] <_joe_>	 I don't see good alternatives
[15:21:56] <_joe_>	 I mean we could try just dewiki, then we find out nothing
[15:33:28] <ottomata>	 hm akosiaris  something is 'timed out waiting for the condition'
[15:33:44] <ottomata>	 i'm trying to start the service in debug mode so I can try and figure out this mem leak
[15:34:06] <ottomata>	 how can I find what is not starting?  i don't think I can get any logs from the failed container...or can I?
[15:36:03] <ottomata>	 the tiller logs don't have much useful
[15:37:00] <ottomata>	 i'd like to get logs from the new pod container as it is spawning
[15:37:06] <ottomata>	 but i don't see its pod id anywhere
[15:38:25] <ottomata>	 i assume the timeout is just the readinessProbe never succeeding
[15:38:37] <ottomata>	 but i don't have any info as to why or where the container start isn't finishing
[16:29:44] <wikibugs>	 10serviceops, 10Operations, 10ops-eqiad: mw1280 crashed logging correctable memory errors - https://phabricator.wikimedia.org/T240187 (10Jclark-ctr)
[16:30:41] <wikibugs>	 10serviceops, 10Operations, 10ops-eqiad: mw1280 crashed logging correctable memory errors - https://phabricator.wikimedia.org/T240187 (10Jclark-ctr) Replacement Dimm has arrived
[16:33:12] <wikibugs>	 10serviceops, 10Operations, 10ops-eqiad: mw1280 crashed logging correctable memory errors - https://phabricator.wikimedia.org/T240187 (10Dzahn) @Jclark-ctr The server is depooled, you can do the replacement any time.
[16:33:53] <wikibugs>	 10serviceops, 10Operations, 10ops-eqiad: mw1280 crashed logging correctable memory errors - https://phabricator.wikimedia.org/T240187 (10Dzahn) p:05Triage→03Medium
[16:34:29] <wikibugs>	 10serviceops, 10Operations, 10ops-eqiad: mw1280 crashed logging correctable memory errors - https://phabricator.wikimedia.org/T240187 (10Jclark-ctr) Thanks taking care of now! @Dzahn
[16:41:23] <wikibugs>	 10serviceops, 10Operations, 10ops-eqiad: mw1280 crashed logging correctable memory errors - https://phabricator.wikimedia.org/T240187 (10Jclark-ctr) Replaced Failed drive host booting now
[16:45:24] <wikibugs>	 10serviceops, 10Operations, 10ops-eqiad: mw1280 crashed logging correctable memory errors - https://phabricator.wikimedia.org/T240187 (10Dzahn) a:05Jclark-ctr→03Dzahn Thanks @Jclark-ctr !  I could get it per SSH now. I'll take it to get it back into production, if you are done.
[17:15:12] <wikibugs>	 10serviceops, 10Operations, 10ops-eqiad: mw1280 crashed logging correctable memory errors - https://phabricator.wikimedia.org/T240187 (10Dzahn) after puppet runs host was added back in Icinga.  then: CRITICAL: 944 mismatched wikiversions  after a looong scap pull it is all green now  https://icinga.wikimedia...
[17:19:18] <wikibugs>	 10serviceops, 10Operations, 10ops-eqiad: mw1280 crashed logging correctable memory errors - https://phabricator.wikimedia.org/T240187 (10Dzahn) 05Open→03Resolved 17:18 <+logmsgbot> !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1280.eqiad.wmnet
[17:24:17] <wikibugs>	 10serviceops, 10ChangeProp, 10Operations, 10Release Pipeline, and 6 others: Migrate cpjobqueue to kubernetes - https://phabricator.wikimedia.org/T220399 (10eprodromou)
[17:30:16] <wikibugs>	 10serviceops, 10Operations, 10Patch-For-Review: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10Dzahn) a:03Dzahn
[17:34:40] <ottomata>	 akosiaris:  guessing you are gone for the day but if now a pointer for my q above would be really helpful, i have no idea why this is failing. it works fine in my dev env, but when i enable debug mode in eqiad canary the service doesn't start.  I could just go for trial and error, but that would mean a lot of new chart versions just to troubelshoot in prod
[18:02:08] <wikibugs>	 10serviceops, 10Operations: miscweb1001/2001 - upgrade to buster or decom - https://phabricator.wikimedia.org/T247648 (10Dzahn)
[18:02:50] <wikibugs>	 10serviceops, 10Operations: miscweb1001/2001 - upgrade to buster or decom - https://phabricator.wikimedia.org/T247648 (10Dzahn)
[18:38:30] <wikibugs>	 10serviceops, 10Operations, 10Patch-For-Review: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw[1238-1239].eqiad.wmnet` -  mw1238.eqiad.wmnet (**PASS**)   - Downtimed host on Icinga...
[18:41:52] <wikibugs>	 10serviceops, 10Operations, 10Patch-For-Review: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw[1240-1243].eqiad.wmnet` -  mw1240.eqiad.wmnet (**FAIL**)   - Host steps raised exception...
[18:45:42] <wikibugs>	 10serviceops, 10Operations, 10Patch-For-Review: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1240.eqiad.wmnet` -  mw1240.eqiad.wmnet (**PASS**)   - Downtimed host on Icinga   - Found...
[20:40:13] <wikibugs>	 10serviceops, 10MediaWiki-Cache, 10Operations, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: WanObjectCache::getWithSetCallback seems not to set objects when fetching data is slow - https://phabricator.wikimedia.org/T244877 (10AMooney)
[22:49:25] <wikibugs>	 10serviceops, 10Beta-Cluster-Infrastructure, 10Operations, 10Scap, 10Release-Engineering-Team (Deployment services): Scap can't clear opcache on mw servers in Beta Cluster - https://phabricator.wikimedia.org/T237033 (10thcipriani) This one:  ` Job ['/usr/bin/scap', 'pull', '--no-php-restart', '--no-updat...