[01:34:44] 10serviceops, 10Beta-Cluster-Infrastructure, 10Operations, 10Scap, 10Release-Engineering-Team (Deployment services): On beta, scap can't clear opcache on some mw servers - https://phabricator.wikimedia.org/T237033 (10Krinkle) Confirmed this is still happening on every beta deploy ([latest](https://integr... [01:34:59] 10serviceops, 10Beta-Cluster-Infrastructure, 10Operations, 10Scap, 10Release-Engineering-Team (Deployment services): Sap can't clear opcache on mw servers in Beta Cluster - https://phabricator.wikimedia.org/T237033 (10Krinkle) [01:35:04] 10serviceops, 10Beta-Cluster-Infrastructure, 10Operations, 10Scap, 10Release-Engineering-Team (Deployment services): Scap can't clear opcache on mw servers in Beta Cluster - https://phabricator.wikimedia.org/T237033 (10Krinkle) [01:47:06] 10serviceops, 10Operations, 10Wikimedia-production-error: PHP7 corruption: Method call executed on unrelated object (also: Call to undefined method) - https://phabricator.wikimedia.org/T245183 (10Krinkle) Here is another mysterious mis-call: [Logstash single document](https://logstash.wikimedia.org/app/kiba... [11:17:30] 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: restrouter.svc.{eqiad,codfw}.wmnet in a failed state - https://phabricator.wikimedia.org/T242461 (10akosiaris) [14:58:45] akosiaris: q for ya [14:58:49] if I try to helmfile apply [14:58:51] and it fails [14:59:07] any further use of helmfile says No affected releases [14:59:16] but, my previous pod (release?) is still there [14:59:31] i just applied --selector name=canary for eventstreams in eqiad [14:59:35] it timedout [14:59:42] if I try to apply again [14:59:46] it does nothing [14:59:58] i've worked around this before by destroying the canary release [14:59:58] cause there's no diff [15:00:00] but that seems harsh [15:00:08] you can try sync [15:00:17] but first figure out why it failed in the first place [15:00:34] AHH i think sync just told me why [15:00:40] Error: UPGRADE FAILED: no Service with the name "eventstreams-canary-debug" found [15:00:41] or [15:00:44] is that just the sync failing [15:00:45] hm [15:01:02] if I add a new service will the upgrade fail in general? [15:01:26] no, it shouldn't, you can add new services [15:01:32] hmm do I even need a service? I can connect to the pod ip directly, right? [15:01:41] yup [15:02:09] yeah i guess that is just for local dev stuff, since i can't do that locally [15:02:12] the service probably makes sense in development though. e.g. minikube won't allow you to connect to the pod ip directly [15:02:17] right [15:02:20] :)( [15:02:21] :) [15:08:51] _joe_: re: https://phabricator.wikimedia.org/T247562#5974902 - what's needed here? [15:09:31] <_joe_> brennen: try to re-deploy and we check what keys are overwhelming [15:11:15] _joe_: just roll wmf.23 to group2 again for a defined window, or to some partial subset? [15:11:39] <_joe_> brennen: I would say 10 minutes should be more than enough [15:11:41] <_joe_> even less [15:12:03] <_joe_> if the problem shows up again, SRE will be ready to debug in depth to try to find which keys are responsible [15:12:59] ok, that makes sense. i can go ahead now, if that's workable. [15:16:16] ^ _joe_ [15:16:29] <_joe_> brennen: uhm gimme 5 minus [15:16:32] <_joe_> *minutes [15:17:02] yeah, of course - sorry to rush. [15:21:33] <_joe_> brennen: let's do it then [15:21:41] <_joe_> I don't see good alternatives [15:21:56] <_joe_> I mean we could try just dewiki, then we find out nothing [15:33:28] hm akosiaris something is 'timed out waiting for the condition' [15:33:44] i'm trying to start the service in debug mode so I can try and figure out this mem leak [15:34:06] how can I find what is not starting? i don't think I can get any logs from the failed container...or can I? [15:36:03] the tiller logs don't have much useful [15:37:00] i'd like to get logs from the new pod container as it is spawning [15:37:06] but i don't see its pod id anywhere [15:38:25] i assume the timeout is just the readinessProbe never succeeding [15:38:37] but i don't have any info as to why or where the container start isn't finishing [16:29:44] 10serviceops, 10Operations, 10ops-eqiad: mw1280 crashed logging correctable memory errors - https://phabricator.wikimedia.org/T240187 (10Jclark-ctr) [16:30:41] 10serviceops, 10Operations, 10ops-eqiad: mw1280 crashed logging correctable memory errors - https://phabricator.wikimedia.org/T240187 (10Jclark-ctr) Replacement Dimm has arrived [16:33:12] 10serviceops, 10Operations, 10ops-eqiad: mw1280 crashed logging correctable memory errors - https://phabricator.wikimedia.org/T240187 (10Dzahn) @Jclark-ctr The server is depooled, you can do the replacement any time. [16:33:53] 10serviceops, 10Operations, 10ops-eqiad: mw1280 crashed logging correctable memory errors - https://phabricator.wikimedia.org/T240187 (10Dzahn) p:05Triage→03Medium [16:34:29] 10serviceops, 10Operations, 10ops-eqiad: mw1280 crashed logging correctable memory errors - https://phabricator.wikimedia.org/T240187 (10Jclark-ctr) Thanks taking care of now! @Dzahn [16:41:23] 10serviceops, 10Operations, 10ops-eqiad: mw1280 crashed logging correctable memory errors - https://phabricator.wikimedia.org/T240187 (10Jclark-ctr) Replaced Failed drive host booting now [16:45:24] 10serviceops, 10Operations, 10ops-eqiad: mw1280 crashed logging correctable memory errors - https://phabricator.wikimedia.org/T240187 (10Dzahn) a:05Jclark-ctr→03Dzahn Thanks @Jclark-ctr ! I could get it per SSH now. I'll take it to get it back into production, if you are done. [17:15:12] 10serviceops, 10Operations, 10ops-eqiad: mw1280 crashed logging correctable memory errors - https://phabricator.wikimedia.org/T240187 (10Dzahn) after puppet runs host was added back in Icinga. then: CRITICAL: 944 mismatched wikiversions after a looong scap pull it is all green now https://icinga.wikimedia... [17:19:18] 10serviceops, 10Operations, 10ops-eqiad: mw1280 crashed logging correctable memory errors - https://phabricator.wikimedia.org/T240187 (10Dzahn) 05Open→03Resolved 17:18 <+logmsgbot> !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1280.eqiad.wmnet [17:24:17] 10serviceops, 10ChangeProp, 10Operations, 10Release Pipeline, and 6 others: Migrate cpjobqueue to kubernetes - https://phabricator.wikimedia.org/T220399 (10eprodromou) [17:30:16] 10serviceops, 10Operations, 10Patch-For-Review: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10Dzahn) a:03Dzahn [17:34:40] akosiaris: guessing you are gone for the day but if now a pointer for my q above would be really helpful, i have no idea why this is failing. it works fine in my dev env, but when i enable debug mode in eqiad canary the service doesn't start. I could just go for trial and error, but that would mean a lot of new chart versions just to troubelshoot in prod [18:02:08] 10serviceops, 10Operations: miscweb1001/2001 - upgrade to buster or decom - https://phabricator.wikimedia.org/T247648 (10Dzahn) [18:02:50] 10serviceops, 10Operations: miscweb1001/2001 - upgrade to buster or decom - https://phabricator.wikimedia.org/T247648 (10Dzahn) [18:38:30] 10serviceops, 10Operations, 10Patch-For-Review: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw[1238-1239].eqiad.wmnet` - mw1238.eqiad.wmnet (**PASS**) - Downtimed host on Icinga... [18:41:52] 10serviceops, 10Operations, 10Patch-For-Review: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw[1240-1243].eqiad.wmnet` - mw1240.eqiad.wmnet (**FAIL**) - Host steps raised exception... [18:45:42] 10serviceops, 10Operations, 10Patch-For-Review: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1240.eqiad.wmnet` - mw1240.eqiad.wmnet (**PASS**) - Downtimed host on Icinga - Found... [20:40:13] 10serviceops, 10MediaWiki-Cache, 10Operations, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: WanObjectCache::getWithSetCallback seems not to set objects when fetching data is slow - https://phabricator.wikimedia.org/T244877 (10AMooney) [22:49:25] 10serviceops, 10Beta-Cluster-Infrastructure, 10Operations, 10Scap, 10Release-Engineering-Team (Deployment services): Scap can't clear opcache on mw servers in Beta Cluster - https://phabricator.wikimedia.org/T237033 (10thcipriani) This one: ` Job ['/usr/bin/scap', 'pull', '--no-php-restart', '--no-updat...