[07:45:50] 10serviceops, 10MW-on-K8s, 10SRE: Figure out appropriate readiness and liveness probes - https://phabricator.wikimedia.org/T276908 (10Joe) After some more work, this is my ideas for liveness and readiness probes: # httpd: -- liveness: tcp connection to the main tcp port -- readiness: http status page (via t... [07:46:29] <_joe_> akosiaris / jayme / effie I'd like a check on https://phabricator.wikimedia.org/T276908#6916385 [07:46:58] 10serviceops, 10MW-on-K8s, 10SRE: Figure out appropriate readiness and liveness probes - https://phabricator.wikimedia.org/T276908 (10Joe) p:05Triage→03High [08:19:00] _joe_: Got a major upgrade today, I 'll put it in the backlog [08:23:55] <_joe_> akosiaris: codfw? [08:26:17] yup [08:26:31] yup [08:34:13] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10JMeybohm) [08:37:22] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10akosiaris) [08:37:27] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10JMeybohm) [08:39:06] _joe_: since changeprop in codfw is going to die, we should switch restbase-async to eqiad for latency reasons, right ? [08:40:03] akosiaris: I've killed your change to the task, sorry. Reconstructing [08:40:25] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10akosiaris) [08:40:27] jayme, don't worry [08:40:42] it was a single line, it doesn't matter [08:40:50] ok [08:41:22] <_joe_> akosiaris: not really, no [08:41:44] <_joe_> but probably is a good idea for another reason [08:42:16] <_joe_> restbase generates events from -async [08:42:26] <_joe_> and we don't want those to lie there unprocessed [08:43:31] ok [08:45:31] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10JMeybohm) [08:48:24] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10JMeybohm) [08:51:43] jayme: sigh, all our downtime tooling assumes we are talking about a puppet host. Falling back to the icinga UI for the downtimes [08:51:59] ouch [08:52:03] we probably need to visit this better, I 'll create an action item [08:52:29] akosiaris: did you send the mail already? I did not get it... [08:52:46] yes, ops-l, wikitech-l [08:53:08] hmm...strange [08:53:55] my bad /ignore [08:54:01] <_joe_> if you need me, ping me directly, I'm trying to write code right now [08:57:10] ok, downtimes scheduled [08:57:37] nice. I could run service-route then [08:58:12] yup, go for it [08:58:59] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10akosiaris) [08:59:25] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10ops-monitoring-bot) Icinga downtime set by akosiaris@cumin1001 for 1 day, 0:00:00 18 host(s) and their services with reason: Reinitialize codf... [09:00:44] it does not support multiple arguments, unfortunately. Do you keep action items somewhere? That must be one [09:01:20] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10akosiaris) [09:02:18] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10akosiaris) [09:02:24] yes, in the task itself at the end [09:04:27] <_joe_> I think we need a cookbook to set a k8s cluster in maintenance mode [09:04:52] <_joe_> drag all service traffic, then downtime all services [09:05:00] yup...and one for nodes etc. we lack a bunch regarding k8s [09:05:37] anyways..service-route does not seem to work at all - unfortunately [09:06:35] ok, we got the list, we can fallback to doing it manually via conftool, but add it to the action items [09:07:41] update: all services have downtimed, all hosts have been downtimed and puppet has been disabled on all hosts [09:08:03] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10JMeybohm) [09:09:38] depooling via confctl now [09:11:23] done [09:11:35] ok, doublechecking just to be sure since we fell back to manual [09:12:37] looks ok, I am adding restbase-async [09:12:42] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10JMeybohm) [09:13:23] akosiaris: what do you do there exactly? [09:13:44] and: should we wipe resolver caches to be sure? [09:15:14] sudo confctl --object discovery select "name=codfw,dnsdisc=restbase-async" set/pooled=false [09:15:20] that's all [09:15:25] ah, okay [09:15:42] actually, no that was wrong [09:15:55] it's followed by sudo confctl --object discovery select "name=eqiad,dnsdisc=restbase-async" set/pooled=true [09:16:02] restbase-async is a trick btw [09:16:26] I put it on a different list to ask stupid questions later :-) [09:16:53] we just try to send what we classify as async traffic to codfw instead of eqiad normally and rely on cassandra cross DC replication for bringing the resulting stuff back to eqiad [09:17:03] the idea being to not process expensive stuff in eqiad [09:18:44] ok. so "rec_control wipe-cache"? [09:22:19] it's a 5m thing, no need [09:22:24] it should have converged already [09:22:34] at least from deploy2002, DNs looks good already [09:23:09] yeah, we rely on that for switchdc, tested enough to not warrant cleaning up DNS caches [09:24:26] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10JMeybohm) [09:25:22] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10JMeybohm) [09:25:26] added the restbase step to the task [09:27:05] total requests in codfw pretty much dried out [09:28:19] but still a bit more than I would expect to come from monitoring...is that because LVS does still check them and does so for every node? [09:29:58] yes [09:30:11] it should be around # of pods + #of nodes [09:30:13] per service [09:33:04] jayme: I 'll poweroff the old kubernetes master then [09:33:35] akosiaris: did you already disable puppet? [09:34:09] ah, yeah. Sorry [09:34:15] ok, go ahead [09:35:28] should we clean up etcd and reboot the VMs ? [09:36:17] akosiaris: I can do so [09:36:37] can you prepare/regenerate the certs? [09:36:46] yes [09:36:51] cool [09:39:15] 2772 :-P [09:40:49] akosiaris: "PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers acrab.codfw.wmnet are marked down but pooled" [09:41:12] heh, we forgot about that [09:41:25] let's ACK it for now [09:41:30] do we maybe miss steps to change names there somewhere? [09:41:34] yep, on it [09:41:38] yes we do of course :-) [09:41:48] I 'll draft the patch and add it to the task [09:44:00] akosiaris: I do have a conftool data changfe in https://gerrit.wikimedia.org/r/c/operations/puppet/+/671171/3/conftool-data/node/codfw.yaml [09:44:04] *change [09:44:43] jayme: Ah, that's enough then [09:44:48] perfect! [09:47:48] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10JMeybohm) [09:48:00] added action items to downtime lvs checks [09:48:08] and prometheus :-| [09:50:05] Compilation of file '/srv/config-master/pybal/eqiad/apaches' is broken [09:50:08] 12h already, [09:50:19] that's not us, but it's worrying [09:50:19] akosiaris: we should merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/671174 and start reimaging nodes now [09:50:27] oh :-o [09:50:53] ok, let me review that and I can look into the config thing after [09:51:00] ack [09:51:34] * jayme etcd cleared, etcd nodes rebooted [09:55:23] cool [09:56:59] certs done [10:00:01] I think we can merge and start the reimaging [10:00:25] akosiaris: shouldn't we see downtimes for all k8s nodes in https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&style=hostdetail&hoststatustypes=4&hostprops=1 ? [10:03:01] jayme: in https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=kubernetes2%2A&style=detail [10:03:10] I think the one you pasted is only for hosts that are down? [10:03:46] oh, well...maybe. [10:05:28] akosiaris: so but the nodes are still pooled in conftool. That means we need to run the reimage with "-c" to have them depooled (or we will see alerts again), right? [10:06:15] yes [10:08:24] jayme: I am adding the controller token in private [10:08:31] in private(s) for like it [10:08:48] is there a handy way to have cumin unfold something like "sudo cumin 'P{O:kubernetes::worker} and A:codfw'" to valid FQDNs? [10:09:12] it unfolds to clustershell syntax IIRC [10:09:21] AIUI auto-reimage won't accept the shortcut [10:09:36] ah yes it is not yet cumin compliant [10:11:35] jayme: get the clustershell syntax and try nodeset --expand [10:12:49] akosiaris: :D google landet me on the manual about know and also firefox tells me I read it before [10:12:52] thanks [10:12:57] lol [10:13:03] yw [10:13:59] akosiaris: so I'll merge and go for "sudo -i wmf-auto-reimage -c -p https://phabricator.wikimedia.org/T277191 --force $HOSTS" [10:14:21] --force is "override the default limit of that can be reimaged" [10:14:37] I guess it does all at once then [10:14:51] yeah it we will get alerts [10:15:06] hm? why? [10:15:11] the pybal stuff [10:15:17] but I think it's fine, proceed [10:15:27] jayme: please don't do that [10:15:34] volans? [10:15:51] if you reimage in parallel too many host, most will fail the icinga downtime and you'll have a bunch of IRC spam [10:16:03] they are already downtimed volans [10:16:04] because of too slow puppet run on icinga host and timeeout of the run-puppet-agent command [10:16:10] akosiaris: the new systems [10:16:12] they are already downtimed [10:16:13] after d-i [10:16:21] when puppet starts to run [10:16:23] ah [10:16:38] how many hosts? [10:16:42] 17 [10:17:07] we 've been kind of relying on not splitting them up in groups tbh [10:17:19] there is also a smaller issue that you might trip the rack power if there are too many in teh same rack, but tha's less probable [10:17:26] (as they are downtimed, i will add --no-downtime to the command, now being: "sudo -i wmf-auto-reimage --no-downtime -c -p https://phabricator.wikimedia.org/T277191 --force $HOSTS") [10:17:40] it will considerably slow us down (e.g. a couple of hours if we do 5 per batch) [10:18:18] volans: niah, they are spread enough across racks already, we are talking 4 machines per rack row, something like 1 per rack? [10:18:26] ok [10:18:26] we could do 8 first and start the others like 30min later or so? [10:18:33] let me check one thing in the code [10:19:01] ok, nevermind I take it back [10:19:06] myself from the recent past already added [10:19:14] lib.print_line('Splaying the start of the next reimage by 2 minutes') [10:19:22] so they will not be fully in parallel [10:19:41] go ahead and sorry for the trouble, I had forgot I had already added that bit :D [10:19:56] they will be already splayed [10:20:00] by 2 minutes each [10:20:54] jayme: ^^^ [10:20:57] okay, change is merged. If there are no objections anymore, I'll start reimaging 17 nodes with "sudo -i wmf-auto-reimage --no-downtime -c -p https://phabricator.wikimedia.org/T277191 --force $HOSTS" [10:21:17] jayme: I think I 'll also merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/671171 and do the master in the meantime [10:21:22] forget about the URL in there :D [10:21:31] no blockers for me [10:21:34] controller-manager token done [10:21:46] ack akosiaris [10:23:21] jayme: FYI what you wanted earlier is to use the 'nodeset' binary from clustershell, it's installed in the cumin hosts, using the output of the cumin query: nodeset -S ' ' -e kubernetes[2001-2016].codfw.wmnet [10:23:42] yup [10:23:52] in the next cumin release you could also pipe it (now there's other output mixed) [10:24:06] akosiaris: we forgot about another thing - the worker nodes that are VMs [10:24:22] indeed [10:24:32] wmf-auto-reimage will not handle them IIUC [10:24:46] indeed [10:24:47] kubernetes[2001-2004,2007-2014].codfw.wmnet [10:24:51] are the physical ones [10:25:03] I'll halt roll-rebooting ganeti nodes in codfw while you're reimaging [10:26:19] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jayme on cumin1001.eqiad.wmnet for hosts: ` ['kubernetes2001.codfw.wmnet', 'kubern... [10:26:54] jayme: I 'll do those 4 manually [10:27:13] akosiaris: or you teach me how to do it :) [10:28:17] jayme: https://wikitech.wikimedia.org/wiki/Ganeti#Reinstall_/_Reimage_a_VM [10:28:27] <3 [10:29:22] hmm that doesn't point out that we need to also clean the old certs I think [10:29:33] akosiaris: it does [10:29:34] or decom + makevm :-P [10:29:50] akosiaris: "After you get the server login you need to proceed with the manual installation procedure. " links to https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Manual_installation [10:30:07] jayme: lol, ok point taken :-D [10:31:07] akosiaris: so I will do it then, if you're fine with that [10:32:03] fine by me [10:32:50] ack [10:40:40] Error: unable to load server certificate: tls: private key type does not match public key type [10:40:49] damn, I hate x509 [10:42:10] <_joe_> akosiaris / jayme we have errors in conftool for kubemaster in codfw, I guess expected? [10:42:19] PROBLEM - Confd template for /srv/config-master/pybal/codfw/kubemaster on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/kubemaster is broken [10:42:35] oh, hi joe :) [10:42:38] yes [10:42:40] expected [10:43:09] let me fix that TLS thing and this should probably fix itself in a bit [10:44:03] akosiaris: kubernetes[2015-2016].codfw.wmnet don't have "depool" installed on them...that seems quite weird, no? [10:44:31] <_joe_> jayme: reinstalled servers or old servers? [10:44:41] old ones [10:44:56] two of the 4 VM workers we have [10:45:34] <_joe_> it looks like those servers don't include profile::conftool::client [10:45:51] <_joe_> because I think we only include it via profile::lvs::realserver elsewhere [10:46:07] I think they have that profile [10:46:16] the latter that is, not the client one perhaps? [10:46:34] <_joe_> we can look later anyways [10:46:40] <_joe_> do you need me to depool them? [10:47:26] no, I think I can just set them to inactive via confctl, right? [10:48:00] <_joe_> yes [10:48:25] then I'm fine, thanks :) [10:48:48] <_joe_> btw depool does just "pooled=no" [10:48:50] <_joe_> not inactive [10:49:20] oh...right. So thats bad anyways for reinstalling them [10:49:28] ok found my mistake I think [10:49:56] <_joe_> jayme: not really [10:50:23] _joe_: I thought they would then still be health-checked by pyball, no? [10:50:29] <_joe_> yes [10:50:54] but it would just not be alerted on? [10:51:24] <_joe_> so pybal seems to handle empty pools well [10:51:58] <_joe_> anyways [10:52:08] <_joe_> it's really not that different overall [10:54:30] <_joe_> akosiaris: can I merge your change? [10:55:21] <_joe_> done [10:55:22] _joe_: yes thanks [10:55:32] <_joe_> yeah :) [10:59:32] * jayme rebooting kubernetes2017.codfw.wmnet (just got the worker role) for kernel 4.19 [11:15:31] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10akosiaris) [11:18:48] masters are done. I am gonna grab a quick bite while workers are being reimaged [11:19:28] ack, will do the same when I'm done with the ganeti workers [11:33:53] akosiaris: I think the last point in https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Manual_installation (If you already began reinstalling the server before destroying its cert on the puppetmaster, you should clean out ON THE newserver (with care):) does not apply, right? The cert there seems okay to me [11:37:58] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kubernetes2009.codfw.wmnet', 'kubernetes2011.codfw.wmnet', 'kubernetes2001.codfw.wmne... [11:38:02] there is also sre.puppet.renew-cert fwiw [11:50:31] * akosiaris back [11:52:00] jayme: yeah, I don't think that's needed [11:53:02] akosiaris: ok [11:53:44] akosiaris: reimages are done, I'll repool all nodes in conftool [11:58:26] cool [12:00:14] akosiaris: so, from my point of view we're good for a helmfile sync of admin_ng [12:00:51] yeah, I was verifying where we are and it seems to me the same [12:02:17] akosiaris: nice. If you +1 me at https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/671144 I'm happy to sync :) [12:02:47] Go for it [12:07:21] it's running [12:10:28] akosiaris: looks like BIRD is not getting ready (so calico-node fails) [12:10:36] BIRD is not ready: BGP not established with 208.80.153.192,208.80.153.193 [12:10:50] jayme: ah, kubernetes2017? [12:10:59] or more general? [12:11:08] looks isolated, let me check [12:11:41] akosiaris: yep, 17 only [12:13:22] ok, fixing [12:13:59] and some rbac error in kube-controller...*that* is weird. Checking that one [12:22:49] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10jijiki) [12:24:29] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10jijiki) p:05Medium→03High [12:24:35] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10jijiki) @Cmjohnson please let me know if I can help to put these 2 servers into production as soon as possible [12:24:50] jayme: cr*codfw* patch for adding kubernetes2017 being deployed right now [12:25:30] done [12:26:14] akosiaris: thanks. I'm stills staring at this weird message, wondering why that is an issue *not* and was not for the other clusters [12:26:19] maybe it's a red hering [12:26:30] User "kubelet" cannot get resource "clusterinformations" in API group "crd.projectcalico.org" at the cluster scope [12:26:32] that is [12:27:28] never seen that before (that I remember of) [12:28:01] Me neither...and I don't remember granting access to crd's to "system users" [12:28:44] I mean, it's probably correct...but still. I try the sync again now that you fixed 2017's BGP [12:29:00] just to see if it's missleading [12:30:25] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10JMeybohm) [12:31:08] no, it persists [12:31:10] wtf [12:32:56] what? [12:32:59] weird [12:33:45] I did paste the full error in the ticket [12:35:49] ah, so calico-node is now just fine on 2017 [12:36:00] oh, yes, yes. Sorry [12:36:08] cause that wasn't clear to me, ok. [12:36:19] hmmm weird [12:40:26] calico-cni should have that right from what I see [12:40:30] yeah [12:40:41] does the kubelet require too? And why it didnot in the other 2 clusters? [12:41:48] ah, wait [12:42:06] ah wait!!!! [12:42:10] profile::calico::kubernetes::calico_cni::token [12:42:38] grep config /etc/cni/net.d/10-calico.conflist [12:42:38] "kubeconfig": "/etc/kubernetes/kubelet_config" [12:42:41] I guess that's wrong? [12:42:59] yeah, it's running with kubelets account [12:43:19] ok, so we forgot the hiera that fixes that. Sigh [12:43:22] because we missed just another secret [12:43:40] let me guess, that's in the private repo only, right ? [12:43:48] where we don't have code review... [12:44:27] hm no...in private the token is set [12:45:35] ah no [12:45:38] hieradata/role/common/kubernetes/worker.yaml:profile::kubernetes::node::cni_config: /etc/kubernetes/kubelet_config [12:45:48] ok this needs to be deleted, removing [12:46:10] ah no, override in codfw actually [12:46:31] uh, yes please...or just write it to eqiad [12:50:49] my lunch is ready now (finally) ... I'll grab a bite for a bit. "helmfile sync" did roll back, so you can just re-run it with fill sync again when you removed cni_config [12:55:56] yup, go have lunch, I am forcing puppet changes and will retry [12:57:20] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10akosiaris) [12:59:30] success! [13:02:20] I 'll start syncing services [13:25:38] o/ did eventgate-* in codfw get totally depooled there for over an hour? [13:29:01] ottomata: yeah, it is since a couple of hours and it will stay depooled for longer [13:29:22] ottomata: yes, we are reinitializing the entire cluster [13:29:29] akosiaris: weeh, cool! Means we just have to do it right for it to work :D [13:30:39] I am down to r* btw. recommendation-api. Up to now it proceeds ok [13:31:06] akosiaris: Cool! [13:37:15] jayme: Done! [13:37:23] * akosiaris fixing labels for kubernetes2017 as well [13:37:52] ok thanks! [13:38:00] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10JMeybohm) [13:38:17] (i just sent a nice thank you email thinking this was simpler but still the thank you applies even though I misunderstood :p ) [13:38:36] akosiaris: great! I've added another action item to write a "now to add a node" guide [13:39:57] jayme: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=svc.codfw.wmnet&style=detail is a pretty greenish green [13:40:02] with the exception of modileapps [13:40:05] mobileapps* [13:41:18] cool! [13:41:31] let's take a look at mobileapps then :) [13:42:32] {"name":"mobileapps","hostname":"mobileapps-production-7c59ffcb49-rtsbh","pid":30,"level":"ERROR","message":"upstream connect error or disconnect/reset before headers. reset reason: local reset","status":503,"type":"internal_error","detail":"upstream connect error or disconnect/reset before headers. reset reason: local reset","request_id":"88c46a90 [13:42:32] -de6d-4b7d-bb13-bdbbe6a97b97","request":{"url":"/en.wikipedia.org/v1/page/talk/Talk%3ASalt","headers":{"user-agent":"ServiceChecker-WMF/0.1.2","x-request-id":"88c46a90-de6d-4b7d-bb13-bdbbe6a97b97","content-length":"0"},"method":"GET","params":{"0":"/en.wikipedia.org/v1/page/talk/Talk:Salt"},"query":{},"remoteAddress":"127.0.0.1","remotePort":33364} [13:42:32] ,"levelPath":"error/503","msg":"upstream connect error or disconnect/reset before headers. reset reason: local reset","time":"2021-03-16T13:31:26.698Z","v":0} [13:43:28] akosiaris: has no egress enabled [13:43:57] aaaah the revert was never merged I guess [13:44:01] * akosiaris merging that [13:44:21] jayme: you 're fast ;) [13:44:35] I have eaten :P [13:45:17] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/670540 btw, waiting for CI [13:46:05] In staging it has egress enabled still...I "love" how those are out of sync all the time [14:14:59] jayme: everything green, I think we can take a step back and relax for the day [14:15:47] akosiaris: nice! Should we maybe un-downtime the services to get alerted if anything is unusual? [14:20:39] good morning ... if someone can take a look at testreduce1001 (parsing team's testing vm) .. looks like mysql is stuck because of disk space issues (Mar 16 14:12:52 testreduce1001 mysqld[15065]: 2021-03-16 14:12:52 0 [Warning] mysqld: Disk is full writing './testreduce1001-bin.000051' (Errcode: 28 "No space left on device"). Waiting for someone to free space [14:20:40] ) [14:31:25] 10serviceops, 10SRE, 10Performance-Team (Radar): Get rid of nutcracker for connecting to redis - https://phabricator.wikimedia.org/T277183 (10AMooney) [14:39:55] jayme: and it's back in complaining... [14:45:39] subbu: I did cleanup some space but keep in mind that the testreduce db is 70% of the disk space on that host. It's like the prime thing that would make sense to comb through and save space. [14:48:54] ok ... the disk space will only grow as we run more tests ... we can probably delete results from old test runs .. but since it is a VM, can the disk space be bumped a little so we have a bit more breathing room. [14:49:30] but, we'll add 'clear out test results from the db from tests older than 2 months' to our chore list to prevent this in the future. [14:51:15] i suppose we'll have to run optimize tables and restart mysql after doing that ... presumably that will reduce the on-disk footprints of those tables? [14:53:24] I guess so [14:55:23] actually mobileapps is flapping a bit, I 'll keep an eye on it [14:55:35] was about to ask what you mean... [14:55:43] mobileapps in the newly reinitialized cluster that is [14:55:50] Seems like I always look in the green phases :) [14:58:20] could be ;-) [14:58:28] akosiaris: what issues did you see in mobileapps? Timeouts? [14:58:34] but that makes me the unlucky one, doesn't it ? [14:59:14] e.g. [14:59:16] [2021-03-16 14:48:54] SERVICE ALERT: mobileapps.svc.codfw.wmnet;Mobileapps LVS codfw;OK;SOFT;2;All endpoints are healthy [14:59:16] Service Critical[2021-03-16 14:46:32] SERVICE ALERT: mobileapps.svc.codfw.wmnet;Mobileapps LVS codfw;CRITICAL;SOFT;1;/{domain}/v1/page/media-list/{title} (Get media list from test page) is CRITICAL: Test Get media list from test page returned the unexpected status 503 (expecting: 200) [14:59:16] Service Ok[2021-03-16 14:39:22] SERVICE ALERT: mobileapps.svc.codfw.wmnet;Mobileapps LVS codfw;OK;SOFT;2;All endpoints are healthy [14:59:16] Service Critical[2021-03-16 14:36:56] SERVICE ALERT: mobileapps.svc.codfw.wmnet;Mobileapps LVS codfw;CRITICAL;SOFT;1;/{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections returned the unexpected status 503 (expecting: 200) [14:59:53] hmm...thats weird [15:00:59] akosiaris: do you have time to join the sre-interview meeting? I might have volunteered you for an interview tomorrow :-D [15:46:20] jayme: and downtimes deleted :-) [15:46:39] akosiaris: great! [15:47:02] let's hold the dance till tomorrow ;-) [15:47:04] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10ops-monitoring-bot) Icinga downtime set by akosiaris@cumin1001 for 16 days, 16:00:00 1 host(s) and their services with reason: Extend downtime... [15:47:09] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10ops-monitoring-bot) Icinga downtime set by akosiaris@cumin1001 for 16 days, 16:00:00 1 host(s) and their services with reason: Extend downtime... [15:47:15] yup, +1 [15:56:56] akosiaris: regarding the depool not available topic from earlier. We do include ::profile::lvs::realserver but we do set "profile::lvs::realserver::use_conftool: false" for workers [15:57:11] should we change that? [15:57:51] the script itself worked fine on the two nodes where it was accidentally installed :) [15:59:08] a234b1be37f56bb2b1d0e190b8534dad2436fb72 implies that the depool script should be around [15:59:42] maybe something changed since then [16:02:24] <_joe_> jayme: yes, that script works fine, not sure about all the others [16:03:14] jayme: we need to build some tooling I guess per our Action Items and we might just rely on the script being there. And we are anyway gonna be doing kernel upgrades on those hosts, so it sounds like it makes sense to have it installed [16:03:45] that being said, if we ever manage to do that calico bgp for service/external IPs OKR, we might end up not being behind pybal at all [16:04:04] true...but until then. [16:04:46] the justification for "use_conftool: false" is "We don't need conftool safe restart scripts on k8s." [16:05:00] but I think it does not hurt to have them, tight? [16:05:03] <_joe_> jayme: which is true, but then we need pool/depool on occasion [16:05:19] <_joe_> although I probably thought we would never act ssh'ing on the server [16:05:36] <_joe_> and cookbooks have their own libraries to interact with conftool [16:05:44] depends I guess on how we implement the cookbook ? [16:05:51] we could avoid it I think [16:06:00] <_joe_> no, cookbooks should never use "pool/depool". Never [16:06:09] <_joe_> if they do, that's an antipattern [16:06:24] ah, so the valid use case that remains is interactively calling it [16:06:28] <_joe_> pool/depool allow little to none preservation of state, and just very basic error handling [16:06:45] I was just lazily running "sudo cumin ... depool", tbh [16:06:58] that's how I figured it's not there [16:07:49] to be clear: I'm not insisting that we should have it. I was just wondering why we had hosts with and without [16:09:02] leftovers probably [16:09:14] if you found it on like 4 hosts only, it was the initial deployment [16:09:35] then let's just assume it was never there and leave it like that :) [16:09:57] sure ;-) [16:10:44] * akosiaris off for the day [16:10:53] o/ [16:15:47] 10serviceops, 10MW-on-K8s, 10Patch-For-Review, 10Release-Engineering-Team (Pipeline), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)): Containers on releases hosts cannot update apt cache from non-WMF sources - https://phabricator.wikimedia.org/T277109 (10thcipriani) [16:16:29] 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team (Pipeline), 10Release-Engineering-Team-TODO: Investigate how we can provide an mwdebug functionality on kubernetes - https://phabricator.wikimedia.org/T276994 (10thcipriani) [16:19:18] 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team (Pipeline), 10Release-Engineering-Team-TODO: Progressive rollout of MediaWiki deployment on Kubernetes - https://phabricator.wikimedia.org/T276487 (10thcipriani) [16:24:14] can someone verify for me that this output from `helmfile diff` is OK? - https://phabricator.wikimedia.org/P14861 [16:24:44] the egress stuff, it was added some time ago and just wasn't deployed [16:25:17] (this is for sessionstore) [16:26:39] urandom: I'll have a look in a sec [16:27:54] jayme: thanks [16:31:19] urandom: yeah, looks good. Go ahead [16:35:11] jayme: cool; thanks [16:51:56] 10serviceops, 10Analytics-Radar, 10Cassandra, 10ContentTranslation, and 10 others: Rebuild all blubber build docker images running on kubernetes - https://phabricator.wikimedia.org/T274262 (10Eevans) [17:04:50] 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17 (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2226.codfw.wmnet` - mw2226.codfw.wmnet (**PASS**) - Downti... [17:11:46] 10serviceops, 10MW-on-K8s, 10SRE: Figure out appropriate readiness and liveness probes - https://phabricator.wikimedia.org/T276908 (10jijiki) >>! In T276908#6916385, @Joe wrote: > After some more work, this is my ideas for liveness and readiness probes: > > # httpd: > -- liveness: tcp connection to the main... [17:37:44] 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17 (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2227.codfw.wmnet` - mw2227.codfw.wmnet (**PASS**) - Downti... [17:45:31] 10serviceops, 10Parsoid: Bump disk space on testreduce1001 - https://phabricator.wikimedia.org/T277580 (10ssastry) [18:39:12] 10serviceops, 10MediaWiki-General, 10SRE, 10observability, and 2 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10lmata) Hi @AMooney, I'd like to present this patch as the other of the two I was hoping to bring to your attention for next clinic duty... Please let me know... [18:59:14] 10serviceops, 10Parsoid: Bump disk space on testreduce1001 - https://phabricator.wikimedia.org/T277580 (10Dzahn) Disk space can't be simply bumped up, but what we can offer is to create a new virtual hard disk and mount it into the file system. Then you can use the new disk for mysql or whatever you like. But... [19:08:05] 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17 (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2228.codfw.wmnet` - mw2228.codfw.wmnet (**PASS**) - Downti... [19:10:26] 10serviceops, 10Parsoid: Bump disk space on testreduce1001 - https://phabricator.wikimedia.org/T277580 (10ssastry) Sure. [19:10:39] 10serviceops, 10Parsoid: Bump disk space on testreduce1001 - https://phabricator.wikimedia.org/T277580 (10ssastry) And, thanks! :) [19:17:00] 10serviceops, 10Parsoid: Bump disk space on testreduce1001 - https://phabricator.wikimedia.org/T277580 (10Dzahn) ` [ganeti1011:~] $ sudo gnt-instance modify --disk add:size=50G testreduce1001.eqiad.wmnet Tue Mar 16 19:15:57 2021 - INFO: Waiting for instance testreduce1001.eqiad.wmnet to sync disks Tue Mar 16... [19:18:34] 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17 (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10Dzahn) [19:20:36] 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17 (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2229.codfw.wmnet` - mw2229.codfw.wmnet (**PASS**) - Downti... [19:45:22] 10serviceops, 10Parsoid: Bump disk space on testreduce1001 - https://phabricator.wikimedia.org/T277580 (10Dzahn) @ssastry The disk has been created but for the VM to detect it we have to reboot the VM just like if it was a physical machine. Can I do that at any time or do you have some tests running right now? [21:43:45] 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10kostajh) Actually the 503 is reproducible with GET as well: ` time curl -s https://api.wi... [22:02:33] 10serviceops, 10Parsoid: Bump disk space on testreduce1001 - https://phabricator.wikimedia.org/T277580 (10ssastry) mysql is running a recovery after a previous crash .. so, let us wait for it to complete before restarting.