[06:21:50] 10serviceops, 10Operations, 10observability, 10Performance-Team (Radar), 10User-Elukey: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10elukey) >>! In T224454#6269950, @CDanis wrote: > There's no alert yet for memcache NIC saturation, and I don't believe there's... [07:25:36] 10serviceops, 10Beta-Cluster-Infrastructure, 10Operations, 10observability, 10Patch-For-Review: Stream a subset of mediawiki apache logs to logstash - https://phabricator.wikimedia.org/T244472 (10hashar) Just a note the Apache logs are still emitted to logstash for mw1262 and mw1276 ` name=hieradata/host... [08:09:17] 10serviceops, 10Page Content Service, 10Product-Infrastructure-Team-Backlog, 10Kubernetes: mobileapps kubernetes deployment is timing out - https://phabricator.wikimedia.org/T256786 (10JMeybohm) [08:12:03] 10serviceops, 10Page Content Service, 10Product-Infrastructure-Team-Backlog, 10Kubernetes: mobileapps kubernetes deployment is timing out - https://phabricator.wikimedia.org/T256786 (10JMeybohm) [08:12:39] 10serviceops, 10Page Content Service, 10Prod-Kubernetes, 10Product-Infrastructure-Team-Backlog, 10Kubernetes: mobileapps kubernetes deployment is timing out - https://phabricator.wikimedia.org/T256786 (10JMeybohm) p:05Triage→03High [08:17:38] 10serviceops, 10Page Content Service, 10Prod-Kubernetes, 10Product-Infrastructure-Team-Backlog, 10Kubernetes: kubernetes unable to pull images from registry - https://phabricator.wikimedia.org/T256786 (10JMeybohm) [08:22:29] 10serviceops, 10Page Content Service, 10Prod-Kubernetes, 10Product-Infrastructure-Team-Backlog, 10Kubernetes: kubernetes unable to pull images from registry - https://phabricator.wikimedia.org/T256786 (10JMeybohm) Still getting ErrImagePull in `kubectl get events`: ` 73s Normal Pulling... [08:41:37] 10serviceops, 10Page Content Service, 10Prod-Kubernetes, 10Product-Infrastructure-Team-Backlog, 10Kubernetes: kubernetes unable to pull images from registry - https://phabricator.wikimedia.org/T256786 (10JMeybohm) It's only docker that is totally sure that the certificate is not valid, so I guess it does... [08:42:03] akosiaris: ^ [08:44:44] 10serviceops, 10Page Content Service, 10Prod-Kubernetes, 10Product-Infrastructure-Team-Backlog, 10Kubernetes: kubernetes unable to pull images from registry - https://phabricator.wikimedia.org/T256786 (10JMeybohm) p:05High→03Unbreak! Raising prio as we do have the same situation on prod clusters. [08:49:06] akosiaris: _joe_: did you ever do a dockerd restart on docker 1.12.6? I see that we have live-restore enabled, but I have a bit of a trauma on that :-) [08:49:32] (restart while not termiating all containers ofc) [08:49:41] jayme: we can just drain the node I think and do it [08:50:01] we 've definitely restarted all other components, not so sure about docker [08:50:19] akosiaris: staging only has two nodes...was not sure if the workloads fits on one [08:50:32] <_joe_> jayme: this started happening after the puppet CA expiration, right? [08:50:52] yeah...I think it's dockerd not refreshing the CA [08:50:59] <_joe_> how nice [08:51:01] that sound right [08:51:15] jayme: doesn't matter, it's staging, it's ok if the workloads don't fit [08:51:28] k [08:51:31] <_joe_> so on staging I would just do the restart without draining the node :) [08:51:32] we can even let them be in "pending" state, no harm [08:51:39] true [08:51:42] <_joe_> to see if we can do it [08:51:50] yeah, try it out I 'd say [08:52:03] worse case kubelet will restart all pods [08:52:09] (at least I hope) [08:58:40] I've taken the safe path with drain and check as drain will leave calico running [08:59:09] ah, good idea [08:59:13] docker restart re-creates the container :-/ [08:59:26] so "half-thumb" I guess [09:00:08] And "docker pull" is working again ofc [09:01:17] ok, so we need to restart dockerd everywhere it seems, and judging from it in a rolling fashion giving enough time to docker to start again all containers [09:01:25] did I get that right? [09:01:45] so, cumin ftw? [09:02:07] Not completely sure what the container restart does to kubelet ... [09:02:57] I would suggest that I uncordon kubestage1001.eqiad.wmnet, cordon (not drain) kubestage1002.eqiad.wmnet and do the docker restart there...we'll see that is kubelet goes crazy [09:04:35] <_joe_> ok [09:04:44] <_joe_> and btw, this seems like cookbook material [09:05:06] <_joe_> cordon+drain worker, do some action, restore state [09:05:33] definitely, plus "punch someone at docker inc" [09:05:56] <_joe_> jayme: I'm not even sure who we can blame for this [09:06:04] <_joe_> if them or us for using an ages-old version [09:06:52] "internet" sais we're not the only ones bitten by that but yeah...might be us anyways [09:07:31] <_joe_> yeah what I meant is maybe they already realized this and fixed it in a newer version [09:07:52] got that [09:09:01] the really bad thing about this is that if we don't have the images cached on "the other" nodes, kubelet will not be able to start then [09:09:04] *them [09:09:13] them being the pods/containers [09:14:13] <_joe_> there is another solution [09:14:25] <_joe_> disable ntp, turn the clock back a week, restart docker [09:14:31] <_joe_> :P [09:14:43] ahahahaha [09:15:16] jayme: +1 on the plan [09:16:38] _joe_: not sure if we need to restart docker in that case...so it sounds like a good plan! [09:18:15] <_joe_> well we want to reenable ntp eventually [09:18:27] <_joe_> but it could be a stopgap while we restart the first few nodes? [09:19:57] I think we can just drain nodes 1 by 1 and restart docker (and perhaps all other components as well in the process). It's anyway going to be part of the kubernetes version upgrade cookbook [09:20:19] I was joking actually...but as you say it. Could be an option to not end up in catch-22 with draining [09:20:35] let's see how the docker restart goes first [09:21:33] <_joe_> sure [09:21:33] there is a good chance btw that many docker daemons on the newly provisioned hosts btw don't have images for a lot of the pods [09:21:52] <_joe_> it's ironic I timed my vacation to avoid the puppet ca expiry and I still get bitten by it [09:22:12] <_joe_> akosiaris: the new hosts should not have that problem [09:22:21] <_joe_> they should have the new CA already [09:22:25] <_joe_> can we confirm that? [09:22:26] so that catch-22 would definitely catch us. On the other hand, those nodes are also probably empty and we can just restart dockerds there easily [09:22:41] sure, easy enough [09:22:51] <_joe_> akosiaris: those newer hosts should just work without a restart [09:23:06] <_joe_> the puppet ca was refreshed in october 2019 IIRC [09:25:07] yeah docker pull was successful on 1 new node, let me get the entire fleet really quickly [09:25:38] that's good news [09:27:05] sudo docker pull docker-registry.discovery.wmnet/wmfdebug:0.0.3 [09:27:17] (8) kubernetes[2001-2004].codfw.wmnet,kubernetes[1001-1004].eqiad.wmnet [09:27:21] those are the nodes that this failed on [09:27:36] the expected ones. Every other nodes fared just fine [09:27:54] so we can first cordon those, then drain them and just restart docker on those [09:30:10] It looks as if the docker restart in case of kubestage1002 did not bring down every container... kubelet was crying a lot (probbaly just because dockerd was down) but it seems to have settled now [09:31:57] What's left are some "No ref for container "docker://e487a846d886322c37124d714c459f3826b9614fdbae712454cc74cc2397f901" (eventgate-*" messages from kubelet but maybe they've been there before [09:32:37] yeah, I 've seen similar in kubernetes1001 just now, they can probably be ignored for now [09:33:34] indeed. the containers in question are up for two weeks [09:41:06] 10serviceops, 10Page Content Service, 10Prod-Kubernetes, 10Product-Infrastructure-Team-Backlog, 10Kubernetes: kubernetes unable to pull images from registry - https://phabricator.wikimedia.org/T256786 (10JMeybohm) This is the old Puppet CA that some docker daemons have still loaded. Unfortunately a docke... [09:41:27] So what do you think akosiaris...drain and restart or brave-mode? [09:41:57] My vote is for drain & restart ;-) [09:42:06] 10serviceops, 10ChangeProp, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: Changeprop config management in beta cluster - https://phabricator.wikimedia.org/T251176 (10hnowlan) 05Open→03Resolved [09:44:08] ...as it's only a couple of nodes and we definitely won't run into trouble sheduling the pods somewhere else. If we cordon all affected nodes first ofc [09:45:34] what we should be doing anyways (cordon), so I'm doing that now [09:47:34] <_joe_> did we re-enable the restrictions for sessionstore? [09:49:13] _joe_: yes https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/607980 [09:52:51] jayme: cordon, drain, restart [09:53:04] akosiaris: ack [09:53:25] the cordon part should avoid catch-22s with for some whatever reason pods being scheduled on the other 3 affected nodes [09:53:52] yeah. I've them cordoned already [11:05:54] akosiaris: _joe_: we should be fine now [11:11:13] 10serviceops, 10Page Content Service, 10Prod-Kubernetes, 10Product-Infrastructure-Team-Backlog, 10Kubernetes: kubernetes unable to pull images from registry - https://phabricator.wikimedia.org/T256786 (10JMeybohm) 05Open→03Resolved a:03JMeybohm Did a rolling restart on all affected nodes, we should... [11:12:04] <_joe_> great [11:16:19] +1 [11:16:37] actually, where is that thumbs up emoji ... [11:16:57] 👍 [11:20:40] <_joe_> akosiaris: the correct emoji was 💯 [11:21:08] <_joe_> we should ask some millenial or gen z to teach us a class on correct emoji usage [11:29:30] ageism straight up [11:29:51] it so happens that both of those are acceptable :-P [11:51:32] _joe_: I 'd rather ask my niece. She should be able to make them feel old enough [11:51:58] btw, this greek names for projects things, needs to stop at some point https://github.com/bpineau/katafygio [11:52:00] <_joe_> apergos: it's not ageism if I'm joking about myself, right? [11:52:20] <_joe_> akosiaris: ahahahahahahah you're doomed [11:53:32] that project btw... I mean I am pretty sure it's all good, but the use case it's solving? Saving user changes in git instead of getting users to commit changes to git? I have a feeling it's a bit backwards :P [11:54:53] <_joe_> akosiaris: I disagree, but I'll elaborate later [11:54:58] <_joe_> taking a break :) [12:00:20] waste of a good name there [12:16:15] akosiaris: I think they're a bit orthogonal [12:32:41] to be fair [12:33:34] katafygio means shelter/haven [12:34:39] I think apothiki (safehouse/storage) would be more appropriate [12:37:21] but I agree with apergos, waste of a good name [12:51:28] +1 for apothiki, who wants to write them and suggest the name change? [12:51:36] * apergos invokes the finger to nose protocol: not it! [13:15:11] cdanis: yes, depending on the structural setup of an org, they can be pretty orthogonal. In our case, we can (and have multiple times already) reinstantiate an entire cluster from the helmfile.d part of our deployment-charts repo. [13:15:41] so we don't really need that tool, at least for now and as long as the above stays true [14:06:45] 10serviceops, 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10User-brennen: Remove obsoleted docker images - https://phabricator.wikimedia.org/T242604 (10JMeybohm) Unfortunately removing all tags of an image (e.g. repository) does not remove the repository itself from the registry[1][2].... [14:18:23] oh hey, I think I lost this window on my last IRC restart some time ago [14:19:52] So, apiportalwiki : [14:20:08] https://phabricator.wikimedia.org/T246945 -> 3x SRE-level patches I know of [14:20:32] we can deal with the DNS+Varnish bits in https://gerrit.wikimedia.org/r/c/operations/dns/+/599273/2/templates/wikimedia.org + https://gerrit.wikimedia.org/r/c/operations/puppet/+/601924/1/modules/varnish/templates/text-frontend.inc.vcl.erb [14:20:47] there's also a prod_sites patch for it at: https://gerrit.wikimedia.org/r/c/operations/puppet/+/599751/1/modules/mediawiki/manifests/web/prod_sites.pp [14:21:10] is that fairly straightforward? it's a data change, so seems safe-ish in general. [14:21:26] it should probably be operational before we start sending traffic there, so we don't cache dumb things [14:22:46] * apergos peeks in [14:24:19] at least one wiki-level patch went in yesterday with: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMessages/+/608702 [14:27:40] the three patches that are up for review look reasonable enough [14:29:34] The wiki already exists on beta too [14:30:16] the messages patch should be fine even with nothing there, it's just setting up for localization [14:30:24] ack [14:30:36] The wiki probably won't be created in prod for a few more weeks anyway [14:30:37] I'll work through merging them and getting them puppetized, etc, in a little bit [14:30:53] without creation I guess it will still give us a generic projects page? [14:31:15] with 1 hour cacheability IIRC [14:31:19] so should be fine [14:31:37] yeah. which is what happens with other wikis that require apache changes etc [14:31:59] an hour definitely isn't going to kill us [14:32:44] And even when it is cached, you can just visit other URLs to access the wiki, even if the root URL that would redirect is cached as non existsent [14:33:08] well they probably want to send people to the mai page [14:33:19] but so they wait 1 hour before sending the announce mail :-) [14:33:26] It's automated [14:33:37] But when it's created, it's not going to be for public consumption for weeks after [14:33:44] Lots of content to create etc, so it really doesn't matter [14:34:22] I gotta look at the new shiny automated script someday [14:35:01] shiny new? :P [14:35:15] automated wiki creation? yeah [14:35:26] maybe it's old hat to you [14:36:20] it's not really automated... it's the same old addWiki script [14:36:33] ah bummer, I thought... well my bad [14:36:43] amir created some wrapper around it for spitting out the commands to run etc [14:36:47] sorry for the digresion anyways [14:37:16] :) [14:38:03] so looks like prod_sites will hit apache config on a bunch of appservers/api/etc [14:38:14] is the norm to disable puppet on them and try one Just In Case? [14:38:49] I'm checking compiler now [14:40:49] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: redis for docker-registry should have maxmemory-policy set to allkeys-lru - https://phabricator.wikimedia.org/T256726 (10JMeybohm) a:03akosiaris [14:41:15] don't we want dns first? been a long time since I've poked around at these things [14:41:51] 10serviceops, 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10User-brennen: Remove obsoleted docker images - https://phabricator.wikimedia.org/T242604 (10JMeybohm) [14:42:04] https://puppet-compiler.wmflabs.org/compiler1001/23606/mw1366.eqiad.wmnet/index.html [14:42:22] ideally it's outside-in, but I think either way it would eventually sort itself out [14:42:34] err sorry, inside-out [14:42:55] (mediawiki support, apache support, varnish support, then dns support) [14:43:07] so that no outer layer sees a broken inner layer, is my thinking [14:43:19] I would have done it 100% backwards then, heh [14:43:52] yeah there's different viewpoints on all such things I think [14:44:04] I tend to think of each layer from the user down to whatever database as clients of each other [14:44:17] varnish is a client of the apache server, apache is a client of mediawiki, etc [14:44:32] if A makes requests of B, A is a client of B, and you do B before A [14:45:21] but honestly in a case like this, we could probably merge them randomly and worst case some redirect takes a little while to expire out of cache and correct itself or whatever [14:49:33] lol [14:51:36] nice to have all the things be that resilient tbh [15:27:22] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: redis for docker-registry should have maxmemory-policy set to allkeys-lru - https://phabricator.wikimedia.org/T256726 (10akosiaris) 05Open→03Resolved Double checked across all nodes, this has been applied successfully. R... [15:27:27] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, and 2 others: improve docker registry architecture - https://phabricator.wikimedia.org/T209271 (10akosiaris) [16:54:58] 10serviceops, 10Page Content Service, 10Prod-Kubernetes, 10Product-Infrastructure-Team-Backlog, 10Kubernetes: kubernetes unable to pull images from registry - https://phabricator.wikimedia.org/T256786 (10jeena) Thanks @JMeybohm ! [18:16:38] 10serviceops, 10Wikimedia-production-error: Uncaught ConfigException: Failed to load configuration from etcd: in /srv/mediawiki/php-1.35.0-wmf.38/includes/config/EtcdConfig.php:202 - https://phabricator.wikimedia.org/T256900 (10mmodell) [18:38:50] 10serviceops, 10Wikimedia-production-error: Uncaught ConfigException: Failed to load configuration from etcd: in /srv/mediawiki/php-1.35.0-wmf.38/includes/config/EtcdConfig.php:202 - https://phabricator.wikimedia.org/T256900 (10mmodell) [18:39:05] 10serviceops, 10Wikimedia-production-error: Uncaught ConfigException: Failed to load configuration from etcd: in /srv/mediawiki/php-1.35.0-wmf.38/includes/config/EtcdConfig.php:202 - https://phabricator.wikimedia.org/T256900 (10mmodell) [19:01:02] 10serviceops, 10Wikimedia-production-error: Uncaught ConfigException: Failed to load configuration from etcd: in /srv/mediawiki/php-1.35.0-wmf.38/includes/config/EtcdConfig.php:202 - https://phabricator.wikimedia.org/T256900 (10Umherirrender) ` if ( $loop->invoke() !== WaitConditionLoop::CONDITION_REACHED )... [22:02:43] 10serviceops, 10Core Platform Team, 10Operations, 10Release Pipeline, and 5 others: Kask functional testing with Cassandra via the Deployment Pipeline - https://phabricator.wikimedia.org/T224041 (10jeena) [22:12:39] 10serviceops, 10Core Platform Team, 10Operations, 10Release Pipeline, and 5 others: Kask functional testing with Cassandra via the Deployment Pipeline - https://phabricator.wikimedia.org/T224041 (10jeena) We've published a cassandra image to docker-registry.wikimedia.org/releng/cassandra311:0.0.1 I've test... [22:15:11] 10serviceops, 10Core Platform Team, 10Operations, 10Release Pipeline, and 5 others: Kask functional testing with Cassandra via the Deployment Pipeline - https://phabricator.wikimedia.org/T224041 (10jeena)