[06:59:38] https://phabricator.wikimedia.org/T226675 this one is marked as doing but noone is assigned [06:59:42] please claim it [07:00:17] akosiaris: has this one in doing https://phabricator.wikimedia.org/T226237 is still ongoing? [07:00:38] im doing this https://phabricator.wikimedia.org/T212130 and is ongoing BTW [07:01:21] m.utante: could you check this task too https://phabricator.wikimedia.org/T190568 ? is still ongoing? [07:02:27] this one https://phabricator.wikimedia.org/T226675 is also in doing and nobody has claimed it [07:04:32] 10serviceops: create a public docker-registry lvs endpoint for being used behind varnish - https://phabricator.wikimedia.org/T226642 (10fsero) p:05Triage→03Normal [07:05:52] 10serviceops, 10Core Platform Team Backlog (Designing), 10Services (designing): Allow service-checker to run multiple domains for RESTBase - https://phabricator.wikimedia.org/T227198 (10fsero) p:05Triage→03Normal [07:06:03] fsero: I think we agreed to do this later [07:06:52] ups sorry! i'll wait, i was just doing a minor cleanup [07:07:00] ping me when you are around [07:07:04] sure [08:19:29] fsero: yup, still ongoing [09:01:21] I am around [09:01:41] and not dead from the heatwave [09:21:48] akosiaris: fsero are you around to tidy things up? [09:22:06] sure [09:23:59] should we put the phab* reimage to the next up column since daniel is not here now and we don't when he is going to get it [09:24:35] I would leave it to him then [09:25:25] ok then mutante move https://phabricator.wikimedia.org/T190568 to whichever column you find appropriate [09:25:51] Yup [09:25:52] undeploy electron will be done by me and petr, I have delayed it [09:26:27] wmerrors I am not sure if there are any leftovers, I will go through it and find out [09:27:01] thumbor memory errors will go back to the pile, I don't think I will do anything for now [09:29:04] https://phabricator.wikimedia.org/T195392 (switch crons to php7) I am putting it on next up, me and daniel will go throught it [09:29:10] through* [09:34:37] 10serviceops: Reboot rdb* cluster - https://phabricator.wikimedia.org/T227304 (10jijiki) [09:36:30] jijiki: then would you claim this task https://phabricator.wikimedia.org/T226675 ? [09:36:40] ifyou think is not going to happen soon [09:36:43] it should go back to backlog [09:36:51] next week could be moved again [09:39:02] fsero: well this is Petr's task and I am helping [09:39:04] better left as is [09:39:29] then should be maybe on radar? [09:39:46] i dont like having tasks listed that we are not actively doing it leds us to pile things [09:39:53] better few than many [09:40:13] I think it will be ok, I will do a little less work that I would do it we were doing it [09:40:22] than* [09:41:06] 10serviceops: Reboot rdb* cluster - https://phabricator.wikimedia.org/T227304 (10jijiki) p:05Triage→03Normal [09:42:04] I added some reboots I need to do [09:42:28] and we are missing the documentation tracking task [09:42:41] I should have created last week [09:42:46] documentation? [09:44:00] akosiaris: remember when we had this discussion about listing what we want to document [09:44:17] create some notes/structure/whatever on each one [09:44:26] and then see how to improve that [09:44:32] but at least have something [09:45:23] our meeting notes say: [09:45:25] Documentation [09:45:27] Create a tracking task for it [09:45:29] List what we will document, at least for starters [09:45:31] migrating/deploy/operate a service on k8s [09:45:54] In that case we already have https://phabricator.wikimedia.org/T213090 and https://phabricator.wikimedia.org/T220397 [09:47:01] then this is part of the docs we want to produce [09:47:02] yes? [09:51:42] 10serviceops, 10Operations, 10observability, 10Performance-Team (Radar), 10User-Elukey: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10elukey) @fgiunchedi I noticed that node_network_transmit_bytes_total is already used for swift in puppet, do you have any sugg... [09:59:32] I guess so? [10:00:07] ok I will create one, add it in the goals column and if we disagree [10:00:11] we get rid of it [10:14:17] 10serviceops: Missing Documentation for Service Operations - https://phabricator.wikimedia.org/T227306 (10jijiki) [10:15:39] 10serviceops: Missing Documentation for Service Operations - https://phabricator.wikimedia.org/T227306 (10jijiki) p:05Triage→03Normal [10:19:50] fsero: I am rebooting master rdb* servers [10:19:59] it should take about 5' [10:20:15] no worries, registry should live without them [10:20:20] would be slower [10:20:21] thats it [10:22:17] k [10:32:16] fsero: jijiki: the page I guess disagress ? [10:32:21] disagrees? [10:32:23] yes [10:32:26] that is not expected [10:36:07] it recovered when redis came up [10:41:31] mm i think i know what happened https://github.com/docker/distribution/blob/749f6afb4572201e3c37325d0ffedb6f32be8950/health/doc.go#L8 [10:41:40] we have health probes configured for redis [10:41:59] and the health probe returned 503 [10:42:07] which is good and bad :P [10:43:09] anyhow yet another reason to use envoy or another redis proxy so we can configure redis proxy addr on registry and proxy moves to slave when master is down [11:02:47] 10serviceops, 10Operations, 10observability, 10Performance-Team (Radar), 10User-Elukey: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10fgiunchedi) >>! In T224454#5307968, @elukey wrote: > @fgiunchedi I noticed that node_network_transmit_bytes_total is already u... [11:21:51] 10serviceops: Reboot rdb* cluster - https://phabricator.wikimedia.org/T227304 (10jijiki) 05Open→03Resolved [11:47:26] 10serviceops, 10Operations, 10ops-eqiad: Upgrade firmware on ms-be1021 (Was: Degraded RAID on ms-be1021) - https://phabricator.wikimedia.org/T227076 (10jijiki) 05Open→03Resolved a:03jijiki There are still messages like ` [ 122.753602] perf: interrupt took too long (2953 > 2500), lowering kernel.per... [12:23:31] 10serviceops, 10Operations: upgrade krypton (webserver_misc_apps) to stretch - https://phabricator.wikimedia.org/T210008 (10hashar) [12:24:16] 10serviceops, 10Operations: upgrade krypton (webserver_misc_apps) to stretch - https://phabricator.wikimedia.org/T210008 (10hashar) Seems `krypton.eqiad.wmnet` is still using Jessie / php5.6. We could use an upgrade to Stretch to drop php5.6 support from the CI infrastructure :-] [13:03:05] 10serviceops, 10Operations, 10observability, 10Performance-Team (Radar), 10User-Elukey: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10elukey) >>! In T224454#5308149, @fgiunchedi wrote: >>>! In T224454#5307968, @elukey wrote: >> @fgiunchedi I noticed that node_... [13:04:33] 10serviceops, 10Prod-Kubernetes, 10Patch-For-Review: Helm packages deployment tool, at least for cluster applications. - https://phabricator.wikimedia.org/T212130 (10fsero) after further testing it seems that in order to use helmfile we need to set up some environment variables i.e HELM_HOME=/etc/helm KUBECO... [15:25:57] 10serviceops, 10Operations, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Kanban (Done with CPT), and 4 others: Session storage Cassandra cluster configuration - https://phabricator.wikimedia.org/T215883 (10WDoranWMF) [15:26:16] 10serviceops, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Kanban (Done with CPT), 10User-Clarakosi, 10User-Eevans: Plan/design a session storage service - https://phabricator.wikimedia.org/T206015 (10WDoranWMF) [15:26:24] 10serviceops, 10Excimer, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Done with CPT), and 2 others: Excimer: new profiler for PHP - https://phabricator.wikimedia.org/T205059 (10WDoranWMF) [15:28:13] 10serviceops, 10MediaWiki-History-and-Diffs, 10MediaWiki-Parser, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: MWException when viewing or comparing certain pages with Preprocessor_DOM (PHP7 beta feature) - https://phabricator.wikimedia.org/T216664 (10WDoranWM... [16:31:07] 10serviceops, 10MediaWiki-General-or-Unknown, 10Operations, 10Core Platform Team (PHP7 (TEC4)), and 4 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10WDoranWMF) [16:33:13] 10serviceops, 10Documentation: Missing Documentation for Service Operations - https://phabricator.wikimedia.org/T227306 (10Aklapper) [16:45:41] 10serviceops, 10Machine vision, 10Operations, 10Service-deployment-requests, 10Services (watching): Internal deployment of open_nsfw-- image scoring service - https://phabricator.wikimedia.org/T225664 (10Mholloway) [17:39:28] 10serviceops, 10Operations, 10Release Pipeline, 10Core Platform Team (RESTBase Split (CDP2)), and 4 others: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 (10WDoranWMF) [17:45:40] 10serviceops, 10MediaWiki-Logging, 10Operations, 10Wikimedia-Logstash, and 8 others: Port mediawiki/php/wmerrors to PHP7 and deploy - https://phabricator.wikimedia.org/T187147 (10WDoranWMF) [18:20:02] 10serviceops, 10RESTBase, 10Core Platform Team (RESTBase Split (CDP2)), 10Core Platform Team Kanban (Doing), and 2 others: Split RESTBase in two services: storage service and API router/proxy - https://phabricator.wikimedia.org/T220449 (10WDoranWMF) [18:20:10] 10serviceops, 10RESTBase, 10Core Platform Team (RESTBase Split (CDP2)), 10Core Platform Team Kanban (Team 2), and 2 others: Split RESTBase in two services: storage service and API router/proxy - https://phabricator.wikimedia.org/T220449 (10WDoranWMF)