[04:07:44] 10serviceops, 10Operations, 10Core Platform Team Backlog (Later), 10Services (next): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10KartikMistry) >>! In T210704#5270565, @KartikMistry wrote: > I was able to reproduce error we saw in Production using end p... [06:02:50] 10serviceops, 10Operations, 10Core Platform Team Backlog (Later), 10Services (next): Update nodejs10 image to use the latest version of the package - https://phabricator.wikimedia.org/T226346 (10Joe) [06:04:06] 10serviceops, 10Operations, 10Core Platform Team Backlog (Later), 10Services (next): Update nodejs10 image to use the latest version of the package - https://phabricator.wikimedia.org/T226346 (10Joe) p:05Triage→03High [06:04:41] 10serviceops, 10Operations, 10Core Platform Team Backlog (Later), 10Services (next): Update nodejs10 image to use the latest version of the package - https://phabricator.wikimedia.org/T226346 (10Joe) a:03Joe While I'm at it, I'll upgrade all production images with new base images too. [06:09:43] 10serviceops, 10Release-Engineering-Team, 10Release-Engineering-Team-TODO: Our docker base images lack tags - https://phabricator.wikimedia.org/T218342 (10Joe) 05Open→03Resolved a:03Joe This is already fixed, see https://gerrit.wikimedia.org/r/c/operations/puppet/+/501564 I've just created the new ima... [06:31:36] 10serviceops, 10Operations, 10Core Platform Team Backlog (Later), 10Services (next): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Joe) [06:31:40] 10serviceops, 10Operations, 10Core Platform Team Backlog (Later), 10Patch-For-Review, 10Services (next): Update nodejs10 image to use the latest version of the package - https://phabricator.wikimedia.org/T226346 (10Joe) 05Open→03Resolved [06:32:49] 10serviceops, 10Operations, 10Core Platform Team Backlog (Later), 10Services (next): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Joe) @KartikMistry if we trigger a rebuild of the production container, it should now use the newer nodejs10-slim image and... [07:00:35] 10serviceops, 10Release-Engineering-Team, 10Release-Engineering-Team-TODO: Our docker base images lack tags - https://phabricator.wikimedia.org/T218342 (10hashar) Indeed https://tools.wmflabs.org/dockerregistry/wikimedia-stretch/tags/ Thank you! [07:10:05] <_joe_> fsero, jijiki you have to report from your breakouts on the post-event summary for the summit [07:10:19] <_joe_> please do so before the SRE meeting today [07:47:32] _joe_: it was on joels email yes [07:53:55] 10serviceops, 10Operations, 10Core Platform Team Backlog (Later), 10Services (next): Update nodejs10 image to use the latest version of the package - https://phabricator.wikimedia.org/T226346 (10Petar.petkovic) [08:56:10] 10serviceops, 10TechCom-RFC (TechCom-Approved): RfC: Standards for external services in the Wikimedia infrastructure. - https://phabricator.wikimedia.org/T208524 (10Joe) 05Open→03Resolved [09:01:28] 10serviceops, 10Release-Engineering-Team, 10Release-Engineering-Team-TODO, 10Scap, and 3 others: Enhance MediaWiki deployments for support of php7.x - https://phabricator.wikimedia.org/T224857 (10Joe) >>! In T224857#5274663, @thcipriani wrote: > > Should we be working to implement symlink swapping in scap... [13:51:18] hi all i plan to do a rolling restart of the conf servers in 10 minutes please let me know if you forsee any issue [13:52:56] 10serviceops, 10Operations, 10Continuous-Integration-Infrastructure (phase-out-jessie): Upload docker-ce 18.06.3 upstream package for Stretch - https://phabricator.wikimedia.org/T226236 (10jijiki) [13:53:12] jbond42: ack [13:53:33] _joe_ fsero akosiaris should we discuss this ? https://phabricator.wikimedia.org/T226236 [13:54:15] <_joe_> that looks pretty low-priority [13:54:44] I honestly don't know [13:56:32] _joe_: can you confirm your ok with the conf server restarts and that https://wikitech.wikimedia.org/wiki/Service_restarts#Zookeeper is not missing anything? [13:57:56] also this one https://wikitech.wikimedia.org/wiki/Service_restarts#etcd [13:58:07] <_joe_> jbond42: my ok is subject to depooling etcd in codfw first [13:58:21] <_joe_> given it's the master, I need to do a proper etcd switchover [13:59:36] <_joe_> I mean it's not strictly needed, but we need to do it anyways [13:59:45] <_joe_> so I was thinking it could be a good moment to do it [14:00:03] im not sure i follow you [14:00:54] <_joe_> yeah sorry I need to update that documentation [14:01:04] <_joe_> on it [14:01:28] ack ok ill hold off [14:06:02] <_joe_> https://wikitech.wikimedia.org/wiki/Service_restarts#etcd ok now it has less assumptions [14:06:13] <_joe_> so, at the moment our "master" dc for etcd is codfw [14:06:22] <_joe_> so etcdmirror is in eqiad [14:07:49] <_joe_> the instructions as they're written now should work; anyways ping me when you restart the conf2* nodes [14:08:56] ack ok so just to doble check other then conf1005 everything can just get rebooted one at a time with all the appropriate health checks when they come back up? [14:09:28] <_joe_> yes [14:09:42] <_joe_> the biggest question mark is pybal tbh [14:09:46] 7~ack great ill let you know when eqiad is done and i start on codfw [14:09:57] <_joe_> last time it behaved well, let's see this time :) [14:10:05] lets hope :) [14:10:16] <_joe_> uh we need to reboot eqiad as well? [14:10:47] <_joe_> not a huge issue, but I'd go with codfw first, those servers are up since forever [14:10:48] yes i was going to do eqiad first then codfw [14:11:00] <_joe_> otoh, eqiad is the slave [14:11:04] <_joe_> so it's less critical [14:11:22] that was my thinking [14:11:29] <_joe_> go with that, yes [14:11:33] ack [14:12:16] <_joe_> we're introducing a small "risk" factor, doing things this way, but If something bad happens because of this, we'll know we have some badly engineered part of our system [14:13:58] ack [14:59:26] 10serviceops, 10Annual-Report, 10Operations, 10Patch-For-Review: Redirects for 2018 Annual Report - https://phabricator.wikimedia.org/T226066 (10jijiki) 05Open→03Resolved a:03jijiki @LTraer please let me know if there are any issues [15:10:04] 10serviceops, 10Release-Engineering-Team, 10Release-Engineering-Team-TODO, 10Scap, and 3 others: Enhance MediaWiki deployments for support of php7.x - https://phabricator.wikimedia.org/T224857 (10thcipriani) >>! In T224857#5277649, @Joe wrote: >>>! In T224857#5274663, @thcipriani wrote: >> >> Should we be... [15:10:48] _joe_: about to start rebooting conf200* please confirm you are happy for me to start [15:11:14] <_joe_> jbond42: sure, go on, let's see what breaks, if anything [15:11:22] ack :) [15:12:09] <_joe_> jbond42: uhm wait a sec [15:12:16] waiting [15:12:21] <_joe_> in eqiad we have the pybals throwing alerts [15:12:35] <_joe_> and esams as well [15:12:46] <_joe_> we shall probably restart them, ask ema maybe? [15:13:18] ack ill clear alearts first [15:13:20] <_joe_> and conf1004 has some alert as well [15:13:55] <_joe_> beware when doing pybal restarts, you can't restart lvs1014 and 1016 in fast sequence [15:14:08] the conf1004 one is caused because etcdmirror-conftool-codfw-wmnet.service is in a failed state. however i think this service should be masked right? [15:14:08] <_joe_> you need to wait for the restarted one to reestablish all the bgp connections [15:14:20] <_joe_> yeah that's kinda strange [15:14:36] <_joe_> lemme check but that's not a blocker [15:15:09] etcdmirror-conftool-codfw-wmnet.service enabled [15:15:18] so init tried to start it on boot [15:16:03] <_joe_> which is wrong, hence i want to look into it [15:17:44] <_joe_> but that's not a blocker, so proceed [15:18:03] ack ill look at the pybal stuff [15:23:43] <_joe_> fsero: I think replication for the docker containers is broken [15:24:20] <_joe_> still hasn't replicated the "latest" wikimedia-stretch [15:24:34] <_joe_> which I uploaded this morning, 9 hours ago [15:32:27] 10serviceops, 10Operations: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10jijiki) p:05High→03Normal [15:32:29] 10serviceops, 10Editing-team, 10Beta-Cluster-reproducible, 10Core Platform Team Backlog (Watching / External), 10Services (next): Zotero container: Production is running candidate version, last production version is broken due to lack of ca-certificates package - https://phabricator.wikimedia.org/T223345 (... [15:33:53] 10serviceops, 10MediaWiki-Docker: Clarify and document our docker image building process and policies. - https://phabricator.wikimedia.org/T216234 (10Joe) p:05Triage→03Normal [15:35:55] 10serviceops, 10Kubernetes, 10User-fsero: add CI job into operations/deployments-charts repo that helm lint packages and perform the helm index after merge. - https://phabricator.wikimedia.org/T216049 (10Joe) p:05Triage→03Low [15:36:27] 10serviceops, 10Prod-Kubernetes, 10User-fsero: Kubernetes clusters roadmap - https://phabricator.wikimedia.org/T212123 (10akosiaris) p:05Triage→03Normal [15:41:04] 10serviceops: Find which machines will be over 5 years old during FY19-20 - https://phabricator.wikimedia.org/T217764 (10jijiki) 05Open→03Resolved [16:12:23] _joe_ and all in genral have now finished with the conf reboots. i have also done some edits to https://wikitech.wikimedia.org/wiki/Service_restarts#etcd if someone could check it and make sure i havn't added incorrect info that would be good, thanks [16:23:53] <_joe_> sure, will do [16:23:55] <_joe_> thanks! [16:24:19] cheers [16:43:27] <_joe_> jbond42: lgtm, thanks [16:43:39] great thanks [20:27:25] 10serviceops, 10Annual-Report, 10Operations, 10Patch-For-Review: Redirects for 2018 Annual Report - https://phabricator.wikimedia.org/T226066 (10Varnent) All set - thank you! Apologies for the short notice. :) [21:11:43] 10serviceops, 10Annual-Report, 10Operations, 10Patch-For-Review: Redirects for 2018 Annual Report - https://phabricator.wikimedia.org/T226066 (10LTraer) @jijiki Thank you so much! [22:27:28] 10serviceops, 10Operations, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Update nodejs10 image to use the latest version of the package - https://phabricator.wikimedia.org/T226346 (10mobrovac)