[03:05:49] 10serviceops, 10Operations, 10ops-codfw: (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 (10Papaul) @akosiaris please see below what i am getting from kubestage2001 and kubernetes2007 ` You may use the whole vol... [06:58:14] 10serviceops, 10MediaWiki-JobQueue, 10WMF-JobQueue, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: Find a way to set elevated timeouts for job running - https://phabricator.wikimedia.org/T247114 (10Naike) 05Open→03Stalled [06:58:18] 10serviceops, 10MediaWiki-JobQueue, 10WMF-JobQueue, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: Enable MW REST API on job runners and video scalers (for the new rest.php job executor) - https://phabricator.wikimedia.org/T246389 (10Naike) [09:28:49] 10serviceops, 10Operations, 10Kubernetes, 10Patch-For-Review: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm) >>! In T235411#6153075, @JMeybohm wrote: > TLS enabled mathoid is corrently deployed in staging and codfw k8s clusters but not in eqia... [09:43:23] 10serviceops, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Deployment services): upgrade MediaWiki appservers to Debian 10 (buster) - https://phabricator.wikimedia.org/T245757 (10Aklapper) [09:43:26] 10serviceops, 10Operations, 10decommission, 10ops-codfw, 10Patch-For-Review: codfw: decom at least 15 appservers(mw2158 through mw2172) in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin... [09:49:11] 10serviceops, 10Operations: upgrade people.wikimedia.org backend to buster - https://phabricator.wikimedia.org/T247649 (10Volans) 05Resolved→03Open I think there are still some bits in the DNS repo that point to the old instance: ` templates/wmnet:people 5M IN CNAME people1001.eqiad.wmnet. te... [09:56:50] 10serviceops, 10Operations: upgrade people.wikimedia.org backend to buster - https://phabricator.wikimedia.org/T247649 (10Dzahn) These were changed in https://gerrit.wikimedia.org/r/c/operations/dns/+/595959/2/templates/wmnet [10:09:33] 10serviceops, 10Operations: upgrade people.wikimedia.org backend to buster - https://phabricator.wikimedia.org/T247649 (10Volans) 05Open→03Resolved @Dzahn my bad, I had a silent error during the update of my local git copy that lead to this mis-finding. FWIW it's also possible to ssh directly into the "ri... [10:09:57] 10serviceops, 10Operations, 10decommission, 10ops-codfw, 10Patch-For-Review: codfw: decom at least 15 appservers(mw2158 through mw2172) in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin... [10:34:33] 10serviceops, 10Operations, 10decommission, 10ops-codfw, 10Patch-For-Review: codfw: decom at least 15 appservers(mw2158 through mw2172) in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin... [10:37:44] 10serviceops, 10Operations, 10decommission, 10ops-codfw, 10Patch-For-Review: codfw: decom at least 15 appservers(mw2158 through mw2172) in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin... [11:14:57] 10serviceops, 10Operations, 10decommission, 10ops-codfw, 10Patch-For-Review: codfw: decom at least 15 appservers in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10Dzahn) [11:17:14] 10serviceops, 10MediaWiki-JobQueue, 10WMF-JobQueue, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: Find a way to set elevated timeouts for job running - https://phabricator.wikimedia.org/T247114 (10Aklapper) 05Stalled→03Open @Naike: The previous comments don't explain what/wh... [11:17:17] 10serviceops, 10MediaWiki-JobQueue, 10WMF-JobQueue, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: Enable MW REST API on job runners and video scalers (for the new rest.php job executor) - https://phabricator.wikimedia.org/T246389 (10Aklapper) [11:35:19] 10serviceops, 10Operations: No mw canary servers in codfw - https://phabricator.wikimedia.org/T242606 (10Dzahn) 05Resolved→03Open a:05jijiki→03Dzahn reopening because i am decom'ing servers in T247018 and that included some canaries. so we need to assign new ones [12:06:09] 10serviceops, 10Operations, 10decommission, 10ops-codfw, 10Patch-For-Review: codfw: decom at least 15 appservers in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw[21... [12:13:43] 10serviceops, 10Operations, 10decommission, 10ops-codfw, 10Patch-For-Review: codfw: decom at least 15 appservers in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw[21... [12:18:29] 10serviceops, 10Operations, 10decommission, 10ops-codfw, 10Patch-For-Review: codfw: decom at least 15 appservers in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10Dzahn) @papaul 20 servers from rack C3 have been decom'ed. mw2150 through mw2169. (lower part of... [12:19:08] 10serviceops, 10Operations: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10Dzahn) 05Stalled→03Open [12:19:11] 10serviceops, 10Operations, 10ops-codfw: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Dzahn) [12:44:29] 10serviceops, 10ChangeProp, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: Changeprop config management in beta cluster - https://phabricator.wikimedia.org/T251176 (10hnowlan) This work is currently blocked on us getting our configuration into beta somehow. As it stands we can't po... [12:55:08] _joe_: i decom'ed a couple hosts and now checking for any other remnant and I notice one of them is in mcrouter::shards: proxies. How easy is it to replace it? [12:55:30] <_joe_> quite easy, just change the server [12:55:35] <_joe_> in the list [12:55:46] to a random appserver though? [12:55:53] <_joe_> and wait for a puppet run [12:56:04] <_joe_> yes, I try to keep the "one per row" approach [12:56:12] ok! nice [12:57:06] well, except in this case it will be that row we tell dcops to use for their test. but yea, i'll pick one that is still active [12:57:55] using a different rack in the same row then [13:14:27] <_joe_> weit [13:14:30] <_joe_> *wait [13:14:41] <_joe_> did you decom a current mcrouter proxy? [13:14:53] <_joe_> https://grafana.wikimedia.org/d/000000549/mcrouter?panelId=9&fullscreen&orgId=1&var-source=eqiad%20prometheus%2Fops&var-cluster=All&var-instance=All&var-memcached_server=All looks like it [13:14:57] <_joe_> :/ [13:15:13] <_joe_> we need to wipe the memory of all memcacheds in codfw now [13:16:33] <_joe_> force-run puppet on all appservers in eqiad please [13:16:46] <_joe_> appservers and api [13:17:00] in eqiad? ok [13:17:50] <_joe_> with some limited concurrency, but still [13:18:01] <_joe_> we need to see that graph go flat to zero again [13:19:33] elukey showed me in the other channel, ACK, running with -b5 on mw-api-eqiad right now [13:19:36] <_joe_> mutante: for next time - always grep both the hostname and the ip in puppet [13:19:43] <_joe_> go with -b 20 [13:19:45] <_joe_> :P [13:20:33] ok, doing that. it's running. and yea, i will search better running the decom script next time [13:22:42] <_joe_> volans: can we make the decom script first grep ops/puppet and ops/dns and ops/mediawiki-config when someone tries to decom a host? [13:22:59] <_joe_> or we can just make a call to codesearch :P [13:24:47] running on mw-eqiad now [13:25:25] _joe_: grep for what specifically? [13:25:40] <_joe_> hostname and IP of the host to decom [13:25:59] the dns can't be removed "before" the decom [13:26:03] so it will surely be in dns [13:26:16] just operations/puppet in this case [13:26:36] I can see people decomming stuff and then clearing hiera afterwards [13:26:59] <_joe_> volans: like what? [13:27:13] <_joe_> volans: clearing hiera of reference to the hostname/ip? [13:27:19] puppet-run aborted because it failed on some .. checking why [13:27:20] <_joe_> I *strongly* doubt that [13:27:20] like if you remove some entry from hiera before the decom maybe icinga starts alerting, etc... [13:27:46] <_joe_> I'm waiting for an example, I had a dozen in the other direction [13:28:08] <_joe_> anways my proposal is to search for those and show the references and ask confirmation [13:28:09] reason: puppet disabled on mw1261 et al because "switch tls to envoy" [13:28:19] <_joe_> mutante: oh reenable [13:28:24] <_joe_> mw1261-5 [13:28:24] ok! [13:28:26] I guess we can do that sure [13:28:27] yes, thsoe [13:28:49] <_joe_> volans: my argument is - right now if I decom a host I do grep all those repos myself [13:29:00] <_joe_> it would be nice if there was a failsafe before we nuke it [13:29:02] maybe it can just warn you but let you override it [13:29:06] indeed [13:36:20] _joe_: all done. sorry about that. https://grafana.wikimedia.org/d/000000549/mcrouter?panelId=9&fullscreen&orgId=1&var-source=eqiad%20prometheus%2Fops&var-cluster=All&var-instance=All&var-memcached_server=All&from=1590154025887&to=1590154535604 [13:41:52] creating a task to monitor TKO rates, as suggested by Luca [14:24:26] 10serviceops, 10Operations, 10decommission, 10ops-codfw, 10Patch-For-Review: codfw: decom at least 15 appservers in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw[21... [14:34:06] 10serviceops, 10Operations, 10ops-codfw: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Dzahn) 05Stalled→03Open Hi @Papaul 23 servers from rack C3 have been decom'ed. mw2150 through mw2172. (lower part of the rack) You can: - remove these p... [14:40:05] 10serviceops, 10Operations, 10decommission, 10ops-codfw, 10Patch-For-Review: codfw: decom at least 15 appservers in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10Dzahn) Technically resolved because we made more than enough room for the 5 (not 15 anymore, 10 we... [14:46:14] 10serviceops, 10Operations, 10ops-codfw: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Papaul) @Dzahn Thanks [15:02:38] _joe_, mutante: https://gerrit.wikimedia.org/r/#/c/operations/cookbooks/+/598065 [15:26:22] Monday is a US holiday -- are we having a Europe-only serviceops meeting, rescheduling, or canceling? [15:29:36] <_joe_> rzl: we discussed this with wkandek earlier, prbably rescheduling [15:29:59] oh sorry, must have missed it [15:30:04] sgtm [15:30:50] i will be off because for me US holidays count even if i am physically here. in return i did work the German holiday yesterday [15:31:16] <_joe_> rzl: not in public :P [15:58:47] 10serviceops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 (10akosiaris) a:05Papaul→03akosiaris We debugged this with @papaul, patch above resolves it. Whil... [16:47:31] 10serviceops, 10Packaging: Please provide our special component/php72 in buster-wikimedia - https://phabricator.wikimedia.org/T250515 (10Jdforrester-WMF) [17:00:46] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Change helm chart default labels to k8s standard - https://phabricator.wikimedia.org/T253395 (10JMeybohm) [17:05:56] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Upgrade all TLS enabled charts to v0.2 tls_helper - https://phabricator.wikimedia.org/T253396 (10JMeybohm) [17:20:31] 10serviceops, 10Core Platform Team, 10Operations, 10Traffic, and 2 others: Reduce rate of purges emitted by MediaWiki - https://phabricator.wikimedia.org/T250205 (10Krinkle) >>! In T250205#6154883, @aaron wrote: > I'm not fond of the idea of not sending purges for indirect edits Agreed. The proposal to st...