[00:47:01] 10serviceops, 10Operations, 10ops-codfw: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Papaul) [03:17:54] 10serviceops, 10Operations, 10ops-codfw: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Papaul) |servers|ready for service| |mw2310|yes| |mw2311|yes| |mw2312|yes| |mw2313|yes| |mw2314|yes| |mw2315|yes| |mw2316|yes| |mw2317|yes| |mw2318|yes| |mw2319|yes| |mw2320|yes| |m... [07:17:54] 10serviceops, 10Operations, 10Performance-Team, 10Wikimedia-production-error: Wiki diffs take over 15s to load - https://phabricator.wikimedia.org/T244058 (10Joe) >>! In T244058#5849290, @aaron wrote: > Links to old (non-current) versions due not use the parser cache. This means that rendering will always... [09:49:19] 10serviceops, 10Prod-Kubernetes: Upgrade production kubernetes clusters to a security supported version - https://phabricator.wikimedia.org/T244335 (10akosiaris) [09:49:36] 10serviceops, 10Prod-Kubernetes: Upgrade production kubernetes clusters to a security supported version - https://phabricator.wikimedia.org/T244335 (10akosiaris) p:05Triage→03High [09:50:12] 10serviceops, 10Prod-Kubernetes: Upgrade production kubernetes clusters to a security supported version - https://phabricator.wikimedia.org/T244335 (10akosiaris) p:05High→03Normal [09:55:04] 10serviceops, 10Prod-Kubernetes: Upgrade production kubernetes clusters to a security supported version - https://phabricator.wikimedia.org/T244335 (10akosiaris) [10:02:38] 10serviceops, 10Core Platform Team, 10Services: scb2003 reports 'Internal error in changeprop' - https://phabricator.wikimedia.org/T244069 (10jijiki) ping! [10:22:26] 10serviceops, 10Prod-Kubernetes: Upgrade production kubernetes clusters to a security supported version - https://phabricator.wikimedia.org/T244335 (10akosiaris) Important release notes for 1.13.x that affect us ` kube-apiserver The deprecated etcd2 storage backend has been removed. Before upgrading a kub... [10:22:58] 10serviceops, 10Operations: Reduce read pressure on memcached servers by adding a machine-local Memcache instance - https://phabricator.wikimedia.org/T244340 (10Joe) [10:50:56] fyi, kubernetes staging cluster is now on 1.13.12. I 'll do a rolling restart of all pods just to make sure everything is ok and then move on to codfw (eqiad can't be done yet) [13:52:24] so on the outage(s) yesterday [13:52:38] do we know how the maps issues are related to the appserver issues, if at all? [13:52:53] we've seen that coincide several times now, but at least to me it's not yet clear how they are related [13:53:33] I think we do not yet know [14:07:22] and as we briefly discussed last week [14:07:33] maybe we should talk about reprioritizing some of our current work to focus more on reliability and known issues [14:07:46] e.g. the php7 https stuff [14:08:05] that has played a part in most recent incidents since php7 as well hasn't it? [14:50:34] _joe_: akosiaris: effie: ^ [14:51:05] <_joe_> sorry, I was working on another perf issue right now [14:51:55] <_joe_> I expected to work on the incident doc for yesterday now, but I'm intertwined into looking at the current issue [14:52:09] mark: I am not full caught up with the maps incident, but the other time it was surely that [14:53:46] <_joe_> mark: I am now taking a brief break since things have settled [14:53:55] ok [14:53:58] <_joe_> and I think I can work on the incident report with rlazarus once he's up [14:53:59] maybe we can discuss this a bit tomorrow [14:54:05] <_joe_> he was IC for most of the outage [14:54:07] <_joe_> yes [14:54:11] but start thinking about it [14:54:13] <_joe_> there are at least a few things we can do [14:54:15] we can reprioritize our work [14:54:21] <_joe_> I even have some patches up :P [14:54:32] <_joe_> also finally getting the servers racked [14:54:38] <_joe_> would give us more computing power [14:54:45] <_joe_> which won't be a bad thing [15:07:06] i saw effie was gonna ping on a task? [15:07:36] tag willy on it, let me know if there's no response [15:10:43] yeah I will [15:50:47] 10serviceops, 10Operations, 10ops-eqiad: (No Need By Date Provided) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10jijiki) @wiki_willy @Jclark-ctr I understand that eqiad is overloaded, but is there a chance we can raise the priority of this? We have been sufferin... [15:51:59] according to https://phabricator.wikimedia.org/T236437 [15:52:11] we should have the first batch in a couple of days [16:07:34] 10serviceops, 10Operations, 10ops-codfw: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Papaul) All the servers on the table above are running Buster I had a chat with @MoritzMuehlenhoff and he mentioned that we need to install Stretch on those servers so I have to upd... [16:12:00] _joe_: can we discuss a bit on https://phabricator.wikimedia.org/T244340 ? [16:12:06] local memcached [16:14:28] <_joe_> sure [16:50:13] how much memory do we estimate to use for that ? [16:51:17] the concept is understandable, I am a little concerned about how much this will cost on a server under load [16:56:52] <_joe_> I was thinking like 10 GB [16:57:02] <_joe_> maybe less [16:57:16] <_joe_> It's better if we discuss on-task proabbly [17:08:56] 10serviceops, 10Operations, 10ops-eqiad: (No Need By Date Provided) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10wiki_willy) Hi @jijiki - @Cmjohnson is currently working on finishing up T236437, which also had a previous need by date of a month ago. Would the c... [17:09:57] _joe_: sure [17:35:28] 10serviceops, 10Services, 10Core Platform Team Workboards (Clinic Duty Team): scb2003 reports 'Internal error in changeprop' - https://phabricator.wikimedia.org/T244069 (10WDoranWMF) a:03Clarakosi [18:00:07] 10serviceops, 10Operations, 10ops-eqiad: (No Need By Date Provided) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10jijiki) @wiki_willy hopefully it will help, but we generally believe that we will not be able to cope well again when we have sudden request spikes.... [18:02:40] 10serviceops, 10Citoid, 10Operations, 10Core Platform Team Workboards (Clinic Duty Team): Citoid is logging all request / response headers as separate fields - https://phabricator.wikimedia.org/T239713 (10jijiki) [18:08:33] 10serviceops, 10Operations, 10ops-eqiad: (No Need By Date Provided) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10wiki_willy) @jijiki - I'll talk to @Jclark-ctr and see if there's someway to expedite these. One of the current bottlenecks is getting rid of some o... [18:38:10] 10serviceops, 10Release Pipeline, 10Release-Engineering-Team: Provide the official production base images for Wikimedia use - https://phabricator.wikimedia.org/T238774 (10Jdforrester-WMF) [18:38:49] 10serviceops, 10Release Pipeline, 10Release-Engineering-Team, 10Release-Engineering-Team-TODO: Provide the official production base images for Wikimedia use - https://phabricator.wikimedia.org/T238774 (10Jdforrester-WMF) [18:39:30] 10serviceops, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2310.codfw.wmnet ` The log can be found in `/var/log... [18:42:13] 10serviceops, 10Operations, 10Research: Request for a in-memory caching data set for caching research - https://phabricator.wikimedia.org/T240503 (10jijiki) p:05Triage→03Low [18:50:56] 10serviceops, 10Operations, 10Wikimedia-Mailing-lists: Allow list admins to train spam filters - https://phabricator.wikimedia.org/T244241 (10jijiki) p:05Triage→03Normal [18:51:02] 10serviceops, 10Operations: Reduce read pressure on memcached servers by adding a machine-local Memcache instance - https://phabricator.wikimedia.org/T244340 (10jijiki) p:05Triage→03Normal [18:53:48] 10serviceops, 10Operations, 10Wikimedia-Mailing-lists: Allow list admins to train spam filters - https://phabricator.wikimedia.org/T244241 (10Reedy) https://blogs.gnome.org/ovitters/2008/06/07/using-moderated-messages-to-train-the-bayes-classifier/ >I’ve added a patch to Mailman [19:07:00] 10serviceops, 10Operations, 10Patch-For-Review: No mw canary servers in codfw - https://phabricator.wikimedia.org/T242606 (10Dzahn) The following are now declared canary API appservers in site.pp: mw2215, mw2216 (rack A3) mw2244, mw2245 (rack A4) [19:15:21] 10serviceops, 10Operations: Reduce read pressure on memcached servers by adding a machine-local Memcache instance - https://phabricator.wikimedia.org/T244340 (10jijiki) The idea is obviously sensible. I do have some concerns about how this will perform with our loaded mwservers. We could wait to test this afte... [19:16:12] 10serviceops, 10Operations, 10observability, 10vm-requests: Provision grafana VM in codfw - https://phabricator.wikimedia.org/T244357 (10jijiki) [19:20:00] 10serviceops, 10Core Platform Team, 10MediaWiki-Parser, 10Operations, and 2 others: API action=parse should be poolcounter-limited if a re-parse is necessary - https://phabricator.wikimedia.org/T243803 (10daniel) [19:20:24] 10serviceops, 10MediaWiki-Parser, 10Operations, 10Core Platform Team Workboards (Clinic Duty Team), 10Wikimedia-Incident: API action=parse should be poolcounter-limited if a re-parse is necessary - https://phabricator.wikimedia.org/T243803 (10daniel) [19:22:25] 10serviceops, 10Operations, 10ops-codfw: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2310.codfw.wmnet'] ` Of which those **FAILED**: ` ['mw2310.codfw.wmnet'] ` [19:27:52] 10serviceops, 10Operations, 10Patch-For-Review: No mw canary servers in codfw - https://phabricator.wikimedia.org/T242606 (10Dzahn) [20:48:36] 10serviceops, 10Services, 10Core Platform Team Workboards (Clinic Duty Team): scb2003 reports 'Internal error in changeprop' - https://phabricator.wikimedia.org/T244069 (10Pchelolo) Oh! The core reason is that a counter has overflowed int32. Our software is so stable, we can overflow int32 now! The HTCP pur... [21:06:05] 10serviceops, 10Services, 10Core Platform Team Workboards (Clinic Duty Team): scb2003 reports 'Internal error in changeprop' - https://phabricator.wikimedia.org/T244069 (10jijiki) 05Open→03Resolved Achievement unlocked. [21:06:06] 10serviceops, 10Services, 10Core Platform Team Workboards (Clinic Duty Team): scb2003 reports 'Internal error in changeprop' - https://phabricator.wikimedia.org/T244069 (10jijiki) 05Open→03Resolved Achievement unlocked. [21:40:29] 10serviceops, 10Operations, 10Performance-Team, 10Wikimedia-production-error: Wiki diffs take over 15s to load - https://phabricator.wikimedia.org/T244058 (10jijiki) >>! In T244058#5851362, @Joe wrote: > Instead of caching, we should just rate-limit parsing of old revisions to N concurrent revisions per us... [21:45:46] 10serviceops, 10Operations, 10Performance-Team, 10Wikimedia-production-error: Wiki diffs take over 15s to load - https://phabricator.wikimedia.org/T244058 (10jijiki) [22:01:01] 10serviceops, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Config, 10Release-Engineering-Team (CI & Testing services), 10Test-Coverage: Add pcov PHP extension to wikimedia apt so it can be used in Wikimedia CI - https://phabricator.wikimedia.org/T243847 (10Jdforrester-WMF) Alternatively we c... [22:14:28] 10serviceops, 10Proton, 10Patch-For-Review, 10Product-Infrastructure-Team-Backlog (Kanban): Profile proton memory usage for Helm chart - https://phabricator.wikimedia.org/T238830 (10Mholloway)