[08:51:11] 10serviceops, 10Analytics, 10Operations, 10vm-requests, and 2 others: Create a replacement for kraz.wikimedia.org - https://phabricator.wikimedia.org/T244719 (10elukey) ` elukey@ganeti2001:~$ sudo gnt-group list Group Nodes Instances AllocPolicy NDParams row_A 4 34 preferred ovs=False, ssh_po... [08:58:41] 10serviceops, 10Analytics, 10Operations, 10vm-requests, and 2 others: Create a replacement for kraz.wikimedia.org - https://phabricator.wikimedia.org/T244719 (10MoritzMuehlenhoff) Does this really need 8 GB RAM and 8 CPUs? The machine that this will replace (kraz) uses a single CPU (and hardly uses it) and... [09:00:55] 10serviceops, 10Analytics, 10Operations, 10vm-requests, and 2 others: Create a replacement for kraz.wikimedia.org - https://phabricator.wikimedia.org/T244719 (10elukey) >>! In T244719#5875487, @MoritzMuehlenhoff wrote: > Does this really need 8 GB RAM and 8 CPUs? The machine that this will replace (kraz) u... [09:01:04] 10serviceops, 10Analytics, 10Operations, 10vm-requests, and 2 others: Create a replacement for kraz.wikimedia.org - https://phabricator.wikimedia.org/T244719 (10elukey) [09:18:00] 10serviceops, 10Analytics, 10Operations, 10vm-requests, 10User-Elukey: Create a replacement for kraz.wikimedia.org - https://phabricator.wikimedia.org/T244719 (10elukey) ` elukey@cumin1001:~$ sudo cookbook sre.ganeti.makevm codfw_B --link public --memory 8 --disk 40 --vcpus 4 irc2001.wikimedia.org START... [09:44:11] 10serviceops, 10Operations, 10Patch-For-Review: Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10jijiki) [09:46:10] 10serviceops, 10Operations, 10Patch-For-Review: Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10jijiki) [09:47:58] 10serviceops, 10Operations, 10Patch-For-Review: Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10jijiki) @elukey thank you for unblocking this !!! [11:22:31] i guess we need to decide what to do with the service ops meeting today or tomorrow [11:22:36] akosiaris: so you can't make it today, correct [11:22:37] ? [11:22:41] and giuseppe is also a maybe [11:25:29] mark: lemme make sure [11:26:09] ah today ... maybe. right [11:26:14] i could do either one [11:27:20] 10serviceops, 10Analytics, 10Operations, 10vm-requests, 10User-Elukey: Create a replacement for kraz.wikimedia.org - https://phabricator.wikimedia.org/T244719 (10elukey) Ok current status: * irc2001.wikimedia.org is running * puppet is set to role::system::spare, waiting for a new role/cluster combinati... [12:17:38] _joe_: ^ [12:18:12] <_joe_> today I am present, I marked it as a maybe because sometimes it conflicts with techcom [12:19:01] ah right [12:19:24] an alternative is friday... [12:19:27] though not this week [12:19:32] this week I have an annual planning meeting then [12:20:36] so if we want you it has to be today. gotcha [14:17:00] 10serviceops, 10Core Platform Team, 10MediaWiki-Cache, 10Operations: WanObjectCache::getWithSetCallback seems not to set objects when fetching data is slow - https://phabricator.wikimedia.org/T244877 (10jbond) p:05Triage→03Medium [15:07:56] 10serviceops, 10Core Platform Team, 10Operations, 10Performance-Team (Radar), 10Wikimedia-production-error: Wiki diffs take over 15s to load - https://phabricator.wikimedia.org/T244058 (10Joe) >>! In T244058#5872430, @Anomie wrote: >>>! In T244058#5861825, @Joe wrote: >> Sure, they do, but IIRC that limi... [15:09:41] akosiaris: so what's the verdict? :) [15:10:16] mark: turns out I can do today [15:10:23] just got the verdict [15:10:25] heh I was just typing that, given the meeting is (or is not) in 90 mins [15:10:42] ok [15:10:47] let's see what we can do next weeks then [15:10:48] but do it today [15:11:02] 👍 [15:11:12] 👍 [15:22:58] 10serviceops, 10Core Platform Team, 10MediaWiki-General, 10Operations: siteinfo api calls should be cached for N minutes on the caching layer - https://phabricator.wikimedia.org/T244204 (10Joe) >>! In T244204#5848838, @Anomie wrote: [cut] > > As for actually implementing this: HTTP caching in the Action A... [15:41:07] 10serviceops, 10Operations, 10Traffic: When fetching siteinfo from the MediaWiki API, set the `maxage` and `smaxage` parameters. - https://phabricator.wikimedia.org/T245033 (10Joe) [15:47:23] 10serviceops, 10Android-app-Bugs, 10Operations, 10Traffic, and 3 others: When fetching siteinfo from the MediaWiki API, set the `maxage` and `smaxage` parameters. - https://phabricator.wikimedia.org/T245033 (10Joe) [15:50:43] 10serviceops, 10Android-app-Bugs, 10Operations, 10Traffic, and 3 others: When fetching siteinfo from the MediaWiki API, set the `maxage` and `smaxage` parameters. - https://phabricator.wikimedia.org/T245033 (10Joe) [15:54:00] rlazarus: effie: https://phabricator.wikimedia.org/T243106#5877113 [15:54:21] _joe_: mutante: apergos ^ [15:54:23] ooh, reading [15:54:50] I may be wrong (I damn hope I am) but have a read please. We need to address this before it ships out to group2 [15:55:32] I also damn hope that the timeout is way higher currently than 100ms, cause otherwise, only circuit breaking like envoy does will be a suitable solution [15:56:04] akosiaris: "Interestingly enough the fallback to redis, while probably saved the users from receiving error messages, did not" [15:56:09] <_joe_> akosiaris: you hit two different timeouts [15:56:24] <_joe_> yeah I wasn't sure what that meant either [15:56:29] (rest of the sentence is missing) [15:56:34] <_joe_> it looks like you cut out the rest yes [15:56:37] yeah, fixing [15:56:42] 👍 [15:56:45] <_joe_> akosiaris: so the "connection timed out" method [15:56:56] <_joe_> that means you hit the curl connection timeout [15:57:15] <_joe_> I'm not sure what's our setting for that (in MediaWiki, I mean) [15:57:37] <_joe_> req timeout can be set in the Mwhttprequest class [15:58:11] fixed [15:58:25] yeah I am chasing down now where how to set that [15:59:01] 10serviceops, 10Android-app-Bugs, 10Operations, 10Traffic, and 3 others: When fetching siteinfo from the MediaWiki API, set the `maxage` and `smaxage` parameters. - https://phabricator.wikimedia.org/T245033 (10Dbrant) @Joe Thanks for that! We'll update our code asap. Since we use the `siteinfo` data for th... [15:59:12] btw, the 2 ways of emulating the error scenarios were [15:59:24] <_joe_> akosiaris: ask someone like Krinkle or anomie :) [15:59:26] connection timed out one. in /etc/hosts 10.2.2.39 sessionstore.discovery.wmnet sessionstore sessionstore.svc.eqiad.wmnet sessionstore.svc.codfw.wmnet [15:59:32] <_joe_> reject and drop right? [15:59:48] and connection refused same line but with an actual LVS IP that just doesn't listen on that port [16:00:02] _joe_: I thought of that and then realized I did not even need to play with iptables rules [16:00:16] but yes, it would have done the exact same thing [16:00:34] <_joe_> ok, anyways, I expected something like that [16:01:00] <_joe_> I'm pretty sure you'd get a similar result if you dropped connection to redis [16:01:16] haven't we lowered the timeout considerably for that one though ? [16:01:26] we had an outage exactly because of high timeouts there [16:04:05] looking (sorry, I was digging through the wancache stuff) [16:16:04] btw, the other interesting thing to test, is to pull locally https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/570395/3/wmf-config/InitialiseSettings.php on those 2 hosts [16:17:55] 10serviceops, 10Operations: (Need By: Jan 10) rack/setup/install mc-gp100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T241795 (10RobH) [16:17:56] 10serviceops, 10Operations: (Need By: Jan 10) rack/setup/install mc-gp100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T241795 (10RobH) [16:19:32] _joe_ akosiaris, I was under the impression from the Thu meetinhg [16:19:58] what's the current mw timeout, akosiaris? [16:20:29] apergos: for that call? I don't know tbh, /me researching [16:21:09] that this rollout would be blocked until we'd find out why mw is doing all those requests to kask [16:21:11] anyway [16:21:43] <_joe_> we lowered the request timeouts [16:21:47] <_joe_> not the connection timeouts [16:23:11] <_joe_> CURLOPT_TIMEOUT vs CURLOPT_CONNECTTIMEOUT [16:24:14] join #wikimedia-staff [16:32:04] akosiaris: i'm going to jump back into helm world this today [16:32:13] woudl you be ok if I just went for one of the proposed solutions [16:32:23] either service.name, or route_to_release_only? [16:32:39] ottomata: I got repriorized, doubt I 'll have the time to look into this anytime soon [16:33:00] so feel free to do whatever you want, and we can refactor if needed when I manage to revisit this [16:33:04] I don't want to block you [16:33:05] great...! thank you [16:33:14] i do need some help with eventstreams benching [16:33:19] would love a comment on that one if you find a min' [16:33:33] https://phabricator.wikimedia.org/T238658#5830499 [16:34:32] I 'll try and have a look after my meeting [16:42:37] ty [16:48:03] 10serviceops, 10Operations, 10ops-eqiad: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10mark) @wiki_willy With Chris having been ill the past few days, what's a realistic new ETA for this? [17:08:59] 10serviceops, 10Operations, 10ops-eqiad: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10wiki_willy) @mark - just chatted with Chris and he's working on them now (he's back today after being sick), so ETA is end of day. Thanks, Willy [17:46:39] ottomata: I 've replied to T238658, you are being throttled [18:42:25] 10serviceops, 10Analytics, 10Operations, 10vm-requests, 10User-Elukey: Create a replacement for kraz.wikimedia.org - https://phabricator.wikimedia.org/T244719 (10Dzahn) ` Debug: Augeas[ens5_v6_token](provider=augeas): sending command 'set' with params ["/files/etc/network/interfaces/iface[. = 'ens5']/pre... [18:46:34] 10serviceops, 10Analytics, 10Operations, 10vm-requests, 10User-Elukey: Create a replacement for kraz.wikimedia.org - https://phabricator.wikimedia.org/T244719 (10Dzahn) The primary network interface is missing from /etc/network/interfaces. There is only loopback in there. Why that is is another question.... [18:48:54] ty! [19:39:56] 10serviceops, 10Analytics, 10Operations, 10vm-requests, 10User-Elukey: Create a replacement for kraz.wikimedia.org - https://phabricator.wikimedia.org/T244719 (10MoritzMuehlenhoff) Given that Luca also had an error during initial setup related to name resolution, this sounds like some error related to th... [19:58:31] 10serviceops, 10Operations, 10Core Platform Team Workboards (Clinic Duty Team), 10Performance-Team (Radar), 10Wikimedia-production-error: Wiki diffs take over 15s to load - https://phabricator.wikimedia.org/T244058 (10WDoranWMF) [19:59:28] 10serviceops, 10Operations, 10Core Platform Team Workboards (Clinic Duty Team), 10Performance-Team (Radar), 10Wikimedia-production-error: Wiki diffs take over 15s to load - https://phabricator.wikimedia.org/T244058 (10WDoranWMF) a:03Krinkle [19:59:58] 10serviceops, 10MediaWiki-General, 10Operations, 10Core Platform Team Workboards (Clinic Duty Team): siteinfo api calls should be cached for N minutes on the caching layer - https://phabricator.wikimedia.org/T244204 (10WDoranWMF) a:03Pchelolo [20:35:09] 10serviceops, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Clarify multi-service instance concepts in helm charts and enable canary releases - https://phabricator.wikimedia.org/T242861 (10Ottomata) > Should we use main_app.name instead of service.name? I think yes is the answer. I just updated [[... [20:45:22] 10serviceops, 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10Cmjohnson) [20:46:50] 10serviceops, 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10Cmjohnson) @Jclark-ctr when you're at data center later today could you look into these please. mw1351 and mw1381 have the wrong password... [21:46:12] 10serviceops, 10OTRS, 10Security-Team: OTRS: CVE-2020-1765, CVE-2020-1766 - https://phabricator.wikimedia.org/T242586 (10sbassett) [21:46:35] 10serviceops, 10OTRS, 10Security-Team: OTRS: CVE-2020-1765, CVE-2020-1766 - https://phabricator.wikimedia.org/T242586 (10sbassett) >>! In T242586#5861939, @Aklapper wrote: >>>! In T242586#5797448, @akosiaris wrote: >> @chasemp, I 'd like to make this public to avoid OTRS agents creating duplicates of it for... [22:33:29] 10serviceops, 10Analytics, 10Analytics-Kanban: Clarify multi-service instance concepts in helm charts and enable canary releases - https://phabricator.wikimedia.org/T242861 (10Ottomata) Ok, applied for staging eventgate-analytics. I think it works! First, because the 'analytics' release already existed, I... [23:23:49] 10serviceops, 10Operations, 10ops-codfw: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2291.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [23:23:54] 10serviceops, 10Operations, 10ops-codfw: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2291.codfw.wmnet'] ` Of which those **FAILED**: ` ['mw2291.codfw.wmnet'] ` [23:26:04] 10serviceops, 10Operations, 10ops-codfw: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2291.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [23:29:47] 10serviceops, 10Operations, 10ops-codfw: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2292.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [23:29:51] 10serviceops, 10Operations, 10ops-codfw: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2292.codfw.wmnet'] ` Of which those **FAILED**: ` ['mw2292.codfw.wmnet'] ` [23:31:30] 10serviceops, 10Operations, 10ops-codfw: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2292.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [23:40:31] 10serviceops, 10Operations, 10ops-eqiad: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10wiki_willy) Latest update: there were some settings that need to be redone, so @Cmjohnson will get these fixed tomorrow, with Thursday as the new ETA. Apologies... [23:50:07] 10serviceops, 10Operations, 10ops-codfw: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2291.codfw.wmnet'] ` and were **ALL** successful. [23:51:06] 10serviceops, 10Operations, 10ops-codfw: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2292.codfw.wmnet'] ` and were **ALL** successful. [23:51:20] 10serviceops, 10Operations, 10ops-codfw: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2293.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [23:53:31] 10serviceops, 10Operations, 10ops-codfw: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2294.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020...