[02:52:48] 10serviceops, 10Operations, 10vm-requests, 10Release-Engineering-Team (Watching / External): Increase mwdebugXXXX hosts CPU and memory(?) - https://phabricator.wikimedia.org/T212955 (10Krinkle) [02:58:03] 10serviceops, 10Operations, 10vm-requests, 10Release-Engineering-Team (Watching / External): Increase mwdebugXXXX hosts CPU and memory(?) - https://phabricator.wikimedia.org/T212955 (10Krinkle) [03:31:47] 10serviceops, 10Analytics, 10Operations, 10Research, and 4 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Dzahn) [03:45:29] 10serviceops, 10Analytics, 10Operations, 10Research, and 4 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Dzahn) @Nuria I agree it seems the most likely solution is using a Ganeti VM though but due to allhands we still did not have an SRE m... [03:55:44] it's been 2 or 3 times now that i thought of sending an email to all of my subteam .. would there be value in a mail alias so i dont have to list individuals? [04:42:13] 10serviceops, 10Gerrit, 10Icinga, 10Operations, and 2 others: improve Gerrit monitoring (was: Investigate why icinga did not report high cpu/load for gerrit) - https://phabricator.wikimedia.org/T215033 (10Dzahn) new virtual "service" host: (so also ping check for gerrit.wikimedia.org as opposed to cobalt o... [05:39:22] <_joe_> mutante: didn't we create the serviceops@ ? [08:20:42] there is an sre-service-ops google group, that would be the way to go (mutante) [08:54:44] <_joe_> jijiki: ping when you're around :) [08:56:17] <_joe_> apergos: since you're already here, can I ask for your support on php 7 work too? [08:56:58] of course. note that if you mean 'instantly', I'm still very very braindead :-D [08:56:58] ping [08:57:12] _joe_: [08:59:35] <_joe_> ok so I'm looking at https://phabricator.wikimedia.org/tag/php_7.2_support/ [09:00:19] <_joe_> jijiki: you too :) [09:00:36] <_joe_> apergos: I thought you could help me with https://phabricator.wikimedia.org/T214984 [09:00:57] <_joe_> I think James_F is onto something there (HHVM has accepted broken json probably) [09:01:17] <_joe_> but that looks like a serious problem if he's wrong [09:01:47] <_joe_> jijiki: I [09:01:56] <_joe_> heh sorry [09:02:04] _joe_: one person at a time [09:02:12] :p [09:02:18] <_joe_> I was about to say - can you look into https://phabricator.wikimedia.org/T215376 ? [09:02:29] <_joe_> I *think* it could be as simple as upgrading php-redis [09:03:04] <_joe_> but I'm already working on another issue [09:03:40] _joe_: gotcha. will do [09:03:51] <_joe_> apergos: <3 wikilove [09:03:55] today is going to be a day to get caffeinated, this is not going to work otherwise :-d [09:03:57] sure :) [09:04:23] <_joe_> apergos: I already did my morning intake of 2 large mokas [09:04:23] 10serviceops, 10Operations, 10Wikimedia-General-or-Unknown, 10PHP 7.2 support, 10User-jijiki: mwscript dies on mwmaint with PHP=php7.2 due to php-redis missing - https://phabricator.wikimedia.org/T215376 (10jijiki) [09:04:41] 10serviceops, 10Operations, 10Wikimedia-General-or-Unknown, 10PHP 7.2 support, 10User-jijiki: mwscript dies on mwmaint with PHP=php7.2 due to php-redis missing - https://phabricator.wikimedia.org/T215376 (10jijiki) p:05Triage→03Normal a:03jijiki [09:04:47] it's gonna be hot chocolate in a bit [09:22:29] 10serviceops, 10Gerrit, 10Icinga, 10Operations, and 2 others: improve Gerrit monitoring (was: Investigate why icinga did not report high cpu/load for gerrit) - https://phabricator.wikimedia.org/T215033 (10Paladox) Is the script on cobalt? [09:24:17] <_joe_> jijiki: you did depool thumbor2002 right? [09:24:30] yes, last week [09:24:32] <_joe_> I see some alerts on icinga, can you ack them with the ticket number if that's the case? [09:24:51] let me check [09:31:16] tx _joe_, I downtimed the host till we are done [10:04:34] 10serviceops, 10CirrusSearch, 10Operations, 10Discovery-Search (Current work): Find an alternative to HHVM curl connection pooling for PHP 7 - https://phabricator.wikimedia.org/T210717 (10dcausse) a:03Joe Thanks @Joe! I'll follow-up on this and prepare mw-config patches to use these new entries. [10:06:31] <_joe_> dcausse: the change is still not applied everywhere :P [10:06:39] :) [10:06:52] <_joe_> I've stumbled across a problem with our setup [10:07:12] ah I thought you meant puppet did not run everywhere yet [10:07:18] <_joe_> nothing unsurmountable, but it takes more time to apply the change [10:07:19] <_joe_> yes [10:07:26] <_joe_> that's what I mean [10:07:29] oh ok [10:07:58] <_joe_> I need to apply the change slowly so that we don't get a ton of useless and scary alerts :P [10:08:06] ah I see [10:08:27] <_joe_> it's an annoying problem with tmpreaper and systemd using PrivateTmp [10:09:22] I'll prep the patch, I'll schedule it for tomorrow, will ping you just to be sure it's ready by then [10:12:00] 10serviceops, 10Release Pipeline, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Expose blubberoid to the public allowing CI in WMCS to be able to reach out as well to it - https://phabricator.wikimedia.org/T212251 (10akosiaris) 05Open→03Resolved a:03akosiaris ` curl -s https://blubberoid.w... [11:09:57] I think we might want this on our radar generally: https://phabricator.wikimedia.org/T213345 [11:16:44] <_joe_> apergos: sure, it's already on mine, I think I did report about it at our last team meeting? [11:31:55] I don't remember it but that doesn't mean anything; I travelled the next day so maybe not much got retained [11:32:18] I am subscribed so that's two of us watching, at least [12:07:57] 10serviceops, 10Language-Team, 10MediaWiki-Language-converter, 10Parsing-Team, and 5 others: RFC: Spin off (Parsoid) language variants functionality as a Node.js microservice? - https://phabricator.wikimedia.org/T213345 (10jijiki) [12:08:23] <_joe_> thanks jijiki [12:18:50] 10serviceops, 10Operations, 10Wikibase-Containers, 10Wikidata, and 2 others: Create a wmf production ready nginx image - https://phabricator.wikimedia.org/T209292 (10Addshore) >>! In T209292#4920883, @Ladsgroup wrote: > I would be in favor of not using nginx and turning WDQS gui to a proper nodejs applicat... [12:27:11] _joe_: according to debmon, only mwmaint* and deploy* can be upgraded to php-redis 4.1.1 [12:27:33] do we need to do anything else apart from apt-get install? [12:27:56] <_joe_> jijiki: not sure! [12:28:06] <_joe_> the error seems to suggest that [12:28:16] <_joe_> the version currently installed is old and only works on 7.0 [12:28:24] well, all mw* have the 4.1.1 version [12:28:48] <_joe_> have you verified if the command works on the other servers? [12:29:16] <_joe_> if that's the case, then the issue is the outdated package version [12:29:18] I can run it, but tgr mentioned already that it works ok mwmaint* [12:29:57] <_joe_> you mean on mwdebug*? [12:30:04] <_joe_> ok than we've found the issue [12:31:10] <_joe_> *then :) [12:31:57] for some reason the mwdebug servers also have the 7.0 FPM installed, BTW, maybe leftover from the initial tests [12:33:00] <_joe_> yes [12:33:03] <_joe_> I have to remove it [12:33:18] <_joe_> it's on my todo list, but if someone else wants to do it, be my guest [12:33:31] <_joe_> moritzm: I'm going to "fix" the tmpreaper horror [12:34:52] <_joe_> and some other bugs while I'm at it, like the fact that a basic installation of tlsproxy::instance doesn't prevent nginx to try to listen on port 80 [12:35:04] <_joe_> I thought it was already the case, that's strange [12:36:58] ok ok wait [12:37:03] the plot thickened [12:37:39] mwdebug1002 works fine (tgr) [12:37:58] ok let me go to an mw* server [12:42:38] ack, feel free to add me to reviewers when ready [13:35:42] 10serviceops, 10Operations, 10Thumbor, 10ops-eqiad: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10jijiki) [13:35:55] 10serviceops, 10Operations, 10Thumbor, 10ops-eqiad: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10jijiki) p:05Triage→03Normal [13:36:57] 10serviceops, 10Operations, 10Thumbor, 10ops-eqiad: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10jijiki) [14:02:34] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: mw2206.codfw.wmnet memory issues - https://phabricator.wikimedia.org/T215415 (10jijiki) p:05Triage→03Normal [14:50:23] 10serviceops, 10Analytics, 10Operations, 10Research, and 4 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10bmansurov) [14:50:53] 10serviceops, 10Analytics, 10Operations, 10Research, and 4 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10bmansurov) Thanks, @Dzahn for the info. I've this task: {T215421}. [15:24:50] 10serviceops, 10Analytics, 10Operations, 10Research, and 4 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Ottomata) Marco's suggestion of using mwmaint1002 is not a bad idea... [15:53:47] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: Degraded RAID on thumbor2002 - https://phabricator.wikimedia.org/T214813 (10Papaul) a:05Papaul→03jijiki Disk replaced, server didn't boot up. [15:59:27] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: Degraded RAID on thumbor2002 - https://phabricator.wikimedia.org/T214813 (10jijiki) @papaul Thank you! I will reimage this server, no need to spend more time on it [16:27:09] 10serviceops, 10MediaWiki-Cache, 10Operations, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10EvanProdromou) I did some analysis of how we're using... [16:45:52] 10serviceops, 10MediaWiki-Cache, 10Operations, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10EvanProdromou) Actually, it looks like we've got some... [17:03:57] 10serviceops, 10Operations, 10Wikimedia-General-or-Unknown, 10PHP 7.2 support, 10User-jijiki: mwscript dies on mwmaint with PHP=php7.2 due to php-redis missing - https://phabricator.wikimedia.org/T215376 (10Dzahn) Do we have to install the php7.2-redis package? and / or https://stackoverflow.com/quest... [17:15:31] <_joe_> mutante: jijiki is already looking at that ticket, FYI [17:19:09] _joe_: I pinged mutante :) [17:19:46] <_joe_> ahah ok :P [18:47:29] 10serviceops, 10Operations, 10Wikimedia-General-or-Unknown, 10PHP 7.2 support, 10User-jijiki: mwscript dies on mwmaint with PHP=php7.2 due to php-redis missing - https://phabricator.wikimedia.org/T215376 (10Tgr) IIRC (we ran into similar issues on Vagrant in {T213016}) there is no php7.2-redis, just a si... [18:56:13] 10serviceops, 10Operations, 10Wikimedia-General-or-Unknown, 10PHP 7.2 support, 10User-jijiki: mwscript dies on mwmaint with PHP=php7.2 due to php-redis missing - https://phabricator.wikimedia.org/T215376 (10Reedy) In `modules/contint/manifests/packages/php.pp` we're doing `ensure => latest` ` reedy@depl... [19:18:36] 10serviceops, 10Operations, 10Wikimedia-General-or-Unknown, 10PHP 7.2 support, 10User-jijiki: mwscript dies on mwmaint with PHP=php7.2 due to php-redis missing - https://phabricator.wikimedia.org/T215376 (10Dzahn) >>! In T215376#4932577, @Reedy wrote: > In `modules/contint/manifests/packages/php.pp` we'r... [20:30:21] 10serviceops, 10Gerrit, 10Icinga, 10Operations, and 2 others: gerrit: Add a icinga check that uses the healthcheck endpoint - https://phabricator.wikimedia.org/T215457 (10Dzahn) [20:30:34] 10serviceops, 10Gerrit, 10Icinga, 10Operations, and 2 others: gerrit: Add a icinga check that uses the healthcheck endpoint - https://phabricator.wikimedia.org/T215457 (10Dzahn) p:05Triage→03Normal a:03Dzahn [20:35:37] 10serviceops, 10Gerrit, 10Icinga, 10Operations, and 2 others: gerrit: Add a icinga check that uses the healthcheck endpoint - https://phabricator.wikimedia.org/T215457 (10Dzahn) So you said checking for status 200 is enough here? We don't need to bother looking for the string "Passed" or something? Have y... [20:37:30] 10serviceops, 10Gerrit, 10Icinga, 10Operations, and 2 others: gerrit: Add a icinga check that uses the healthcheck endpoint - https://phabricator.wikimedia.org/T215457 (10Paladox) Yup, if any of the checks fail it will return 500. See https://gerrit.googlesource.com/plugins/healthcheck/#how-to-use [21:04:04] 10serviceops, 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10pmiazga) @Tgr I assume you're still waiting for answers from @ema? Is there anything I can help you with? [22:07:20] 10serviceops, 10TechCom, 10TechCom-RFC: RfC: Standards for external services in the Wikimedia infrastructure. - https://phabricator.wikimedia.org/T208524 (10Krinkle) @Joe In the TechCom meeting today I agreed to collab with you to summarise RFC in the task here with a problem statement and proposed solution. [22:35:54] 10serviceops, 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10Tgr) Yeah, but I don't think this task should be a blocker (for either handover or production switchove... [22:36:08] 10serviceops, 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10Tgr)