[08:03:26] 10serviceops, 10Operations, 10ops-eqiad, 10HHVM: mw1272 crashed: Bad page map in process hhvm - https://phabricator.wikimedia.org/T211668 (10MoritzMuehlenhoff) a:03Cmjohnson [08:04:21] 10serviceops, 10Operations, 10ops-eqiad, 10HHVM: mw1272 crashed: Bad page map in process hhvm - https://phabricator.wikimedia.org/T211668 (10MoritzMuehlenhoff) This server is still under warranty for another 6-7 weeks. [09:00:47] <_joe_> paladox: uhm if we want to refactor those functions, we need to do a slightly better job [09:14:55] Hmm, I only did the bare as I have little knowledge on ruby :) [09:22:48] 10serviceops, 10Operations, 10Thumbor: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10Gilles) [11:19:02] 10serviceops, 10Operations, 10Thumbor, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` thumbor1001.eqiad.wmnet ` The log can be found in... [11:23:03] <_joe_> akosiaris, fsero https://gerrit.wikimedia.org/r/#/c/integration/config/+/492649 [11:41:40] <_joe_> I wasn't sure if we should submit it to integration/config or to production-images tbh [11:41:48] <_joe_> I can move it over easily [11:51:55] _joe_, fsero: I'm building php7.2 for component/php72, will use mw2151 for testing [11:57:51] <_joe_> moritzm: arg [11:58:00] <_joe_> I was working on a way to do it automatically [11:58:14] <_joe_> I guess we'll use it for the next iteration [11:58:44] <_joe_> while trying to do it, I found a nice novelty - our CI is using php7.2 packages from sury.org, and specifically the jessie packages [11:59:44] we can rebuild the addon packages automatically, only working on the core package ATM [12:04:42] <_joe_> my work will take all day [12:36:27] 10serviceops, 10Operations, 10Performance-Team (Radar), 10User-Elukey, 10User-jijiki: Test different growth factors for memcached (prep step for upgrade to newer versions - https://phabricator.wikimedia.org/T217020 (10elukey) p:05Triage→03Normal [12:36:32] 10serviceops, 10Operations, 10Performance-Team (Radar), 10User-Elukey, 10User-jijiki: Test different growth factors for memcached (prep step for upgrade to newer versions) - https://phabricator.wikimedia.org/T217020 (10elukey) [13:45:06] <_joe_> hashar just correctly told me to use production-images for this [13:45:15] o/ [13:45:40] _joe_: yeah I am not quite sure how we did for python/nodejs services, but I would guess there is some kind of SRE maintained base image [13:46:06] alternative is to use wikimedia-stretch as a base container and add all the commands in bubbluer config/golang.go and let blubber setup the base container [13:46:08] <_joe_> yeah fair enough [13:46:15] ;D [13:46:28] <_joe_> nah, it's ok to have such an image [13:47:27] also looking at blubber, it is pipeline file has: base: golang:1.10-stretch [13:47:34] so I guess it somehow defaults to dockerhub [13:47:52] we might want blubber to reject that entirely [13:48:17] (also thanks for the sury rebuild :D ) [13:53:09] so for my education which images goes into that repo and which ones goes into production images [13:54:22] fsero: integration/config is for containers used by the CI system run by Zuul/Jenkins [13:54:46] for services we are shifting toward "the pipeline" which uses Blubber to define the container [13:54:56] the container is build automatically at build time by Jenkins [13:55:04] and eventually promoted/published to the registry [13:55:07] or [13:55:14] we have two different CI systems working in parallel [13:55:18] but [13:55:37] the pipeline based on blubber, results in Docker images that are trusted to be run on the production environment [13:56:04] where as the images for the other CI system (zuul/jenkins) have less scrutinity. We don't use them for anything that i sproduction grade [13:56:15] too many moving parts :( [13:56:31] <_joe_> hashar: it makes sense to have this distinction I think [13:56:52] <_joe_> hashar: btw, are you rebuilding all php images already? [13:57:32] for sury? [13:57:36] yes it is complete ( https://gerrit.wikimedia.org/r/#/c/492666/ ) [13:57:49] <_joe_> let's see what breaks :P [14:06:44] 10serviceops, 10Operations, 10Thumbor, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['thumbor1001.eqiad.wmnet'] ` Of which those **FAILED**: ` ['thumbor1001.eqiad.wmnet'] ` [15:22:52] <_joe_> fsero, akosiaris, apergos, mutante friendly reminder to fill https://etherpad.wikimedia.org/p/SRE-ServiceOps-StatusMeeting [15:37:10] akosiaris: can we merge the DNS? https://gerrit.wikimedia.org/r/#/c/operations/dns/+/491860/ [15:37:11] and [15:37:24] training material..? https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/491861/ [15:37:58] ottomata: have you deployed to production? [15:38:05] akosiaris: no [15:38:18] but DNS change is fine to go? [15:38:22] or you want to wait on that? [15:38:33] the DNS change is fine, the puppet one is not [15:38:35] right [15:38:37] agree [15:38:46] just wondering what you meant by training material on that one [15:39:17] akosiaris right now we are a little blocked on beta cluster being readonly... :/ [15:39:32] ottomata: oh, have one of the new hires doing it. Also the entire new lvs service is a little bit of a delicate dance and we should create/update docs about it [15:39:45] we can't smoke test stuff...although...we are mostly settled on one schema...so i can bake it in and build the prod image and we could deploy [15:39:47] aka "a runbook" :-) [15:39:53] ah ok [15:40:11] hmmm, lemme prep an image/build, we might want to go ahead and deploy to prod [15:40:15] won't hurt, nothing will be using it yet [15:40:23] and we'll get the (single) schema baked in [15:40:53] fine by me [16:15:25] 10serviceops, 10Operations, 10ops-eqiad, 10HHVM: mw1272 crashed: Bad page map in process hhvm - https://phabricator.wikimedia.org/T211668 (10Cmjohnson) A self-dispatch ticket has been created for a new DIMM and CPU You have successfully submitted request SR986941367. [16:20:57] fsero@registry1001:~$ envoy --version [16:20:57] envoy version: 6fb08f24ae55c32b23e43c92508240f307bda104/1.9.0/Clean/RELEASE/BoringSSL [16:21:01] * fsero cries [16:28:38] 10serviceops, 10Operations, 10Thumbor, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin2001.codfw.wmnet for hosts: ` ['thumbor2001.codfw.wmnet'] ` The log can be foun... [16:58:46] 10serviceops, 10TechCom, 10TechCom-RFC: RfC: Standards for external services in the Wikimedia infrastructure. - https://phabricator.wikimedia.org/T208524 (10kchapman) TechCom is placing this on Last Call ending March 6 1pm PST(21:00 UTC, 22:00 CET) [16:59:01] 10serviceops, 10TechCom-RFC: RfC: Standards for external services in the Wikimedia infrastructure. - https://phabricator.wikimedia.org/T208524 (10kchapman) [18:10:43] 10serviceops, 10Operations, 10Thumbor, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['thumbor2001.codfw.wmnet'] ` and were **ALL** successful. [18:19:52] 10serviceops, 10Operations, 10Thumbor, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10jijiki) [18:25:45] 10serviceops, 10Operations, 10Thumbor, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` thumbor1002.eqiad.wmnet ` The log can be found in... [18:40:27] 10serviceops, 10TechCom-RFC: RfC: Standards for external services in the Wikimedia infrastructure. - https://phabricator.wikimedia.org/T208524 (10dr0ptp4kt) Would it be possible to clarify the wording on "There is no existing FLOSS software that provides the same functionality"? I believe the intent here is ab... [19:22:12] 10serviceops, 10Operations, 10Wikidata, 10Wikidata-Termbox-Hike, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10WMDE-leszek) 05Stalled→03Open Unstalling as the RFC requested above is now approved! (see: T213318) How are we going to proceed with... [19:29:57] 10serviceops, 10Operations, 10Operations-Software-Development, 10User-Joe, 10User-jijiki: Convert makevm to spicerack cookbook - https://phabricator.wikimedia.org/T203963 (10crusnov) Is it okay to use rapi for this or is there a compelling reason to use cumin+ganeti-* commands? [19:47:03] 10serviceops, 10Operations, 10Thumbor, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['thumbor1002.eqiad.wmnet'] ` and were **ALL** successful. [20:28:03] 10serviceops, 10TechCom-RFC: RfC: Standards for external services in the Wikimedia infrastructure. - https://phabricator.wikimedia.org/T208524 (10akosiaris) [20:34:11] 10serviceops, 10TechCom-RFC: RfC: Standards for external services in the Wikimedia infrastructure. - https://phabricator.wikimedia.org/T208524 (10akosiaris) I think that the **"Collect RED metrics; be able to export those metrics according to WMF standards specified in the implementation guidelines"** should... [20:37:29] 10serviceops, 10TechCom-RFC: RfC: Standards for external services in the Wikimedia infrastructure. - https://phabricator.wikimedia.org/T208524 (10akosiaris) [20:44:44] 10serviceops, 10TechCom-RFC: RfC: Standards for external services in the Wikimedia infrastructure. - https://phabricator.wikimedia.org/T208524 (10akosiaris) **"Log all requests received via the production logging facilities"** Should we make this a bit more generic? e.g. **"Log actions via the production log... [20:58:06] 10serviceops, 10Operations, 10Operations-Software-Development, 10User-Joe, 10User-jijiki: Convert makevm to spicerack cookbook - https://phabricator.wikimedia.org/T203963 (10akosiaris) >>! In T203963#4982270, @crusnov wrote: > Is it okay to use rapi for this or is there a compelling reason to use cumin+g...