[09:36:17] 10serviceops, 10Commons, 10Multimedia, 10Operations, 10Thumbor: Only one thumbor server (thumbor1002) upgraded to librsvg 2.40.20-3 - https://phabricator.wikimedia.org/T220342 (10jijiki) [10:30:32] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, and 2 others: Package envoy 1.9.X for stretch and use it as redis proxy on docker registry - https://phabricator.wikimedia.org/T215810 (10fsero) [10:30:51] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, and 2 others: Package envoy 1.9.X for stretch and use it as redis proxy on docker registry - https://phabricator.wikimedia.org/T215810 (10fsero) Building 1.9.1 due to CVE [10:35:12] 10serviceops, 10Core Platform Team, 10MediaWiki-General-or-Unknown, 10Operations, 10PHP 7.2 support: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Joe) a:03Joe >>! In T219279#5092106, @Joe wrote: > I looked into the... [10:53:03] let's create some activity on this channel [10:53:17] godog: cdanis can we schedule the merge of this https://gerrit.wikimedia.org/r/c/operations/puppet/+/490073 ? [10:53:48] last time we spoke we decided it was good to merge but we wanted to wait some time as we were busy in other things [11:04:13] <_joe_> fsero: the new registry supports deleting old images, right? [11:04:32] <_joe_> because I think we need to start thinking image archival and GC of the live registry [11:04:43] yes it has support to DELETE methods and registry gc [11:04:53] <_joe_> great [11:05:01] but never tested it using the swift backend [11:05:09] it works using the filesystem backend dough [11:05:49] <_joe_> ok, we'll need to test it [11:05:56] <_joe_> but that's good enough for now :P [11:19:41] 10serviceops, 10Core Platform Team, 10MediaWiki-General-or-Unknown, 10Operations, 10PHP 7.2 support: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10jcrespo) I wonder also if collation or other character-related updates... [11:20:53] 10serviceops, 10Commons, 10Multimedia, 10Operations, and 2 others: Only one thumbor server (thumbor1002) upgraded to librsvg 2.40.20-3 - https://phabricator.wikimedia.org/T220342 (10jijiki) All servers have been upgraded to 2.40.20-3+wmf1+stretch1 [11:23:40] 10serviceops, 10Core Platform Team, 10MediaWiki-General-or-Unknown, 10Operations, 10PHP 7.2 support: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Joe) >>! In T219279#5092959, @jcrespo wrote: > I wonder also if collat... [11:25:29] fsero: indeed, maybe tomorrow? [11:31:11] sounds good to me [11:38:53] fsero: I left an additional comment/nit, but lgtm otherwise, ping me tomorrow (morning ?) and we'll deploy the patch [11:39:09] <_joe_> \o/ [13:18:40] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10User-Joe: publish 1.9.1 envoy docker image - https://phabricator.wikimedia.org/T220382 (10fsero) [13:23:59] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: mw2206.codfw.wmnet memory issues - https://phabricator.wikimedia.org/T215415 (10MoritzMuehlenhoff) >>! In T215415#5032971, @Papaul wrote: > Also the error I have here is not telling me which memory row or channel it refers to so it's difficult to tell... [13:29:09] i would appreciate a quick +1 https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/502217 [13:37:05] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: mw2206.codfw.wmnet memory issues - https://phabricator.wikimedia.org/T215415 (10Papaul) @MoritzMuehlenhoff yes we do have some. Will replace A2 once on site. [13:46:16] * volans has no context but could point out a trailing space :-P [13:54:21] 10serviceops, 10Commons, 10Multimedia, 10Operations, and 2 others: Only one thumbor server (thumbor1002) upgraded to librsvg 2.40.20-3 - https://phabricator.wikimedia.org/T220342 (10jijiki) 05Open→03Resolved [13:57:23] 10serviceops, 10Core Platform Team, 10MediaWiki-General-or-Unknown, 10Operations, 10PHP 7.2 support: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Joe) Even when backporting the above patches, I tried to run a simple... [13:59:58] 10serviceops, 10Analytics, 10EventBus, 10Operations, and 5 others: Enabling api-request eventgate to group1 caused minor service disruptions - https://phabricator.wikimedia.org/T218255 (10Ottomata) [14:00:03] 10serviceops, 10Analytics, 10EventBus, 10Operations, and 5 others: Enabling api-request eventgate to group1 caused minor service disruptions - https://phabricator.wikimedia.org/T218255 (10Ottomata) 05Open→03Resolved [14:00:39] i am on a train now. will fly back tomorrow morning and be out. i will be back Thursday US-morning [14:11:34] service ops people... [14:11:41] for the SRE summit travel spread [14:11:46] on the return date it seems ok [14:12:00] on june 9th, traveling to the summit, everyone seems to want fo fly june 9th [14:12:06] can we perhaps figure out disjunct travel times? [14:40:16] 10serviceops, 10Core Platform Team, 10Operations, 10Release Pipeline, and 2 others: TEC3:O3:O3.1:Q4 Goal - Move cpjobqueue, Wikidata Termbox SSR (new service), Kask (session storage service) and ORES (partially) through the production CD Pipeline - https://phabricator.wikimedia.org/T220398 (10akosiaris) [14:40:32] 10serviceops, 10Core Platform Team, 10Operations, 10Release Pipeline, and 2 others: TEC3:O3:O3.1:Q4 Goal - Move cpjobqueue, Wikidata Termbox SSR (new service), Kask (session storage service) and ORES (partially) through the production CD Pipeline - https://phabricator.wikimedia.org/T220398 (10akosiaris) p:... [14:41:40] 10serviceops, 10Core Platform Team, 10Operations, 10Release Pipeline, and 2 others: Migrate cpjobqueue to kubernetes - https://phabricator.wikimedia.org/T220399 (10akosiaris) [14:41:45] 10serviceops, 10Core Platform Team, 10Operations, 10Release Pipeline, and 2 others: Migrate cpjobqueue to kubernetes - https://phabricator.wikimedia.org/T220399 (10akosiaris) p:05Triage→03Normal [14:41:57] 10serviceops, 10Core Platform Team, 10Operations, 10Release Pipeline, and 2 others: TEC3:O3:O3.1:Q4 Goal - Move cpjobqueue, Wikidata Termbox SSR (new service), Kask (session storage service) and ORES (partially) through the production CD Pipeline - https://phabricator.wikimedia.org/T220398 (10akosiaris) [14:42:22] 10serviceops, 10Core Platform Team, 10Operations, 10Release Pipeline, and 2 others: Migrate ORES to kubernetes - https://phabricator.wikimedia.org/T220400 (10akosiaris) [14:42:53] 10serviceops, 10Core Platform Team, 10Operations, 10Release Pipeline, and 2 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10akosiaris) [14:43:03] 10serviceops, 10Core Platform Team, 10Operations, 10Release Pipeline, and 2 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10akosiaris) p:05Triage→03Normal [14:43:51] 10serviceops, 10Core Platform Team, 10Operations, 10Release Pipeline, and 2 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10akosiaris) [14:44:00] 10serviceops, 10Core Platform Team, 10Operations, 10Release Pipeline, and 2 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10akosiaris) p:05Triage→03Normal [14:44:05] 10serviceops, 10Core Platform Team, 10Operations, 10Release Pipeline, and 2 others: Migrate ORES to kubernetes - https://phabricator.wikimedia.org/T220400 (10akosiaris) p:05Triage→03Normal [14:45:47] 10serviceops, 10Operations: TEC3:Q4 Tracking task - https://phabricator.wikimedia.org/T220403 (10akosiaris) [14:46:07] 10serviceops, 10Operations: TEC3:Q4 Tracking task - https://phabricator.wikimedia.org/T220403 (10akosiaris) [14:46:10] 10serviceops, 10Core Platform Team, 10Operations, 10Release Pipeline, and 2 others: TEC3:O3:O3.1:Q4 Goal - Move cpjobqueue, Wikidata Termbox SSR (new service), Kask (session storage service) and ORES (partially) through the production CD Pipeline - https://phabricator.wikimedia.org/T220398 (10akosiaris) [14:46:18] 10serviceops, 10Operations: TEC3:Q4 Tracking task - https://phabricator.wikimedia.org/T220403 (10akosiaris) p:05Triage→03Normal [14:47:30] 10serviceops, 10Operations: TEC3:05:05.1:Q4 Services and the deployment pipeline are hosted on production-level infrastructure - https://phabricator.wikimedia.org/T220405 (10akosiaris) [14:47:38] 10serviceops, 10Operations: TEC3:05:05.1:Q4 Services and the deployment pipeline are hosted on production-level infrastructure - https://phabricator.wikimedia.org/T220405 (10akosiaris) p:05Triage→03Normal [14:55:44] _joe_: fsero: fyi, etcd latencies for eqiad just peaked. https://grafana.wikimedia.org/d/000000435/kubernetes-api?panelId=12&fullscreen&orgId=1&from=1554732948478&to=1554735130514. Looks like etcd1003 wasn't having a nice time per https://grafana.wikimedia.org/d/000000377/host-overview?panelId=3&fullscreen&orgId=1&var-server=etcd1003&var-datasource=eqiad%20prometheus%2Fops&var-cluster=kubernetes&from=1554732989121&to=155473 [14:55:44] 5220210 [14:56:03] it hasn't harmed us in any way from what I can tell, just point it out [14:56:13] it's all IOwait btw [14:56:37] <_joe_> that's usually an issue etcd is sensitive to [14:56:44] <_joe_> but we need to migrate to etcd3 [14:57:34] ok I thing I just got the root cause https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&from=1554732989121&to=1554735220210&var-server=ganeti1007&var-datasource=eqiad%20prometheus%2Fops&var-cluster=ganeti [14:57:50] <_joe_> I'm knee deep in unicode php shit right now [14:58:07] drbd was desynced and had to be resynced for etcd1003 [14:58:10] probably even more [14:58:27] etcd is quite sensitive to iowait [14:58:43] maybe a compelling case to move etcd servers from ganeti at some point [14:59:13] oh this is not ganeti... [14:59:18] Apr 8 14:44:33 ganeti1007 kernel: [6997824.688012] drbd resource5: Connection closed [14:59:18] Apr 8 14:44:35 ganeti1007 kernel: [6997826.520265] drbd resource21: Connection closed [14:59:18] Apr 8 14:47:36 ganeti1007 kernel: [6998007.729488] drbd resource7: Connection closed [14:59:22] this is a network issue [14:59:37] it's just that we noticed it at the etcd level [14:59:44] due to drbd [14:59:46] ahh ok ok [15:00:37] next question would be what caused the network issue [15:00:39] there is something fishy with network error there [15:00:51] https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&panelId=11&fullscreen&orgId=1&var-server=ganeti1007&var-datasource=eqiad%20prometheus%2Fops&var-cluster=ganeti&from=1554732989121&to=1554735220210 [15:00:57] it's not many packets [15:01:03] but still ... [15:01:27] why do no other hosts exhibit that? [15:05:27] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: mw2206.codfw.wmnet memory issues - https://phabricator.wikimedia.org/T215415 (10Papaul) a:05Papaul→03jijiki DIMM_A2 replaced [15:08:21] might be unrelated but i also see a lot of net activity on ganeti1001 because drbd [15:08:25] around 14:10 [15:08:53] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: mw2206.codfw.wmnet memory issues - https://phabricator.wikimedia.org/T215415 (10jijiki) @Papaul thank you! Pooling ... [15:10:13] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: mw2206.codfw.wmnet memory issues - https://phabricator.wikimedia.org/T215415 (10jijiki) 05Open→03Resolved Will reopen if there are issues. Thank you all! [15:10:14] etcd sensible to iowait is true, I remember when we had issues because of madam monthly checkarray [15:40:24] 10serviceops, 10Core Platform Team, 10MediaWiki-General-or-Unknown, 10Operations, 10PHP 7.2 support: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Joe) I tried to write a simplistic implementation of such a function:... [15:47:29] 10serviceops, 10Core Platform Team, 10MediaWiki-General-or-Unknown, 10Operations, 10PHP 7.2 support: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Joe) It should be noted that if this seems too slow, we can still "fix... [16:45:08] 10serviceops, 10Operations, 10Wikidata, 10Wikidata-Termbox-Hike, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10akosiaris) @WMDE-leszek Hi, sorry for not answering any sooner, last few weeks have been crazy indeed. Q4/Q2 started We can start work on... [16:54:25] 10serviceops, 10Core Platform Team, 10Operations, 10Release Pipeline, and 3 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10Eevans) [16:57:42] 10serviceops, 10Core Platform Team, 10Operations, 10Release Pipeline, and 4 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10Lydia_Pintscher) [18:42:09] hello! I'm investigating using a Go app for RPKI validation, see https://github.com/cloudflare/gortr/releases Is there doc/procedure/etc on how to proceed? Mostly evaluated the amount of work needed for now. I guess it's not "put the binary in Puppet". [18:45:22] 10serviceops, 10Operations, 10Release Pipeline, 10Release-Engineering-Team, and 4 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10mobrovac) [18:46:16] 10serviceops, 10Operations, 10Release Pipeline, 10Release-Engineering-Team, and 2 others: TEC3:O3:O3.1:Q4 Goal - Move cpjobqueue, Wikidata Termbox SSR (new service), Kask (session storage service) and ORES (partially) through the production CD Pipeline - https://phabricator.wikimedia.org/T220398 (10mobrovac) [18:46:41] 10serviceops, 10Operations, 10Release Pipeline, 10Release-Engineering-Team: Migrate ORES to kubernetes - https://phabricator.wikimedia.org/T220400 (10mobrovac) [18:47:33] 10serviceops, 10Operations, 10Release Pipeline, 10Release-Engineering-Team, and 3 others: Migrate cpjobqueue to kubernetes - https://phabricator.wikimedia.org/T220399 (10mobrovac) [18:49:15] 10serviceops, 10Operations, 10Release Pipeline, 10Release-Engineering-Team, and 4 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10mobrovac) [18:52:21] 10serviceops, 10ChangeProp, 10Operations, 10Release Pipeline, and 5 others: Migrate cpjobqueue to kubernetes - https://phabricator.wikimedia.org/T220399 (10mobrovac) [19:52:00] 10serviceops, 10MediaWiki-General-or-Unknown, 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10PHP 7.2 support: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10kchapman) [19:52:31] 10serviceops, 10Core Platform Team Kanban, 10MediaWiki-General-or-Unknown, 10Operations, and 2 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10kchapman) [19:53:11] 10serviceops, 10Core Platform Team Backlog, 10MediaWiki-General-or-Unknown, 10Operations, and 2 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10kchapman) [19:53:32] 10serviceops, 10MediaWiki-General-or-Unknown, 10Operations, 10Core Platform Team (PHP7 (TEC4)), and 2 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10kchapman) [19:54:04] 10serviceops, 10MediaWiki-General-or-Unknown, 10Operations, 10Core Platform Team (PHP7 (TEC4)), and 2 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10kchapman) @Joe did you get what you needed from CPT in IRC? If... [20:54:35] 10serviceops, 10RESTBase, 10Core Platform Team (RESTBase Split (CDP2)), 10Core Platform Team Kanban (Doing), and 2 others: Split RESTBase in two services: storage service and API router/proxy - https://phabricator.wikimedia.org/T220449 (10mobrovac) p:05Triage→03Normal [20:54:45] 10serviceops, 10RESTBase, 10Core Platform Team (RESTBase Split (CDP2)), 10Core Platform Team Kanban (Doing), and 3 others: Split RESTBase in two services: storage service and API router/proxy - https://phabricator.wikimedia.org/T220449 (10mobrovac) [20:58:54] I tagged you guys here ^ for visibility and so that you are in the loop [22:45:35] 10serviceops, 10Analytics, 10ChangeProp, 10Community-Tech, and 6 others: Provide the ability to have time-delayed or time-offset jobs in the job queue - https://phabricator.wikimedia.org/T218812 (10aezell)