[08:00:58] <_joe_> effie, jayme is push-notifications not using the service proxy? [08:01:10] <_joe_> https://phabricator.wikimedia.org/T260247#6489959 [08:01:42] 10serviceops, 10Push-Notification-Service, 10Patch-For-Review, 10Product-Infrastructure-Team-Backlog (Kanban): Push notification service should make deletion requests to the MW API for invalid or expired subscriptions - https://phabricator.wikimedia.org/T260247 (10Joe) [08:01:51] _joe_: No. The understanding was that it does not connecto to internal services [08:02:25] 10serviceops, 10Push-Notification-Service, 10Patch-For-Review, 10Product-Infrastructure-Team-Backlog (Kanban): Push notification service should make deletion requests to the MW API for invalid or expired subscriptions - https://phabricator.wikimedia.org/T260247 (10Joe) Or even better: it should use the ser... [08:02:26] <_joe_> well it does :P [08:02:33] obviously.. [08:02:33] <_joe_> see above [08:02:35] I was under the same impression, let us discuss it with michael [08:03:45] <_joe_> the presence of main_app.mwapi_uri is quite telling imho [08:46:05] <_joe_> sigh there are still connections to ORES that are nto via https [08:46:11] * _joe_ goes spelunking [08:50:07] 10serviceops, 10User-jijiki: Improve the New Service Request documentation and template - https://phabricator.wikimedia.org/T263723 (10jijiki) [09:35:37] <_joe_> hnowlan: do you have a task for SLOs for the API gateway? [09:38:28] 10serviceops, 10Operations, 10Platform Team Initiatives (API Gateway): Separate mediawiki latency metrics by endpoint - https://phabricator.wikimedia.org/T263727 (10Joe) [09:38:39] _joe_: yep, T254916 [09:42:58] 10serviceops, 10Operations, 10observability, 10Platform Team Initiatives (API Gateway): mtail 3.0.0-rc35 doesn't support the histogram type in -oneshot mode. - https://phabricator.wikimedia.org/T263728 (10Joe) [09:44:04] 10serviceops, 10Operations, 10Platform Team Initiatives (API Gateway): Separate mediawiki latency metrics by endpoint - https://phabricator.wikimedia.org/T263727 (10Joe) [09:51:45] Any ideas why helmfile is trying to sync changes for both a "production" and "staging" release in a service? regardless of whether it's a prod or a staging service changeprop-jobqueue in this case [09:59:44] hnowlan: with "sync" it is trying to ensure state. So I think it just ensures hat there is no staging release present (e.g. it would remove existing ones from prod environments) [10:04:22] jayme: ahh okay, the double !log for each sync made me think there was a misconfig. that makes sense [10:13:27] <_joe_> hnowlan: !log is something we still need to improve [10:13:33] <_joe_> contributions welcome :P [10:49:44] <_joe_> hnowlan: so... T263727 is the thing we need to do in order to obtain information about the rest api latencies [11:04:10] <_joe_> hnowlan: ugh I have a few issues with changeprop's config, we need to talk about it later [11:04:23] <_joe_> but basically, it should switch to use the https uri for ORES [11:05:03] <_joe_> and also for restbase :) [11:05:08] <_joe_> I'll send a patch I guess [11:06:38] _joe_: oh, thanks for the ticket - I'll have a look [11:06:46] interestingly I am debugging issues with changeprop-jobqueue right now [11:07:15] <_joe_> what issues? [11:07:26] <_joe_> maybe it's due to something we did [11:07:35] <_joe_> do you have a task? [11:08:47] <_joe_> uhmmm although I doubt it. It just calls the jobrunners, and that didn't change [11:08:51] not at the moment - it's just the staging instance thankfully, but the instance fails to start workers and crashloops [11:08:54] prod instances are fine [11:09:05] <_joe_> oh I see [11:09:07] <_joe_> ok [11:09:19] <_joe_> nice to see staging helps finding issues before they reach prod :) [11:09:31] <_joe_> anyways, we can talk later, I'm taking a break [11:09:39] sounds good [11:31:23] oh dear, nothing quite as complex as I suspected, just an OOM kill [13:36:40] hiya [13:36:45] how can I delete a single pod? [13:36:47] i used to do [13:36:47] https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate/Administration#Delete_a_specific_k8s_pod [13:37:07] but now [13:37:07] Error from server (Forbidden): pods "eventstreams-canary-6df87cd67f-8qshm" is forbidden: User "eventstreams" cannot delete resource "pods" in API group "" in the namespace "eventstreams" [13:40:30] ottomata: you need to be "cluster admin" to delete pods (deploy users don't have the right to do so [13:41:11] ottomata: go for "sudo -i; kube_env admin ; kubectl -n eventstreams foo bar" [13:41:23] ah k [13:51:14] jayme: ....you know what would be really useful? [13:51:22] some kind of logstash k8s service dashbaord [13:51:29] where you could select the service/app name from a drop down [13:51:32] and get all logs for your service [13:51:35] from the app [13:51:48] 10serviceops, 10Maps, 10Product-Infrastructure-Team-Backlog: [OSM] Install imposm3 in Maps master - https://phabricator.wikimedia.org/T238753 (10MSantos) >>! In T238753#5678746, @MoritzMuehlenhoff wrote: > Ah, that explains, it was removed from Debian in https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=9326... [13:51:50] ...we don't have that already, do we? [13:51:51] yeah! so nice of you that you want to create one :D [13:51:54] haha [13:51:57] i'll make a ticket [13:52:02] logstash is really hard [13:54:31] 10serviceops, 10Wikimedia-Logstash, 10Kubernetes: Create a logstash dashboard showing all application logs for a selected service - https://phabricator.wikimedia.org/T263755 (10Ottomata) [14:00:16] ottomata: mood [14:06:12] 10serviceops, 10Operations, 10ops-eqiad: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (10Cmjohnson) a:05Cmjohnson→03RobH @robh power has been pulled and flea power drained [14:33:47] 10serviceops, 10Wikimedia-Logstash, 10Kubernetes: Create a logstash dashboard showing all application logs for a selected service - https://phabricator.wikimedia.org/T263755 (10Ottomata) I guess this is only possible in logstash-next: https://www.elastic.co/blog/interactive-inputs-on-kibana-dashboards [14:38:36] 10serviceops, 10Operations, 10ops-eqiad: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` mw1360.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202009241438_robh_29643... [14:38:39] 10serviceops, 10Operations, 10ops-eqiad: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1360.eqiad.wmnet'] ` Of which those **FAILED**: ` ['mw1360.eqiad.wmnet'] ` [14:39:16] 10serviceops, 10Operations, 10ops-eqiad: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` mw1360.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202009241439_robh_30270... [15:00:00] 10serviceops, 10Operations, 10ops-eqiad: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1360.eqiad.wmnet'] ` Of which those **FAILED**: ` ['mw1360.eqiad.wmnet'] ` [15:00:13] 10serviceops, 10Wikimedia-Logstash, 10Kubernetes: Create a logstash dashboard showing all application logs for a selected service - https://phabricator.wikimedia.org/T263755 (10Ottomata) Hm, perhaps: https://logstash-next.wikimedia.org/app/dashboards#/view/7f883390-fe76-11ea-b848-090a7444f26c?_g=(filters%3A... [15:22:53] 10serviceops, 10Operations, 10ops-eqiad: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (10RobH) Host is now online (reimaged) and returned to service post scap pull and repool. Set to active in netbox. However, its not in the DSH node groups, and the directions aren't clear on where th... [15:27:31] o/ hi all, What are the current hosting options for "static sites" that may have a build step in them before they are ready to be "static"? [15:28:27] and also, I remember talking in the past about trying to get things like the query service UI (for wikidata) deployed via blubber etc so that we (wmde) could do our own deployments. Is that still "kind of okay"? or are there more options now? [15:31:08] <_joe_> addshore: we're all in a meeting sorry [15:31:13] ack! np! [15:54:03] 10serviceops, 10Operations, 10Recommendation-API: recommendation-api alerting and api errors - https://phabricator.wikimedia.org/T262587 (10crusnov) p:05Triage→03Medium [16:00:17] 10serviceops, 10Wikidata, 10Wikidata-Termbox: Termbox service: unusual errors that could be from envoy - https://phabricator.wikimedia.org/T263764 (10Tarrow) [16:02:53] 10serviceops, 10Operations, 10observability: Strongswan Icinga check: do not report issues about depooled hosts - https://phabricator.wikimedia.org/T148976 (10crusnov) [16:11:22] 10serviceops, 10Operations, 10ops-eqiad: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (10RobH) After checkign with @Joe via irc, it seems this should automatically be added back into DSH and clear after the puppet run and repooling, but has not. All other checks green, but I'd like to... [16:13:24] 10serviceops, 10Operations, 10ops-eqiad: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (10Volans) It looks it's marked as inactive on conftool: ` $ confctl select 'name=mw1360.eqiad.wmnet' get {"mw1360.eqiad.wmnet": {"weight": 30, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=api_apps... [16:30:50] 10serviceops, 10Operations, 10User-WDoran, 10User-brennen: Canaries canaries canaries - https://phabricator.wikimedia.org/T210143 (10brennen) [16:36:06] 10serviceops, 10Operations, 10ops-eqiad: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (10RobH) 05Open→03Resolved Ok, all is now green for the host in icinga and it shows in pooled/in service state. [16:46:56] 10serviceops, 10Operations, 10ops-codfw: decom wtp2005 (was: wtp2005 hardware issue) - https://phabricator.wikimedia.org/T257903 (10Papaul) 05Open→03Resolved Disks removed from server and unrack [18:20:17] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, and 2 others: Move feed assembly from RESTBase to Wikifeeds - https://phabricator.wikimedia.org/T263133 (10Pchelolo) [19:47:34] 10serviceops, 10Operations, 10Parsing-Team, 10Platform Engineering, and 5 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10daniel) [19:47:54] 10serviceops, 10Operations, 10Parsing-Team, 10TechCom, and 4 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10daniel) [19:48:40] 10serviceops, 10Operations, 10Parsing-Team, 10TechCom, and 4 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10daniel) a:05holger.knust→03None [19:52:29] Is there a plan to make a buster + python3.7 base image? The current docker-registry.wikimedia.org/python3 is 3.5 + stretch. [19:53:37] 10serviceops, 10Operations, 10Parsing-Team, 10TechCom, and 4 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10daniel) Looking into this some more, we came across a number of issues, namely: * Diffs and permalinks don... [23:56:09] 10serviceops, 10MW-on-K8s, 10Operations, 10TechCom-RFC, 10Patch-For-Review: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10tstarling)