[09:02:25] any good URL to link to for "Docker Registry HA" Icinga monitor? [09:13:30] 10serviceops, 10Operations, 10Wikidata, 10Wikidata-Termbox-Hike, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10akosiaris) >>! In T212189#5017053, @Tarrow wrote: >>>! In T212189#5011311, @akosiaris wrote: >> I have to say I am wondering a bit about th... [09:15:19] tarrow: I am EU based, albeit it's damn flu season :(. I am around now if you need any help [09:15:45] .pipeline/helm.yaml is the helm chart version the pipeline runs unit/integration tests against [09:16:35] akosiaris: thanks! hope you're ok. Don't feel you need to add my questions to your todo list if you should be resting [09:17:08] oh, I am the only one in the house without the flu (yet) [09:17:20] it's the rest of the house that needs help, me tending to it [09:17:35] :( sad times [09:17:58] just February/March from what I hear :-) [09:18:05] anyway, what can I help with? [09:20:59] I don't suppose there are instructions for setting up my own "staging" environment or even the whole pipeline are there? I'm just trying to feel more confident that everything will just drop in when we start deploying. I think in general I'm still not superclear in my own head about how everything sticks together right now [09:22:01] well written instructions, no there aren't. Setting up your own staging is not that difficult though [09:22:18] you need virtualbox && minikube && helm [09:22:52] hmm I think I have some preliminary draft docs somewhere in the wiki, lemme see [09:23:31] ah https://wikitech.wikimedia.org/wiki/User:Alexandros_Kosiaris/Benchmarking_kubernetes_apps [09:24:07] so this should allow you to test your container image, your helm chart and even do some basic benchmarking against it [09:24:31] great! that sounds good [09:24:46] you can bypass for now the pipeline for fast development by reusing the docker daemon in minikube [09:25:26] you can get the dockerfile the pipeline uses by talking to https://blubberoid.wikimedia.org/ [09:25:43] and POSTing the blubber.yaml file [09:27:00] one question I had when making the helm chart is: where is the name of the production image decided? e.g. I saw both wikimedia/blubber and wikimedia/mediawiki-services-citoid [09:27:35] it's the name of the repo in gerrit with the slashes turned dashes [09:28:04] cool [09:31:24] Thanks for all the hand holding :) [09:34:06] 10serviceops, 10Operations, 10Wikidata, 10Wikidata-Termbox-Hike, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10WMDE-leszek) > Final question and just for verification, this ain't going to be exposed directly to the internet after all, right? Rather t... [10:46:10] i noticed in lvs::monitor_services we have Icinga checks for all the services, mobileapps, graphoid, citoid and so on. all of them have one eqiad and one codfw check. just kartotherian only has codfw. is that exected? [10:58:28] ah it was back then [10:58:38] but that no longer holds, so it should be fixed [10:58:56] akosiaris: ok, i will upload a fix [14:29:40] 10serviceops, 10CX-cxserver, 10Citoid, 10Graphoid, and 10 others: Make services swagger specs standard compliant - https://phabricator.wikimedia.org/T218217 (10Pchelolo) [14:34:09] 10serviceops, 10Operations, 10RESTBase, 10RESTBase-API, and 2 others: Make RESTBase spec standard compliant and switch to OpenAPI 3.0 - https://phabricator.wikimedia.org/T218218 (10Pchelolo) [14:36:08] 10serviceops, 10Operations, 10RESTBase-API, 10TechCom, and 2 others: Decide whether to keep violating OpenAPI/Swagger specification in our REST services - https://phabricator.wikimedia.org/T217881 (10Pchelolo) 05Open→03Resolved a:03Pchelolo I think we have a consensus to go spec-compliant here. I've... [14:57:12] 10serviceops, 10Operations, 10RESTBase, 10RESTBase-API, and 2 others: Make RESTBase spec standard compliant and switch to OpenAPI 3.0 - https://phabricator.wikimedia.org/T218218 (10Pchelolo) [14:58:21] 10serviceops, 10Operations, 10RESTBase-API, 10TechCom, and 2 others: Decide whether to keep violating OpenAPI/Swagger specification in our REST services - https://phabricator.wikimedia.org/T217881 (10Pchelolo) [14:58:42] 10serviceops, 10Operations, 10RESTBase-API, 10TechCom, and 2 others: Decide whether to keep violating OpenAPI/Swagger specification in our REST services - https://phabricator.wikimedia.org/T217881 (10Pchelolo) [14:59:42] 10serviceops, 10CX-cxserver, 10Citoid, 10Graphoid, and 10 others: Make services swagger specs standard compliant - https://phabricator.wikimedia.org/T218217 (10Pchelolo) [15:00:25] 10serviceops, 10Operations, 10RESTBase, 10RESTBase-API, and 2 others: Make RESTBase spec standard compliant and switch to OpenAPI 3.0 - https://phabricator.wikimedia.org/T218218 (10Pchelolo) [15:26:05] ottomata: so, gerrit is no longer required to be accessible by eventgate-analytics, right? Should I remove the outgoing firewall hole? [15:26:23] ah, yes let's remove. [15:26:29] thanks! [15:27:06] and to answer your question the other day. You no longer need to wait > 3 mins between merging something in the charts repo and it being available on deploy1001 [15:27:14] I 've lowered the crons to 1 min [15:30:38] ok cool! [15:48:48] 'role::webperf::profiling_tools' includes passwords::ldap::production which is neither a role nor a profile <-- just using [15:48:57] "class { '::passwords::ldap::production': }" instead [16:09:07] 10serviceops, 10Operations, 10RESTBase, 10RESTBase-API, and 3 others: Make RESTBase spec standard compliant and switch to OpenAPI 3.0 - https://phabricator.wikimedia.org/T218218 (10mobrovac) [16:15:16] 10serviceops, 10Operations, 10Thumbor, 10ops-eqiad, 10User-jijiki: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10jijiki) [16:19:43] 10serviceops, 10Operations, 10Thumbor, 10ops-eqiad, 10User-jijiki: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10jijiki) Server has been depooled and downtimed on icinga for 48 hours, @Cmjohnson you can power it down any time, tx :) [16:20:54] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: mw2206.codfw.wmnet memory issues - https://phabricator.wikimedia.org/T215415 (10jijiki) a:03RobH [16:29:41] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: mw2206.codfw.wmnet memory issues - https://phabricator.wikimedia.org/T215415 (10RobH) next steps: * verify there is no ongoing SWAT or config change windows being pushed to the mw cluster * update the firmware of the bios * clear all memory errors fr... [16:55:22] 10serviceops, 10CX-cxserver, 10Citoid, 10Graphoid, and 11 others: Make services swagger specs standard compliant - https://phabricator.wikimedia.org/T218217 (10mobrovac) [17:03:12] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: mw2206.codfw.wmnet memory issues - https://phabricator.wikimedia.org/T215415 (10RobH) a:05RobH→03Papaul ` Record: 1 Date/Time: 01/15/2015 23:20:17 Source: system Severity: Ok Description: Log cleared. -------------------------------... [17:04:16] hm, akosiaris i wonder if there is some way we can indicate the k8s cluster of DC in the logstash logs. perhaps the best wqay would be to include the cluster name in the pod host name? [17:04:27] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: mw2206.codfw.wmnet memory issues - https://phabricator.wikimedia.org/T215415 (10RobH) Actually, this is out of warranty as of Jan. 19, 2018. So we may just want to decommission this host, unless we want to try just swapping the memory with an on site... [17:04:33] looking at messages in logstash, i don't have anyway of telling which logs come from which DC cluster [17:11:09] ottomata: yeah we will do that once the logging pipeline is in place [17:11:19] which is almost there, but no time [17:21:19] heya [17:21:26] Pchelolo: and I just upgraded the eventgate-analytics chart [17:21:32] it kinda works, but also it does not [17:21:37] and we are a bit for a loss atm. [17:21:52] sometimes there are errors about produce requests to kafka timing out [17:21:53] but [17:22:01] there are also ones direclty from mediawiki about not being about to reach the service [17:22:02] e.g. [17:22:04] https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2019.03.13/mediawiki?id=AWl4DxZMaxeebUfQn_Au&_g=() [17:22:26] there are events that get through tho [17:22:47] hmm and actually [17:22:53] it seems tob e only one pod causing this problem..... [17:22:59] i'm going to delete that offending pod and see what happens [17:23:03] pod is eventgate-analytics-production-5d866bc9dd-9nhbg [17:23:14] https://logstash.wikimedia.org/goto/84cffca0d114a72962fb863861cfa1fa [17:25:26] Pchelolo i think that might have fixed it. [17:25:28] ;but is very strange [17:25:32] ottomata: it has indeed fixed the eventgate->kafka, but there's still a bunch of 'couldn't connect to the server' on MW side [17:25:37] oh? [17:25:42] i need to have a dashh with that... [17:25:57] i wish we could tell which host was responding to the lvs request... [17:26:06] or [17:26:11] where it was being routed. [17:26:27] https://logstash.wikimedia.org/goto/4e1b1843d875f7033fe8b9fdf10bdaa1 [17:26:32] oh ya i have that thank you [17:26:59] regarding eventgate itself issue - I might have an idea what happened [17:27:07] oh? [17:27:18] I'll verify later [17:27:30] that was the pod that had 'broker transport failure' [17:30:11] it must be a single pod problem [17:30:27] most of the time eventgate-analytics.svc.eqiad.wmnet.31192 responds correctly [17:30:31] i'm tcpdumping on a mw host now [17:30:37] one that's failed before [17:32:54] akosiaris: is there a way I can send an http request directly to a pod port? [17:33:40] ottomata: look at eventgate-analytics-production-5d866bc9dd-kwflb [17:34:19] nah, that's not it.. [17:34:36] yeah, seems ok, nothing on eventgate side is logging errors now. [17:34:42] i can get logs from that host [17:34:45] pod* [17:34:53] but i don't know how to target it with http req [17:35:04] i dont think i can. [17:35:14] i can target a kube host on the service port [17:35:22] but k8s will route that to any pod [17:36:33] oh but wy would that host even have logs about timeouts, if it was just the one pod i thought was hte offender [17:46:58] found it. [17:47:16] well [17:47:20] the offending pod at least [17:47:24] still investigating. [17:51:14] ottomata: what's the name of the pod? [17:51:26] wel;l i have the IP [17:51:31] figuring out how to map that to the name... [17:52:49] eventgate-analytics-production-5d866bc9dd-kbhlw [17:53:33] ah Pchelolo [17:53:36] ha! that's the one I found too - it had some logs regarding message time outs [17:53:37] that's the one with broker transport failure [17:53:45] its logs say [17:53:46] first worker died during startup, continue startup [17:53:49] oh, no.. [17:53:58] but it doesn't restart the service? or hte pod? [17:54:08] i would expect the pod to be marked as failed [17:54:12] i can't curl it. [17:54:41] curl 10.64.64.96:8192/_info [17:54:41] vs [17:54:45] ottomata: I think it's a bug in service-runner [17:54:48] curl 10.64.64.96:8192/_info [17:54:49] oh ya? [17:55:04] but even so, k8s should restart it? [17:56:05] I don't think so.. the master was still alive, the worker failed [17:56:10] ya but it has a readiness check [17:56:16] k8s does [17:56:21] that is requesting /?spec [17:56:39] Pchelolo: we have num_workers: 1 [17:56:42] perhaps we should have 0? [17:57:09] nono, I think the idea was to have 1 worker since it's easier/more lightweight to restart a worker then a pod [17:59:37] 10serviceops, 10Operations, 10Thumbor, 10ops-eqiad, 10User-jijiki: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10Cmjohnson) I moved DIMM from B side to A side and cleared the log...let's give it a day or so and see if the error follows. [18:00:57] Pchelolo: i just did a kill -HUP 1 in the pod [18:01:12] and it seemed to restart the service runner inside [18:01:17] the pod didn't itself restart [18:01:44] i'm expecting the mw eventbus logs to chill out now [18:01:48] but. ya that shouldn't happend [18:01:53] two weird things there shouldn't happen. [18:02:46] 1: why was eventgate pod A not able to talk to kafka [18:02:46] 2: why did eventgate pod B fail talking to kafka on startup [18:02:46] 3: why did eventgate pod B not get its worker restarted [18:02:51] that's 3 things :[ [18:02:53] :p [18:05:14] still timeoutes! [18:08:28] HMMMM now that pod is timing out to kafka [18:08:32] very strange [18:08:33] deleting it [18:11:43] looking ok.... [18:14:34] ottomata: ok, there is a bug in service-runner for sure [18:16:48] that + something very weird k8s side. [18:19:03] yeye, k8s did not behave correctly either [18:19:09] fixind the service-runner [18:19:22] oh really? [18:19:29] cool,. [18:22:42] wow phew that was weird. [18:22:47] Pchelolo: i'd be ok with group1 now [18:27:38] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: mw2206.codfw.wmnet memory issues - https://phabricator.wikimedia.org/T215415 (10Papaul) IDRAC firmware complete please depool server for BIOS upgrade [18:27:47] 10serviceops, 10Operations, 10ops-eqiad: mw1280 crashed - https://phabricator.wikimedia.org/T218006 (10Cmjohnson) @MoritzMuehlenhoff can you please depool the server [18:28:06] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: mw2206.codfw.wmnet memory issues - https://phabricator.wikimedia.org/T215415 (10Papaul) Power State ON System Model PowerEdge R420 System Revision I System Host Name Operating System Operating System Version Service Tag B14K842 Express Service... [18:35:59] 10serviceops, 10Operations, 10Thumbor, 10ops-eqiad, 10User-jijiki: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10RobH) So attempting to boot this system shows: Error: Memory initialization warning detected. I cannot get what dimm has error, since it is overwritten on serial... [18:36:56] 10serviceops, 10Operations, 10Thumbor, 10ops-eqiad, 10User-jijiki: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10RobH) ` Error: Memory initialization warning detected. Management Engine Mode : Active ManagementeEngineeFirmwaretVersiongent: 0002.0001 Copyright (C)... [18:38:16] 10serviceops, 10Operations, 10Thumbor, 10ops-eqiad, 10User-jijiki: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10RobH) Next steps: * Chris attach crash cart ** output on crash cart won't be overwritten like serial console, so the POST error will denote which dimm is bad. ** if... [18:48:37] 10serviceops, 10Operations, 10ops-eqiad: mw1280 crashed - https://phabricator.wikimedia.org/T218006 (10jijiki) @Cmjohnson Server has been depooled, ping us to pool it back, tx! [18:52:52] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: mw2206.codfw.wmnet memory issues - https://phabricator.wikimedia.org/T215415 (10jijiki) @Papaul Server has been depooled, thank you! [18:55:07] 10serviceops, 10Operations, 10Thumbor, 10ops-eqiad, 10User-jijiki: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10Cmjohnson) DIMM A1 is now showing bad so it looks a DIMM replacement is needed. [18:56:03] 10serviceops, 10Operations, 10ops-eqiad: mw1280 crashed - https://phabricator.wikimedia.org/T218006 (10Cmjohnson) Record: 42 Date/Time: 03/10/2019 07:43:40 Source: system Severity: Non-Critical Description: Correctable memory error rate exceeded for DIMM_B1. -----------------------------------... [18:59:59] 10serviceops, 10Operations, 10ops-eqiad: mw1280 crashed - https://phabricator.wikimedia.org/T218006 (10Cmjohnson) Swapped DIMM B1 with A1 cleared idrac log. [19:42:22] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: mw2206.codfw.wmnet memory issues - https://phabricator.wikimedia.org/T215415 (10Papaul) Service Tag B14K842 Express Service Code 24012733970 BIOS Version 2.6.0 Firmware Version 2.61.60.60 IP Address(es) 10.193.2.106 iDRAC MAC Address [19:45:20] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: mw2206.codfw.wmnet memory issues - https://phabricator.wikimedia.org/T215415 (10Papaul) System log clear Server can be repool Monitoring DIMM error [21:06:59] 10serviceops, 10Operations, 10Wikimedia-Incident: Influx of service errors Mar 13: 19:12-19:14 UTC and 20:12-20:15 - https://phabricator.wikimedia.org/T218255 (10jijiki) [21:10:12] 10serviceops, 10Operations, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), 10Core Platform Team Kanban (Doing), and 2 others: Influx of service errors Mar 13: 19:12-19:14 UTC and 20:12-20:15 - https://phabricator.wikimedia.org/T218255 (10mobrovac) [21:41:03] 10serviceops, 10Analytics, 10EventBus, 10Operations, 10Prod-Kubernetes: eventgate-analytics k8s pods occasionally can't produce to kafka - https://phabricator.wikimedia.org/T218268 (10Ottomata) [21:41:24] 10serviceops, 10Analytics, 10EventBus, 10Operations, 10Prod-Kubernetes: eventgate-analytics k8s pods occasionally can't produce to kafka - https://phabricator.wikimedia.org/T218268 (10Ottomata) @akosiaris let's try to figure this out tomorrow. :)