[06:15:42] 10serviceops, 10Operations, 10Patch-For-Review, 10Release-Engineering-Team-TODO (201907), 10Wikimedia-Incident: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10Nikerabbit) Per https://integration.wikimedia.org/ci/job/tr... [06:52:12] 10serviceops, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (201907): 'scap pull' stopped working on appservers ? - https://phabricator.wikimedia.org/T228328 (10greg) [08:03:51] 10serviceops, 10Operations, 10Patch-For-Review, 10Release-Engineering-Team-TODO (201907), 10Wikimedia-Incident: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10akosiaris) p:05Normal→03High >>! In T228196#5342893, @t... [08:27:32] 10serviceops, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Ores hosts: mwparserfromhell tokenizer random segfault - https://phabricator.wikimedia.org/T222866 (10akosiaris) [08:27:50] 10serviceops, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Ores hosts: mwparserfromhell tokenizer random segfault - https://phabricator.wikimedia.org/T222866 (10akosiaris) Indeed. done. Thanks! [09:06:00] 10serviceops, 10Operations, 10Patch-For-Review, 10Release-Engineering-Team-TODO (201907), 10Wikimedia-Incident: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10fsero) i've uploaded the missing layers from a backup, it w... [09:07:36] 10serviceops, 10Operations, 10Patch-For-Review, 10Release-Engineering-Team-TODO (201907), 10Wikimedia-Incident: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10fsero) fixes also docker-registry.wikimedia.org/releng/comp... [10:40:14] 10serviceops, 10CirrusSearch, 10Discovery-Search, 10Patch-For-Review: Intermittent connect timeout for CirrusSearch connections - https://phabricator.wikimedia.org/T228063 (10jijiki) 05Open→03Resolved Let's reopen if the issue persists. [13:16:12] 10serviceops, 10Operations, 10Continuous-Integration-Infrastructure (phase-out-jessie): Upload docker-ce 18.06.3 upstream package for Stretch - https://phabricator.wikimedia.org/T226236 (10hashar) 05Resolved→03Open Thanks, I can confirm the component is around and it addresses the concern of mixing up up... [14:21:03] I don't suppose I could get anyone to take another glance at https://phabricator.wikimedia.org/T226814? It would be super awesome to get some feel for the timescale it might take :) [14:30:14] 10serviceops, 10Operations, 10Continuous-Integration-Infrastructure (phase-out-jessie): Upload docker-ce 18.06.3 upstream package for Stretch - https://phabricator.wikimedia.org/T226236 (10akosiaris) >>! In T226236#5345617, @hashar wrote: > Thanks, I can confirm the component is around and it addresses the c... [14:33:01] tarrow: was busy with other things I hope to take a look tomorrow [14:33:37] fsero: Cheers! I know some of them were other problems I brought up so thanks! :) [14:49:58] 10serviceops, 10Wikibase-Termbox-Iteration-20, 10Wikidata-Termbox-Iteration-19, 10Patch-For-Review: Create termbox release for test.wikidata.org - https://phabricator.wikimedia.org/T226814 (10akosiaris) I 've brought this up in the weekly SRE meeting. Overall there's a number of concerns. I 'll be listing... [14:56:32] tarrow: I did answer [14:56:41] and saw your ping just right now :-) [14:59:00] akosiaris: thanks! I think we need a new helm release not just pointing it at staging; I'm more than happy to skip LVS though [14:59:20] The reason is I think we need staging for actually staging before going to www.wikidata.org [14:59:44] makes sense to me, that's why I offered both alternatives [15:01:02] akosiaris: cool; are the patches that I hacked together for it basically sufficient then if we just abandon the LVS ones? [15:05:39] tarrow: in a meeting, will reply in a bit [15:05:54] Thanks! :) [15:18:43] 10serviceops, 10Wikibase-Termbox-Iteration-20, 10Wikidata-Termbox-Iteration-19, 10Patch-For-Review: Create termbox release for test.wikidata.org - https://phabricator.wikimedia.org/T226814 (10Tarrow) This sounds great! Actually: > ... create some DNS hostname like termbox-test.staging.svc.eqiad.wmnet poin... [16:09:03] tarrow: the patches you hacked probably will require some changes, I have the patchset range, I 'll try and comment on them today [16:10:36] akosiaris: awesome! thank you. I added how I anticipated they would need changing to the ticket [16:29:53] 10serviceops, 10MediaWiki-Cache, 10Operations, 10Performance-Team, 10Wikimedia-production-error: Redis exception connecting to "/var/run/nutcracker/redis_eqiad.sock": read error on connection - https://phabricator.wikimedia.org/T228303 (10jijiki) [16:39:18] 10serviceops, 10MediaWiki-Cache, 10Operations, 10Performance-Team, 10Wikimedia-production-error: Redis exception connecting to "/var/run/nutcracker/redis_eqiad.sock": read error on connection - https://phabricator.wikimedia.org/T228303 (10Krinkle) >>! From IRC: > [2019-07-17 15:02:…] nc_redis.c... [16:59:54] 10serviceops, 10MediaWiki-Cache, 10Operations, 10Performance-Team, and 2 others: Redis exception connecting to "/var/run/nutcracker/redis_eqiad.sock": read error on connection - https://phabricator.wikimedia.org/T228303 (10Krinkle) [17:00:17] 10serviceops, 10MediaWiki-Cache, 10Operations, 10Performance-Team, and 2 others: Redis exception connecting to "/var/run/nutcracker/redis_eqiad.sock": read error on connection - https://phabricator.wikimedia.org/T228303 (10Krinkle) Re-blocking the train. Turns out that while wmf.13 is also affected, it is... [17:02:45] 10serviceops, 10MediaWiki-Cache, 10Operations, 10Performance-Team, and 2 others: Redis exception connecting to "/var/run/nutcracker/redis_eqiad.sock": read error on connection - https://phabricator.wikimedia.org/T228303 (10greg) p:05Triage→03Unbreak! [17:24:20] 10serviceops, 10MediaWiki-Cache, 10Operations, 10Performance-Team, and 2 others: Redis exception connecting to "/var/run/nutcracker/redis_eqiad.sock": read error on connection - https://phabricator.wikimedia.org/T228303 (10elukey) I have isolated on logstash mw1345 and it seems that the nutcracker errors h... [17:48:55] 10serviceops, 10MediaWiki-Cache, 10Operations, 10Performance-Team, and 2 others: Redis exception connecting to "/var/run/nutcracker/redis_eqiad.sock": read error on connection - https://phabricator.wikimedia.org/T228303 (10Krinkle) p:05Unbreak!→03High Prod recovered. [17:48:59] 10serviceops, 10MediaWiki-Cache, 10Operations, 10Performance-Team, and 2 others: Redis exception connecting to "/var/run/nutcracker/redis_eqiad.sock": read error on connection - https://phabricator.wikimedia.org/T228303 (10Krinkle) [18:01:12] 10serviceops, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (201907): 'scap pull' stopped working on appservers ? - https://phabricator.wikimedia.org/T228328 (10Dzahn) A side effect of this is also that when reimaging hosts fails now. wmf-au... [18:01:51] 10serviceops, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (201907): 'scap pull' stopped working on appservers ? - https://phabricator.wikimedia.org/T228328 (10Dzahn) p:05Triage→03High [18:48:37] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: Degraded RAID on mw2250 - https://phabricator.wikimedia.org/T226948 (10Dzahn) ran wmf-auto-reimage-host on it. OS is freshly installed though the first puppet run fails because it tries to run scap pull and this is currently broken (T228328) so this... [18:48:58] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: Degraded RAID on mw2250 - https://phabricator.wikimedia.org/T226948 (10Dzahn) 05Open→03Stalled [19:24:38] When following https://wikitech.wikimedia.org/wiki/Migrating_from_scap-helm, how long would one assume that `helmfile apply` would block on "Upgrading staging/kask" ? [19:25:49] How long should I wait before assuming it's failed, and are there any recommended steps if it has? [19:26:16] fsero: akosiaris , this helmfile stuff is WAY nicer [19:26:19] thank you for this! [19:33:39] urandom: is configured for wait for 300 secs [19:33:57] If the deployment change do not progress helmfile will fail [19:34:18] ottomata: thanks :) we are glad you like it :) [19:34:41] i just did my first deployment to staging with it [19:34:55] i love that I don't have to think too hard about editing CLI params [19:35:03] and the diff makes it clear what is happening [19:35:23] AND the values are finally in git :) [19:38:44] fsero: always been confused about this: [19:38:56] where does the 'stable' come from in chart: stable/eventgate [19:38:56] ? [19:47:55] Is part of the helm configuration on deploy1001 [19:48:22] We override the stable to point to our releases helm repo instead of the kubernetes one [19:48:56] Because we do not allow helm in the cluster or the deployment servers to contact the default helm repo [19:49:20] But fetching releases from stable is part of the helm workflow when doing an install or upgrade [19:49:30] That is the way to circumvent that issue [19:49:45] Because we want helm to only use our releases [19:50:26] ah ok, weird. so its kinda just a workaround [19:50:29] to keep helm from reaching out [19:50:30] ? [19:53:47] fsero: any insight into my error? [19:53:50] https://www.irccloud.com/pastebin/5Evkm25q/ [19:54:00] should I just... try again? [19:54:20] is there a step that can be taken at this point to yield more information? logs perhaps? [19:54:54] yikes [19:55:00] https://www.irccloud.com/pastebin/DgHNUqcC/ [19:55:46] `helmfile diff` seems to show nothing anymore, and `status` shows it in a failed state [19:57:54] urandom: check out cluster status [19:58:05] There is maybe a crashloopbackoff [19:58:19] Or another error that could tell you why it failed [19:58:57] fsero: cluster status? [19:59:01] fsero: [19:59:01] does [19:59:02] https://wikitech.wikimedia.org/wiki/Migrating_from_scap-helm#Rolling_back_changes [19:59:19] mean that you'd prefer rollbacks to previous releases didn't happen? [19:59:25] https://www.irccloud.com/pastebin/7zRgxpf9/ [20:00:25] a revert to a helmfile.d is a new release revision, right? [20:00:35] urandom: not literal! :P [20:00:41] https://www.irccloud.com/pastebin/F6bXc71a/ [20:00:50] your pod is crashing sir [20:01:06] well, there is a command called "cluster", so I thought I'd try [20:01:21] https://www.irccloud.com/pastebin/w0QoPJTF/ [20:01:42] this is the laststate of the container [20:01:48] the process ended with an errorcode of 2 [20:01:53] so something went wrong [20:01:57] i got that running kubectl get pods kask-staging-6d45df6bcf-kdslw -o yaml [20:02:15] ottomata: yes, a revert is a new release revision [20:02:31] before it was a deploy of a previous release revision with helm, right? [20:02:34] we are just discouraging that now? [20:02:53] since the revert of the helmfile commit should mean that the new release revision shoudl be identical to the previous one? [20:03:20] for helmfile anything that modifies the values.yaml is a new revision [20:03:28] so if you have image: a in values at T1 [20:03:36] change it to image: b at T2 [20:03:45] and revert to image: a at T3 [20:03:53] there would be 3 deploys [20:04:46] so we are not discouraging it, before we would ran scap-helm upgrade bla bla point_to_old_values [20:04:50] now is managed via helmfile [20:04:56] right, but previously we could also do [20:05:06] scap helm upgrade --version [20:05:08] to rollback [20:05:16] or somethingm [20:05:23] but ya i see it that is discouraged and for emergencies [20:05:24] +1 [20:05:30] i'm just updating my eventgate docs [20:05:38] based on your helmfile ones [20:05:47] that is still possible but discouraged [20:05:51] aye [20:05:52] cool [20:05:57] if you really really want to do it [20:06:38] https://www.irccloud.com/pastebin/7NNDu5vX/ [20:06:42] you can do that [20:06:54] problem is there would not be any commit in deployment-charts [20:07:10] so i'd say that only use it for death and life emergencies [20:07:15] usually go through git [20:07:31] aye [20:07:36] (im using kask in the example but i think you get it) [20:07:39] i like it [20:11:43] you can do as many questions you like or need but i'll answer them in an async fashion now [20:11:47] ttyl := [20:12:30] laters :) [20:16:35] 10serviceops, 10RESTBase, 10Core Platform Team (RESTBase Split (CDP2)), 10Epic, 10User-mobrovac: Split RESTBase in two services: storage service and API router/proxy - https://phabricator.wikimedia.org/T220449 (10WDoranWMF) Note: This needs to be scheduled as work with SRE in order to complete. [20:23:28] 10serviceops, 10Operations, 10Release Pipeline, 10Core Platform Team (RESTBase Split (CDP2)), and 3 others: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 (10Pchelolo) [20:25:43] 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)): Allow service-checker to run multiple domains for RESTBase - https://phabricator.wikimedia.org/T227198 (10WDoranWMF) [20:33:21] 10serviceops, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Workboards (Team 2): k8s liveness check(?) generating session storage log noise - https://phabricator.wikimedia.org/T227514 (10WDoranWMF) [20:33:40] 10serviceops, 10Core Platform Team (Multi-DC (TEC1)): k8s liveness check(?) generating session storage log noise - https://phabricator.wikimedia.org/T227514 (10WDoranWMF) [20:36:14] 10serviceops, 10MediaWiki-Cache, 10Operations, 10Performance-Team, and 2 others: Redis exception connecting to "/var/run/nutcracker/redis_eqiad.sock": read error on connection - https://phabricator.wikimedia.org/T228303 (10aaron) [20:37:00] 10serviceops, 10MediaWiki-Cache, 10Operations, 10Performance-Team, and 2 others: Redis exception connecting to "/var/run/nutcracker/redis_eqiad.sock": read error on connection - https://phabricator.wikimedia.org/T228303 (10aaron) 05Open→03Resolved a:03aaron [20:39:55] 10serviceops, 10Core Platform Team (Session Management Service (CDP2)): Problems deploying sessionstore service (staging) to k8s - https://phabricator.wikimedia.org/T227492 (10WDoranWMF) 05Open→03Resolved a:03WDoranWMF [20:55:10] 10serviceops, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Workboards (Clinic Duty Team), 10User-Clarakosi, 10User-Eevans: Package table_properties utility for Debian - https://phabricator.wikimedia.org/T226551 (10WDoranWMF) [20:58:38] expected that the catalogue would contain Monitoring::Service[something] with notes_url set to "https://wikitech.wikimedia.org/wiki/Monitoring" but it is set to "'https://wikitech.wikimedia.org/wiki/Monitoring'" [20:59:24] akosiaris: ^ re: nrpe spec / not sure how to fix that. if i don't quote it there is an error and if i quote it with ' there is this other one [21:02:01] that's in line 29 of https://gerrit.wikimedia.org/r/c/operations/puppet/+/496830/21/modules/nrpe/spec/defines/monitor_service_spec.rb for [21:02:10] "nrpe::monitor_service with ensure present contains normal resources" [21:22:50] 10serviceops, 10Operations, 10Core Platform Team (Services Operations): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10MSantos) [21:44:17] 10serviceops, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (201907): 'scap pull' stopped working on appservers ? - https://phabricator.wikimedia.org/T228328 (10greg) a:03thcipriani [21:50:15] fsero: (while noting that you may not be around for a while), so to clarify, the app running in the container exited with status code 2? [21:50:22] yes [21:50:26] because that doesn't seem right [21:50:32] why not? [21:50:45] well, by the logs, startup seems to have been normal [21:51:43] I'm not even sure how a status code 2 would be returned [21:53:02] fsero: for example, I know of an error scenario that would return status 1 (and log loudly) [21:53:36] well that error code refers to the container, not only the app (i think that was your first comment) [21:53:40] and checking events.. i do see [21:53:53] https://www.irccloud.com/pastebin/Yqj2ropl/ [21:54:00] aha! [21:54:05] I was about to ask [21:54:24] so a readiness probe failure would explain this? [21:54:26] you should be able to see that either with kubectl get events [21:54:30] liveness more likely [21:54:42] but yeah [21:54:58] Error from server (Forbidden): events is forbidden: User "sessionstore" cannot list resource "events" in API group "" in the namespace "sessionstore" [21:55:28] yeah thats something to fix in our end [21:55:40] and if you create a phab task i'd appreciate it [21:56:00] is just a matter of authorization, changing the RBAC rules [21:59:11] fsero: will do. [22:00:57] holy crap, I think it worked [22:02:03] it did [22:24:42] 10serviceops, 10Scap, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (201907): Deploy scap 3.11.1-1 - https://phabricator.wikimedia.org/T228482 (10thcipriani) [23:03:29] [mw2250:~] $ scap pull [23:03:39] File "/usr/lib/python2.7/dist-packages/scap/opcache_manager.py", line 3, in [23:03:41] from concurrent.futures import ThreadPoolExecutor [23:03:44] ImportError: No module named concurrent.futures [23:03:56] that's new .. the scap pull issue is gone but this is there now [23:26:44] mutante: that was a bug in the scap deb package... I think thcipriani wrote a fix? [23:27:17] the deb manifest was missing a require for python-concurrent.futures [23:28:51] bd808: yep, he showed that to me. i actually built the new scap version deb just now. only fighting with a key issue to publish it [23:29:28] i import the signing key but reprepro can't find it [23:51:12] fixed that.. new scap version on APT repo now [23:51:21] trying on mw2250 [23:53:18] indices exported for buster and jessie.. just on stretch it doesn't know yet