[06:15:42] <wikibugs>	 10serviceops, 10Operations, 10Patch-For-Review, 10Release-Engineering-Team-TODO (201907), 10Wikimedia-Incident: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10Nikerabbit) Per https://integration.wikimedia.org/ci/job/tr...
[06:52:12] <wikibugs>	 10serviceops, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (201907): 'scap pull' stopped working on appservers ? - https://phabricator.wikimedia.org/T228328 (10greg)
[08:03:51] <wikibugs>	 10serviceops, 10Operations, 10Patch-For-Review, 10Release-Engineering-Team-TODO (201907), 10Wikimedia-Incident: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10akosiaris) p:05Normal→03High >>! In T228196#5342893, @t...
[08:27:32] <wikibugs>	 10serviceops, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Ores hosts: mwparserfromhell tokenizer random segfault - https://phabricator.wikimedia.org/T222866 (10akosiaris)
[08:27:50] <wikibugs>	 10serviceops, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Ores hosts: mwparserfromhell tokenizer random segfault - https://phabricator.wikimedia.org/T222866 (10akosiaris) Indeed. done. Thanks!
[09:06:00] <wikibugs>	 10serviceops, 10Operations, 10Patch-For-Review, 10Release-Engineering-Team-TODO (201907), 10Wikimedia-Incident: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10fsero) i've uploaded the missing layers from a backup, it w...
[09:07:36] <wikibugs>	 10serviceops, 10Operations, 10Patch-For-Review, 10Release-Engineering-Team-TODO (201907), 10Wikimedia-Incident: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10fsero) fixes also docker-registry.wikimedia.org/releng/comp...
[10:40:14] <wikibugs>	 10serviceops, 10CirrusSearch, 10Discovery-Search, 10Patch-For-Review: Intermittent connect timeout for CirrusSearch connections - https://phabricator.wikimedia.org/T228063 (10jijiki) 05Open→03Resolved Let's reopen if the issue persists.
[13:16:12] <wikibugs>	 10serviceops, 10Operations, 10Continuous-Integration-Infrastructure (phase-out-jessie): Upload docker-ce 18.06.3 upstream package for Stretch - https://phabricator.wikimedia.org/T226236 (10hashar) 05Resolved→03Open Thanks, I can confirm the component is around and it addresses the concern of mixing up up...
[14:21:03] <tarrow>	 I don't suppose I could get anyone to take another glance at https://phabricator.wikimedia.org/T226814? It would be super awesome to get some feel for the timescale it might take :)
[14:30:14] <wikibugs>	 10serviceops, 10Operations, 10Continuous-Integration-Infrastructure (phase-out-jessie): Upload docker-ce 18.06.3 upstream package for Stretch - https://phabricator.wikimedia.org/T226236 (10akosiaris) >>! In T226236#5345617, @hashar wrote: > Thanks, I can confirm the component is around and it addresses the c...
[14:33:01] <fsero>	 tarrow: was busy with other things I hope to take a look tomorrow
[14:33:37] <tarrow>	 fsero: Cheers! I know some of them were other problems I brought up so thanks! :) 
[14:49:58] <wikibugs>	 10serviceops, 10Wikibase-Termbox-Iteration-20, 10Wikidata-Termbox-Iteration-19, 10Patch-For-Review: Create termbox release for test.wikidata.org - https://phabricator.wikimedia.org/T226814 (10akosiaris) I 've brought this up in the weekly SRE meeting. Overall there's a number of concerns. I 'll be listing...
[14:56:32] <akosiaris>	 tarrow: I did answer
[14:56:41] <akosiaris>	 and saw your ping just right now :-)
[14:59:00] <tarrow>	 akosiaris: thanks! I think we need a new helm release not just pointing it at staging; I'm more than happy to skip LVS though
[14:59:20] <tarrow>	 The reason is I think we need staging for actually staging before going to www.wikidata.org
[14:59:44] <akosiaris>	 makes sense to me, that's why I offered both alternatives
[15:01:02] <tarrow>	 akosiaris: cool; are the patches that I hacked together for it basically sufficient then if we just abandon the LVS ones?
[15:05:39] <akosiaris>	 tarrow: in a meeting, will reply in a bit
[15:05:54] <tarrow>	 Thanks! :)
[15:18:43] <wikibugs>	 10serviceops, 10Wikibase-Termbox-Iteration-20, 10Wikidata-Termbox-Iteration-19, 10Patch-For-Review: Create termbox release for test.wikidata.org - https://phabricator.wikimedia.org/T226814 (10Tarrow) This sounds great! Actually:  > ... create some DNS hostname like termbox-test.staging.svc.eqiad.wmnet poin...
[16:09:03] <akosiaris>	 tarrow: the patches you hacked probably will require some changes, I have the patchset range, I 'll try and comment on them today
[16:10:36] <tarrow>	 akosiaris: awesome! thank you. I added how I anticipated they would need changing to the ticket
[16:29:53] <wikibugs>	 10serviceops, 10MediaWiki-Cache, 10Operations, 10Performance-Team, 10Wikimedia-production-error: Redis exception connecting to "/var/run/nutcracker/redis_eqiad.sock": read error on connection - https://phabricator.wikimedia.org/T228303 (10jijiki)
[16:39:18] <wikibugs>	 10serviceops, 10MediaWiki-Cache, 10Operations, 10Performance-Team, 10Wikimedia-production-error: Redis exception connecting to "/var/run/nutcracker/redis_eqiad.sock": read error on connection - https://phabricator.wikimedia.org/T228303 (10Krinkle) >>! From IRC: > <elukey> [2019-07-17 15:02:…] nc_redis.c...
[16:59:54] <wikibugs>	 10serviceops, 10MediaWiki-Cache, 10Operations, 10Performance-Team, and 2 others: Redis exception connecting to "/var/run/nutcracker/redis_eqiad.sock": read error on connection - https://phabricator.wikimedia.org/T228303 (10Krinkle)
[17:00:17] <wikibugs>	 10serviceops, 10MediaWiki-Cache, 10Operations, 10Performance-Team, and 2 others: Redis exception connecting to "/var/run/nutcracker/redis_eqiad.sock": read error on connection - https://phabricator.wikimedia.org/T228303 (10Krinkle) Re-blocking the train. Turns out that while wmf.13 is also affected, it is...
[17:02:45] <wikibugs>	 10serviceops, 10MediaWiki-Cache, 10Operations, 10Performance-Team, and 2 others: Redis exception connecting to "/var/run/nutcracker/redis_eqiad.sock": read error on connection - https://phabricator.wikimedia.org/T228303 (10greg) p:05Triage→03Unbreak!
[17:24:20] <wikibugs>	 10serviceops, 10MediaWiki-Cache, 10Operations, 10Performance-Team, and 2 others: Redis exception connecting to "/var/run/nutcracker/redis_eqiad.sock": read error on connection - https://phabricator.wikimedia.org/T228303 (10elukey) I have isolated on logstash mw1345 and it seems that the nutcracker errors h...
[17:48:55] <wikibugs>	 10serviceops, 10MediaWiki-Cache, 10Operations, 10Performance-Team, and 2 others: Redis exception connecting to "/var/run/nutcracker/redis_eqiad.sock": read error on connection - https://phabricator.wikimedia.org/T228303 (10Krinkle) p:05Unbreak!→03High Prod recovered.
[17:48:59] <wikibugs>	 10serviceops, 10MediaWiki-Cache, 10Operations, 10Performance-Team, and 2 others: Redis exception connecting to "/var/run/nutcracker/redis_eqiad.sock": read error on connection - https://phabricator.wikimedia.org/T228303 (10Krinkle)
[18:01:12] <wikibugs>	 10serviceops, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (201907): 'scap pull' stopped working on appservers ? - https://phabricator.wikimedia.org/T228328 (10Dzahn) A side effect of this is also that when reimaging hosts fails now. wmf-au...
[18:01:51] <wikibugs>	 10serviceops, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (201907): 'scap pull' stopped working on appservers ? - https://phabricator.wikimedia.org/T228328 (10Dzahn) p:05Triage→03High
[18:48:37] <wikibugs>	 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: Degraded RAID on mw2250 - https://phabricator.wikimedia.org/T226948 (10Dzahn) ran wmf-auto-reimage-host on it. OS is freshly installed though the first puppet run fails because it tries to run scap pull and this is currently broken (T228328)  so this...
[18:48:58] <wikibugs>	 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: Degraded RAID on mw2250 - https://phabricator.wikimedia.org/T226948 (10Dzahn) 05Open→03Stalled
[19:24:38] <urandom>	 When following https://wikitech.wikimedia.org/wiki/Migrating_from_scap-helm, how long would one assume that `helmfile apply` would  block on "Upgrading staging/kask" ?
[19:25:49] <urandom>	 How long should I wait  before assuming it's failed, and are there any recommended steps if it has?
[19:26:16] <ottomata>	 fsero: akosiaris , this helmfile stuff is WAY nicer
[19:26:19] <ottomata>	 thank you for this!
[19:33:39] <fsero>	 urandom: is configured for wait for 300 secs
[19:33:57] <fsero>	 If the deployment change do not progress helmfile will fail
[19:34:18] <fsero>	 ottomata: thanks :) we are glad you like it :)
[19:34:41] <ottomata>	 i just did my first deployment to staging with it
[19:34:55] <ottomata>	 i love that I don't have to think too hard about editing CLI params
[19:35:03] <ottomata>	 and the diff makes it clear what is happening
[19:35:23] <ottomata>	 AND the values are finally in git :)
[19:38:44] <ottomata>	 fsero: always been confused about this:
[19:38:56] <ottomata>	 where does the 'stable' come from in chart: stable/eventgate
[19:38:56] <ottomata>	 ?
[19:47:55] <fsero>	 Is part of the helm configuration on deploy1001
[19:48:22] <fsero>	 We override the stable to point to our releases helm repo instead of the kubernetes one
[19:48:56] <fsero>	 Because we do not allow helm in the cluster or the deployment servers to contact the default helm repo
[19:49:20] <fsero>	 But fetching releases from stable is part of the helm workflow when doing an install or upgrade
[19:49:30] <fsero>	 That is the way to circumvent that issue
[19:49:45] <fsero>	 Because we want helm to only use our releases
[19:50:26] <ottomata>	 ah ok, weird.  so its kinda just a workaround
[19:50:29] <ottomata>	 to keep helm from reaching out
[19:50:30] <ottomata>	 ?
[19:53:47] <urandom>	 fsero: any insight into my error?
[19:53:50] <urandom>	 https://www.irccloud.com/pastebin/5Evkm25q/
[19:54:00] <urandom>	 should I just... try again?
[19:54:20] <urandom>	 is there a step that can be taken at this point to yield more information? logs perhaps?
[19:54:54] <urandom>	 yikes
[19:55:00] <urandom>	 https://www.irccloud.com/pastebin/DgHNUqcC/
[19:55:46] <urandom>	 `helmfile diff` seems to show nothing anymore, and `status` shows it in a failed state
[19:57:54] <fsero>	 urandom: check out cluster status
[19:58:05] <fsero>	 There is maybe a crashloopbackoff
[19:58:19] <fsero>	 Or another error that could tell you why it failed
[19:58:57] <urandom>	 fsero: cluster status?
[19:59:01] <ottomata>	 fsero: 
[19:59:01] <ottomata>	 does
[19:59:02] <ottomata>	 https://wikitech.wikimedia.org/wiki/Migrating_from_scap-helm#Rolling_back_changes
[19:59:19] <ottomata>	 mean that you'd prefer rollbacks to previous releases didn't happen?
[19:59:25] <urandom>	 https://www.irccloud.com/pastebin/7zRgxpf9/
[20:00:25] <ottomata>	 a revert to a helmfile.d is a new release revision, right?
[20:00:35] <fsero>	 urandom: not literal! :P
[20:00:41] <fsero>	 https://www.irccloud.com/pastebin/F6bXc71a/
[20:00:50] <fsero>	 your pod is crashing sir
[20:01:06] <urandom>	 well, there is a command called "cluster", so I thought I'd try
[20:01:21] <fsero>	 https://www.irccloud.com/pastebin/w0QoPJTF/
[20:01:42] <fsero>	 this is the laststate of the container
[20:01:48] <fsero>	 the process ended with an errorcode of 2
[20:01:53] <fsero>	 so something went wrong
[20:01:57] <fsero>	 i got that running kubectl get pods kask-staging-6d45df6bcf-kdslw -o yaml
[20:02:15] <fsero>	 ottomata: yes, a revert is a new release revision
[20:02:31] <ottomata>	 before it was a deploy of a previous release revision with helm, right?
[20:02:34] <ottomata>	 we are just discouraging that now?
[20:02:53] <ottomata>	 since the revert of the helmfile commit should  mean that the new release revision shoudl be identical to the previous one?
[20:03:20] <fsero>	 for helmfile anything that modifies the values.yaml is a new revision
[20:03:28] <fsero>	 so if you have image: a in values at T1
[20:03:36] <fsero>	 change it to image: b at T2
[20:03:45] <fsero>	 and revert to image: a at T3
[20:03:53] <fsero>	 there would be 3 deploys
[20:04:46] <fsero>	 so we are not discouraging it, before we would ran scap-helm upgrade bla bla point_to_old_values
[20:04:50] <fsero>	 now is managed via helmfile
[20:04:56] <ottomata>	 right, but previously we could also do
[20:05:06] <ottomata>	 scap helm upgrade --version <release revision>
[20:05:08] <ottomata>	 to rollback
[20:05:16] <ottomata>	 or somethingm
[20:05:23] <ottomata>	 but ya i see it that is discouraged and for emergencies
[20:05:24] <ottomata>	 +1
[20:05:30] <ottomata>	 i'm just updating my eventgate docs
[20:05:38] <ottomata>	 based on your helmfile ones
[20:05:47] <fsero>	 that is still possible but discouraged
[20:05:51] <ottomata>	 aye
[20:05:52] <ottomata>	 cool
[20:05:57] <fsero>	 if you really really want to do it
[20:06:38] <fsero>	 https://www.irccloud.com/pastebin/7NNDu5vX/
[20:06:42] <fsero>	 you can do that
[20:06:54] <fsero>	 problem is there would not be any commit in deployment-charts
[20:07:10] <fsero>	 so i'd say that only use it for death and life emergencies
[20:07:15] <fsero>	 usually go through git
[20:07:31] <ottomata>	 aye
[20:07:36] <fsero>	 (im using kask in the example but i think you get it)
[20:07:39] <ottomata>	 i like it
[20:11:43] <fsero>	 you can do as many questions you like or need but i'll answer them in an async fashion now
[20:11:47] <fsero>	 ttyl :=
[20:12:30] <ottomata>	 laters :)
[20:16:35] <wikibugs>	 10serviceops, 10RESTBase, 10Core Platform Team (RESTBase Split (CDP2)), 10Epic, 10User-mobrovac: Split RESTBase in two services: storage service and API router/proxy - https://phabricator.wikimedia.org/T220449 (10WDoranWMF) Note: This needs to be scheduled as work with SRE in order to complete.
[20:23:28] <wikibugs>	 10serviceops, 10Operations, 10Release Pipeline, 10Core Platform Team (RESTBase Split (CDP2)), and 3 others: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 (10Pchelolo)
[20:25:43] <wikibugs>	 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)): Allow service-checker to run multiple domains for RESTBase - https://phabricator.wikimedia.org/T227198 (10WDoranWMF)
[20:33:21] <wikibugs>	 10serviceops, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Workboards (Team 2): k8s liveness check(?) generating session storage log noise - https://phabricator.wikimedia.org/T227514 (10WDoranWMF)
[20:33:40] <wikibugs>	 10serviceops, 10Core Platform Team (Multi-DC (TEC1)): k8s liveness check(?) generating session storage log noise - https://phabricator.wikimedia.org/T227514 (10WDoranWMF)
[20:36:14] <wikibugs>	 10serviceops, 10MediaWiki-Cache, 10Operations, 10Performance-Team, and 2 others: Redis exception connecting to "/var/run/nutcracker/redis_eqiad.sock": read error on connection - https://phabricator.wikimedia.org/T228303 (10aaron)
[20:37:00] <wikibugs>	 10serviceops, 10MediaWiki-Cache, 10Operations, 10Performance-Team, and 2 others: Redis exception connecting to "/var/run/nutcracker/redis_eqiad.sock": read error on connection - https://phabricator.wikimedia.org/T228303 (10aaron) 05Open→03Resolved a:03aaron
[20:39:55] <wikibugs>	 10serviceops, 10Core Platform Team (Session Management Service (CDP2)): Problems deploying sessionstore service (staging) to k8s - https://phabricator.wikimedia.org/T227492 (10WDoranWMF) 05Open→03Resolved a:03WDoranWMF
[20:55:10] <wikibugs>	 10serviceops, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Workboards (Clinic Duty Team), 10User-Clarakosi, 10User-Eevans: Package table_properties utility for Debian - https://phabricator.wikimedia.org/T226551 (10WDoranWMF)
[20:58:38] <mutante>	 expected that the catalogue would contain Monitoring::Service[something] with notes_url set to "https://wikitech.wikimedia.org/wiki/Monitoring" but it is set to "'https://wikitech.wikimedia.org/wiki/Monitoring'"
[20:59:24] <mutante>	 akosiaris: ^ re: nrpe spec / not sure how to fix that. if i don't quote it there is an error and if i quote it with ' there is this other one
[21:02:01] <mutante>	 that's in line 29  of  https://gerrit.wikimedia.org/r/c/operations/puppet/+/496830/21/modules/nrpe/spec/defines/monitor_service_spec.rb for 
[21:02:10] <mutante>	 "nrpe::monitor_service with ensure present contains normal resources"
[21:22:50] <wikibugs>	 10serviceops, 10Operations, 10Core Platform Team (Services Operations): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10MSantos)
[21:44:17] <wikibugs>	 10serviceops, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (201907): 'scap pull' stopped working on appservers ? - https://phabricator.wikimedia.org/T228328 (10greg) a:03thcipriani
[21:50:15] <urandom>	 fsero: (while noting that you may not be around for a while), so to clarify, the app running in the container exited with status code 2?
[21:50:22] <fsero>	 yes
[21:50:26] <urandom>	 because that doesn't seem right
[21:50:32] <fsero>	 why not?
[21:50:45] <urandom>	 well, by the logs, startup seems to have been normal
[21:51:43] <urandom>	 I'm not even sure how a status code 2 would be returned
[21:53:02] <urandom>	 fsero: for example, I know of an error scenario that would return status 1 (and log loudly)
[21:53:36] <fsero>	 well that error code refers to the container, not only the app (i think that was your first comment)
[21:53:40] <fsero>	 and checking events.. i do see
[21:53:53] <fsero>	 https://www.irccloud.com/pastebin/Yqj2ropl/
[21:54:00] <urandom>	 aha!
[21:54:05] <urandom>	 I was about to ask
[21:54:24] <urandom>	 so a readiness probe failure would explain this?
[21:54:26] <fsero>	 you should be able to see that either with kubectl get events
[21:54:30] <fsero>	 liveness more likely
[21:54:42] <fsero>	 but yeah
[21:54:58] <urandom>	 Error from server (Forbidden): events is forbidden: User "sessionstore" cannot list resource "events" in API group "" in the namespace "sessionstore"
[21:55:28] <fsero>	 yeah thats something to fix in our end
[21:55:40] <fsero>	 and if you create a phab task i'd appreciate it
[21:56:00] <fsero>	 is just a matter of authorization, changing the RBAC rules
[21:59:11] <urandom>	 fsero: will do.
[22:00:57] <urandom>	 holy crap, I think it worked
[22:02:03] <fsero>	 it did
[22:24:42] <wikibugs>	 10serviceops, 10Scap, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (201907): Deploy scap 3.11.1-1 - https://phabricator.wikimedia.org/T228482 (10thcipriani)
[23:03:29] <mutante>	 [mw2250:~] $ scap pull
[23:03:39] <mutante>	  File "/usr/lib/python2.7/dist-packages/scap/opcache_manager.py", line 3, in <module>
[23:03:41] <mutante>	     from concurrent.futures import ThreadPoolExecutor
[23:03:44] <mutante>	 ImportError: No module named concurrent.futures
[23:03:56] <mutante>	 that's new .. the scap pull issue is gone but this is there now
[23:26:44] <bd808>	 mutante: that was a bug in the scap deb package... I think thcipriani wrote a fix?
[23:27:17] <bd808>	 the deb manifest was missing a require for python-concurrent.futures
[23:28:51] <mutante>	 bd808: yep, he showed that to me. i actually built the new scap version deb just now. only fighting with a key issue to publish it
[23:29:28] <mutante>	 i import the signing key but reprepro can't find it
[23:51:12] <mutante>	 fixed that.. new scap version on APT repo now
[23:51:21] <mutante>	 trying on mw2250
[23:53:18] <mutante>	 indices exported for buster and jessie.. just on stretch it doesn't know yet