[01:34:56] !help I can't start my webservice and error.log says toolsws.proxy.ProxyException: Port registration failed! [01:34:56] If you don't get a response in 15-30 minutes, please create a phabricator task -- https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=wmcs-kanban [01:37:22] 2020-06-11T01:36:30.483638 Throttled for 3 restarts in last 3600 seconds [01:37:27] :\ [01:38:07] SigmaWP: which tool? I can try to see if I can figure out what's going on [01:38:29] the throttle thing would be the grid engine watchdog process [01:38:47] bd808: it's named sigma [01:39:06] I accidentally ctrl+c'd a webservice restart earlier [01:39:08] which i think is the issue [01:41:44] SigmaWP: do you mind if I try some things? [01:42:23] bd808: go for it [01:43:18] SigmaWP: just curious, does your python webservice shell out to run some commands or something? I'm wondering if it could run on kubernetes rather than the job grid [01:43:39] I tried to move it to kubernetes a few motnhs ago but things blew up [01:43:53] I'm running uwsgi on Python using a setup that uwsgi-python doesn't like [01:43:57] so I just use uwsgi-plain [01:45:19] !log tools.sigma Made $HOME/service.template to simplify webservice commands [01:45:21] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.sigma/SAL [01:46:32] that is a rally lame error message we give when the proxy doesn't work :/ [01:47:01] :( [01:49:12] oh... I think it may be starting this time... [01:49:41] SigmaWP: it's alive! [01:50:05] I'm not 100% sure what cleared the error. [01:50:14] thank you bd808 [01:51:07] I did `webservice stop` and the `rm service.manifest`. That strange dance has been kown to help fix problems from the grid watchdog process before [01:51:13] *known [01:51:28] then I started it back up (my 2nd try) and it worked [01:51:53] huh. [01:52:02] interesting [01:53:00] SigmaWP: I made you a $HOME/service.template file. That will make it easier for you to start in the future. you just have to `webservice start` and the `--backend=gridengine uwsgi-plain` options will be read from the template file [01:53:06] bd808: lol i ran webservice restart and it's happening again [01:53:09] oh, handy [01:53:20] lemme try to fix this myself [01:54:31] yay [01:54:32] it worked [01:54:35] thanks bd808 [01:54:35] if you'd like some help trying to move to kubernetes I would be glad to help with that too. Webservices on the job grid are not my favorite thing to debug. Its a hacked together system at least for the proxy part. [01:54:45] maybe another time [01:54:49] sure :) [01:54:53] good night friend [10:14:52] !log tools.zppixbot-test tools.zppixbot-test@tools-sgebastion-08:~$ grep -r -D skip "last_event_at" (in case anything seems slow, may take a while, please don't kill anything while I do it) [10:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot-test/SAL [10:15:29] !log paws added role (just a label) for ingress nodes: `kubectl label node paws-k8s-ingress-1 kubernetes.io/role=ingress` (T195217) [10:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Paws/SAL [10:15:32] T195217: Sane ingress methods for PAWS - https://phabricator.wikimedia.org/T195217 [10:18:57] !log tools.zppixbot-test tools.zppixbot-test@tools-sgebastion-08:~$ grep -r -D skip "last_event_at" (in case anything seems slow, may take a while, please don't kill anything while I do it) END [10:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot-test/SAL [10:29:30] * RhinosF1 got nowhere doing that [11:04:27] !log codesearch restarting everything after gerrit-replica 502s fixed T255094 T255125 [11:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Codesearch/SAL [11:04:30] T255125: Wikibase not included in codesearch’s “everything” group - https://phabricator.wikimedia.org/T255125 [11:04:30] T255094: gerrit-replica 502/OOM again - https://phabricator.wikimedia.org/T255094 [11:11:06] !log paws deployed nginx-ingress for some early testing (not definitive) with code https://github.com/crookedstorm/paws/commit/bee62b3fd57f9804aa27e7b8b41fde50bd93df94 (T195217) [11:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Paws/SAL [11:11:08] T195217: Sane ingress methods for PAWS - https://phabricator.wikimedia.org/T195217 [11:21:45] !log toolsbeta puppet not working bc puppetdb, run `aborrero@toolsbeta-puppetdb-02:~ $ sudo systemctl restart puppetdb` [11:21:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [11:27:13] !log toolsbeta puppetdb wasn't the problem. The problem is puppet-enc segfaulting in toolsbeta-puppetmaster-03 [11:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [11:32:38] !log toolsbeta apparently every python script segfaults in toolsbeta-puppetmaster-03 [11:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [11:33:22] !log toolsbeta reboot toolsbeta-puppetmaster-03 to try cleaning up potential kernel/filesystem problems [11:33:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [11:35:59] !log toolsbeta try reinstalling the python3 stack in toolsbeta-puppetmaster-03, because everything python-related segfaults [11:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [11:53:20] !log toolsbeta create VM toolsbeta-puppetmaster-04 [11:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [12:12:04] !log toolsbeta copy over labs/private from toolsbeta-puppetmaster-03 to toolsbeta-puppetmaster-04 [12:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [12:14:06] !log toolsbeta poweroff toolsbeta-puppetmaster-03 [12:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [12:14:21] !log toolsbeta try switching all VMs to toolsbeta-puppetmaster-04 [12:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [12:35:25] !log toolsbeta according to `aborrero@cloud-cumin-01:~$ sudo cumin --force -x 'O{project:toolsbeta}' 'run-puppet-agent'` we are mostly back in business [12:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [12:39:09] !log toolsbeta for the record, k8s etcd servers certificate changed (puppet based) and k8s just kept working [12:39:09] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [12:42:18] !log toolsbeta introduce puppet profile 'toolsbeta-docker-registry' and relocate some hiera config there [12:42:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [13:16:06] !log tools.zppixbot auto-update@website: Synced website repo in 45.s [13:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot/SAL [13:16:11] there it is [13:16:38] yey [13:17:22] MacFan4000: seems to have worked [13:20:53] * RhinosF1 broke something [13:33:24] !log tools.zppixbot auto-update@website: Synced website repo in 95.s [13:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot/SAL [13:34:46] 95s - wow [13:42:49] MacFan4000: slow! [13:43:02] I'll check + add to search console soon [13:45:01] !log tools.zppixbot added sitemap.xml to search console [13:45:03] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot/SAL [14:48:25] Hey, I don't know what's not working but https://meet-auth.wmflabs.org/create times out while the DNS proxy (in meet project) is set to meet-auth.eqiad.wmflabs (port 5000) and locally curl works and security groups are okay too (someone said iptables seems to have a problem) but I don't know well enough to fix it :( [14:49:29] Amir1: I'll take a look [14:49:38] Thanks [14:49:50] Is iptables installed on that host on purpose? Typically we just rely on security groups for firewalling [14:50:06] and, it's the whole sight, not just that one endpoint right? [14:50:18] sight/site/I never know which of these to use [14:50:26] haha [14:50:30] let me check [14:50:46] it has a puppet module that might install iptables [14:51:56] https://github.com/wikimedia/puppet/blob/production/modules/role/manifests/meet/accountmanager.pp [14:51:58] hmm [14:52:08] https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/meet/accountmanager.pp [14:52:24] the firewall there might be the reason [14:52:28] it is [14:52:58] should we just delete it? [14:53:25] or we can add something to fix it like ferm [14:53:51] if this role is only ever used on cloud-vps, then just remove the '::profile::base::firewall' line from that [14:54:44] and if it's used in both prod and cloud-vps, best to fork the role and have one for each I think [14:55:09] Amir1: no, don't remove it. let's open the port [14:55:36] mutante: can I leave this in your hands? [14:55:38] let's avoid 2 separate roles [14:55:42] andrewbogott: yea [14:55:48] ok! [14:56:30] mutante: Thanks. How can I open it? Just a pointer to some doc or example is enough [14:58:23] mutante: if you open up port 5000 to the cloud-vps proxies doesn't that require some way of having cloud-specific logic anyway? [14:58:39] I guess you could create a separate role or profile and apply that in addition [15:02:41] Amir1: https://phabricator.wikimedia.org/P11471 we just need a hostname for the source (replace SOME_HOSTNAME) [15:03:40] andrewbogott: yea, we would set the srange in Hiera when moving to production. or second best do a $realm check, but definitely avoid separate roles [15:03:57] ok :) [15:04:00] i can do it.. just multi-tasking [15:08:41] no worries. That's why there are multiple people for multiple things :D [15:19:06] so yea.. we just need to know the hostname (or IP, but hostname is better) that the connection comes from [15:32:43] mutante: I think it's either proxy-01.project-proxy.eqiad.wmflabs or proxy-02... [15:34:16] its both as they are a hot/warm HA pair [15:35:37] Majavah: bd808: we just found the existing "cache_hosts" Hiera key [15:35:58] it has proxy-01, proxy-02 and deployment-cache-text06.deployment-prep for some reason [15:36:16] so i guess best is we just reuse that and allow all these [15:36:23] that sounds like something that would be used in deployment-prep, so yeah [15:36:54] it is in hieradata/cloud/eqiad1.yaml, so not project-specific [15:36:58] so yea, we can use that [15:37:00] Amir1: ^ [15:37:18] the cache-text nodes in beta cluster are the ingress varnish/ATS/whatever hosts [15:38:38] bd808: while you're here, any idea on when you would have time to review my Striker patch? :P [15:39:40] bd808: since you're here too, how can we put secrets in horizon? Just a hiera config? [15:40:16] Majavah: "someday" :(( My work life is not good right now [15:40:54] Amir1: you have to have a project local puppetmaster and apply secrets using local commits to labs/private.git on that host [15:41:31] I see [15:41:33] Thanks! [15:43:07] secret management is a really ugly problem in Puppet in general and its worse in Cloud VPS projects. Puppet has no concept of multi-tenancy [15:55:08] * AntiComposite waits for someone to start downloading a data dump to toolforge [15:56:31] andrewbogott: lookup('cache_hosts') would simply work in both prod and cloud because it's already in hiera in the right places :) [15:56:57] fair enough :) [16:15:54] !log admin failing over NFS for labstore1004 to labstore1005 T224582 [16:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [16:15:57] T224582: Migrate labstore1004/labstore1005 to Stretch/Buster - https://phabricator.wikimedia.org/T224582 [17:17:36] !log admin failing NFS back to labstore1004 to complete the upgrade process T224582 [17:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [17:17:39] T224582: Migrate labstore1004/labstore1005 to Stretch/Buster - https://phabricator.wikimedia.org/T224582 [17:22:15] !log admin delaying failback labstore1004 for drive syncs T224582 [17:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [18:47:26] !log tools.zppixbot-test kubectl delete pods --all [18:47:31] Texas: Failed to log message to wiki. Somebody should check the error logs. [18:47:39] !log tools.zppixbot-test kubectl delete pods --all [18:47:44] Texas: Failed to log message to wiki. Somebody should check the error logs. [18:47:57] bd808, arturo: ^ [18:48:34] Texas: incident [18:48:39] See -operationd [18:48:41] @texas login/logout sessions are on fire right now, see T255179 [18:48:42] T255179: Session failures preventing edits, login, logout, etc - https://phabricator.wikimedia.org/T255179 [18:48:45] oh [19:15:34] !log tools.stashbot Testing wikitech logging [19:15:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stashbot/SAL [19:16:01] !log tools.zppixbot-test Testing wikitech logging [19:16:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot-test/SAL [19:16:32] !log tools.zppixbot-test restarted a few times while wikidata was down to deploy various stuff [19:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot-test/SAL [19:17:47] !log tools.zppixbot-test i meant wikimedia/wikitech - my brain was down as well apparently [19:17:48] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot-test/SAL [19:17:56] Hi andre__ [19:18:04] o/ [19:18:44] I haven't caught up on backscroll in -operations, but the failure for stashbot was 'Invalid CSRF token.' Not sure if that was MediaWiki deploy train related or not [19:18:53] no, different outage [19:19:01] https://phabricator.wikimedia.org/T255179 [19:19:05] bd808: k8s got upset with kask/session [19:19:20] it got turned off and turned back on again [19:19:26] !log admin proceeding with failback to labstore1004 now that DRBD devices are consistent T224582 [19:19:30] no, they moved it to another DC :P [19:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [19:19:31] T224582: Migrate labstore1004/labstore1005 to Stretch/Buster - https://phabricator.wikimedia.org/T224582 [19:20:27] we're back on eqiad [19:20:57] kask things shouldn't mess with wikitech. it is still detached from the "production" wiki farm. /me shrugs [19:21:52] https://github.com/wikimedia/operations-mediawiki-config/commit/fc25ef3b46cd2e7b2ef5ccdf1fa721f673ec382a#diff-42462e209bc9b5d106353a9fff29cd27 [19:22:03] wgSessionCacheType => 'labswiki' => 'kask-session', [19:23:26] uhh... ok then [19:23:41] heh [19:23:53] I'm all for wikitech being "normal", but maybe they didn't actually think about that [19:23:53] I'm guessing it got dragged over in one of the previous migrations [19:24:33] T237773 -- we really could finish that any time [19:25:09] Sorry about the NFS failback not going great. There's still a bug in there :( [19:25:27] Don't worry [19:26:32] it is much better than it was, and I think I can fix the bugs fairly tidily [20:44:01] !log tools.stewardbots tools.stewardbots@tools-sgebastion-07:~$ bash ./stewardbots/StewardBot/restart_stewardbot.sh [20:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stewardbots/SAL [21:26:46] Hey folks, does anyone here happen to know whether it's Grid Engine itself or some part of the WCS-specific configuration that does the memory limiting for GE jobs on Toolforge? [21:36:31] Naypta: yes, the grid engine itself enforces based on limits set and managed by the jsub commands [21:36:47] Naypta: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Grid#Allocating_additional_memory [21:38:00] bd808: cheers! is the best place to report issues with it upstream at SGE then? [21:38:19] iirc there's little to no maintenance of that project anymore [21:38:23] heh, well... sure. But its mostly abandonware a s a project [21:38:41] Naypta: do you have a particular bug that you think you have found? [21:39:01] I'm hesitant to call it a bug so much as an "incompatibility" [21:39:35] effectively the way that golang compiles to binary causes all golang programs to require >500mb assigned memory, even a trivial program, when running on an amd64 platform that uses ulimit -v to limit memory [21:40:01] there's an open issue on this here https://github.com/golang/go/issues/38010 but it looks like it's going to be closed as a WONTFIX [21:40:22] or rather, more accurately, the system "thinks" it requires >500mb memory when it doesn't [21:40:29] try a jvm based tool :) I don't think you can even do a "hello world" to the console without reserving 2G on the grid [21:42:00] oh good lord [21:42:18] but on the other hand, that's JVM - it's not exactly a language designed for its memory efficiency :p [21:42:35] nor is golang ;) [21:43:16] tell that to google's devs and you'll get some very askance looks ;) [21:43:45] "But I have 128GB of memory in my work machine. 2GB is nothing" [21:44:21] I would glad to tell them that "ulimit -v is an out-dated mechanism for limiting memory use" is a very biased concept [21:44:55] but that's not going to fix anything. Naypta: just play with -mem xxx until you find the sweet spot you need [21:45:32] or you could try using the Kubernetes job features, but those are pretty raw and ugly to work with [21:45:38] yeah, that's what I've been doing - just it frustrates me on a spiritual level to have to muck about with memory limits when running something that ought not to need it :p [21:45:51] otoh anything but kubernetes [21:46:06] heh. if you are into go and not k8s... [21:46:26] you haven't drank all of the goog kool-aid yet [21:46:37] in fairness, i'm not sure "into" is the best word for it - i *use* go mostly because a potential employer wanted experience with it... :p [21:46:52] ruby will always have its place in my heart <3 [21:47:00] thanks very much for all the help as always bd808 :) [21:47:10] yw [22:31:02] !log tools.zppixbot auto-update@website: Synced website repo in 39.s [22:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot/SAL [22:32:59] Texas: ^ [22:33:11] K [22:49:49] bstorm_: you've given me some fun errors to debate handling [22:50:28] Uh oh. What did I do? [22:51:13] Besides briefly wreck NFS [22:51:17] bstorm_: our bot had some right fun during the nfs maintenance. It did things that should be impossible. [22:51:25] In those few minutes [22:51:27] Oh dear [22:51:52] Well, the next failover will be better! I'm going to say that a lot from here in 😅 [22:52:09] It's been a long term bug [22:52:15] But we discovered more [22:52:30] is that bot doing real-time financial trading or something? The amount of work going into a sopel irc bot is a bit amazing to me [22:52:39] And now are debating what on earth it got into its little head [22:52:45] bd808: we are bot mad [22:55:04] It turns out a lot of sopel's jobs backend might be broke [22:55:22] Well dumber than thought [23:03:20] * RhinosF1 is off to bed before he digs too deep [23:30:13] !log tools.zppixbot-test restarting for config/code changes [23:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot-test/SAL [23:35:02] !log tools rebooting tools-k8s-control-2 because it seems to be confused on NFS, interestingly enough [23:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [23:47:16] !log tools.zppixbot restarting for config/code changes [23:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot/SAL [23:53:26] !log tools.zppixbot restarting again [23:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot/SAL [23:58:48] !log tools.zppixbot-test restarting again [23:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot-test/SAL