[11:20:36] arturo: if you're around: two NFS shares are acting up, which locks up prometheus, which brings the load average to insane values, which means SGE is no longer scheduling jobs -- see https://phabricator.wikimedia.org/T217472 [12:01:56] labstore1006 and labstore1003 seem to have returned, and load averages are dropping quick [12:05:57] hello - I could login to login-stretch but I got an error message I never saw before: "groups: cannot find name for group ID 50062" [12:06:02] any idea? [12:08:55] hauskatze: that's.. odd. [12:10:27] hauskatze: group 50062 is `project-bastion` [12:10:48] and I indeed get the same warning when I login [12:11:37] valhallasw`cloud: hi, long time no see :) - well, I have not much idea about the internals of tools so I though I should inform :) [12:12:20] hauskatze: https://phabricator.wikimedia.org/T217280 seems related [12:13:06] looks like it, yes [12:13:11] I'll try to become a tool [12:13:49] looks like I could, but I got the same error [12:14:01] hi, please see T217473 [12:14:02] T217473: labstore1006 spontaneous reboot - https://phabricator.wikimedia.org/T217473 [12:14:34] gtirloni: thanks! [12:14:59] np [12:15:08] ah, I was just about to ask if we should find an op :) [12:15:12] labstore1003 seems fine though, I wonder if it's related. do you see any errors for scratch? [12:16:54] gtirloni: not anymore, but `ls /mnt/nfs/labstore1003-scratch` was hanging as well (but seems healthy now) [12:18:57] (which labstore1003.eqiad.wmnet:/scratch is mounted on) [12:19:36] Krenair: bd_808 is oncall but we didn't get any pages for this O_o - created T217474 [12:19:36] T217474: labstore1006 nfsd not started after reboot - https://phabricator.wikimedia.org/T217474 [12:19:58] valhallasw`cloud: got it, that's weird. I wonder if there's a dependency on labstore1006 somehow... really odd. [12:20:27] gtirloni: prometheus was hanging as it was trying to access NFS... so I think no stats were pushed to graphite et al [12:21:17] (that doesn't explain why labstore1006 going down did not trigger any alerts, though) [12:26:56] got it, hopefully on monday we can discuss this with the rest of the WMCS team and figure out why everything would go down because of labstore1006.. I'd expect jobs to at least get scheduled, maybe it's another SPOF [12:31:57] gtirloni: as I understand it, SGE (by default) schedules jobs based on load averages. If the load average is above some value (I'm guessing the number of CPUs), no jobs will get scheduled [12:32:24] oh I see, so it could be working as designed then [12:32:43] but jobs waiting for NFS are counted in the load average, so 160 Prometheus threads waiting for NFS causes a load average of 160, etc. [12:33:13] it might be possible to reconfigure SGE to ignore waiting-for-NFS-jobs [16:05:56] It's not possible, or desirable, to reconfigure SGE to ignore high load, though. It'd be best to configure prometheus to somehow timeout instead of filling up the process table. [16:09:27] bstorm_: my understanding was that SGE could also be configured to use a different load metric (e.g. CPU usage rather than 'load'). But I agree that holding off on submitting jobs on high loads is probably still a good idea if it's NFS-induced. [16:09:56] It sorta can. [16:10:01] Some are hard coded in. [16:10:08] But you can screw round with it a lot [16:10:13] obviously [16:10:28] However, SGE is tightly integrated with NFS, so... [16:11:16] * bstorm_ goes back in time and tells the would-be inventors of NFS to go outside and enjoy the natural world instead. [18:55:54] !log quarry block spammer https://quarry.wmflabs.org/Twc93521 `INSERT INTO user_group (user_id, group_name) VALUES (3734, "blocked");` [18:55:55] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Quarry/SAL [19:00:01] spammers on quarry, now I've seen everything [19:31:00] !log tools.nlwikibots Cleaning up old (2017-2018) log files [19:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.nlwikibots/SAL [21:37:09] !log tools.nlwikibots Converted tvpupdater & archivering. Also upgraded the latter to Python 3 & pywikibot-core. [21:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.nlwikibots/SAL