[00:15:33] !log tools.stashbot Restarted irc bot [00:16:42] ugh. not the elasticsearch cluster is freaking out?/ [00:16:59] bd808: Failed to log message to wiki. Somebody should check the error logs. [00:28:15] !log tools.stashbot Restarted irc bot again [00:28:16] bd808: Failed to log message to wiki. Somebody should check the error logs. [00:37:41] !log tools.stashbot test [00:37:42] bd808: Failed to log message to wiki. Somebody should check the error logs. [00:44:28] !log tools.stashbot test [00:44:29] bd808: Failed to log message to wiki. Somebody should check the error logs. [01:11:31] Umm. How do I log in to the old environment? Seems ssh login.tools.wmflabs.org already points to the new one. [01:12:26] Geerlings: login-trusty.tools.wmflabs.org [01:12:50] Thanks. I should have remembered that. Lazyweb. ;) [01:16:52] !log nlwikibots Removed the last cronjob from login-trusty (tvpmelder, replaced with tbpmelder on stretch) [01:16:53] Geerlings: Unknown project "nlwikibots" [01:16:53] Geerlings: Did you mean to say "tools.nlwikibots" instead? [01:17:08] !log tools.nlwikibots Removed the last cronjob from login-trusty (tvpmelder, replaced with tbpmelder on stretch) [01:17:08] Geerlings: Failed to log message to wiki. Somebody should check the error logs. [01:17:20] Ok I give up. :) [01:18:13] It's broken [02:02:12] !log tools.stashbot test [02:02:13] tgr: Failed to log message to wiki. Somebody should check the error logs. [02:03:38] !log tools.stashbot test [02:03:38] tgr: Failed to log message to wiki. Somebody should check the error logs. [02:03:50] !log tools.stashbot test [02:03:50] tgr: Failed to log message to wiki. Somebody should check the error logs. [02:03:51] !log tools.stashbot test [02:03:51] tgr: Failed to log message to wiki. Somebody should check the error logs. [02:03:52] !log tools.stashbot test [02:03:53] tgr: Failed to log message to wiki. Somebody should check the error logs. [02:03:53] !log tools.stashbot test [02:03:53] tgr: Failed to log message to wiki. Somebody should check the error logs. [02:08:26] !log tools.stashbot test [02:08:27] tgr: Failed to log message to wiki. Somebody should check the error logs. [02:08:42] !log tools.stashbot test [02:08:42] tgr: Failed to log message to wiki. Somebody should check the error logs. [02:24:34] !log tools.stashbot test [02:24:35] tgr: Failed to log message to wiki. Somebody should check the error logs. [02:29:09] !log tools.stashbot test [02:29:10] tgr: Failed to log message to wiki. Somebody should check the error logs. [02:32:11] !log tools.stashbot test [02:32:12] tgr: Failed to log message to wiki. Somebody should check the error logs. [02:35:25] !log tools.stashbot test [02:35:25] tgr: Failed to log message to wiki. Somebody should check the error logs. [02:36:36] !log tools.stashbot test [02:36:37] tgr: Failed to log message to wiki. Somebody should check the error logs. [02:37:41] !log tools.stashbot test [02:37:41] tgr: Failed to log message to wiki. Somebody should check the error logs. [03:02:55] !log wmflabsdotorg Update wmflabs.org A record itself to point at eqiad1-r proxy IP which appears to be redirecting it properly - old main region IP no longer responding [03:02:56] Krenair: Failed to log message to wiki. Somebody should check the error logs. [07:03:37] FYI en-wiki is loading slowly for me, pings are hit-and-miss - timing out... [09:52:55] Oshwah: this is not the right place to report about that, perhaps better #wikimedia-operations [10:57:22] Strange! None of my (persondata) cronjobs got startet this night [12:55:21] inserting a job on the grid seems also fskd [12:55:38] ls and become commands are very slow as well [12:56:37] everything is borked lately it seems [12:56:57] arturo: ¿puedes echarle un vistazo? [12:57:11] not even `tail` [12:57:16] well [12:57:20] hey [12:57:48] what's wrong with toolforge? is there anything specific I can test to reproduce the issue? [12:58:08] hauskatze: ^ [12:58:14] arturo: so, `become` is very slow, you can't insert a job to the grid [12:58:15] hauskatze: let's start by which cluster. stretch ? [12:58:24] sge07 yes [12:58:33] sgebastion-07 [12:58:54] arturo: there's also a task about cron jobs not being launched since this midnight [12:59:09] even Ctrl + C to cancel something takes a lifetime [13:00:03] https://www.irccloud.com/pastebin/nR4E7lzD/ [13:00:38] God what is wrong with NFS [13:00:45] Everything [13:00:50] tools.mabot@tools-sgebastion-07:~$ qstat <-- after 5', no response yet [13:00:51] Zppix: ok.. lemme gues.. [13:00:57] maps.wmflabs.org [13:01:17] which bsically recently regained write access to nfs, so now is processing tons of tiles... [13:01:32] if that is not fully isolated from tools... [13:01:33] also, I see the OOM killer entered stage several times [13:01:44] maybe maps needs its own nfs server? [13:02:05] even better, maybe we need to get rid of NFS servers :-) [13:02:13] didnt we just switch to NFS [13:02:25] No [13:02:46] OOM killer is killing jobs from `uid=13778(ibrahemqasim)` by means of systemd slice limits [13:02:53] If we got rid of NFS what would we switch to? [13:03:19] We've been collectively hoping to get rid of NFS for a while [13:03:28] Zppix: we are looking forward for some improved shared/network storage mechanisms, including ceph [13:03:44] whats blocking that? [13:03:53] i can stop rendering on maps and see if that changes something. [13:04:03] but rather do that if we have some sort of graph that we can observe. [13:04:37] i have graphs of maps cluster, but i'd rather see a graph that shows the problems that people experience. [13:04:44] Zppix: a few things, but the team being under-resourced would be my first answer [13:05:05] Zppix: ^^^ what chicocvenancio said [13:05:37] "ibrahemqasim" euh.. who that ? [13:06:45] considering we also have gerrit/phab stuff, maybe we should also keep an eye on tools compromises as potential causes.. just sayin. [13:07:07] thedj: sure, also please could you briefly stop the maps thing and see what happens? [13:07:23] arturo: are you able to observe results ? [13:07:29] cause then sure. [13:07:48] let me check for the proper grafana dashboard [13:08:06] maps is here: https://grafana-labs.wikimedia.org/dashboard/db/cloud-vps-project-board?orgId=1&var-project=maps&var-server=maps-tiles1 [13:08:23] but that likely won't tell the full story. [13:08:32] this is for labstore1004 (the nfs-tools-project.svc.eqiad.wmnet server) [13:08:33] https://grafana.wikimedia.org/d/000000568/labstore1004-1005?orgId=1 [13:08:52] k. stopping renderer daemon and apache host. [13:09:43] but I see some toolforge nodes consuming a lot of bandwich [13:09:56] let me see if I can trace it back to a single tool [13:15:46] well that did drop the 1min loadavg to almost 0 ;) [13:16:58] the maps things thedj ? [13:18:05] also, I see montage-dev tool is writting heavily into the NFS server [13:18:27] arturo: yeah i stopped it at 13:08 utc and labstore1004 loadavg looks noticeably dropping since that timeslot. [13:19:00] yes, I see that [13:19:25] r/w ops seem similar as before though.. that's weird. [13:19:38] but the NFS server shouldn't allow tools hammer it this way [13:19:58] also.. r/w ops on 04 is much lower than on 05, yet 04 has much higher loadavg [13:20:05] that's weird [13:20:19] saturated connection ? [13:20:48] I thought it was a per node limit with nfs arturo [13:20:50] you can't trust that r/w graph, because it includes drbd [13:21:22] drdb? [13:21:48] drbd is a mechanism to sync storage data between two pairs of servers [13:22:03] and we use it on the nfs servers to implement redundancy [13:22:35] chicocvenancio: we have several mechanisms, one being network throttling per VM instance [13:23:01] there just was a pretty big spike, and that sure wasn't maps.. [13:23:16] 33.7 Mbps [13:23:25] on 1005 [13:23:25] but some r/w operations doesn't require big network bandwidth, yet they can hammer the NFS server [13:23:57] thedj: 1005 is currently the backup, so network traffic there is mostly drbd [13:24:00] AFAIK [13:24:06] ah k. [13:24:28] that makes sense then [13:24:47] too bad Marco left... [13:25:51] now I see `tools-sgewebgrid-generic-0903` talking to NFS at about 2MB/s [13:26:55] `python core/pwb.py reflinks.py -start:! -always -ignorepdf` is writting heavily there [13:28:56] arturo: want me to start maps again, to check if we see loadavg go up again on 1004 ? [13:29:07] thedj: sur [13:29:09] e [13:29:59] started [13:33:41] I think i can see the 1min loadavg starting to climb again... [13:36:51] ok, lemme restart the layer renderer, which was also running when i stopped everythingl.. Maybe that makes the load jump to those insane proporties of half an hour again.. [13:37:42] started [13:40:21] arturo: yup, looks like that makes it spike [13:41:12] gonna keep it running a little bit longer, just to get a nice graph for documentation purposes. [13:41:13] * arturo nods [13:45:30] stopped the layer renderer again (request renderer still running) [13:45:59] I don't think your tool is doing something wrong thedj [13:46:35] wrong and influences something are two different things of course ;) [13:47:09] we could stop every single tool in toolforge and the NFS server would still be weird :-P [13:47:19] haha [13:49:33] arturo: btw.. the labstore1005 RW await panel in grafana is accidently named labstore1004 RW await [13:50:27] fixed thedj [13:51:26] hmm, lemme try to ionice this render_old script... [13:52:00] its probably walking the entire directory tree of the tiles to check the timestamps... [13:56:00] k running with an ionice of -c 3 -n 19 now.. let see what happens [13:56:20] !log tools T218649 rebooting tools-sgecron-01 [13:56:23] arturo: Failed to log message to wiki. Somebody should check the error logs. [13:56:23] T218649: Grid jobs don't run as of Tue, March 19, 2019 - https://phabricator.wikimedia.org/T218649 [13:57:25] euh it started rising again at 13:49.. that sure wasn't maps... [13:58:32] when brooke is awake she will probably have better ideas of what's going on [13:58:41] load is at 50 again... [13:59:51] k. i'll stop the renderers for safety. will keep the apache host serving stale tiles. [14:00:11] can ping me later if you need me. [14:00:32] renderers stopped again. [14:02:01] !log wikilabels stopping the web service https://phabricator.wikimedia.org/T217922 [14:02:03] halfak: Failed to log message to wiki. Somebody should check the error logs. [14:02:59] Arg. [14:05:00] halfak: stashbot hasnt been working for a while now [14:05:37] That's not true [14:05:45] It's logging to some sources, but not to the wikitech SALs [14:06:19] thats what i meant Reedy [15:52:52] bstorm_: want me to restart maps nfs usage ? [15:53:48] Uh...what's the context :) [15:54:05] https://phabricator.wikimedia.org/T218649#5036108 [15:54:24] maps mounts a different server in general [15:54:42] unless you are talking on toolforge...since I know maps bleeds into there too [15:54:57] since you were actively ticking off potential problems, i figure i could restart that thing just so you can make sure it didnt have something to do with it. [15:55:20] bstorm_: yeah, we were more thinking like crosscontamination in terms of hardware resourceing or network congestion or something. [15:55:21] Ah ok :) What was going on there? Was it tools crons or... [15:55:39] no its fully in seperate project [15:55:50] Ok thanks. It should be ok. I've collected a lot of data. That should be traffic on virt interfaces, but the NFS server used is separate [15:56:27] but since it usage nfs aggresively, and we did have a interesting drop in load when i disabled it, we figured we'd go safe [15:57:08] Yeah, overall that should only affect labstore1003, but who knows? I suspect load is not the problem at this point. [15:57:51] k. putting 'full' load on it again as i had earlier today. [15:58:11] I could run some bot scripts and use qstat/jsub fine this time :) [15:58:35] However OAuth for wikitech looks broken -- I think there's a task. Maybe I'll temporary disable the bot there [15:58:55] hauskatze: might have something to do with the gerrit outage... [15:59:17] I have some very vague theories...and it still could be linked to that filesystem problem yesterday, slowly building problems on a client until it just can't function. [15:59:55] hauskatze: yeah, OAuth to wikitech is messed up still. I hope we will figure out a solution today, but no guarantees [16:00:04] Oauth is a work in progress ...yeah that [16:00:34] bd808: no problems - MABot just fixes redirects on wikitech and we hadn't had much to fix there lately :) [16:01:17] we found a new dependency cycle in AuthManager that is an interestingly complex case to resolve in the early session initialization phase [16:01:48] if there is an urgent problem, we could come up with a horrible workaround (maybe disable $wgBlockDisablesLogin for API requests with an OAuth header, and then manually redo it in a hook that runs after Setup.php has finished, like the API module disable hook) [16:02:45] it's hard to fix the issue in general in core, not so hard to fix our specific instance of it by messing around in wiki config [16:02:46] bstorm_: i'll note that loadavg is rising again ;) [16:03:52] not dangeriously yet though, but still [16:05:28] On which server? [16:05:41] 1004 [16:05:50] It's low [16:06:10] That server has a problem in the kernel with DRBD that artificially inflates load numbers in general. [16:06:12] It's awful [16:06:27] That and it has 64 cores...so "high load" is 50+ at least [16:06:33] it's at 10 at this second [16:06:39] so...🤷🏻 [16:06:46] The load spikes a lot [16:06:57] most of that is DRBD and NFS interaction [16:07:04] k [16:07:21] That kept me busy for months last year. I'm at the point where I am just hoping to upgrade it away soon. [16:07:53] We used to keep load under 20. That changed after a kernel upgrade... [16:08:04] Now I'm happy when it is under 100 [16:08:22] load of 10 means it's quietly thinking to itself [16:08:26] 😉 [16:08:41] bstorm_: Thanks for your work btw. you've been very visible and communicative on various issues i recently just happened to stumble into. That's appreciated. [16:09:02] :) [16:14:04] thedj: there's a brief network blip coming on maps nfs in a few moments. It should barely be picked up by the servers. If you are watching that right now, you might notice. Otherwise, it's not something anyone is likely to notice. [16:17:21] that's done [17:39:15] !log git upgrading icinga2 icinga2-bin icinga2-common icinga2-doc icinga2-ido-mysql on gerrit-mysql.git via apt full-update [17:39:16] Zppix: Failed to log message to wiki. Somebody should check the error logs. [17:44:16] !log wmflabsdotorg Deleted orphaned DNS records codesearch-sourcegraph1, migrat4-testone, migrat4-testtwo, sourcegraph1, sourcegraph11 T218633 [17:44:20] Krenair: Failed to log message to wiki. Somebody should check the error logs. [17:44:21] T218633: Sort out orphaned proxy entries - https://phabricator.wikimedia.org/T218633 [17:46:49] log project-proxy Deleted orphaned DNS records ase.wikipedia, math.testme, mathoid.testme, mws.testme, openid-wiki.instance-proxy, structured.wikiquote T218633 [17:46:53] !log project-proxy Deleted orphaned DNS records ase.wikipedia, math.testme, mathoid.testme, mws.testme, openid-wiki.instance-proxy, structured.wikiquote T218633 [17:46:55] Krenair: Failed to log message to wiki. Somebody should check the error logs. [18:00:24] !log wmflabsdotorg Updated m and www records to CNAME to proxy-eqiad1 instead of proxy-eqiad (which was an A record to a now unused IP), found in review at T218633 [18:00:29] Krenair: Failed to log message to wiki. Somebody should check the error logs. [18:00:30] T218633: Sort out orphaned proxy entries - https://phabricator.wikimedia.org/T218633 [18:00:43] !log wmflabsdotorg Removed old proxy-eqiad A record pointing at no-longer-used IP T218633 [18:00:45] Krenair: Failed to log message to wiki. Somebody should check the error logs. [18:17:31] are there any known issues with logging into horizon? [18:18:12] yup [18:19:37] urandom: https://lists.wikimedia.org/pipermail/cloud-announce/2019-March/000146.html [18:20:07] existing sessions are still valid, which is fortunate because Horizon sessions are famously long-lived 😶 [18:20:27] thanks [19:03:30] !log admin-monitoring deleted archived metrics from toolsbeta project [19:03:31] gtirloni: Failed to log message to wiki. Somebody should check the error logs. [19:05:24] !log admin-monitoring deleted iostat metrics from tools-workers servers (polutted by docker devicemapper data) [19:05:24] gtirloni: Failed to log message to wiki. Somebody should check the error logs. [19:30:35] !log git install h2database on gerrit-test3 (to investigate web_sessions) [19:30:42] paladox: Failed to log message to wiki. Somebody should check the error logs. [19:31:30] !log git nvm it's not a actual package (only source) [19:31:30] paladox: Failed to log message to wiki. Somebody should check the error logs. [20:11:48] !log tools.gerrit-reviewer-bot GRB has been converted to Stretch (queue festive music) [20:11:49] valhallasw`cloud: Failed to log message to wiki. Somebody should check the error logs. [20:23:52] And that's all my tools converted *performs minor victory dance* [20:24:28] nice valhallasw`cloud :) [20:26:08] * bd808 gives valhallasw`cloud a stroopwafel [20:44:08] !logs maps redirect [abc].tiles.wmflabs.org to tiles.wmflabs.org [20:45:35] !log maps redirect [abc].tiles.wmflabs.org to tiles.wmflabs.org [20:45:36] thedj: Failed to log message to wiki. Somebody should check the error logs. [22:34:47] I'm trying to re-start my tools in the new Stretch environment. I was able to stop the mediaviews-api webservice, but I am not able to start it up again due to "Could not find a public_html folder or a .lighttpd.conf file in your tool home." I feel like I'm missing a step because surely this worked before. [22:35:33] what do you have ;) [22:36:19] It's a kubernetes-backed webservice. [22:37:16] I'm trying to follow https://wikitech.wikimedia.org/wiki/News/Toolforge_Trusty_deprecation#Move_a_grid_engine_webservice , which may be a problem because I'm converting from Kubernetes (Ubuntu) to Kubernetes (Debian), with no Grid Engine involvement [22:38:07] nah, should work [22:38:23] do you have your public_html dir though ? [22:39:00] This tool never had a dir called "public_html". It has a different directory structure based on what is recommended on Wikitech for Python webservices. [22:39:24] So the code lives in ./www/python/src [22:39:27] ah it's python [22:39:52] hare: webservice --backend=kubernetes python start [22:39:59] hare: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Web#Python_(uWSGI) [22:40:12] Ahhhh. That's what the [type [22:40:13] ] thing was. [22:40:33] (Was I supposed to know better? Sincere question.) [22:41:37] hare: yes, in theory you should know how your webservice works ;) [22:42:03] !log tools.projanalysis Stopped webservice at request of owner [22:42:04] That's fair. Do keep in mind I'm of the "sets up service once and leaves it alone for the next 5 years" persona. [22:42:04] bd808: Failed to log message to wiki. Somebody should check the error logs. [22:42:05] can/should you store type in .webservicerc btw ? [22:42:11] thank you bryan! [22:43:41] thedj: it's not a horrible idea if you use --backend=kubernetes a lot [22:44:16] i meant python vs php [22:45:11] * thedj doesn't see a page describing what .webservicerc actually supports... [22:45:23] thedj: ah. right now you can't do that, but I would like to change that at some point [22:46:37] thedj: the only doc for it is -- https://wikitech.wikimedia.org/wiki/Help:Toolforge/Web#Configuring_a_default_backend [22:46:49] and today that's all it can do [22:57:58] Anyways, I have now switched over all my tools. It was actually quite easy. Yay! [22:58:17] * hare braces for all his jobs failing; pre-emptively emails Earwig [23:58:42] !log codesearch rolling restart of everything, issues probably due to Gerrit downtime (T218706) [23:58:45] legoktm: Failed to log message to wiki. Somebody should check the error logs. [23:58:45] T218706: Deployed repos search returns no results - https://phabricator.wikimedia.org/T218706 [23:58:51] uh bd808 ^ [23:59:05] legoktm: known [23:59:09] ack [23:59:40] its fallout from auth changes on wikitech. Should have it fixed by EOD tomorrow (I hope) [23:59:56] makes sense