mojira.dev
BDS-17527

Multiple server crashes due to memory leak when loading chunks

Server is running mostly default survival settings except for simulation distance of 6 and chunk render distance of 32. Has been stable for the past year without issue until around 1 month ago. At first we believed the crashes to be caused by entering portals too quickly (going from nether to overworld, then overworld to end portal fast) since some of the crashes were occurring under those conditions, but we have also been getting crashes throughout normal gameplay, when performing normal actions such as placing blocks or even walking around. The crashes occur in multiple areas, overworld, nether, and end. There doesn't seem to be a specific chunk or groups of chunks causing this since the same areas may be fine one moment, then crash at another. As of now the server will crash at least 4-5 times a day if people are active. Checking memory usage showed a steady usage of around 700MB our of the 3GB allocated, so there doesn't seem to be any issue there.

Made the bug public and removed the world file since I confirmed the memory leak occurs with a fresh world. Also adding some extra info from my comments below:

After further testing this seems to be a memory leak related to chunk loading. If you check the memory while standing still, not loading chunks, it should be stable. When you load chunks, whether it's via crossing portals, flying, etc. the memory use increases and not come back down, which it should not be doing. Running farms while standing still did not affect the memory usage for us. Passive mob farms, redstone mechanisms, etc. did not contribute to the memory leak. Here's a log of our memory use up until the server crashed again.

Crash reports are fairly limited in information, but here are some of them attached below.

Linked issues

Attachments

Comments 55

Hi

Were there any changes made to server.properties? Any files from root BDS folder deleted? Have you tried reinstalling BDS from new download?
Can you please look at BDS-17453 and tell us if it relates to your issue?

This ticket will automatically reopen when you reply.

I don't think this is related to BDS-17453 since there is a permissions.json file present. Either way the server crashes are not on startup, they're after some time playing, which can vary dramatically, not sure if that's relevant to the script-watchdog setting.
Attached here is the server properties file. Not sure what would cause an issue there, thought it might be outdated since I do not see a script-watchdog setting.

[media]

Sorry for the duplicate comment, can't seem to add another attachment via editing existing comments. Here's a screenshot of the server directory. If anything I'm seeing more files than what is normally present so I'll back up the server files, download a fresh install and try again.

[media]

After further testing this seems to be a memory leak related to chunk loading. If you check the memory while standing still, not loading chunks, it should be stable. When you load chunks, whether it's via crossing portals, flying, etc. the memory use should increase and not come back down. Running farms while standing still did not affect the memory usage for us. Passive mob farms, redstone mechanisms, etc. did not contribute to the memory leak. Here's a log of our memory use up until the server crashed again.

[media]

 

Edit: Tested with a fresh world on the server, issue still persists.

My ticket was merged here, so I'm adding my comments here as requested:

There appears to be a significant memory leak in BDS for Linux. The bedrock_server process continues to grow its memory utilization whenever any activity occurs, and appears to not ever release that memory. Utilization continues to increase until all available RAM on the host is exhausted, at which point either swap kicks in (resulting in host paging/thrashing), or the process is killed by the OOM killer in the kernel, which of course releases all the memory but terminates the BDS service. Either situation results in all players being forcibly disconnected, and could result in database corruption.

The problem appears to be related to loading chunks. It is exacerbated when teleporting. Teleporting to a distant location can trigger significant memory allocations (on the order of 10MB per second) making this problem easier to duplicate. It also exposes an additional aspect of this bug, which I will describe below.

Steps to duplicate:
1. Fresh install a current Ubuntu LTS version on a bare metal or virtual server.
2. Download BDS 1.19.21.01. No mods, blank world. Start the process running.
3. Observe utilization of about 300MB initially.
4. Connect to game. Note that RAM increases slightly (that's expected of course.)
5. Teleport to a distant location (e.g. 2000, 80, 2000).
6. Note that RAM increases by 10-20MB within seconds.
7. Teleport back to origin. Wait.
8. Note that RAM is not released.
9. Teleport to the same distant location.
10. Observe that RAM again increases by 10-20 MB within seconds.
11. Repeat steps 7-10. Observe that RAM continues to increase.
12. Disconnect from game. Wait for hours. Observe that allocated RAM is never released.

I used an Azure server with 2 CPUs, 4GB of RAM, and 32GB of disk, but this has also been tested on larger server configurations with the same result. I used Ubuntu 20 LTS as directed, but I also tested this under OpenSuse 15.3, with the same result.

Following the above steps, I was, within the space of 10 minutes, able to more than double the process' RAM allocation to above 700MB. Clearly I could have continued teleporting, back and forth, until I brought the server down from memory exhaustion.

This highlights a number of points:

1. As C++ has no automated garbage collection mechanism, save for very limited scope-exit recoveries, it is necessary for the program to track and release its own memory. This clearly is not happening. An idle server with no players on it should detect this condition and release unused memory. A chunk which is no longer in use after a period of time should be written back to disk and similarly released from memory. Neither of these things are happening.

2. If you're building an in-memory copy of loaded portions of the database (as is clearly the case here), an in-core index or other similar data structure should be used to track which chunks are already loaded, and point functions back to them. This clearly is also not happening. Only two chunks (or areas) were being visited in my test: 0,0,0 (the spawn point, or near to it), and 2000,80,2000. Yet, each visit to those same chunks in either direction, caused additional RAM to be alloc'ed by the process. Each teleport required, in my case, an additional 10-20MB of RAM to complete. This suggests that multiple copies of the same chunk(s) were being maintained in RAM (clearly without the process realizing it), which amplifies the leak: Not only is RAM not being released, but data is being duplicated in RAM, causing growth to expand at least geometrically. This may be related to why memory is not being freed: If the process isn't tracking which memory it's allocated, and loses the pointer to the allocated memory block(s), it CANNOT release them. That type of thing seems to be happening here.

3. This obviously exposes a potential for a denial-of-service-style attack against a BDS. If a world has either a malicious operator, or if the world has set up (for example) command blocks enabling regular users to teleport, then repeated use of the teleport by players - either maliciously in quick succession - or, simply over time, if the server process continues to run - is guaranteed to speed up memory consumption and hasten the crashing of the server process itself. Again this applies to any in-server actions: the more action, the faster the RAM exhaustion appears to occur.

Note here that this bug is NOT about teleporting itself: Memory usage increases whenever new chunks are loaded via ANY method, and such memory appears to never be released. Teleporting simply speeds up the process and the visibility of the problem. Even with teleporting disabled, BDS on Linux slowly grows in RAM size, and never releases any RAM, eventually exhausting available resources on the server causing a crash of some kind to happen.

I've been off the test instance for an hour as I post this bug, and I firewalled it off so nobody could get in. It's still at 705.64MB - after being unused by anyone for an hour - and will continue that way until it's reset.

45 more comments

Just chiming in to say I'm experiencing this too. The only difference is that for me, the server hangs instead of crashing while a process called "kswapd" climbs to 100% CPU usage and stays there. Sometimes it resolves itself after a few minutes. Sometimes it doesn't and requires me to restart my cloud VPS. I think this is because they provision the VPSs for me with swap enabled, so I get a hang instead of a crash.

From the looks of the comments here, other people have more experience doing debugging steps and providing logs etc than me, so I'll hold off on adding mine until asked. I'll subscribe to this for updates on this issue.

Thanks!

jtp10181: the Bedrock team does not use the "Assignee" field on this bug tracker. You can see that the report is being tracked internally by Mojang when it has a number in the ADO field.

@GoldenHelmet, fair enough. Would be nice to get some sort of an update. Is it actively being worked on? Can they reproduce it? Do they need anything from the users? Even a response acknowledging the problem and saying they cannot figure out what is causing it would be better than totally ignoring everyone. I imagine there are thousands of people experiencing this bug but they either have no idea why the server is crashing or just have not bothered to search around and find this. I found it with a search trying to figure out if it was something I did wrong and could fix.

Pretty sad this has been on here for just about a year now and only two responses from "Mojang" in that entire time.

Looks like this issue has been finally fixed in the latest preview as the changelog says! Next stable release should bring the fix to everyone! “Only” took a year, but at least they fixed it. https://feedback.minecraft.net/hc/en-us/articles/18619357250701-Minecraft-Beta-Preview-1-20-30-22

I feel like they just ignored it until people recently started to pester them about it more. Seems like they were very confused on how to replicate the issue up until at least the last Mojang post on 6/16/23. Not sure how though since all you had to do was load a server and play on it.

Either way, glad a fix is finally coming. Will be interesting to see how it runs after the fix. I had to bump up the RAM allocation on my VM just to deal with this issue. Will have to keep an eye on the release notes to watch for the update.

migrated

(Unassigned)

Community Consensus

crash, server

Retrieved