[MC-162522] Setting render distance to values >8 chunks freezes the game for >1 second (minutes for higher values)

The bug

This already happened in earlier versions, but ONLY when VBOs are disabled (that only happened on some GPU drivers, for example linux intel GPU drivers). That option has been removed in one of recent versions so this shouldn't be an issue anymore.

I can reproduce it reliably every time now immediately after entering the world, but in the initial few tests, it appeared to happen only after a while of flying around the world (I don't know what actually triggered it).

This was not an issue in 19w37a, it appears to be introduced in 19w38a

It seems to coincide with MC not displaying GPU vendor and opengl version properly (screenshots showing this attached, it does show up correctly in crash report output).

Running visualvm sampler shows that all of the time is spent in RenderSystem.glGenBuffers and subsequently GL15C.nglGenBuffers (sampler snapshot attached). As far as my understanding goes, this is very likely caused by requesting gl buffers separately for each render section, instead of all at once.

Update: opengl version and vendor not being shown is a completely separate thing that doesn't affect anything. Appearing in the same snapshot was a coincidence.

The reason

After a bit of testing in mod development environment with latest snapshot, I was able to find a fix/workaround in code in GlStateManager. It's not exactly the best solution, but works really well in practice and requires very little code changes:

private static final int GENBUFFERS_BATCH = 2048;
    private static IntBuffer BUFFERS;

    public static int genBuffers() {
        RenderSystem.assertThread(RenderSystem::isOnRenderThreadOrInit);
        if (BUFFERS == null || !BUFFERS.hasRemaining()) {
            if (BUFFERS == null) {
                BUFFERS = BufferUtils.createIntBuffer(GENBUFFERS_BATCH);
            }
            BUFFERS.rewind();
            GL15.glGenBuffers(BUFFERS);
            BUFFERS.rewind();
        }
        return BUFFERS.get();
    }

This confirms that the issue is the huge amount of glGenBuffers calls, but it's still unclear to me why it's a performance issue to begin with, especially with glDeleteBuffers not being an issue at all.

~~A good solution would be to allocate and delete all the chunk buffers in one glGenBuffers call but that doesn't work nicely with the way this code is written.~~

~~I also verified that this code didn't change in any significant way even since 1.12.2 so I really have no idea what has changed here that suddenntly makes this code path so much slower.~~

After more debugging, this is the real reason (full explanation of how I found it in comments):

Minecraft is binding already deleted VertexBuffers, which it internally assigns an ID of -1 to. This ends up changing mesa driver internal state in a way that always triggers a very slow code path for glGenBuffers.

Attachments

forced_crash_19w37a.txt

forced_crash_19w40a.txt

sampler_snapshot.nps

Comments 12

FaRo1 2019-10-13T13:19:35Z

I don't need to fly around to reproduce this, just opening a world (only tested in my test world so far) and increasing the render distance reproduces it. Especially pressing F3+F causes so much lag that I have absolutely no chance of getting from 2 to 32, it always freezes before.
I have a Debian 9.11 laptop. 1.14.4 shows the GPU info as "Mesa DRI Intel Kabylake GT2, 3.0 Mesa 13.0.6", 19w41a shows it as "GLU.getRenderer", "GLU.getOpenGLVersion".

RedCMD 2019-10-18T23:20:46Z

I run Debian Linux on my chromebook
When ever changing render distance I get a massive freeze between a sec and 20sec (never a minute?)
But it doesn't always happen, but when it does, it always does it until I restart
Can anyone confirm that the freezing only happens when there is a large amount of block entities within render distance? (maybe only needs heaps when first joining the world and rest of time afterwards is laggy?)
I currently switch between different versions constantly 1.12.2 and 1.14.4 - both had the issue, tho not any more?
Maybe its because I run Carpet Client or Tweakeroo? - They have a fix for Block Entity Unloading/Loading lag fix
Forge has the BE fix too

FaRo1 2019-10-18T23:34:34Z

I just reproduced it in a fresh default world in 19w42a, so masses of tile entities are not the cause. And yes, it's minutes, not seconds. It seems to actually recover if I let it go on for long enough, I just usually don't do that. It seems to be only client lag, the server runs fine in the meantime.

Barteks2x 2019-10-21T12:51:40Z

This has nothing to do with block entities. This happened with 1.12.2 when VBOs are disabled. Now it happens almost always. I haven't analyzed the code yet, but it's most likely due to the intel drivers on linux taking much more time on creating buffers (or at least MC triggering some slower code path there?).

[Mod]Les3awe 2019-11-11T05:50:46Z

Fixed some performance issues on 19w45b.
Please check if it still affects 19w45b.

2 more comments

Barteks2x 2019-11-11T17:31:12Z

The issue still persists with 19w45b. VisualVM sampler still shows the same parts of code involved. But now it correctly shows GPU information on F3 screen.

This time again it didn't occur on the first attempt so initially I almost thought it's fixed. The first 1 or 2 times I set render distance, it works just fine even when setting it to values as high as 22, but after that it still takes well over a second to set it to 8 or 10, and whole minutes to set it above 16 chunks.

Aside of intel GPU on linux, I was able to also reproduce this issue with linux nouveau drivers on nvidia GPU (running on nvidia GeForce GT 740M. On the same computer+OS, the issue doesn't occur when using official nvidia drivers.

Barteks2x 2019-11-11T19:21:45Z

private static final int GENBUFFERS_BATCH = 2048;
    private static IntBuffer BUFFERS;

    public static int genBuffers() {
        RenderSystem.assertThread(RenderSystem::isOnRenderThreadOrInit);
        if (BUFFERS == null || !BUFFERS.hasRemaining()) {
            if (BUFFERS == null) {
                BUFFERS = BufferUtils.createIntBuffer(GENBUFFERS_BATCH);
            }
            BUFFERS.rewind();
            GL15.glGenBuffers(BUFFERS);
            BUFFERS.rewind();
        }
        return BUFFERS.get();
    }

A good solution would be to allocate and delete all the chunk buffers in one glGenBuffers call but that doesn't work nicely with the way this code is written.

I also verified that this code didn't change in any significant way even since 1.12.2 so I really have no idea what has changed here that suddenntly makes this code path so much slower.

FaRo1 2019-11-11T21:34:20Z

I can definitely confirm this happening and I've created a bunch of screenshots to show how the lag scales. The screenshots for render distances 31 and 32 were made much later, simply because the lag spikes took too long and I had to leave home in the meantime. 😃
@unknown Please use the preview feature more, every edit sends a mail to every watcher.

Barteks2x 2019-11-11T23:37:20Z

I think I figured out what the REAL issue is and it's completely unlike what I thought it would be. And of course not without first getting very confused on false track of what the issue is. A tl;dr answer of what is actually wrong is at the bottom, there follows a full story of me discovering it.

First I modified 1.12.2 and latest snapshot code to track currently used GL buffer IDs. And I noticed, that 1.12.2 has always the last id as a very large number, while latest snapshot, has just 1, 2, 3 and 4. This turned out to be completely irrelevant to the issue, but this put me on the right track.

I started looking at mesa source code, since this issue seems to exclusively affect mesa-based GPU drivers.

Grepping the code for glGenBuffers, I quickly found the implementation in src/mesa/main/bufferobj.c. From there we can follow that code into the _mesa_HashFindFreeKeyBlock() call in src/mesa/main/hash.c here we can see 2 code paths: one in if (maxKey - numKeys > table->MaxKey) and another where this condition is false. Without fully understanding what this line did, my initial assumption was that the fast path, is when the max allocated ID is already the maximum, and the slow path, is otherwise. That turned out to be completely wrong.

Mesa is actually going to attempt to keep giving sequential IDs as long as possible and unless Minecraft somehow managed to exhaust all 32-bit integers, this should never reach the slow path. After quick search through the code, we can find that the only other place where mesa writes to MaxKey is in _mesa_HashInsert_unlocked which is called, among other places, from glBindBuffer.

In order to see where this is called from, I modified this mesa code to check for value close to max 32-bit unsigned in, and cause segfault (because that generates a useful hs_err file with a java stacktrace in it). With that change, I quickly crashed the JVM and got my answer: When binding VertexBuffer during VBO upload, it ends up binding already-deleted VertexBuffers. This gives mesa an ID of -1 which ends up being converted to 4294967295. This doesn't break anything, bit has the effect of always triggering the slow path in _mesa_HashFindFreeKeyBlock, resulting in extremely change of render distance.

TL;DR:

Minecraft is binding already deleted VertexBuffers, which it internally assigns an ID of -1. This ends up changing mesa driver internal state in a way that always triggers a very slow code path for glGenBuffers.

Barteks2x 2019-11-12T00:25:34Z

I also now verified that editing the code to bind 0 if the ID is negative solves this issue, but then creates gl errors.

I haven't looked deeper into the code to find what exactly what change makes it attempt to bind and use deleted buffers, but it might be a result of removing VBOs as separate option, since I also found that in older version, this was actually the cause of changing render distance with VBOs disabled being slow. In older version, when VBOs were enabled, MC didn't attempt to use deleted buffers, but when they were disabled, it did attempt to use already deleted display lists (ID -1). Fixing it in 1.12.2 (It's probably exactly the same in 1.14.4) for display lists the same way (using value 0 instead when it's -1) has the same effect - a few gl errors but changing render distance is fast.

Sollace 2021-10-26T19:20:06Z

@Bartosz Skrzypczah

Minecraft is binding already deleted VertexBuffers, which it internally assigns an ID of -1. This ends up changing mesa driver internal state in a way that always triggers a very slow code path for glGenBuffers.

I happened to land on this issue after searching for an issue I'm having that's somewhat similar, so I thought I would ask: Is it possible for this to crash GL11's own rendering pipeline?

I've been running the game on OpenJDK16, and ever since the switch away from Java8 I have been struggling to debug this occassional error where the game would have a REALLY intense lag spike, and then all rendering would freeze whilst sound and gameplay continues unabated.

Barteks2x 2021-10-27T04:40:35Z

First Minecraft no longer uses gl 1.1. and the issue is long fixed afaik. So no this is not a possible cause. And non-VBO rendering mod is no longer even available.