mojira.dev
MC-119971

Various duplications, deletions, and data corruption at chunk boundaries, caused by loading outdated chunks — includes duping and deletion of entities/mobs, items in hoppers, and blocks moved by pistons, among other problems

In the process of testing the fix for MC-79154, I uncovered another potentially serious but unrelated Minecraft bug. I found it by looking at MCP for 1.11.2, although I'M SURE it affects 1.12.x also.

Here's what I saw in my logs:

// On tick 5034266, a chunk unload happens, causing a hopper I was monitoring to get serialized as NBT data and queued for I/O:

Hopper save on tick 5034266 ser=29333 oser=7
  Position at co{x=-1, y=56, z=2}
    data: {TransferCooldown:3,x:-1,y:56,z:2,Items:[],id:"minecraft:hopper",serial_num:7L,Lock:""}

// Two game ticks later, the same chunk is reloaded, but we get completely different data:

Hopper load on tick 5034268 ser=29341 oser=7
  Position at co{x=-1, y=56, z=2}
    data: {TransferCooldown:7,x:-1,y:56,z:2,Items:[0:{Slot:0b,id:"minecraft:redstone",Count:1b,Damage:0s}],id:"minecraft:hopper",serial_num:7L,Lock:""}

// Searching backwards in the log, we find that this is what was written out the previous time this chunk was unloaded:

Hopper save on tick 5034170 ser=29323 oser=7
  Position at co{x=-1, y=56, z=2}
    data: {TransferCooldown:7,x:-1,y:56,z:2,Items:[0:{Slot:0b,id:"minecraft:redstone",Count:1b,Damage:0s}],id:"minecraft:hopper",serial_num:7L,Lock:""}

(In case you're wondering why the saves are only 96 ticks apart, I think it's because the one on tick 5034170 happened when I logged out or restarted the server. I forget exactly, but it appears I triggered an off-schedule auto-save. This is actually relevant to Rich Crosby's fix for MC-22147. Briefly, I thought it might be moot, since auto-saves are 45 seconds apart. What are the chances that the I/O system would be backlogged by 45 seconds, triggering that bug? But anything that causes two auto-saves to occur close to each other CAN trigger that bug.)

Anyhow, so, how could we be loading an old version of this data? Well, here's what I found:

There is a time between when AnvilChunkLoader.writeNextIO dequeues a chunk from chunksToRemove and when the I/O write completes when AnvilChunkLoader.loadChunk could come along looking for the same chunk. If that happens, AnvilChunkLoader.loadChunk will load a stale version of the chunk.

The key methods on RegionFile are synchronized, so I'm pretty sure it can't read corrupt data. Sector allocation and low-level file access are all synchronized. The problem is what happens in the period of time when CompressedStreamTools and the various NBTTag objects are are writing out the NBT data to the ChunkBuffer (derived from ByteArrayOutputStream), including the overhead of doing the compression.

I don't have an "easy one-line" fix for this.

Let's say that AnvilChunkLoader.writeNextIO were to pick a chunk out of the chunksToRemove but not actually remove it form that container during I/O. That would be great for the case when the same chunk is reloaded. But what if another save for the same chunk comes along during that time? Then the newer one would replace the older one in chunksToRemove while the older one is being saved, and then when the write was done, the newer one would get deleted from chunksToRemove.

That condition could be eliminated (fixing Crosby's observation in the process) if an auto-save can't even be started if AnvilChunkLoader's I/O queue is not empty.


How to reproduce: Point two hoppers into each other on a chunk border, throw one item in, create lag.

Linked issues

Attachments

Comments 18

Timothy Miller

I'm testing a fix for this. There's a chance that fixing this may help with mobs and other entities (aside from block entities) being duped or deleted at chunk boundaries.

Timothy Miller

I've uploaded the source code that fixes this for me. I have a setup that will duplicate and delete items in hoppers and blocks being pushed by pistons, when using an unmodified server. MC-79154 fixes duping within chunks, while MC-119971 fixes problems at chunk boundaries. Together, all of these problems appear to go away.

Most of the code changes are comments. What I did was make the synchronization explicit for maps that keep track of pending chunks. One map is for those queued to be written, and another is for what's being currently written. Transitions between them, chunks entering and leaving the system, and peeking into them for reloading all have to be synchronized. Only by eliminating the race conditions can we be sure there will be no data loss.

I'll get some friends to test this further, but I think I've managed to find proper fixes to some bugs that have annoyed technical Minecraft users for a long time. What is necessary to get the developers to notice these fixes and implement them?

Timothy Miller

I've been told that Microsoft policy disallows developers from looking at code offered by community members. That seems like a crippling policy, and there surely has to be a way around it. I can sign a release if that helps, or I can add a much more detailed description. Heck, for that matter, I'd be happy to arrange a phone (or equivalent) call with a developer to talk them through it. I'm no noob when it comes to signing NDAs and other protection and indemnification contracts as an independent contractor. 🙂

It wasn't easy to identify this bug in the first place, and the fix had to be implemented very carefully. And although I can't say that I have covered every possible scenario, I have tested the heck out of the code I've attached. Although my approach is surely not the only valid solution, reinventing the wheel risks missing important motivations for the design choices I had to make. For instance, I believe it is important to ensure atomicity in changing the state of a chunk being queued for unloading, dequeued for writing, retired, and picked out of the unloading and pending write queues when a block needs to be reloaded. Although carefully ordering certain container accesses might allow us to not use synchronized methods for these state transitions, such approaches are harder to prove.

Timothy Miller

I changed the title.

"AnvilChunkLoader: loadChunk will read an old version of chunk if writeNextIO is simultaneously writing out the same chunk" is the cause, but the new title is more descriptive and better informs as to the consequences of the bug.

Timothy Miller

Hi, everyone.

With help from Pokechu22 and NarcolepticFrog, I made a world that is empty except for a test environment for this bug. It demonstrates both item duping and block duping (or deletion). Near -24 / 0, there are some hoppers and a machine that pushes blocks. The view distance away (at around 150 / 0), there is a minecart rail. Ride in that cart for a few hours and then go check the machine and hoppers for duping. I also included the server.properties just in case.

This is rigged so that one of the chunks containing the setup gets loaded and unloaded over and over again. The other chunk stays loaded.

This was tested in 1.11.2, vanilla. It should behave the same in 1.12.1.

We tested it for about 5 minutes and got item duping, so I went ahead and uploaded the world. I'll edit this comment when I see block anomalies.

8 more comments
Timothy Miller

I have made a much simplified version of the fix for this. It's a lot more readable and shows better just how simple this fix is.

[media]

Is there a good way to reproduce this bug? Is it the old "place two hoppers into each other on a chunk border"?

Is this still an issue in 1.13?

Timothy Miller

To reproduce this, there is "hopper test world.zip", although it could be older than what I remember working with. Around 0/0, there's a machine and some hoppers. At around x=+160, there'a minecart. Get in the minecart and AFK. In minutes to hours, you'll have duped blocks in the machine and duped items in the hoppers.

Timothy Miller

I'm attaching a screenshot of my testing of this bug fix in 18w30b. In my fix to 1.12.2, this never happened. But in 1.13, one of the pistons gets permanently stuck in an extended state. I only AFK'd for a short time, and nothing got duped in the hoppers. So what we're seeing here may be some other bug entirely.

[media]

Update: It takes maybe a minute in the minecart, loading and unloading chunk with the slime blocks in it, for the piston to get stuck like this. This isn't a self-triggering flying machine. The redstone lines are being pulsed. It's just that the extended piston is ignoring that and staying extended. I created a new ticket for this: MC-134979

Timothy Miller

Fry

Unconfirmed

Minecraft 1.12.1, Minecraft 1.12.2, Minecraft 1.13

Minecraft 18w30a

Retrieved