[MC-119971] Various duplications, deletions, and data corruption at chunk boundaries, caused by loading outdated chunks — includes duping and deletion of entities/mobs, items in hoppers, and blocks moved by pistons, among other problems

In the process of testing the fix for MC-79154, I uncovered another potentially serious but unrelated Minecraft bug. I found it by looking at MCP for 1.11.2, although I'M SURE it affects 1.12.x also.

Here's what I saw in my logs:

// On tick 5034266, a chunk unload happens, causing a hopper I was monitoring to get serialized as NBT data and queued for I/O:

Hopper save on tick 5034266 ser=29333 oser=7
  Position at co{x=-1, y=56, z=2}
    data: {TransferCooldown:3,x:-1,y:56,z:2,Items:[],id:"minecraft:hopper",serial_num:7L,Lock:""}

// Two game ticks later, the same chunk is reloaded, but we get completely different data:

Hopper load on tick 5034268 ser=29341 oser=7
  Position at co{x=-1, y=56, z=2}
    data: {TransferCooldown:7,x:-1,y:56,z:2,Items:[0:{Slot:0b,id:"minecraft:redstone",Count:1b,Damage:0s}],id:"minecraft:hopper",serial_num:7L,Lock:""}

// Searching backwards in the log, we find that this is what was written out the previous time this chunk was unloaded:

Hopper save on tick 5034170 ser=29323 oser=7
  Position at co{x=-1, y=56, z=2}
    data: {TransferCooldown:7,x:-1,y:56,z:2,Items:[0:{Slot:0b,id:"minecraft:redstone",Count:1b,Damage:0s}],id:"minecraft:hopper",serial_num:7L,Lock:""}

(In case you're wondering why the saves are only 96 ticks apart, I think it's because the one on tick 5034170 happened when I logged out or restarted the server. I forget exactly, but it appears I triggered an off-schedule auto-save. This is actually relevant to Rich Crosby's fix for MC-22147. Briefly, I thought it might be moot, since auto-saves are 45 seconds apart. What are the chances that the I/O system would be backlogged by 45 seconds, triggering that bug? But anything that causes two auto-saves to occur close to each other CAN trigger that bug.)

Anyhow, so, how could we be loading an old version of this data? Well, here's what I found:

There is a time between when AnvilChunkLoader.writeNextIO dequeues a chunk from chunksToRemove and when the I/O write completes when AnvilChunkLoader.loadChunk could come along looking for the same chunk. If that happens, AnvilChunkLoader.loadChunk will load a stale version of the chunk.

The key methods on RegionFile are synchronized, so I'm pretty sure it can't read corrupt data. Sector allocation and low-level file access are all synchronized. The problem is what happens in the period of time when CompressedStreamTools and the various NBTTag objects are are writing out the NBT data to the ChunkBuffer (derived from ByteArrayOutputStream), including the overhead of doing the compression.

I don't have an "easy one-line" fix for this.

Let's say that AnvilChunkLoader.writeNextIO were to pick a chunk out of the chunksToRemove but not actually remove it form that container during I/O. That would be great for the case when the same chunk is reloaded. But what if another save for the same chunk comes along during that time? Then the newer one would replace the older one in chunksToRemove while the older one is being saved, and then when the write was done, the newer one would get deleted from chunksToRemove.

That condition could be eliminated (fixing Crosby's observation in the process) if an auto-save can't even be started if AnvilChunkLoader's I/O queue is not empty.

How to reproduce: Point two hoppers into each other on a chunk border, throw one item in, create lag.

Linked issues

Attachments

AnvilChunkLoader.java

AnvilChunkLoader.zip

hopper_test_world.tbz2

hopper_test_world.zip

MC-119971-fix-simplified.zip

Comments 18

Timothy Miller 2017-08-05T03:29:33Z

I'm testing a fix for this. There's a chance that fixing this may help with mobs and other entities (aside from block entities) being duped or deleted at chunk boundaries.

Timothy Miller 2017-08-05T12:33:51Z

I've uploaded the source code that fixes this for me. I have a setup that will duplicate and delete items in hoppers and blocks being pushed by pistons, when using an unmodified server. MC-79154 fixes duping within chunks, while MC-119971 fixes problems at chunk boundaries. Together, all of these problems appear to go away.

Most of the code changes are comments. What I did was make the synchronization explicit for maps that keep track of pending chunks. One map is for those queued to be written, and another is for what's being currently written. Transitions between them, chunks entering and leaving the system, and peeking into them for reloading all have to be synchronized. Only by eliminating the race conditions can we be sure there will be no data loss.

I'll get some friends to test this further, but I think I've managed to find proper fixes to some bugs that have annoyed technical Minecraft users for a long time. What is necessary to get the developers to notice these fixes and implement them?

Timothy Miller 2017-09-14T17:37:07Z

I've been told that Microsoft policy disallows developers from looking at code offered by community members. That seems like a crippling policy, and there surely has to be a way around it. I can sign a release if that helps, or I can add a much more detailed description. Heck, for that matter, I'd be happy to arrange a phone (or equivalent) call with a developer to talk them through it. I'm no noob when it comes to signing NDAs and other protection and indemnification contracts as an independent contractor. 🙂

It wasn't easy to identify this bug in the first place, and the fix had to be implemented very carefully. And although I can't say that I have covered every possible scenario, I have tested the heck out of the code I've attached. Although my approach is surely not the only valid solution, reinventing the wheel risks missing important motivations for the design choices I had to make. For instance, I believe it is important to ensure atomicity in changing the state of a chunk being queued for unloading, dequeued for writing, retired, and picked out of the unloading and pending write queues when a block needs to be reloaded. Although carefully ordering certain container accesses might allow us to not use synchronized methods for these state transitions, such approaches are harder to prove.

Timothy Miller 2017-09-15T23:18:17Z

I changed the title.

"AnvilChunkLoader: loadChunk will read an old version of chunk if writeNextIO is simultaneously writing out the same chunk" is the cause, but the new title is more descriptive and better informs as to the consequences of the bug.

Timothy Miller 2017-09-16T01:34:34Z

Hi, everyone.

With help from Pokechu22 and NarcolepticFrog, I made a world that is empty except for a test environment for this bug. It demonstrates both item duping and block duping (or deletion). Near -24 / 0, there are some hoppers and a machine that pushes blocks. The view distance away (at around 150 / 0), there is a minecart rail. Ride in that cart for a few hours and then go check the machine and hoppers for duping. I also included the server.properties just in case.

This is rigged so that one of the chunks containing the setup gets loaded and unloaded over and over again. The other chunk stays loaded.

This was tested in 1.11.2, vanilla. It should behave the same in 1.12.1.

We tested it for about 5 minutes and got item duping, so I went ahead and uploaded the world. I'll edit this comment when I see block anomalies.

8 more comments

Timothy Miller 2017-09-16T05:32:06Z

Y'all are going to love these screenshots. The blocks duped at the chunk boundary, and one of the hoppers is completely filled with items.

atomizer 2017-09-16T14:43:08Z

I have attached a zip version of the files, since it is not easy to work with TBZ2 on Windows.
Tested in 1.12.2-pre2 for an hour, got a piston duplicated and both hopper pairs lost their item.

Timothy Miller 2017-09-18T17:46:37Z

Thanks for the help with that, atomizer!

BTW, I have been thinking about ways to avoid explicit synchronization.

One idea that I considered was to have two lists, one for "pending saves" and the other for "currently writing." (That latter list would have one or zero items, but it's important that the key and value be updated atomically.) What if we were to copy a (reference to the) selected pending chunk into the writing list before deleting it from the pending list? Well, there is still a race condition where the pending list could get this entry updated after it's copied to the writing list but before it's deleted from the pending list. Either the iterator is invalidated, which ultimately means that the updated version gets thrown away, or the chunk data gets modified while it's being processed. Either way is bad. We really need the transition from one list to the other to be atomic, which requires explicit synchronization.

For similar reasons, we can't just leave the chunk in the pending list (perhaps marked as being written) because it could get updated while it's being processed. If the old chunk is just replaced with a new reference (which I think it would be), then we get data loss when the entry is deleted; otherwise the data changes while it's being consumed.

There are other functions I've provided for this also also have to be synchronized for similar reasons. Basically, two containers are being updated in one thread, so any updates in another thread in between them will cause data loss or data corruption.

Maybe someone else can come up with a smarter solution, but I see no way to avoid explicit synchronization here.

Matti Ruohonen 2017-09-18T18:39:34Z

I just thought about this very briefly (for a few minutes), ie. how I could implement the chunk saving, before fully reading your comments. And I came up with basically what you have in your latest comment.

So in short:

There would be a Map of queued chunks to be unloaded and saved
The Write I/O thread has a separate Map or List or just a single Pair of ChunkPos -> Chunk of the chunk it's currently writing to disk (probably just zero to one entries at a time)
When a chunk needs to get loaded, it would first be checked in the queued chunks map, as those are always the most recently updated version of the chunk, waiting to be written to disk
If the chunk isn't in the queue map, then it would be checked in the map or whatever the write I/O thread uses, ie. what is currently being written, as that's the "second most recent place" a chunk can be in
If it isn't in there either, then it can be loaded from disk via RegionFileCache, as it would have been completely written to disk already

Now, the critical place is to always use a common synchronization/lock object when the queue map is accessed from the server thread to see if a chunk is there, and when the write I/O thread takes a chunk from the queue map, and moves it to the write map, before starting to write it to the disk, and finally when the write i/o thread is done writing the chunk data, and removes the chunk from the write map. That lock shouldn't cause much slow-downs, as the synchronized code section would only be the time it takes to check and grab the chunk from the queue map in the server thread, or the time it takes to grab the chunk from the queue map and put it into the write map in the write thread. The write thread's actual write operation would obviously be outside that synchronized block.

So basically when a Chunk gets unloaded, a reference to it (or its data) would move like so: queue -> writemap -> file

So when unloading a chunk, it would be safe to freely override a chunk reference in the queue map, as there is never any newer version of the chunk anywhere else, and the write thread grabbing chunks from that queue map would be synchronized. Even if the write thread is currently writing a chunk, and it gets modified and put to the queue again, it's no problem, it will just get written again later. And if a chunk needs to re-load before getting written, the latest version of it would be waiting in that queue map. If it's not there yet, then check the write map in case it's currently being written, and if it's not there either, then it can be read from the file again.

Anyway, this is my current thinking of this issue, after a very short amount of time thinking about it.

Timothy Miller 2017-09-18T20:12:36Z

Hi, Matti,

Reading over what you said, I cannot find anything to disagree with. It sounds just like my reasoning for my implementation.

Anyhow, now we should wait for the devs to get involved and not give them too many comments to read. It may be useful to mention this bug on social media, but very politely, because I hear that the devs are planning to fix MC-79154, which I am super-excited about. We need to let people know that the devs have been listening – bugs like MC-119971 are game-breaking, while things like random ticks on liquids was merely inconvenient yet they were kind enough to implement a change due to user request.

Thanks.

Timothy Miller 2017-12-07T20:46:54Z

I have attached a AnvilChunkLoader.zip, which contains the unmodified MCP code, the updated code, and a unified diff.

EDIT: There are some minor differences between the old AnvilChunkLoader.java that I had uploaded before and the ".fixed" one in the zip file. But it's just differences in print statements that you don't want to copy.

TheTamedWolf 2018-03-08T04:16:03Z

i think this is happening for me now in my world, going from 1.12.2 to 18w10b. constant crashes when creating new sections of world and reloading old chunks, even when changing something then save/quit. reopening the world, i see that everything i did was futile. nothing kept. world unplayable.

[Mod] redstonehelper 2018-03-09T20:37:47Z

@@unknown: Please create a new ticket for your issue.

Timothy Miller 2018-06-22T02:08:36Z

I have made a much simplified version of the fix for this. It's a lot more readable and shows better just how simple this fix is.

[media]

FaRo1 2018-07-14T13:19:09Z

Is there a good way to reproduce this bug? Is it the old "place two hoppers into each other on a chunk border"?

Greg Milson 2018-07-20T03:11:47Z

Is this still an issue in 1.13?

Timothy Miller 2018-07-24T13:16:07Z

To reproduce this, there is "hopper test world.zip", although it could be older than what I remember working with. Around 0/0, there's a machine and some hoppers. At around x=+160, there'a minecart. Get in the minecart and AFK. In minutes to hours, you'll have duped blocks in the machine and duped items in the hoppers.

Timothy Miller 2018-07-26T21:43:28Z

I'm attaching a screenshot of my testing of this bug fix in 18w30b. In my fix to 1.12.2, this never happened. But in 1.13, one of the pistons gets permanently stuck in an extended state. I only AFK'd for a short time, and nothing got duped in the hoppers. So what we're seeing here may be some other bug entirely.

[media]

Update: It takes maybe a minute in the minecart, loading and unloading chunk with the slime blocks in it, for the piston to get stuck like this. This isn't a self-triggering flying machine. The redstone lines are being pulsed. It's just that the extended piston is ignoring that and staying extended. I created a new ticket for this: MC-134979

Various duplications, deletions, and data corruption at chunk boundaries, caused by loading outdated chunks — includes duping and deletion of entities/mobs, items in hoppers, and blocks moved by pistons, among other problems

Linked issues

discovered while testing 1

is duplicated by 2

relates to 1

testing discovered 1

Attachments

Comments 18