[MC-22147] Entity duplication bug

The bug

Definitely the way how entities are being saved is not stable and need to be checked. Yesterday I complained that my horse with armor and saddle disappeared with no reason after I have re-logged my SSP world. Today I enter my brewing room I found my only one librarian villager was duplicated. Both librarians and with the same offerings as of my last trade.

It's impossible for a villager to have entered that room (underground, by iron door); the closest village is 800 m away. I got my librarian by curing a zombie villager and trapping him in that room.

Sorry, this bug is irreproducible. It happened randomly.

Code analysis

See this comment by @unknown

Linked issues

is duplicated by 3

MC-58317 Villager Cloning Resolved

MC-66463 Cloned villager pops up out of nowhere Resolved

MC-109475 Entity Duplication Glitch Resolved

relates to 1

MC-119971 Various duplications, deletions, and data corruption at chunk boundaries, caused by loading outdated chunks — includes duping and deletion of entities/mobs, items in hoppers, and blocks moved by pistons, among other problems Resolved

Attachments

AnvilChunkLoader.java

Comments 30

Ezekiel 2013-07-10T01:59:09Z

Is this still a concern in the current Minecraft version? If so, please update the affected versions in order to best aid Mojang ensuring bugs are still valid in the latest releases/pre-releases.

Ezer'Arch 2013-07-10T16:54:12Z

I'm now playing 1.6.2 and that bug never happened again. In fact, this kind of bug has taken place 3 times since I purchased the game two years ago, it happened when I re-logged after a crash. But this very one, I got no crash and it happened just after I had updated the game to 1.6.1, so I thought it was worth reporting.

Arjan de Vries 2014-08-12T12:55:25Z

I experienced this bug a few days ago on the 14w32d snapshot. My farmer got cloned, same trades unlocked.

Pongo Sapiens 2014-12-22T16:28:46Z

I've just had this happen in 1.8.2-pre1. I have a set of villager tended farms, completely sealed, and I now have two identical farmers (same trades, all unlocked) in one of them.

ednoria 2015-05-29T02:41:46Z

This just happened to me in 1.8.6, same as Pongo Sapiens reports. I have a sealed villager-tended farm, and the farmer/shepherd has been duplicated (same trades).

20 more comments

Arjan de Vries 2015-08-08T23:01:15Z

Still experiencing this. It's rare and seemingly random. Breeding never happens where my villagers are, but sometimes one of them just literally clones itself.

I had it happen with Golems as well, but I'm not too familiar with the spawning rules of them.

Samuel Shank 2015-10-22T14:16:31Z

15w43a

Nathan Konnor Doxey 2016-08-21T02:33:31Z

just happened in my iron titan in 1.10.2

Ch3zter 2016-10-11T12:10:51Z

this happend to my farmer villager yester but i found some more details for this bug. the farmer villager was locked in a wheat farm deep in my base far away from any other village.

the server had been on the whole day, he was still alone half an hour before i found him with his clone, the server did not restart in this time.

today i put the server on and go to check on my now 2 villager to find only 1 again. the duplicated villager somehow was removed again on server startup, i am 100% certain no zombies could get to it.

the trades where identical, it was possible to trade with it and having different trades locked out than the first villager, it is not a ghost villager

CDES5 2016-12-08T17:37:58Z

Affects Version 1.11 from MC-109475

Rich Crosby 2017-03-01T19:06:10Z

I have found a bug in AnvilChunkLoader.java while browsing the MCP decompiled source code that can lead to duplicate/missing entities. Here's how it can happen:

1. Current chunk data gets added to the queue to be unloaded.
2. Chunk gets reloaded for whatever reason and entities enter/leave the chunk.
3. Queued chunk data starts to get written out in the separate IO thread.
4. While the queued chunk data is in the process of being written, the chunk is queued to be unloaded again.

The problem is the following function gets called for unloading the chunk the second time (from MCP source code):

protected void addChunkToPending(ChunkPos pos, NBTTagCompound compound) {
if(!this.pendingAnvilChunksCoordinates.contains(pos)) {
this.chunksToRemove.put(pos, compound);
}

ThreadedFileIOBase.getThreadedIOInstance().queueIO(this);
}

Since the same ChunkPos is already in the process of being written, it will not add the new chunk data to the queue. The result is that the stale data is written to the region file while the new data is just discarded. This means that entities that left the chunk will be duplicated in the region file (still in the stale chunk data as well as added to its new chunk) and that entities that entered the chunk will be lost (not in the stale chunk data and removed from its previous chunk).

Rich Crosby 2017-03-02T13:14:53Z

Also, I am getting this bug on 1.11.2. I run a multiplayer server and like to make large villages, so this bug happens every few days. I can see the following sorts of warning messages in the server logs:

[20:55:39] [Server thread/WARN]: Keeping entity minecraft:villager that already exists with UUID 7789a225-8dfb-4c8b-9ff7-d09220e2357c
[20:55:39] [Server thread/WARN]: Keeping entity minecraft:villager that already exists with UUID 520cf954-4346-4bd4-9d86-c22f4af6dc81
[20:55:39] [Server thread/WARN]: Keeping entity minecraft:villager that already exists with UUID b9003671-7440-47ff-ad3f-c845a30bcac5

Using NBTExplorer on the region files I can see that some villagers that I have given a name with a Name Tag are disappearing as well.

Rich Crosby 2017-03-06T14:17:06Z

These are the patched 1.11.2 server files that fix the problem for me. ConcurrentHashMap.put() and ConcurrentHashMap.remove() are fully synchronized between multiple threads. There is no reason to try to wrap the operations in an additional layer of synchronization. Also, the most recent data should never be discarded. If any data is to be discarded, it should be the old data.

Timothy Miller 2017-06-15T02:24:56Z

I wonder two things:

(1) Does this affect things not related to crossing chunk boundaries? None of the duping bugs I have encountered lately involve chunk borders. (Edit: None of the ITEM duping bugs I've encountered recently involve chunk borders. However, the BLOCK duplicating bugs very well might.)

(2) If the bug was so easy to fix in a modded server, why has Mojang not, say, asked for permission to copy their code?

Of all the numerous bugs I have dealt with in Minecraft, these chunk loading and duping issues are by far the worst ones. So many problems caused bad handling of chunk saves, and these bugs have been left unaddressed for YEARS.

And where do the priorities come from? Take piston translocation, for instance. Sure, it was a bug. But it wasn't really bothering most people, and indeed many users were making productive use out of it. So what goes through people's heads that they prioritize fixing an innocuous bug over fixing bugs that cause untold headaches? How do we get attention paid to the bugs that MATTER?

marcono1234 2017-06-15T11:40:43Z

@unknown could you please upload the fixed source files (decompiled using MCP) as well or provide code snippets containing the fixes to make them easier to implement.

Timothy Miller 2017-06-15T15:52:25Z

It's only been three months since Rich Crosby posted his last comment, but since his offer of code for a fix was ignored for that whole time, don't be surprised if it takes a long time for him to respond. That does not absolve Mojang of fixing a long-standing problem that has caused such wide-spread frustration for so many people.

Also, have you looked at the class files he provided? Although they are not source code, they may nevertheless contain the fixes he's referring to. If we can have moderators telling us how we can "just" make our own resource packs to fix other long-standing bugs, then it should also be easy for THEM to look into Java class files.

Rich Crosby 2017-06-16T03:46:14Z

I've uploaded src/minecraft_server/net/minecraft/world/chunk/storage/AnvilChunkLoader.java
The fix is removing some code which I have shown commented out with the annotation MC-22147 FIX. As I said before, ConcurrentHashMap put() and remove() are fully synchronized between threads and don't need a separate attempt at synchronization with a wrapper. The code using pendingAnvilChunkCoordinates makes it so that new requests to save a chunk are thrown away if the chunk is currently in the process of being written out to disk (with old data) which is incorrect behavior. The new data should not be discarded - it needs to be added to the queue even if old data is in the process of being written to disk.

Timothy Miller 2017-06-16T18:49:36Z

@TastyHaggis,

Out of curiosity, do you believe this to be a total fix? Obviously the current situation is severe, and your fix will ensure that what gets saved is the most up-to-date. But is there any chance remaining that anything could get duped or lost across a chunk boundary, or are we 100% guaranteed that when an entity or block is being moved from one chunk to another, we're guaranteed of there being no race conditions around the block deletes and placements vs. chunk saves?

If a chunk is marked for saving, but then it needs an update again, does it get pulled from the save list? Or does it just get enqueued a second time when it is once again marked for unloading?

Timothy Miller 2017-06-19T17:34:31Z

I had a conversation via reddit with someone that you all have heard of and who is familiar with the game code, and he cautions that this proposed fix may be an oversimplification. Before I go into that, while I stand by my assertion that this deserves attention, I have made some statements in various places that I'm sure are rude, and I apologize for that.

I'm going to put my money where my mouth is and look at the game code myself. I have downloaded MCP, and I'm going to start digging in probably later this week. No promises, since I have a ton of day job work to do.

So first of all, I had inferred from Rich Crosby's comments that the race condition is as follows:

A chunk is queued for unloading. Let's call it chunk X, version A.
Before it's actually saved, an event occurs that causes it to be requested for reload.
More entity processing happens, resulting in what we'll call chunk X, version B.
At some point, the chunk is selected for unloading again, also before it's been saved to disk.
What gets saved to disk is version A.

Now, this could be all just me being stupid and it could be no way implied by what Rich said. But the bottom line, there's no way that it is version A that gets saved. That would require that the whole chunk got copied when it was queued, and that would be really stupid. Completely wasted overhead.

What is more likely is that the chunk gets modified while it is in the I/O queue, and sometimes those modifications are happening in one thread while it's being written to disk in another thread. In this case, we lose atomicity, and this could very well explain how hoppers entirely within chunks could duplicate items. Hopper M is holding item J then gets written out, then the item is transferred to hopper N, after which hopper N gets written out. Voilà, you have a duped item.

It's definitely a problem that when the chunk gets unloaded again, the save is discarded. However, we are cautioned that what appears to be redundant mutex is likely a bad fix to prevent the same chunk from getting saved twice by two different I/O threads. This may have been implemented to work around even worse chunk corruption problems.

A couple of the things I'm going to be on the lookout for:

How are chunks looked up when they are to be processed? Is the reference to a chunk object always found via a hash table? Or is it possible for any other object to hold a reference to the chunk and therefore be totally unaware if it's been queued for unloading? If chunks are always looked up the same way, we can make queueing for chunk unload fully atomic by removing the chunk from the main hash.
If a chunk is found to not exist, does the code check to make sure it's not currently queued for unload? If not, then this "version A" idea might have some validity... the version loaded could be really old. Otherwise, if the code does fetch it from the I/O queue, why is it also not removed from the I/O queue atomically at that time?

There may be some missing locks around the I/O queue. This isn't just about thread safety but about ensuring that changes to the disposition of chunks are handled atomically. But let's say that a lock was put on a chunk while it's in the middle of being written out. That could take a while. If the server then decided to load that chunk while it's being unloaded, if it tried to take that lock, it could result in significant lag with the main thread being blocked by the I/O. So just thinking out of my butt, one solution I would consider would be to have three states: Loaded, queued-for-unloading, unloading, unload-aborted, and (implicitly) not-loaded. Transitioning from one state to another could be done without too much overhead using full synchronization. So if one is just queued, then it's no biggie to just remove it from the pending queue and move on. If it's currently being written, then what we'd want to do is safely transition it to the unload-aborted state but let the I/O finish. Once the I/O is finished then the I/O thread would check the state. If it's still "unloading" then the chunk is dropped. If, however, it's aborted, then the I/O thread would know from this that it had already been "reloaded" and would transition the state to "loaded" instead.

I'm going to stop speculating now and plan to come back with something more concrete once I know how the code REALLY works.

Thanks for your patience.

Rich Crosby 2017-06-20T03:35:22Z

There is only a single I/O thread used to write out chunks which is accessed through the singleton class src/minecraft_server/net/minecraft/world/storage/ThreadedFileIOBase.java

When a chunk is queued for saving, a snapshot in the form of an NBTTagCompound containing the chunk data is created and then placed in the ConcurrentHashMap chunksToRemove with the chunkPos as the key. This is what is later saved to disk by the I/O thread.

Here are the possible cases when a chunk is queued for saving (data A), is modified, and then is requeued for saving (data B):
1. Data B is requested to be queued before data A has started writing to disk. In this case data B replaces data A in the queue.
2. Data B is requested to be queued while data A is being written to disk. In this case data B is not put on the queue and is discarded. However, the server continues as if data B has been queued (clears dirty flags, etc.)
3. Data B is requested to be queued after data A has finished writing to disk. In this case data B is added to the queue.

Clearly, case 2 is doing the wrong thing. I do not see any reason why you would want to throw away data B just because data A is in the process of being written to disk especially since right before that and right after that, data B would instead be added to the queue.

Timothy Miller 2017-06-20T14:20:06Z

Based on what you're saying, to save chunk data, it can't just be saved trivially in its in-memory format. It has to be serialized, so they serialize it on the main thread, which is effectively a snapshot, and that snapshot is saved later. Fixing (2) sounds like it really should fix all of the block and entity duping that occurs at chunk boundaries.

But IS the serialization done on the main thread? I'm still trying to get an idea of why two hoppers could dupe an item, even if they're both in the same chunk. Usually, this sort of thing happens due to multiple threads accessing the same data structure without proper synchronization.

Timothy Miller 2017-06-23T23:53:35Z

I have spent some time looking at the code, and I can see no potential problems with the fix proposed by Rich Crosby. I have started on a detailed code analysis, but I got really swamped. Besides, there's no reason for the devs to wait around on me to provide this code analysis before they start making arrangement to fix this.

When are we going to get a value for "Fix Version/s"?

MalbaCato 2017-06-24T19:25:59Z

@unknown About "Fix Version/s", it is updated only when the version (in which the bug is fixed) is finished and ready to release as a public snapshot.
The only way for us to know ahead of time is the "Assignee" field. That is the way I understand the info in the top.

Timothy Miller 2017-06-29T00:51:41Z

I have spent some time already digging through the MCP code in order to verify Rich Crosby's fix. It looks good, so far. Unfortunately, some serious real life things are going to make it uncertain how much time I can spend playing Minecraft in the near future, let alone spend significant lengths of time digging through the code. The code is actually really nice, well organized, and not at all hard to follow. So when I get the odd free moment, I may be able to spend some more time on it, but I can't make any solid commitments. In the mean time, I felt light it might be a good idea to share what I have done so far. Others are free to send me their own observations that I can add to my notes.

Here is the work-in-progress analysis:
https://docs.google.com/document/d/10nzvbYlJWs5fABtFSA_SZYnIlkyNcDl_BajZa-gM_DQ/edit?usp=sharing

Here's a reddit post where you can add comments:
https://www.reddit.com/r/Minecraft/comments/6k4uva/ongoing_mcp_code_analysis_related_to_mc22147_the/

Thank you all for your patience.

Timothy Miller 2017-07-15T23:49:29Z

Short version: I can confirm the correctness of Rich Crosby's fix exactly as he wrote it, and I suggest implementing it in the next 1.12 point release.

If that's all you do, the world of technical minecrafters will thank you endlessly.

—

I have written my recommendations on the fix proposed for this bug in the previously mentioned Google Doc, starting on page 11:
https://docs.google.com/document/d/10nzvbYlJWs5fABtFSA_SZYnIlkyNcDl_BajZa-gM_DQ/edit?usp=sharing

In summary, I have determined that Rich Crosby's fix will (a) to the job correctly, (b) have no performance impact on the main game thread, and (c) have negligible impact on I/O performance.

—

I have also found that a performance improvement can be made on the main game thread, which you may want to consider. I'll upload that code here. But I recommend considering that for a later improvement.

EDIT: I sure hope me uploading AnvilChunkLoader.java didn't overwrite Rich's version. They have the same name. If so, let me know, and I can reupload it. Also, in my version I too marked the code changes with MC-22147.

Timothy Miller 2017-07-20T16:48:52Z

Does anyone have some suggestions for reasonably efficient ways to reproduce this bug?

Gnembon and MethodZz are helping out, but the tests they're doing are unexpected to me, and they haven't given me much detail. They're still running into some problems, and I'm starting to suspect that their tests are affected more by tick delivery to lazy chunks than to data loss on unloading. I really need to be able to test and debug this myself on my own server.

What would be great is a world download and some procedures that have a tendency to cause breakages.

Any thoughts on this?

Timothy Miller 2017-08-04T21:32:57Z

Some interesting new observations. I'm not confident enough to say anything definitive at this time, but here's what I think right now from my testing and code analysis:

The symptoms of MC-22147 are actually caused by the same block entity bug I described in MC-79154.
Rich Crosby's fix addresses a real bug, but that bug is not what causes MC-22147. However, a good fix for MC-119971 should fix Crosby's bug also.

Update: A while ago, I had put a debug message in AnvilChunkLoader that would get printed out if a chunk was being saved but got thrown away because of pendingAnvilChunksCoordinates. I just saw it print out for the first time after having done all kinda of extensive testing. Although rare, data loss CAN happen. The fix I'm working on for MC-119971 does indeed address this.

Timothy Miller 2017-08-05T02:52:41Z

I need to make a correction. MC-22147 is about entities like mobs and stuff. MC-79154 is about BLOCK entities, which is a whole other thing. None of the suggestions made pertaining to MC-79154 have anything to do with entities. For instance, we can still get mobs to dupe at chunk boundaries.

lord.quadrato 2018-06-21T20:43:10Z

Does this still happen in the 1.13 prereleases, currently 1.13-pre3?

Timothy Miller 2018-07-25T17:42:21Z

This is likely caused by MC-119971 and/or MC-108469. The former is marked as fixed for an up-coming release, but Xcom did provide a fix for MC-108469. For the bugs fixed by the EigenCraft group, Grum has tended to assign them to Mr. Herlitz who then assigns them to one of the junior devs; you might want to do that.