[MC-224893] Replacing Chunk Futures causes several issues

It turns out that MC-214808, MC-216148 and MC-214793 are all caused by the same code change introduced in the 1.17 snapshots. In the following I will describe this change and explain how it leads to each of these bugs. I post this as a separate report as it affects multiple other reports, so it doesn't really fit anywhere as a comment.

World generation/loading stores one Future for each generation stage of a chunk in a corresponding ChunkHolder. These Futures are created once something requests the chunk in the respective stage, under the condition that the chunk ticket level is sufficiently large to issue generation into that stage. What changed in the 1.17 snapshots is the handling of demotion of chunk ticket levels.
In 1.16 demoting the chunk ticket level completed all pending Future s above the new level with a special UNLOADED_CHUNK marker, while leaving already finished Future s untouched.

net.minecraft.server.world.ChunkHolder.java

protected void tick(ThreadedAnvilChunkStorage chunkStorage) {
  ...
  Either<Chunk, ChunkHolder.Unloaded> either = Either.right(new ChunkHolder.Unloaded() {...});

  for(int i = bl2 ? chunkStatus2.getIndex() + 1 : 0; i <= chunkStatus.getIndex(); ++i) {
    completableFuture = (CompletableFuture)this.futuresByStatus.get(i);
    if (completableFuture != null) {
      completableFuture.complete(either);
    } else {
      this.futuresByStatus.set(i, CompletableFuture.completedFuture(either));
    }
  }
  ...
}

This has two important consequences:

Once a generation stage finishes, the Future stays completed with the correct result until the chunk is unloaded
A Future that is still pending at some point can only be aborted (completed with UNLOADED_CHUNK, which does not abort the actual generation step) if the chunk ticket level drops too low, otherwise it will complete with a valid chunk. This requires some small argument since the Future cannot only be aborted directly by the above code, but also indirectly if any of its dependencies is aborted (completed with UNLOADED_CHUNK). However, the latter cannot happen if the ticket level stays high enough (after observing the pending Future), by construction of the ticket system.

In the 1.17 snapshots, this handling of demotion fundamentally changed. Instead of directly aborting the pending Future s above the new level, all Future s, including already finished ones, are replaced with a new one that completes with an UNLOADED_CHUNK once the original Future completes.

net.minecraft.server.world.ChunkHolder.java

protected void tick(ThreadedAnvilChunkStorage chunkStorage) {
  ...
  Either<Chunk, ChunkHolder.Unloaded> either = Either.right(new ChunkHolder.Unloaded() {...});

  for(int i = bl2 ? chunkStatus2.getIndex() + 1 : 0; i <= chunkStatus.getIndex(); ++i) {
    completableFuture = (CompletableFuture)this.futuresByStatus.get(i);
    if (completableFuture != null) {
      this.futuresByStatus.set(i, completableFuture.thenApply(__ -> either));
    } else {
      this.futuresByStatus.set(i, CompletableFuture.completedFuture(either));
    }
  }
  ...
}

As a result, both of the above properties are no longer valid:

An already finished stage might be replaced with an UNLOADED_CHUNK_FUTURE

If the chunk ticket level drops below, say, t then the corresponding Future is not directly aborted, but instead replaced with a Future that will be lazily aborted. If the ticket level is now raised above t again while the original Future is still running, then no new Future will be issued for this generation stage.

net.minecraft.server.world.ChunkHolder.java

public CompletableFuture<Either<Chunk, ChunkHolder.Unloaded>> getChunkAt(ChunkStatus targetStatus, ThreadedAnvilChunkStorage chunkStorage) {
  int i = targetStatus.getIndex();
  CompletableFuture<Either<Chunk, ChunkHolder.Unloaded>> completableFuture = (CompletableFuture)this.futuresByStatus.get(i);
  if (completableFuture != null) {
    Either<Chunk, ChunkHolder.Unloaded> either = (Either)completableFuture.getNow((Object)null);
    if (either == null || either.left().isPresent()) {
      return completableFuture;
    }
  }
  if (getTargetStatusForLevel(this.level).isAtLeast(targetStatus)) {
    CompletableFuture<Either<Chunk, ChunkHolder.Unloaded>> completableFuture2 = chunkStorage.getChunk(this, targetStatus);
    this.combineSavingFuture(completableFuture2, "schedule " + targetStatus);
    this.futuresByStatus.set(i, completableFuture2);
      return completableFuture2;
  } else {
    return completableFuture == null ? UNLOADED_CHUNK_FUTURE : completableFuture;
  }
}

This was fine in 1.16 since the Future could only be pending if it was not already (indirectly) aborted (which is exactly point 2 above). However, in 1.17 the Future is already indirectly aborted (it is replaced by a lazily aborted one) so it will always produce an UNLOADED_CHUNK even if the ticket level does not fall again. This breaks property 2.

It turns out that this code was changed in 21w06a, which is precisely the version where all 3 bugs were reported. (I actually checked the precise version retrospectively after figuring out the cause for these issues).

So, how does violation of these 2 properties cause the 3 linked bugs?

1. Ticking-Futures might never complete

Every time the ticket level is raised high enough, a chunk is promoted to BORDER, TICKING and ENTITY_TICKING state respectively. This will create additional Future s that depend on the generation Future s in some small neighborhood of the chunk. Upon completion, these will execute the respective tasks for promoting the chunk to the respective state, like registering tick schedulers, marking the chunk tickable/entity-tickable and sending the chunk data to players.
Assuming property 2 above, these Future s will always complete with a valid chunk at some point, unless the ticket level drops too low. In the latter case, the Future s get recreated once the chunk is promoted again, so this is not an issue.
Now, in the 1.17 snapshots property 2 is violated and hence these special Future s might never complete (with a valid chunk) since they might depend on lazily aborted Future s upon creation. Hence the respective promotion tasks might never execute. In particular, the ticking-Future, which is responsible for sending the chunk data to players, might not execute, hence causing MC-214808.
Note that all 3 Future s can fail independently. For example, I did observe chunks where the ticking-Future completed but the entity-ticking-Future did not, causing the chunk to load but entities to get stuck in these chunks. Also note that this issue can similarly lead to the "Chunk not there when requested: " error when the ServerChunkManager queries a lazily aborted Future for the FULL stage. I actually observed this crash a few time during debugging.

In order to provoke this issue, one can try the following steps. Due to the random nature of these bugs, several attempts might be needed. Run

/tp ~1000 ~ ~
/tp ~-1000 ~ ~
/tp ~1000 ~ ~
in quick succession, e.g., with one tick in-between, or while pausing the server thread through the debugger. The first teleport will create the ChunkHolder s at the target location and create the generation Future s up to FULL stage. The second tp will then lazily abort these Future s again and the third and final tp will create the ticking-Future s on the lazily aborted but still pending generation Future s, hence causing MC-214808.
I also observed this issue with only a single tp, although this was less reproducible. This might seem strange at first, since the ticket levels should change monotonically in this case and not experience the jitter required for the above explanation. I think this is caused by an interaction of the POST_TELEPORT chunk ticket and the player ticket throttling. The POST_TELEPORT ticket is created right after teleporting, but expires before the player tickets are added because of the throttling, hence causing the required jitter.

2. Futures might be erased completely

If the chunk ticket level drops low enough, the corresponding ChunkHolder is scheduled for unloading. It can still be revoked up to the point where it is actually saved to NBT and all the unloading tasks are done. This can be a few ticks from the actual scheduling, depending on server load. By property 1, once the corresponding chunk has passed the first generation/loading stage, all Future s will keep the reference to this chunk, so that it can indeed be revoked from the ChunkHolder.
However, in 1.17 when the ChunkHolder is scheduled for unloading, all generation Future s are replaced with (lazily) aborted ones. If this ChunkHolder gets then later on revoked (before it could properly save and unload) all generation Future s are recreated and the very first stage then reloads the chunk from disk again. Hence, the chunk object gets replaced with a completely new version that is reloaded from disk, erasing any progress since the last save. This is exactly MC-216148.
Note that the unloading tasks did not run in this scenario, so any externally stored data is still present. In particular, the chunk is still marked as loaded and will hence skip the block-entity loading step.

net.minecraft.server.world.ThreadedAnvilChunkStorage.java

private CompletableFuture<Either<Chunk, ChunkHolder.Unloaded>> convertToFullChunk(ChunkHolder chunkHolder) {
  ...
  worldChunk2.loadToWorld();
  if (this.loadedChunks.add(chunkPos.toLong())) {
    worldChunk2.setLoadedToWorld(true);
    worldChunk2.updateAllBlockEntities();
  }
  ...
}

As a consequence, block-entity tickers will not be recreated and still reference the old copies, hence block-entities will not be ticked. This is indeed a side effect noted in the bug report.

In order to provoke this issue, run for example

/tp ~1000 ~ ~
/tp ~-1000 ~ ~

in quick succession, so that the ChunkHolder s do not save and unload in-between. This can be achieved by pausing the server thread, for example. The first tp will erase all generation Future s and the second tp will then regenerate them and reload the chunk from disk.

In the report, it is noted that this seems to happen near nether portals. I'm not really sure how this is correlated. Most certainly, the relevant feature of nether portals are the PORTAL tickets created upon using them. However, I don't know how they enter the equation. They might play a similar role to the POST_TELEPORT_TICKET of the previous section or it might be something else.

3. Chunks might drop below FEATURES stage during initial lighting

While MC-214793 is actually caused by another concurrency bug, namely MC-224894, it might still be worthwhile to explain why this issue only showed up recently, even though the other bug is way older than 1.16.
Chunks below the FEATURES stage, i.e., chunks for which the FEATURES generation Future does return UNLOADED_CHUNK, are considered opaque by the lighting engine. Due to MC-224894, chunks are not kept in this required stage during initial lighting and hence might become opaque to the lighting engine.
However, due to property 1, in 1.16 the chunk always stayed in FEATURES stage once it was scheduled for initial lighting (and even was kept from unloading due to the pending light task). What could have happened even in 1.16 is that neighbor chunks might be unloaded between starting the initial lighting and finishing, which would cause glitches at the border. However, these neighbor chunks were kept loaded by the light ticket that is only removed (wrongly) at the start of the initial lighting, so the time frame was usually way too small, given that the actual unloading usually requires some extra ticks.
On the other hand, in 1.17 the chunk will immediately lose its FEATURES status once the light ticket is removed (and the player is sufficiently far away), hence triggering the issue with a relatively small amount of work required by the server thread which thus gives a large enough time frame for the issue to happen.
Furthermore, note that in 1.16 the light generation stage can be aborted before execution if any pending dependency was aborted due to ticket levels dropping too low. In 1.17 this is no longer the case since generation Future s don't get aborted directly. Hence, 1.17 more often processes initial lighting even if the player is already far away, compared to 1.16, increasing the chance for glitches at he chunk borders.
MC-224894 already describes how to trigger this lighting issue alone. However, MC-214793 mentions corrupted features in conjunction with this, I am not sure what is happening here, but I think this is a combination of the pure lighting issue and the previous section (progress being reverted). For example, one might argue that chunks which do make it to the LIGHT stage are kept from unloading a bit longer than their neighbors (which already finish after the FEATURES stage), due to the additional FUTURE that is waited upon by the unloading code. Hence, these ChunkHolder s might still be alive while their neighbors are already saved and unloaded and are hence susceptible to the previous issue which then regenerates these chunks completely, erasing all features that were leaking in from neighbors. However, as the light data is stored externally in the lighting engine, it will stay broken and not be regenerated. The neighbors were already saved to disk and will hence not need to regenerate, so they keep their features, causing corruption at the boundary.
Anyway, this is just a wild guess and probably not the whole story. But I think it's not too important to know precisely what's going on here. Debugging this is a real nightmare due to the rather random nature of everything involved and light tickets interacting with the whole argument.

Conclusion

The actual bug should be reasonably easy to understand and fix, e.g., by reverting to the 1.16 behavior. On the other hand, there might be good reasons for this change, so some more sophisticated solution might be necessary.
In any case, I hope these explanations have made clear the connections to the other 3 bugs and improved understanding of the general concurrency issues.

Best,
PhiPro

Linked issues

relates to 3

MC-216148 Some chunks occasionally don't save properly, resetting progress Resolved

MC-214793 Some strips of chunks generate completely dark Resolved

MC-224986 chunks doesn't generate Resolved

Comments 7

user-c84db 2021-05-05T10:37:04Z

How 1.16.5 is affected? Because all the related issues are new to 1.17

user-c84db 2021-05-06T15:24:59Z

Fixed in 21w18a?

PhiPro 2021-05-06T15:54:52Z

Whoops, ofc 1.16.5 is not affected. Guess all my other reports affected older versions aswell, so I clicked it kinda automatically.

PhiPro 2021-05-06T17:12:28Z

The implemented fix for this issue did not revert to the 1.16 code, but instead just dropped the abortion part alltogether

net.minecraft.server.world.ChunkHolder.java

protected void tick(ThreadedAnvilChunkStorage chunkStorage) {
  ...
  Either<Chunk, ChunkHolder.Unloaded> either = Either.right(new ChunkHolder.Unloaded() {...});

  for(int i = bl2 ? chunkStatus2.getIndex() + 1 : 0; i <= chunkStatus.getIndex(); ++i) {
    completableFuture = (CompletableFuture)this.futuresByStatus.get(i);
    if (completableFuture == null) {
      this.futuresByStatus.set(i, CompletableFuture.completedFuture(either));
    }
  }
  ...
}

Unfortunately, this breaks property 2 above. (Guess I was a bit too sloppy when sketching the argument why it holds for 1.16 and I honestly didn't spot the issue either when briefly looking at the implemented solution).

The problem is that there are still some mechanisms that can indirectly abort the Future s, namely ThreadedAnvilChunkStorage.getRegion(...) which returns an UNLOADED_CHUNK if any of the requested chunks has ticket level below EMPTY (note that the chunk does not need to be completely unloaded for the check to fail, as the code only checks the live chunk holder map, but does not attempt to revoke any chunk) or ThreadedAnvilChunkStorage.convertToFullChunk(...) which does abort if the chunk has ticket level below FULL.
This can lead to generation Future s that are still pending (for example when looking at it for creating the ticking-Future) but are already indirectly aborted through their dependencies (because the ticket level dropped too low at some point in the past), and this information might not yet have been propagated, e.g., because it needs to pass through other threads. This can hence break property 2.

Note that in 1.16 this does not happen since Future s are directly aborted when the ticket level drops too low, so one can conclude from observing a still pending generation Future that the ticket level did not drop (below the respective status) since creation of the Future, and then conclude from this that the Future is also not indirectly aborted. In 21w18a this argument does not work since Future s are not directly aborted and hence do not allow any conclusion about past ticket levels.

As a consequence, symptom 1 (Ticking-Future s might never complete) can still occur (whereas the other 2 symptoms were caused by failure of property 1, which is indeed fixed). This was indeed reported in MC-224986.

In the following I will present some very artificial steps to reproduce the issue deterministically. (Unfortunately, these steps do have some chance for failure since there is no way to prohibit the server thread from looking at chunks around the player (even with spectator mode), so that these steps will unintentionally pause the server thread even when only explicitly pausing the worldgen thread. This leads to some timing issue which can cause these steps to fail. Nevertheless, they worked quite reliably for me.)

Create a new world (these steps will only work when generating new chunks, not when reloading them)

We need a bunch of breakpoints in order to force some bad timing between worldgen and server thread. It is important that the breakpoints are configured to only pause the hitting thread, and not all threads. Concretely, we will place these in the following method:

net.minecraft.server.world.ThreadedAnvilChunkStorage.jaba

private CompletableFuture<Either<Chunk, ChunkHolder.Unloaded>> upgradeChunk(ChunkHolder holder, ChunkStatus requiredStatus) {
  ChunkPos chunkPos = holder.getPos();
  CompletableFuture<Either<List<Chunk>, ChunkHolder.Unloaded>> completableFuture = this.getRegion(chunkPos, requiredStatus.getTaskMargin(), (i) -> {
   return this.getRequiredStatusForGeneration(requiredStatus, i);
  });
  ...
  Executor executor = (runnable) -> this.worldGenExecutor.send(ChunkTaskPrioritySystem.createMessage(holder, runnable));
  return completableFuture.thenComposeAsync((either) -> {
   return (CompletableFuture)either.map((list) -> {
    try {
      CompletableFuture<Either<Chunk, ChunkHolder.Unloaded>> completableFuture = requiredStatus.runGenerationTask(executor, this.world, this.chunkGenerator, this.structureManager, this.serverLightingProvider, (chunk) -> this.convertToFullChunk(holder), list);
      this.worldGenerationProgressListener.setChunkStatus(chunkPos, requiredStatus);
      return completableFuture;
    } catch (Exception var9) {
	  ...
    }
   }, (unloaded) -> {
    this.releaseLightTicket(chunkPos);
    return CompletableFuture.completedFuture(Either.right(unloaded));
   });
  }, executor);
}

After generating the world, place a breakpoint in the requiredStatus.runGenerationTask(...) line. This will eventually pause the worldgen thread and prohibit chunks from finishing generation.

/tp ~1000 ~ ~. This should now trigger the breakpoint on the worldgen thread. Chunks are now frozen in some early generation stage.
/tp ~-1000 ~ ~ while the worldgen thread is still paused. Leaving the region will eventually cause the ThreadedAnvilChunkStorage.getRegion(...) calls to produce UNLOADED_CHUNK s.
Next, move the breakpoint to the this.getRegion(...) line (removing the old one) and continue execution. This should trigger the breakpoint on the server thread. Chunks can now finish their current execution step and are now all waiting for getRegion(...) for their next step.
Move the breakpoint a last time to the this.releaseLightTicket(...) line (again removing the old one) and continue execution, The breakpoint will now trigger again on the worldgen thread (This step may fail if one has bad luck, see the remark in the beginning. Just repeat the whole process in this case). All the pending getRegion(...) calls will now produce UNLOADED_CHUNK results, but this information is not propagated due to the worldgen thread being paused.
/tp ~1000 ~ ~. This will now recreate the ticking-Future s in the target region, using the still pending, but already indirectly aborted generation Future s.
Finally, remove the breakpoint and continue execution. All the Future s, including the ticking-Future s, will now be completed with an UNLOADED_CHUNK (verify this for example by looking at the chunk holder map), hence triggering the chunk loading issue (symptom 1).

This problem shows that it is necessary to keep track of the history of the ticket level, either by directly storing that information in the Future by aborting it like 1.16 does. Or by using some other external tracking which then needs to extend still pending Future s with some retry mechanism upon increasing ticket level (or something in that direction)

Best,
PhiPro

PhiPro 2021-06-02T14:29:32Z

This issue actually still exists in 1.17-pre3 to some extent. It took me quite a while to find out what actually changed in pre3 at all, but as far as I can tell, the relevant change is that the getRegion(...) calls are now done eagerly instead of lazily, or more precisely ThreadedAnvilChunkStorage.getChunk(...) executes immediately instead of waiting for the previous worldgen stage to finish. As a consequence, they should indeed no longer be able to produce UNLOADED_CHUNK results by construction of the ticket system, hence solving the example in my previous comment.

As a side remark: This change might have some negative impact on performance, as it gets increasingly difficult to abort already scheduled worldgen steps. Lazy evaluation of the getRegion(...) calls (or rather of the getChunk(...) code) was at least able to abort all but the lowest pending worldgen stages (which still misses a bunch of steps which could possibly be aborted (MC-183841)), whereas with eager evaluation no stages at all are aborted. I haven't actually looked into actual numbers to tell whether or not this is relevant. Might be a good idea to look investigate.

Anyway, back to the original issue. There is still one place that can produce UNLOADED_CHUNK results when generating new chunks, namely ThreadedAnvilChunkStorage.convertToFullChunk(...). The following steps shows how this can be employed to prevent a single chunk from loading to the client:

Create a new world (these steps will only work when generating new chunks, not when reloading them)
Set a breakpoint in the return completableFuture; line inside ThreadedAnvilChunkStorage.upgradeChunk(...) (see the code snippet in the previous comment). Configure this breakpoint to only pause the hitting thread and only trigger it on the condition requiredStatus == ChunkStatus.FULL
Run the following two commands in chained command blocks, so that they execute in the same tick
- /tp @p ~1000 ~ ~
- /tp @p ~ ~ ~
The first of these commands will trigger chunk loading through the PLAYER (and POST_TELEPORT) ticket. The second will teleport the player back before it tries to access the chunk, which would result in the server thread getting paused (see the timing issue of the previous comment).
This will now trigger the breakpoint on the worldgen thread. The returned completableFuture corresponds to the FULL stage due to the breakpoint condition. Since the server thread was not paused, the corresponding call to convertToFullChunk(...) could already finish and produce an UNLOADED_CHUNK result, as the player was already teleported away, so the ticket level of the chunk is below FULL. This can be verified by inspecting the returned completableFuture. (This step might fail due to the POST_TELEPORT ticket, which can keep the ticket level of some of the chunks at FULL. In this cause simply continue until hitting a breakpoint where the completableFuture contains an UNLOADED_CHUNK).
/tp @p ~1000 ~ ~
The UNLOADED_CHUNK result is not propagated since it is still stuck in the paused worldgen thread. Teleporting back to the area creates the ticking-Future which then depends on the still pending Future for the FULL stage, which will hence eventually produce an UNLOADED_CHUNK. As in the previous examples, this will prevent this chunk from loading to the client.
Remove the breakpoint and continue execution. The chunk where we forced the UNLOADED_CHUNK result will not load to the client, but all the others do.

Best,
PhiPro

ampolive 2021-06-05T20:45:09Z

Is it possible that the attempted fix of this issue on 1.17 Pre-release 3 caused MC-227202 / MC-227945, or is it just a coincidence?

Brain81505 2023-01-21T11:26:56Z

Can confirm in 1.19.3 and 23w03a