PhiPro

@@unknown There is already an easy hackfix by SuperCoder79, simply removing the isAccessible filters: https://github.com/SuperCoder7979/chunksavingfix-fabric
Note though that a proper fix should redesign some aspects of the saving code, in particular because there are some other race conditions (caused by removing chunks from the chunkHolder map before saving them, so that the save(...) method does not necessarily wait for those chunks while they are neither in the chunkHolder map nor in the unloadQueue) which aren't covered by this fix.
Nevertheless, such a redesign is probably out of scope for mods and should be done by Mojang. The simple fix is enough for the deterministic part of the issue and should be good enough as hackfix for the moment.

PhiPro on MC-224893 2021-06-02T14:29:32Z

This issue actually still exists in 1.17-pre3 to some extent. It took me quite a while to find out what actually changed in pre3 at all, but as far as I can tell, the relevant change is that the getRegion(...) calls are now done eagerly instead of lazily, or more precisely ThreadedAnvilChunkStorage.getChunk(...) executes immediately instead of waiting for the previous worldgen stage to finish. As a consequence, they should indeed no longer be able to produce UNLOADED_CHUNK results by construction of the ticket system, hence solving the example in my previous comment.

As a side remark: This change might have some negative impact on performance, as it gets increasingly difficult to abort already scheduled worldgen steps. Lazy evaluation of the getRegion(...) calls (or rather of the getChunk(...) code) was at least able to abort all but the lowest pending worldgen stages (which still misses a bunch of steps which could possibly be aborted (MC-183841)), whereas with eager evaluation no stages at all are aborted. I haven't actually looked into actual numbers to tell whether or not this is relevant. Might be a good idea to look investigate.

Anyway, back to the original issue. There is still one place that can produce UNLOADED_CHUNK results when generating new chunks, namely ThreadedAnvilChunkStorage.convertToFullChunk(...). The following steps shows how this can be employed to prevent a single chunk from loading to the client:

Create a new world (these steps will only work when generating new chunks, not when reloading them)
Set a breakpoint in the return completableFuture; line inside ThreadedAnvilChunkStorage.upgradeChunk(...) (see the code snippet in the previous comment). Configure this breakpoint to only pause the hitting thread and only trigger it on the condition requiredStatus == ChunkStatus.FULL
Run the following two commands in chained command blocks, so that they execute in the same tick
- /tp @p ~1000 ~ ~
- /tp @p ~ ~ ~
The first of these commands will trigger chunk loading through the PLAYER (and POST_TELEPORT) ticket. The second will teleport the player back before it tries to access the chunk, which would result in the server thread getting paused (see the timing issue of the previous comment).
This will now trigger the breakpoint on the worldgen thread. The returned completableFuture corresponds to the FULL stage due to the breakpoint condition. Since the server thread was not paused, the corresponding call to convertToFullChunk(...) could already finish and produce an UNLOADED_CHUNK result, as the player was already teleported away, so the ticket level of the chunk is below FULL. This can be verified by inspecting the returned completableFuture. (This step might fail due to the POST_TELEPORT ticket, which can keep the ticket level of some of the chunks at FULL. In this cause simply continue until hitting a breakpoint where the completableFuture contains an UNLOADED_CHUNK).
/tp @p ~1000 ~ ~
The UNLOADED_CHUNK result is not propagated since it is still stuck in the paused worldgen thread. Teleporting back to the area creates the ticking-Future which then depends on the still pending Future for the FULL stage, which will hence eventually produce an UNLOADED_CHUNK. As in the previous examples, this will prevent this chunk from loading to the client.
Remove the breakpoint and continue execution. The chunk where we forced the UNLOADED_CHUNK result will not load to the client, but all the others do.

Best,
PhiPro

PhiPro on MC-224729 2021-05-18T17:12:03Z

Well, removal of the PLAYER tickets involves some threading, so it's likely that they actually don't get removed entirely before saving (so this effect is probably slightly random). So I would guess that this is actually caused by the PLAYER tickets that are not cleared before saving. Anyway, the general issue stays the same: chunks not getting saved due to not being unloaded because of some tickets.

I'll maybe add this case to the original report later (it's probably the most dominant case).

PhiPro on MC-224729 2021-05-18T14:33:37Z

@@unknown I added a third example to the original report which should explain your case. As I said, the list is certianly not exhaustive.

I guess you did your tests with a render distance of 2? Just out of interest, does it also happen on higher render distances, e.g., 20? (away from spawn to exclude problem 1)

PhiPro on MC-218739 2021-05-07T03:03:04Z

All these instances seem to be near worldspawn, so this might be caused by MC-224729 (see section "1. Spawn chunks"). Can you please answer the following questions to check this claim:

Was your render distance below 11?
Did you save and quit the world once before observing the glitch or was it observed immediately?

PhiPro on MC-214808 2021-05-07T02:39:02Z

See this comment on MC-224893.

PhiPro on MC-224986 2021-05-06T17:27:45Z

Thanks. It's consistent with my explanation then (which only really works for newly generated chunks).

PhiPro on MC-224893 2021-05-06T17:12:28Z

The implemented fix for this issue did not revert to the 1.16 code, but instead just dropped the abortion part alltogether

net.minecraft.server.world.ChunkHolder.java

protected void tick(ThreadedAnvilChunkStorage chunkStorage) {
  ...
  Either<Chunk, ChunkHolder.Unloaded> either = Either.right(new ChunkHolder.Unloaded() {...});

  for(int i = bl2 ? chunkStatus2.getIndex() + 1 : 0; i <= chunkStatus.getIndex(); ++i) {
    completableFuture = (CompletableFuture)this.futuresByStatus.get(i);
    if (completableFuture == null) {
      this.futuresByStatus.set(i, CompletableFuture.completedFuture(either));
    }
  }
  ...
}

Unfortunately, this breaks property 2 above. (Guess I was a bit too sloppy when sketching the argument why it holds for 1.16 and I honestly didn't spot the issue either when briefly looking at the implemented solution).

The problem is that there are still some mechanisms that can indirectly abort the Future s, namely ThreadedAnvilChunkStorage.getRegion(...) which returns an UNLOADED_CHUNK if any of the requested chunks has ticket level below EMPTY (note that the chunk does not need to be completely unloaded for the check to fail, as the code only checks the live chunk holder map, but does not attempt to revoke any chunk) or ThreadedAnvilChunkStorage.convertToFullChunk(...) which does abort if the chunk has ticket level below FULL.
This can lead to generation Future s that are still pending (for example when looking at it for creating the ticking-Future) but are already indirectly aborted through their dependencies (because the ticket level dropped too low at some point in the past), and this information might not yet have been propagated, e.g., because it needs to pass through other threads. This can hence break property 2.

Note that in 1.16 this does not happen since Future s are directly aborted when the ticket level drops too low, so one can conclude from observing a still pending generation Future that the ticket level did not drop (below the respective status) since creation of the Future, and then conclude from this that the Future is also not indirectly aborted. In 21w18a this argument does not work since Future s are not directly aborted and hence do not allow any conclusion about past ticket levels.

As a consequence, symptom 1 (Ticking-Future s might never complete) can still occur (whereas the other 2 symptoms were caused by failure of property 1, which is indeed fixed). This was indeed reported in MC-224986.

In the following I will present some very artificial steps to reproduce the issue deterministically. (Unfortunately, these steps do have some chance for failure since there is no way to prohibit the server thread from looking at chunks around the player (even with spectator mode), so that these steps will unintentionally pause the server thread even when only explicitly pausing the worldgen thread. This leads to some timing issue which can cause these steps to fail. Nevertheless, they worked quite reliably for me.)

Create a new world (these steps will only work when generating new chunks, not when reloading them)

We need a bunch of breakpoints in order to force some bad timing between worldgen and server thread. It is important that the breakpoints are configured to only pause the hitting thread, and not all threads. Concretely, we will place these in the following method:

net.minecraft.server.world.ThreadedAnvilChunkStorage.jaba

private CompletableFuture<Either<Chunk, ChunkHolder.Unloaded>> upgradeChunk(ChunkHolder holder, ChunkStatus requiredStatus) {
  ChunkPos chunkPos = holder.getPos();
  CompletableFuture<Either<List<Chunk>, ChunkHolder.Unloaded>> completableFuture = this.getRegion(chunkPos, requiredStatus.getTaskMargin(), (i) -> {
   return this.getRequiredStatusForGeneration(requiredStatus, i);
  });
  ...
  Executor executor = (runnable) -> this.worldGenExecutor.send(ChunkTaskPrioritySystem.createMessage(holder, runnable));
  return completableFuture.thenComposeAsync((either) -> {
   return (CompletableFuture)either.map((list) -> {
    try {
      CompletableFuture<Either<Chunk, ChunkHolder.Unloaded>> completableFuture = requiredStatus.runGenerationTask(executor, this.world, this.chunkGenerator, this.structureManager, this.serverLightingProvider, (chunk) -> this.convertToFullChunk(holder), list);
      this.worldGenerationProgressListener.setChunkStatus(chunkPos, requiredStatus);
      return completableFuture;
    } catch (Exception var9) {
	  ...
    }
   }, (unloaded) -> {
    this.releaseLightTicket(chunkPos);
    return CompletableFuture.completedFuture(Either.right(unloaded));
   });
  }, executor);
}

After generating the world, place a breakpoint in the requiredStatus.runGenerationTask(...) line. This will eventually pause the worldgen thread and prohibit chunks from finishing generation.

/tp ~1000 ~ ~. This should now trigger the breakpoint on the worldgen thread. Chunks are now frozen in some early generation stage.
/tp ~-1000 ~ ~ while the worldgen thread is still paused. Leaving the region will eventually cause the ThreadedAnvilChunkStorage.getRegion(...) calls to produce UNLOADED_CHUNK s.
Next, move the breakpoint to the this.getRegion(...) line (removing the old one) and continue execution. This should trigger the breakpoint on the server thread. Chunks can now finish their current execution step and are now all waiting for getRegion(...) for their next step.
Move the breakpoint a last time to the this.releaseLightTicket(...) line (again removing the old one) and continue execution, The breakpoint will now trigger again on the worldgen thread (This step may fail if one has bad luck, see the remark in the beginning. Just repeat the whole process in this case). All the pending getRegion(...) calls will now produce UNLOADED_CHUNK results, but this information is not propagated due to the worldgen thread being paused.
/tp ~1000 ~ ~. This will now recreate the ticking-Future s in the target region, using the still pending, but already indirectly aborted generation Future s.
Finally, remove the breakpoint and continue execution. All the Future s, including the ticking-Future s, will now be completed with an UNLOADED_CHUNK (verify this for example by looking at the chunk holder map), hence triggering the chunk loading issue (symptom 1).

This problem shows that it is necessary to keep track of the history of the ticket level, either by directly storing that information in the Future by aborting it like 1.16 does. Or by using some other external tracking which then needs to extend still pending Future s with some retry mechanism upon increasing ticket level (or something in that direction)

Best,
PhiPro

PhiPro on MC-224893 2021-05-06T15:54:52Z

Whoops, ofc 1.16.5 is not affected. Guess all my other reports affected older versions aswell, so I clicked it kinda automatically.

PhiPro on MC-224986 2021-05-06T15:34:15Z

Also a question from my side: Do you know if these chunks were previously generated and just didn't reload? Or were they freshly explored?

PhiPro on MC-224975 2021-05-06T00:31:31Z

Not sure if this is caused by MC-224729. Can you please answer the following questions:

Was this world generated with 21w18a or is this an old world?
Did you log out of the world between world creation and observing this glitch?
Was the picture taken anywhere near the spawn chunks? (11 chunks away from worldspawn, to be more precise)

PhiPro on MC-224935 2021-05-05T21:40:31Z

The symptoms are exactly those described by MC-162253, namely lag spikes when crossing chunk borders in the direction of worldspawn (or chunks loaded by any other means), together with large height differences near those chunks required to trigger the lighting issue. The boundaries where the lag spikes are observed should be render distance away from any chunks in the spawn area (that have sufficient height differences to their neighbors).

You can try increasing the render distance and see if the problematic boundaries move further away from the spawn chunks. In that case this would most certainly be a duplicate of MC-162253.

PhiPro on MC-170012 2020-09-10T21:35:37Z

I posted the second comment as MC-199966.
I would prefer to leave the first comment as is. It is more of a theoretical input and the glitches caused by it would be rather sporadic, so a reproducible setup is very hard to find. So I think for the moment it fits best here, as it falls under the general issue that pre_light (or features) and light stage aren't properly distinguished.
In case this report gets marked as fixed without fixing the comment, I will repost it as separate report.

PhiPro on MC-170012 2020-09-03T10:32:35Z

There is another related issue regarding chunks before the light stage. Upon loading from disk, lightmaps are only loaded for chunks that were already lighted and discarded otherwise, although they are always saved to disk.

net.minecraft.world.ChunkSerializer.java

public static ProtoChunk deserialize(...) {
    ...
    boolean bl = compoundTag.getBoolean("isLightOn");
    ...
    if (bl) {
        if (compoundTag2.contains("BlockLight", 7)) {
           lightingProvider.enqueueSectionData(LightType.BLOCK, ChunkSectionPos.from(pos, k), new ChunkNibbleArray(compoundTag2.getByteArray("BlockLight")), true);
        }

        if (bl2 && compoundTag2.contains("SkyLight", 7)) {
           lightingProvider.enqueueSectionData(LightType.SKY, ChunkSectionPos.from(pos, k), new ChunkNibbleArray(compoundTag2.getByteArray("SkyLight")), true);
        }
    }
    ...
}

This basically erases any light propagations to chunks in pre_light stage when unloading them before the light stage and hence causes lighting glitches.

The Vanilla code currently uses this mechanism to erase cached data, which only removes the isLightOn field but not the lightmaps themselves.

PhiPro on MC-169913 2020-09-03T08:25:09Z

It turns out that the workaround for the easy part 1 of this issue that I described in this comment does not work. This is because Vanilla schedules light checks to the POST_UPDATE phase, so scheduling lightmap removal to the POST_UPDATE phase aswell does not guarantee that all pending light updates have been carried out before the lightmaps are removed. Also, light checks cannot be moved to the PRE_LIGHT phase as they would otherwise be carried out before the relevant lightmaps are created, which would again cause issues.

Fortunately, this part of the issue is automatically solved by my proposed solution for MC-196725. Note that the argument given there at the end regarding the interaction of skylight optimization and unloaded chunks also shows that my proposed solution mitigates this easy part of the issue here.
One important point in the argument was that non-optimizable subchunks are always locally correct. This does not hold in the current Vanilla code, precisely because of the issue discussed here, i.e., non-trivial lightmaps getting removed, hence changing values without informing neighbors. This is fixed by my proposed solution for MC-196725 which only removes lightmaps once they are trivial.
Also note that it is not an issue that the distance tracking of non-empty subchunks gets updated before all pending light updates are carried out. The argument only requires that if in the end state there is a non-air block then the surrounding subchunks must not be optimizable. Hence it is sufficient to mark a subchunk as empty after all blocks have been removed, but possibly before the corresponding light updates have been fully processed. Which is what the Vanilla code currently does.

Hence no extra treatment for the easy part of this issue is required once implementing my solution for MC-196725.

In fact, the argument given in MC-196725 can be greatly simplified for what we need here, since we do not need to deal with unloaded chunks for this matter. Hence the region that was constructed in the argument is simply (-1 -1 -1)..(1 1 1) in this case, so the only interesting parts of the boundary are the top faces. And for those we didn't even use that the subchunks at y=-1 were empty, hence the more efficient specification of the skylight optimization, as briefly mentioned at the end of MC-196725, would be sufficient here.

PhiPro on MC-196542 2020-08-28T16:51:32Z

On a further note, the skylight optimization code currently queries in each step if the currenty processed neighbor position has an associated lightmap.

net.minecraft.world.chunk.light.ChunkSkyLighProvider.java

while(true) {
    long u = BlockPos.add(id, direction.getOffsetX(), -t, direction.getOffsetZ());
    long v = ChunkSectionPos.fromBlockPos(u);
    if (l == v) {
       this.propagateLevel(id, u, level, decrease);
       break;
    }

    if (((SkyLightStorage)this.lightStorage).hasSection(v)) {
       this.propagateLevel(id, u, level, decrease);
    }

    ++t;
    if (t > o * 16) {
       break;
    }
}

If the answer is negative, then the result won't even be cached by the currently employed cache in ChunkToNibbleArrayMap, resulting in 16 HashMap lookups per subchunk.

This can be optimized by first iterating over the subchunks and checking only once if the neighbor has an associated lightmap, and then iterating over the 16 blocks of the subchunk.

PhiPro on MC-177685 2020-08-23T11:51:24Z

Looks like I somehow managed to overlook some important piece of code... The ChunkTaskPrioritySystem actually does wait for tasks to complete on the dedicated queues before scheduling new ones, instead of scheduling everything at once. So the issue is invalid. Sorry about that.

Nevertheless, I propose the solution explained in the report as a simpler, more efficient and less obscure implementation. So I leave the report open for now.

PhiPro on MC-162253 2020-08-03T22:03:43Z

@@unknown Yes, I am aware of this. I need to get in contact with her and see what we can work out.

On another note, please post any issues regarding my bugfix mod on the Github bugtracker, if possible, in order to keep the discussion here on-topic.

PhiPro on MC-162253 2020-08-03T21:03:02Z

Info to everyone using my mod as temporary workaround: There is a (vanilla) bug, see MC-196614, that caused basically all underground subchunks to become fully bright up to and including version mc-170010-v1.2.1. This issue is solved in https://github.com/PhiPro95/mc-fixes/tree/mc-196725.
A complete bundle of all required mods, including the fix for this issue, can be found here.

In order to repair already affected chunks you need to erase cached data, see MC-142134 for further instructions.

Sorry for the inconvenience.
Best,
PhiPro

PhiPro on MC-196725 2020-08-03T17:02:35Z

I have implemented my proposed solution at https://github.com/PhiPro95/mc-fixes/tree/mc-196725.

There is one important remark I want to put here. There are two places in the code that manually change light values: Initial lighting which sets the topmost sections up to the first non-empty section to a skylight value of 15 and unloading chunks which sets all light to 0. It is important that one first removes all pending light updates for the affected sections before changing any light value. Otherwise the light propagator will lose track of the pending updates and not remove them properly later on, leaving behind some orphaned updates causing major issues.
For the current vanilla code this simply means to individually for each section remove pending updates before changing the section (which is indeed done in the vanilla code), because light updates can only be present for sections that have an associated lightmap, so changing the lightmap of one section does not interfere with any other section that contains scheduled ligth updates. With my proposed solution this is no longer the case: Light updates can also be scheduled for sections without an associated ligthmap, as long as there are nearby blocks, in case of skylight. Changing light values in one section can now lead to a change of light values in other sections, by the way sections without lightmaps are handled, and these other sections may contain pending light updates.
Because of this one needs to be careful to first remove all affected pending light updates before changing any lightmaps or changing the skylight source values.

Also note that upon removing a trivial lightmap one must not remove pending light updates if there are still nearby blocks, i.e., as long as the skylight optimization is still not applicable for that section, as otherwise of course some light updates get lost. But conversely, if skylight optimization does become applicable after removing a trivial lightmap, i.e., there are no nearby blocks, then one should remove pending updates. Because in this case the code will no longer create a lightmap when the inherited light value would change by a light update above , opposed to the case for non-skylight optimizable sections. Hence we would have changes in light values that are not directlly set by the level propagator, causing aforementioned issues. So one needs to remove all pending updates once a section becomes skylight optimizable and prevent any further propagations into such a section (which is already done in the vanilla code).
Similarly, if the distance tracking from non-air blocks changes and a section gets far enough away from blocks to become eligible for skylight optimization, then pending light updates must be removed if and only if the section does not have an sssociated lightmap, i.e., if it indeed becomes skylight optimizable.