[MC-106551] Concurrency Issue in ThreadedAnvilChunkStorage / crash when using "/save-all flush"

In the method which logs "ThreadedAnvilChunkStorage ({}): All chunks are saved", a single element attempts to be popped from a Map.

The problem is that between checking for an element in this Map and the element being popped, if /save-all flush is executed, there is a chance that it will be popped and removed by another thread, causing the map to be empty, and the save thread to crash.

Everything from this point down is from research by @unknown.

Actually encountering this on a real server is very rare (it took me 600 tries with a bot and I've reproduced it randomly a few other times out of a bunch of other times I've run /save-all flush (all to test this bug)), but it does eventually happen. See

[media]

and

[media]

(which, oddly, were generated at roughly the same time - I've never seen the watchdog do that with other crashes. I'm pretty sure that 20.10.34 has the actual relevant stacktrace, though) Also, here's the very end of the log:

[20:10:25] [Server thread/INFO]: ThreadedAnvilChunkStorage (world): All chunks are saved
[20:10:25] [Server thread/INFO]: ThreadedAnvilChunkStorage (DIM-1): All chunks are saved
[20:10:25] [Server thread/INFO]: ThreadedAnvilChunkStorage (DIM1): All chunks are saved
[20:10:25] [Server thread/INFO]: [TestBot: Saved the world]
[20:10:26] [Server thread/INFO]: [TestBot: 16641 blocks filled]
[20:10:27] [Server thread/INFO]: ThreadedAnvilChunkStorage (world): All chunks are saved
[20:10:27] [Server thread/INFO]: ThreadedAnvilChunkStorage (DIM-1): All chunks are saved
[20:10:27] [Server thread/INFO]: ThreadedAnvilChunkStorage (DIM1): All chunks are saved
[20:10:27] [Server thread/INFO]: [TestBot: Saved the world]
[20:10:28] [Server thread/INFO]: [TestBot: 16641 blocks filled]
[20:10:30] [Server thread/INFO]: ThreadedAnvilChunkStorage (world): All chunks are saved
[20:10:30] [Server thread/INFO]: ThreadedAnvilChunkStorage (DIM-1): All chunks are saved
[20:10:30] [Server thread/INFO]: ThreadedAnvilChunkStorage (DIM1): All chunks are saved
[20:10:30] [Server thread/INFO]: [TestBot: Saved the world]
[20:10:31] [Server thread/INFO]: [TestBot: 16641 blocks filled]
[20:10:31] [Server Watchdog/FATAL]: A single server tick took 60.00 seconds (should be max 0.05)
[20:10:31] [Server Watchdog/FATAL]: Considering it to be crashed, server will forcibly shutdown.
[20:10:32] [Server Watchdog/ERROR]: This crash report has been saved to: C:\pokechu22\testserver_1.11\.\crash-reports\crash-2016-09-08_20.10.32-server.txt
[20:10:32] [Server Shutdown Thread/INFO]: Stopping server

Also, of note, occasionally /save-all flush results in multiple "All chunks are saved" messages. It's random and somewhat rare, but I think it happens more often when chunks are being loaded/unloaded than when standing still. These duplicate messages are a second indicator of this concurrency issue. When it happens, it looks like this:

[19:18:05] [Server thread/INFO]: [TestBot: Saved the world]
[19:18:07] [File IO Thread/INFO]: ThreadedAnvilChunkStorage (world): All chunks are saved
[19:18:07] [Server thread/INFO]: ThreadedAnvilChunkStorage (world): All chunks are saved
[19:18:07] [Server thread/INFO]: ThreadedAnvilChunkStorage (DIM-1): All chunks are saved
[19:18:07] [Server thread/INFO]: ThreadedAnvilChunkStorage (DIM1): All chunks are saved

This issue can also result in /save-all flush producing "An unknown error occurred while attempting to perform this command". While the stacktrace is not logged in that condition normally, a 1.10.2 server modified to print the stacktrace (only modification) resulted in this:

java.util.NoSuchElementException
        at java.util.concurrent.ConcurrentHashMap$KeyIterator.next(ConcurrentHashMap.java:3416)
        at atj.c(SourceFile:127)
        at atj.b(SourceFile:163)
        at lr.c(SourceFile:178)
        at ls.q(SourceFile:903)
        at ay.a(SourceFile:47)
        at k.a(k.java:89)
        at k.a(k.java:74)
        at me.d(SourceFile:825)
        at me.a(SourceFile:812)
        at im.a(SourceFile:37)
        at im.a(SourceFile:9)
        at fl$1.run(SourceFile:13)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at h.a(SourceFile:45)
        at net.minecraft.server.MinecraftServer.D(SourceFile:598)
        at ld.D(SourceFile:338)
        at net.minecraft.server.MinecraftServer.C(SourceFile:554)
        at net.minecraft.server.MinecraftServer.run(SourceFile:458)
        at java.lang.Thread.run(Thread.java:745)

My bot basically just repeated these commands, with the /fill s being there to have more dirty chunks (which might increase the chance of it happening, but also might not):

/fill ~-64 ~-1 ~-64 ~64 ~-1 ~64 minecraft:stained_hardened_clay color=red
/save-all flush
/fill ~-64 ~-1 ~-64 ~64 ~-1 ~64 minecraft:stained_hardened_clay color=blue
/save-all flush

Linked issues

testing discovered 1

MC-121662 Unhandled exceptions in commands just dump the exception into chat, without the "unknown error" message Resolved

Attachments

crash-2016-09-08_20.10.32-server.txt

crash-2016-09-08_20.10.34-server.txt

crash-2016-09-20_21.39.02-server.txt

crash-2016-09-20_21.39.04-server.txt

crash-2016-09-29_16.31.38-server.txt

crash-2016-09-29_16.31.41-server.txt

crash-2016-09-30_15.12.39-server.txt

crash-2016-09-30_15.12.40-server.txt

crash-2016-10-06_15.11.34-server.txt

crash-2016-10-06_15.11.35-server.txt

crash-2016-10-13_15.13.58-server.txt

crash-2016-10-13_15.13.59-server.txt

crash-2016-10-19_13.15.46-server.txt

crash-2016-10-27_15.43.03-server.txt

crash-2016-10-27_15.43.04-server.txt

crash-2016-11-03_15.08.50-server.txt

crash-2016-11-03_15.08.51-server.txt

Comments 5

pokechu22 2016-10-06T22:18:46Z

Note: While in the past this has been really rare, in the past few snapshots, I've gotten it within under 5 minutes each time. I might just be getting lucky though.

pokechu22 2016-12-18T02:43:46Z

Confirmed for 16w50a in terms of an error occurring in processing the command (after 24ish hours of waiting with a bot). Have not yet reproduced a crash. However, since both are caused by this issue (and in my experience, the error while processing the command is rarer common), I'm marking it as affected.

pokechu22 2017-02-14T03:56:47Z

Confirmed in 17w06a after a bit over 5 days of non-stop running via a bot. I got "An unknown error occurred while attempting to perform this command"; still haven't reproduced the crash yet.

Kademlia 2017-09-13T00:38:33Z

See https://hub.spigotmc.org/jira/browse/SPIGOT-1097

worldserver.flushSave(); is called by CommandSaveAll and results in a syncronous call to the mentioned method.
This can result in the creation of two iterators which are not thread safe even when using a ConcurrentHashMap and can result in at least a NPE on .next().

Bennet B. 2018-07-19T22:21:38Z

See MinecraftForge Issue 5052