mojira.dev
MC-71334

Terrible 1.8 server performance even with the watchdog disabled

First off, I should say the 1.7.10 server rant GREAT on my machine. From the moment I updated to the 1.8 server performance has been TERRIBLE. If I leave the tick watchdog at the default of 60 seconds I can't even reliably run the server with nobody connected. That is, if I start the server and nobody even tries to join the watchdog will kill it a few minutes later.

So, now I'm setting set max-tick-time to -1 and the server runs but it can't keep up with normal gameplay. Blocks reaper after being broken and basic things like opening a door or a gate take minutes.

The console logs show the "Can't keep up!" warnings:

[05:53:02] [User Authenticator #1/INFO]: UUID of player bmenrigh is fd931d94-ee87-4b2d-b9c6-71ebe050f387
[05:53:03] [Server thread/INFO]: bmenrigh/24.6.173.186:46162 logged in with entity id 1658 at (-249649.545
29494024, 102.0, 344.93664003493956)
[05:53:03] [Server thread/INFO]: bmenrigh joined the game
[05:55:10] [Server thread/WARN]: bmenrigh moved wrongly!
[05:55:41] [Server thread/WARN]: Can't keep up! Did the system time change, or is the server overloaded? Running 161930ms behind, skipping 3238 tick(s)
[05:55:47] [Server thread/WARN]: Can't keep up! Did the system time change, or is the server overloaded? Running 6331ms behind, skipping 126 tick(s)
[05:56:52] [Server thread/WARN]: Can't keep up! Did the system time change, or is the server overloaded? Running 52925ms behind, skipping 1058 tick(s)
[05:57:05] [Server thread/WARN]: Can't keep up! Did the system time change, or is the server overloaded? Running 13730ms behind, skipping 274 tick(s)
[05:57:17] [Server thread/WARN]: Can't keep up! Did the system time change, or is the server overloaded? Running 8908ms behind, skipping 178 tick(s)
[05:57:52] [Server thread/WARN]: Can't keep up! Did the system time change, or is the server overloaded? Running 22231ms behind, skipping 444 tick(s)

I thought this might be related to the server regenerating ocean to check if it needs to add ocean monuments but there is no ocean near enough for those chunks to get loaded. I also tried flying around the area my player is in to give the server a chance to load and possibly update / convert any old chunks. None of that has had any effect on performance.

I also thought the issue could be related to CPU time spent compressing network data so I've played around with setting the compression threshold from -1 to 255 to 1024 to 4096 to 16384 and none of those values have helped or had any effect on performance.

I'd love to provide more details about where the server is spending all of its time but I think the F3 debug screen is all client-side performance. For what it's worth, the 1.8 client runs great on my machine. FPS is great and the client never lags.

If there is some debugging or tracing mode I can put the server in I'd be happy to do so to gather useful data.

Attachments

Comments 6

I was wondering if the performance change was due to some syscall latency like a change in disk reads or something. I used strace to track system calls and track the time spent in them with the following command:

strace -tt -T -ff -o traces/trace_start java -Xms1280M -Xmx1280M -jar minecraft_server.jar nogui

I then wrote a script to process the results. There is no meaningful difference in the syscall activity between 1.7.10 and 1.8. Here is 1.7.10:

$ cat ~/backup_1.7.10_world/current_world/traces/* | ~/current_world/sort_strace.pl | head -n 10
func: futex; total time: 2135.280352 (82.832%); count: 37192 (6.159%); median time: 0.000081; average time: 0.05741236
func: read; total time: 228.799741 (8.876%); count: 236153 (39.105%); median time: 0.000020; average time: 0.00096886
func: epoll_wait; total time: 181.589652 (7.044%); count: 184 (0.030%); median time: 1.001637; average time: 0.98690028
func: clock_gettime; total time: 23.074748 (0.895%); count: 265244 (43.922%); median time: 0.000018; average time: 0.00008699
func: gettimeofday; total time: 4.432616 (0.172%); count: 18010 (2.982%); median time: 0.000019; average time: 0.00024612
func: lseek; total time: 1.448669 (0.056%); count: 10587 (1.753%); median time: 0.000019; average time: 0.00013683
func: sched_yield; total time: 1.058822 (0.041%); count: 19273 (3.191%); median time: 0.000017; average time: 0.00005494
func: write; total time: 0.418092 (0.016%); count: 8584 (1.421%); median time: 0.000022; average time: 0.00004871
func: open; total time: 0.389138 (0.015%); count: 1260 (0.209%); median time: 0.000025; average time: 0.00030884
func: mprotect; total time: 0.345825 (0.013%); count: 3010 (0.498%); median time: 0.000026; average time: 0.00011489

And here is the syscall activity for 1.8:
$ cat traces/* | ~/current_world/sort_strace.pl | head -n 10
func: futex; total time: 2794.826033 (84.472%); count: 51580 (7.825%); median time: 0.000079; average time: 0.05418430
func: read; total time: 266.182423 (8.045%); count: 232272 (35.237%); median time: 0.000020; average time: 0.00114599
func: epoll_wait; total time: 206.683876 (6.247%); count: 211 (0.032%); median time: 1.002230; average time: 0.97954444
func: clock_gettime; total time: 28.869215 (0.873%); count: 302261 (45.855%); median time: 0.000018; average time: 0.00009551
func: gettimeofday; total time: 4.980315 (0.151%); count: 17446 (2.647%); median time: 0.000019; average time: 0.00028547
func: madvise; total time: 2.402947 (0.073%); count: 1430 (0.217%); median time: 0.000052; average time: 0.00168038
func: sched_yield; total time: 1.280591 (0.039%); count: 30288 (4.595%); median time: 0.000017; average time: 0.00004228
func: open; total time: 1.029462 (0.031%); count: 973 (0.148%); median time: 0.000028; average time: 0.00105803
func: lseek; total time: 0.730414 (0.022%); count: 10620 (1.611%); median time: 0.000018; average time: 0.00006878
func: mprotect; total time: 0.554859 (0.017%); count: 2083 (0.316%); median time: 0.000025; average time: 0.00026637

I ran each server until the world was fully loaded and the CPU usage dropped down to idle levels.

So the performance issue doesn't appear to be the result of disk reads or other OS-dependent activities.

EDIT: Yuck, the formatting got destroyed when I pasted it into Jira. Sorry. I will attach a screenshot.

Using Server start script with nogui fixed this for me.

I always launch with nogui. My script runs:

java -Xms1280M -Xmx1280M -jar minecraft_server.jar nogui

I decided to explore garbage collector options and I've found that using the G1 garbage collector (using the flag -XX:+UseG1GC) makes the problem significantly better.

I think garbage collection is the source of the terrible performance and very long lags. What's probably happening is that the server now requires more CPU and memory than it used to require. When a garbage collection happens it takes so long that the server gets behind. When the garbage collection finishes the server detects that it's behind and tries to "catch up" by skipping ticks. This in turn probably causes another garbage collection and so the server gets stuck in an vicious loop where garbage collections cause the server to do work that causes another garbage collection and so forth. Instead of getting useful work done it spend almost all of its time in garbage collection cycles.

The G1 garbage collector is designed to spread the garbage collection work out and not cause long delays which seems to prevent the vicious garbage collection cycle from killing performance.

With G1 now my server just runs slowly with lag at all times. It is now playable, but just barely.

I doubt it's just one thing that needs fixing. I suspect a lot of effort needs to go into using less memory and probably also less CPU so that garbage collection isn't needed so often and when it does happen, there is enough spare CPU left for the garbage collection to not get the server too behind. Right now I think the server can't even keep up without garbage collection so any time spent doing garbage collection means the server gets behind.

Is this still happening in the latest version (15w43c)?

Ticket resolved as incomplete, because no answer in a reasonable amount of time (1+ year), if it still happens, please update the ticket.

Brandon Enright

(Unassigned)

Unconfirmed

Minecraft 1.8

Retrieved