Stockfish Benchmark Tool

Now that I have my Stockfish Raspberry PI cluster up and running, it is time to figure out the best configuration to make it run the most efficiently.

To do this I used Stockfish’s built-in benchmarking tool. You can run it from the command line by just invoking:

stockfish bench

This will run a series of about 50 chess positions and record how long it took, how many nodes were searched, and from those two numbers calculate the number of nodes/second. It is good to call a fresh instance of stockfish bench for each test because it may cache some results and be faster on subsequent runs.

Calling this without any parameters uses some defaults that I wanted to check and tune. When you call it without any arguments, it actually is the same as running it with these arguments:

stockfish bench 16 4 13 default depth mixed

The parameters are:

  • 16 is the size of the hash
  • 4 is the number of threads
  • 13 is the depth to search
  • default means to use the default list of FEN positions
  • depth means to search until it reaches the defined depth
  • mixed is the evaluation type and can be classical, NNUE, or mixed.

You can leave off any parameters too and it will use the defaults. You just need to put the ones you are using in this specific order. So if you only wanted to change the hash and number of threads, you could do:

stockfish bench 64 3

If you really want to get geeky, here is the code that controls the benchmark tool.

Finding The Best Options

I started by just repeatedly calling stockfish bench ... with different values. That was painful. Instead what I did was write a script that changed all that for me and then just writes the results to a file I could look at when it was all done.

Here is my (sorry it is a little ugly) script:

#!/bin/bash

REPEAT=3
DEPTH=13

for NODES in {2..4}
do
    for HASH in 16 32 64 128 256 512
    do
        for THREADS in {3..12}
        do
            RESULTS=""
            SUM=0
            for (( r=1; r<=$REPEAT; r++ ))
            do
                # write to a temp file because of the weird way MPI streams output
                # there is probably a cleaner way to pipe the output to the right place
                /home/matt/openmpi/install/bin/mpirun --hostfile /home/matt/cluster_hosts -map-by node --merge-stderr-to-stdout -n $NODES /home/matt/bin/stockfish bench $HASH $THREADS $DEPTH default depth mixed 2> stockfish-output.temp

                # Find the result
                NPS=$(grep "Nodes/second" stockfish-output.temp | cut -d':' -f 2 | cut -d ' ' -f 2)

                # remove the temp file
                rm stockfish-output.temp

                # Store the sum to create an average later
                SUM=$((SUM + NPS))

                if [[ $r -gt 1 ]]
                then
                    RESULTS="$RESULTS,$NPS"
                else
                    RESULTS="$NPS"
                fi
            done

            # Calculate the Average
            AVERAGE=$((SUM/REPEAT))

            # Output the results
            echo "$NODES,$HASH,$THREADS,$RESULTS,$AVERAGE" | tee -a output.csv
        done
    done
done

If you are going to try to run this, you’ll have to change some parameters, especially the directories where commands are located. I found using full paths to be the most reliable way to run using MPI across the cluster.

This runs each test 3 times and outputs the result to a CSV file. On my cluster, it took a long time, but since it was just running in the background, it didn’t matter much to me how long it took. As a tip, you can launch the program like this so if you disconnect it will just keep running:

nohup ./performance-tests.sh > /dev/null 2>&1 &
echo $! > save_pid.txt

The second line is useful if you want to kill the process at some point.

Refining Results

To get more accurate results, there are a couple of additional things you can do. Once you know a general range of values, I updated the available options link number of threads for future runs.

I would also change for n in {1..3} for how many times it does each test to be something like for n in {1..5} or even {1..10}. This should give a better average result.

Finally, I would search deeper. Using a depth of 13 is quick and a good guide, but increasing that depth will be more accurate. It takes longer, which means the average N/sec will have less variablity. You can change that in this section: bench $HASH $THREADS 15 default depth mixed. Every increase will make it take much longer.

Tuning Results

This will be different for every system. The number of threads is likely limited by the number of CPUs. The recommendation is to have 1 fewer thread than the number of CPUs so you can use 1 core to handle network and other processes. The hash size is likely dependant on the amount of memory available.

For my cluster, here is what I found to be the fastest performing setup:

  • Nodes=4 - This makes sense, though performance was not that much different with only 3 nodes. I think there will be a point where adding more and more nodes, which I’m not planning to do, is less effective. I did try this with even more nodes than I have raspberry pis. In this case it starts up multiple stockfish processes on each pi. This was not as effective as a single process with more threads.
  • Hash=64 - This did not seem to have a very big impact on performance. When using the same number of threads, my tests were all within 1-2% performance. using 64 was the highest, but not by much.
  • Threads=23 - Everything I read said to use 1 less thread than the number of processors you have. However, the best performance was when every process I had running was using 23 threads. This was almost 30% faster than using just 3 threads. There were other settings around 23 that were all within 5% of the same time.

Performance Results

With these parameters, I get about 2 MN/s (2,000,000 nodes searched per second):

===========================
Total time (ms) : 91154
Nodes searched  : 188416888
Nodes/second    : 2067017