Skip to content

Technical Explanation of Leela Chess Zero

Andy Olsen edited this page Jan 2, 2019 · 14 revisions

Lc0 methods and input/output terms

A nice pictorial overview

Basics

Like all other Chess (or Go) engines, Leela maintains a tree of potential future moves and game states. Each potential game state is called a node in the tree, and each node has a list of child nodes corresponding to the potential moves (edges) for that board position. Nodes are first created with an estimated win value for that position, as well as a list of potential continuing moves to consider (called the policy for that position). Traditional chess engines have a very-finely-crafted-by-humans win valuation and policy generation system; unlike traditional engines, Leela uses its neural network trained without human knowledge for both win valuation and policy generation. Then, by some means or another, the engines expand the tree to get a better understanding of the root node, the current on-the-board position.

Leela's search methods

In the tree search algorithm used by Leela, we evaluate new nodes by doing what's called a playout: start from the root node (the current, actually-on-the-board position), pick a move to explore, and repeat down the tree until we reach a game position that has not been examined yet (or a position that ends the game, called a terminal node). We expand the tree with that new position (assuming non-terminal node) and use the neural network to create a first estimate of the win value for the position as well as the policy for continuing moves. In Leela, a policy for a node is a list of possible moves and a probability for each move. The probability specifies the odds that an automatic player that executes the policy will make that move. After this newly-expanded node is added to the tree, we then use that new value to update its parent node's value, and increment the parent's visit count; and that parent in turn updates its own parent's value and visit count, and so on back up to the root node. One playout adds one new node to the tree (excepting terminal nodes), and all nodes already in the tree were first created by a previous playout.

When a move is actually played on the board, the chosen move is made the new root of the tree. The old root and the other children of that root node are erased.

This is the same search specified by the AGZ paper, PUCT (Predictor + Upper Confidence Bound tree search). Many people call this MCTS (Monte-Carlo Tree Search), because it is very similar to the search algorithm the Go programs started using in 2006. But the PUCT used in AGZ and Lc0 does not do game rollouts (sampling playouts to a terminal game state). Other search algorithms are under consideration on the Github of Leela Go, but there isn't yet any real consensus that something else is demonstrably better than PUCT. This is something of an active research topic in the overlap of the AI+Game Theory fields.

Glossary

  • 20x256: Shorthand for the size of the NN. 20 residual blocks and 256 filters.
  • Batch Size: How many positions the GPU can train on simultaneously. Set as large as your GPU can handle.
  • Bestmove: Picks the move with the most visits. randomize or tempdecay options allow picking lower moves for variety.
  • Block: See Residual Block.
  • CNN: Convolutional Neural Network. The largest part of Lc0's NN uses this architecture, which convolves (slides) several 3x3xN filters across the entire board.
  • Depth: Average depth. See also Seldepth.
  • FC Layer: Fully Connected layer. Unlike filters which only look at a 3x3 area of the board, the FC layer looks at the entire board at once. This is used in the policy and value heads.
  • Filters: A 3x3xN pattern. The N means it is looking at several input features. A "20x256" network has 256 3x3x256 filters in each layer (input layer is slightly different).
  • Learning Rate: How fast the neural net weights are adjusted. Too high and you don't learn anything, or worse. Too low and your progress is too slow.
  • MSE loss: Mean Squared Error of the value (V) NN output. One of the terms the NN training process tries to minimize. The NN improves in a feedback loop by trying to predict who wins each self-play game. The mean squared error of the predictions is the "loss". See also policy loss and reg term.
  • NNCache: Stores NN evaluations, which can be reused if the same positions is reached by transposition.
  • Nodes: A potential game position in the tree of future gameplay. The root node of the tree is the current position.
  • NPS: Nodes per second, including NNCache hits and terminal node hits, but excluding nodes carried over from tree reuse. In other words, total playouts generated per second.
  • Overfitting: If the network trains on the same positions too much or with too low learning rate, it may memorize those positions and not generalize well to other similar positions.
  • P: Neural network's raw policy output (probability this is the best move)
  • Playouts: As defined by the Monte Carlo Tree Search, starting from the root node: 1) pick a move to explore according to the policy and confidence of the choices (PUCT selection algorithm); 2) travel to the resulting game position node; 3) if this child node is already explored at least once, repeat steps 1) and 2), otherwise; 4) evaluate this node via the neural network, creating value and policy estimates for this position, and use this new value estimate to update all parent nodes' values. After 4), the playout is complete.
  • Policy loss: One of the terms the NN training process tries to minimize. The NN improves in a feedback loop by trying to predict what the self-play games say the policy should be. The amount the NN differs from the self-play game outputs is the "loss". See also mse loss and reg term.
  • PV: Principal Variation. The moves that would be taken if we on each level chose the node that has most visits (i.e. most playout traversals).
  • Visits: Total number of playouts that have traversed this node. This is equal to the total size of the tree below this node (unless playouts have hit a terminal node, in which case visits = nodes + playouts_that_hit_terminal_node).
  • Q: Average expected value of all playouts for this move, a value ranging from -1 to 1. See also V.
  • Reg term: The L2-norm (roughly the size weights). The feedback training of the NN tries to minimize this along with policy loss and mse loss. This term tries to improve the generalization of the NN, and prevent overfiting.
  • Residual Block: A residual block is a popular NN architecture that combines two CNN layers with a skip connection. A "20x256" network has 20 residual blocks. (Note that this is slightly different from the Deepmind paper where a "20x256" network has 19 residual blocks).
  • Sampling Ratio: How many times each position from self-play games is used for training. Too high and your net may overfit. Too low and your progress is too slow.
  • Score cp: Traditional centipawn value where 100 is a one pawn advantage. See FAQ#how-does-leela-chess-zero-calculate-the-cp-eval.
  • Squeeze Excite: SE is an extension to the Residual Block architecture that Deepmind used. It adds a global pooling layer (the squeeze part) and then transmits that global information back to all parts of the board (the excite part).
  • Seldepth: Maximum depth.
  • Terminal node: A node that has a game result (checkmate or draw).
  • Train/Test Sets: Best practice is to split your data into two sets, train and test sets. You use the train set to improve the NN weights. You use the test set to validate the NN is able to generalize what it learned to positions it has never trained on. This way you can detect if your NN is overfitting.
  • UCB: Upper Confidence Bound, aka U. This is the part of the PUCT formula that encourages exploring moves that have not been searched much yet. See also the other half, Q.
  • V: Expected Value output of the NN, ranging from -1 to 1. See also Q.

Example --verbose-move-stats output

            Move            Visits          Policy      Avg. Value      UCB                      Raw NN Value
info string g2g4  (378 ) N:      11 (+ 2) (P:  2.81%) (Q: -0.08786) (U: 0.19640) (Q+U:  0.10853) (V: -0.0652)
info string f2f3  (346 ) N:      12 (+ 2) (P:  2.92%) (Q: -0.08100) (U: 0.19088) (Q+U:  0.10988) (V: -0.0759)
info string b1a3  (34  ) N:      14 (+ 4) (P:  3.39%) (Q: -0.06031) (U: 0.17498) (Q+U:  0.11467) (V: -0.0385)
info string g1h3  (161 ) N:      18 (+ 3) (P:  3.55%) (Q: -0.05069) (U: 0.15819) (Q+U:  0.10750) (V: -0.0315)
info string a2a4  (207 ) N:      22 (+ 5) (P:  3.89%) (Q: -0.02759) (U: 0.13604) (Q+U:  0.10845) (V: -0.0084)
info string h2h4  (403 ) N:      24 (+ 6) (P:  3.70%) (Q: -0.01613) (U: 0.11689) (Q+U:  0.10077) (V: -0.0109)
info string b2b3  (230 ) N:      26 (+ 7) (P:  4.00%) (Q: -0.00999) (U: 0.11520) (Q+U:  0.10521) (V: -0.0041)
info string f2f4  (351 ) N:      27 (+ 5) (P:  4.02%) (Q: -0.00992) (U: 0.11946) (Q+U:  0.10954) (V: -0.0014)
info string b2b4  (234 ) N:      29 (+ 5) (P:  3.97%) (Q: -0.00536) (U: 0.11123) (Q+U:  0.10587) (V: -0.0090)
info string a2a3  (204 ) N:      32 (+ 9) (P:  4.38%) (Q:  0.00459) (U: 0.10220) (Q+U:  0.10679) (V:  0.0026)
info string h2h3  (400 ) N:      34 (+ 9) (P:  4.74%) (Q:  0.00160) (U: 0.10550) (Q+U:  0.10711) (V: -0.0021)
info string d2d3  (288 ) N:      37 (+11) (P:  5.35%) (Q:  0.00055) (U: 0.10692) (Q+U:  0.10746) (V: -0.0001)
info string c2c3  (259 ) N:      40 (+ 8) (P:  4.89%) (Q:  0.00996) (U: 0.09779) (Q+U:  0.10775) (V:  0.0089)
info string b1c3  (36  ) N:      44 (+18) (P:  5.69%) (Q:  0.02348) (U: 0.08859) (Q+U:  0.11207) (V:  0.0180)
info string g2g3  (374 ) N:      47 (+ 8) (P:  5.40%) (Q:  0.00971) (U: 0.09454) (Q+U:  0.10425) (V:  0.0132)
info string e2e3  (317 ) N:      48 (+11) (P:  5.65%) (Q:  0.01524) (U: 0.09223) (Q+U:  0.10746) (V:  0.0231)
info string g1f3  (159 ) N:      73 (+17) (P:  7.08%) (Q:  0.02949) (U: 0.07629) (Q+U:  0.10578) (V:  0.0343)
info string d2d4  (293 ) N:      81 (+36) (P:  7.94%) (Q:  0.04077) (U: 0.06593) (Q+U:  0.10670) (V:  0.0471)
info string c2c4  (264 ) N:      84 (+15) (P:  6.63%) (Q:  0.04203) (U: 0.06503) (Q+U:  0.10706) (V:  0.0542)
info string e2e4  (322 ) N:     128 (+63) (P: 10.01%) (Q:  0.05778) (U: 0.05111) (Q+U:  0.10889) (V:  0.0556)

(Note the list is in reverse visits order since Arena GUI flips it again.)

Command line flags and UCI options

Most options have both a command line flag and a UCI option. In the list below both are listed. Most of the options should be left default. Here are some common ones you might change:

--weights, --threads, --backend, --backend-opts, --verbose-move-stats

TBD: If you have multiple GPUs, you can use these with e.g. --backend=?? --backend-opts=??

How does Lc0 calculate the cp eval?

Lc0 uses an average expected score Q in the range [-1,1]. This expected score is converted to a traditional centi-pawn (cp) eval using this formula: cp = 290.680623072 * tan(1.548090806 * _Q_).

(Q+1)/2 cp
0.50 0.00
0.55 0.45
0.60 0.93
0.65 1.46
0.70 2.07
0.75 2.84
0.80 3.89
0.85 5.49
0.90 8.42
0.95 16.20
1.00 128.00

Clone this wiki locally