Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train a weight‘s weight, to predict the value of the next Katago weight #944

Open
zliu1022 opened this issue May 24, 2024 · 3 comments
Open

Comments

@zliu1022
Copy link

zliu1022 commented May 24, 2024

Current factors affecting Game go’s strength: engine search and weight values

Factors affecting Game go strength:

  • Engine search algorithm
  • Structure and specific values of weights (referred to as Go weights)
    This issue mainly discusses how to use another set of weights (referred to as prediction weights) to predict the value of the next stronger Go weight.

How to find a set of Go weight values faster?

Currently, the main task of Go AI architecture is to find a set of weight values that correspond to stronger Game go strength.
I once thought about how to judge the strength only from the values of 2 weights. The simple method is to use information entropy to discriminate.
But after trying, it was found that Game go's strength does not directly correspond to the information entropy of weight files.

Katago's weight sequences constitute numerical vectors, this super functions's direction is to enhance Game go's strength

However, it can be assumed that the many weight sequences obtained by Katago have already constituted a numerical vector, and these numerical vectors, in space, represent the direction of enhancing Game go's strength.
So, I wonder if it is possible to train a prediction weight based on these known Go weight sequences, to predict the value of the next Go weight along the direction of Game go's strength enhancement.

Directly train prediction weights to predict stronger Go weights

Since there are already different structures of Go weight sequences, when training prediction weights, the structural parameters of Go weights (b, c) can be used as training parameters. In this way, the prediction weights trained can predict where the values of the next stronger Go weights might be under different b, c.

As the current Go weights need to be obtained through:

  • self-play
  • training
  • and competition
    which consumes a lot of computing power.

If the method of predicting weights is feasible, only training prediction weights and competition, 2 steps are needed, which eliminates the step of "self-play".

@zliu1022 zliu1022 changed the title To train a weight weights, to predict the value of the next weight. To train a weight‘s weight, to predict the value of the next Katago weight May 24, 2024
@zliu1022 zliu1022 changed the title To train a weight‘s weight, to predict the value of the next Katago weight Train a weight‘s weight, to predict the value of the next Katago weight May 24, 2024
@Ishinoshita
Copy link

Ishinoshita commented May 29, 2024

Hi Zliu. I am not at all a DL expert, rather a newbie, but I can share the following thoughts:

There somewhat exists something doing a similar thing to your idea, it's SGD with momentum, that takes a step in the weights space based in part on the gradient on the last batch of training data, and in part based on an exponentially decaying moving average of the past gradient (momentum). Stochastic Weights Averaging is also kind of predicting a better point in the weights space based on recent trajectory. But these techniques still rely on a past trajectory which are themselves based on a gradient signal coming training data.

If one were to train a model (NN based or not) to predict next weight based on past weights, and one would iterate that operator over predicted weights, not just past weights computed based on a gradient signal, what guarantee would we have that the trajectory would correspond to an increase of strength in go? If you remove the grounding to go game by removing the step of training on go data, after a thousand iterations, what would make the produced weight a good weight for playing go and not chess or classifying cats and dogs?

To take an analogy, suppose you run a thousand miles with your car and train a model to predict the steering wheel angle based on past steering wheel angles. Then you enter a new town (never seen), hands off the steering wheel, letting your model adjusting steering wheel angle based on its own prediction. How long would you stay on the road? ...

The landscape of the loss over the weights space is basically unknown in advance (otherwise, one would just directly jump to a minimum location). And the loss landscape exists only in relation to a loss with respect to a training dataset (or training window).

You can compute the gradient (and the hessian) of the loss to know the landscape around you, but that does not tell you what it will look like in the future, after you take 10, 50 steps. Assuming that you can successfully navigate the N+1 dimension of the loss landscape (that is, reaching a global minimum in the loss direction) by only operating in the N dimensions of the weights space (without looking at loss) amounts to assume that, in the game of Go and with katago's NN architecture, the loss landscape has some regularities that a model could discover and exploit beyond a very short term. Maybe there are, but I never read about that, in go or in an other domain. Due to non linearities and high number of parameters, large neural network lanscape are infamous for being very torturous. A NN might discover and predict well the local shape of loss landscape around recent weights but that would probably not generalized well (think to the analogy with meteorological wheather predictions, based on highly non linear models of fluids circulation: they give good predictions on a few days scale but their accuracy quickly drops because of the diverging nature of the equations behind).

Additionally, your model, even if helpfull to take steps in the weights space in combination with classical training, would have the drawback to take as input very high dimensions data ! (the number of parameters of b28, for instance, in the range of 10's if millions). It would be very costly to train. Which is in contradiction with the fact that it should a priori have relatively few parameters, because there quite few network generations in a run like kata1 (a few thousands?) as training examples.

@zliu1022
Copy link
Author

Hi, Ishinoshita

Thank you for your reply and very detailed analogy.
I'm not an expert either, that's why I'm so bold in asking questions ^_^
Your thoughts are very helpful. I am also thinking further, trying to separate different situations in more detail to see if I can analyze them one by one.

  1. For the many weights that have already been trained, including the b6 series, b18 series, etc., according to the order of ELO, is there a "pattern of consistency" among their numerical files?
    1. This 'consistency pattern'. For example, using a neural network to convert each Go weight from weak to strong into a single value, there should be increasing changes. Moreover, the Elo differences between these Go weights from weak to strong should be consistent with this value."
    2. This can be used to predict next weight's elo, maybe
  2. Is this pattern the "directionality of strength enhancement"?
  3. Can this "directionality of strength enhancement" be learned using other weights?
  4. After learning this pattern, is it possible to predict the next set of weights that can play normally and have stronger performance?

The first one might exist.

@Ishinoshita
Copy link

Ishinoshita commented May 30, 2024

IMHO, your new formulation does not change nothing to the deluding perspective that you could learn to navigate a N dimensional weights space to optimise the loss (+1 additional dimension) of a given neural network with respect to a training dataset, without looking at that additional dimension ( = using training data).

In essence, a N dimension weight space is kind of isotropic, there is no preferential direction. If there is any 'consistency pattern' in the series of point taken by the training in the weights space, it's a local, short range one. At a given step of the training, the gradient of the loss is moving the weight in a certain direction. Locally, the direction of the weights trajectory can be inferred (linear approximation using moving average of the last k gradients, for instance). As said, this is already used in standard SGD with momentum. However, that direction makes sense (optimize something) only in the N+1 dimension, and its only a local approximation of the optimal trajectory. You can blindly take some steps in that direction and may continue to optimize the loss for a short while, but then your loss will increase again because, from where you stand now, the optimal direction has changed. If you model also the second and third order curvature of your past trajectory, you might improve a bit longer, but on the long run, you will go "off road". Assuming the contrary would be assuming that the loss landscape (of that particular NN for that particular problem) has some strong regularity that would make possible to predict an optimal trajectory, and, iteratively, an optimal weight. There is no such proof for katago, go in general or NN in general (quite the opposite opinion) and you provide no argument in that sense.

To use again an analogy: you are walking on a path in the forest. You know that you can close your eyes for a few seconds and continue walking in the same direction and keep on the path, but you also know that you cannot do that very long ... You move in a 3D space but your sight gives you an additional info (4th dimension) about the distance between your position and the center of the path. And your walking algorithm keeps trying to minimize that distance, keeping you safe at the center of the path. When you close your eyes, or look at the sky or the birds and forget to look at the path for moment, you loose that 4th dimension information. You will instinctively "prolong" your past trajectory and most of the time that will keep you on track for a few seconds. But you already know what will happens if you keep looking at the birds ... Assuming otherwise would amount to assume that the path you are in is either straight or of fixed curvature, so that you could determine its direction/curvature by just looking at a fraction of the path, and then, that you could just blindly prolong the direction/curvature, and stay on track indefinitely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants