-
Notifications
You must be signed in to change notification settings - Fork 564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Train a weight‘s weight, to predict the value of the next Katago weight #944
Comments
Hi Zliu. I am not at all a DL expert, rather a newbie, but I can share the following thoughts: There somewhat exists something doing a similar thing to your idea, it's SGD with momentum, that takes a step in the weights space based in part on the gradient on the last batch of training data, and in part based on an exponentially decaying moving average of the past gradient (momentum). Stochastic Weights Averaging is also kind of predicting a better point in the weights space based on recent trajectory. But these techniques still rely on a past trajectory which are themselves based on a gradient signal coming training data. If one were to train a model (NN based or not) to predict next weight based on past weights, and one would iterate that operator over predicted weights, not just past weights computed based on a gradient signal, what guarantee would we have that the trajectory would correspond to an increase of strength in go? If you remove the grounding to go game by removing the step of training on go data, after a thousand iterations, what would make the produced weight a good weight for playing go and not chess or classifying cats and dogs? To take an analogy, suppose you run a thousand miles with your car and train a model to predict the steering wheel angle based on past steering wheel angles. Then you enter a new town (never seen), hands off the steering wheel, letting your model adjusting steering wheel angle based on its own prediction. How long would you stay on the road? ... The landscape of the loss over the weights space is basically unknown in advance (otherwise, one would just directly jump to a minimum location). And the loss landscape exists only in relation to a loss with respect to a training dataset (or training window). You can compute the gradient (and the hessian) of the loss to know the landscape around you, but that does not tell you what it will look like in the future, after you take 10, 50 steps. Assuming that you can successfully navigate the N+1 dimension of the loss landscape (that is, reaching a global minimum in the loss direction) by only operating in the N dimensions of the weights space (without looking at loss) amounts to assume that, in the game of Go and with katago's NN architecture, the loss landscape has some regularities that a model could discover and exploit beyond a very short term. Maybe there are, but I never read about that, in go or in an other domain. Due to non linearities and high number of parameters, large neural network lanscape are infamous for being very torturous. A NN might discover and predict well the local shape of loss landscape around recent weights but that would probably not generalized well (think to the analogy with meteorological wheather predictions, based on highly non linear models of fluids circulation: they give good predictions on a few days scale but their accuracy quickly drops because of the diverging nature of the equations behind). Additionally, your model, even if helpfull to take steps in the weights space in combination with classical training, would have the drawback to take as input very high dimensions data ! (the number of parameters of b28, for instance, in the range of 10's if millions). It would be very costly to train. Which is in contradiction with the fact that it should a priori have relatively few parameters, because there quite few network generations in a run like kata1 (a few thousands?) as training examples. |
Hi, Ishinoshita Thank you for your reply and very detailed analogy.
The first one might exist. |
IMHO, your new formulation does not change nothing to the deluding perspective that you could learn to navigate a N dimensional weights space to optimise the loss (+1 additional dimension) of a given neural network with respect to a training dataset, without looking at that additional dimension ( = using training data). In essence, a N dimension weight space is kind of isotropic, there is no preferential direction. If there is any 'consistency pattern' in the series of point taken by the training in the weights space, it's a local, short range one. At a given step of the training, the gradient of the loss is moving the weight in a certain direction. Locally, the direction of the weights trajectory can be inferred (linear approximation using moving average of the last k gradients, for instance). As said, this is already used in standard SGD with momentum. However, that direction makes sense (optimize something) only in the N+1 dimension, and its only a local approximation of the optimal trajectory. You can blindly take some steps in that direction and may continue to optimize the loss for a short while, but then your loss will increase again because, from where you stand now, the optimal direction has changed. If you model also the second and third order curvature of your past trajectory, you might improve a bit longer, but on the long run, you will go "off road". Assuming the contrary would be assuming that the loss landscape (of that particular NN for that particular problem) has some strong regularity that would make possible to predict an optimal trajectory, and, iteratively, an optimal weight. There is no such proof for katago, go in general or NN in general (quite the opposite opinion) and you provide no argument in that sense. To use again an analogy: you are walking on a path in the forest. You know that you can close your eyes for a few seconds and continue walking in the same direction and keep on the path, but you also know that you cannot do that very long ... You move in a 3D space but your sight gives you an additional info (4th dimension) about the distance between your position and the center of the path. And your walking algorithm keeps trying to minimize that distance, keeping you safe at the center of the path. When you close your eyes, or look at the sky or the birds and forget to look at the path for moment, you loose that 4th dimension information. You will instinctively "prolong" your past trajectory and most of the time that will keep you on track for a few seconds. But you already know what will happens if you keep looking at the birds ... Assuming otherwise would amount to assume that the path you are in is either straight or of fixed curvature, so that you could determine its direction/curvature by just looking at a fraction of the path, and then, that you could just blindly prolong the direction/curvature, and stay on track indefinitely. |
Current factors affecting Game go’s strength: engine search and weight values
Factors affecting Game go strength:
This issue mainly discusses how to use another set of weights (referred to as prediction weights) to predict the value of the next stronger Go weight.
How to find a set of Go weight values faster?
Currently, the main task of Go AI architecture is to find a set of weight values that correspond to stronger Game go strength.
I once thought about how to judge the strength only from the values of 2 weights. The simple method is to use information entropy to discriminate.
But after trying, it was found that Game go's strength does not directly correspond to the information entropy of weight files.
Katago's weight sequences constitute numerical vectors, this super functions's direction is to enhance Game go's strength
However, it can be assumed that the many weight sequences obtained by Katago have already constituted a numerical vector, and these numerical vectors, in space, represent the direction of enhancing Game go's strength.
So, I wonder if it is possible to train a prediction weight based on these known Go weight sequences, to predict the value of the next Go weight along the direction of Game go's strength enhancement.
Directly train prediction weights to predict stronger Go weights
Since there are already different structures of Go weight sequences, when training prediction weights, the structural parameters of Go weights (b, c) can be used as training parameters. In this way, the prediction weights trained can predict where the values of the next stronger Go weights might be under different b, c.
As the current Go weights need to be obtained through:
which consumes a lot of computing power.
If the method of predicting weights is feasible, only training prediction weights and competition, 2 steps are needed, which eliminates the step of "self-play".
The text was updated successfully, but these errors were encountered: