-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster evaluation metrics (baked into the library?) #2986
Comments
That's an interesting paper! (Great to see that a simple change, making It'll take me a while to digest their paper... some of their notation choices are confusing, & on a 1st scan it's not completely clear to me how their choice of epochs and/or 'convergence' was driven. But it kind of looks to me that what they did, and what you may be trying to reproduce, may involve early-stopping based on when some external measure of quality ( If you're finding your single-threaded mid-training evaluation to be too time-consuming, my main thought would be: can a much smaller random sample still provide the same directional guidance? More generally, I'm a bit suspicious of mid-training evaluations. Until the model has 'converged' according to its own internal optimization targets, any indications of its progress are highly tentative. Ideally: (1) training would always run until the model is 'converged' according to its own internal loss-minimization; (2) the model would only be tested for downstream purposes when in this settled stage. Unfortunately, because Gensim's implementation has incomplete/flaky/buggy loss-tracking (see #2617 & others), historically true convergence hasn't really been tracked/assured - people just try to do "enough" epochs that things seems to settle and work well. When loss-tracking is fixed, people will have a better idea of whether they've trained 'enough', and more efficient optimization than fixed linear learning-rate decay over a prechosen set of epochs may also become possible & popular. |
Ultimately, yes. If we're a few epochs in, we'd like to know whether our current set of model hyper-parameters are worth exploring further based on the model's performance on the validation (or some other holdout) set. For this issue I was just asking about how to (possibly) efficiently compute these metrics as a callback, or if this has been requested before.
Good idea! I also thought this would work, but "use less data for evaluation" isn't the most satisfying answer :) But yes, we could definitely sample some fraction of the validation set on each epoch and evaluate our model on that.
That's interesting; can you elaborate? As Gensim's end user, don't we effectively have the final say for when the model has reached its optimization target? It may not have minimized its loss within, say, 10 epochs, but that's the risk we take by not setting
Yeah, that's precisely what we were trying to do. The I think that answers it, though! Thanks so much for your thoughts on this issue. I'll try sampling the validation set and hopefully that's a good enough approximation of the actual validation performance. I'd love to see some way to "expose" user-defined functions/callbacks to the highly-tuned and efficient Gensim training loop, though 👍 |
Because the model, in early training, might be very far from its ultimate 'settled' state, I suspect making such a decision on the hyperparameters could be premature. (I see that the paper you've linked has as footnote 3: "we found that models often appear to converge and performance even decreases for several epochs before breaking through to new highs".) The single best-grounded time to evaluate a model is after its internal loss has stagnated - only then has it plausibly reached a point where it can't improve on one training example without performing worse on others.
Yes, definitionally simple SGD with linear learning-rate decay & fixed epochs reaches a predetermined stopping point, so sure, the user caps the effort devoted to optimization. But the model may not have truly converged by then – it hasn't reached its actual target, the best performance possible with its current architecture/state-budget. That means any measured performance on a downstream evaluation is somewhat arbitrarily influenced by the interaction between (data/parameters/user-patience) and whether that happens onto a lucky stopping-moment. For example, overparameterized models prone to extreme overfitting will often peak on some external-evaluation during early-training, but seeing that, then trying to 'stop at just the right moment of undertraining' is a clumsy/unrigorous way to optimize, somewhat short-circuiting the point of SGD. A project taking the care to do a broad search of metaparameters probably doesn't want to take that shortcut.
That was another portion of the paper I found confusing. They seem to use lowercase Also, as Gensim doesn't yet have any official/reliable facility for early-stopping or "run to convergence" (other than by trial-and-error), when they claim to have done this in places, it's unclear what strategy they used. They seem to have used stagnation on their external
All the code is there to modify arbitrarily, but note that the training speed comes primarily from (1) using bulk vector calculations from scipy/BLAS whenever possible - avoiding python-loops & one-at-a-time calcs; (2) putting those intense-but-narrowly-focused large-batch calcs inside Cython code that's relinquished the usual Python "GIL", to truly utilize multiple CPU cores for long stretches of time. Your code could choose to do those too - but such optimizations are tricky enough that trying to mix your ops into the existing Gensim code-blocks would more likely hurt than help. (And, such optimizations could easily throw a monkey-wrench into the highly-functional style you're using.) OTOH, if you simply manage to perform some optimization that does fewer/larger-bulk array operations, you might get most of the benefit without having to think about Cython/GIL. (The bulk array ops themselves use multiple cores when they can, fanning out larger-array-to-large-array calcs across threads.) I also now notice that you're using If you're sure that's the calculation your downstream analysis needs, it could likely be optimized. Probably, first, by working in larger batches at a time (so one call to |
First of all, thanks for a helpful discussion and comments; I am curious, If there's an easy way to launch a Grid-Search on your custom task? [within Gensim itself? or outside?] Thanks, |
Closing as "not a bug" here, but feel free to continue the discussion at the Gensim mailing list. And of course, for complex optimizations / commercial uses, please consider becoming a Gensim sponsor :) |
Before getting into the issue, I'd like to thank you all for maintaining this library! It's been great so far, and I really appreciate the thorough documentation.
Problem description
I'm trying to train Word2Vec embedding vectors on my own dataset. Things have been going well so far, but as I've started to add in certain features to the training loop, it's become more and more difficult to continue.
Our scenario is that we'd like to adapt Twitter's recent paper (and its reference on Word2Vec for recommendation systems) for our own use-case. Put simply, I have three files (
train.jsonl
,valid.jsonl
,test.jsonl
) with samples of our full training dataset (~275k, 110k, and 110k examples, respectively).Using
gensim
, I can successfully train a Word2Vec model for many epochs and get a proper output. Since Gensim doesn't come out-of-the-box with certain callbacks and metrics, I've rolled my own and applied them successfully—not a problem.The problem comes when one of those callbacks is a metric that has to do some inference work on many sequences. For example, in the latter paper linked above, the authors describe a
Hit Ratio @ K
metric, which is doing next-token-prediction on a sequence ofn
tokens: the context consists of tokens0, ..., n-1
and the token to be predicted isn
. I've implemented it below:I wanted to track Hit Ratio @ 1 on both the training and validation sets after each epoch, so I made a callback that can do that for any of my general metric functions:
This works just fine, except that you quickly run into performance problems: Gensim's training loop is parallelized and fast, but (understandably) callbacks are called within a single process.
To try and mitigate this, I tried using Python's multi-processing (via
multiprocessing.dummy
andconcurrent.futures
packages) to make parallel calls toself.func(model, seq)
. This helps when the data loader is small (a sample of ~3-5k sequences), but when passing the full train/validation data loader, performance isn't so great. For reference,hit_ratio_at_k
(=self.func
) on a single process can do about 30 iterations per second.I suppose I'd want to know if you've dealt with this issue before. Ideally, I'd love to have a Gensim-approved way of doing inference on many documents/word sequences.
Steps/code/corpus to reproduce
This isn't a bug, but the relevant code blocks are above. Happy to provide any other code that would help clarify. I thought of potentially "freezing" the model's
KeyedVectors
instance (just for evaluation) to see if there's a significant speed-up, but I'm not sure what side effects I might be incurring (if any) by doing so.Versions
Here's what I'm working with:
Thank you!
The text was updated successfully, but these errors were encountered: