Faster evaluation metrics (baked into the library?) #2986

dataframing · 2020-10-19T21:53:32Z

Before getting into the issue, I'd like to thank you all for maintaining this library! It's been great so far, and I really appreciate the thorough documentation.

Problem description

I'm trying to train Word2Vec embedding vectors on my own dataset. Things have been going well so far, but as I've started to add in certain features to the training loop, it's become more and more difficult to continue.

Our scenario is that we'd like to adapt Twitter's recent paper (and its reference on Word2Vec for recommendation systems) for our own use-case. Put simply, I have three files (train.jsonl, valid.jsonl, test.jsonl) with samples of our full training dataset (~275k, 110k, and 110k examples, respectively).

Using gensim, I can successfully train a Word2Vec model for many epochs and get a proper output. Since Gensim doesn't come out-of-the-box with certain callbacks and metrics, I've rolled my own and applied them successfully—not a problem.

The problem comes when one of those callbacks is a metric that has to do some inference work on many sequences. For example, in the latter paper linked above, the authors describe a Hit Ratio @ K metric, which is doing next-token-prediction on a sequence of n tokens: the context consists of tokens 0, ..., n-1 and the token to be predicted is n. I've implemented it below:

from math import inf
from typing import List
from typing import Optional
from typing import Sequence
from typing import Set
from typing import Tuple

import gensim

def hit_ratio_at_k(
        model: gensim.models.Word2Vec,
        sequence: Sequence[str],
        k: int,
        verbose: bool = False
) -> float:
    if verbose:
        logger.debug(f"Called with `k={k}` on `sequence={sequence}`")

    # Exit early: if we don't have enough data to make separate context from target.
    if len(sequence) < 2:
        return inf

    # Isolate tokens `t_0, ..., t_{n-1}` as context, and `t_n` as target.
    context: List[str] = [*sequence[:-1]]
    target: str = sequence[-1]

    # Get our top `k` predictions for the next token.
    # Note: `gensim` returns None if all context words are OOV.
    preds: Optional[List[Tuple[str, float]]] = model.predict_output_word(
        context_words_list=context, topn=k
    )
    if not preds:
        return inf

    # If we have valid predictions, isolate the unordered tokens.
    pred_tokens: Set[str] = {word for word, _ in preds}

    # Hit Ratio is 1 if the target appears in the list of `k` predicted items, else 0.
    hit_ratio: float = 1.0 if target in pred_tokens else 0.0
    return hit_ratio

I wanted to track Hit Ratio @ 1 on both the training and validation sets after each epoch, so I made a callback that can do that for any of my general metric functions:

from dataclasses import dataclass
from dataclasses import field
from math import isfinite
from typing import Callable
from typing import List
from typing import Sequence
from typing import Tuple

import gensim
import tqdm
from gensim.models.callbacks import CallbackAny2Vec

from myrepo.models.data import DataLoader

logger = ...


@dataclass
class MetricTracker(CallbackAny2Vec):
    dl: DataLoader
    name: str
    func: Callable[[gensim.models.Word2Vec, Sequence[str]], float]
    value: List[float] = field(default_factory=list, init=False)

    def on_epoch_end(self, model: gensim.models.Word2Vec) -> None:
        logger.debug(f"Computing {self.name} via `{self.func.__name__}`")

        # Compute our evaluation metric on each sequence in our data loader.
        per_line_values: Tuple[float, ...] = tuple(
            self.func(model, seq) for seq in tqdm.tqdm(self.dl, total=len(self.dl))
        )

        # Adjust for any invalid values returned by our evaluation metric.
        valid_line_values: Tuple[float, ...] = tuple(
            value for value in per_line_values if isfinite(value)
        )
        avg_value: float = sum(valid_line_values) / len(valid_line_values)
        self.value.append(avg_value)
        logger.debug(f"Finished computing `{self.name}`: {avg_value:.3f}")

This works just fine, except that you quickly run into performance problems: Gensim's training loop is parallelized and fast, but (understandably) callbacks are called within a single process.

To try and mitigate this, I tried using Python's multi-processing (via multiprocessing.dummy and concurrent.futures packages) to make parallel calls to self.func(model, seq). This helps when the data loader is small (a sample of ~3-5k sequences), but when passing the full train/validation data loader, performance isn't so great. For reference, hit_ratio_at_k (=self.func) on a single process can do about 30 iterations per second.

I suppose I'd want to know if you've dealt with this issue before. Ideally, I'd love to have a Gensim-approved way of doing inference on many documents/word sequences.

Steps/code/corpus to reproduce

This isn't a bug, but the relevant code blocks are above. Happy to provide any other code that would help clarify. I thought of potentially "freezing" the model's KeyedVectors instance (just for evaluation) to see if there's a significant speed-up, but I'm not sure what side effects I might be incurring (if any) by doing so.

Versions

Here's what I'm working with:

Linux-4.15.0-65-generic-x86_64-with-debian-9.13
Python 3.7.9 (default, Oct 13 2020, 21:28:14) 
[GCC 6.3.0 20170516]
Bits 64
NumPy 1.19.2
SciPy 1.5.3
gensim 3.8.3
FAST_VERSION 1

Thank you!

The text was updated successfully, but these errors were encountered:

gojomo · 2020-10-20T00:11:31Z

That's an interesting paper! (Great to see that a simple change, making ns_exponent adjustable as per user-contribution in #2090/#2093, can help create giant improvements in downstream tasks.)

It'll take me a while to digest their paper... some of their notation choices are confusing, & on a 1st scan it's not completely clear to me how their choice of epochs and/or 'convergence' was driven. But it kind of looks to me that what they did, and what you may be trying to reproduce, may involve early-stopping based on when some external measure of quality (HR@10) stops improving. Is that true?

If you're finding your single-threaded mid-training evaluation to be too time-consuming, my main thought would be: can a much smaller random sample still provide the same directional guidance?

More generally, I'm a bit suspicious of mid-training evaluations. Until the model has 'converged' according to its own internal optimization targets, any indications of its progress are highly tentative. Ideally: (1) training would always run until the model is 'converged' according to its own internal loss-minimization; (2) the model would only be tested for downstream purposes when in this settled stage. Unfortunately, because Gensim's implementation has incomplete/flaky/buggy loss-tracking (see #2617 & others), historically true convergence hasn't really been tracked/assured - people just try to do "enough" epochs that things seems to settle and work well.

When loss-tracking is fixed, people will have a better idea of whether they've trained 'enough', and more efficient optimization than fixed linear learning-rate decay over a prechosen set of epochs may also become possible & popular.

dataframing · 2020-10-20T03:51:32Z

But it kind of looks to me that what they did, and what you may be trying to reproduce, may involve early-stopping based on when some external measure of quality (HR@10) stops improving. Is that true?

Ultimately, yes. If we're a few epochs in, we'd like to know whether our current set of model hyper-parameters are worth exploring further based on the model's performance on the validation (or some other holdout) set. For this issue I was just asking about how to (possibly) efficiently compute these metrics as a callback, or if this has been requested before.

If you're finding your single-threaded mid-training evaluation to be too time-consuming, my main thought would be: can a much smaller random sample still provide the same directional guidance?

Good idea! I also thought this would work, but "use less data for evaluation" isn't the most satisfying answer :) But yes, we could definitely sample some fraction of the validation set on each epoch and evaluate our model on that.

Until the model has 'converged' according to its own internal optimization targets, any indications of its progress are highly tentative.

That's interesting; can you elaborate? As Gensim's end user, don't we effectively have the final say for when the model has reached its optimization target? It may not have minimized its loss within, say, 10 epochs, but that's the risk we take by not setting iter to a very large value, no? I'd also think it's the risk we take when considering early stopping. That is, if we're training two Word2Vec model configurations A and B to 50 epochs – to then decide which configuration we should train to 200 epochs – we assume the risk of choosing the wrong configuration based on limited/truncated convergence data.

Ideally: (1) training would always run until the model is 'converged' according to its own internal loss-minimization; (2) the model would only be tested for downstream purposes when in this settled stage. Unfortunately, because Gensim's implementation has incomplete/flaky/buggy loss-tracking (see #2617 & others), historically true convergence hasn't really been tracked/assured - people just try to do "enough" epochs that things seems to settle and work well.

Yeah, that's precisely what we were trying to do. The iter epochs parameter does seem like an important hyper-parameter to get right. I think this is what the paper means by "fixed budget": given all other hyper-parameters, which one(s) do best within the limited number of epochs.

I think that answers it, though! Thanks so much for your thoughts on this issue. I'll try sampling the validation set and hopefully that's a good enough approximation of the actual validation performance. I'd love to see some way to "expose" user-defined functions/callbacks to the highly-tuned and efficient Gensim training loop, though 👍

gojomo · 2020-10-20T18:44:03Z

But it kind of looks to me that what they did, and what you may be trying to reproduce, may involve early-stopping based on when some external measure of quality (HR@10) stops improving. Is that true?

Ultimately, yes. If we're a few epochs in, we'd like to know whether our current set of model hyper-parameters are worth exploring further based on the model's performance on the validation (or some other holdout) set. For this issue I was just asking about how to (possibly) efficiently compute these metrics as a callback, or if this has been requested before.

Because the model, in early training, might be very far from its ultimate 'settled' state, I suspect making such a decision on the hyperparameters could be premature. (I see that the paper you've linked has as footnote 3: "we found that models often appear to converge and performance even decreases for several epochs before breaking through to new highs".) The single best-grounded time to evaluate a model is after its internal loss has stagnated - only then has it plausibly reached a point where it can't improve on one training example without performing worse on others.

Until the model has 'converged' according to its own internal optimization targets, any indications of its progress are highly tentative.

That's interesting; can you elaborate? As Gensim's end user, don't we effectively have the final say for when the model has reached its optimization target? It may not have minimized its loss within, say, 10 epochs, but that's the risk we take by not setting iter to a very large value, no? I'd also think it's the risk we take when considering early stopping. That is, if we're training two Word2Vec model configurations A and B to 50 epochs – to then decide which configuration we should train to 200 epochs – we assume the risk of choosing the wrong configuration based on limited/truncated convergence data.

Yes, definitionally simple SGD with linear learning-rate decay & fixed epochs reaches a predetermined stopping point, so sure, the user caps the effort devoted to optimization. But the model may not have truly converged by then – it hasn't reached its actual target, the best performance possible with its current architecture/state-budget. That means any measured performance on a downstream evaluation is somewhat arbitrarily influenced by the interaction between (data/parameters/user-patience) and whether that happens onto a lucky stopping-moment.

For example, overparameterized models prone to extreme overfitting will often peak on some external-evaluation during early-training, but seeing that, then trying to 'stop at just the right moment of undertraining' is a clumsy/unrigorous way to optimize, somewhat short-circuiting the point of SGD. A project taking the care to do a broad search of metaparameters probably doesn't want to take that shortcut.

Ideally: (1) training would always run until the model is 'converged' according to its own internal loss-minimization; (2) the model would only be tested for downstream purposes when in this settled stage. Unfortunately, because Gensim's implementation has incomplete/flaky/buggy loss-tracking (see #2617 & others), historically true convergence hasn't really been tracked/assured - people just try to do "enough" epochs that things seems to settle and work well.

Yeah, that's precisely what we were trying to do. The iter epochs parameter does seem like an important hyper-parameter to get right. I think this is what the paper means by "fixed budget": given all other hyper-parameters, which one(s) do best within the limited number of epochs.

That was another portion of the paper I found confusing. They seem to use lowercase n to mean epochs in some cases (like the inexplicable use of L for window and α for ns_exponent), but then have incredibly low n values in some table rows. presumably to cap runtimes. But such minimal passes are especially likely to be far from model convergence, and thus especially likely to mislead about how a model run with more opportunity-to-converge (via either more epochs or more data) would perform. (It's possible the poor full-dataset optimization results in 'retweets' in Figure 3 is due to this effect, though it's unclear whether Figure 3 is showing the SG or CBOW results.)

Also, as Gensim doesn't yet have any official/reliable facility for early-stopping or "run to convergence" (other than by trial-and-error), when they claim to have done this in places, it's unclear what strategy they used. They seem to have used stagnation on their external HR@10 measure, rather than internal model loss, which if I understand correctly isn't the normal meaning of 'convergence' in an SGD model. And, did they run for, say, 200 epochs, but then stop at 100 when their external 'convergence' test passes? (Thus, leaving internal learning-rate alpha only half-decayed?) Or try many alternate epochs values, and find that 100 seemed to show 'convergence' in its final epochs, settling on that? (Thus, allowing learning-rate to fully decay?) Or, something else custom?

I think that answers it, though! Thanks so much for your thoughts on this issue. I'll try sampling the validation set and hopefully that's a good enough approximation of the actual validation performance. I'd love to see some way to "expose" user-defined functions/callbacks to the highly-tuned and efficient Gensim training loop, though 👍

All the code is there to modify arbitrarily, but note that the training speed comes primarily from (1) using bulk vector calculations from scipy/BLAS whenever possible - avoiding python-loops & one-at-a-time calcs; (2) putting those intense-but-narrowly-focused large-batch calcs inside Cython code that's relinquished the usual Python "GIL", to truly utilize multiple CPU cores for long stretches of time.

Your code could choose to do those too - but such optimizations are tricky enough that trying to mix your ops into the existing Gensim code-blocks would more likely hurt than help. (And, such optimizations could easily throw a monkey-wrench into the highly-functional style you're using.) OTOH, if you simply manage to perform some optimization that does fewer/larger-bulk array operations, you might get most of the benefit without having to think about Cython/GIL. (The bulk array ops themselves use multiple cores when they can, fanning out larger-array-to-large-array calcs across threads.)

I also now notice that you're using predict_output_word() - a somewhat peculiar operation. It doesn't really match either what the model was doing during training – it uses CBOW-style input even if the model is SG, it doesn't mimic the window-weighting that's implicit in usual training, & it necessarily performs an evaluation at every predictable-word output node (unlike the sparse calculations in training). And, this reuse of the model's internal weights to predict words is not a very common way of using word2vec models - last I looked, most word2vec-implementations don't even offer such an API call.

If you're sure that's the calculation your downstream analysis needs, it could likely be optimized. Probably, first, by working in larger batches at a time (so one call to .dot() is supporting many predictions). But also, it's possible the Twitter evaluation was just on cosine-similarities between word-vectors, rather than forward-propagated predictions through the trained model. So you might want to check the performance-bottlenecks & optimization possibilities (which may be similar) in that alternate evaluation, instead.

AdityaSoni19031997 · 2021-04-19T15:27:28Z

First of all, thanks for a helpful discussion and comments;

I am curious, If there's an easy way to launch a Grid-Search on your custom task? [within Gensim itself? or outside?]
Secondly, let's say that your # sequences of tokens length is 50% of the time less than the context_word value; what does wor2vec does in such scenarios?

Thanks,
Aditya.

piskvorky · 2021-04-19T15:49:39Z

Closing as "not a bug" here, but feel free to continue the discussion at the Gensim mailing list.

And of course, for complex optimizations / commercial uses, please consider becoming a Gensim sponsor :)

gojomo mentioned this issue Oct 20, 2020

Support multiple most_similar() queries in one call #2987

Open

piskvorky closed this as completed Apr 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster evaluation metrics (baked into the library?) #2986

Faster evaluation metrics (baked into the library?) #2986

dataframing commented Oct 19, 2020

gojomo commented Oct 20, 2020

dataframing commented Oct 20, 2020

gojomo commented Oct 20, 2020

AdityaSoni19031997 commented Apr 19, 2021

piskvorky commented Apr 19, 2021

Faster evaluation metrics (baked into the library?) #2986

Faster evaluation metrics (baked into the library?) #2986

Comments

dataframing commented Oct 19, 2020

Problem description

Steps/code/corpus to reproduce

Versions

gojomo commented Oct 20, 2020

dataframing commented Oct 20, 2020

gojomo commented Oct 20, 2020

AdityaSoni19031997 commented Apr 19, 2021

piskvorky commented Apr 19, 2021