Initial commit to get BERT + run_glue.py on TPU #1

jysohn23 · 2019-11-09T01:19:48Z

Verified performance numbers look at least comparable on a chip-to-chip basis (TPUv3 vs V100) for MRPC dataset (pretty much the same accuracy & f1 test metrics too). Runner script works for both GPU and TPU.

taylanbil · 2019-11-11T18:17:48Z

examples/run_glue.py

+    parser.add_argument('--use_tpu', action='store_true', help='Whether to use TPUs.')
+    parser.add_argument('--num_cores', default=8, type=int, help='Number of TPU cores to use.')
+    parser.add_argument('--metrics_debug', action='store_true', help='Whether to print debug metrics.')


are these the only tpu specific args?

taylanbil · 2019-11-11T18:18:20Z

examples/run_glue.py

+    parser.add_argument('--seed', type=int, default=42,
+                        help="random seed for initialization")
+
+    parser.add_argument('--fp16', action='store_true',


Does enabling this break tpu? It did for fairseq.

taylanbil · 2019-11-11T18:20:10Z

examples/run_glue.py

        logger.info("  Batch size = %d", args.eval_batch_size)
        eval_loss = 0.0
        nb_eval_steps = 0
        preds = None
        out_label_ids = None
-        for batch in tqdm(eval_dataloader, desc="Evaluating"):
+        for batch in tqdm(eval_dataloader, desc="Evaluating", disable=args.use_tpu):


why disable this?

taylanbil · 2019-11-11T18:21:20Z

examples/run_glue.py

@@ -505,7 +436,7 @@ def main():


    # Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()
-    if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0) and not args.tpu:
+    if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):


where do these get set correctly for our MP purposes?

taylanbil · 2019-11-11T18:22:50Z

examples/run_glue.py

+
+def main_cli():
+    args = get_args()
+    if args.use_tpu:


having to pass --use_tpu every time feels annoying, but no big deal.

Yep, it indeed is so will create separate runner as discussed offline.

taylanbil · 2019-11-11T18:23:47Z

examples/run_glue.py

        if args.max_steps > 0 and global_step > args.max_steps:
            train_iterator.close()
            break

    if args.local_rank in [-1, 0]:
        tb_writer.close()

-    return global_step, tr_loss / global_step
+    return global_step, loss.item()


is this equivalent?

Sort of, just that it's not a real average.

taylanbil · 2019-11-11T18:24:25Z

examples/run_glue.py

                if args.fp16:
                    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
                else:
                    torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)

-                optimizer.step()
+                if args.use_tpu:
+                    xm.optimizer_step(optimizer, barrier=True)


why do we need barrier here? Isn't it in parallelloader already?

Good point! We don't need barrier here. Artifact I forgot to cleanup from testing on single core.

jysohn23 · 2019-11-13T02:14:01Z

Thanks for the review @taylanbil!

Based on offline conversation with Google and HuggingFace teams, will close the PR in favor of preparing a separate run_glue_tpu.py runner.

…gface#26681) * Draft version of new KV Caching This should allow Attention Sinks (https://github.com/tomaarsen/attention_sinks) / StreamingLLM (https://arxiv.org/abs/2309.17453) to be easily implemented in a third-party or in transformers directly * Address numerous PR suggestions 1. Move layer_idx from cache to ...Attention. Removes confusing set_layer_idx magic. 2. Always convert past_key_values to Cache instance at the start of ...Attention, removes all other isinstance calls. 3. Remove __bool__ and __getitem__ magic as they're confusing. 4. past_key_values.update(key, value, idx) now returns key, value. 5. Add use_legacy_cache flag, defaults to None, i.e. Falsey. This breaks generate for now, until 1) the cache is used is generate() or 2) use_legacy_cache is defaulted to True in generate() until we change it in another PR. 6. Separate key_cache and value_cache. Some work is still needed to see if the SinkCache can conveniently be implemented with just one update method. * Implement the SinkCache through backward+forward rotations * Integrate (Sink)Cache with Llama FA2 * Set use_legacy_cache=True as default, allows for test passes * Move from/to_legacy_cache to ...Model class * Undo unnecessary newline change * Remove copy utility from deprecated OpenLlama * Match import style * manual rebase with main * Cache class working with generate (#1) * Draft version of new KV Caching This should allow Attention Sinks (https://github.com/tomaarsen/attention_sinks) / StreamingLLM (https://arxiv.org/abs/2309.17453) to be easily implemented in a third-party or in transformers directly * Address numerous PR suggestions 1. Move layer_idx from cache to ...Attention. Removes confusing set_layer_idx magic. 2. Always convert past_key_values to Cache instance at the start of ...Attention, removes all other isinstance calls. 3. Remove __bool__ and __getitem__ magic as they're confusing. 4. past_key_values.update(key, value, idx) now returns key, value. 5. Add use_legacy_cache flag, defaults to None, i.e. Falsey. This breaks generate for now, until 1) the cache is used is generate() or 2) use_legacy_cache is defaulted to True in generate() until we change it in another PR. 6. Separate key_cache and value_cache. Some work is still needed to see if the SinkCache can conveniently be implemented with just one update method. * Integrate (Sink)Cache with Llama FA2 * Move from/to_legacy_cache to ...Model class * Undo unnecessary newline change * Match import style * working generate * Add tests; Simplify code; Apply changes to Mistral and Persimmon * fix rebase mess * a few more manual fixes * last manual fix * propagate changes to phi * upgrade test * add use_legacy_cache docstring; beef up tests * reintroduce unwanted deletes --------- Co-authored-by: Tom Aarsen <[email protected]> * move import * add default to model_kwargs.get('use_legacy_cache') * correct failing test * Apply suggestions from code review Co-authored-by: Patrick von Platen <[email protected]> * apply PR suggestions * fix failing test * Apply suggestions from code review Co-authored-by: Patrick von Platen <[email protected]> Co-authored-by: Tom Aarsen <[email protected]> * PR comments * tmp commit * add docstrings * more tests, more docstrings, add to docs * derp * tmp commit * tmp dbg * more dbg * fix beam search bug * cache can be a list of tuples in some models * fix group beam search * all but sinkcache integration tests * fix sink cache and add hard integration test * now also compatible with input_embeds input * PR comments * add Cache support to Phi+FA2 * make fixup --------- Co-authored-by: Joao Gante <[email protected]> Co-authored-by: Joao Gante <[email protected]> Co-authored-by: Patrick von Platen <[email protected]>

* Cohere Model Release (#1) Cohere Model Release * Remove unnecessary files and code (#2) Some cleanup * Delete cohere-model directory (#3) * Make Fix (#5) * Pr fixes (#6) * fixes for pr * pr fixes for the format * pr fixes for the format * src/transformers/models/auto/tokenization_auto.py * Tokenizer test (#8) * tokenizer test * format fix * Adding Docs and other minor changes (#7) * Add modeling tests (#9) * Smol Fix (#11) * tokenization tests are fixed * format fixes * fix pr doc tests * fix pr doc tests * fix pr doc tests * fix pr style check * small changes in cohere.md * FIX: Address final comments for transformers integration (#13) * fix modeling final nits and add proper test file * for now leave empty tests * add integration test * push new test * fix modeling cohere (#14) * Update chat templates to use the new API (#15) --------- Co-authored-by: ahmetustun <[email protected]> Co-authored-by: Younes Belkada <[email protected]> Co-authored-by: Matt <[email protected]>

Initial commit to get BERT + run_glue.py on TPU

a336b2f

jysohn23 requested a review from taylanbil November 9, 2019 01:19

taylanbil reviewed Nov 11, 2019

View reviewed changes

jysohn23 closed this Nov 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial commit to get BERT + run_glue.py on TPU #1

Initial commit to get BERT + run_glue.py on TPU #1

jysohn23 commented Nov 9, 2019

taylanbil Nov 11, 2019

taylanbil Nov 11, 2019

taylanbil Nov 11, 2019

taylanbil Nov 11, 2019

taylanbil Nov 11, 2019

jysohn23 Nov 13, 2019

taylanbil Nov 11, 2019

jysohn23 Nov 13, 2019

taylanbil Nov 11, 2019

jysohn23 Nov 13, 2019

jysohn23 commented Nov 13, 2019

Initial commit to get BERT + run_glue.py on TPU #1

Initial commit to get BERT + run_glue.py on TPU #1

Conversation

jysohn23 commented Nov 9, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jysohn23 commented Nov 13, 2019