Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have a bunch of benchmarks #177

Open
madig opened this issue Sep 9, 2021 · 12 comments
Open

Have a bunch of benchmarks #177

madig opened this issue Sep 9, 2021 · 12 comments

Comments

@madig
Copy link
Collaborator

madig commented Sep 9, 2021

I every now and then do profiling of UFO loading (and writing). I think it would be nice to have a bunch of benchmarks (https://doc.rust-lang.org/cargo/commands/cargo-bench.html) ready to be run. They could also serve as entry points for profilers.

I'd say we could import a copy of Noto Sans and maybe even a custom to-UFO translation of Noto Sans CJK to have norad work up a sweat. Need to figure out the CJK part still, until the sources are opened.

Scenarios:

  • Serial loading of UFOs, no parallel loading of glifs (without rayon feature)
  • Serial loading of UFOs, parallel loading of glifs (with rayon feature)
  • Parallel loading of UFOs, no parallel loading of glifs
  • Parallel loading of UFOs, parallel loading of glifs
  • Loading a single UFO without any font or glyph libs
  • Loading a single UFO where every glyph has a lib with lotsa stuff and the font lib is big (to hammer plist)

Could also include the line-ending benches in #172 (comment).

@chrissimpkins
Copy link
Collaborator

I really like this idea. Are the GHA runners reliable enough envs to run these benchmarks? If not, how would you standardize the execution/reporting?

@madig
Copy link
Collaborator Author

madig commented Sep 9, 2021

I wasn't thinking about GHA actually, I'm not sure how good CI infrastructure is for reliable benchmarking. This is more aimed towards running benches on various machines easily to compare e.g. platform diffs. Benchmarking on commits does sound enticing, though...

@chrissimpkins
Copy link
Collaborator

https://fast.vlang.io/ appears to use a free instance on AWS, possibly related to #175 too

@cmyr
Copy link
Member

cmyr commented Sep 10, 2021

I wouldn't benchmark on CI infrastructure, and generally wouldn't want to benchmark on a virtual machine. I do think benchmarks are important, although I would prefer criterion to the built in cargo bench.

@madig
Copy link
Collaborator Author

madig commented Sep 12, 2021

Another idea: look for quadratic runtime by having a massive CJK UFO and loading incrementally more of it and seeing if the timings form a line or upward slope. Also comparison and other things you can do to objects.

@chrissimpkins
Copy link
Collaborator

having a massive CJK UFO

I looked into Noto CJK sources. They are not available and won't be in the near term.

@madig
Copy link
Collaborator Author

madig commented Oct 5, 2021

I made a 60k glyph amalgamation of Noto at https://github.com/madig/noto-amalgamated. It's just the Regular for now, maybe I should do an amalgamation for all Designspace extremes? Need to think about what and how I want to benchmark.

BTW: I profiled the amalgamation script and was amazed to find out that ~2 mins of the 9-10 mins runtime are spent in ufoLib.filenames.userNameToFileName. What the hell.

@madig
Copy link
Collaborator Author

madig commented Oct 10, 2021

Looking at this 🤔 So, criterion is built such that if you want to compare rayon to no rayon, you run cargo criterion --features rayon instead of changing the benchmarks. This leaves what to benchmark and how.

I currently have Mutator Sans as a small UFO collection, a recent Noto Sans as a medium-size UFO collection but with 15 masters and one huge Noto Amalgamated. I know that plist loading influences parsing time; 15-25% of Mutator Sans glyphs have a lib, almost all glyphs in Noto Sans do, 75% in Noto Amalgamated do. I had the idea of measuring with and without plists and stuff, but maybe I should keep it real and take the 3 UFO families as they are, for now, until I have a clearer idea of what I want to benchmark and why.

So, maybe I'll make a new data repo with Mutator Sans, Noto Sans and Noto Amalgamated (with maybe all points in the Designspace, amalgamated) and hook that in as a sub-repo, and test serial loading in each group plus parallel loading (launching 1 thread per UFO to load). Then I can bench with --features rayon and without?

Edit: just saw that a Noto amalgamated by style name gives me a nice progression of glyph numbers. I can bench that.

@madig
Copy link
Collaborator Author

madig commented Oct 10, 2021

Interestingly, there does seem to be some quadraticness going on without rayon? X-axis is number of glyphs (amalgamated Noto has a nice glyph number progression), Y-axis is load time in seconds. Not loading glyph libs halves loading time but the graph keeps the slope. Or am I reading the graph wrong?

grafik

@cmyr
Copy link
Member

cmyr commented Oct 12, 2021

I don't think the graph is especially clear, it isn't far from being a straight line, and there's always the possibility of measurement noise.

@chrissimpkins
Copy link
Collaborator

I wandered across this project from the Criterion developer that claims to be a way to support benchmark tests on CI infrastructures:

https://github.com/bheisler/iai

Precision: High-precision measurements allow you to reliably detect very small optimizations to your code
Consistency: Iai can take accurate measurements even in virtualized CI environments
Performance: Since Iai only executes a benchmark once, it is typically faster to run than statistical benchmarks
Profiling: Iai generates a Cachegrind profile of your code while benchmarking, so you can use Cachegrind-compatible tools to analyze the results in detail

Valgrind-based, Linux only IIUC.

@chrissimpkins
Copy link
Collaborator

chrissimpkins commented Dec 22, 2021

Can confirm that iai functions on GH Actions Ubuntu runner CI with an apt install of valgrind, and data appear to be relatively stable across runs. Cannot confirm accuracy, nor whether the data are useful for performance improvement work (yet)... :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants