Poor performance of get_subjects() even when using indexing without metadata #940

wasciutto · 2023-02-05T17:51:02Z

I am puzzled by how slow get_subjects() is for me. It takes about 10 seconds to retrieve the subject ids for a 67 subject BIDS directory. This is after some improvement achieved by using a saved index file and not indexing metadata - two of the most common performance enhancement suggestions I see here.

I don't understand why gathering the subject IDs in such a small database wouldn't be almost instantaneous. Meanwhile, when I query for a single subject, the result is as quick as you'd expect.

Here is how I fetch my layout, using version 15.5:

bids_dir = "path/to/my/bids_dir"
database_path = "path/to/my/index"

indexer = BIDSLayoutIndexer(index_metadata=False)

BIDSLayout(
  bids_dir,
  indexer=indexer, database_path=database_path,
  reset_database=refresh
)

Is there some way I can configure the indexer to improve my performance here, or some other solution?

Thanks!

The text was updated successfully, but these errors were encountered:

effigies · 2023-02-06T13:17:17Z

We just merged a bug-fix to the indexer (#936). Would you be willing to install master and see if the issue remains? The bug was related to traversing ignored directories despite them being ignored.

pip install git+https://github.com/bids-standard/pybids.git

wasciutto · 2023-02-06T17:52:41Z

I just tried and unfortunately get_subjects() is still taking the same amount of time.

adelavega · 2023-02-07T20:37:08Z

Can you elaborate on what you mean by "querying for a single subject"?

I do recall this being rather slow at times, so I'll take a look. My guess is that it has to do with taking the "unique" set of subjects, and it's doing too many pre-querying to do so.

adelavega · 2023-02-07T20:48:59Z

So on a random dataset I had, I couldn't quite replicate your issue, but I do see that it takes 2 orders of magnitudes more to do get_subjects vs get:

In [7]: %timeit layout.get(subject='1')
5.02 ms ± 439 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [8]: %timeit layout.get_subjects()
660 ms ± 2.73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The reason for this is that get_subjects is not a special function, but rather a shortcut for the more general form: layout.get(return_type='id', target='subject')

I'll see if there's any quick fixes for this, but if not there is a major refactor coming up which will speed up pybids significantly: #863

wasciutto · 2023-02-07T20:58:48Z

Yup, so by querying a single subject, I mean the same thing as in your timing example - using layout.get(subject='123')

As far as replicating my issue, I believe your test showing a two order of magnitude difference in get_subjects() vs (get(subject='123')) captures exactly what I'm dealing with.

I appreciate you looking into this for me!

wasciutto · 2023-02-07T21:02:20Z

Maybe there is some way for me to index on 'id'?

adelavega · 2023-02-07T21:26:26Z

Looks like it's slow because of this line:

pybids/bids/layout/layout.py

Line 679 in 1ba4f66

ent_iter = (x.get_entities(metadata=metadata) for x in results)

Essentially its asking for the meta-data of every indexed file, to later filter by only those that have the target (in this case subject) in their metadata.

.get_entities is > 2 orders of magnitude slower than just checking .entities. I'm trying to figure out if just using the attribute would be reliable enough here.

adelavega · 2023-02-07T23:13:11Z

BTW, you are totally right that this can be instantaneous.

If you run this query, you can get it in 2ms:

from bids.layout import models
layout.session.query(models.Tag).filter_by(entity_name='subject').distinct().with_entities(models.Tag._value).all()

The problem is that layout.get is doing this in a very inefficient way, in order to piggy back on earlier logic, that supports filter files based on other filters. That is, if you want to be able to get all subjects that match a certain task, layout.get_subjects can do so, and that's what complicates the logic and in the end makes it slower.

I'm about to submit a patch that at least speeds it up 20x, by getting each file's entity using a faster method.

wasciutto · 2023-02-08T14:24:07Z

This is excellent, thanks so much for diving into this, explaining the situation, and providing a patch.

I have been working to transition a mid-size application away from interacting with BIDS directories through the file system, and instead through pybids exclusively. However, in the use case of getting all subject ids (a very frequent case), a simple file traversal had been much faster than pybids. Speeding up this crucial operation will smooth out the user experience where we are using it, and make me much more enthusiastic about implementing pybids where we haven't yet.

adelavega · 2023-02-08T15:50:38Z

I'm glad. Like I said though, we are working on a major refactor that should make an even bigger difference, so stay tuned.

adelavega mentioned this issue Feb 8, 2023

Speed up BIDSLayout.get(return_type='id', ...) #942

Merged

adelavega closed this as completed in #942 Feb 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor performance of get_subjects() even when using indexing without metadata #940

Poor performance of get_subjects() even when using indexing without metadata #940

wasciutto commented Feb 5, 2023

effigies commented Feb 6, 2023

wasciutto commented Feb 6, 2023

adelavega commented Feb 7, 2023

adelavega commented Feb 7, 2023 •

edited

Loading

wasciutto commented Feb 7, 2023 •

edited

Loading

wasciutto commented Feb 7, 2023

adelavega commented Feb 7, 2023

adelavega commented Feb 7, 2023

wasciutto commented Feb 8, 2023

adelavega commented Feb 8, 2023

Poor performance of get_subjects() even when using indexing without metadata #940

Poor performance of get_subjects() even when using indexing without metadata #940

Comments

wasciutto commented Feb 5, 2023

effigies commented Feb 6, 2023

wasciutto commented Feb 6, 2023

adelavega commented Feb 7, 2023

adelavega commented Feb 7, 2023 • edited Loading

wasciutto commented Feb 7, 2023 • edited Loading

wasciutto commented Feb 7, 2023

adelavega commented Feb 7, 2023

adelavega commented Feb 7, 2023

wasciutto commented Feb 8, 2023

adelavega commented Feb 8, 2023

adelavega commented Feb 7, 2023 •

edited

Loading

wasciutto commented Feb 7, 2023 •

edited

Loading