Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reported very slow performance compared to DuckDB in ibis-project #8492

Closed
alamb opened this issue Dec 11, 2023 · 15 comments · Fixed by #8631
Closed

Reported very slow performance compared to DuckDB in ibis-project #8492

alamb opened this issue Dec 11, 2023 · 15 comments · Fixed by #8631
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@alamb
Copy link
Contributor

alamb commented Dec 11, 2023

Describe the bug

As reported by @cpcloud in ibis-project/ibis#7703

The most relevant portions:

DataFusion
DataFusion never ran out of memory and had a memory profile similar to DuckDB:
single digit GBs peak memory.

However, it was still extremely slow compared to DuckDB, about 9-10 minutes to
run the whole workload.

Similarly to Polars I compared both the Ibis implementation and a hand-written
SQL version (built from the generated Ibis code). Both had the same performance

I also looked at perf top while the DataFusion workload was running and saw this:

289368886-a33bf90f-8877-43b0-8a4d-328f9721f264

To Reproduce

TBD (first thing would be to get a datafusion only reproducer)

Looks like the query, from ibis-project/ibis#7703 is

SELECT
  month,
  ext,
  COUNT(DISTINCT project_name) AS project_count
FROM (
  SELECT
    project_name,
    DATE_TRUNC('month', uploaded_on) AS month,
    NULLIF(
      REPLACE(
        REPLACE(
          REPLACE(
            REGEXP_REPLACE(
              REGEXP_REPLACE(
                REGEXP_MATCH(path, CONCAT('(', '\.([a-z0-9]+)$', ')'))[2],
                'cxx|cpp|cc|c|hpp|h',
                'C/C++',
                'g'
              ),
              '^f.*$',
              'Fortran',
              'g'
            ),
            'rs',
            'Rust'
          ),
          'go',
          'Go'
        ),
        'asm',
        'Assembly'
      ),
      ''
    ) AS ext
  FROM pypi
  WHERE COALESCE(
      ARRAY_LENGTH(
        REGEXP_MATCH(path, '\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$')
      ) > 0,
      FALSE
    )
    AND NOT COALESCE(ARRAY_LENGTH(REGEXP_MATCH(path, '(^|/)test(|s|ing)')) > 0, FALSE)
    AND NOT STRPOS(path, '/site-packages/') > 0
)
WHERE ext IS NOT NULL
GROUP BY month, ext
ORDER BY month DESC, project_count DESC

Expected behavior

No response

Additional context

No response

@alamb
Copy link
Contributor Author

alamb commented Dec 11, 2023

I wonder if the slowness in regexp / time spent compiling stuff is related to not being able to pre-compile the argument and instead re-creating the regular expression for each batch.

@thinkharderdev mentioned something similar for #8051

@alamb alamb added the help wanted Extra attention is needed label Dec 11, 2023
@cpcloud
Copy link
Contributor

cpcloud commented Dec 11, 2023

Thanks for creating the issue!

The example can be simplified a bit.

It should be sufficient to see the performance difference with DuckDB by:

  1. Using any single file from the dataset (they're all big enough to take some noticeable amount of time)
  2. Removing the aggregation and using only the inner SELECT.

@Dandandan
Copy link
Contributor

A similar problem (recompiling the regex again and again) I found some time ago in the clickbench benchmark as well (query 28):
https://github.com/JayjeetAtGithub/datafusion-duckdb-benchmark/blob/main/clickbench/queries-datafusion.sql#L29

@zhangxffff
Copy link
Contributor

test with simple regexp_match query with index-0.parquet
array-datafusion took 47.847 seconds.

SELECT COUNT(*) FROM '*.parquet' WHERE
    ARRAY_LENGTH(
      REGEXP_MATCH(path, '\\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$')
    ) > 0;


DataFusion CLI v33.0.0
+----------+
| COUNT(*) |
+----------+
| 5834398  |
+----------+
1 row in set. Query took 47.847 seconds.

../datafusion-cli/target/release/datafusion-cli -f d0.sql  225.03s user 5.77s system 477% cpu 48.353 total

duckdb took 3.029 seconds

SELECT COUNT(*) FROM '*.parquet' WHERE
    ARRAY_LENGTH(
      REGEXP_EXTRACT_ALL(path, '\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$')
    ) > 0;


┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│      5834398 │
└──────────────┘
../../duckdb/build/release/duckdb -s "`cat d1.sql`"  15.70s user 1.78s system 576% cpu 3.029 total

@Weijun-H
Copy link
Member

test with simple regexp_match query with index-0.parquet array-datafusion took 47.847 seconds.

SELECT COUNT(*) FROM '*.parquet' WHERE
    ARRAY_LENGTH(
      REGEXP_MATCH(path, '\\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$')
    ) > 0;


DataFusion CLI v33.0.0
+----------+
| COUNT(*) |
+----------+
| 5834398  |
+----------+
1 row in set. Query took 47.847 seconds.

../datafusion-cli/target/release/datafusion-cli -f d0.sql  225.03s user 5.77s system 477% cpu 48.353 total

duckdb took 3.029 seconds

SELECT COUNT(*) FROM '*.parquet' WHERE
    ARRAY_LENGTH(
      REGEXP_EXTRACT_ALL(path, '\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$')
    ) > 0;


┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│      5834398 │
└──────────────┘
../../duckdb/build/release/duckdb -s "`cat d1.sql`"  15.70s user 1.78s system 576% cpu 3.029 total

Maybe related to this #8524

@cpcloud
Copy link
Contributor

cpcloud commented Dec 14, 2023

Seems unlikely to be related to #8524. The issue is present for a single file.

@alamb
Copy link
Contributor Author

alamb commented Dec 14, 2023

I had some ideas on how to speed up regular expression evaluation here: #8051 (comment)

@comphead
Copy link
Contributor

df-distinct-graph
Attaching Flamegraph for the query

SELECT COUNT(*) FROM '*.parquet' WHERE
    ARRAY_LENGTH(
      REGEXP_EXTRACT_ALL(path, '\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$')
    ) > 0;

@alamb
Copy link
Contributor Author

alamb commented Dec 15, 2023

The blog post is now published: https://ibis-project.org/posts/pydata-performance-part2/

@alamb
Copy link
Contributor Author

alamb commented Dec 15, 2023

I looked into the regular expression matching code in DataFusion -- there is a lot of room for improvement:

It translates each argument into an array (even when the argument is a constant). Thus DataFusion is effectively compiling the regular expression for each row (not even each batch) which is unsurprisingly quite expensive

This is very fixable but the way the functions are wired in will take some finagling I think

@alamb
Copy link
Contributor Author

alamb commented Dec 23, 2023

@viirya found some very low hanging fruit: #8538 #8631

@cpcloud
Copy link
Contributor

cpcloud commented Dec 23, 2023

Happy to try the query again once the next release is out!

@viirya
Copy link
Member

viirya commented Dec 23, 2023

It is #8631 actually. 😄

@my-vegetable-has-exploded
Copy link
Contributor

my-vegetable-has-exploded commented Mar 23, 2024

df-distinct-graph Attaching Flamegraph for the query

SELECT COUNT(*) FROM '*.parquet' WHERE
    ARRAY_LENGTH(
      REGEXP_EXTRACT_ALL(path, '\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$')
    ) > 0;

Hi, could you share the method to draw the flamegraph?

There is a large "unknown" in my flame graph, which prevents me from tracking the function call chain.

图片

Here is my shell to generate the flamegraph using datafusion-cli.

#!/bin/bash
PERF_DATA_FILE="$1"
../../datafusion-cli/target/debug/datafusion-cli -f repro-range-query.sql &
# datafusion-cli -f repro-range-query.sql &
PID=$!
# use perf to collect data
sudo /usr/bin/perf_5.15 record --call-graph=dwarf -e cpu-clock -F 100 -p $PID -- sleep 30
sudo /usr/bin/perf_5.15 script -i perf.data >"perf.unfold"
FLAMEGRAPH_DIR="/home/deepin/sdk/FlameGraph"
"$FLAMEGRAPH_DIR/stackcollapse-perf.pl" "perf.unfold" >"perf.folded"

"$FLAMEGRAPH_DIR/flamegraph.pl" "perf.folded" >"$1.svg"

rm "perf.unfold" "perf.folded"
echo "Flame graph SVG created: $1.svg"

Thanks very much. @comphead

@alamb
Copy link
Contributor Author

alamb commented Mar 23, 2024

Hi, could you share the method to draw the flamegraph?

@comphead recently added this to the contributor guide: https://arrow.apache.org/datafusion/library-user-guide/profiling.html#building-a-flamegraph

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants