Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Website]: DataFusion 26-34 blog post #457

Merged
merged 19 commits into from
Jan 19, 2024
Merged

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jan 7, 2024

Closes apache/datafusion#6780

This blog post describes DataFusion over the last 6 months, DataFusion 26 to 34.

If anyone has time to pitch in and look up links or help with the language that would be most apprecaited

@alamb alamb marked this pull request as ready for review January 14, 2024 13:44
@alamb
Copy link
Contributor Author

alamb commented Jan 14, 2024

I hope to publish this post at the end of the week -- around Jan 19

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks excellent. Thank you @alamb

_posts/2024-01-25-datafusion-34.0.0.md Outdated Show resolved Hide resolved
_posts/2024-01-25-datafusion-34.0.0.md Outdated Show resolved Hide resolved
_posts/2024-01-25-datafusion-34.0.0.md Outdated Show resolved Hide resolved
x: [[1,2,3]]
```

## Improved `STRUCT` and `ARRAY` support
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jayzhan211 is there any other improvements you think we should call out about struct/array support over the last 6 months?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope. I think all of them are new features compare to 6 months before.


This year some major initiatives contributors plan to focus on are:

1. *Modularity*: Make Datafusion even more modular, such as [unifying how
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ozankabak are there any plans you and your team may have that you want to share publically?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have anything to share just yet on features we will contribute in 2024 (but there will be many!). We will probably have something to publish in a month or two.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this plot is hard to read, mostly due to the wildly different numbers and because "execution time" often is an aggregate. For these kinds of plots, I see two possible ways to improve them:

  • log axis: use a log axis for time, because improvements are often not linear deltas but factors and a log-space would account for that nicely. That would also make the wildly different numbers easier to read. Drawback: people don't read log space very well.
  • relative factor: Only draw bars for DF v34 as a factor relative to v25 (which would be <1.0x in most cases) and on top of the bar (or at the base) print the seconds it took for v25. This tells the story of the change but also gives readers a baseline.

Copy link
Contributor Author

@alamb alamb Jan 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used this new chart. Here is what the page looks like rendered now:

Screenshot 2024-01-16 at 4 18 09 PM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's better. I think that the execution time (blue) shouldn't be a line plot because that type implies a connection between the neighboring points (like a time series where subsequent entries are indeed related). The linear interpolation between the measurements makes this even more misleading / "weird".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't figure out how to make Google sheets do what I wanted, so I eventually just made two charts so that one shows the overall magnitude and one shoes the relative improvement

Screenshot 2024-01-19 at 7 06 09 AM

I am sure we can do better if we spent more time on this, but I think it is good enough for now

Comment on lines 310 to 311
2. *Community Growth*: Graduate to our own top level Apache project, and
subsequently add more committers and PMC members to help the project grow.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We plan to write show-and-tell blog posts and videos that explain how one can use Datafusion in real-world use cases. We will try to partner with members of the community to create toy examples relating to their use cases and try to come up with demo scripts that offer guidance to others on how they can use Datafusion in similar contexts.

Maybe it could be a good idea to mention these upcoming show-and-tells as a near-future community growth effort.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is great. I will include it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in 95158bb

@alamb alamb changed the title [Website]: DataFusion 26-34 blog [Website]: DataFusion 26-34 blog post Jan 16, 2024

You can do this using [`CREATE EXTERNAL TABLE` statement] for example:

```sql
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW does any one have an example of writing to remote object storage (e.g. s3) handy that they could share so I can include it here?

@devinjdangelo do you have this setup ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

COPY table to 's3://my_bucket/my_prefix' should work in datafusion-cli so long as the credentials are set up. I'll verify real quick...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok confirmed with caveat. Using COPY directly to object store from datafusion cli does not work, but insert to external table does. We probably need to add special logic to datafusion-cli to make copy to object store to work directly. That would be a neat feature to add.

For now this works:

export AWS_SECRET_ACCESS_KEY=...
export AWS_ACCESS_KEY_ID=...
export AWS_DEFAULT_REGION=...
datafusion-cli
❯ create external table remote_table2(a int, b int) stored as parquet location 's3://dfcli-test-bucket2/';
0 rows in set. Query took 0.001 seconds.

❯ insert into remote_table2 values (1,2);
+-------+
| count |
+-------+
| 1     |
+-------+
1 row in set. Query took 0.272 seconds.
❯ select * from remote_table2;
+---+---+
| a | b |
+---+---+
| 1 | 2 |
+---+---+
1 row in set. Query took 0.131 seconds.

I see the parquet file in my s3 bucket as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @devinjdangelo -- I'll file a ticket about this in DataFusion later today

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#8907 filed to make the COPY example work...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok confirmed with caveat. Using COPY directly to object store from datafusion cli does not work, but insert to external table does. We probably need to add special logic to datafusion-cli to make copy to object store to work directly. That would be a neat feature to add.

Thank you 🙏

@alamb
Copy link
Contributor Author

alamb commented Jan 19, 2024

Thank you everyone who helped with this project

@alamb alamb merged commit b49af25 into apache:main Jan 19, 2024
1 check passed
@alamb alamb deleted the alamb/df_blog_34 branch January 19, 2024 12:15
@alamb
Copy link
Contributor Author

alamb commented Jan 19, 2024

The blog post is now published! https://arrow.apache.org/blog/2024/01/19/datafusion-34.0.0/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Blog post with DataFusion Jun - Sep 2023
9 participants