From bad3507505b9c12a3f42ac1ca3ef3f252e962733 Mon Sep 17 00:00:00 2001 From: Avril Aysha <68642378+avriiil@users.noreply.github.com> Date: Mon, 22 Apr 2024 14:16:15 +0100 Subject: [PATCH 1/9] create dask migration guide --- .../migration_guides/coming_from_dask.rst | 110 ++++++++++++++++++ 1 file changed, 110 insertions(+) create mode 100644 docs/source/migration_guides/coming_from_dask.rst diff --git a/docs/source/migration_guides/coming_from_dask.rst b/docs/source/migration_guides/coming_from_dask.rst new file mode 100644 index 0000000000..05bddbcfd1 --- /dev/null +++ b/docs/source/migration_guides/coming_from_dask.rst @@ -0,0 +1,110 @@ +# Dask Migration Guide +This migration guide explains the most important points that anyone familiar with Dask should know when trying out Daft or migrating Dask workloads to Daft. The guide includes an overview of technical,conceptual and syntax differences between the two libraries that you should be aware of. Understanding these differences will help you evaluate your choice of tooling and ease your migration from Dask to Daft. + +## When should I use Daft? +Dask and Daft are DataFrame frameworks built for distributed computing. Both libraries enable you to process large, tabular datasets in parallel, either locally or on remote instances on-prem or in the cloud. + +If you are currently using Dask, you may want to consider migrating to Daft if you: + +- Are working with multimodal data types, such as nested JSON, tensors, Images, URLs, etc., +- Need faster computations through query planning and optimization, +- Are executing machine learning workloads at scale, +- Want to benefit from native Rust concurrency + +You may want to stick with using Dask if you: + +- Want to only write pandas-like syntax, +- Need to parallelize array-based workloads or arbitrary Python code that does not involve DataFrames (with Dask Array, Dask Delayed and/or Dask Futures) + +The following sections explain conceptual and technical differences between Dask and Daft. Whenever relevant, code snippets are provided to illustrate differences in syntax. + +## Daft does not use an index +Dask aims for as much feature-parity with pandas as possible, including maintaining the presence of an Index in the DataFrame. But keeping an Index is difficult when moving to a distributed computing environment. Dask doesn’t support row-based positional indexing (with .iloc) because it does not track the length of its partitions. It also does not support pandas MultiIndex. The argument for keeping the Index is that it makes some operations against the sorted index column very fast. In reality, resetting the Index forces a data shuffle and is an expensive operation. + +Daft drops the need for an Index to make queries more readable and consistent. How you write a query should not change because of the state of an index or a reset_index call. In our opinion, eliminating the index makes things simpler, more explicit, more readable and therefore less error-prone. Daft achieves this by using the [Expressions API](). + +In Dask you would index your DataFrame to return row `b` as follows: + +``` +ddf.loc[[“b”]] +``` + +In Daft, you would accomplish the same by using a `col` Expression to refer to the column that contains `b`: + +``` +df.where(daft.col(“alpha”)==”b”) +``` + +More about Expressions in the sections below. + +## Daft does not try to copy the pandas syntax +Dask is built as a parallelizable version of pandas and Dask partitions are in fact pandas DataFrames. When you call a Dask function you are often applying a pandas function on each partition. This makes Dask relatively easy to learn for people familiar with pandas, but it also causes complications when pandas logic (built for sequential processing) does not translate well to a distributed context. When reading the documentation, Dask users will often encounter this phrase `“This docstring was copied from pandas.core… Some inconsistencies with the Dask version may exist.”` It is often unclear what these inconsistencies are and how they might affect performance. + +Daft does not try to copy the pandas syntax. Instead, we believe that efficiency is best achieved by defining logic specifically for the unique challenges of distributed computing. This means that we trade a slightly higher learning curve for pandas users against improved performance and more clarity for the developer experience. + +## Daft eliminates manual repartitioning of data +In distributed settings, your data will always be partitioned for efficient parallel processing. How to partition this data is not straightforward and depends on factors like data types, query construction, and available cluster resources. While Dask often requires manual repartitioning for optimal performance, Daft abstracts this away from users so you don’t have to worry about it. + +Dask leaves repartitioning up to the user with guidelines on having partitions that are “not too large and not too many”. This is hard to interpret, especially given that the optimal partitioning strategy may be different for every query. Instead, Daft automatically controls your partitions in order to execute queries faster and more efficiently. As a side-effect, this means that while Dask supports partition indexing – i.e. “get me partition X” – Daft does not. + +## Daft performs Query Optimization for optimal performance +Daft is built with logical query optimization by default. This means that Daft will optimize your queries and skip any files or partitions that are not required for your query. This can give you significant performance gains, especially when working with file formats that support these kinds of optimized queries. + +Dask currently does not support full-featured query optimization. + +_Note: As of version 2024.3.0 Dask is slowly implementing query optimization as well. As far as we can tell this is still in early development and has some rough edges. For context: https://github.com/dask/dask/issues/10995_ + +## Daft uses Expressions and UDFs to perform computations in parallel +Dask provides a `map_partitions` method to map computations over the partitions in your DataFrame. Since Dask partitions are pandas DataFrames, you can pass pandas functions to `map_partitions`. You can also map arbitrary Python functions over Dask partitions using `map_partitions`. + +For example: + +``` +def my_function(**kwargs): + return … + +res = ddf.map_partitions(my_function, **kwargs) +``` + +Daft implements two APIs for mapping computations over the data in your DataFrame in parallel: [Expressions]() and [UDFs](). Expressions are most useful when you need to define computation over your columns. + +``` +# Add 1 to each element in column "A" +df = df.with_column("A_add_one", daft.col(“A”) + 1) +``` + +You can use User-Defined Functions (UDFs) to run computations over multiple rows or columns: + +``` +# apply a custom function “crop_image” to the image column +@daft.udf(...) +def crop_image(**kwargs): + … + return … + +df = df.with_column( + "cropped", + crop_image(daft.col(“image”), **kwargs), +) +``` + +## Daft is built for Machine Learning Workloads +Dask offers some distributed Machine Learning functionality through the [`dask-ml` library](https://ml.dask.org/). This library provides parallel implementations of a few common scikit-learn algorithms. Note that `dask-ml` is not a core Dask library and is not as actively maintained. It also does not offer support for deep-learning algorithms or neural networks. + +Daft is built as a DataFrame API for distributed Machine learning. You can use Daft UDFs to apply Machine Learning tasks to the data stored in your Daft DataFrame, including deep learning algorithms from libraries like PyTorch. See [our Quickstart](https://www.getdaft.io/projects/docs/en/latest/10-min.html) for a toy example. + +## Daft supports Multimodal Data Types +Dask supports the same data types as pandas. Daft is built to support many more data types, including Images, nested JSON, tensors, etc. See [the documentation](https://www.getdaft.io/projects/docs/en/latest/user_guide/daft_in_depth/datatypes.html) for a list of all supported data types. + +## Distributed Computing and Remote Clusters +Both Dask and Daft support distributed computing on remote clusters. In Dask, you create a Dask cluster either locally or remotely and perform computations in parallel there. Currently, Daft supports distributed cluster computing [with Ray](https://www.getdaft.io/projects/docs/en/latest/user_guide/poweruser/scaling-up.html). Support for running Daft computations on Dask clusters is on the roadmap. + +Cloud support for both Dask and Daft is the same. + +## SQL Support +Dask does not natively provide full support for running SQL queries. You can use pandas-like code to write SQL-equivalent queries, or use the external [`dask-sql` library](https://dask-sql.readthedocs.io/en/latest/). + +Daft provides a read_sql method to read SQL queries into a DataFrame. Daft uses SQLGlot to build SQL queries, so it supports all databases that SQLGlot supports. Daft pushes down operations such as filtering, projections, and limits into the SQL query when possible. Full-featured support for SQL queries (as opposed to a DataFrame API) is in progress. + +## Daft combines Python with Rust and Pyarrow for optimal performance +Daft combines Python with Rust and Pyarrow for optimal performance. Under the hood, Table and Series are implemented in Rust on top of the Apache Arrow specification (using the Rust arrow2 library). This architecture means that all the computationally expensive operations on Table and Series are performed in Rust, and can be heavily optimized for raw speed. Python is most useful as a user-facing API layer for ease of use and an interactive data science user experience. Read [more](faq/technical_architecture). From b7b05f57c21d612be93d2d4bee0a17662f7d2274 Mon Sep 17 00:00:00 2001 From: Avril Aysha <68642378+avriiil@users.noreply.github.com> Date: Mon, 22 Apr 2024 14:33:17 +0100 Subject: [PATCH 2/9] add to index/toc --- docs/source/index.rst | 1 + docs/source/migration_guides/index.rst | 8 ++++++++ 2 files changed, 9 insertions(+) create mode 100644 docs/source/migration_guides/index.rst diff --git a/docs/source/index.rst b/docs/source/index.rst index 004bfd429f..3a2d3eabb5 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -43,6 +43,7 @@ Frequently Asked Questions 10-min user_guide/index api_docs/index + migration_guides/index FAQs Release Notes diff --git a/docs/source/migration_guides/index.rst b/docs/source/migration_guides/index.rst new file mode 100644 index 0000000000..6c1b9d28e2 --- /dev/null +++ b/docs/source/migration_guides/index.rst @@ -0,0 +1,8 @@ +Migration Guides +================= + +.. toctree:: + :maxdepth: 2 + + Table of Contents + coming_from_dask From 737795f7a4b040def29223ee03763a765f871bd9 Mon Sep 17 00:00:00 2001 From: Avril Aysha <68642378+avriiil@users.noreply.github.com> Date: Mon, 22 Apr 2024 14:40:02 +0100 Subject: [PATCH 3/9] minor changes --- .../migration_guides/coming_from_dask.md | 122 ++++++++++++++++++ 1 file changed, 122 insertions(+) create mode 100644 docs/source/migration_guides/coming_from_dask.md diff --git a/docs/source/migration_guides/coming_from_dask.md b/docs/source/migration_guides/coming_from_dask.md new file mode 100644 index 0000000000..23845029f1 --- /dev/null +++ b/docs/source/migration_guides/coming_from_dask.md @@ -0,0 +1,122 @@ +# Dask Migration Guide + +This migration guide explains the most important points that anyone familiar with Dask should know when trying out Daft or migrating Dask workloads to Daft. The guide includes an overview of technical,conceptual and syntax differences between the two libraries that you should be aware of. Understanding these differences will help you evaluate your choice of tooling and ease your migration from Dask to Daft. + +## When should I use Daft? + +Dask and Daft are DataFrame frameworks built for distributed computing. Both libraries enable you to process large, tabular datasets in parallel, either locally or on remote instances on-prem or in the cloud. + +If you are currently using Dask, you may want to consider migrating to Daft if you: + +- Are working with **multimodal data types**, such as nested JSON, tensors, Images, URLs, etc., +- Need faster computations through **query planning and optimization**, +- Are executing **machine learning workloads** at scale, +- Want to benefit from **native Rust concurrency** + +You may want to stick with using Dask if you: + +- Want to only write **pandas-like syntax**, +- Need to parallelize **array-based workloads** or arbitrary **Python code that does not involve DataFrames** (with Dask Array, Dask Delayed and/or Dask Futures) + +The following sections explain conceptual and technical differences between Dask and Daft. Whenever relevant, code snippets are provided to illustrate differences in syntax. + +## Daft does not use an index + +Dask aims for as much feature-parity with pandas as possible, including maintaining the presence of an Index in the DataFrame. But keeping an Index is difficult when moving to a distributed computing environment. Dask doesn’t support row-based positional indexing (with .iloc) because it does not track the length of its partitions. It also does not support pandas MultiIndex. The argument for keeping the Index is that it makes some operations against the sorted index column very fast. In reality, resetting the Index forces a data shuffle and is an expensive operation. + +Daft drops the need for an Index to make queries more readable and consistent. How you write a query should not change because of the state of an index or a reset_index call. In our opinion, eliminating the index makes things simpler, more explicit, more readable and therefore less error-prone. Daft achieves this by using the [Expressions API](). + +In Dask you would index your DataFrame to return row `b` as follows: + +``` +ddf.loc[[“b”]] +``` + +In Daft, you would accomplish the same by using a `col` Expression to refer to the column that contains `b`: + +``` +df.where(daft.col(“alpha”)==”b”) +``` + +More about Expressions in the sections below. + +## Daft does not try to copy the pandas syntax + +Dask is built as a parallelizable version of pandas and Dask partitions are in fact pandas DataFrames. When you call a Dask function you are often applying a pandas function on each partition. This makes Dask relatively easy to learn for people familiar with pandas, but it also causes complications when pandas logic (built for sequential processing) does not translate well to a distributed context. When reading the documentation, Dask users will often encounter this phrase `“This docstring was copied from pandas.core… Some inconsistencies with the Dask version may exist.”` It is often unclear what these inconsistencies are and how they might affect performance. + +Daft does not try to copy the pandas syntax. Instead, we believe that efficiency is best achieved by defining logic specifically for the unique challenges of distributed computing. This means that we trade a slightly higher learning curve for pandas users against improved performance and more clarity for the developer experience. + +## Daft eliminates manual repartitioning of data + +In distributed settings, your data will always be partitioned for efficient parallel processing. How to partition this data is not straightforward and depends on factors like data types, query construction, and available cluster resources. While Dask often requires manual repartitioning for optimal performance, Daft abstracts this away from users so you don’t have to worry about it. + +Dask leaves repartitioning up to the user with guidelines on having partitions that are “not too large and not too many”. This is hard to interpret, especially given that the optimal partitioning strategy may be different for every query. Instead, Daft automatically controls your partitions in order to execute queries faster and more efficiently. As a side-effect, this means that Daft does not support partition indexing the way Dask does (i.e. “get me partition X”). If things are working well, you shouldn't need to index partitions like this. + +## Daft performs Query Optimization for optimal performance + +Daft is built with logical query optimization by default. This means that Daft will optimize your queries and skip any files or partitions that are not required for your query. This can give you significant performance gains, especially when working with file formats that support these kinds of optimized queries. + +Dask currently does not support full-featured query optimization. + +_Note: As of version 2024.3.0 Dask is slowly implementing query optimization as well. As far as we can tell this is still in early development and has some rough edges. For context see [the discussion](https://github.com/dask/dask/issues/10995_) in the Dask repo. + +## Daft uses Expressions and UDFs to perform computations in parallel + +Dask provides a `map_partitions` method to map computations over the partitions in your DataFrame. Since Dask partitions are pandas DataFrames, you can pass pandas functions to `map_partitions`. You can also map arbitrary Python functions over Dask partitions using `map_partitions`. + +For example: + +``` +def my_function(**kwargs): + return … + +res = ddf.map_partitions(my_function, **kwargs) +``` + +Daft implements two APIs for mapping computations over the data in your DataFrame in parallel: [Expressions]() and [UDFs](). Expressions are most useful when you need to define computation over your columns. + +``` +# Add 1 to each element in column "A" +df = df.with_column("A_add_one", daft.col(“A”) + 1) +``` + +You can use User-Defined Functions (UDFs) to run computations over multiple rows or columns: + +``` +# apply a custom function “crop_image” to the image column +@daft.udf(...) +def crop_image(**kwargs): + … + return … + +df = df.with_column( + "cropped", + crop_image(daft.col(“image”), **kwargs), +) +``` + +## Daft is built for Machine Learning Workloads + +Dask offers some distributed Machine Learning functionality through the [`dask-ml` library](https://ml.dask.org/). This library provides parallel implementations of a few common scikit-learn algorithms. Note that `dask-ml` is not a core Dask library and is not as actively maintained. It also does not offer support for deep-learning algorithms or neural networks. + +Daft is built as a DataFrame API for distributed Machine learning. You can use Daft UDFs to apply Machine Learning tasks to the data stored in your Daft DataFrame, including deep learning algorithms from libraries like PyTorch. See [our Quickstart](https://www.getdaft.io/projects/docs/en/latest/10-min.html) for a toy example. + +## Daft supports Multimodal Data Types + +Dask supports the same data types as pandas. Daft is built to support many more data types, including Images, nested JSON, tensors, etc. See [the documentation](https://www.getdaft.io/projects/docs/en/latest/user_guide/daft_in_depth/datatypes.html) for a list of all supported data types. + +## Distributed Computing and Remote Clusters + +Both Dask and Daft support distributed computing on remote clusters. In Dask, you create a Dask cluster either locally or remotely and perform computations in parallel there. Currently, Daft supports distributed cluster computing [with Ray](https://www.getdaft.io/projects/docs/en/latest/user_guide/poweruser/scaling-up.html). Support for running Daft computations on Dask clusters is on the roadmap. + +Cloud support for both Dask and Daft is the same. + +## SQL Support + +Dask does not natively provide full support for running SQL queries. You can use pandas-like code to write SQL-equivalent queries, or use the external [`dask-sql` library](https://dask-sql.readthedocs.io/en/latest/). + +Daft provides a read_sql method to read SQL queries into a DataFrame. Daft uses SQLGlot to build SQL queries, so it supports all databases that SQLGlot supports. Daft pushes down operations such as filtering, projections, and limits into the SQL query when possible. Full-featured support for SQL queries (as opposed to a DataFrame API) is in progress. + +## Daft combines Python with Rust and Pyarrow for optimal performance + +Daft combines Python with Rust and Pyarrow for optimal performance. Under the hood, Table and Series are implemented in Rust on top of the Apache Arrow specification (using the Rust arrow2 library). This architecture means that all the computationally expensive operations on Table and Series are performed in Rust, and can be heavily optimized for raw speed. Python is most useful as a user-facing API layer for ease of use and an interactive data science user experience. Read [more](faq/technical_architecture). From 690e8ccd3f4036c7b0db220d2babe8931e5fc569 Mon Sep 17 00:00:00 2001 From: Avril Aysha <68642378+avriiil@users.noreply.github.com> Date: Mon, 22 Apr 2024 14:40:08 +0100 Subject: [PATCH 4/9] make md --- .../migration_guides/coming_from_dask.rst | 110 ------------------ 1 file changed, 110 deletions(-) delete mode 100644 docs/source/migration_guides/coming_from_dask.rst diff --git a/docs/source/migration_guides/coming_from_dask.rst b/docs/source/migration_guides/coming_from_dask.rst deleted file mode 100644 index 05bddbcfd1..0000000000 --- a/docs/source/migration_guides/coming_from_dask.rst +++ /dev/null @@ -1,110 +0,0 @@ -# Dask Migration Guide -This migration guide explains the most important points that anyone familiar with Dask should know when trying out Daft or migrating Dask workloads to Daft. The guide includes an overview of technical,conceptual and syntax differences between the two libraries that you should be aware of. Understanding these differences will help you evaluate your choice of tooling and ease your migration from Dask to Daft. - -## When should I use Daft? -Dask and Daft are DataFrame frameworks built for distributed computing. Both libraries enable you to process large, tabular datasets in parallel, either locally or on remote instances on-prem or in the cloud. - -If you are currently using Dask, you may want to consider migrating to Daft if you: - -- Are working with multimodal data types, such as nested JSON, tensors, Images, URLs, etc., -- Need faster computations through query planning and optimization, -- Are executing machine learning workloads at scale, -- Want to benefit from native Rust concurrency - -You may want to stick with using Dask if you: - -- Want to only write pandas-like syntax, -- Need to parallelize array-based workloads or arbitrary Python code that does not involve DataFrames (with Dask Array, Dask Delayed and/or Dask Futures) - -The following sections explain conceptual and technical differences between Dask and Daft. Whenever relevant, code snippets are provided to illustrate differences in syntax. - -## Daft does not use an index -Dask aims for as much feature-parity with pandas as possible, including maintaining the presence of an Index in the DataFrame. But keeping an Index is difficult when moving to a distributed computing environment. Dask doesn’t support row-based positional indexing (with .iloc) because it does not track the length of its partitions. It also does not support pandas MultiIndex. The argument for keeping the Index is that it makes some operations against the sorted index column very fast. In reality, resetting the Index forces a data shuffle and is an expensive operation. - -Daft drops the need for an Index to make queries more readable and consistent. How you write a query should not change because of the state of an index or a reset_index call. In our opinion, eliminating the index makes things simpler, more explicit, more readable and therefore less error-prone. Daft achieves this by using the [Expressions API](). - -In Dask you would index your DataFrame to return row `b` as follows: - -``` -ddf.loc[[“b”]] -``` - -In Daft, you would accomplish the same by using a `col` Expression to refer to the column that contains `b`: - -``` -df.where(daft.col(“alpha”)==”b”) -``` - -More about Expressions in the sections below. - -## Daft does not try to copy the pandas syntax -Dask is built as a parallelizable version of pandas and Dask partitions are in fact pandas DataFrames. When you call a Dask function you are often applying a pandas function on each partition. This makes Dask relatively easy to learn for people familiar with pandas, but it also causes complications when pandas logic (built for sequential processing) does not translate well to a distributed context. When reading the documentation, Dask users will often encounter this phrase `“This docstring was copied from pandas.core… Some inconsistencies with the Dask version may exist.”` It is often unclear what these inconsistencies are and how they might affect performance. - -Daft does not try to copy the pandas syntax. Instead, we believe that efficiency is best achieved by defining logic specifically for the unique challenges of distributed computing. This means that we trade a slightly higher learning curve for pandas users against improved performance and more clarity for the developer experience. - -## Daft eliminates manual repartitioning of data -In distributed settings, your data will always be partitioned for efficient parallel processing. How to partition this data is not straightforward and depends on factors like data types, query construction, and available cluster resources. While Dask often requires manual repartitioning for optimal performance, Daft abstracts this away from users so you don’t have to worry about it. - -Dask leaves repartitioning up to the user with guidelines on having partitions that are “not too large and not too many”. This is hard to interpret, especially given that the optimal partitioning strategy may be different for every query. Instead, Daft automatically controls your partitions in order to execute queries faster and more efficiently. As a side-effect, this means that while Dask supports partition indexing – i.e. “get me partition X” – Daft does not. - -## Daft performs Query Optimization for optimal performance -Daft is built with logical query optimization by default. This means that Daft will optimize your queries and skip any files or partitions that are not required for your query. This can give you significant performance gains, especially when working with file formats that support these kinds of optimized queries. - -Dask currently does not support full-featured query optimization. - -_Note: As of version 2024.3.0 Dask is slowly implementing query optimization as well. As far as we can tell this is still in early development and has some rough edges. For context: https://github.com/dask/dask/issues/10995_ - -## Daft uses Expressions and UDFs to perform computations in parallel -Dask provides a `map_partitions` method to map computations over the partitions in your DataFrame. Since Dask partitions are pandas DataFrames, you can pass pandas functions to `map_partitions`. You can also map arbitrary Python functions over Dask partitions using `map_partitions`. - -For example: - -``` -def my_function(**kwargs): - return … - -res = ddf.map_partitions(my_function, **kwargs) -``` - -Daft implements two APIs for mapping computations over the data in your DataFrame in parallel: [Expressions]() and [UDFs](). Expressions are most useful when you need to define computation over your columns. - -``` -# Add 1 to each element in column "A" -df = df.with_column("A_add_one", daft.col(“A”) + 1) -``` - -You can use User-Defined Functions (UDFs) to run computations over multiple rows or columns: - -``` -# apply a custom function “crop_image” to the image column -@daft.udf(...) -def crop_image(**kwargs): - … - return … - -df = df.with_column( - "cropped", - crop_image(daft.col(“image”), **kwargs), -) -``` - -## Daft is built for Machine Learning Workloads -Dask offers some distributed Machine Learning functionality through the [`dask-ml` library](https://ml.dask.org/). This library provides parallel implementations of a few common scikit-learn algorithms. Note that `dask-ml` is not a core Dask library and is not as actively maintained. It also does not offer support for deep-learning algorithms or neural networks. - -Daft is built as a DataFrame API for distributed Machine learning. You can use Daft UDFs to apply Machine Learning tasks to the data stored in your Daft DataFrame, including deep learning algorithms from libraries like PyTorch. See [our Quickstart](https://www.getdaft.io/projects/docs/en/latest/10-min.html) for a toy example. - -## Daft supports Multimodal Data Types -Dask supports the same data types as pandas. Daft is built to support many more data types, including Images, nested JSON, tensors, etc. See [the documentation](https://www.getdaft.io/projects/docs/en/latest/user_guide/daft_in_depth/datatypes.html) for a list of all supported data types. - -## Distributed Computing and Remote Clusters -Both Dask and Daft support distributed computing on remote clusters. In Dask, you create a Dask cluster either locally or remotely and perform computations in parallel there. Currently, Daft supports distributed cluster computing [with Ray](https://www.getdaft.io/projects/docs/en/latest/user_guide/poweruser/scaling-up.html). Support for running Daft computations on Dask clusters is on the roadmap. - -Cloud support for both Dask and Daft is the same. - -## SQL Support -Dask does not natively provide full support for running SQL queries. You can use pandas-like code to write SQL-equivalent queries, or use the external [`dask-sql` library](https://dask-sql.readthedocs.io/en/latest/). - -Daft provides a read_sql method to read SQL queries into a DataFrame. Daft uses SQLGlot to build SQL queries, so it supports all databases that SQLGlot supports. Daft pushes down operations such as filtering, projections, and limits into the SQL query when possible. Full-featured support for SQL queries (as opposed to a DataFrame API) is in progress. - -## Daft combines Python with Rust and Pyarrow for optimal performance -Daft combines Python with Rust and Pyarrow for optimal performance. Under the hood, Table and Series are implemented in Rust on top of the Apache Arrow specification (using the Rust arrow2 library). This architecture means that all the computationally expensive operations on Table and Series are performed in Rust, and can be heavily optimized for raw speed. Python is most useful as a user-facing API layer for ease of use and an interactive data science user experience. Read [more](faq/technical_architecture). From 6ac098a71378e76ab7751776475beb88d79d78ab Mon Sep 17 00:00:00 2001 From: Avril Aysha <68642378+avriiil@users.noreply.github.com> Date: Mon, 22 Apr 2024 14:40:13 +0100 Subject: [PATCH 5/9] remove toc --- docs/source/migration_guides/index.rst | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/source/migration_guides/index.rst b/docs/source/migration_guides/index.rst index 6c1b9d28e2..39d3bf77dc 100644 --- a/docs/source/migration_guides/index.rst +++ b/docs/source/migration_guides/index.rst @@ -4,5 +4,4 @@ Migration Guides .. toctree:: :maxdepth: 2 - Table of Contents coming_from_dask From 20398fc4dcdfb8168d6d6c639b5c1c1e8e8ee1fb Mon Sep 17 00:00:00 2001 From: Avril Aysha <68642378+avriiil@users.noreply.github.com> Date: Mon, 22 Apr 2024 14:41:14 +0100 Subject: [PATCH 6/9] update contrib instructions --- docs/CONTRIBUTING.md | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/docs/CONTRIBUTING.md b/docs/CONTRIBUTING.md index bbd0a5841e..a90a6b1ba5 100644 --- a/docs/CONTRIBUTING.md +++ b/docs/CONTRIBUTING.md @@ -1,7 +1,12 @@ -[WIP] Building docs locally -=========================== +# [WIP] Building docs locally 1. Go to the `docs/` folder 2. `make clean` 3. `make html` 4. `open build/html/index.html` + +To create a new directory level: + +- create a folder under `source` +- add an `index.rst`, follow template from other `index.rst` files +- add new folder name to `toctree` at the end of `source/index.rst` From 7d975a36fcd312ab1df0e55bcc8a2805ba5aa886 Mon Sep 17 00:00:00 2001 From: Avril Aysha <68642378+avriiil@users.noreply.github.com> Date: Mon, 22 Apr 2024 14:48:07 +0100 Subject: [PATCH 7/9] update internal links --- docs/source/migration_guides/coming_from_dask.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/source/migration_guides/coming_from_dask.md b/docs/source/migration_guides/coming_from_dask.md index 23845029f1..ceaec47771 100644 --- a/docs/source/migration_guides/coming_from_dask.md +++ b/docs/source/migration_guides/coming_from_dask.md @@ -24,7 +24,7 @@ The following sections explain conceptual and technical differences between Dask Dask aims for as much feature-parity with pandas as possible, including maintaining the presence of an Index in the DataFrame. But keeping an Index is difficult when moving to a distributed computing environment. Dask doesn’t support row-based positional indexing (with .iloc) because it does not track the length of its partitions. It also does not support pandas MultiIndex. The argument for keeping the Index is that it makes some operations against the sorted index column very fast. In reality, resetting the Index forces a data shuffle and is an expensive operation. -Daft drops the need for an Index to make queries more readable and consistent. How you write a query should not change because of the state of an index or a reset_index call. In our opinion, eliminating the index makes things simpler, more explicit, more readable and therefore less error-prone. Daft achieves this by using the [Expressions API](). +Daft drops the need for an Index to make queries more readable and consistent. How you write a query should not change because of the state of an index or a reset_index call. In our opinion, eliminating the index makes things simpler, more explicit, more readable and therefore less error-prone. Daft achieves this by using the [Expressions API](/user_guide/basic_concepts/expressions.rst). In Dask you would index your DataFrame to return row `b` as follows: @@ -73,7 +73,7 @@ def my_function(**kwargs): res = ddf.map_partitions(my_function, **kwargs) ``` -Daft implements two APIs for mapping computations over the data in your DataFrame in parallel: [Expressions]() and [UDFs](). Expressions are most useful when you need to define computation over your columns. +Daft implements two APIs for mapping computations over the data in your DataFrame in parallel: [Expressions](/user_guide/basic_concepts/expressions.rst) and [UDFs](/user_guide/daft_in_depth/udf.rst). Expressions are most useful when you need to define computation over your columns. ``` # Add 1 to each element in column "A" @@ -107,7 +107,7 @@ Dask supports the same data types as pandas. Daft is built to support many more ## Distributed Computing and Remote Clusters -Both Dask and Daft support distributed computing on remote clusters. In Dask, you create a Dask cluster either locally or remotely and perform computations in parallel there. Currently, Daft supports distributed cluster computing [with Ray](https://www.getdaft.io/projects/docs/en/latest/user_guide/poweruser/scaling-up.html). Support for running Daft computations on Dask clusters is on the roadmap. +Both Dask and Daft support distributed computing on remote clusters. In Dask, you create a Dask cluster either locally or remotely and perform computations in parallel there. Currently, Daft supports distributed cluster computing [with Ray](/user_guide/poweruser/scaling-up.html). Support for running Daft computations on Dask clusters is on the roadmap. Cloud support for both Dask and Daft is the same. @@ -119,4 +119,4 @@ Daft provides a read_sql method to read SQL queries into a DataFrame. Daft uses ## Daft combines Python with Rust and Pyarrow for optimal performance -Daft combines Python with Rust and Pyarrow for optimal performance. Under the hood, Table and Series are implemented in Rust on top of the Apache Arrow specification (using the Rust arrow2 library). This architecture means that all the computationally expensive operations on Table and Series are performed in Rust, and can be heavily optimized for raw speed. Python is most useful as a user-facing API layer for ease of use and an interactive data science user experience. Read [more](faq/technical_architecture). +Daft combines Python with Rust and Pyarrow for optimal performance. Under the hood, Table and Series are implemented in Rust on top of the Apache Arrow specification (using the Rust arrow2 library). This architecture means that all the computationally expensive operations on Table and Series are performed in Rust, and can be heavily optimized for raw speed. Python is most useful as a user-facing API layer for ease of use and an interactive data science user experience. Read [more](/faq/technical_architecture.rst). From 10255a68c9d689f7eac87b9959b5deddd68a7fed Mon Sep 17 00:00:00 2001 From: Avril Aysha <68642378+avriiil@users.noreply.github.com> Date: Mon, 22 Apr 2024 14:50:50 +0100 Subject: [PATCH 8/9] fix typo --- docs/source/migration_guides/coming_from_dask.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/migration_guides/coming_from_dask.md b/docs/source/migration_guides/coming_from_dask.md index ceaec47771..253ce6fdb4 100644 --- a/docs/source/migration_guides/coming_from_dask.md +++ b/docs/source/migration_guides/coming_from_dask.md @@ -58,7 +58,7 @@ Daft is built with logical query optimization by default. This means that Daft w Dask currently does not support full-featured query optimization. -_Note: As of version 2024.3.0 Dask is slowly implementing query optimization as well. As far as we can tell this is still in early development and has some rough edges. For context see [the discussion](https://github.com/dask/dask/issues/10995_) in the Dask repo. +_Note: As of version 2024.3.0 Dask is slowly implementing query optimization as well. As far as we can tell this is still in early development and has some rough edges. For context see [the discussion](https://github.com/dask/dask/issues/10995_) in the Dask repo._ ## Daft uses Expressions and UDFs to perform computations in parallel From 5d96d983adb6aecf992762e05a6ea22a078e9e5c Mon Sep 17 00:00:00 2001 From: Avril Aysha <68642378+avriiil@users.noreply.github.com> Date: Tue, 23 Apr 2024 11:40:06 +0100 Subject: [PATCH 9/9] incorporate changes --- docs/source/migration_guides/coming_from_dask.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/source/migration_guides/coming_from_dask.md b/docs/source/migration_guides/coming_from_dask.md index 253ce6fdb4..8e760134f7 100644 --- a/docs/source/migration_guides/coming_from_dask.md +++ b/docs/source/migration_guides/coming_from_dask.md @@ -11,6 +11,7 @@ If you are currently using Dask, you may want to consider migrating to Daft if y - Are working with **multimodal data types**, such as nested JSON, tensors, Images, URLs, etc., - Need faster computations through **query planning and optimization**, - Are executing **machine learning workloads** at scale, +- Need deep support for **data catalogs, predicate pushdowns and metadata pruning** from Iceberg, Delta, and Hudi - Want to benefit from **native Rust concurrency** You may want to stick with using Dask if you: @@ -58,7 +59,7 @@ Daft is built with logical query optimization by default. This means that Daft w Dask currently does not support full-featured query optimization. -_Note: As of version 2024.3.0 Dask is slowly implementing query optimization as well. As far as we can tell this is still in early development and has some rough edges. For context see [the discussion](https://github.com/dask/dask/issues/10995_) in the Dask repo._ +_Note: As of version 2024.3.0 Dask is slowly implementing query optimization as well. As far as we can tell this is still in early development and has some rough edges. For context see [the discussion](https://github.com/dask/dask/issues/10995_) in the Dask repo.\_ ## Daft uses Expressions and UDFs to perform computations in parallel @@ -119,4 +120,4 @@ Daft provides a read_sql method to read SQL queries into a DataFrame. Daft uses ## Daft combines Python with Rust and Pyarrow for optimal performance -Daft combines Python with Rust and Pyarrow for optimal performance. Under the hood, Table and Series are implemented in Rust on top of the Apache Arrow specification (using the Rust arrow2 library). This architecture means that all the computationally expensive operations on Table and Series are performed in Rust, and can be heavily optimized for raw speed. Python is most useful as a user-facing API layer for ease of use and an interactive data science user experience. Read [more](/faq/technical_architecture.rst). +Daft combines Python with Rust and Pyarrow for optimal performance (see [benchmarks](/faq/benchmarks.rst)). Under the hood, Table and Series are implemented in Rust on top of the Apache Arrow specification (using the Rust arrow2 library). This architecture means that all the computationally expensive operations on Table and Series are performed in Rust, and can be heavily optimized for raw speed. Python is most useful as a user-facing API layer for ease of use and an interactive data science user experience. Read [more](/faq/technical_architecture.rst).