Skip to content

Commit

Permalink
FEAT-modin-project#2013: Merge upstream.
Browse files Browse the repository at this point in the history
Signed-off-by: Itamar Turner-Trauring <[email protected]>
  • Loading branch information
itamarst committed Nov 24, 2020
2 parents a62c908 + f7f1f7a commit 22bdfe8
Show file tree
Hide file tree
Showing 69 changed files with 2,138 additions and 807 deletions.
6 changes: 3 additions & 3 deletions .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
<!--
Thank you for your contribution!
Please review the contributing docs: https://modin.readthedocs.io/en/latest/contributing.html
Thank you for your contribution!
Please review the contributing docs: https://modin.readthedocs.io/en/latest/CONTRIBUTING.html
if you have questions about contributing.
-->

## What do these changes do?

<!-- Please give a short brief about these changes. -->

- [ ] commit message follows format outlined [here](https://modin.readthedocs.io/en/latest/developer/contributing.html)
- [ ] commit message follows format outlined [here](https://modin.readthedocs.io/en/latest/CONTRIBUTING.html)
- [ ] passes `flake8 modin`
- [ ] passes `black --check modin`
- [ ] signed commit with `git commit -s` <!-- you can amend your commit with a signature via `git commit -amend -s` -->
Expand Down
25 changes: 14 additions & 11 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ jobs:
node-version: "10.x"
- run: npm install --save-dev @commitlint/{config-conventional,cli} commitlint-plugin-jira-rules commitlint-config-jira
- name: Add dependencies for commitlint action
run: echo "::set-env name=NODE_PATH::$GITHUB_WORKSPACE/node_modules"
run: echo "NODE_PATH=$GITHUB_WORKSPACE/node_modules" >> $GITHUB_ENV
- run: git remote add upstream https://github.com/modin-project/modin.git
- run: git fetch upstream
- run: npx commitlint --from upstream/master --to HEAD --verbose
Expand Down Expand Up @@ -51,6 +51,7 @@ jobs:
- run: pydocstyle --convention=numpy --add-ignore=D101,D102 modin/pandas/series_utils.py
- run: pydocstyle --convention=numpy --add-ignore=D103 modin/pandas/general.py
- run: pydocstyle --convention=numpy modin/pandas/plotting.py modin/pandas/utils.py modin/pandas/iterator.py modin/pandas/indexing.py
- run: pydocstyle --convention=numpy --add-ignore=D100,D104 modin/engines/base/frame

lint-flake8:
name: lint (flake8)
Expand Down Expand Up @@ -107,7 +108,7 @@ jobs:
with:
path: ~/.cache/pip
key: ${{ runner.os }}-python-3.7-pip-${{ github.run_id }}-${{ hashFiles('environment.yml') }}
- uses: goanpeca/setup-miniconda@v1.6.0
- uses: conda-incubator/setup-miniconda@v2
with:
activate-environment: modin
environment-file: environment.yml
Expand Down Expand Up @@ -141,7 +142,7 @@ jobs:
with:
path: ~/.cache/pip
key: ${{ runner.os }}-python-3.6-pip-${{ github.run_id }}-${{ hashFiles('environment.yml') }}
- uses: goanpeca/setup-miniconda@v1.6.0
- uses: conda-incubator/setup-miniconda@v2
with:
activate-environment: modin
environment-file: environment.yml
Expand Down Expand Up @@ -170,7 +171,7 @@ jobs:
with:
path: ~/.cache/pip
key: ${{ runner.os }}-python-3.6-pip-${{ github.run_id }}-${{ hashFiles('environment.yml') }}
- uses: goanpeca/setup-miniconda@v1.6.0
- uses: conda-incubator/setup-miniconda@v2
with:
activate-environment: modin
environment-file: environment.yml
Expand All @@ -189,6 +190,8 @@ jobs:
run: python -m pytest modin/config/test
- shell: bash -l {0}
run: python -m pytest modin/test/test_envvar_catcher.py
- shell: bash -l {0}
run: python -m pytest modin/test/backends/pandas/test_internals.py

test-defaults:
needs: [lint-commit, lint-flake8, lint-black, test-api, test-headers]
Expand All @@ -209,7 +212,7 @@ jobs:
with:
path: ~/.cache/pip
key: ${{ runner.os }}-python-3.6-pip-${{ github.run_id }}-${{ hashFiles('environment.yml') }}
- uses: goanpeca/setup-miniconda@v1.6.0
- uses: conda-incubator/setup-miniconda@v2
with:
activate-environment: modin
environment-file: environment.yml
Expand Down Expand Up @@ -275,7 +278,7 @@ jobs:
path: ~/.cache/pip
key: ${{ runner.os }}-python-3.7-pip-${{ github.run_id }}-${{ hashFiles('environment.yml') }}
- name: Setting up Modin environment
uses: goanpeca/setup-miniconda@v1.6.0
uses: conda-incubator/setup-miniconda@v2
with:
activate-environment: modin_on_omnisci
python-version: 3.7.8
Expand Down Expand Up @@ -314,7 +317,7 @@ jobs:
with:
path: ~/.cache/pip
key: ${{ runner.os }}-python-${{ matrix.python-version }}-pip-${{ github.run_id }}-${{ hashFiles('environment.yml') }}
- uses: goanpeca/setup-miniconda@v1.6.0
- uses: conda-incubator/setup-miniconda@v2
with:
activate-environment: modin
environment-file: environment.yml
Expand Down Expand Up @@ -382,7 +385,7 @@ jobs:
with:
path: ~/.cache/pip
key: ${{ runner.os }}-python-3.7-pip-${{ github.run_id }}-${{ hashFiles('environment.yml') }}
- uses: goanpeca/setup-miniconda@v1.6.0
- uses: conda-incubator/setup-miniconda@v2
with:
activate-environment: modin
environment-file: environment.yml
Expand Down Expand Up @@ -420,7 +423,7 @@ jobs:
with:
path: ~/.cache/pip
key: ${{ runner.os }}-python-3.7-pip-${{ github.run_id }}-${{ hashFiles('environment.yml') }}
- uses: goanpeca/setup-miniconda@v1.6.0
- uses: conda-incubator/setup-miniconda@v2
with:
activate-environment: modin
environment-file: environment.yml
Expand Down Expand Up @@ -458,7 +461,7 @@ jobs:
with:
path: ~\AppData\Local\pip\Cache
key: ${{ runner.os }}-python-${{ matrix.python-version }}-pip-${{ github.run_id }}-${{ hashFiles('environment.yml') }}
- uses: goanpeca/setup-miniconda@v1.6.0
- uses: conda-incubator/setup-miniconda@v2
with:
activate-environment: modin
environment-file: environment.yml
Expand Down Expand Up @@ -543,7 +546,7 @@ jobs:
with:
path: ~/.cache/pip
key: ${{ runner.os }}-python-${{ matrix.python-version }}-pip-${{ github.run_id }}-${{ hashFiles('environment.yml') }}
- uses: goanpeca/setup-miniconda@v1.6.0
- uses: conda-incubator/setup-miniconda@v2
with:
activate-environment: modin
environment-file: environment.yml
Expand Down
45 changes: 40 additions & 5 deletions .github/workflows/push.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,41 @@ jobs:
architecture: "x64"
- run: pip install "ray>=1.0.0"

test-internals:
needs: prepare-cache
runs-on: ubuntu-latest
name: test-internals
steps:
- uses: actions/checkout@v2
with:
fetch-depth: 1
- name: Cache pip
uses: actions/cache@v1
with:
path: ~/.cache/pip
key: ${{ runner.os }}-python-3.6-pip-${{ github.run_id }}-${{ hashFiles('environment.yml') }}
- uses: conda-incubator/setup-miniconda@v2
with:
activate-environment: modin
environment-file: environment.yml
python-version: 3.6
channel-priority: strict
use-only-tar-bz2: true # IMPORTANT: This needs to be set for caching to work properly!
- name: Conda environment
shell: bash -l {0}
run: |
conda info
conda list
- name: Internals tests
shell: bash -l {0}
run: python -m pytest modin/data_management/factories/test/test_dispatcher.py modin/experimental/cloud/test/test_cloud.py
- shell: bash -l {0}
run: python -m pytest modin/config/test
- shell: bash -l {0}
run: python -m pytest modin/test/test_envvar_catcher.py
- shell: bash -l {0}
run: python -m pytest modin/test/backends/pandas/test_internals.py

test-defaults:
needs: prepare-cache
runs-on: ubuntu-latest
Expand All @@ -48,7 +83,7 @@ jobs:
with:
path: ~/.cache/pip
key: ${{ runner.os }}-python-3.6-pip-${{ github.run_id }}-${{ hashFiles('environment.yml') }}
- uses: goanpeca/setup-miniconda@v1.6.0
- uses: conda-incubator/setup-miniconda@v2
with:
activate-environment: modin
environment-file: environment.yml
Expand Down Expand Up @@ -114,7 +149,7 @@ jobs:
path: ~/.cache/pip
key: ${{ runner.os }}-python-3.7-pip-${{ github.run_id }}-${{ hashFiles('environment.yml') }}
- name: Setting up Modin environment
uses: goanpeca/setup-miniconda@v1.6.0
uses: conda-incubator/setup-miniconda@v2
with:
activate-environment: modin_on_omnisci
python-version: 3.7.8
Expand Down Expand Up @@ -153,7 +188,7 @@ jobs:
with:
path: ~/.cache/pip
key: ${{ runner.os }}-python-${{ matrix.python-version }}-pip-${{ github.run_id }}-${{ hashFiles('environment.yml') }}
- uses: goanpeca/setup-miniconda@v1.6.0
- uses: conda-incubator/setup-miniconda@v2
with:
activate-environment: modin
environment-file: environment.yml
Expand Down Expand Up @@ -225,7 +260,7 @@ jobs:
with:
path: ~\AppData\Local\pip\Cache
key: ${{ runner.os }}-python-${{ matrix.python-version }}-pip-${{ github.run_id }}-${{ hashFiles('environment.yml') }}
- uses: goanpeca/setup-miniconda@v1.6.0
- uses: conda-incubator/setup-miniconda@v2
with:
activate-environment: modin
environment-file: environment.yml
Expand Down Expand Up @@ -310,7 +345,7 @@ jobs:
with:
path: ~/.cache/pip
key: ${{ runner.os }}-python-${{ matrix.python-version }}-pip-${{ github.run_id }}-${{ hashFiles('environment.yml') }}
- uses: goanpeca/setup-miniconda@v1.6.0
- uses: conda-incubator/setup-miniconda@v2
with:
activate-environment: modin
environment-file: environment.yml
Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
<a href="https://codecov.io/gh/modin-project/modin"><img src="https://codecov.io/gh/modin-project/modin/branch/master/graph/badge.svg" align="center"/></a>
<a href="https://github.com/modin-project/modin/actions"><img src="https://github.com/modin-project/modin/workflows/master/badge.svg" align="center"></a>
<a href="https://modin.readthedocs.io/en/latest/?badge=latest"><img alt="" src="https://readthedocs.org/projects/modin/badge/?version=latest" align="center"></a>
<a href="https://pypi.org/project/modin/"><img alt="" src="https://img.shields.io/badge/pypi-0.8.1.1-blue.svg" align="center"></a>
<a href="https://pypi.org/project/modin/"><img alt="" src="https://img.shields.io/badge/pypi-0.8.2-blue.svg" align="center"></a>
</p>

<p align="center"><b>To use Modin, replace the pandas import:</b></p>
Expand Down Expand Up @@ -179,8 +179,8 @@ and improve:

![Architecture](docs/img/modin_architecture.png)

Visit the [Documentation](https://modin.readthedocs.io/en/latest/architecture.html) for
more information!
Visit the [Documentation](https://modin.readthedocs.io/en/latest/developer/architecture.html) for
more information, and checkout [the difference between Modin and Dask!](https://github.com/modin-project/modin/tree/master/docs/modin_vs_dask.md)

**`modin.pandas` is currently under active development. Requests and contributions are welcome!**

Expand Down
2 changes: 1 addition & 1 deletion docs/UsingSQLonRay/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,4 +30,4 @@ Modin has a query compiler that acts as an intermediate layer between the query
0 1 2.0 A String of information True
1 6 17.0 A String of different information False
.. _architecture: https://modin.readthedocs.io/en/latest/architecture.html
.. _architecture: https://modin.readthedocs.io/en/latest/developer/architecture.html
90 changes: 90 additions & 0 deletions docs/comparisons/dask.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
Modin vs. Dask Dataframe
========================

Dask's Dataframe is effectively a meta-frame, partitioning and scheduling many smaller
``pandas.DataFrame`` objects. The Dask DataFrame does not implement the entire pandas
API, and it isn't trying to. See this explained in the `Dask DataFrame documentation`_.

**The TL;DR is that Modin's API is identical to pandas, whereas Dask's is not. Note: The
projects are fundamentally different in their aims, so a fair comparison is
challenging.**

API
---
The API of Modin and Dask are different in several ways, explained here.

Dask DataFrame
""""""""""""""

Dask is currently missing multiple APIs from pandas that Modin has implemented. Of note:
Dask does not implement ``iloc``, ``MultiIndex``, ``apply(axis=0)``, ``quantile``,
``median``, and more. Some of these APIs cannot be implemented efficiently or at all
given the architecture design tradeoffs made in Dask's implementation, and others simply
require engineering effort. ``iloc``, for example, can be implemented, but it would be
inefficient, and ``apply(axis=0)`` cannot be implemented at all in Dask's architecture.

Dask DataFrames API is also different from the pandas API in that it is lazy and needs
``.compute()`` calls to materialize the DataFrame. This makes the API less convenient
but allows Dask to do certain query optimizations/rearrangement, which can give speedups
in certain situations. Several additional APIs exist in the Dask DataFrame API that
expose internal state about how the data is chunked and other data layout details, and
ways to manipulate that state.

Semantically, Dask sorts the ``index``, which does not allow for user-specified order.
In Dask's case, this was done for optimization purposes, to speed up other computations
which involve the row index.

Modin
"""""

Modin is targeted toward parallelizing the entire pandas API, without exception.
As the pandas API continues to evolve, so will Modin's pandas API. Modin is intended to
be used as a drop-in replacement for pandas, such that even if the API is not yet
parallelized, it still works by falling back to running pandas. One of the key features
of being a drop-in replacement is that not only will it work for existing code, if a
user wishing to go back to running pandas directly, they may at no cost. There's no
lock-in: Modin notebooks can be converted to and from pandas as the user prefers.

In the long-term, Modin is planned to become a data science framework that supports all
popular APIs (SQL, pandas, etc.) with the same underlying execution.

Architecture
------------

The differences in Modin and Dask's architectures are explained in this section.

Dask DataFrame
""""""""""""""

Dask DataFrame uses row-based partitioning, similar to Spark. This can be seen in their
`documentation`_. They also have a custom index object for indexing into the object,
which is not pandas compatible. Dask DataFrame seems to treat operations on the
DataFrame as MapReduce operations, which is a good paradigm for the subset of the pandas
API they have chosen to implement, but makes certain operations impossible. Dask
Dataframe is also lazy and places a lot of partitioning responsibility on the user.

Modin
"""""

Modin's partition is much more flexible, so the system can scale in both directions and
have finer grained partitioning. This is explained at a high level in `Modin's
documentation`_. Because we have this finer grained control over the partitioning, we
can support a number of operations that are very challenging in MapReduce systems (e.g.
transpose, median, quantile). This flexibility in partitioning also gives Modin
tremendous power to implement efficient straggler mitigation and improvements in
utilization over the entire cluster.

Modin is also architected to run on a variety of systems. The goal here is that users
can take the same notebook to different clusters or different environments and it will
still just work, run on what you have! Modin does support running on Dask's compute
engine in addition to Ray. The architecture of Modin is extremely modular, we are able
to add different execution engines or compile to different memory formats because of
this modularity. Modin can run on a Dask cluster in the same way that Dask Dataframe
can, but they will still be different in all of the ways described above.

Modin's implementation is grounded in theory, which is what enables us to implement the
entire pandas API.

.. _Dask DataFrame documentation: http://docs.dask.org/en/latest/dataframe.html#common-uses-and-anti-uses
.. _documentation: http://docs.dask.org/en/latest/dataframe.html#design.
.. _Modin's documentation: https://modin.readthedocs.io/en/latest/developer/architecture.html
4 changes: 4 additions & 0 deletions docs/comparisons/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
How is Modin unique?
====================

Coming Soon...
4 changes: 4 additions & 0 deletions docs/comparisons/pandas.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
Modin vs. Pandas
================

Coming Soon...
4 changes: 4 additions & 0 deletions docs/comparisons/spark.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
Modin vs. Koalas and Spark
==========================

Coming Soon...
Loading

0 comments on commit 22bdfe8

Please sign in to comment.