Rewrite backend to be more modular, maintainable #70

devin-petersohn · 2018-08-14T02:54:50Z

This will touch essentially all of the code. We are currently heavily integrated with the internals of Ray. There are a number of places where we write and call custom ray.remote functions from dataframe.py. This PR is a rework of that approach to a more modular approach that will allow us to use the backend from a number of locations. We will not have to operate directly on a DataFrame for the different APIs, they can call the new DataManager.

Additionally, there is a new BlockPartitions abstract class that directly interacts with the data. It does not know about the Index or Columns, and operates as statelessly as possible. There is also a new RemotePartition object that will store the data. This object has knowledge of a way to interact with it.

I have currently created a RayDataManager, RayBlockPartitions, and RayRemotePartition classes for interacting on Ray. Currently this is the only supported backend, but it does make it much easier to integrate other backends. The modularity also lets new backends have custom implementations for certain methods that can speed up operations on that backend.

It's not done yet, but it would be great to get some feedback on the overall approach here. I will add a checklist of things to be done to this post shortly.

Pass unit tests
Create a factory for building the blocks for the correct execution engine
Clean up documentation

~~Add developer docs~~ (Happening in a subsequent PR)

Clean up dead code
passes git diff upstream/master -u -- "*.py" | flake8 --diff

williamma12 · 2018-08-25T23:55:22Z

The docs for the backend look good (ie I read docs and code and understand what's going on).

However, here are some things I noticed that (should) be a quick fix:

Assert statement for to_pandas in RayRemotePartition to ensure that the returned object is a dataframe
Assert statements for the kwargs used in the following from PandasDataManager
- concat
- join_data manager
- join_list_of_managers
- reset_index
- reduction functions (eg count, max , min , mean, prod, sum, any, all, idxmax, idxmin)
inter_manager_operations from PandasDataManager sets reindexed_other and reindexed_self to the same value
from_pandas in BlockPartitions should use _compute_num_partitions
get_indicies and apply_func_to_select_indices_along_full_axis in BlockPartitions should preprocess the lambda and function, respectively, before passing to to RemotePartitions object
quantiles functions in PandasDataManager should check if the value passed in is a list or value for each respective function

simon-mo · 2018-08-27T18:59:20Z

modin/data_management/data_manager.py

+    columns = property(_get_columns, _set_columns)
+    index = property(_get_index, _set_index)
+
+    # END Index and columns objects


Redundant line, collides with the line after compute_index

simon-mo · 2018-08-27T19:44:33Z

modin/data_management/data_manager.py

+
+    # END Index and columns objects
+
+    def compute_index(self, data_object):


Do we have to be python2 compatible? If not I would like to add type annotations:

def compute_index(self, data_object: BlockPartition) -> pandas.Index: ...

simon-mo · 2018-08-27T20:00:33Z

modin/data_management/partitioning/remote_partition.py

+        """
+        raise NotImplementedError("Must be implemented in child class")
+
+    def to_pandas(self):


Note:

Contrary what I said in the meeting offline, I'm fine with where to_pandas is right now because converting any sort of block to any representation is important. RemotePartition::to_pandas is a good interface so any future implementation should implement a to_pandas

modin/data_management/partitioning/remote_partition.py

modin/pandas/datetimes.py

weiwzhang · 2018-09-17T00:19:08Z

modin/data_management/data_manager.py

+        joined_axis = self._join_index_objects(axis, [other.columns if axis == 0
+            else other.index for other in others], join, sort=sort)
+
+        to_append = [other.reindex(axis ^ 1, joined_axis).data for other in others]


request documentation

Adding some partitioning updates Updating remote Continuing backend rewrite progress Adding idxmax/min Adding head/tail/repr for data_manager Fixing transpose Updating remote with bugfixes for transpose/repr Fixing a number of operations Fix quantile Adding more functionality, __getitem__ Fixing more tests Fixing drop, passing tests Add insert to new structure Cleaning up unneeded imports Updating remote Fix minor bug Updating remote Add sort_index Minor refactor of code. Cleaning some up Continuing logic migration Add some docs Adding more docs Add more method level documentation. Adding documentation and cleaning up Retructuring partitioning files for simplicity Adding more docs, cleaning up docs Fix performance bug, more cleanup More cleanup/renaming Adding factory skeleton Adding from_pandas code path and update constructor Removing debugging code Cleaning up dead code Added type checking and changed how variables were read in from kwargs (#1) Removing IndexMetadata and code_gen Adding preliminary apply method Updated sample and eval to the new backend (#2) * Added type checking and changed how variables were read in from kwargs * Updated sample to new architecture * Made test_sample more rigourous * Removed 'default=' from kwargs.get's * Updated eval to the new backend * Added two more tests for eval Finalizing apply and start agg Fixing some broken stuff with apply, update remote Starting dictionary apply and fillna Fixed dictionary apply and fillna Moves Inter DataFrame Operations Logic to Data Manager (#3) * Moving multi dataframe operation logic to data_manager * Remove Unused Functions from dataframe.py * removed unnecessary isScalar arg * minor code cleanup * changed _operator_handler name * removing hasattr from data_manager * cleaning up dataframe.py code for add function * changing name to _validate_other * cleaned up kwargs parsing in data_manager function * updated all inter df functions * commenting out old helper functions in dataframe.py * cleaned up unused code * fixed type error for functions using map_across_axis Updated info and memory_usage to new backend (#4) * Added type checking and changed how variables were read in from kwargs * Updated sample to new architecture * Made test_sample more rigourous * Removed 'default=' from kwargs.get's * Updated eval to the new backend * Added two more tests for eval * Updated memory_usage to new backend * Updated info and memory_usage to the new backend * Updated info and memory_usage to be standalone tests and updated the tests * Updated info to do only one pass * Updated info to do everything in one run with DataFrame * Update info to do everything in one run with Series * Updated info to do everything in one run with DataFrame * Updated to get everything working and moved appropriate parts to DataManager Adding first where implementation Adding sort_values and update implementations Cleaning up dead code Adding manual_shuffle abstraction Starting merge Add merge Cleaning up Add dtype (#6) * Added type checking and changed how variables were read in from kwargs * Updated sample to new architecture * Made test_sample more rigourous * Removed 'default=' from kwargs.get's * Updated eval to the new backend * Added two more tests for eval * Updated memory_usage to new backend * Updated info and memory_usage to the new backend * Updated info and memory_usage to be standalone tests and updated the tests * Updated info to do only one pass * Updated info to do everything in one run with DataFrame * Update info to do everything in one run with Series * Updated info to do everything in one run with DataFrame * Updated to get everything working and moved appropriate parts to DataManager * Removed extraneous print statement * Moved dtypes stuff to data manager * Fixed calculating dtypes to only doing a full_reduce instead of map_full_axis * Updated astype to new backend * Updated astype to new backend * Updated ftypes to new backend * Added dtypes argument to map_partitions * Fixing dtypes * Cleaning up dtype and merge issues Fix isin bug Cleaning up Cleaning up more unused code Updated iterables and to_datetime to new backend and improved astype runtime (#7) * Added type checking and changed how variables were read in from kwargs * Updated sample to new architecture * Made test_sample more rigourous * Removed 'default=' from kwargs.get's * Updated eval to the new backend * Added two more tests for eval * Updated memory_usage to new backend * Updated info and memory_usage to the new backend * Updated info and memory_usage to be standalone tests and updated the tests * Updated info to do only one pass * Updated info to do everything in one run with DataFrame * Update info to do everything in one run with Series * Updated info to do everything in one run with DataFrame * Updated to get everything working and moved appropriate parts to DataManager * Removed extraneous print statement * Moved dtypes stuff to data manager * Fixed calculating dtypes to only doing a full_reduce instead of map_full_axis * Updated astype to new backend * Updated astype to new backend * Updated ftypes to new backend * Added dtypes argument to map_partitions * Updated astype and added dtypes option to _from_old_block_partitions in RayPandasDataManager * Undid unnecessary change * Updated iterables to new backend * Updated to_datetime to new backend * Reverted some changes for PR * Replaced pd with pandas * Made additional changes mentioned in (#7) Cleaning up Cleaning up imports Fix minor bug from getting kwargs Concat now working with new architecture (#9) * concat now working with new architecture * fixing functionality for pandas Series * updated append_list_of_data_managers function for concat * minor stylistic fix * remove unused append_data_manager function * fixed join * removed axis arg from join function read_csv changes and improvements in performance (#10) * Test changes to io * Update io changes * Fix performance bug * Debugging performance * Debugging performance on large IO * Making some performance tuning changes * Cleaning up and adding performance improvements * Cleaning up * Addressing comments * Addressing comments Fix bug Formatting fix fillna bug updated rdiv, rpow, rsub methods (#12) * updated rdiv, rpow, rsub methods * spelled dataframe wrong Fixed eval and astype (#11) * Updated to_datetime docstring * Updated astype tests * Commented out loc and iloc tests * Updated eval * removed empty space and uncommented test_loc and test_iloc Passes test_mixed_dtype_dataframe and test_nan_dataframe (#15) * Fixed describe and quantiles and cleaned up code * Updated numeric functions and handles empty dataframes * Fixed dtypes and ftypes * Imported is_numeric_dtype from pandas * Cleaned up print statements in test_dataframe.py Cleaning up and enabling tests. Fix __repr__ Removing dead code Fix where bugy Fix append error checking Fix read_csv args bug Fix read_parquet Groupby implementation Adding groupby final fix Adding docs Fix for info (#16) * Quick fix for info * Removed extraneous print statement * Restructured to use count and memory_usage instead Minor optimization change get_dummies implementation (#19) * intial code for get_dummies * Starting help on get_dummies * Fix get_dummies * Removing dead code * bug fix for get_dummies Rewrite loc (#20) * Rewrite the rewrite Finish implement loc/iloc Remove debug lines, fix typo Removing unused imports * Removing dead code * Changing naming of clone * Formatting and removing dead code * Moving imports for matching pandas

…nd other numeric functions (#18) * Cleaned up code and added documentation * Worked on documentation * Added periods. * Changed all to follow the documentation * Added more method level documentation and fixed quantile with list input * Updated quantiles * Fixed dataframe to manager and reducesd descriptions to one line in data_manager.py documentation' * Fixed quantiles for sure this time * Changed numeric functions to be cleaner and work properly * Resolved comments

simon-mo

Feel free to merge anytime. Awesome work!!! It comes together nicely.

…king add doc

) [should upstream]

) [should upstream] [core]

) [should upstream] [service]

) [should upstream] [core]

) [should upstream] [service]

) [should upstream] [core]

) [should upstream] [service]

) [should upstream] [core]

) [should upstream] [service]

) [should upstream] [core]

) [should upstream] [service]

) [should upstream] [core]

devin-petersohn force-pushed the rewrite_backend branch 2 times, most recently from f0c80a6 to b07ec3c Compare August 24, 2018 20:38

simon-mo mentioned this pull request Aug 28, 2018

REFACTOR: Epic: Add mypy for all of modin #74

Open

williamma12 mentioned this pull request Aug 30, 2018

Added type and argument checks devin-petersohn/modin#1

Merged

6 tasks

devin-petersohn mentioned this pull request Aug 30, 2018

NotImplementedError: Groupby with lists of columns not yet supported. #75

Closed

simon-mo self-requested a review September 16, 2018 22:23

simon-mo reviewed Sep 16, 2018

View reviewed changes

modin/pandas/datetimes.py Show resolved Hide resolved

weiwzhang reviewed Sep 17, 2018

View reviewed changes

devin-petersohn force-pushed the rewrite_backend branch from fd24d0f to 6523a4a Compare September 17, 2018 20:43

devin-petersohn changed the title ~~[WIP] Rewrite backend to be more modular, maintainable~~ Rewrite backend to be more modular, maintainable Sep 17, 2018

devin-petersohn force-pushed the rewrite_backend branch from 6523a4a to ed93e6b Compare September 17, 2018 21:17

devin-petersohn force-pushed the rewrite_backend branch from ed93e6b to 36be7cb Compare September 17, 2018 21:18

devin-petersohn and others added 3 commits September 17, 2018 14:23

linting with yapf

8d46e83

Fix encoding (#21)

0302c95

Fixing num_threads

5e22bf5

devin-petersohn force-pushed the rewrite_backend branch from ba3dd32 to 5e22bf5 Compare September 23, 2018 05:36

devin-petersohn force-pushed the rewrite_backend branch from ac67bd7 to 9bcd0f6 Compare September 23, 2018 05:58

simon-mo added 9 commits September 23, 2018 13:05

Fix travis dependency, add strip-hints for py2 checker

861e7cd

Fix mac install miniconda issue

86b1351

Travis being flaky on mac

7e39ec6

Add space to bash commands

86f9594

Rework travis script

a929f37

Fix pip flags

261e43f

Remove -q flag

bfd2c7f

Fix python2 incompatibility

dc9a879

run strip-hints at py2.7

b204ab9

simon-mo and others added 3 commits September 24, 2018 06:15

Fix test_max in axis 1 for python2

6be2e4d

Fix lint

929eb2b

Fix lint

dd80f22

devin-petersohn mentioned this pull request Sep 24, 2018

PyTest Warning #79

Closed

Fix lint

5239cc3

simon-mo approved these changes Sep 24, 2018

View reviewed changes

devin-petersohn added 3 commits September 24, 2018 20:01

Addressing comments

738f490

Final formatting

3310ed9

Fixing refactor

00544a1

devin-petersohn merged commit cebb7a0 into modin-project:master Sep 25, 2018

This was referenced Sep 29, 2018

Repr for DataFrameView is broken #59

Closed

_IndexMetadata.copy() may block on OID-backed _index_cache #63

Closed

Fundamental memory leak in Modin #98

Closed

dchigarev pushed a commit to dchigarev/modin that referenced this pull request Aug 25, 2020

Merge pull request modin-project#70 from intel-go/ienkovich/arrow-mas…

e7d1261

…king add doc

mvashishtha pushed a commit to mvashishtha/modin that referenced this pull request Feb 22, 2023

FEAT: Remove API layer default for groupby.get_group() (modin-project#70

8f9608d

) [should upstream]

vnlitvinov pushed a commit to vnlitvinov/modin that referenced this pull request Feb 27, 2023

FEAT: Remove API layer default for groupby.get_group() (modin-project#70

ba9fac1

) [should upstream] [core]

vnlitvinov pushed a commit to vnlitvinov/modin that referenced this pull request Feb 27, 2023

FEAT: Remove API layer default for groupby.get_group() (modin-project#70

4427466

) [should upstream] [service]

vnlitvinov pushed a commit to vnlitvinov/modin that referenced this pull request Mar 16, 2023

FEAT: Remove API layer default for groupby.get_group() (modin-project#70

bd142b1

) [should upstream] [core]

vnlitvinov pushed a commit to vnlitvinov/modin that referenced this pull request Mar 16, 2023

FEAT: Remove API layer default for groupby.get_group() (modin-project#70

457c902

) [should upstream] [service]

vnlitvinov pushed a commit to vnlitvinov/modin that referenced this pull request Mar 16, 2023

FEAT: Remove API layer default for groupby.get_group() (modin-project#70

95e4e2a

) [should upstream] [core]

vnlitvinov pushed a commit to vnlitvinov/modin that referenced this pull request Mar 16, 2023

FEAT: Remove API layer default for groupby.get_group() (modin-project#70

191a266

) [should upstream] [service]

vnlitvinov pushed a commit to vnlitvinov/modin that referenced this pull request Mar 16, 2023

FEAT: Remove API layer default for groupby.get_group() (modin-project#70

ad85e96

) [should upstream] [core]

vnlitvinov pushed a commit to vnlitvinov/modin that referenced this pull request Mar 16, 2023

FEAT: Remove API layer default for groupby.get_group() (modin-project#70

93cbcbf

) [should upstream] [service]

vnlitvinov pushed a commit to vnlitvinov/modin that referenced this pull request Mar 16, 2023

FEAT: Remove API layer default for groupby.get_group() (modin-project#70

d371541

) [should upstream] [core]

vnlitvinov pushed a commit to vnlitvinov/modin that referenced this pull request Mar 16, 2023

FEAT: Remove API layer default for groupby.get_group() (modin-project#70

3147ef2

) [should upstream] [service]

vnlitvinov pushed a commit to vnlitvinov/modin that referenced this pull request Apr 12, 2023

FEAT: Remove API layer default for groupby.get_group() (modin-project#70

03ba995

) [should upstream] [core]

vnlitvinov pushed a commit to vnlitvinov/modin that referenced this pull request Apr 13, 2023

FEAT: Remove API layer default for groupby.get_group() (modin-project#70

2c046d6

) [should upstream] [core]

vnlitvinov pushed a commit to vnlitvinov/modin that referenced this pull request Apr 17, 2023

FEAT: Remove API layer default for groupby.get_group() (modin-project#70

f19e478

) [should upstream] [core]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite backend to be more modular, maintainable #70

Rewrite backend to be more modular, maintainable #70

devin-petersohn commented Aug 14, 2018 •

edited

Loading

williamma12 commented Aug 25, 2018 •

edited by devin-petersohn

Loading

simon-mo Aug 27, 2018

simon-mo Aug 27, 2018

simon-mo Aug 27, 2018

weiwzhang Sep 17, 2018

simon-mo left a comment


		# END Index and columns objects

		def compute_index(self, data_object):

Rewrite backend to be more modular, maintainable #70

Rewrite backend to be more modular, maintainable #70

Conversation

devin-petersohn commented Aug 14, 2018 • edited Loading

williamma12 commented Aug 25, 2018 • edited by devin-petersohn Loading

simon-mo Aug 27, 2018

Choose a reason for hiding this comment

simon-mo Aug 27, 2018

Choose a reason for hiding this comment

simon-mo Aug 27, 2018

Choose a reason for hiding this comment

weiwzhang Sep 17, 2018

Choose a reason for hiding this comment

simon-mo left a comment

Choose a reason for hiding this comment

devin-petersohn commented Aug 14, 2018 •

edited

Loading

williamma12 commented Aug 25, 2018 •

edited by devin-petersohn

Loading