Implement post task gc hook #47

simon-mo · 2018-07-19T18:12:39Z

What do these changes do?

Add a decorator that perform manual GC after the task. It can help containment memory hungry task or pandas operator that does not deallocation memory correctly.

(https://stackoverflow.com/questions/35782929/pandas-groupby-memory-deallocation/37790738)

Related issue number

passes git diff upstream/master -u -- "*.py" | flake8 --diff

pschafhalter · 2018-07-19T18:18:07Z

modin/pandas/utils.py

+    def wrapped(*args):
+        result = func(*args)
+
+        import gc


Why import here and not at the top of the file?

This will cause import gc to be run every time that a function with the decorator is run

Because the ray worker process might not correctly import gc. (Especially working with cloudpickle)

@kunalgosar The python process will only import it once, the rest import is noop.

I'm not sure if that's true, I think importing it at the beginning of the file would be fine. This seems to work well for the rest of the code?

pschafhalter · 2018-07-19T18:19:56Z

modin/pandas/utils.py

+        ```
+    Note:
+        - This will invoke the GC for the entire process. Expect
+          About 100ms latency. 


Is the latency still 100ms for gc'ing groupby on a smaller dataframe? Might be worth thinking about the speed-memory tradeoff.

Can we potentially be more intelligent about when to run rc? Perhaps time the function run and as long as it is 5x the expected gc runtime, we gc? This way we cap speed performance regression to 20%.

Yes we should be more intelligent. I can add a basic heuristic for now, per your suggestion.

However, considering the use case, we are not faster than pandas on small dataframe.

The memory issue can be a problem for us already if the dataframe is few hundred MBs. For now, this is really about just making sure we can even do groupby without crashing (due to OOM).

kunalgosar · 2018-07-19T19:53:07Z

modin/pandas/utils.py

+    def wrapped(*args):
+        result = func(*args)
+
+        import gc


This will cause import gc to be run every time that a function with the decorator is run

kunalgosar · 2018-07-19T19:54:35Z

modin/pandas/utils.py

+        ```
+    Note:
+        - This will invoke the GC for the entire process. Expect
+          About 100ms latency. 


Can we potentially be more intelligent about when to run rc? Perhaps time the function run and as long as it is 5x the expected gc runtime, we gc? This way we cap speed performance regression to 20%.

pschafhalter · 2018-07-19T20:32:59Z

modin/pandas/utils.py

+
+        duration_s = time.time() - start_time
+        duration_ms = duration_s * 1000
+        if duration_ms > 500:


Using size of the dataframe could be another good heuristic since the concern for gc is memory. However, I'm fine with using time for now. Thoughts?

I think time makes a bit more sense right now because of the relative overhead of GC. We may in the future try to do this another way, but for now this looks good.

devin-petersohn

Looks great! Are there any other operations you think might benefit from this?

devin-petersohn · 2018-07-19T21:05:34Z

modin/pandas/utils.py

+
+        duration_s = time.time() - start_time
+        duration_ms = duration_s * 1000
+        if duration_ms > 500:


I think time makes a bit more sense right now because of the relative overhead of GC. We may in the future try to do this another way, but for now this looks good.

kunalgosar · 2018-07-19T21:10:57Z

Given the runtime heuristic, I think it may even make sense to apply this decorator on almost all functions. Potentially, we can profile a bit to categorize which ops would benefit the most from this.

I would guess blocks to cols/rows might benefit from this.

devin-petersohn · 2018-07-19T21:21:42Z

@kunalgosar blocks to cols/rows might especially benefit from this. For now, since we have good numbers for groupby memory usage, we can use it here (it doesn't affect runtime all that much).

Thanks @simon-mo! Great patch!

add concat for multiple (>2) frames

* FEAT: Add partial support for groupby.sem() * Add sem changes to groupby

…a service Signed-off-by: Devin Petersohn <[email protected]> Fixes to pass CI + docs for io.py Update implementation Signed-off-by: Devin Petersohn <[email protected]> Fix some things Signed-off-by: Devin Petersohn <[email protected]> Lint fixes Fix put Signed-off-by: Devin Petersohn <[email protected]> Clean up and add new details Signed-off-by: Devin Petersohn <[email protected]> Use fsspec to get full path and allow URLs Signed-off-by: Devin Petersohn <[email protected]> Add lazy loc Signed-off-by: Devin Petersohn <[email protected]> fixes for tests porting more tests more fixes moar fixes Raise exception Signed-off-by: Devin Petersohn <[email protected]> Lint fixes Return Python as the default modin engine Handle indexing case for client qc Call fast path for __getitem__ if not lazy Remove user warning for Python-engine fall back Add init Signed-off-by: Devin Petersohn <[email protected]> Implement free as a no-op Signed-off-by: Devin Petersohn <[email protected]> Add support for replace - client side Fix a couple of issues with Client Signed-off-by: Devin Petersohn <[email protected]> Throw errors on to_pandas Signed-off-by: Devin Petersohn <[email protected]> Do not default to pandas for str_repeat Add support for 18 datetime functions/properties Fix columns caching when renaming columns Fix test_query: put backticks back for col names Add support for astype -- client side hard coded changes for functions Client support for str_(en/de)code, to_datetime Add all missing query compiler methods. Signed-off-by: mvashishtha <[email protected]> Fix getitem_column_array and take_2d. Signed-off-by: mvashishtha <[email protected]> Fix getitem_column_array and take_2d. Signed-off-by: mvashishtha <[email protected]> Fix again. Signed-off-by: mvashishtha <[email protected]> Fix more bugs. Signed-off-by: mvashishtha <[email protected]> More fixes. Signed-off-by: mvashishtha <[email protected]> Fix more bugs-- pushdown tests test_dates and test_pivot still broken due to service bugs. Signed-off-by: mvashishtha <[email protected]> Fix typo. Note drop() broken because service requires you to specify both argument and client QC at base of this PR uses default Nones. Signed-off-by: mvashishtha <[email protected]> Add query compiler class. Signed-off-by: mvashishtha <[email protected]> Testing a commit Initial changes for adding support for Expanding FEAT Support for rolling.sem FEAT support for Expanding sum, min, max, mean, var, std, count, sem Removing extratenous comment REFACTOR: Remove defaults to pandas at API layer and add some corresponding client QC methods. Signed-off-by: mvashishtha <[email protected]> Add more methods. Signed-off-by: mvashishtha <[email protected]> Fix expanding. Signed-off-by: mvashishtha <[email protected]> Add ewm. Signed-off-by: mvashishtha <[email protected]> Revert whitespace. Signed-off-by: mvashishtha <[email protected]> Fix to_numpy by making it like to_pandas. Signed-off-by: mvashishtha <[email protected]> Remove extra to_numpy. Signed-off-by: mvashishtha <[email protected]> Pass kwargs Signed-off-by: mvashishtha <[email protected]> Fix DataFrame import for isin. Signed-off-by: mvashishtha <[email protected]> Fix again. Signed-off-by: mvashishtha <[email protected]> Remove breakpoint Signed-off-by: mvashishtha <[email protected]> Tell if series. Signed-off-by: mvashishtha <[email protected]> Fix client qc. Signed-off-by: mvashishtha <[email protected]> Add self_is_series. Signed-off-by: mvashishtha <[email protected]> FIX: Set numeric_only to True in groupby quantile Add some comments Fix str_cat/fullmatch/removeprefix/removesuffix/translate/wrap (modin-project#44) * Fix str_cat/fullmatch/removeprefix/removesuffix/translate/wrap * Update modin/core/storage_formats/base/query_compiler.py Co-authored-by: Mahesh Vashishtha <[email protected]> * Update modin/pandas/series_utils.py Co-authored-by: Mahesh Vashishtha <[email protected]> * Update modin/core/storage_formats/base/query_compiler.py Co-authored-by: Mahesh Vashishtha <[email protected]> Co-authored-by: Mahesh Vashishtha <[email protected]> FEAT Support expanding.aggregate (modin-project#45) Fix at_time and between_time. (modin-project#43) Signed-off-by: mvashishtha <[email protected]> Signed-off-by: mvashishtha <[email protected]> Add QC method for groupby.sem (modin-project#47) * FEAT: Add partial support for groupby.sem() * Add sem changes to groupby Fix nlargest and nsmallest Series support (modin-project#46) * Fix nlargest and smallest support Signed-off-by: Naren Krishna <[email protected]> Remove client query compiler's columnarize. (modin-project#48) Signed-off-by: mvashishtha <[email protected]> Signed-off-by: mvashishtha <[email protected]> Fix info and set memory_usage=False. (modin-project#49) Signed-off-by: mvashishtha <[email protected]> Signed-off-by: mvashishtha <[email protected]> POND-815 fixes for 21 column dataset (modin-project#50) * POND-815 fixes for 21 column dataset * Update modin/pandas/base.py Co-authored-by: helmeleegy <[email protected]> --------- Co-authored-by: helmeleegy <[email protected]> Bring in upstream series binary operation fix 6d5545f… (modin-project#52) * Bring in upstream series binary operation fix 6d5545f. Signed-off-by: mvashishtha <[email protected]> * Update modin/pandas/series.py Co-authored-by: Karthik Velayutham <[email protected]> --------- Signed-off-by: mvashishtha <[email protected]> Co-authored-by: Karthik Velayutham <[email protected]> Support groupby first/last (modin-project#53) Signed-off-by: Naren Krishna <[email protected]> FEAT: Add initial partial support for groupby.cumcount() (modin-project#54) * FEAT: Add partial support for cumcount * Remove the set_index_name * Squeeze the result * Write cumcount name to None * Can't set dtype to int64 Fix resample sum, prod, size (modin-project#56) Signed-off-by: Naren Krishna <[email protected]> POND-184: fix describe and simplify query compiler interface (modin-project#55) * Fix describe Signed-off-by: mvashishtha <[email protected]> * Pass datetime_is_numeric. Signed-off-by: mvashishtha <[email protected]> --------- Signed-off-by: mvashishtha <[email protected]> Fix dt_day_of_week/day_of_year, str_cat/extract/partition/replace/rpartition (modin-project#51) * Fix dt_day_of_week/day_of_year, str_partition/replace/rpartition * Fix str_extract Revert "Fix dt_day_of_week/day_of_year, str_cat/extract/partition/replace/rpartition (modin-project#51)" (modin-project#58) This reverts commit f7a31ab. Revert "Revert "Fix dt_day_of_week/day_of_year, str_cat/extract/partition/replace/rpartition (modin-project#51)" (modin-project#58)" (modin-project#60) This reverts commit ad9231d. Add query compiler method for groupby.prod() (modin-project#57) Signed-off-by: Naren Krishna <[email protected]> FEAT: Add support for groupby.head and groupby.tail (modin-project#61) * FEAT: Add support for groupby.head and groupby.tail * Change _change_index FEAT: Add partial support for groupby.nth (modin-project#62) FIX: Push first and last down to query compiler. (modin-project#64) * FIX: Push first and last down to query compiler. Signed-off-by: mvashishtha <[email protected]> * Fix last. Signed-off-by: mvashishtha <[email protected]> --------- Signed-off-by: mvashishtha <[email protected]> FEAT: Add partial support for groupby.ngroup (modin-project#65) * FEAT: Add partial support for groupby.ngroup * Name of result should be none for now Add client support for SeriesGroupby unique, nsmallest, nlargest (modin-project#63) * Add client support for SeriesGroupby unique, nsmallest, nlargest Signed-off-by: Naren Krishna <[email protected]> --------- Signed-off-by: Naren Krishna <[email protected]> Push memory_usage entirely to query compiler [change is not to be upstreamed to Modin] (modin-project#66) * Fix dataframe memory usage. Signed-off-by: mvashishtha <[email protected]> * Fix series memory_usage() the same way. Signed-off-by: mvashishtha <[email protected]> --------- Signed-off-by: mvashishtha <[email protected]> FIX: allow updating backend query compilers in place. (modin-project#67) * FIX: Mutate client query compiler columns and index in the service. Motivation: Align axis update semantics across query compilers. In the base query compiler and even our service's query compiler, you can update the index and columns in place. However, the service gives no way to update axes of a query compiler. Right now, for inplace updates, service exposes an extra method rename(), and client query compiler uses this to get the id of a new compiler with updated axis, and then updates its id ID of the new query compiler. This change might be the first to make the service present a mutable interface for a backend query compiler. That seems safe to me, except I had to make copy() get a new query compiler copied from the old query compiler, because we can't let updates to the new query compiler change the original (or vice versa). Signed-off-by: mvashishtha <[email protected]> * Add a comment. Signed-off-by: mvashishtha <[email protected]> --------- Signed-off-by: mvashishtha <[email protected]> FEAT replace groupby.fillna with a simpler logic (modin-project#68) * FEAT Support expanding.aggregate * Replaced groupby.fillna logic with a simpler one * Fix in groupby.fillna. Work object was causing problems. * Only need to change _check_index_name to _check_index * Removed commented out code.

Implement post task gc hook

37c2905

pschafhalter reviewed Jul 19, 2018

View reviewed changes

kunalgosar requested changes Jul 19, 2018

View reviewed changes

simon-mo added 2 commits July 19, 2018 13:06

Address Comments

06e561a

Move gc import to the top

07350a2

pschafhalter reviewed Jul 19, 2018

View reviewed changes

kunalgosar added the Performance 🚀 Performance related issues and pull requests. label Jul 19, 2018

devin-petersohn approved these changes Jul 19, 2018

View reviewed changes

kunalgosar approved these changes Jul 19, 2018

View reviewed changes

pschafhalter approved these changes Jul 19, 2018

View reviewed changes

devin-petersohn merged commit d019142 into modin-project:master Jul 19, 2018

dchigarev pushed a commit to dchigarev/modin that referenced this pull request Aug 25, 2020

Merge pull request modin-project#47 from intel-go/ienkovich/concat-many

6c56f4e

add concat for multiple (>2) frames

naren-ponder pushed a commit to naren-ponder/modin that referenced this pull request Jan 25, 2023

Add QC method for groupby.sem (modin-project#47)

fa08b63

* FEAT: Add partial support for groupby.sem() * Add sem changes to groupby

vnlitvinov pushed a commit to vnlitvinov/modin that referenced this pull request Feb 13, 2023

Add QC method for groupby.sem (modin-project#47) (core)

00f126b

vnlitvinov pushed a commit to vnlitvinov/modin that referenced this pull request Feb 27, 2023

Add QC method for groupby.sem (modin-project#47) (service)

c4e6a7d

vnlitvinov pushed a commit to vnlitvinov/modin that referenced this pull request Mar 16, 2023

Add QC method for groupby.sem (modin-project#47) (core)

ac3df3c

vnlitvinov pushed a commit to vnlitvinov/modin that referenced this pull request Mar 16, 2023

Add QC method for groupby.sem (modin-project#47) (service)

9120db5

vnlitvinov pushed a commit to vnlitvinov/modin that referenced this pull request Mar 16, 2023

Add QC method for groupby.sem (modin-project#47) (core)

f9d9b5f

vnlitvinov pushed a commit to vnlitvinov/modin that referenced this pull request Mar 16, 2023

Add QC method for groupby.sem (modin-project#47) (service)

27cc7a5

vnlitvinov pushed a commit to vnlitvinov/modin that referenced this pull request Mar 16, 2023

Add QC method for groupby.sem (modin-project#47) (service)

c11749f

vnlitvinov pushed a commit to vnlitvinov/modin that referenced this pull request Mar 16, 2023

Add QC method for groupby.sem (modin-project#47) (core)

c11ff2a

vnlitvinov pushed a commit to vnlitvinov/modin that referenced this pull request Mar 16, 2023

Add QC method for groupby.sem (modin-project#47) (service)

7751a20

vnlitvinov pushed a commit to vnlitvinov/modin that referenced this pull request Apr 12, 2023

Add QC method for groupby.sem (modin-project#47) (core)

3fe97e1

vnlitvinov pushed a commit to vnlitvinov/modin that referenced this pull request Apr 17, 2023

Add QC method for groupby.sem (modin-project#47) (core)

e738f7d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement post task gc hook #47

Implement post task gc hook #47

simon-mo commented Jul 19, 2018

pschafhalter Jul 19, 2018

kunalgosar Jul 19, 2018

simon-mo Jul 19, 2018

simon-mo Jul 19, 2018

kunalgosar Jul 19, 2018

pschafhalter Jul 19, 2018

kunalgosar Jul 19, 2018

simon-mo Jul 19, 2018

kunalgosar Jul 19, 2018

kunalgosar Jul 19, 2018

pschafhalter Jul 19, 2018

devin-petersohn Jul 19, 2018

devin-petersohn left a comment

devin-petersohn Jul 19, 2018

kunalgosar commented Jul 19, 2018

devin-petersohn commented Jul 19, 2018

Implement post task gc hook #47

Implement post task gc hook #47

Conversation

simon-mo commented Jul 19, 2018

What do these changes do?

Related issue number

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devin-petersohn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kunalgosar commented Jul 19, 2018

devin-petersohn commented Jul 19, 2018