Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas insert should use index=False to prevent extra index field generation #666

Closed
ixcat opened this issue Oct 1, 2019 · 7 comments
Closed
Assignees

Comments

@ixcat
Copy link
Contributor

ixcat commented Oct 1, 2019

No description provided.

@ixcat
Copy link
Contributor Author

ixcat commented Oct 2, 2019

conversation snippet:

eywalker:spiral_calendar_pad:  we'd actually want to do to_records() after we reset index, I'd say
index=False simply drops the index field if I recall correctly and that we don't want
so something like rows.reset_index().to_records()
maybe there is also a smart option in to_records() to do this in one step
chris: would reset_index ‘do’ for user-created DataFrames? (or conversely, for DJ fetched ones?)
new messages
eywalker:spiral_calendar_pad:  I'd say so. Most people don't even have any index
and for ones returned by DJ fetch, the DataFrame has the primary keys as multi-index, and this certainly needs to go through reset_index()

@dimitri-yatsenko
Copy link
Member

dimitri-yatsenko commented Nov 19, 2019

Will you please describe the scenario in a bit more detail? When is the bothersome extra index field generated? And why is it a problem?

@ixcat
Copy link
Contributor Author

ixcat commented Nov 19, 2019

the below illustrates what this hopes to work but results in a KeyError with 'unknown column index', this should be fixed in some manner which does not break Dest.insert(Source.fetch(format='frame'))

I don't know 100% all of the situations where index is generated (not so familiar with Pandas here), it's a problem because we claim to support pandas insert and dealing with correct indexes will likely be a common scenario, it would be best if we could transparently strip any notion of index and still allow pandas insert.

#! /usr/bin/env python

import datajoint as dj
import numpy as np
import pandas as pd

from code import interact

dj.config['enable_python_native_blobs'] = True


schema = dj.schema('test_pd_insert')


@schema
class TestPdInsert(dj.Manual):
    definition = """
    desc:    char(16)
    ---
    data:    longblob
    """



def test_dataframe_index_strip_fails():

    try:
        TestPdInsert.insert(
            pd.DataFrame(data=[('error', np.zeros(10, dtype=np.int64))],
                         columns=['desc', 'data']))
    except KeyError as e:
        if '`index` is not in the table heading' in repr(e):
            print('fails as expected')
            pass


if __name__ == '__main__':
   test_dataframe_index_strip_fails()
   interact('test_dataframe_index_strip_fails', local=locals())

@eywalker eywalker added this to the Release 0.12.6 milestone Apr 14, 2020
@ixcat ixcat self-assigned this Apr 14, 2020
@ixcat
Copy link
Contributor Author

ixcat commented Apr 14, 2020

todo: reconfirm, report back

@ixcat
Copy link
Contributor Author

ixcat commented Apr 22, 2020

confirmed still valid. will adjust per comments above, ensure pandas test coverage for this case

@ixcat
Copy link
Contributor Author

ixcat commented Apr 30, 2020

Currently running with:

diff --git a/datajoint/table.py b/datajoint/table.py
index e9bacd4..be79832 100644
--- a/datajoint/table.py
+++ b/datajoint/table.py
@@ -193,7 +193,9 @@ class Table(QueryExpression):
         """
 
         if isinstance(rows, pandas.DataFrame):
-            rows = rows.to_records()
+            rows = rows.reset_index(
+                drop=isinstance(rows.index, pandas.RangeIndex)).to_records(
+                    index=False)
 
         # prohibit direct inserts into auto-populated tables
         if not allow_direct_insert and not getattr(self, '_allow_insert', True):  # allow_insert is only used in AutoPopulate

would be good to get more pandas user confirmation if this makes sense... also probably needs xcheck on composite keys..

for single-key example (see attached), user created frames without explicit index get range index, which then can be 'dropped' on reset index, allowing to_records to work with index false. dj created records will not have the range index, so it is reset and then dropped within the to_records call

see attached for test/interaction example (rename to .py. if needed)
issue666.txt

@eywalker eywalker self-assigned this May 8, 2020
ixcat added a commit to ixcat/datajoint-python that referenced this issue May 15, 2020
@ixcat ixcat changed the title [dev] pandas insert should use index=False to prevent extra index field generation pandas insert should use index=False to prevent extra index field generation May 15, 2020
eywalker added a commit that referenced this issue May 15, 2020
datajoint/table.py: smarter dataframe conversion (#666)
@dimitri-yatsenko
Copy link
Member

Fixed with #776

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants