Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add row_id pseudocolumn support #2251

Closed
wants to merge 4 commits into from

Conversation

xmnlab
Copy link
Contributor

@xmnlab xmnlab commented Jun 16, 2020

This PR aims to start a discussion about pseudocolumn support (https://en.wikipedia.org/wiki/Pseudocolumn).

Initially it adds:

  • initial support for pseudo columns support
  • support for row_id pseudocolum to omniscidb and sqlite backends.

example:

>>> con = ibis.sqlite.connect(**conf['sqlite'])
>>> t = con.table('functional_alltypes')
>>> rowid = t.row_id('rowid')  # it needs the name used for the backend as a parameter
>>> expr = t[row_id, t.index].head()
>>> print(expr.compile())
SELECT rowid - ? AS rowid, t0."index" 
FROM base.functional_alltypes AS t0
LIMIT ? OFFSET ?

limitations:

A pseudocolumn needs to be used by a table expression directly, if it is used by a selection it will raise NotImplementedErrorerror.

Resolves: #1462

@xmnlab xmnlab marked this pull request as ready for review June 20, 2020 21:40
@xmnlab
Copy link
Contributor Author

xmnlab commented Jun 20, 2020

this PR is ready for review. thanks!

Copy link
Contributor

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of questions...

Do you have any other pseudocolumn in mind that you want to implement? I'm wondering if having the PseudoColumn class is worth.

Why do we need the col_name parameter? I guess I'm missing something, but table.row_id().name('id') seems more natural/standard than table.row_id('id').

@xmnlab
Copy link
Contributor Author

xmnlab commented Jun 21, 2020

Do you have any other pseudocolumn in mind that you want to implement? I'm wondering if having the PseudoColumn class is worth.

for now it is just rowid, but as it has a different concept of the rest of the other operations, I think it would be better to define the rules using a base class.

Why do we need the col_name parameter? I guess I'm missing something, but table.row_id().name('id') seems more natural/standard than table.row_id('id').

the main goal is allowing the operation to be translated to something like: select row_id from table
without a alias (name operation).

normally, a common operation need alias (I didn't find a way to do it without that, any suggestion would be very welcome)

the ideal would be use just table.row_id() but I needed to keep consistence in the column name and the translation and to avoid conflicts with the name resolving in the expression execution so I needed to added it as a parameter.

I have tried a lot of different ways to do it ... but with no much success ... if there is a better way to implement that I would be glad to receive any suggestion

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you give a concrete example of where this is useful?

@xmnlab
Copy link
Contributor Author

xmnlab commented Jun 29, 2020

@jreback here there is some use cases for oracle (that maybe could be implemented as row_id_hex on ibis).

in general, also for omniscidb, it can be used to improve performance. for example, in the link above, there is the following comment (https://stackoverflow.com/a/2701811/3715476):

ROWID is the physical location of a row. Consequently it is the fastest way of locating a row, faster even than a primary key lookup. So it can be useful in certain types of transaction where we select some rows, store their ROWIDs and then later on use the ROWIDs in where clauses for DML against those same rows.

@datapythonista datapythonista added backends - omnisci sqlite The SQLite backend feature Features or general enhancements expressions Issues or PRs related to the expression API labels Jul 2, 2020
@xmnlab
Copy link
Contributor Author

xmnlab commented Jul 26, 2020

an example from @jp-harvey

import pymapd
import pandas as pd
import os
from datetime import datetime, timedelta
import time

username = os.environ.get('omnisciusername', 'admin')
password = os.environ.get('omniscipassword', 'HyperInteractive')
dbname = os.environ.get('omniscidbname', 'omnisci')
port = os.environ.get('omnisciport', 6274)
protocol = os.environ.get('omnisciprotocol', 'binary')
omniscihost = os.environ.get('omniscihost', 'localhost')


newtablename = 'newtablenamehere'
lefttable = 'sometable'

batchsize = 10000   
offset = 0

con = pymapd.connect(user=username, password=password, dbname=dbname, host=omniscihost, port=port, protocol=protocol)
c = con.cursor()

countquery = 'SELECT COUNT(*) FROM {0};'.format(lefttable)
tablesize = list(c.execute(countquery))[0][0]

print('Record count of {0} is: {1}'.format(lefttable,tablesize))
print('Batch size is {0}'.format(batchsize))
print('Offset is {0}'.format(offset))

dropquery = 'DROP table IF EXISTS {0}'.format(newtablename)
# uncomment this if you are testing and want to drop the table before load
# c.execute(dropquery)
batchstarttime = datetime.now()
print('{0}: CTAS/ITAS started'.format(batchstarttime))
starttimestamp = time.time()

while offset < tablesize:
    joinquery = '''
    SELECT * FROM sometable WHERE rowid > {1} and rowid <= {0}
        '''.format(offset+batchsize,offset).replace("\n", " ")
    if offset == 0:
        query = 'CREATE TABLE {0} AS ({1})'.format(newtablename,joinquery)
    else:
        query = 'INSERT INTO {0} ({1})'.format(newtablename,joinquery)
    result = c.execute(query)

    offset += batchsize
    secondselapsed = time.time() - starttimestamp
    timeperrecord = secondselapsed / offset
    secondsremaining = (tablesize - offset) * timeperrecord
    projectedend = datetime.now() + timedelta(seconds=secondsremaining)
    print('{0}: {1} of {2} - {3}% (ETA {4})'.format(datetime.now(),offset,tablesize,int(offset/tablesize*100),projectedend))

@xmnlab
Copy link
Contributor Author

xmnlab commented Jul 27, 2020

@jreback any feedback about this? thanks

@kcpevey
Copy link

kcpevey commented Aug 7, 2020

@jreback @datapythonista can we get your feedback? We have a PR to add Ibis to Holoviews (holoviz/holoviews#4517) but this PR is required before we can do that. I'd love to get this merged in soon. :)

@datapythonista
Copy link
Contributor

@xmnlab can you rebase please? Also, I'd prefer that we don't divide this in two levels of abstraction at this point, and we leave the concept of pseudocolumn for later, if we ever add a new one. A comment in the code mentioning that if we add another pseudocolumn we should create a base class would be great.

@jreback you've got an example of this being used in https://github.com/holoviz/holoviews/pull/4517/files#diff-7839668a002e8874d14a2e3debc2f79cR135, I think you were adding for concrete examples.

@xmnlab
Copy link
Contributor Author

xmnlab commented Aug 13, 2020

@datapythonista thanks for the review.
I applied your suggestions. thanks!

Copy link
Contributor

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xmnlab, does this need to be exposed in the public documentation?

# pseudocolumn needs to be used by a table expression directly
# alltypes fixture from some backends maybe apply some operation on it
t = con.table('functional_alltypes')
backend_col_name = backend_pseudocolumn.get(backend.name, None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None is the default, not needed.

Suggested change
backend_col_name = backend_pseudocolumn.get(backend.name, None)
backend_col_name = backend_pseudocolumn.get(backend.name)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right. thanks for catching that. I am going to change that right now.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the usecase of this?

@xmnlab
Copy link
Contributor Author

xmnlab commented Aug 13, 2020

@jreback, as @kcpevey and @datapythonista mentioned before, there is a holoview PR (https://github.com/holoviz/holoviews/pull/4517/files#diff-7839668a002e8874d14a2e3debc2f79cR135) that depends on this feature.
let me know if you want more information. thanks

@jreback
Copy link
Contributor

jreback commented Aug 13, 2020

@xmnlab i mean in words what the purpose is

@xmnlab
Copy link
Contributor Author

xmnlab commented Aug 14, 2020

quoting from https://oracle-base.com/articles/misc/rowids-for-plsql-performance

Using a ROWID is the quickest way to access a row of data. If you are planning to retrieve some data, process it, then subsequently update the row in the same transaction, you can improve performance by using the ROWID.

ROWIDs are the fastest way to access a row of data, but if you can do an operation in a single DML statement, that is faster than selecting the data first, then supplying the ROWID to the DML statement.

If rows are moved, the ROWID will change. Rows can move due to maintenance operations like shrinks and table moves. As a result, storing ROWIDs for long periods of time is a bad idea. They should only be used in a single transaction, preferably as part of a SELECT ... FOR UPDATE, where the row is locked, preventing row movement.

the OmniSciDB rowid is similar and in other words, it would be useful to resolve specific problems when the performance is critical.

in one of the examples above (https://github.com/ibis-project/ibis/pull/2251#issuecomment-663923706), rowid was used to create chunks of data and it is faster than the other similar approaches.

not sure if I answered your question, but I will be happy to provide more information, just let me know what point you want more details.

@dharhas
Copy link

dharhas commented Aug 17, 2020

@datapythonista @jreback

This issue is blocker on having Holoviews integration with Ibis. It seems like we've tried to explain a few times the reasoning behind why we would like this feature in this thread, but we seem to be talking past each other. What do we need to to to get unblocked? Is there a specific concern about pseudocolumn support that needs to be addressed or do we need to have a higher bandwidth meeting about it?

Connecting Ibis and Holoviews opens up some great capabilities to build interactive dashboards and pipelines backed by databases through Ibis and we are eager to get unblocked and move this forward. Please let me know what we can do to address y'alls concerns and expedite the process of merging this pr.

@jreback
Copy link
Contributor

jreback commented Aug 17, 2020

@datapythonista @jreback

This issue is blocker on having Holoviews integration with Ibis. It seems like we've tried to explain a few times the reasoning behind why we would like this feature in this thread, but we seem to be talking past each other. What do we need to to to get unblocked? Is there a specific concern about pseudocolumn support that needs to be addressed or do we need to have a higher bandwidth meeting about it?

Connecting Ibis and Holoviews opens up some great capabilities to build interactive dashboards and pipelines backed by databases through Ibis and we are eager to get unblocked and move this forward. Please let me know what we can do to address y'alls concerns and expedite the process of merging this pr.

@dharhas I dont' think I am talking past you at all. I have not heard:

  • why you actuallly need this, when we haven't needed this 'feature' anywhere else
  • what alternative you tried and decided, oh I need row_id support

The reason I concerned is that this is a very sql specific feature, though its actually very easy to emulate in pandas. So again I'll ask the question. What exactly are you doing that this is necessary.

Pointing to code and saying, we are using it is not enough.

@datapythonista
Copy link
Contributor

@jreback my understanding from the holoviews PR is that the goal is to join two datasets by position. What in pandas would be:

>>> import pandas

>>> fruits = pandas.DataFrame({'name': ['Orange', 'Melon', 'Banana']})
>>> colors = pandas.DataFrame({'color_name': ['orange', 'green', 'yellow']})

>>> pandas.concat([fruits, colors], axis='columns')
     name color_name
0  Orange     orange
1   Melon      green
2  Banana     yellow

I think to do that in SQL, the equivalent would be:

sqlite> CREATE TABLE fruits (name TEXT);
sqlite> INSERT INTO fruits VALUES ('Orange');
sqlite> INSERT INTO fruits VALUES ('Melon');
sqlite> INSERT INTO fruits VALUES ('Banana');
sqlite> SELECT * FROM fruits;
Orange
Melon
Banana

sqlite> CREATE TABLE colors (color_name TEXT);
sqlite> INSERT INTO colors VALUES ('orange');
sqlite> INSERT INTO colors VALUES ('green');
sqlite> INSERT INTO colors VALUES ('yellow');
sqlite> SELECT * FROM colors;
orange
green
yellow

sqlite> SELECT name, color_name FROM fruits INNER JOIN colors ON colors.rowid = fruits.rowid;
Orange|orange
Melon|green
Banana|yellow

I'm personally not aware of any alternative to achieve the same in SQL without rowid. I don't think there is an equivalent to UNION to concatenate horizontally.

rowid would also work as an equivalent of pandas .iloc[N] (something like WHERE rowid = N). For example, if we know a table has garbage on the first 20 rows, but and we are interested in the next 100 (rows from 20 to 120) we could use WHERE rowid BETWEEN 20 and 120. I think having garbage in rows in more common in datasets (e.g. csv files) than in databases, but I guess that could be useful.

Not sure if you have alternatives to those, or if this wasn't needed until now because the use cases are not so common.

@dharhas
Copy link

dharhas commented Aug 17, 2020

@jreback that is a bit unfair, this is the first time you have clearly mentioned what you are looking for in terms of a response and what the actual concern is.

In terms of what you are now asking for in terms of a use case and a reason, we have provided those to an extent, maybe not as clearly as you would prefer i.e. holoviews integration and improved performance.

Holoviews and Datashader requires efficient access to selections of data , i.e. locations of the rows is required for interactive plotting of very large datasets. Based on panning and zooming, Holoviews will request different subsets of data from the larger dataset. We do have a workaround without it based on making a new column in the database but the performance is pretty bad. The row_id pseudocolumn approach improves performance significantly and makes Holoviews integration much more feasible.

If there is still a strong concern around adding this in, please let us know what you would like in terms of justification past what @datapythonista and others have already provided.

@jreback
Copy link
Contributor

jreback commented Aug 17, 2020

@dharhas I mentioned it #2251 (review), #2251 (review), and #2251 (comment). I don't actually see any responses except 'its in the code here'.

@datapythonista explanation was the clearest and actually pretty reasonable request actually. That was what I was looking for. If someone needs a 'feature' just because they think they needs it is not reasonable. Showing that you tried alternative and need more idiomatic api IS.

The reason I am concerned about this api expansion, is that a) it is now implicitly providing ordering for non-orderable tables which is not the sql standard, and b) this is a very sql type expression and not needed / necessary in other backends; ibis is quite pythonic and this is going the other way.

that said i guess this is fine, will review the actual code soon.

@dharhas
Copy link

dharhas commented Aug 18, 2020

@jreback Thank you for your time responding to this. The point I was trying to make was that we were unclear on what the concern was and unsure what you were looking for in a response. That is clearer now and we will try to do a better job with justification in the future.

One clarification, although we are initially implementing holoviews support for two backends, we hope to add others so that it will allow for a really good path for visualization and dashboards backed by Ibis.

@jreback
Copy link
Contributor

jreback commented Aug 18, 2020

@jreback Thank you for your time responding to this. The point I was trying to make was that we were unclear on what the concern was and unsure what you were looking for in a response. That is clearer now and we will try to do a better job with justification in the future.

One clarification, although we are initially implementing holoviews support for two backends, we hope to add others so that it will allow for a really good path for visualization and dashboards backed by Ibis.

fair enough


"""

def _validate(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RowID was implemented using TableColumn, and it will need it, without that it will raise:

~/dev/quansight/ibis-project/ibis/ibis/expr/signature.py in __init__(self, *args, **kwargs)
    181         for name, value in self.signature.validate(*args, **kwargs):
    182             setattr(self, name, value)
--> 183         self._validate()
    184 
    185     def _validate(self):

IbisTypeError: 'rowid' is not a field in ['index', 'Unnamed: 0', 'id', 'bool_col', 'tinyint_col', 'smallint_col', 'int_col', 'bigint_col', 'float_col', 'double_col', 'date_string_col', 'string_col', 'timestamp_col', 'year', 'month']

klass = self.output_type()
return klass(self, name=self.name)

def output_type(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these need actual types

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using here _make_expr to keep the name parameter. maybe it is not the best way to do that. any recommendations would be very appreciated.

return klass(self, name=self.name)

def output_type(self):
return functools.partial(ir.IntegerColumn, dtype=dt.int64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just return the datatype class itself

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I changed to return dt.int64.column_type()

@@ -235,6 +234,23 @@ def _rpad(t, expr):
return arg + _generic_pad(arg, length, pad)


def _row_id(t, expr: ir.Expr):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this only for sqlite or is it generally available for sql?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for postgresql we could use rowid and translate that to ctid (I didn't test that yet), about MySQL, I found some discussions but I didn't find a good answer, so not sure if there is a rowid for mysql.

Copy link
Contributor Author

@xmnlab xmnlab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some comments inline. just an extra comment about the current implementation:
as it is a TableColumn it has some challenges (at least to me), the operation resolving is a little bit special because it occurs in 2 different places: 1) in compiling time as also normally happens with common operations and 2) in data preparing time (eg from cursor to dataframe), that is why I implemented it using the approach t.row_id("rowid") (where "rowid" is used just for the data preparing time). not sure if there is a way to just get that directly from the translation.

maybe another approach would be the creation of this expression as a common unary operation and add it directly to ibis (ie: ibis.row_id()) ... I didn't find a way to allow an expression to be used without alias (eg.: t[ibis.row_id(), t]) ... so maybe it would need always to be used with an alias (eg. t[ibis.row_id().name('rowid'), t]).

any recommendations would be very appreciated. thanks!


"""

def _validate(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RowID was implemented using TableColumn, and it will need it, without that it will raise:

~/dev/quansight/ibis-project/ibis/ibis/expr/signature.py in __init__(self, *args, **kwargs)
    181         for name, value in self.signature.validate(*args, **kwargs):
    182             setattr(self, name, value)
--> 183         self._validate()
    184 
    185     def _validate(self):

IbisTypeError: 'rowid' is not a field in ['index', 'Unnamed: 0', 'id', 'bool_col', 'tinyint_col', 'smallint_col', 'int_col', 'bigint_col', 'float_col', 'double_col', 'date_string_col', 'string_col', 'timestamp_col', 'year', 'month']

return klass(self, name=self.name)

def output_type(self):
return functools.partial(ir.IntegerColumn, dtype=dt.int64)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I changed to return dt.int64.column_type()

klass = self.output_type()
return klass(self, name=self.name)

def output_type(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using here _make_expr to keep the name parameter. maybe it is not the best way to do that. any recommendations would be very appreciated.

@@ -235,6 +234,23 @@ def _rpad(t, expr):
return arg + _generic_pad(arg, length, pad)


def _row_id(t, expr: ir.Expr):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for postgresql we could use rowid and translate that to ctid (I didn't test that yet), about MySQL, I found some discussions but I didn't find a good answer, so not sure if there is a rowid for mysql.

@jreback
Copy link
Contributor

jreback commented Sep 1, 2020

superseded by #2345

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
expressions Issues or PRs related to the expression API feature Features or general enhancements sqlite The SQLite backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RowId operation
5 participants