-
Notifications
You must be signed in to change notification settings - Fork 598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add row_id pseudocolumn support #2251
Conversation
this PR is ready for review. thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of questions...
Do you have any other pseudocolumn in mind that you want to implement? I'm wondering if having the PseudoColumn
class is worth.
Why do we need the col_name
parameter? I guess I'm missing something, but table.row_id().name('id')
seems more natural/standard than table.row_id('id')
.
for now it is just rowid, but as it has a different concept of the rest of the other operations, I think it would be better to define the rules using a base class.
the main goal is allowing the operation to be translated to something like: normally, a common operation need alias (I didn't find a way to do it without that, any suggestion would be very welcome) the ideal would be use just I have tried a lot of different ways to do it ... but with no much success ... if there is a better way to implement that I would be glad to receive any suggestion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you give a concrete example of where this is useful?
@jreback here there is some use cases for oracle (that maybe could be implemented as row_id_hex on ibis). in general, also for omniscidb, it can be used to improve performance. for example, in the link above, there is the following comment (https://stackoverflow.com/a/2701811/3715476):
|
an example from @jp-harvey import pymapd
import pandas as pd
import os
from datetime import datetime, timedelta
import time
username = os.environ.get('omnisciusername', 'admin')
password = os.environ.get('omniscipassword', 'HyperInteractive')
dbname = os.environ.get('omniscidbname', 'omnisci')
port = os.environ.get('omnisciport', 6274)
protocol = os.environ.get('omnisciprotocol', 'binary')
omniscihost = os.environ.get('omniscihost', 'localhost')
newtablename = 'newtablenamehere'
lefttable = 'sometable'
batchsize = 10000
offset = 0
con = pymapd.connect(user=username, password=password, dbname=dbname, host=omniscihost, port=port, protocol=protocol)
c = con.cursor()
countquery = 'SELECT COUNT(*) FROM {0};'.format(lefttable)
tablesize = list(c.execute(countquery))[0][0]
print('Record count of {0} is: {1}'.format(lefttable,tablesize))
print('Batch size is {0}'.format(batchsize))
print('Offset is {0}'.format(offset))
dropquery = 'DROP table IF EXISTS {0}'.format(newtablename)
# uncomment this if you are testing and want to drop the table before load
# c.execute(dropquery)
batchstarttime = datetime.now()
print('{0}: CTAS/ITAS started'.format(batchstarttime))
starttimestamp = time.time()
while offset < tablesize:
joinquery = '''
SELECT * FROM sometable WHERE rowid > {1} and rowid <= {0}
'''.format(offset+batchsize,offset).replace("\n", " ")
if offset == 0:
query = 'CREATE TABLE {0} AS ({1})'.format(newtablename,joinquery)
else:
query = 'INSERT INTO {0} ({1})'.format(newtablename,joinquery)
result = c.execute(query)
offset += batchsize
secondselapsed = time.time() - starttimestamp
timeperrecord = secondselapsed / offset
secondsremaining = (tablesize - offset) * timeperrecord
projectedend = datetime.now() + timedelta(seconds=secondsremaining)
print('{0}: {1} of {2} - {3}% (ETA {4})'.format(datetime.now(),offset,tablesize,int(offset/tablesize*100),projectedend)) |
@jreback any feedback about this? thanks |
@jreback @datapythonista can we get your feedback? We have a PR to add Ibis to Holoviews (holoviz/holoviews#4517) but this PR is required before we can do that. I'd love to get this merged in soon. :) |
@xmnlab can you rebase please? Also, I'd prefer that we don't divide this in two levels of abstraction at this point, and we leave the concept of pseudocolumn for later, if we ever add a new one. A comment in the code mentioning that if we add another pseudocolumn we should create a base class would be great. @jreback you've got an example of this being used in https://github.com/holoviz/holoviews/pull/4517/files#diff-7839668a002e8874d14a2e3debc2f79cR135, I think you were adding for concrete examples. |
@datapythonista thanks for the review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @xmnlab, does this need to be exposed in the public documentation?
# pseudocolumn needs to be used by a table expression directly | ||
# alltypes fixture from some backends maybe apply some operation on it | ||
t = con.table('functional_alltypes') | ||
backend_col_name = backend_pseudocolumn.get(backend.name, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None
is the default, not needed.
backend_col_name = backend_pseudocolumn.get(backend.name, None) | |
backend_col_name = backend_pseudocolumn.get(backend.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right. thanks for catching that. I am going to change that right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the usecase of this?
@jreback, as @kcpevey and @datapythonista mentioned before, there is a holoview PR (https://github.com/holoviz/holoviews/pull/4517/files#diff-7839668a002e8874d14a2e3debc2f79cR135) that depends on this feature. |
@xmnlab i mean in words what the purpose is |
quoting from https://oracle-base.com/articles/misc/rowids-for-plsql-performance
the OmniSciDB in one of the examples above ( not sure if I answered your question, but I will be happy to provide more information, just let me know what point you want more details. |
This issue is blocker on having Holoviews integration with Ibis. It seems like we've tried to explain a few times the reasoning behind why we would like this feature in this thread, but we seem to be talking past each other. What do we need to to to get unblocked? Is there a specific concern about pseudocolumn support that needs to be addressed or do we need to have a higher bandwidth meeting about it? Connecting Ibis and Holoviews opens up some great capabilities to build interactive dashboards and pipelines backed by databases through Ibis and we are eager to get unblocked and move this forward. Please let me know what we can do to address y'alls concerns and expedite the process of merging this pr. |
@dharhas I dont' think I am talking past you at all. I have not heard:
The reason I concerned is that this is a very sql specific feature, though its actually very easy to emulate in pandas. So again I'll ask the question. What exactly are you doing that this is necessary. Pointing to code and saying, we are using it is not enough. |
@jreback my understanding from the holoviews PR is that the goal is to join two datasets by position. What in pandas would be: >>> import pandas
>>> fruits = pandas.DataFrame({'name': ['Orange', 'Melon', 'Banana']})
>>> colors = pandas.DataFrame({'color_name': ['orange', 'green', 'yellow']})
>>> pandas.concat([fruits, colors], axis='columns')
name color_name
0 Orange orange
1 Melon green
2 Banana yellow I think to do that in SQL, the equivalent would be:
I'm personally not aware of any alternative to achieve the same in SQL without
Not sure if you have alternatives to those, or if this wasn't needed until now because the use cases are not so common. |
@jreback that is a bit unfair, this is the first time you have clearly mentioned what you are looking for in terms of a response and what the actual concern is. In terms of what you are now asking for in terms of a use case and a reason, we have provided those to an extent, maybe not as clearly as you would prefer i.e. holoviews integration and improved performance. Holoviews and Datashader requires efficient access to selections of data , i.e. locations of the rows is required for interactive plotting of very large datasets. Based on panning and zooming, Holoviews will request different subsets of data from the larger dataset. We do have a workaround without it based on making a new column in the database but the performance is pretty bad. The row_id pseudocolumn approach improves performance significantly and makes Holoviews integration much more feasible. If there is still a strong concern around adding this in, please let us know what you would like in terms of justification past what @datapythonista and others have already provided. |
@dharhas I mentioned it #2251 (review), #2251 (review), and #2251 (comment). I don't actually see any responses except 'its in the code here'. @datapythonista explanation was the clearest and actually pretty reasonable request actually. That was what I was looking for. If someone needs a 'feature' just because they think they needs it is not reasonable. Showing that you tried alternative and need more idiomatic api IS. The reason I am concerned about this api expansion, is that a) it is now implicitly providing ordering for non-orderable tables which is not the sql standard, and b) this is a very sql type expression and not needed / necessary in other backends; ibis is quite pythonic and this is going the other way. that said i guess this is fine, will review the actual code soon. |
@jreback Thank you for your time responding to this. The point I was trying to make was that we were unclear on what the concern was and unsure what you were looking for in a response. That is clearer now and we will try to do a better job with justification in the future. One clarification, although we are initially implementing holoviews support for two backends, we hope to add others so that it will allow for a really good path for visualization and dashboards backed by Ibis. |
fair enough |
|
||
""" | ||
|
||
def _validate(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RowID was implemented using TableColumn
, and it will need it, without that it will raise:
~/dev/quansight/ibis-project/ibis/ibis/expr/signature.py in __init__(self, *args, **kwargs)
181 for name, value in self.signature.validate(*args, **kwargs):
182 setattr(self, name, value)
--> 183 self._validate()
184
185 def _validate(self):
IbisTypeError: 'rowid' is not a field in ['index', 'Unnamed: 0', 'id', 'bool_col', 'tinyint_col', 'smallint_col', 'int_col', 'bigint_col', 'float_col', 'double_col', 'date_string_col', 'string_col', 'timestamp_col', 'year', 'month']
klass = self.output_type() | ||
return klass(self, name=self.name) | ||
|
||
def output_type(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these need actual types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am using here _make_expr
to keep the name parameter. maybe it is not the best way to do that. any recommendations would be very appreciated.
ibis/expr/operations.py
Outdated
return klass(self, name=self.name) | ||
|
||
def output_type(self): | ||
return functools.partial(ir.IntegerColumn, dtype=dt.int64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just return the datatype class itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I changed to return dt.int64.column_type()
@@ -235,6 +234,23 @@ def _rpad(t, expr): | |||
return arg + _generic_pad(arg, length, pad) | |||
|
|||
|
|||
def _row_id(t, expr: ir.Expr): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this only for sqlite or is it generally available for sql?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for postgresql we could use rowid and translate that to ctid
(I didn't test that yet), about MySQL, I found some discussions but I didn't find a good answer, so not sure if there is a rowid for mysql.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some comments inline. just an extra comment about the current implementation:
as it is a TableColumn
it has some challenges (at least to me), the operation resolving is a little bit special because it occurs in 2 different places: 1) in compiling time as also normally happens with common operations and 2) in data preparing time (eg from cursor to dataframe), that is why I implemented it using the approach t.row_id("rowid")
(where "rowid" is used just for the data preparing time
). not sure if there is a way to just get that directly from the translation.
maybe another approach would be the creation of this expression as a common unary operation and add it directly to ibis
(ie: ibis.row_id()
) ... I didn't find a way to allow an expression to be used without alias (eg.: t[ibis.row_id(), t]
) ... so maybe it would need always to be used with an alias (eg. t[ibis.row_id().name('rowid'), t]
).
any recommendations would be very appreciated. thanks!
|
||
""" | ||
|
||
def _validate(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RowID was implemented using TableColumn
, and it will need it, without that it will raise:
~/dev/quansight/ibis-project/ibis/ibis/expr/signature.py in __init__(self, *args, **kwargs)
181 for name, value in self.signature.validate(*args, **kwargs):
182 setattr(self, name, value)
--> 183 self._validate()
184
185 def _validate(self):
IbisTypeError: 'rowid' is not a field in ['index', 'Unnamed: 0', 'id', 'bool_col', 'tinyint_col', 'smallint_col', 'int_col', 'bigint_col', 'float_col', 'double_col', 'date_string_col', 'string_col', 'timestamp_col', 'year', 'month']
ibis/expr/operations.py
Outdated
return klass(self, name=self.name) | ||
|
||
def output_type(self): | ||
return functools.partial(ir.IntegerColumn, dtype=dt.int64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I changed to return dt.int64.column_type()
klass = self.output_type() | ||
return klass(self, name=self.name) | ||
|
||
def output_type(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am using here _make_expr
to keep the name parameter. maybe it is not the best way to do that. any recommendations would be very appreciated.
@@ -235,6 +234,23 @@ def _rpad(t, expr): | |||
return arg + _generic_pad(arg, length, pad) | |||
|
|||
|
|||
def _row_id(t, expr: ir.Expr): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for postgresql we could use rowid and translate that to ctid
(I didn't test that yet), about MySQL, I found some discussions but I didn't find a good answer, so not sure if there is a rowid for mysql.
superseded by #2345 |
This PR aims to start a discussion about pseudocolumn support (https://en.wikipedia.org/wiki/Pseudocolumn).
Initially it adds:
example:
limitations:
A pseudocolumn needs to be used by a table expression directly, if it is used by a selection it will raise
NotImplementedError
error.Resolves: #1462