Add row_id pseudocolumn support #2251

xmnlab · 2020-06-16T17:26:36Z

This PR aims to start a discussion about pseudocolumn support (https://en.wikipedia.org/wiki/Pseudocolumn).

Initially it adds:

initial support for pseudo columns support
support for row_id pseudocolum to omniscidb and sqlite backends.

example:

>>> con = ibis.sqlite.connect(**conf['sqlite'])
>>> t = con.table('functional_alltypes')
>>> rowid = t.row_id('rowid')  # it needs the name used for the backend as a parameter
>>> expr = t[row_id, t.index].head()
>>> print(expr.compile())
SELECT rowid - ? AS rowid, t0."index" 
FROM base.functional_alltypes AS t0
LIMIT ? OFFSET ?

limitations:

A pseudocolumn needs to be used by a table expression directly, if it is used by a selection it will raise NotImplementedErrorerror.

Resolves: #1462

xmnlab · 2020-06-20T21:41:04Z

this PR is ready for review. thanks!

datapythonista

Couple of questions...

Do you have any other pseudocolumn in mind that you want to implement? I'm wondering if having the PseudoColumn class is worth.

Why do we need the col_name parameter? I guess I'm missing something, but table.row_id().name('id') seems more natural/standard than table.row_id('id').

xmnlab · 2020-06-21T21:01:14Z

Do you have any other pseudocolumn in mind that you want to implement? I'm wondering if having the PseudoColumn class is worth.

for now it is just rowid, but as it has a different concept of the rest of the other operations, I think it would be better to define the rules using a base class.

Why do we need the col_name parameter? I guess I'm missing something, but table.row_id().name('id') seems more natural/standard than table.row_id('id').

the main goal is allowing the operation to be translated to something like: select row_id from table
without a alias (name operation).

normally, a common operation need alias (I didn't find a way to do it without that, any suggestion would be very welcome)

the ideal would be use just table.row_id() but I needed to keep consistence in the column name and the translation and to avoid conflicts with the name resolving in the expression execution so I needed to added it as a parameter.

I have tried a lot of different ways to do it ... but with no much success ... if there is a better way to implement that I would be glad to receive any suggestion

jreback

can you give a concrete example of where this is useful?

xmnlab · 2020-06-29T15:14:40Z

@jreback here there is some use cases for oracle (that maybe could be implemented as row_id_hex on ibis).

in general, also for omniscidb, it can be used to improve performance. for example, in the link above, there is the following comment (https://stackoverflow.com/a/2701811/3715476):

ROWID is the physical location of a row. Consequently it is the fastest way of locating a row, faster even than a primary key lookup. So it can be useful in certain types of transaction where we select some rows, store their ROWIDs and then later on use the ROWIDs in where clauses for DML against those same rows.

xmnlab · 2020-07-26T01:19:34Z

an example from @jp-harvey

import pymapd
import pandas as pd
import os
from datetime import datetime, timedelta
import time

username = os.environ.get('omnisciusername', 'admin')
password = os.environ.get('omniscipassword', 'HyperInteractive')
dbname = os.environ.get('omniscidbname', 'omnisci')
port = os.environ.get('omnisciport', 6274)
protocol = os.environ.get('omnisciprotocol', 'binary')
omniscihost = os.environ.get('omniscihost', 'localhost')


newtablename = 'newtablenamehere'
lefttable = 'sometable'

batchsize = 10000   
offset = 0

con = pymapd.connect(user=username, password=password, dbname=dbname, host=omniscihost, port=port, protocol=protocol)
c = con.cursor()

countquery = 'SELECT COUNT(*) FROM {0};'.format(lefttable)
tablesize = list(c.execute(countquery))[0][0]

print('Record count of {0} is: {1}'.format(lefttable,tablesize))
print('Batch size is {0}'.format(batchsize))
print('Offset is {0}'.format(offset))

dropquery = 'DROP table IF EXISTS {0}'.format(newtablename)
# uncomment this if you are testing and want to drop the table before load
# c.execute(dropquery)
batchstarttime = datetime.now()
print('{0}: CTAS/ITAS started'.format(batchstarttime))
starttimestamp = time.time()

while offset < tablesize:
    joinquery = '''
    SELECT * FROM sometable WHERE rowid > {1} and rowid <= {0}
        '''.format(offset+batchsize,offset).replace("\n", " ")
    if offset == 0:
        query = 'CREATE TABLE {0} AS ({1})'.format(newtablename,joinquery)
    else:
        query = 'INSERT INTO {0} ({1})'.format(newtablename,joinquery)
    result = c.execute(query)

    offset += batchsize
    secondselapsed = time.time() - starttimestamp
    timeperrecord = secondselapsed / offset
    secondsremaining = (tablesize - offset) * timeperrecord
    projectedend = datetime.now() + timedelta(seconds=secondsremaining)
    print('{0}: {1} of {2} - {3}% (ETA {4})'.format(datetime.now(),offset,tablesize,int(offset/tablesize*100),projectedend))

xmnlab · 2020-07-27T17:02:41Z

@jreback any feedback about this? thanks

kcpevey · 2020-08-07T15:48:22Z

@jreback @datapythonista can we get your feedback? We have a PR to add Ibis to Holoviews (holoviz/holoviews#4517) but this PR is required before we can do that. I'd love to get this merged in soon. :)

datapythonista · 2020-08-13T08:42:58Z

@xmnlab can you rebase please? Also, I'd prefer that we don't divide this in two levels of abstraction at this point, and we leave the concept of pseudocolumn for later, if we ever add a new one. A comment in the code mentioning that if we add another pseudocolumn we should create a base class would be great.

@jreback you've got an example of this being used in https://github.com/holoviz/holoviews/pull/4517/files#diff-7839668a002e8874d14a2e3debc2f79cR135, I think you were adding for concrete examples.

xmnlab · 2020-08-13T15:35:22Z

@datapythonista thanks for the review.
I applied your suggestions. thanks!

datapythonista

Thanks @xmnlab, does this need to be exposed in the public documentation?

datapythonista · 2020-08-13T17:17:21Z

ibis/tests/all/test_column.py

+    # pseudocolumn needs to be used by a table expression directly
+    # alltypes fixture from some backends maybe apply some operation on it
+    t = con.table('functional_alltypes')
+    backend_col_name = backend_pseudocolumn.get(backend.name, None)


None is the default, not needed.

Suggested change

backend_col_name = backend_pseudocolumn.get(backend.name, None)

backend_col_name = backend_pseudocolumn.get(backend.name)

you're right. thanks for catching that. I am going to change that right now.

jreback

what is the usecase of this?

xmnlab · 2020-08-13T20:29:28Z

@jreback, as @kcpevey and @datapythonista mentioned before, there is a holoview PR (https://github.com/holoviz/holoviews/pull/4517/files#diff-7839668a002e8874d14a2e3debc2f79cR135) that depends on this feature.
let me know if you want more information. thanks

jreback · 2020-08-13T20:30:28Z

@xmnlab i mean in words what the purpose is

xmnlab · 2020-08-14T01:37:45Z

quoting from https://oracle-base.com/articles/misc/rowids-for-plsql-performance

Using a ROWID is the quickest way to access a row of data. If you are planning to retrieve some data, process it, then subsequently update the row in the same transaction, you can improve performance by using the ROWID.

ROWIDs are the fastest way to access a row of data, but if you can do an operation in a single DML statement, that is faster than selecting the data first, then supplying the ROWID to the DML statement.

If rows are moved, the ROWID will change. Rows can move due to maintenance operations like shrinks and table moves. As a result, storing ROWIDs for long periods of time is a bad idea. They should only be used in a single transaction, preferably as part of a SELECT ... FOR UPDATE, where the row is locked, preventing row movement.

the OmniSciDB rowid is similar and in other words, it would be useful to resolve specific problems when the performance is critical.

in one of the examples above (https://github.com/ibis-project/ibis/pull/2251#issuecomment-663923706), rowid was used to create chunks of data and it is faster than the other similar approaches.

not sure if I answered your question, but I will be happy to provide more information, just let me know what point you want more details.

dharhas · 2020-08-17T18:04:45Z

@datapythonista @jreback

This issue is blocker on having Holoviews integration with Ibis. It seems like we've tried to explain a few times the reasoning behind why we would like this feature in this thread, but we seem to be talking past each other. What do we need to to to get unblocked? Is there a specific concern about pseudocolumn support that needs to be addressed or do we need to have a higher bandwidth meeting about it?

Connecting Ibis and Holoviews opens up some great capabilities to build interactive dashboards and pipelines backed by databases through Ibis and we are eager to get unblocked and move this forward. Please let me know what we can do to address y'alls concerns and expedite the process of merging this pr.

jreback · 2020-08-17T18:13:30Z

@datapythonista @jreback

This issue is blocker on having Holoviews integration with Ibis. It seems like we've tried to explain a few times the reasoning behind why we would like this feature in this thread, but we seem to be talking past each other. What do we need to to to get unblocked? Is there a specific concern about pseudocolumn support that needs to be addressed or do we need to have a higher bandwidth meeting about it?

Connecting Ibis and Holoviews opens up some great capabilities to build interactive dashboards and pipelines backed by databases through Ibis and we are eager to get unblocked and move this forward. Please let me know what we can do to address y'alls concerns and expedite the process of merging this pr.

@dharhas I dont' think I am talking past you at all. I have not heard:

why you actuallly need this, when we haven't needed this 'feature' anywhere else
what alternative you tried and decided, oh I need row_id support

The reason I concerned is that this is a very sql specific feature, though its actually very easy to emulate in pandas. So again I'll ask the question. What exactly are you doing that this is necessary.

Pointing to code and saying, we are using it is not enough.

datapythonista · 2020-08-17T19:01:41Z

@jreback my understanding from the holoviews PR is that the goal is to join two datasets by position. What in pandas would be:

>>> import pandas

>>> fruits = pandas.DataFrame({'name': ['Orange', 'Melon', 'Banana']})
>>> colors = pandas.DataFrame({'color_name': ['orange', 'green', 'yellow']})

>>> pandas.concat([fruits, colors], axis='columns')
     name color_name
0  Orange     orange
1   Melon      green
2  Banana     yellow

I think to do that in SQL, the equivalent would be:

sqlite> CREATE TABLE fruits (name TEXT);
sqlite> INSERT INTO fruits VALUES ('Orange');
sqlite> INSERT INTO fruits VALUES ('Melon');
sqlite> INSERT INTO fruits VALUES ('Banana');
sqlite> SELECT * FROM fruits;
Orange
Melon
Banana

sqlite> CREATE TABLE colors (color_name TEXT);
sqlite> INSERT INTO colors VALUES ('orange');
sqlite> INSERT INTO colors VALUES ('green');
sqlite> INSERT INTO colors VALUES ('yellow');
sqlite> SELECT * FROM colors;
orange
green
yellow

sqlite> SELECT name, color_name FROM fruits INNER JOIN colors ON colors.rowid = fruits.rowid;
Orange|orange
Melon|green
Banana|yellow

I'm personally not aware of any alternative to achieve the same in SQL without rowid. I don't think there is an equivalent to UNION to concatenate horizontally.

rowid would also work as an equivalent of pandas .iloc[N] (something like WHERE rowid = N). For example, if we know a table has garbage on the first 20 rows, but and we are interested in the next 100 (rows from 20 to 120) we could use WHERE rowid BETWEEN 20 and 120. I think having garbage in rows in more common in datasets (e.g. csv files) than in databases, but I guess that could be useful.

Not sure if you have alternatives to those, or if this wasn't needed until now because the use cases are not so common.

dharhas · 2020-08-17T21:18:40Z

@jreback that is a bit unfair, this is the first time you have clearly mentioned what you are looking for in terms of a response and what the actual concern is.

In terms of what you are now asking for in terms of a use case and a reason, we have provided those to an extent, maybe not as clearly as you would prefer i.e. holoviews integration and improved performance.

Holoviews and Datashader requires efficient access to selections of data , i.e. locations of the rows is required for interactive plotting of very large datasets. Based on panning and zooming, Holoviews will request different subsets of data from the larger dataset. We do have a workaround without it based on making a new column in the database but the performance is pretty bad. The row_id pseudocolumn approach improves performance significantly and makes Holoviews integration much more feasible.

If there is still a strong concern around adding this in, please let us know what you would like in terms of justification past what @datapythonista and others have already provided.

jreback · 2020-08-17T22:48:23Z

@dharhas I mentioned it #2251 (review), #2251 (review), and #2251 (comment). I don't actually see any responses except 'its in the code here'.

@datapythonista explanation was the clearest and actually pretty reasonable request actually. That was what I was looking for. If someone needs a 'feature' just because they think they needs it is not reasonable. Showing that you tried alternative and need more idiomatic api IS.

The reason I am concerned about this api expansion, is that a) it is now implicitly providing ordering for non-orderable tables which is not the sql standard, and b) this is a very sql type expression and not needed / necessary in other backends; ibis is quite pythonic and this is going the other way.

that said i guess this is fine, will review the actual code soon.

dharhas · 2020-08-18T13:19:34Z

@jreback Thank you for your time responding to this. The point I was trying to make was that we were unclear on what the concern was and unsure what you were looking for in a response. That is clearer now and we will try to do a better job with justification in the future.

One clarification, although we are initially implementing holoviews support for two backends, we hope to add others so that it will allow for a really good path for visualization and dashboards backed by Ibis.

jreback · 2020-08-18T14:50:08Z

@jreback Thank you for your time responding to this. The point I was trying to make was that we were unclear on what the concern was and unsure what you were looking for in a response. That is clearer now and we will try to do a better job with justification in the future.

One clarification, although we are initially implementing holoviews support for two backends, we hope to add others so that it will allow for a really good path for visualization and dashboards backed by Ibis.

fair enough

jreback · 2020-08-19T14:50:52Z

ibis/expr/operations.py

+
+    """
+
+    def _validate(self):


what is this here?

RowID was implemented using TableColumn, and it will need it, without that it will raise:

~/dev/quansight/ibis-project/ibis/ibis/expr/signature.py in __init__(self, *args, **kwargs) 181 for name, value in self.signature.validate(*args, **kwargs): 182 setattr(self, name, value) --> 183 self._validate() 184 185 def _validate(self): IbisTypeError: 'rowid' is not a field in ['index', 'Unnamed: 0', 'id', 'bool_col', 'tinyint_col', 'smallint_col', 'int_col', 'bigint_col', 'float_col', 'double_col', 'date_string_col', 'string_col', 'timestamp_col', 'year', 'month']

jreback · 2020-08-19T14:51:02Z

ibis/expr/operations.py

+        klass = self.output_type()
+        return klass(self, name=self.name)
+
+    def output_type(self):


these need actual types

I am using here _make_expr to keep the name parameter. maybe it is not the best way to do that. any recommendations would be very appreciated.

jreback · 2020-08-19T14:51:23Z

ibis/expr/operations.py

+        return klass(self, name=self.name)
+
+    def output_type(self):
+        return functools.partial(ir.IntegerColumn, dtype=dt.int64)


you can just return the datatype class itself

ok, I changed to return dt.int64.column_type()

jreback · 2020-08-19T14:51:48Z

ibis/sql/sqlite/compiler.py

@@ -235,6 +234,23 @@ def _rpad(t, expr):
    return arg + _generic_pad(arg, length, pad)


+def _row_id(t, expr: ir.Expr):


is this only for sqlite or is it generally available for sql?

for postgresql we could use rowid and translate that to ctid (I didn't test that yet), about MySQL, I found some discussions but I didn't find a good answer, so not sure if there is a rowid for mysql.

xmnlab

I added some comments inline. just an extra comment about the current implementation:
as it is a TableColumn it has some challenges (at least to me), the operation resolving is a little bit special because it occurs in 2 different places: 1) in compiling time as also normally happens with common operations and 2) in data preparing time (eg from cursor to dataframe), that is why I implemented it using the approach t.row_id("rowid") (where "rowid" is used just for the data preparing time). not sure if there is a way to just get that directly from the translation.

maybe another approach would be the creation of this expression as a common unary operation and add it directly to ibis (ie: ibis.row_id()) ... I didn't find a way to allow an expression to be used without alias (eg.: t[ibis.row_id(), t]) ... so maybe it would need always to be used with an alias (eg. t[ibis.row_id().name('rowid'), t]).

any recommendations would be very appreciated. thanks!

xmnlab · 2020-08-19T22:11:50Z

ibis/expr/operations.py

+
+    """
+
+    def _validate(self):


RowID was implemented using TableColumn, and it will need it, without that it will raise:

~/dev/quansight/ibis-project/ibis/ibis/expr/signature.py in __init__(self, *args, **kwargs) 181 for name, value in self.signature.validate(*args, **kwargs): 182 setattr(self, name, value) --> 183 self._validate() 184 185 def _validate(self): IbisTypeError: 'rowid' is not a field in ['index', 'Unnamed: 0', 'id', 'bool_col', 'tinyint_col', 'smallint_col', 'int_col', 'bigint_col', 'float_col', 'double_col', 'date_string_col', 'string_col', 'timestamp_col', 'year', 'month']

xmnlab · 2020-08-19T22:15:04Z

ibis/expr/operations.py

+        return klass(self, name=self.name)
+
+    def output_type(self):
+        return functools.partial(ir.IntegerColumn, dtype=dt.int64)


ok, I changed to return dt.int64.column_type()

xmnlab · 2020-08-19T22:18:43Z

ibis/expr/operations.py

+        klass = self.output_type()
+        return klass(self, name=self.name)
+
+    def output_type(self):


I am using here _make_expr to keep the name parameter. maybe it is not the best way to do that. any recommendations would be very appreciated.

xmnlab · 2020-08-19T23:40:29Z

ibis/sql/sqlite/compiler.py

@@ -235,6 +234,23 @@ def _rpad(t, expr):
    return arg + _generic_pad(arg, length, pad)


+def _row_id(t, expr: ir.Expr):


for postgresql we could use rowid and translate that to ctid (I didn't test that yet), about MySQL, I found some discussions but I didn't find a good answer, so not sure if there is a rowid for mysql.

jreback · 2020-09-01T21:02:22Z

superseded by #2345

xmnlab force-pushed the add-rowid-op-3 branch from 1675341 to a7ab659 Compare June 18, 2020 15:07

xmnlab marked this pull request as ready for review June 20, 2020 21:40

datapythonista reviewed Jun 21, 2020

View reviewed changes

jreback suggested changes Jun 26, 2020

View reviewed changes

datapythonista added backends - omnisci sqlite The SQLite backend feature Features or general enhancements expressions Issues or PRs related to the expression API labels Jul 2, 2020

kcpevey mentioned this pull request Jul 24, 2020

[Ibis] Add row_id pseudocolumn support Quansight/omnisci#131

Closed

xmnlab added 3 commits August 13, 2020 11:27

Add row_id pseudocolumn support

e204f28

Add release note

749c844

Apply suggestion from review.

828590f

xmnlab force-pushed the add-rowid-op-3 branch from a7ab659 to 828590f Compare August 13, 2020 15:34

datapythonista approved these changes Aug 13, 2020

View reviewed changes

jreback suggested changes Aug 13, 2020

View reviewed changes

jreback suggested changes Aug 19, 2020

View reviewed changes

Change output_type approach

edb569b

xmnlab commented Aug 19, 2020

View reviewed changes

datapythonista mentioned this pull request Aug 27, 2020

ENH: Implementing rowid #2345

Merged

jreback closed this Sep 1, 2020

jcrist mentioned this pull request Dec 9, 2022

fix: clarify and normalize behavior of Table.rowid #4991

Merged

	backend_col_name = backend_pseudocolumn.get(backend.name, None)
	backend_col_name = backend_pseudocolumn.get(backend.name)

		@@ -235,6 +234,23 @@ def _rpad(t, expr):
		return arg + _generic_pad(arg, length, pad)


		def _row_id(t, expr: ir.Expr):

Add row_id pseudocolumn support #2251

Add row_id pseudocolumn support #2251

Conversation

xmnlab commented Jun 16, 2020 • edited Loading

xmnlab commented Jun 20, 2020

datapythonista left a comment

Choose a reason for hiding this comment

xmnlab commented Jun 21, 2020

jreback left a comment

Choose a reason for hiding this comment

xmnlab commented Jun 29, 2020

xmnlab commented Jul 26, 2020

xmnlab commented Jul 27, 2020

kcpevey commented Aug 7, 2020

datapythonista commented Aug 13, 2020

xmnlab commented Aug 13, 2020

datapythonista left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

xmnlab commented Aug 13, 2020

jreback commented Aug 13, 2020

xmnlab commented Aug 14, 2020

dharhas commented Aug 17, 2020

jreback commented Aug 17, 2020

datapythonista commented Aug 17, 2020

dharhas commented Aug 17, 2020

jreback commented Aug 17, 2020

dharhas commented Aug 18, 2020

jreback commented Aug 18, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xmnlab left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Sep 1, 2020

xmnlab commented Jun 16, 2020 •

edited

Loading