Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DM-17872] Parsing for math and coercing types #116

Merged
merged 1 commit into from
Feb 22, 2024
Merged

[DM-17872] Parsing for math and coercing types #116

merged 1 commit into from
Feb 22, 2024

Conversation

cbanek
Copy link
Contributor

@cbanek cbanek commented Jul 15, 2021

We had some problems at LSST where if someone was doing math in a select statement, such as doing a column plus a constant, and this would end up coming out as a string representation of the number, which causes some problems in portals and other things with smart sorting for number types. This is a pretty rough stab, and isn't well tested, but I was wondering what you think.

@pdowler
Copy link
Member

pdowler commented Jul 15, 2021

The approach looks sound: figure out the obvious cases and fall back to string (like before)... can be extended in future if we encounter something more obscure.

The one thing that will cause pain is the resulting names in the TapSelectItem. From the test changes, these look to be the full function(arg) string, which is informative but doesn't follow the recommendation on names for expressions (https://www.ivoa.net/documents/TAP/20190927/REC-TAP-1.1.html#tth_sEc3, second paragraph of 3.2). The purpose of that is so that an output table can (probably) be used as an input (upload) table in the same or another tap service without having to modify field names. A "simple" solution allowed by the spec would be to use quoted identifiers (double quotes) for expressions, but I can't say for sure that our parser actually handles quoted identifiers correctly. taplint doesn't complain and I'm pretty sure Mark would test that, but you might want to try them (eg in an upload table) before going that way.

@cbanek
Copy link
Contributor Author

cbanek commented Jul 15, 2021

Thanks for the double check on the methodology, and yes I was wondering about the names in the TapSelection. I think I can get this back to the way it was before, although now knowing why we do it is certainly helpful. My type coercion is still broken though, I think what I'm going to do is say that the output from an arithmetic expression is either long or double, and try to avoid the smaller datatypes just because I don't want to have to worry about computing overflow into a bigger type (since I don't know the results, I don't know if anything has actually overflowed). I guess those types will be a bit bigger, but should be fine to still upload back as a temporary table like you mention without having to modify the types. I'll keep working on this and let you know!

@cbanek
Copy link
Contributor Author

cbanek commented Jul 15, 2021

I also just noticed that apparently the computed names of columns have to be unique! That'll be interesting. It seems like it might not exactly be unique now, if I have like two max calls, it seems like there will be two columns named max. I guess I could go for some kind of made up name like col42, col43, etc. and just increment the number. I liked how I had the entire subexpression since it seemed a little more useful in the finding the columns, but in reality users should be using aliases. I'll keep thinking about this a bit more.

@pdowler
Copy link
Member

pdowler commented Jul 15, 2021

At some point in the code the description field from tap_schema.columns gets into the output... if you could get the original expression into the description that would help clarity. Might have to augment TapSelectItem to do that and track down where that info gets merged with the ColumnDesc and put into the output (votable field).

Another thing is if the ADQL has an alias for the expression that always takes precedence over a generated column name.

Fun times with SQL!

@cbanek
Copy link
Contributor Author

cbanek commented Jul 17, 2021

Okay so an update here - I've gone ahead and gone back to the way the tests were with the function names and got that to pass. The only changes are the types where it's now a number and not a string.

Sorry this change got a bit big and I ended up kind of rewriting most of this. It's a bit nervewracking for me, but having the tests show green makes me feel pretty good. The one architectural thing I'd like to do that I'm struggling with is I think it'd be better if TapSelectItem could have a setAlias function or something to set the alias on an already created TapSelectItem. That'd get rid of some of the spread out alias handling and would be able to get it down to one spot. But when I make those changes, since they are in a different java library, I was having the trouble with the builds not finding the new interface. Anyway, this one is up for another round of review when you are ready!

@cbanek
Copy link
Contributor Author

cbanek commented Sep 23, 2021

Any more thoughts on taking this one?

cbanek added a commit to lsst-sqre/lsst-tap-service that referenced this pull request Oct 11, 2021
So this PR hasn't been accepted yet upstream, but we have one that
fixes some of the parsing and type coercian for math in sql.  Until
then we should pin this version that I've checked in and built myself
with the PR branch.

opencadc/tap#116
cbanek added a commit to lsst-sqre/lsst-tap-service that referenced this pull request Oct 11, 2021
So this PR hasn't been accepted yet upstream, but we have one that
fixes some of the parsing and type coercian for math in sql.  Until
then we should pin this version that I've checked in and built myself
with the PR branch.

opencadc/tap#116
Copy link
Member

@pdowler pdowler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the long delay on this... finally getting back to it.

I think it would be good to add a member var colIndex that starts at 0 that would be incremented in the loop over select list items (eg line incremented up in the loop over selectListItems (around line 182 where the item is added). Then the colIndex could be used to make sure all generated TapSelectItem name(s) are unique and as a side effect they'd have a number indicating which column it was. Usage in other comments below.

The ExtractorTest would benefit from at least one uniqueness test, eg

select max(table_name), max(column_name) from tap_schema.columns

which with the suggested changes would have columns max0 and max1. Now that I write this comment, I could see an argument for a 1-based colIndex :-)

TapDataType rightType = right.getDatatype();

log.info("leftType: " + leftType + " rightType: " + rightType);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is tricky. My instinct is to default to DOUBLE and see if the types allow one to go back to LONG. Then as a second order improvement see if it's more correct to go back to a smaller type (FLOAT, INT, SHORT, ...). I'm not sure that second step is worth any effort so I might stop at DOUBLE and LONG, because:

  • jsqlparser only has long and double
  • a lot of numeric functions are hard-coded to double in TapSchemaDAO.getFunctionDescs
  • it will get plenty more complicated when we get to adding ADQL-2.1 cast operation

Yeah, unless you specifically have a use case for FLOAT, I'd remove that bit. If you do have a use case, then you do need to remove the else below or check for DOUBLE before FLOAT.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, l just pulled out this whole thing. It's really hard to infer the type even if there's a column in there, since I can think of at least 1 case for each + - * /. And deefaulting to double seems too small since say double x double could be a long. And we don't have the number to test its range or anything. I think just a numbered col is where I'm at.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to worry about double * double going to anything but double, certainly not to long.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think fallback to double in SelectListExpressionExtractor.getItemFromExxpression() will work more often, Certainly expressions can output real values and if we use long the resulting VOTable will be invalid. Chosing long over double has to be the extra effort to improve the metadata quality when it is possible to infer that is safe/correct.

Now that I look at the complete function, is there also the possibility that an expression that's not a column or function results in a string (concat? anything else?) and that's not yet handled? I think that's why it was originally using string, but maybe that was overly cautious/lazy on my part :-)

There is a Concatenate extends BinaryExpression in another package but I don't recall where that gets shows up exactly... I think it's something that can be injected in the select list after the SelectListExpressionExtractor so the internal SQL concatenates multiple values into one output column (eg replace a Column with a Concatenate and probably have a custom Format for output). The presence of that class makes me thing jsqlparser doesn't have a concatenate.

@cbanek
Copy link
Contributor Author

cbanek commented Dec 21, 2023

OK hopefully I've taken care of the points you mentioned.

@cbanek
Copy link
Contributor Author

cbanek commented Jan 10, 2024

I'm still a little stumped by the concat operator - I've gone back to what you were doing with the default being string, and maybe that will catch that?

I've tried doing some queries in the unit tests with "||" which I think is the concat operator but yeah, jsqlparser did not like that and didn't parse the query. I do see that concatenate operator class but I'm not sure how to get to it.

Anyway, hopefully this is a bit better after some vacation to rest the ol' brain pan.

@pdowler
Copy link
Member

pdowler commented Jan 22, 2024

I don't think that operator is valid SQL92 (hence ADQL), so that class can be used internally to replace something in the ADQL with a concatenate so the executed SQL has the || operator. So I guess it is safe to assume you won't see this in the SelectListExtractor because the extractor runs before any query mangling happens. That was just an example that came to mind of an "expression" that produces a string...

I still wonder if there are other expressions that can result in a string rather than a number... but maybe this is good enough: we can just go with double, attempt to deduce that long is correct and better, and then wait and see if anyone finds that obscure use case where the result is a string (and the votable is invalid)

@cbanek
Copy link
Contributor Author

cbanek commented Jan 22, 2024

Well I am on line 384 of the SLEE defaulting back to string, so if there's another operator, and it's not one that I've already picked up above there in the if statements it should fall through to be a string. I can't think of any reason why a math expression would return a string, but I was thinking that maybe a function could return a string, like some databases I think have concat() But if someone used a function like that then they'd say it was a string output from that function, and we should be able to handle that no problem.

@pdowler pdowler merged commit 5039841 into opencadc:master Feb 22, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants