Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rationalize agate in dbt #1639

Closed
drewbanin opened this issue Jul 30, 2019 · 6 comments
Closed

Rationalize agate in dbt #1639

drewbanin opened this issue Jul 30, 2019 · 6 comments
Labels
dependencies Changes to the version of dbt dependencies enhancement New feature or request

Comments

@drewbanin
Copy link
Contributor

drewbanin commented Jul 30, 2019

List of open (or recently closed) issues pertaining to Agate:

Let's rip out agate where it's not needed and make it generally less clever (type inference) where we can.

@drewbanin drewbanin added enhancement New feature or request dependencies Changes to the version of dbt dependencies labels Jul 30, 2019
@Aylr
Copy link

Aylr commented Aug 6, 2019

I'm happy to pitch in here after dealing with this in the new run_query() function.

@drewbanin
Copy link
Contributor Author

Hey @Aylr - I appreciate your support :)

This is going to be a pretty sizable change to dbt and I want to make sure we treat it with the respect it deserves! I think a good place to start is: What should we use instead? dbt operates on tabular data (either from loaded up seed files, or from the results of queries). I think there are parts of the agate API that are sensible. I like that you can operate on columns, or iterate over rows in the data frame, for instance.

Before we picked agate, we considered using something like Pandas. In practice, Pandas is a real bear to install, so we chose not to ship that with dbt. Are you aware of any existing "data frame" libs that are easy to install, or do you think this is something that makes sense for us to build ourselves?

@Aylr
Copy link

Aylr commented Aug 8, 2019

@drewbanin I totally agree that this should be very thoughtful!

From my perspective (which is likely different that a majority of the dbt target audience as I understand it - so I want to call out that bias), I think that choosing something other than pandas doesn't make much sense since it is effectively the de-facto dataframe library. I do want to pay particular attention that we keep installation as easy as possible for everyone who wants to try dbt. However, from where I sit (data science & engineering) installing pandas is not an issue. But again, I do want to be thoughtful about that.

@Aylr
Copy link

Aylr commented Aug 8, 2019

And FWIW pandas has all those kinds of operators and a wide user base to draw expertise from (myself included).

@beckjake
Copy link
Contributor

I've been looking into this, and I have a few thoughts!

First, removing agate entirely seems like overkill, especially because of how useful it is in userspace. Instead I think we should ramp down the amount of type inference it does during parsing - realistically speaking, type inference is where all the bugs live in dbt's interaction with agate. So the rest of my suggestions revolve around that. I'm reluctant to use pandas here as while it's really powerful and has a great ecosystem, it requires a C compiler on some platforms and I really don't think core should have to require that. We've spent a lot of energy to avoid requiring it in most circumstances for psycopg2, and I think that was energy well spent for many of our less-technical users.

csv table parsing should know more about the user's schema.yml settings, in particular if there are column types defined. If there are, we should tell agate that the column is in question is "text", and then do a cast. It does appear possible to tell agate that "these rows are fixed, otherwise infer types", so it's not impossible. We could try to get extra clever and use the specified type to guess what to use, and then pass that to agate, but I'm sure that we'll just mess that up some percentage of the time so I'd prefer not to.

We should come up with a date format (or set of formats) we allow in seeds and not support anything else. I'm in favor of %Y-%m-%d %H:%M:%S with no timezone because that's what all our integration test seeds use. We can pass that to agate.data_types.DateTime as the datetime_format format when building our TypeTester and do similar for agate.data_types.Date. This avoids the whole 'tomorrow' issue by bypassing the parsedatetime.Calendar behavior, which is locale specific and therefore an even more horrible minefield than you probably imagined. I'm pretty confident seeds should not parse differently depending upon your locale! The downside is that parsedatetime is also how agate supports such a wide range of date/datetime formats. I spent a good amount of time seeing if we could subclass Calendar and override things to only avoid natural language dates like 'tomorrow' - my takeaway was that it's a lot of work, if it's possible at all!

We should just drop agate.data_types.TimeDelta from our TypeTester. I'm sure it causes more problems than it solves (does it solve any actual problems?)

To do all this we'll have to do some plumbing work I've explored and don't think is too bad: we have to construct our TypeTester at csv-parse-time in order to tell it about the fields we want to force to text so we can cast them later. I don't think that's a big deal.

We could alternatively set the type tester to parse everything as agate.data_types.Text and cast later. I'm mostly against this because it broke... not technically every test, but close enough to really feel like it.

I'll have a PR up for some of this at a later point, but I wanted to get these thoughts out there.

@drewbanin drewbanin changed the title Remove agate from dbt Rationalize agate in dbt Nov 20, 2019
@drewbanin drewbanin added this to the 0.15.2: Barbara Gittings milestone Nov 20, 2019
beckjake added a commit that referenced this issue Dec 13, 2019
@beckjake
Copy link
Contributor

I think this is fixed in #1920 - please re-open it if that's incorrect!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Changes to the version of dbt dependencies enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants