Rationalize agate in dbt #1639

drewbanin · 2019-07-30T14:00:51Z

List of open (or recently closed) issues pertaining to Agate:

Handle seeds with utf-8 BOM (#1177) #1452
Seed compilation fails with non-ascii character in v0.14.0 #1632
[Bug] tuple.index(x): x not in tuple #1626
Remove agate_table from seed nodes #1566
Seed with custom column type of string fails with boolean like seed values #999
- more
a database named tomorrow will cause dbt to fail

Let's rip out agate where it's not needed and make it generally less clever (type inference) where we can.

The text was updated successfully, but these errors were encountered:

Aylr · 2019-08-06T03:23:44Z

I'm happy to pitch in here after dealing with this in the new run_query() function.

drewbanin · 2019-08-06T20:50:50Z

Hey @Aylr - I appreciate your support :)

This is going to be a pretty sizable change to dbt and I want to make sure we treat it with the respect it deserves! I think a good place to start is: What should we use instead? dbt operates on tabular data (either from loaded up seed files, or from the results of queries). I think there are parts of the agate API that are sensible. I like that you can operate on columns, or iterate over rows in the data frame, for instance.

Before we picked agate, we considered using something like Pandas. In practice, Pandas is a real bear to install, so we chose not to ship that with dbt. Are you aware of any existing "data frame" libs that are easy to install, or do you think this is something that makes sense for us to build ourselves?

Aylr · 2019-08-08T22:51:46Z

@drewbanin I totally agree that this should be very thoughtful!

From my perspective (which is likely different that a majority of the dbt target audience as I understand it - so I want to call out that bias), I think that choosing something other than pandas doesn't make much sense since it is effectively the de-facto dataframe library. I do want to pay particular attention that we keep installation as easy as possible for everyone who wants to try dbt. However, from where I sit (data science & engineering) installing pandas is not an issue. But again, I do want to be thoughtful about that.

Aylr · 2019-08-08T22:52:56Z

And FWIW pandas has all those kinds of operators and a wide user base to draw expertise from (myself included).

beckjake · 2019-11-13T15:17:55Z

I've been looking into this, and I have a few thoughts!

First, removing agate entirely seems like overkill, especially because of how useful it is in userspace. Instead I think we should ramp down the amount of type inference it does during parsing - realistically speaking, type inference is where all the bugs live in dbt's interaction with agate. So the rest of my suggestions revolve around that. I'm reluctant to use pandas here as while it's really powerful and has a great ecosystem, it requires a C compiler on some platforms and I really don't think core should have to require that. We've spent a lot of energy to avoid requiring it in most circumstances for psycopg2, and I think that was energy well spent for many of our less-technical users.

csv table parsing should know more about the user's schema.yml settings, in particular if there are column types defined. If there are, we should tell agate that the column is in question is "text", and then do a cast. It does appear possible to tell agate that "these rows are fixed, otherwise infer types", so it's not impossible. We could try to get extra clever and use the specified type to guess what to use, and then pass that to agate, but I'm sure that we'll just mess that up some percentage of the time so I'd prefer not to.

We should come up with a date format (or set of formats) we allow in seeds and not support anything else. I'm in favor of %Y-%m-%d %H:%M:%S with no timezone because that's what all our integration test seeds use. We can pass that to agate.data_types.DateTime as the datetime_format format when building our TypeTester and do similar for agate.data_types.Date. This avoids the whole 'tomorrow' issue by bypassing the parsedatetime.Calendar behavior, which is locale specific and therefore an even more horrible minefield than you probably imagined. I'm pretty confident seeds should not parse differently depending upon your locale! The downside is that parsedatetime is also how agate supports such a wide range of date/datetime formats. I spent a good amount of time seeing if we could subclass Calendar and override things to only avoid natural language dates like 'tomorrow' - my takeaway was that it's a lot of work, if it's possible at all!

We should just drop agate.data_types.TimeDelta from our TypeTester. I'm sure it causes more problems than it solves (does it solve any actual problems?)

To do all this we'll have to do some plumbing work I've explored and don't think is too bad: we have to construct our TypeTester at csv-parse-time in order to tell it about the fields we want to force to text so we can cast them later. I don't think that's a big deal.

We could alternatively set the type tester to parse everything as agate.data_types.Text and cast later. I'm mostly against this because it broke... not technically every test, but close enough to really feel like it.

I'll have a PR up for some of this at a later point, but I wanted to get these thoughts out there.

…e-values Agate fixes (#999) (#1639)

beckjake · 2020-01-13T15:21:51Z

I think this is fixed in #1920 - please re-open it if that's incorrect!

drewbanin added enhancement New feature or request dependencies Changes to the version of dbt dependencies labels Jul 30, 2019

drewbanin modified the milestone: Louisa May Alcott Sep 13, 2019

drewbanin mentioned this issue Sep 18, 2019

Force agate to load custom types as text #1541

Closed

beckjake mentioned this issue Nov 13, 2019

Agate fixes (#999) (#1639) #1920

Merged

drewbanin changed the title ~~Remove agate from dbt~~ Rationalize agate in dbt Nov 20, 2019

drewbanin added this to the 0.15.2: Barbara Gittings milestone Nov 20, 2019

beckjake added a commit that referenced this issue Dec 13, 2019

Merge pull request #1920 from fishtown-analytics/feature/tighten-agat…

62e22c2

…e-values Agate fixes (#999) (#1639)

beckjake closed this as completed Jan 13, 2020

drewbanin mentioned this issue Jan 27, 2020

Cannot load the string 'null' in seed data #1466

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rationalize agate in dbt #1639

Rationalize agate in dbt #1639

drewbanin commented Jul 30, 2019 •

edited

Loading

Aylr commented Aug 6, 2019

drewbanin commented Aug 6, 2019

Aylr commented Aug 8, 2019

Aylr commented Aug 8, 2019

beckjake commented Nov 13, 2019

beckjake commented Jan 13, 2020

Rationalize agate in dbt #1639

Rationalize agate in dbt #1639

Comments

drewbanin commented Jul 30, 2019 • edited Loading

Aylr commented Aug 6, 2019

drewbanin commented Aug 6, 2019

Aylr commented Aug 8, 2019

Aylr commented Aug 8, 2019

beckjake commented Nov 13, 2019

beckjake commented Jan 13, 2020

drewbanin commented Jul 30, 2019 •

edited

Loading