-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rationalize agate in dbt #1639
Comments
I'm happy to pitch in here after dealing with this in the new |
Hey @Aylr - I appreciate your support :) This is going to be a pretty sizable change to dbt and I want to make sure we treat it with the respect it deserves! I think a good place to start is: What should we use instead? dbt operates on tabular data (either from loaded up seed files, or from the results of queries). I think there are parts of the agate API that are sensible. I like that you can operate on columns, or iterate over rows in the data frame, for instance. Before we picked agate, we considered using something like Pandas. In practice, Pandas is a real bear to install, so we chose not to ship that with dbt. Are you aware of any existing "data frame" libs that are easy to install, or do you think this is something that makes sense for us to build ourselves? |
@drewbanin I totally agree that this should be very thoughtful! From my perspective (which is likely different that a majority of the dbt target audience as I understand it - so I want to call out that bias), I think that choosing something other than pandas doesn't make much sense since it is effectively the de-facto dataframe library. I do want to pay particular attention that we keep installation as easy as possible for everyone who wants to try dbt. However, from where I sit (data science & engineering) installing pandas is not an issue. But again, I do want to be thoughtful about that. |
And FWIW pandas has all those kinds of operators and a wide user base to draw expertise from (myself included). |
I've been looking into this, and I have a few thoughts! First, removing csv table parsing should know more about the user's schema.yml settings, in particular if there are column types defined. If there are, we should tell agate that the column is in question is "text", and then do a cast. It does appear possible to tell agate that "these rows are fixed, otherwise infer types", so it's not impossible. We could try to get extra clever and use the specified type to guess what to use, and then pass that to agate, but I'm sure that we'll just mess that up some percentage of the time so I'd prefer not to. We should come up with a date format (or set of formats) we allow in seeds and not support anything else. I'm in favor of We should just drop To do all this we'll have to do some plumbing work I've explored and don't think is too bad: we have to construct our We could alternatively set the type tester to parse everything as I'll have a PR up for some of this at a later point, but I wanted to get these thoughts out there. |
I think this is fixed in #1920 - please re-open it if that's incorrect! |
List of open (or recently closed) issues pertaining to Agate:
tomorrow
will cause dbt to failLet's rip out agate where it's not needed and make it generally less clever (type inference) where we can.
The text was updated successfully, but these errors were encountered: