-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternatives to custom data management interfaces #35
Comments
LOVE IT! |
I stumbled across this one time and was intrigued. https://github.com/wq/django-data-wizard
|
I took a quick look at The major problem I ran into is a non-intuitive workflow coupled with lack of docs. The only documentation I could find is the README of the repo; it provides some screenshots of the app under an "API documentation" section, but it doesn't actually explain what the major entities are or how to configure them to perform an upload from start to finish. There are a couple of different entities available for editing, including Identifiers, Runs, Serializers, and Loaders, and after reading the docs and poking around for half an hour I still don't understand how they relate to one another. Another issue is that the docs appear to imply that the data wizard assumes a 1:1 relationship between a spreadsheet and a particular model instance:
This would mean that we can't upload additional spreadsheets and use them to create new entities. It also would mean that a spreadsheet has to fully encompass the attributes of the model instance. We might be able to get around these requirements in the UofM case, but it seems restrictive and non-intuitive to me. One thing I did really like is that the uploader can use Celery as a task queue backend, but by default it actually just uses Python threads. This is pretty compelling to me, particularly in a containerized deployment context where we'd need entirely separate Celery and Redis containers if we wanted to run a queue. Multiprocessing may actually be a better choice than threads if we can pull off the complexity, but I'm definitely inspired to try this approach for UofM. |
Note that we're trialling Airflow for LA Metro, we may have different thoughts about this after that project is complete. |
Barring an all-in-one solution, I think it would be useful to chunk off work from this issue based on pain points in the past. One big one is schema validation and, beyond that, useful error messages for users when their data doesn't meet expectations. This was a part of the IHS data import, and will play a role in the Dedupe.io API refactor and BGA Salaries Phase II. I'd like to spend a day working with the schema validation libraries I outline here [internal link], selecting the one that's most comfortable, and setting out a few patterns for common operations, e.g., defining a schema and handling exceptions when expectations are violated. Separate, but related: @jeancochrane mentioned schema locking and foreign key import as functionality to investigate in Airtable, potentially as a replacement for the Google Sheets workflow we've used a couple of times. |
There are a lot of really smart things going on in @jeancochrane's importer for the Minnesota Elections Archive: https://github.com/datamade/mn-election-archive Chief among them is a nice mechanism for logging and reporting errors to the user. I also like that they defined extract, transform, and load management commands, so the processing steps remain separate, but we can still tackle them in Python rather than teeing off a Make process. Overall, the code is simple, elegant, well-organized, and, as a result, easy to follow and use! I'm curious how the client will find this interface. If he likes it, I'd be interested in using some research time to spike out some of these patterns in the payroll import. It'd be super high value to the client to be able to manage that data themselves on a recurring and reliable basis. |
We might consider taking a look at https://django-import-export.readthedocs.io/en/latest/. |
@fgregg will write up some thoughts on splitting this up into the key requirements for a system. |
so, the client wants a custom data management system. This has typically meant two things.
Let's first talk about 2. Here's my imagination of what the client wants.
TK |
I had a case matching scenario 1 from @fgregg's comment above with Lugar. We have a This access was quick and easy to set up with Wagtail's native If they'd needed a bulk import process of any kind, this would have been more complex. It looks like ModelAdmin can support foreign key relationships, but I haven't tried to set that up. For the relatively simple need of editing a couple fields on a single model for a Wagtail site, this approach worked great. |
@beamalsky Is investigating Airtable with South Side Weekly; if it happens, we'll see if there are any lessons for this issue. |
I made SSW a Gatsby site with an Airtable backend and it worked beautifully! Repo here and final site here. This was a case that falls under category #1 of Forest's comment above. The Weekly is already very comfortable using Airtable so this stack was ideal for them. This is what the backend Airtable looks like: Unlike a CMS like Wagtail, there's no easy way for users to update the Airtable and have it reflected immediately on the site. This is also a limitation of something like Google Sheets, and for the CPS SSCE dashboard we've just set up a crontask that pulls in new data from the sheet every night. While the Weekly was actively editing the Airtable with new content I set up a Github Action to rebuild the site every 5 minutes, but that would run through Netlify's free build minutes very quickly if always left on. In retrospect, I could have given them access to the Netlify account and shown them how to rebuild the site when they wanted to see an update. I think that approach would be worth trying for a relatively tech-comfortable client but would probably be alienating to others. An attractive part of Airtable is that it has a concept of foreign keys to link multiple tables. It seems like this is better for us than something like Google Sheets, which has no built in way to link records, but that the Airtable API can't currently return true nested GraphQL queries—the relationships get lost and the foreign key field just returns an ID. I didn't need multiple tables this project and haven't explored this deeply but it would be key for adopting Airtable in a project with a more complex data structure. David Eads is currently exploring BaseQL to improve on the Airtable API handling of foreign keys. |
Clients often ask us to build them a custom data management interface (CPS SSCE, SFM, Erikson, BGA, etc.). Very rarely does the client wind up using the interface successfully without substantial help from DataMade.
Investigate alternative approaches to this problem by choosing an approach and refactoring an existing project to use a different data CMS. Some possible approaches might be:
The text was updated successfully, but these errors were encountered: