Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kafka integration #84

Open
wants to merge 14 commits into
base: master
Choose a base branch
from
Open

Kafka integration #84

wants to merge 14 commits into from

Conversation

geoffreyaldebert
Copy link
Contributor

  • Kafka Integration (only consumer)
  • Read message from udata-analysis-service
  • Parse file (could be from minio instead of downloading again resource)
  • Add csv-detective type detection to help agate to store resource into sqlite
  • Add pandas profiling analysis (minimal) and generation of json report
  • Store new infos into sqlite in new tables :
    • general_infos : basic info on resource
    • column_infos : basic info on each column of resource
    • categorical_infos : categorical values for each columns (limit to 10)
    • top_infos : top values for each columns (limit to 10)
    • numeric_infos : basic info on each numeric column of resource (mean, std, min, max)
    • numeric_plot_infos : repartition of values of numeric column in a plot
  • Update API to list those new info if we have them

#url = r.json()['url']
if((message is not None) & (message['service'] == 'csvdetective')):
#try:
url = 'https://www.data.gouv.fr/fr/datasets/r/{}'.format(key)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to build an url instead of using the minio location?
This data.gouv.fr location url is environment dependent (dev / demo / prod).

* Switch to poetry

- Explicitely upgrade to python >= 3.9
- Upgrade pandas and pandas-profiling

* Cleanup

- Overhaul CI file
- Remove useless files
- Update License attribution
- Remove obsolete ansible roles

* trigger CI

* fix local tests on macos

* add linting

* add linting

* upgrade flake8 and pytest

* lint all the thingz

* fix tests with strict asyncio mode

* really really fix the tests

* Update README

* Use CI template, publish kafka-integration, bump 1.3.0

* poetry update

* invalidate cache

* CI: cache-prefix param
@abulte
Copy link
Contributor

abulte commented Aug 26, 2022

This branch is now published on pypi https://app.circleci.com/pipelines/github/etalab/csvapi/91/workflows/09dba6e2-b91f-4cf2-af03-71a9daee9bbb/jobs/605

⚠️ remove this publication when merged on master

* Check message structure to prevent errors

* Add pandas profiling analysis (optional) in api

* Update requirements (csv-detective)

* Update message structure format

* Add requirements

* Remove requirements, switch to poetry

* Add poetry lock file

* Lint code

* setuptools

* upgrade and clean deps

* lint test

Co-authored-by: Geoffrey Aldebert <[email protected]>
Co-authored-by: Alexandre Bulté <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants