Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Proposal to remove 'Help' pages and gather and update content on 'Large Datasets' #2738

Closed
3 tasks done
binste opened this issue Dec 1, 2022 · 7 comments
Closed
3 tasks done

Comments

@binste
Copy link
Contributor

binste commented Dec 1, 2022

I have two suggestions how we could improve the documentation even further. I'm happy to create a PR for both of them but I first wanted to get some input if I'm on the right track and if this is of general interest.

Remove the "Help" pages

These pages are currently nested under "More" -> "Help" in the top navigation bar. There content is either duplicated or could be reorganised which would lead to a better structure in my opinion:

  • Getting help is already fully contained in the start page (last two paragraphs)
  • Display Troubleshooting could be added as a "Troubleshooting" section at the end of Displaying Altair Charts. As a user, this is where I would look for it in the first place. Thanks to the great "table of contents" on the right side of a page, I think the navigation on "Displaying Altair Charts" would still be good despite the increased length
  • FAQ:
    • "Does Altair work with IPython Terminal/PyCharm/Spyder/": all of the information here is already contained in Displaying Altair Charts
    • "I tried to make a plot but it doesn’t show up": All information is contained in "Display Troubleshooting"
    • The remaining questions deal with large datasets. See next section on how this could be handled

Gather content on "Large Datasets" in one place

There is already some content in the documentation on how to deal with larger datasets. There have also been some new developments in third-party packages such as vl-convert and vegafusion which open up new possibilities which are not yet documented. I think it would make sense to gather this information in one place, especially as I don't think it is obvious and there are many options by now with various tradeoffs. I'd suggest to do this in a new page called "Large Datasets" under "Advanced Usage" and link to it from Specifying Data

Existing pages from which some content could be consolidated:

  • Last 2 questions in FAQ: Explain problem with large notebooks and mention disabling of maxrowserror, json transformer, etc.
  • Data Transformers Has more information on data transformers which is relevant for dealing with larger datasets. Could keep this page but reference to it from "Large Datasets" when talking about json transformer.
  • Customizing Renderers This might not be an obvious candidate for dealing with large datasets for many users. However, thanks to vl-convert we can set the renderer to "svg" and then all calculations will happen on the server and only the svg image is transmitted. This is of course still not as fast as reimplementing the transformations in Python (e.g. altair_transform) but it gives the full functionality and gets you pretty far. This is something I'm very excited about as for data exploration I want to be able to just quickly create some charts without having to worry too much about too much data being transmitted and therefore having to preaggregate data and drop redundant columns (FEAT: Include only referenced data columns in chart specs #2586). I often also don't need the interactivity and am fine with a static image to start out with.

So the new page could start with some explanations on why large datasets can be challenging and then discuss the following recommendations (in this order?) with their pros and cons:x

  1. Native Altair: Use json transformer and potentially preaggregate data
  2. Third-party package but minimal dependencies: As an alternative, use vl-convert and svg renderer if you don't need any interactivity
  3. If you have very large datasets and want to use the interactivity features, take a look at VegaFusion. Could also mention altair_transform as an alternative for simple charts.

Please follow these steps to make it more efficient to respond to your feature request.

  • Since Altair is a Python wrapper around the Vega-Lite visualization grammar, most feature requests should be reported directly to Vega-Lite. You can click the Action Button of your Altair chart and "Open in Vega Editor" to create a reproducible Vega-Lite example.
  • Search for duplicate issues.
  • Describe the feature's goal, motivating use cases, and its expected behavior.
@joelostblom
Copy link
Contributor

Thanks for these two suggestions and all your help with the docs lately! 🚀

I agree with you that here is duplicated info in the help tag. I am leaning towards still be keeping a visible Help indicator in some form, because it makes it easier to discover the help for new readers which I think it important. It doesn't have to be in the navbar, but I don't have a better idea currently myself. I could possibly see moving it under "Getting started", but not sure... I do agree that it is not ideal to have the hidden "More" section. We could probably reorganize this since we are also planning to add a Resource page too (although that might be merged with ecosystems #2415).

I am very much in favor of adding a page for working with large data and moving things out of the help there (and just having a link on the help page). However I don't think we should recommend the json transformer. I remember Jake mentioning that in a recent comment that I can't find but I found this older one that has the same message. In general I think the data_server transformer or possibly VegaFusion should be the first recommendation for working with big data in Altair. I haven't had time to play around with VegaFusion yet, but maybe @jonmmease can fill us in if there are any big gaps compared to the data server transformer (not working with dashboarding frameworks such as panels seems to be one from reading the docs). If my understanding is correct I don't think altair-transform is that complete and haven't been updated in a while, so I'm hesitant about making that a primary recommendation.

Related issues:

@jonmmease
Copy link
Contributor

Hi @joelostblom, VegaFusion's support for the Vega specs that Vega-Lite generates is fairly complete. Transforms that are not supported are left in the Vega spec that the Vega renderer handles, so it falls back gracefully in these situations.

The biggest ecosystem limitation is that VegaFusion currently depends on a custom Jupyter Widget to render the resulting Vega specs and communicate with the VegaFusion runtime.

Part of my motivation for writing vl-convert is that I'd like to add support to VegaFusion for pre-evaluating and optimizing transforms on the server so that the resulting Vega specs can be rendered by regular Vega mimetype renderers.

I'll certainly be interested in your feedback when you have a chance to try it out!

@binste
Copy link
Contributor Author

binste commented Dec 15, 2022

Thank you Joel for the feedback! I started the implementation in #2755

@jonmmease I find the developments around VegaFusion very exciting and vl-convert is a huge upgrade for Altair in terms of usability so thank you very much for putting in all this effort! Looking forward to what comes next, especially the combination of vl-convert and VegaFusion that you described.

@jonmmease
Copy link
Contributor

@binste FYI, I'm working on a VegaFusion PR over in vega/vegafusion#195 that implements the workflow described above.

@binste
Copy link
Contributor Author

binste commented Dec 29, 2022

Wow, that looks very very exciting, thank you @jonmmease for putting in the work to make this happen! Seems like a much more complete altair_transform. Great screencast and PR documentation, clearly explain what's happening and the differences to the existing widget-based approach. Really like that the data is fully inlined and that it produces standalone Vega specs. Might also solve the need for #2586.

Once you release a new version, I'll definitely try this out, maybe even use it as a default when working with Altair if it does not have any downsides.

Closing this issue now, but if we all feel comfortable with it once the new version of VegaFusion is out, I think we should also promote VegaFusion and specifically this approach much more on the new "Large Datasets" page in the documentation.

@binste binste closed this as completed Dec 29, 2022
@jonmmease
Copy link
Contributor

Might also solve the need for #2586

Yeah, I think it does. In addition to pre-applying data transforms, VegaFusion has fairly complete support for projection pushdown (removing unused columns as early as possible in the data pipeline). So even for non-aggregated charts (e.g. scatter plots) it can reduce the bundle size by trimming out unused columns.

@joelostblom
Copy link
Contributor

Wow, this is so helpful! Thanks for implementing that functionality and for pinging us here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants