-
-
Notifications
You must be signed in to change notification settings - Fork 365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plotting large numbers of sequences/time series together: dealing with fixed length numpy arrays #512
Conversation
…mpy arrays to a pandas dataframe with NaNs separating individual sequences 2.) An example in tseries.ipynb showing the use of this function while plotting thousands of sequences together
@jbednar @philippjfr Can you take a look and let me know what you think. |
Thanks for the PR! In most of our use cases for datashading curves, we want to be able to distinguish between the curves, which is only feasible for up to a few dozen curves if we use count_cat to colorize them. Here, there doesn't seem to be a way to convey the identity of each curve, but it seems like you have an application in mind where that doesn't matter? E.g. maybe you could talk about how this approach lets you discover underlying periodicities in pseudorandom number generators? That's what it looks like your example is showing: |
My specific use case involves looking at 1-2ms long voltage traces from neurons (action potentials) and determining if they are coming from the same neuron. Each neuron produces stereotypical action potentials, and plotting all the action potentials recorded on a electrode on top of each other let's us know if they are coming from one or multiple neurons. So, yes, in my case, the exact identity of each curve doesn't really matter. Check out our Scipy paper again for severely overplotted examples of this kind: http://conference.scipy.org/proceedings/scipy2017/narendra_mukherjee.html I think that this sort of use case isn't that uncommon - the original question in #286 was trying to achieve exactly this sort of thing. I just used a pseudorandom number generator as an easy way to generate 'dummy' data of the kind I am plotting - I could as well put in my specific use case, with action potentials from a neuron as an example, but that would mean I would have to put in some actual data that I have recorded as well to make those plots work. I didn't know how to do that with a IPython notebook. Let me know what you think! |
I had forgotten that you were the one with the SciPy paper, which I do remember now! A use case something like that was what I was imagining, but in that case, won't you want to know the identity of the inappropriately sorted curves, the ones with shapes that suggest that they are not action potentials for this neuron, so that you can exclude them from the group? I agree that a visualization like this is a good first step, to at least be able to see them, but then if it were my data I'd immediately want to start pulling out the outlier curves and see why they ended up in this bucket inappropriately, which is difficult if I can't identify them. Maybe in practice what you do is just adjust some threshold, never dealing with individual curves by name or id? In that case I guess a good visualization would be to overlay a datashaded plot of the traces included by the threshold in one color, over a datashaded plot of the ones excluded in another color, adjusting the threshold until those two groups were quite visibly distinct. Doing that shouldn't require anything further from datashader, but it sure seems like it would be helpful to have an example that shows a workflow like this. I wonder if there's a good way to do that with synthetic data, synthesizing a bunch of curves from different categories, pooling them all together, and then showing how to use datashader to see visually that there are these categories and then adjusting thresholds until a clustering algorithm correctly sorts out each category. Hmm; probably too ambitious, so I guess I should just merge this utility as-is and think about that later! |
Ok, I tidied up the example notebook a bit to remove extraneous changes and to use an example where each datapoint was countable for clarity, and merged it. Thanks for your contribution! |
…h fixed length numpy arrays (#512) * Added a function to ds.utils to convert sequences stored as 2D numpy arrays to a pandas dataframe with NaNs separating individual sequences * Added an example in tseries.ipynb showing the use of this function while plotting thousands of sequences together
Added a function in ds.utils to convert time series/sequences stored as 2D numpy arrays to a dataframe with NaN separators between individual sequences. Also added an example in tseries.ipynb showing the use of this function. This is in response to issue #286