Skip to content

Commit

Permalink
Plotting large numbers of sequences/time series together: dealing wit…
Browse files Browse the repository at this point in the history
…h fixed length numpy arrays (#512)

* Added a  function to ds.utils to convert sequences stored as 2D numpy arrays to a pandas dataframe with NaNs separating individual sequences
* Added an example in tseries.ipynb showing the use of this function while plotting thousands of sequences together
  • Loading branch information
narendramukherjee authored and jbednar committed Oct 30, 2017
1 parent 924744c commit ca9a525
Show file tree
Hide file tree
Showing 2 changed files with 101 additions and 0 deletions.
31 changes: 31 additions & 0 deletions datashader/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -352,3 +352,34 @@ def dshape_from_dask(df):
categoricals_in_dtypes = np.vectorize(lambda dtype: dtype.name == 'category', otypes='?')
def categorical_in_dtypes(dtype_arr):
return categoricals_in_dtypes(dtype_arr).any()

def dataframe_from_multiple_sequences(x_values, y_values):
"""
Converts a set of multiple sequences (eg: time series), stored as a 2 dimensional
numpy array into a pandas dataframe that can be plotted by datashader.
The pandas dataframe eventually contains two columns ('x' and 'y') with the data.
Each time series is separated by a row of NaNs.
Discussion at: https://github.com/bokeh/datashader/issues/286#issuecomment-334619499
x_values: 1D numpy array with the values to be plotted on the x axis (eg: time)
y_values: 2D numpy array with the sequences to be plotted of shape (num sequences X length of each sequence)
"""

# Add a NaN at the end of the array of x values
x = np.zeros(x_values.shape[0] + 1)
x[-1] = np.nan
x[:-1] = x_values

# Tile this array of x values: number of repeats = number of sequences/time series in the data
x = np.tile(x, y_values.shape[0])

# Add a NaN at the end of every sequence in y_values
y = np.zeros((y_values.shape[0], y_values.shape[1] + 1))
y[:, -1] = np.nan
y[:, :-1] = y_values

# Return a dataframe with this new set of x and y values
return pd.DataFrame({'x': x, 'y': y.flatten()})


70 changes: 70 additions & 0 deletions examples/tseries.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -390,6 +390,76 @@
"source": [
"Here the three groups can clearly be seen, at least once they diverge sufficiently, as well as the areas of high overlap (high probability of being in that state at that time). Additional patterns are visible when zooming in, all the way down to the individual datapoints, and again it may be useful to zoom first on the x axis (to make enough room on your screen to distinguish the datapoints, since there are 100,000 of them and only a few thousand pixels at most on your screen!). And note again that the interactive plots require a running server if you want to see more datapoints as you zoom in; static exports of this notebook won't support full zooming."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Plotting large numbers of time series together\n",
"\n",
"The examples above all used a small number of very long time series, which is one important use case for Datashader. Another important use case is visualizing very large numbers of time series, even if each individual curve is relatively short. If you have hundreds of thousands of timeseries, putting each one into a Pandas dataframe column and aggregating it individually will not be very efficient. \n",
"\n",
"Luckily, Datashader can render arbitrarily many separate curves, limited only by what you can fit into a Dask dataframe (which in turn is limited only by your system's total disk storage). Instead of having a dataframe with one column per curve, you would instead use a single column for 'x' and one for 'y', with an extra row containing a NaN value to separate each curve from its neighbor (so that no line will connect between them). In this way you can plot millions or billions of curves efficiently.\n",
"\n",
"To make it simpler to construct such a dataframe for the special case of having multiple time series of the same length, Datashader includes a utility function accepting a 2D Numpy array and returning a NaN-separated dataframe. (See [datashader issue 286](https://github.com/bokeh/datashader/issues/286#issuecomment-334619499) for background.) \n",
"\n",
"As an example, let's generate 100,000 sequences, each with 10 points, as a Numpy array:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"n = 100000\n",
"points = 10\n",
"data = np.random.normal(0, 100, size = (n, points))\n",
"data.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can create a suitable Datashader-compatible tidy dataframe using the utility:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df = ds.utils.dataframe_from_multiple_sequences(np.arange(points), data)\n",
"df.head(15)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And then render it as usual:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cvs = ds.Canvas(plot_height=400, plot_width=1000)\n",
"agg = cvs.line(df, 'x', 'y', ds.count()) \n",
"img = tf.shade(agg, how='eq_hist')\n",
"img"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here, the 10 high and low peaks each represent one of the 10 values in each sequence, with lines connecting those random values to the next one in the sequence. Thanks to the `eq-hist` colorization, you can see subtle differences in the likelihood of any particular pixel being crossed by these line segments, with the values towards the middle of each gap most heavily crossed as you would expect. You'll see a similar plot for 1,000,000 or 10,000,000 curves, and much more interesting plots if you have real data to show!"
]
}
],
"metadata": {
Expand Down

0 comments on commit ca9a525

Please sign in to comment.