Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scatter Plot Matrix (aka SPLOM) discussion #2372

Closed
etpinard opened this issue Feb 15, 2018 · 21 comments · Fixed by #2505
Closed

Scatter Plot Matrix (aka SPLOM) discussion #2372

etpinard opened this issue Feb 15, 2018 · 21 comments · Fixed by #2505
Labels
feature something new

Comments

@etpinard
Copy link
Contributor

etpinard commented Feb 15, 2018

SPLOMs are coming to plotly.js.

For the uninitiated, docs on the python api scatterplotmatrix figure factory are here. Seaborn calls it a pairplot. Matlab has plotmatrix draw function.

Some might say that SPLOMs are already part of plotly.js: all we have to do is generate traces for each combination of variables and plot them on an appropriate axis layout (example).

But, this technique has a few limitations and inconveniences:

  • data arrays are duplicated, which impacts performance when the number of variables and/or the data array lengths are large
  • creating the axes layout and correctly linking the scatter traces is tedious. Note that the python api exposes a few tools to make this smoother, but these aren't available to plotly.js users.
  • ....
  • feel free to edit and append this list

Numerous solutions are available. This issue will attempt to spec out the best one.

cc @dfcreative @alexcjohnson @cldougl @chriddyp

@etpinard
Copy link
Contributor Author

etpinard commented Feb 15, 2018

Solution 1 (aka splom overlord)

Add a new do-it-all splom (and possible a splomgl too) trace type that generates its own internal scatter traces and its own axes - with an api similar to parcoords:

trace = {
  dimensions: [{
     values: [/* */],
     // some scatter style props ...
     // some axis props reused from cartesian axes
  }],
  // some splom-wide options e.g.:
  showdiagonal: true || false,
  showupperhalf: true || false,
  showlowerhalf: true || false,
  direction: 'top-left-to-bottom-right' || 'bottom-left-to-top-right',
  // ...
}

PROs

  • easy to make simple case

CONs

  • not compatible with other cartesian trace (e.g. cannot overlay additional traces on particular subplot)
  • not compatible with data-ref layout features (e.g. annotations, shapes and images)

@etpinard
Copy link
Contributor Author

etpinard commented Feb 15, 2018

Solution 2 (tooling)

Port make_subplots and append_traces from the python api in plotly.js (docs). For example:

var Plotly = require('plotly.js')

var fields = [
   [/* */],
   [/* */],
   // ...
]

var layout = Plotly.makeSubplots({rows: fields.length, cols: fields.length})
var data = []

for (var i = 0; i < fields.length; i++) {
  for (var j = 0; j < fields.length; j++) {
    var trace = {
        mode: 'markers',
        x: fields[i],
        y: fields[j]
    }
    Plotly.linkToSubplot(trace, i, j)    
    data.push(trace)
  }
}

Plotly.newPlot(gd, data, layout)

PROs

  • easy subplot generation
  • does not restrict user, other trace types and layout feature can be added

CONs

  • still somewhat tedious trace-to-subplot linking
  • does not address the duplicate array problem

@etpinard
Copy link
Contributor Author

Solution 3 (data-array reusing)

This could be combined with solution 2 to solve the data-array-duplication problem. But this would allow require some backend work for plot.ly support.

In short, we could add a new top-level argument to Plotly.newPlot and Plotly.react

var columns: [
  {name: 'col 0', values: [/* */]},
  {name: 'col 1', values: [/* */]},
  // ...
]

// unfortunately, in this paradigm columns should really be labeled data, 
// and data -> traces
var data = [{
   x: 'col 0',
   y: 'col 1'
}, {
  x: 'col 1',
  y: 'col 0'
}]

Plotly.newPlot(gd, {
  columns: columns,
  data: data,
  layout: {}
})

PROs

  • all trace types could benefit from not duplicating data array

CONs

  • probably the hardest to implement, especially when considering plot.ly backend work.

@alexcjohnson
Copy link
Collaborator

I think it's clear we want to encapsulate a splom in a single trace, like solution 1. Solution 2 won't give the necessary performance benefits. Solution 3 may give some of the performance we need, and may be useful for more generalized trace linking in the future (for example, things like 2dhistogram_contour_subplots where the x and y data are duplicated in the scatter and histogram2dcontour traces, then x and y each get another copy in the 1D histograms) but will still suffer from duplication at the calc/plot level, that I suspect will be prohibitive for us. Likewise it seems to me it's only reasonable to make this as a WebGL type.

The question in my mind is whether we can do it by linking the splom trace to regular cartesian axes, and using it to tailor the defaults for those axes, or if we need to have even the axes encapsulated in the trace itself. If we can do the former, then we retain the flexibility to display other traces on those same subplots. Extra data that we only have for one attribute pair, for example, or a curve fit, or some different type of display on the diagonal. Or even another splom that might even have a disjoint set of dimensions from the first (might be a huge headache but see below for more thoughts)

Preferred option: refer to regular cartesian axes

trace = {
  dimensions: [{
    values: [/* */],
    name: 'Sepal Width' // used as default x/y axis titles
    xaxis: 'x' | 'x2' ... // defaults to ith x axis ID for dimension i
    yaxis: 'y' | 'y2' ...
  }],
  marker: {
    // just like scatter, and all the same ones are arrayOk.
    // goes outside the `dimensions` array because the same data point should get
    // the same marker in all subplots.
  }
  // domain settings - not used directly, just fed into the defaults for all the
  // individual x/y axis domains
  domain: {
    // total domain to be divided among all x / y axes
    x: [0, 1],
    y: [0, 1],
    // blank space between x axes, as a fraction of the length of each axis
    // possibly xgutter and ygutter?
    gutter: 0.1
  }
  // some splom-wide options e.g.:

  // maybe turn these into a flaglist 'upper+lower+diagonal'?
  // these and related attrs will affect the default x/y axis anchor and/or side attributes
  showdiagonal: true || false,
  showupperhalf: true || false,
  showlowerhalf: true || false,

  // maybe xdirection and ydirection?
  direction: 'top-left-to-bottom-right' || 'bottom-left-to-top-right',
  // ...
};

layout = {
  xaxis: { /* overriding any of the defaults set by SPLOM */ },
  xaxis2: { /* */ },
  xaxis3: { /* */ },
  ... ,
  yaxis: { /* */ },
  ...
};

One variation that might be nice but I'm not sure: separate the list of axes from the dimensions. This could make it easier for example to reorder the dimensions without having to do all sorts of gymnastics with swapping axis attributes (though we might need to swap axis titles still, if they're not inherited from the dimension names):

trace = {
  dimensions: [{
    values: [/* */],
    name: 'Sepal Width' // used as default x/y axis titles
    // some scatter style props ...
  }],
  xaxes: ['x', 'x2', 'x3', ...], // defaults to the first N x axis IDs. info_array, Not data_array.
  yaxes: ['y', 'y2', 'y3', ...],
  ...
}

Bonus: layout.grid

Also, it might be nice to move the axis arrangement to layout, but still have splom provide defaults for this. That way we could reuse it for other cases that want a grid of axes, not just splom:

// splom trace would still have axis ids in it but no axis layout info (domain or gutter)
layout = {
  grid: {
    xaxes: ['x', 'x2', 'x3', ...],
    yaxes: ['y', 'y2', 'y3', ...],
    domain: { x: [0, 1], y: [0, 1] },
    gutter: 0.1
  }
}

Cases like splom would use a 1D arrays of x/y axes, as all rows share the same x axes and all columns share the same y axes, but we could also allow 2D arrays for when you want a grid of uncoupled axes. And if you put '' in any entry it leaves that row/col/cell blank, and at some point we can make a way to refer to empty cells in other trace/subplot types - so in a pie trace or a 3d scene etc you could add something like gridcell: [1, 2] which would automatically generate the appropriate domain for you.

Actually, this would make it easy to support multiple splom traces regardless of whether they have the same or different dimensions:

  1. At the beginning of supplyDefaults we'd look through all splom traces and find the full set of xaxes and yaxes to use as the defaults in fullLayout.grid (but the user could override these lists if they wanted) as well as to populate the axis and subplot lists in fullLayout._subplots.
  2. Since there's now a list of axes in fullLayout.grid, we'd coerce grid.domain and grid.gutter.
  3. Then when supplying defaults for the individual axes (as well as other subplots and traces with gridcell attributes), default domain values would be generated based on grid.
  4. After the supplyDefaults step, grid and gridcell attributes would be ignored because the appropriate domain values would have been filled in already.

That way all of this would happen automatically if you just make a splom trace with N dimensions and don't say anything about its layout, but you could alter it all at various stages if you want to.

Alternative: axes also encapsulated in the trace

What I'm trying to avoid above, but might be even higher performance at the expense of flexibility,
as the axis rendering could be tailored to the splom case:

trace = {
  dimensions: [{
    values: [/* */],
    xaxis: { /* all the x axis attributes like title, tick/grid specs, fonts, etc */ },
    yaxis: { /* same for y - or these could go in xaxes/yaxes arrays but still in the trace */ }
  }]
}

or in trace.xaxes and trace.yaxes which would be arrays of objects rather than arrays of IDs... either way the point is no other traces would be able to use these axes, which means they could use stripped down rendering machinery for better performance but less flexibility.

My hope though is that the SVG axis machinery is fast enough, especially if we avoid having splom contribute to fullLayout._subplots.cartesian or fullLayout._subplots.gl2d (which would scale quadratically with number of dimensions, vs the number of x/y axes, fullLayout._subplots.(x|y)axis, which scale linearly) so we only draw the axes in SVG, and let splom draw gridlines (if required) in WebGL.

@etpinard
Copy link
Contributor Author

Thanks for the 📚 @alexcjohnson

I'm a big fan of those xaxes and yaxes info arrays in the traces 👍 Using the plural here is great as they won't conflict with the current xaxis / yaxis trace attributes.

About your grid proposal, I'm curious to see if we could combine the numerous xy subplot-wide but not graph-wide requested settings in them (#1468, #233, #2274 and per subplot plot_bgcolor to name a few).


Now, to give a more concrete example (to e.g. @dfcreative 😉), the Iris splom (e.g. https://codepen.io/etpinard/pen/Vbzxqa) would be declared as:

var url = 'https://cdn.rawgit.com/plotly/datasets/master/iris.csv'
var colors = ['red', 'green', 'blue']

Plotly.d3.csv(url, (err, rows) => {
  var keys = Object.keys(rows[0]).filter(k => k !== 'Name')
  var names = rows.map(r => r.Name).filter((v, i, self) => self.indexOf(v) === i)

  var xaxes = keys.map((_, i) => 'x' + (i ? i + 1 : ''))
  var yaxes = keys.map((_, i) => 'y' + (i ? i + 1 : ''))
  
  var data = names.map((name, i) => {
    var rowsOfName = rows.filter(r => r.Name === name)
  
    var trace = {
       type: 'splom',
       name: name,

       dimensions: keys.map((k, j) => {
          // 'label' would be better here than 'name' (parcoords uses 'label')
          label: k,
          values: rowsOfName.map(r => r[j]),
       }),

       marker: {color: color[i]},
 
       // the default (for clarity)
       showlegend: true,
  
       xaxes: xaxes,
       yaxes: yaxes
     }  

     return trace
  })

  var layout = {
       grid: {
        xaxes: xaxes,
        yaxes: yaxes
        domain: { x: [0, 1], y: [0, 1] },
        gutter: 0.1
     }
   }

  Plotly.newPlot('graph', data, layout)

That is, one splom trace per 🥀 type and one dimension per observed field in each trace.

@etpinard
Copy link
Contributor Author

My hope though is that the SVG axis machinery is fast enough, especially if we avoid having splom contribute to fullLayout._subplots.cartesian or fullLayout._subplots.gl2d (which would scale quadratically with number of dimensions, vs the number of x/y axes, fullLayout._subplots.(x|y)axis, which scale linearly) so we only draw the axes in SVG, and let splom draw gridlines (if required) in WebGL.

Interesting point here about the grid lines. It shouldn't be too hard to draw them in WebGL (much easier than axis labels 😉 at least), if we find SVG too slow.

@dy
Copy link
Contributor

dy commented Feb 16, 2018

May I add my 2¢?
Why don't we just use existing scatter trace data/naming convention as

Plotly.newPlot(document.body, [{
  type: 'scattermatrix',
  x: [[], [], ...xdata],
  y: [[], [], [], ...ydata]
}])

That would be familiar already for the users who know trace types and options.

@alexcjohnson
Copy link
Collaborator

May I add my 50 cents?

Usually it's 2¢ but we like you so sure :)

Why don't we just use existing scatter trace data/naming convention

Two things I don't like about this:

  1. A given data value isn't x or y, it's used for both in different subplots.
  2. We need labels associated with each dimension, and we may want to be able to rearrange dimensions, both of which are a bit awkward if the data are in a 2D array.

Anyway we do have a precedent for the structure I'm proposing, in parcoords. Then the marker attributes would be inherited directly from scatter

@alexcjohnson
Copy link
Collaborator

About your grid proposal, I'm curious to see if we could combine the numerous xy subplot-wide but not graph-wide requested settings in them (#1468, #233, #2274 and per subplot plot_bgcolor to name a few).

I suppose we could let grid provide these settings, the same way grid would be providing domain values for individual axes. But I wouldn't want this to be the only way to provide per-subplot settings, because not every multi-subplot layout can be described as a grid - think of insets, or layouts like

+-------+ +---+
|       | |   |
|       | +---+
|       | +---+
|       | |   |
+-------+ +---+

I guess ^^ could be massaged into the grid format with concepts like colspan / rowspan, and maybe we'll do that, but that would still make it awkward to provide per-subplot attributes, and insets would still be difficult to describe this way.

So I still think we'll need something like #2274 (comment) but perhaps grid would be allowed to provide defaults to that when the layout is conducive to it.

@dfcreative don't worry about grid while implementing splom - just use explicitly positioned x and y axes, and I'll work on grid separately, then once it and splom are both ready we can integrate them.

This was referenced Feb 19, 2018
@dy dy mentioned this issue Mar 8, 2018
7 tasks
@etpinard
Copy link
Contributor Author

Branch splom has some preliminary work on the user-attributes-full-attributes side of things (i.e. pretty much everything except the regl-scatter2d calls).

@etpinard
Copy link
Contributor Author

Things to note:

  • splom traces have their own basePlotModule (similar to pie, parcoords, ...) that reuses some Cartesian methods
  • the splom default step generates default xaxes and yaxes list using the number of dimensions the trace has
  • we keep track of all splom axes to then use them as grid.xaxes and grid.yaxes defaults
  • even though splom traces have their own base plot module, we fill in fullLayout._subplots.cartesian and fullLayout._subplots.(x|y)axes so that things just works.
  • we'll make one regl-scatter2d (or equivalent) call per splom trace

@alexcjohnson
Copy link
Collaborator

Just a couple of clarifying questions:

splom traces have their own basePlotModule

Sounds great, just as long as this doesn't restrict us from displaying other data (be it splom or some other trace type) on the same axes.

we'll make one regl-scatter2d (or equivalent) call per splom trace

I'm not really sure what a regl-scatter2d call entails, but the key optimization we need over making a million scattergl subplots is to only upload the values data for each dimension to the GPU once, even though it will appear in somewhere between N-1 and 2N subplots. Does this strategy do that?

@etpinard
Copy link
Contributor Author

Sounds great, just as long as this doesn't restrict us from displaying other data (be it splom or some other trace type) on the same axes.

Yes, for sure 👌

I'm not really sure what a regl-scatter2d call entails, but the key optimization we need over making a million scattergl subplots is to only upload the values data for each dimension to the GPU once, even though it will appear in somewhere between N-1 and 2N subplots. Does this strategy do that?

Here's a sneak peak:

@etpinard
Copy link
Contributor Author

etpinard commented Mar 13, 2018

Here are some observations on splom-generated cartesian subplots:

Off the splom branch with commits from #2474 and using the following script:

var Nvars = ???
var Nrows = 2e4 // make no difference for now
var dims = []

for(var i = 0; i < Nvars; i++) {
  dims.push({values: []})
  
  for(var j = 0; j < Nrows; j++) {
     dims[i].values.push(Math.random())
  }
}

Plotly.purge(gd);

console.time('splom')
Plotly.plot(gd, [{
  type: 'splom',
  dimensions: dims
}])
console.timeEnd('splom')

I got:

image

where I added console.time / console.timeEnd pairs in the slowest subroutines i.e. the ones that scale with the total number of subplots or Math.pow(dimensions.length, 2)

A few quick hits:

  • initInteractions execution can be 🔪 by setting staticPlot: false (duh) but even setting the more obscure config option showAxisDragHandles and showAxisRangeEntryBoxes to false can reduce its execution time by a factor of 4
  • lsInner is currently called twice via layoutStyles here and here (and a third time on graphs with margin-pushing things). At 40 dimensions (that's 200 subplots), it takes a whooping 2700ms to execute. That is, more that half of the total plotting time is in there. I'll try to first make sure the slow parts are called only once. But, we might need more aggressive optimization at some point
  • Removing the grid-drawing step in Axes.doTicks speeds up the doAxes step by a factor of 2. That's good because we can probably use regl-line2d to draw those lines more efficiently. That said, we'll also have to speed label-drawing step mostly via Replace getBoundingClientRect calls in axes.js #1988 and fixOverlappingLabels.

@dy
Copy link
Contributor

dy commented Mar 13, 2018

Work in progress https://dfcreative.github.io/regl-scattermatrix/

@etpinard
Copy link
Contributor Author

etpinard commented Mar 14, 2018

Quick update:

  • halving the number of lsInner calls was easy enough in e810c1e. Next, I'll try to merge as much logic as possible from Cartesian.drawFramework with lsInner so that we can hopefully loop over all the <g subplot> only once.

@etpinard
Copy link
Contributor Author

Interesting finding:

  • Commenting out this particular Drawing.setClipUrl call can speed up lsInner by 10x at 40 dimensions (or 1600 subplots)! Even when the page has no <base>! I suspect that traversing the DOM when you have 1600 <g subplot> is slow 🐢 (duh!). This should be an easy fix: call d3.select('base') once (i.e. not for every Drawing.setClipUrl call ) and stash it somewhere.

@alexcjohnson
Copy link
Collaborator

There's also document.baseURI perhaps we can bypass base, just check if
document.baseURI === window.location.href

@etpinard
Copy link
Contributor Author

image

too bad. Although ⤴️ is from w3school 😆

https://developer.mozilla.org/en-US/docs/Web/API/Node/baseURI is incomplete:

image

@etpinard
Copy link
Contributor Author

etpinard commented Mar 15, 2018

New benchmarks post 5887104 (which I pushed to #2474 - hopefully @alexcjohnson won't mind):

image

Things are looking up 🎸

Next steps:

  • Minimize the number of loops over subplots
  • Minimize subplot load (i.e. things that scale as Math.pow(dimensions.length, 2))

@etpinard
Copy link
Contributor Author

A first attempt at drawing grid lines using @dfcreative 's regl-line2d was positive.

Here are the numbers (in ms) with all axes having the same gridcolor and gridwidth:

# of dims SVG regl-line2d
10 70 80-100
20 200 140-150
30 500 150-200
40 800 300
50 1500 350

In brief, we start to see improvements over SVG at around 15 dimensions (i.e 15x15=225 subplots).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature something new
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants