Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work with Table Loaded Using Arquero #116

Closed
jaredlander opened this issue Mar 26, 2024 · 16 comments
Closed

Work with Table Loaded Using Arquero #116

jaredlander opened this issue Mar 26, 2024 · 16 comments

Comments

@jaredlander
Copy link

jaredlander commented Mar 26, 2024

Data manipulation capabilities were removed from Arrow a while back and they now suggest using Arquero for operations such as filtering, summarizing, etc.

I would love to be able to manipulate the table before passing it to various layers, for example, I might want to filter out certain rows.

From what I can tell, the layers require that the data be loaded with Arrow.tableFromIPC() while, of course, Arquero requires the data be loaded with aq.loadArrow().

I can convert from the Arrow-loaded data to an Arquero-type and use Arquero functions. I can even send that back to Arrow-style, but then it is unusable with geoarrow/deck.gl-layers.

For instance (again, need to download the datafile somewhere since my website doesn't support CORS).

const GEOARROW_POINT_DATA = "https://jaredlander.com/data/hood_centers.arrow";

const hoods = await Arrow.tableFromIPC(fetch(GEOARROW_POINT_DATA));

hoods.numRows
// outputs 195

// convert to Arquero, filter then show the number of rows
aq.from(hoods).filter(d => d.BoroName == 'Manhattan').numRows()
// outputs 29

If we read the data directly into Arquero we can filter it and transfer it to Arrow and see we get the correct number of rows. (For some reason we can't convert back to Arrow if we first imported FROM Arrow.)

zones = await aq.loadArrow(GEOARROW_POINT_DATA)
zones_arrow = zones.filter(d => d.BoroName == 'Manhattan').toArrow()
zones_arrow.numRows

However, this newly created zones_arrow cannot be used for mapping, resulting in this error.

deck: initialization of GeoArrowScatterplotLayer({id: 'scatter'}): getPosition not GeoArrow point or multipoint Error: getPosition not GeoArrow point or multipoint
    at It.renderLayers (deckgl-layers.js:1:20226)
    at It._postUpdate (deckgl.min.js:1759:35234)
    at It._update (deckgl.min.js:1759:29349)
    at It._initialize (deckgl.min.js:1759:28672)
    at Bn._initializeLayer (deckgl.min.js:1705:8621)
    at Bn._updateSublayersRecursively (deckgl.min.js:1705:8364)
    at Bn._updateLayers (deckgl.min.js:1705:7901)
    at Bn.setLayers (deckgl.min.js:1705:7063)
    at Bn.updateLayers (deckgl.min.js:1705:7186)
    at fp._onRenderFrame (deckgl.min.js:1706:56585)

I'm guessing this is because Arquero doesn't know how to properly handle geometry columns. Would be really nice if we can perform some sort of manipulation on the data before plotting, but I'm not quite sure how to accomplish this.

@kylebarron
Copy link
Member

Flexible: query over arrays, typed arrays, array-like objects, or Apache Arrow columns.

It's hard to tell if Arquero is storing Arrow internally or not. It looks like they may not be.

You should print out the schema of the table you're trying to pass in to deck-layers, but yeah presumably Arquero is messing with the nested geometry column.

"Soon" we'll have another release of geoarrow-rust for JavaScript, which will bring Arrow-backed geospatial operations to WebAssembly. Though it won't have general dataframe operations.

@jaredlander
Copy link
Author

You know, now that I think about it, I think Arquero stores it as an array, not arrow, which seems less than awesome from a memory point of view.

My real goal is to be able to plot some subset of the data, say based on drop down select inputs the user can select. I know there is DataFilterExtension but that only filters ranges of numeric data, won't work for categorical data.

@jaredlander
Copy link
Author

Digging deeper, it looks like Arquero stores the columns as arrow columns. So I'm guessing it's not able to do that with the geometry column.

https://uwdata.github.io/arquero/api/#:~:text=For%20most%20data%20types%2C%20Arquero,is%20with%20zero%20data%20copying.

@kylebarron
Copy link
Member

Digging deeper, it looks like Arquero stores the columns as arrow columns. So I'm guessing it's not able to do that with the geometry column.

You should import then export back to Arrow, and print out the change in the table schema from that transformation.

My real goal is to be able to plot some subset of the data, say based on drop down select inputs the user can select. I know there is DataFilterExtension but that only filters ranges of numeric data, won't work for categorical data.

deck.gl v9 was just released, and while the official website hasn't been updated yet, you can see from https://felixpalmer.github.io/deck.gl/docs/whats-new that category filtering was added to the DataFilterExtension. So whenever we get to updating here to v9, we should support that as well. And then exposed in Lonboard in the next release as well.

@jaredlander
Copy link
Author

jaredlander commented Mar 27, 2024

OMG, that will be great. You should see the hoops I'm jumping through to filter the Arrow data.

const GEOARROW_POINT_DATA = "https://jaredlander.com/data/hood_centers.arrow";

const hoods = await Arrow.tableFromIPC(fetch(GEOARROW_POINT_DATA));

function filter_text(df, column, values){
    let result = df.slice(0, 0);
    let df_aq = aq.from(df);
    let rows_to_keep = df_aq.
        params({ 
            vals_to_check: values,
            column_to_use: column
        }).
        filter((d, $) => op.includes($.vals_to_check, d[$.column_to_use]))
        .indices()

    rows_to_keep.forEach(d => result = result.concat(df.slice(d, d + 1)));

    return result;
}

let reduced_data = filter_text(hoods, 'BoroName', ['Manhattan', 'Queens']);

And that seems to work, but so fragile.

@jaredlander
Copy link
Author

jaredlander commented Apr 1, 2024

Took some time to look at the schemas. They are beyond me, but putting it here in case someone who knows this better than me can chime in.

First we get the data.

const GEOARROW_POINT_DATA = "https://jaredlander.com/data/hood_centers_small.arrow";
const hoods = await Arrow.tableFromIPC(fetch(GEOARROW_POINT_DATA));

Looking at the schema

hoods.schema

Results in this

{
    "fields": [
        {
            "name": "BoroName",
            "type": {
                "typeId": 5
            },
            "nullable": true,
            "metadata": {}
        },
        {
            "name": "NTAName",
            "type": {
                "typeId": 5
            },
            "nullable": true,
            "metadata": {}
        },
        {
            "name": "Area",
            "type": {
                "typeId": 3,
                "precision": 2
            },
            "nullable": true,
            "metadata": {}
        },
        {
            "name": "geometry",
            "type": {
                "typeId": 16,
                "listSize": 2,
                "children": [
                    {
                        "name": "xy",
                        "type": {
                            "typeId": 3,
                            "precision": 2
                        },
                        "nullable": true,
                        "metadata": {}
                    }
                ]
            },
            "nullable": true,
            "metadata": {}
        }
    ],
    "metadata": {},
    "dictionaries": {},
    "Ja": 4
}

Then we send to arquero.

hoods_aq = aq.fromArrow(hoods)
hoods_aq.slice(0,9)

I don't know if there is a schema, but this is the first few rows.

"{\"schema\":{\"fields\":[{\"name\":\"BoroName\"},{\"name\":\"NTAName\"},{\"name\":\"Area\"},{\"name\":\"geometry\"}]},\"data\":{\"BoroName\":[\"Brooklyn\",\"Queens\",\"Queens\",\"Queens\",\"Brooklyn\",\"Queens\",\"Queens\",\"Brooklyn\",\"Brooklyn\"],\"NTAName\":[\"Borough Park\",\"Murray Hill\",\"East Elmhurst\",\"Hollis\",\"Homecrest\",\"Fresh Meadows-Utopia\",\"St. Albans\",\"Madison\",\"Kensington-Ocean Parkway\"],\"Area\":[54005019.0479584,52488277.5914307,19726945.9468689,22887772.9909821,29991967.2829895,27774853.6240387,77412763.7422485,27379162.0824585,15893305.1584473],\"geometry\":[{\"0\":-73.98866200755816,\"1\":40.63095748522966},{\"0\":-73.80954678606533,\"1\":40.76835984691723},{\"0\":-73.86839679452368,\"1\":40.76336034768915},{\"0\":-73.76113832065818,\"1\":40.71064745380929},{\"0\":-73.96433497250935,\"1\":40.59996208640553},{\"0\":-73.78371789566769,\"1\":40.73490269487121},{\"0\":-73.76314740843404,\"1\":40.69120944987827},{\"0\":-73.94813704281002,\"1\":40.604921699485104},{\"0\":-73.97620022870832,\"1\":40.64059825634118}]}}"

Then we change it back to Arrow and look at the schema.

hoods_arrow = hoods_aq.toArrow()
hoods_arrow.schema

Resulting in

{
    "fields": [
        {
            "name": "BoroName",
            "type": {
                "typeId": 5
            },
            "nullable": true,
            "metadata": {}
        },
        {
            "name": "NTAName",
            "type": {
                "typeId": 5
            },
            "nullable": true,
            "metadata": {}
        },
        {
            "name": "Area",
            "type": {
                "typeId": 3,
                "precision": 2
            },
            "nullable": true,
            "metadata": {}
        },
        {
            "name": "geometry",
            "type": {
                "typeId": 16,
                "listSize": 2,
                "children": [
                    {
                        "name": "xy",
                        "type": {
                            "typeId": 3,
                            "precision": 2
                        },
                        "nullable": true,
                        "metadata": {}
                    }
                ]
            },
            "nullable": true,
            "metadata": {}
        }
    ],
    "metadata": {},
    "dictionaries": {},
    "Ja": 4
}

While the copied objects look the same here, the original hoods.schema shows the metadata is different than for hoods_arrow.schema as seen in these screenshots.

image image

So I tried something hacky and set the metadata for the hoods_arrow object to match that of the hoods object.

hoods_arrow.schema.metadata.set('geo', hoods.schema.metadata.get('geo'));

But when I try plotting that I get the same error as before.

new geoarrowLayers.GeoArrowScatterplotLayer({
            id: 'morepoints',
            data: hoods_arrow,
            pickable: true,
            getFillColor: [180, 40, 120],
            getPosition: d => d.getChild('geometry'),
            getRadius: 5,
            radiusUnits: 'pixels',
            radiusScale: 1,
            radiusMinPixels: 4,
            radiusMaxPixels: 7,
        })
deckgl.min.js:1706 deck: initialization of GeoArrowScatterplotLayer({id: 'morepoints'}): getPosition not GeoArrow point or multipoint Error: getPosition not GeoArrow point or multipoint
    at It.renderLayers (deckgl-layers.js:1:20226)
    at It._postUpdate (deckgl.min.js:1759:35234)
    at It._update (deckgl.min.js:1759:29349)
    at It._initialize (deckgl.min.js:1759:28672)
    at Bn._initializeLayer (deckgl.min.js:1705:8621)
    at Bn._updateSublayersRecursively (deckgl.min.js:1705:8364)
    at Bn._updateLayers (deckgl.min.js:1705:7901)
    at Bn.setLayers (deckgl.min.js:1705:7063)
    at Bn.updateLayers (deckgl.min.js:1705:7186)
    at fp._onRenderFrame (deckgl.min.js:1706:56585)

But, if I change the getPosition argument to not be a function, but to directly reference the geometry columns of hoods_arrow, it works!

new geoarrowLayers.GeoArrowScatterplotLayer({
            id: 'morepoints',
            data: hoods_arrow,
            pickable: true,
            getFillColor: [180, 40, 120],
            getPosition: hoods_arrow.getChild('geometry'),
            getRadius: 5,
            radiusUnits: 'pixels',
            radiusScale: 1,
            radiusMinPixels: 4,
            radiusMaxPixels: 7,
        })

So my hack of setting the schema metadata gets us part of the way there. But then, I just ran it again without changing the schema metadata and it still worked by simply changing the getPosition argument. So this leads me to believe it has something to do with the way getPosition: d => d.getChild('geometry') is searching through the table.

I know that was a lot of stuff but hopefully something in there is useful.

@kylebarron
Copy link
Member

getPosition: d => d.getChild('geometry'),

Oh that's your problem.

hoods_arrow.getChild('geometry')

This is how the API is expected to be used. You should've gotten a type error when passing in d => d.getChild('geometry'). The former is a callback per row, the latter passes the entire column to deck. These layers are fast because they copy all the data from the column at once to the GPU, instead of doing any loop over the table.

If you look at any example from the readme, it calls getChild on the table. The object passed into the callback is a single row.

@kylebarron
Copy link
Member

Am I right that if you use table.getChild your problem is solved?

@kylebarron
Copy link
Member

hoods_arrow.schema.metadata.set('geo', hoods.schema.metadata.get('geo'));

This is irrelevant. That's GeoParquet metadata that is unused here.

@jaredlander
Copy link
Author

You are correct, it works with table.getChild(). But that makes me ask, why does it work with d => d.GetChild() for a regularly loaded arrow table?

@kylebarron
Copy link
Member

why does it work with d => d.GetChild() for a regularly loaded arrow table?

It would unintentionally work for a GeoArrow table, that is, if a column exists that is tagged with geoarrow metadata. Because we check for a geoarrow column first (which we probably shouldn't)

const pointVector = getGeometryVector(table, EXTENSION_NAME.POINT);
if (pointVector !== null) {
return this._renderLayersPoint(pointVector);
}
const multiPointVector = getGeometryVector(
table,
EXTENSION_NAME.MULTIPOINT,
);
if (multiPointVector !== null) {
return this._renderLayersMultiPoint(multiPointVector);
}

If your schema above is correct, it says that there's no metadata on the geometry field, but maybe that's an error in maintaining that metadata when exporting to JSON.

Otherwise, it's hard to imagine how that would work. I specifically check against geoarrow types:

const geometryColumn = this.props.getPosition;
if (
geometryColumn !== undefined &&
ga.vector.isPointVector(geometryColumn)
) {
return this._renderLayersPoint(geometryColumn);
}
if (
geometryColumn !== undefined &&
ga.vector.isMultiPointVector(geometryColumn)
) {
return this._renderLayersMultiPoint(geometryColumn);
}

@jaredlander
Copy link
Author

Yep, the metadata on the 'geometry' column is different.

hoods.schema.fields[3].metadata
new Map([
    [
        "ARROW:extension:metadata",
        "\u0001\u0000\u0000\u0000\u0003\u0000\u0000\u0000crs�\u0000\u0000\u0000GEOGCS[\"WGS 84\",DATUM[\"WGS_1984\",SPHEROID[\"WGS 84\",6378137,298.257223563]],PRIMEM[\"Greenwich\",0],UNIT[\"degree\",0.0174532925199433,AUTHORITY[\"EPSG\",\"9122\"]],AXIS[\"Latitude\",NORTH],AXIS[\"Longitude\",EAST],AUTHORITY[\"EPSG\",\"4326\"]]"
    ],
    [
        "ARROW:extension:name",
        "geoarrow.point"
    ]
])

Vs

hoods_arrow.schema.fields[3].metadata
new Map()

So basically, it's an accident that getPosition: d => d.getChild('geometry') works and I should stop doing that? Though that does seem more in line with the way regular deck.gl works.

@kylebarron
Copy link
Member

So basically, it's an accident that getPosition: d => d.getChild('geometry') works and I should stop doing that?

Yes. I'll "fix" it so that always fails.

Though that does seem more in line with the way regular deck.gl works.

The reason this library is fast is because it can copy data directly from Arrow memory to the GPU. Deck.gl's function accessors is indeed a simple and approachable API, but it can have significant overhead and memory use by needing to create new buffers.

This means that whenever possible we want entire columns, not functions. E.g. lonboard works solely in terms of buffers and never does any row-based function accessors on the frontend. Even for stuff like colors and radii, we serialize an entire column of values into a single buffer, and then on the frontend copy those directly to the GPU.

@jaredlander
Copy link
Author

In that case, can we change the default? For a regular ScatterplotLayer, the default is object => object.position. Could the default for GeoArrowScatterplotLayer be object.getChild('geometry')? That way the user is discouraged from filling it out?

@kylebarron
Copy link
Member

The default is to infer the geometry column based on the GeoArrow metadata. So in your initial example you don't need to pass anything because the metadata exists on the column. Any time that you have one geometry column and it's tagged with GeoArrow metadata, you don't need to pass in getPosition or similar.

I'm not going to change that default behavior because that's much more stable than using the name "geometry".

@jaredlander
Copy link
Author

Makes sense. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants