test notebooks in 'docs/', make cuspatial_api_examples self-contained, skip long-running notebook, fix some docs #1407

jameslamb · 2024-07-17T17:06:21Z

Description

Contributes to #1406.

Unblocks CI.

CI in https://github.com/rapidsai/docker has been failing because of errors in the cuspatial_api_examples.ipynb notebook. Specifically, that notebook refers unconditionally to local files that are not guaranteed to exist.

To fix this, that proposes the following:

in the notebook, conditionally recreate those files if they don't yet exist
always test the notebooks in docs/ in CI here

CI here is failing because the nyc_taxi_years_correlation notebook recently started taking a prohibitively long time to run. This proposes:

skipping nyc_taxi_years_correlation notebook in CI

And some other small things noticed while doing all this:

fixing typo (rint -> ring) and argument ordering in GeoSeries.from_polygons_xy() docs

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

Notes for Reviewers

I ran these notebooks on a system with CUDA 12.2, using the latest 24.8.* nightlies of RAPIDS conda packages.

review-notebook-app · 2024-07-17T17:06:26Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

jameslamb · 2024-07-17T20:14:55Z

(edited Jul 21): moved these details to a separate issue, #1413

I think that the nyc_taxi_years_correlation.ipynb notebook has just become prohibitively expensive to run in CI.

this PR: 2.5+ hours before I killed it (build link)
most recent successful run (July 11): 8m38s (build link)

Possible reasons I can think of:

cuspatial or cudf is now falling back to the CPU for some operation it was previously able to do entirely on the GPU
the input data hosted at those remote storage locations has grown significantly
some other performance regression in cudf, cuspatial, or their dependencies
some change in the CI environment

On a machine with CUDA 12.2, latest nightlies of RAPIDS libraries (conda), and 8 V100s, I reduced that notebook down to this minimal example:

imports and data setup (click me)

import cuspatial
import geopandas as gpd
import cudf
import numpy as np

# (shell commands run in notebook cells)
# !if [ ! -f "tzones_lonlat.json" ]; then curl "https://data.cityofnewyork.us/api/geospatial/d3c5-ddgc?method=export&format=GeoJSON" -o tzones_lonlat.json; else echo "tzones_lonlat.json found"; fi
# !if [ ! -f "taxi2016.csv" ]; then curl https://storage.googleapis.com/anaconda-public-data/nyc-taxi/csv/2016/yellow_tripdata_2016-01.csv -o taxi2016.csv; else echo "taxi2016.csv found"; fi   

taxi2016 = cudf.read_csv("taxi2016.csv")
tzones = gpd.GeoDataFrame.from_file('tzones_lonlat.json')
taxi_zones = cuspatial.from_geopandas(tzones).geometry
taxi_zone_rings = cuspatial.GeoSeries.from_polygons_xy(
    taxi_zones.polygons.xy,
    taxi_zones.polygons.ring_offset,
    taxi_zones.polygons.part_offset,
    cudf.Series(range(len(taxi_zones.polygons.part_offset)))
)

def make_geoseries_from_lonlat(lon, lat):
    lonlat = cudf.DataFrame({"lon": lon, "lat": lat}).interleave_columns()
    return cuspatial.GeoSeries.from_points_xy(lonlat)

pickup2016 = make_geoseries_from_lonlat(taxi2016['pickup_longitude'] , taxi2016['pickup_latitude'])
dropoff2016 = make_geoseries_from_lonlat(taxi2016['dropoff_longitude'] , taxi2016['dropoff_latitude'])

pip_iterations = list(np.arange(0, 263, 31))
pip_iterations.append(263)
print(pip_iterations)

# --- begin point-in-polygon calculation ---#
# %%time
taxi2016['PULocationID'] = 264
taxi2016['DOLocationID'] = 264

start = pip_iterations[0]
end = pip_iterations[1]

zone = taxi_zone_rings[start:end]

# find all pickups in that zone
pickups = cuspatial.point_in_polygon(pickup2016, zone)
print(pickups)
print("---")
dropoffs = cuspatial.point_in_polygon(dropoff2016, zone)
print(dropoffs)

Just that bit of code for one combination of polygons took 21 minutes to run. And in the notebook, we're running that for 10 combinations.

cuspatial/notebooks/nyc_taxi_years_correlation.ipynb

Lines 168 to 169 in c8616c1

    
           "pip_iterations = list(np.arange(0, 263, 31))\n", 
        
           "pip_iterations.append(263)\n",

[0, 31, 62, 93, 124, 155, 186, 217, 248, 263]

cuspatial/notebooks/nyc_taxi_years_correlation.ipynb

Lines 207 to 209 in c8616c1

    
           "for i in range(len(pip_iterations)-1):\n", 
        
           "    start = pip_iterations[i]\n", 
        
           "    end = pip_iterations[i+1]\n",

So conservatively, it might take 3.5 hours for the notebook to finish in my setup. And that's making a LOT of assumptions.

…) docs

jameslamb · 2024-07-17T20:18:56Z

notebooks/nyc_taxi_years_correlation.ipynb

@@ -62,7 +62,7 @@
   "source": [
    "!if [ ! -f \"tzones_lonlat.json\" ]; then curl \"https://data.cityofnewyork.us/api/geospatial/d3c5-ddgc?method=export&format=GeoJSON\" -o tzones_lonlat.json; else echo \"tzones_lonlat.json found\"; fi\n",
    "!if [ ! -f \"taxi2016.csv\" ]; then curl https://storage.googleapis.com/anaconda-public-data/nyc-taxi/csv/2016/yellow_tripdata_2016-01.csv -o taxi2016.csv; else echo \"taxi2016.csv found\"; fi   \n",
-    "!if [ ! -f \"taxi2017.csv\" ]; then curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2017-01.parquet -o taxi2017.parquet; else echo \"taxi2017.csv found\"; fi"
+    "!if [ ! -f \"taxi2017.parquet\" ]; then curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2017-01.parquet -o taxi2017.parquet; else echo \"taxi2017.csv found\"; fi"


This -f test was checking a different filename than is actually downloaded, so it re-downloaded the data every time this cell was run.

…mit, drop cuspatial notebooks (#690) CI has been failing here for a few weeks, as a result of some failing `cuspatial` notebooks. As described in rapidsai/cuspatial#1406, I observed those notebooks failing like this: ```text cannot access local variable 'execution_time' where it is not associated with a value ``` Turns out that that was coming from the `test_notebooks.py` script in this repo. It calls `return execution_time` unconditionally, even though it's possible for that variable to not exist. This fixes that, and proposes adding a `pre-commit` configuration and corresponding CI job, to help catch such things earlier in the future. This also proposes skipping some `cuspatial` notebooks, to get CI working. I'd hoped to address that in rapidsai/cuspatial#1407, but most `cuspatial` maintainers are busy or unavailable this week. I think we should get CI working here again and deal with the `cuspatial` notebooks as a separate concern. ## Notes for Reviewers Any other changes you see in files here are the result of running `pre-commit run --all-files` (with some manual fixes applied to fix `ruff` warnings). Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Ray Douglass (https://github.com/raydouglass) - https://github.com/jakirkham - AJ Schmidt (https://github.com/ajschmidt8) URL: #690

jakirkham · 2024-07-19T19:13:45Z

Thanks all! 🙏

#1407 skipped the taxi dropoffs notebook due to performance regression fixed in #1418, so this PR re-enables the notebook in CI by removing it from SKIPNBS in ci/test_noteboks.sh. Authors: - Mark Harris (https://github.com/harrism) Approvers: - https://github.com/jakirkham - Michael Wang (https://github.com/isVoid) - James Lamb (https://github.com/jameslamb) URL: #1422

The notebook-testing CI job in this repo does not actually cause a loud CI failure if any errors are detected in notebooks. That's because of a `bash` mistake I made in #1407... that PR moved notebook-checking into a function, but didn't add a `set -E` to be sure errors from inside that function were appropriately trapped. This PR fixes that: * ensures that notebook failures actually cause CI failures * fixes 2 typos in `nyc_taxi_years_correlation.ipynb` code - *(not caught in #1422 because of this CI script bug)* Context: #1422 (review) ## Notes for Reviewers ### How I tested this Originally did not skip any notebooks. Saw the failures in one of the notebooks cause an actual merge-blocking CI failure. Build link: https://github.com/rapidsai/cuspatial/actions/runs/10199784404/job/28219162698?pr=1424 # Authors: - James Lamb (https://github.com/jameslamb) - Bradley Dice (https://github.com/bdice) Approvers: - Ray Douglass (https://github.com/raydouglass) - Mark Harris (https://github.com/harrism) - https://github.com/jakirkham URL: #1424

jameslamb added 2 commits July 17, 2024 09:52

make cuspatial_api_examples notebook self-contained

7c30a17

also test notebooks in docs

f9f60d2

jameslamb added bug Something isn't working non-breaking Non-breaking change labels Jul 17, 2024

github-actions bot added conda Related to conda and conda configuration ci labels Jul 17, 2024

jakirkham mentioned this pull request Jul 17, 2024

Build and test with CUDA 12.5.1 #1405

Merged

This was referenced Jul 17, 2024

fix return-before-assignment bug in notebook test script, add pre-commit, drop cuspatial notebooks rapidsai/docker#690

Merged

[BUG]: notebooks broken: "cannot access local variable 'execution_time'" #1406

Closed

skip nyc_taxi_years_correlation notebook in CI, fix from_polygons_xy(…

6bd1958

…) docs

github-actions bot added the Python Related to Python code label Jul 17, 2024

jameslamb commented Jul 17, 2024

View reviewed changes

jakirkham mentioned this pull request Jul 17, 2024

Update latest CUDA version for build/test to 12.5 rapidsai/build-planning#73

Closed

13 tasks

jameslamb added 3 commits July 17, 2024 14:20

deal with nested notebooks

1acb017

fix copyright

c4ab0cc

fix script

acd4338

jameslamb changed the title ~~WIP: test notebooks in 'docs/', make cuspatial_api_examples self-contained~~ test notebooks in 'docs/', make cuspatial_api_examples self-contained Jul 18, 2024

jameslamb marked this pull request as ready for review July 18, 2024 14:22

jameslamb requested review from a team as code owners July 18, 2024 14:22

jameslamb requested review from AyodeAwe, trxcllnt and isVoid July 18, 2024 14:22

jameslamb mentioned this pull request Jul 18, 2024

add CUDA 12.5 images rapidsai/docker#689

Merged

jakirkham requested a review from harrism July 18, 2024 20:35

jameslamb changed the title ~~test notebooks in 'docs/', make cuspatial_api_examples self-contained~~ test notebooks in 'docs/', make cuspatial_api_examples self-contained, skip long-running notebook, fix some docs Jul 18, 2024

raydouglass approved these changes Jul 19, 2024

View reviewed changes

raydouglass merged commit 8636b03 into rapidsai:branch-24.08 Jul 19, 2024
69 checks passed

jameslamb deleted the fix/example-notebook branch July 19, 2024 17:57

jameslamb mentioned this pull request Jul 22, 2024

[BUG]: possible performance regression in points_in_polygon() #1413

Closed

harrism mentioned this pull request Jul 31, 2024

Unskip taxi notebook from CI #1420

Closed

3 tasks

jakirkham mentioned this pull request Jul 31, 2024

Readd nyc_taxi_years_correlation.ipynb test #1421

Closed

3 tasks

harrism mentioned this pull request Jul 31, 2024

Unskip taxi notebook from CI #1422

Merged

3 tasks

jameslamb mentioned this pull request Aug 2, 2024

ensure notebook-testing script fails when notebooks fail #1424

Merged

3 tasks

jakirkham mentioned this pull request Aug 8, 2024

Readd fixed cuSpatial notebooks rapidsai/docker#703

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test notebooks in 'docs/', make cuspatial_api_examples self-contained, skip long-running notebook, fix some docs #1407

test notebooks in 'docs/', make cuspatial_api_examples self-contained, skip long-running notebook, fix some docs #1407

jameslamb commented Jul 17, 2024 •

edited

Loading

review-notebook-app bot commented Jul 17, 2024

jameslamb commented Jul 17, 2024 •

edited

Loading

jameslamb Jul 17, 2024

jakirkham commented Jul 19, 2024

test notebooks in 'docs/', make cuspatial_api_examples self-contained, skip long-running notebook, fix some docs #1407

test notebooks in 'docs/', make cuspatial_api_examples self-contained, skip long-running notebook, fix some docs #1407

Conversation

jameslamb commented Jul 17, 2024 • edited Loading

Description

Checklist

Notes for Reviewers

review-notebook-app bot commented Jul 17, 2024

jameslamb commented Jul 17, 2024 • edited Loading

jameslamb Jul 17, 2024

Choose a reason for hiding this comment

jakirkham commented Jul 19, 2024

jameslamb commented Jul 17, 2024 •

edited

Loading

jameslamb commented Jul 17, 2024 •

edited

Loading