-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test notebooks in 'docs/', make cuspatial_api_examples self-contained, skip long-running notebook, fix some docs #1407
test notebooks in 'docs/', make cuspatial_api_examples self-contained, skip long-running notebook, fix some docs #1407
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
(edited Jul 21): moved these details to a separate issue, #1413 I think that the
Possible reasons I can think of:
On a machine with CUDA 12.2, latest nightlies of RAPIDS libraries (conda), and 8 V100s, I reduced that notebook down to this minimal example: imports and data setup (click me)import cuspatial
import geopandas as gpd
import cudf
import numpy as np
# (shell commands run in notebook cells)
# !if [ ! -f "tzones_lonlat.json" ]; then curl "https://data.cityofnewyork.us/api/geospatial/d3c5-ddgc?method=export&format=GeoJSON" -o tzones_lonlat.json; else echo "tzones_lonlat.json found"; fi
# !if [ ! -f "taxi2016.csv" ]; then curl https://storage.googleapis.com/anaconda-public-data/nyc-taxi/csv/2016/yellow_tripdata_2016-01.csv -o taxi2016.csv; else echo "taxi2016.csv found"; fi
taxi2016 = cudf.read_csv("taxi2016.csv")
tzones = gpd.GeoDataFrame.from_file('tzones_lonlat.json')
taxi_zones = cuspatial.from_geopandas(tzones).geometry
taxi_zone_rings = cuspatial.GeoSeries.from_polygons_xy(
taxi_zones.polygons.xy,
taxi_zones.polygons.ring_offset,
taxi_zones.polygons.part_offset,
cudf.Series(range(len(taxi_zones.polygons.part_offset)))
)
def make_geoseries_from_lonlat(lon, lat):
lonlat = cudf.DataFrame({"lon": lon, "lat": lat}).interleave_columns()
return cuspatial.GeoSeries.from_points_xy(lonlat)
pickup2016 = make_geoseries_from_lonlat(taxi2016['pickup_longitude'] , taxi2016['pickup_latitude'])
dropoff2016 = make_geoseries_from_lonlat(taxi2016['dropoff_longitude'] , taxi2016['dropoff_latitude'])
pip_iterations = list(np.arange(0, 263, 31))
pip_iterations.append(263)
print(pip_iterations) # --- begin point-in-polygon calculation ---#
# %%time
taxi2016['PULocationID'] = 264
taxi2016['DOLocationID'] = 264
start = pip_iterations[0]
end = pip_iterations[1]
zone = taxi_zone_rings[start:end]
# find all pickups in that zone
pickups = cuspatial.point_in_polygon(pickup2016, zone)
print(pickups)
print("---")
dropoffs = cuspatial.point_in_polygon(dropoff2016, zone)
print(dropoffs) Just that bit of code for one combination of polygons took 21 minutes to run. And in the notebook, we're running that for 10 combinations. cuspatial/notebooks/nyc_taxi_years_correlation.ipynb Lines 168 to 169 in c8616c1
cuspatial/notebooks/nyc_taxi_years_correlation.ipynb Lines 207 to 209 in c8616c1
So conservatively, it might take 3.5 hours for the notebook to finish in my setup. And that's making a LOT of assumptions. |
@@ -62,7 +62,7 @@ | |||
"source": [ | |||
"!if [ ! -f \"tzones_lonlat.json\" ]; then curl \"https://data.cityofnewyork.us/api/geospatial/d3c5-ddgc?method=export&format=GeoJSON\" -o tzones_lonlat.json; else echo \"tzones_lonlat.json found\"; fi\n", | |||
"!if [ ! -f \"taxi2016.csv\" ]; then curl https://storage.googleapis.com/anaconda-public-data/nyc-taxi/csv/2016/yellow_tripdata_2016-01.csv -o taxi2016.csv; else echo \"taxi2016.csv found\"; fi \n", | |||
"!if [ ! -f \"taxi2017.csv\" ]; then curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2017-01.parquet -o taxi2017.parquet; else echo \"taxi2017.csv found\"; fi" | |||
"!if [ ! -f \"taxi2017.parquet\" ]; then curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2017-01.parquet -o taxi2017.parquet; else echo \"taxi2017.csv found\"; fi" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This -f
test was checking a different filename than is actually downloaded, so it re-downloaded the data every time this cell was run.
…mit, drop cuspatial notebooks (#690) CI has been failing here for a few weeks, as a result of some failing `cuspatial` notebooks. As described in rapidsai/cuspatial#1406, I observed those notebooks failing like this: ```text cannot access local variable 'execution_time' where it is not associated with a value ``` Turns out that that was coming from the `test_notebooks.py` script in this repo. It calls `return execution_time` unconditionally, even though it's possible for that variable to not exist. This fixes that, and proposes adding a `pre-commit` configuration and corresponding CI job, to help catch such things earlier in the future. This also proposes skipping some `cuspatial` notebooks, to get CI working. I'd hoped to address that in rapidsai/cuspatial#1407, but most `cuspatial` maintainers are busy or unavailable this week. I think we should get CI working here again and deal with the `cuspatial` notebooks as a separate concern. ## Notes for Reviewers Any other changes you see in files here are the result of running `pre-commit run --all-files` (with some manual fixes applied to fix `ruff` warnings). Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Ray Douglass (https://github.com/raydouglass) - https://github.com/jakirkham - AJ Schmidt (https://github.com/ajschmidt8) URL: #690
Thanks all! 🙏 |
#1407 skipped the taxi dropoffs notebook due to performance regression fixed in #1418, so this PR re-enables the notebook in CI by removing it from SKIPNBS in ci/test_noteboks.sh. Authors: - Mark Harris (https://github.com/harrism) Approvers: - https://github.com/jakirkham - Michael Wang (https://github.com/isVoid) - James Lamb (https://github.com/jameslamb) URL: #1422
The notebook-testing CI job in this repo does not actually cause a loud CI failure if any errors are detected in notebooks. That's because of a `bash` mistake I made in #1407... that PR moved notebook-checking into a function, but didn't add a `set -E` to be sure errors from inside that function were appropriately trapped. This PR fixes that: * ensures that notebook failures actually cause CI failures * fixes 2 typos in `nyc_taxi_years_correlation.ipynb` code - *(not caught in #1422 because of this CI script bug)* Context: #1422 (review) ## Notes for Reviewers ### How I tested this Originally did not skip any notebooks. Saw the failures in one of the notebooks cause an actual merge-blocking CI failure. Build link: https://github.com/rapidsai/cuspatial/actions/runs/10199784404/job/28219162698?pr=1424 # Authors: - James Lamb (https://github.com/jameslamb) - Bradley Dice (https://github.com/bdice) Approvers: - Ray Douglass (https://github.com/raydouglass) - Mark Harris (https://github.com/harrism) - https://github.com/jakirkham URL: #1424
Description
Contributes to #1406.
Unblocks CI.
CI in https://github.com/rapidsai/docker has been failing because of errors in the
cuspatial_api_examples.ipynb
notebook. Specifically, that notebook refers unconditionally to local files that are not guaranteed to exist.To fix this, that proposes the following:
docs/
in CI hereCI here is failing because the
nyc_taxi_years_correlation
notebook recently started taking a prohibitively long time to run. This proposes:nyc_taxi_years_correlation
notebook in CIAnd some other small things noticed while doing all this:
rint -> ring
) and argument ordering inGeoSeries.from_polygons_xy()
docsChecklist
Notes for Reviewers
I ran these notebooks on a system with CUDA 12.2, using the latest
24.8.*
nightlies of RAPIDS conda packages.