-
-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read shapefiles in UTF-8 mode #137
Conversation
Current (bad): With this patch (good): This example rendered the following shapefile: https://data.geo.admin.ch/ch.bafu.schutzgebiete-aulav_jagdbanngebiete/ |
https://github.com/onthegomap/planetiler/actions/runs/1997273067 ℹ️ Base Logs 0a06479
ℹ️ This Branch Logs f367bf3
|
@@ -125,7 +126,9 @@ private ShapefileDataStore open(Path path) { | |||
} else { | |||
throw new IllegalArgumentException("Invalid shapefile input: " + path + " must be zip or shp"); | |||
} | |||
return new ShapefileDataStore(uri.toURL()); | |||
var store = new ShapefileDataStore(uri.toURL()); | |||
store.setCharset(Charset.forName("UTF8")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for bringing this up! I noticed the same issue when trying to read the natural earth shapefile. One concern here is that standard encoding for shapefiles is technically ISO8859-1 so I'd hesitate to always use UTF-8. It looks like shapefiles that have a different encoding actually contain a ".cpg" file next to the .shp file with the text of the encoding though, so maybe we could use that if it exists?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm you're right.. the shapefile reader should be doing this automatically. From the geotools release page it looks like starting with version 26 (we're on 26.3) they attempt to use cpg file to determine the encoding. I wonder why that's not working here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://docs.geotools.org/stable/javadocs/org/geotools/data/shapefile/ShapefileDataStore.html#isTryCPGFile-- returns false
, i.e., by default it does not access the .cpg
file. Let's check if I can change the behavior with setTryCPGFile()
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the change! Would it be possible to add a minimal test case that verifies a UTF-8 string from a shapefiles is decoded properly? |
Works without hardcoding utf-8 now, by using the |
The unit test would be helpful! I'm hoping to switch to Apache SIS at some point so it would be good to make sure it has the same handling. There is a tiny Also, it might be useful to have an extra option on |
I edited
I updated the test file |
Are you suggesting to overload |
Awesome, looks great! Thank you!
Yeah I could go either way on this, since there are now 2 optional parameters (source encoding and projection) there could be 4 signatures based on which ones you want to set and 8 if we add a third...I should refactor this to allow come kind of parameter object to make it these kinds of settings easier to add. Maybe a compromise would be to just add a character encoding argument to the existing method that includes a projection, and if it's null then ignore? Also I'd be fine doing nothing if adding the cpg file is an OK workaround. |
I think the |
Sounds good to me! Thanks for making this change. |
Currently, if a shapefile has attributes with letters like ä, ö, ü, etc, the resulting tiles show some strange letters. This pull requests fixes this issue by hard-coding the shapefile mode to UTF-8.