feat(geoparquet)!: GeoParquet reader refactor to avoid making duplicate wrappers of upstream structs by kylebarron · Pull Request #1089 · geoarrow/geoarrow-rs (original) (raw)
Right now, we expose a bunch of wrapper structs: GeoParquetRecordBatchReaderBuilder
, GeoParquetRecordBatchReader
, GeoParquetRecordBatchStreamBuilder
, GeoParquetRecordBatchStream
that literally are just wrapping their upstream counterparts. It's a lot of duplication and maintenance overhead for us and doesn't allow users the full amount of control of reading Parquet that the upstream implementation provides. Currently, these wrapper readers don't provide much flexibility. You can't manually choose row group indices nor apply multiple row filters.
For example, @gadomski might want to push down other predicates down into STAC-GeoParquet, such as datetime. But right now, we'd have to add more code to our wrappers to pipe all those predicates through to the underlying readers ourselves.
@paleolimbot got me started thinking about this in #1050 (review), where for example in #1040 a user might want to write GeoJSON with UUID types, where the end user might want more control over encoding different types. Likewise here, if we remove our own wrapper structs and just insert our functionality into the upstream readers via traits, then that gives users more control and greatly simplifies our maintenance here.
Hopefully this will also make it easier to integrate GeoParquet into DataFusion without duplicating all of the upstream Parquet TableProvider
.
Change list
- Removes wrapper structs for builders. So we no longer have a
GeoParquetRecordBatchReaderBuilder
or aGeoParquetRecordBatchStreamBuilder
. Users will use upstreamparquet
sync/asyncBuilder
structs directly. Means a significantly lower maintenance overhead here, and we don't need to do anything to support new upstream functionality. - Adds
GeoParquetReaderBuilder
trait that extends the upstream ArrowReaderBuilder. This allows users to implement a spatial RowFilter directly on an upstream builder instance. It also allows low-level access, so that if a user wanted a spatial filter plus something else, that would be doable. - Allows user to choose when the WKB column should be parsed to a native GeoArrow type or not in the
geoarrow_schema
method. (theparse_to_native
parameter) - Keeps wrapper structs for readers. So we still have a
GeoParquetRecordBatchReader
and aGeoParquetRecordBatchStream
, but these are very lightweight wrappers.
The benefit of having these wrapper structs is that we can ensure GeoArrow metadata is always applied onto the emittedRecordBatch
es. - Updated
example
andexample_crs
tests
Todo
- Expose public function for an end user to create a spatial ArrowPredicate. So if the user wants to apply multiple predicates in a single
RowFilter
, they can do that. - Restore docs, existing tests