feat(geoparquet)!: GeoParquet reader refactor to avoid making duplicate wrappers of upstream structs by kylebarron · Pull Request #1089 · geoarrow/geoarrow-rs (original) (raw)

Right now, we expose a bunch of wrapper structs: GeoParquetRecordBatchReaderBuilder, GeoParquetRecordBatchReader, GeoParquetRecordBatchStreamBuilder, GeoParquetRecordBatchStream that literally are just wrapping their upstream counterparts. It's a lot of duplication and maintenance overhead for us and doesn't allow users the full amount of control of reading Parquet that the upstream implementation provides. Currently, these wrapper readers don't provide much flexibility. You can't manually choose row group indices nor apply multiple row filters.

For example, @gadomski might want to push down other predicates down into STAC-GeoParquet, such as datetime. But right now, we'd have to add more code to our wrappers to pipe all those predicates through to the underlying readers ourselves.

@paleolimbot got me started thinking about this in #1050 (review), where for example in #1040 a user might want to write GeoJSON with UUID types, where the end user might want more control over encoding different types. Likewise here, if we remove our own wrapper structs and just insert our functionality into the upstream readers via traits, then that gives users more control and greatly simplifies our maintenance here.

Hopefully this will also make it easier to integrate GeoParquet into DataFusion without duplicating all of the upstream Parquet TableProvider.

Change list

Removes wrapper structs for builders. So we no longer have a GeoParquetRecordBatchReaderBuilder or a GeoParquetRecordBatchStreamBuilder. Users will use upstream parquet sync/async Builder structs directly. Means a significantly lower maintenance overhead here, and we don't need to do anything to support new upstream functionality.
Adds GeoParquetReaderBuilder trait that extends the upstream ArrowReaderBuilder. This allows users to implement a spatial RowFilter directly on an upstream builder instance. It also allows low-level access, so that if a user wanted a spatial filter plus something else, that would be doable.
Allows user to choose when the WKB column should be parsed to a native GeoArrow type or not in the geoarrow_schema method. (the parse_to_native parameter)
Keeps wrapper structs for readers. So we still have a GeoParquetRecordBatchReader and a GeoParquetRecordBatchStream, but these are very lightweight wrappers.
The benefit of having these wrapper structs is that we can ensure GeoArrow metadata is always applied onto the emitted RecordBatches.
Updated example and example_crs tests

Todo

Expose public function for an end user to create a spatial ArrowPredicate. So if the user wants to apply multiple predicates in a single RowFilter, they can do that.
Restore docs, existing tests