feat(geoparquet)!: GeoParquet reader refactor to avoid making duplicate wrappers of upstream structs by kylebarron · Pull Request #1089 · geoarrow/geoarrow-rs (original) (raw)

Right now, we expose a bunch of wrapper structs: GeoParquetRecordBatchReaderBuilder, GeoParquetRecordBatchReader, GeoParquetRecordBatchStreamBuilder, GeoParquetRecordBatchStream that literally are just wrapping their upstream counterparts. It's a lot of duplication and maintenance overhead for us and doesn't allow users the full amount of control of reading Parquet that the upstream implementation provides. Currently, these wrapper readers don't provide much flexibility. You can't manually choose row group indices nor apply multiple row filters.

For example, @gadomski might want to push down other predicates down into STAC-GeoParquet, such as datetime. But right now, we'd have to add more code to our wrappers to pipe all those predicates through to the underlying readers ourselves.

@paleolimbot got me started thinking about this in #1050 (review), where for example in #1040 a user might want to write GeoJSON with UUID types, where the end user might want more control over encoding different types. Likewise here, if we remove our own wrapper structs and just insert our functionality into the upstream readers via traits, then that gives users more control and greatly simplifies our maintenance here.

Hopefully this will also make it easier to integrate GeoParquet into DataFusion without duplicating all of the upstream Parquet TableProvider.

Change list

Todo