ENH: explicit filters parameter in pd.read_parquet by mrastgoo · Pull Request #53212 · pandas-dev/pandas (original) (raw)

@mrastgoo would you have time to add a test?

We have an existing test for the columns keyword (test_read_columns), so could add an equivalent test_read_filters test case next to it. Something like:

    def test_read_filters(self, engine, tmp_path):
        df = pd.DataFrame({"int": list(range(4)), "part": list("aabb"),})

        expected = pd.DataFrame({"int": [0, 1]})
        check_round_trip(
            df,
            engine,
            path=tmp_path,
            expected=expected,
            write_kwargs={"partition_cols": ["part"]},
            read_kwargs={"filters": [("string", "==", "a")], "columns":["int"]},
            repeat=1,
        )

I used partition cols to actually see the effect of the filter also for fastparquet (to ensure it is correctly passed through). And with that, the repeat=1 is needed to not add additional files to the directory when writing a second time (with partition_cols, it does not just overwrite the single file).