Is _FillValue
really the same as zarr's fill_value
? · Issue #5475 · pydata/xarray (original) (raw)
The zarr backend uses the fill_value
of zarrs .zarray
key as if it would be the _FillValue
according to CF-Conventions:
attributes["_FillValue"] = zarr_array.fill_value |
---|
I think this interpretation of the fill_value
is wrong and creates problems. Here's why:
The zarr v2 spec is still a little vague, but states that fill_value
is
A scalar value providing the default value to use for uninitialized portions of the array, or null if no fill_value is to be used.
Accordingly this value should be used to fill all areas of a variable which are not backed by a stored chunk with this value. This is also different from what CF conventions state (emphasis mine):
The scalar attribute with the name
_FillValue
and of the same type as its variable is recognized by the netCDF library as the value used to pre-fill disk space allocated to the variable. This value is considered to be a special value that indicates undefined or missing data, and is returned when reading values that were not written.
The difference between the two is, that fill_value
is only a background value, which just isn't stored as a chunk. But _FillValue
is (possibly) a background value and is interpreted as not being valid data. In my opinion, this mix of _FillValue
and missing_value
could be considered a defect in the CF-Conventions, but probably that's far to late as many depend on this.
Thinking of an example, when storing a density field (i.e. water droplets forming clouds) in a zarr dataset, it might be perfectly valid to set the fill_value
to 0
and then store only chunks in regions of the atmosphere where clouds are actually present. In that case, 0
(i.e. no drops) would be a perfectly valid value, which just isn't stored. As most parts of the atmosphere are indeed cloud-free, this may save quite a bunch of storage. Other formats (e.g. OpenVDB) commonly use this trick.
The issue gets worse when looking into the upcoming zarr v3 spec where fill_value
is described as:
Provides an element value to use for uninitialised portions of the Zarr array.
If the data type of the Zarr array is Boolean then the value must be the literal
false
ortrue
. If the data type is one of the integer data types defined in this specification, then the value must be a number with no fraction or exponent part and must be within the range of the data type.For any data type, if the
fill_value
is the literalnull
then the fill value is undefined and the implementation may use any arbitrary value that is consistent with the data type as the fill value.[...]
Thus for boolean arrays, if the fill_value
would be interpreted as a missing value indicator, only (missing, True
) or (False
, missing) arrays could be represented. A (False
, True
) array would not be possible. The issue applies similarly for integer types as well.