Unicode strings unexpectedly transformed to byte strings upon open_dataset · Issue #1638 · pydata/xarray (original) (raw)

When I first create the dataset, all the metadata is stored as unicode strings (yay!):

<xarray.Dataset>
Dimensions:                       (cell: 53760, gene: 23438)
Coordinates:
  * gene                          (gene) object '0610005C13Rik' ...
    Uniquely mapped reads number  (cell) int64 1017682 634557 941828 1392029 ...
    Number of input reads         (cell) int64 1229254 730274 1075370 ...
    EXP_ID                        (cell) <U29 '170925_A00111_0066_AH3TKNDMXX' ...
    TAXON                         (cell) <U3 'mus' 'mus' 'mus' 'mus' 'mus' ...
    WELL_MAPPING                  (cell) <U9 'B000126' 'B000126' 'B000126' ...
    Lysis Plate Batch             (cell) <U32 '20' '20' '20' '20' '20' '20' ...
    dNTP.batch                    (cell) <U38 '457912' '457912' '457912' ...
    oligodT.order.no              (cell) <U32 '6/23/17 12757296' ...
    plate.type                    (cell) <U32 'Biorad HSP3901' ...
    preparation.site              (cell) <U32 'Biohub' 'Biohub' 'Biohub' ...
    date.prepared                 (cell) <U32 '07-06-17' '07-06-17' ...
    date.sorted                   (cell) <U6 '170707' '170707' '170707' ...
    tissue                        (cell) <U13 'Skin' 'Skin' 'Skin' 'Skin' ...
    subtissue                     (cell) <U32 'nan' 'nan' 'nan' 'nan' 'nan' ...
    mouse.id                      (cell) <U13 '3_39_F' '3_39_F' '3_39_F' ...
    FACS.selection                (cell) <U52 'Multiple' 'Multiple' ...
    nozzle.size                   (cell) <U32 '100' '100' '100' '100' '100' ...
    FACS.instument                (cell) <U32 'Sony SIM1' 'Sony SIM1' ...
    Experiment ID                 (cell) <U32 'exp22' 'exp22' 'exp22' ...
    Columns sorted                (cell) float64 nan nan nan nan nan nan nan ...
    Double check                  (cell) float64 nan nan nan nan nan nan nan ...
    Plate                         (cell) <U32 '1' '1' '1' '1' '1' '1' '1' ...
    Location                      (cell) <U32 'MACA20_3' 'MACA20_3' ...
    Comments                      (cell) <U32 'nan' 'nan' 'nan' 'nan' 'nan' ...
    mouse.age                     (cell) <U1 '3' '3' '3' '3' '3' '3' '3' '3' ...
    mouse.number                  (cell) <U32 '39' '39' '39' '39' '39' '39' ...
    mouse.sex                     (cell) <U1 'F' 'F' 'F' 'F' 'F' 'F' 'F' 'F' ...
  * cell                          (cell) object 'A17-B000126-3_39_F-1-1' ...
Data variables:
    counts                        (cell, gene) int64 0 0 0 0 442 0 0 0 0 0 0 ...
    log2                          (cell, gene) float64 0.0 0.0 0.0 0.0 8.791 ...
    log10                         (cell, gene) float64 0.0 0.0 0.0 0.0 2.646 ...

but then when I save using to_netcdf using the default arguments, then xr.open_dataset on the same dataset using default arguments, all of them get converted to byte strings:

<xarray.Dataset>
Dimensions:                       (cell: 53760, gene: 23438)
Coordinates:
  * cell                          (cell) |S24 b'A17-B000126-3_39_F-1-1' ...
  * gene                          (gene) |S22 b'0610005C13Rik' ...
Data variables:
    counts                        (cell, gene) int32 0 0 0 0 442 0 0 0 0 0 0 ...
    log2                          (cell, gene) float64 0.0 0.0 0.0 0.0 8.791 ...
    log10                         (cell, gene) float64 0.0 0.0 0.0 0.0 2.646 ...
    FACS.selection                (cell) |S52 b'Multiple' b'Multiple' ...
    dNTP.batch                    (cell) |S38 b'457912' b'457912' b'457912' ...
    EXP_ID                        (cell) |S29 b'170925_A00111_0066_AH3TKNDMXX' ...
    subtissue                     (cell) |S19 b'nan' b'nan' b'nan' b'nan' ...
    oligodT.order.no              (cell) |S17 b'6/23/17 12757296' ...
    plate.type                    (cell) |S14 b'Biorad HSP3901' ...
    tissue                        (cell) |S13 b'Skin' b'Skin' b'Skin' ...
    mouse.id                      (cell) |S13 b'3_39_F' b'3_39_F' b'3_39_F' ...
    FACS.instument                (cell) |S13 b'Sony SIM1' b'Sony SIM1' ...
    Comments                      (cell) |S11 b'nan' b'nan' b'nan' b'nan' ...
    WELL_MAPPING                  (cell) |S9 b'B000126' b'B000126' ...
    date.prepared                 (cell) |S9 b'07-06-17' b'07-06-17' ...
    Location                      (cell) |S9 b'MACA20_3' b'MACA20_3' ...
    preparation.site              (cell) |S8 b'Biohub' b'Biohub' b'Biohub' ...
    date.sorted                   (cell) |S6 b'170707' b'170707' b'170707' ...
    Experiment ID                 (cell) |S6 b'exp22' b'exp22' b'exp22' ...
    TAXON                         (cell) |S3 b'mus' b'mus' b'mus' b'mus' ...
    Lysis Plate Batch             (cell) |S3 b'20' b'20' b'20' b'20' b'20' ...
    nozzle.size                   (cell) |S3 b'100' b'100' b'100' b'100' ...
    Plate                         (cell) |S3 b'1' b'1' b'1' b'1' b'1' b'1' ...
    mouse.number                  (cell) |S3 b'39' b'39' b'39' b'39' b'39' ...
    Uniquely mapped reads number  (cell) int32 1017682 634557 941828 1392029 ...
    Number of input reads         (cell) int32 1229254 730274 1075370 ...
    Columns sorted                (cell) float64 nan nan nan nan nan nan nan ...
    Double check                  (cell) float64 nan nan nan nan nan nan nan ...
    mouse.age                     (cell) |S1 b'3' b'3' b'3' b'3' b'3' b'3' ...
    mouse.sex                     (cell) |S1 b'F' b'F' b'F' b'F' b'F' b'F' ...

So then things I expect like selecting on gene, e.g. ds.sel(gene="Ins1") don't work unless they're byte strings, i.e. ds.sel(gene=b"Ins1") works just fine.

Do you know why this may be happening?