"Attempted to access the data pointer on an invalid python storage" when saving model in TPU mode (Kaggle) · Issue #27578 · huggingface/transformers (original) (raw)

System Info

It keeps happening whenever I try to use TPU mode to fine-tune BERT model for sentiment analysis. Everything works fine in GPU mode. I even tried to downgrade/upgrade TensorFlow & safetensors, but it didn't work either. Can you give me any suggestion?

Link to that notebook: https://www.kaggle.com/code/phttrnnguyngia/final

trainer.save_model('final-result')

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
File /kaggle/working/env/safetensors/torch.py:13, in storage_ptr(tensor)
     12 try:
---> 13     return tensor.untyped_storage().data_ptr()
     14 except Exception:
     15     # Fallback for torch==1.10

RuntimeError: Attempted to access the data pointer on an invalid python storage.

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
Cell In[21], line 2
      1 # save the model
----> 2 trainer.save_model('final-result')

File /kaggle/working/env/transformers/trainer.py:2804, in Trainer.save_model(self, output_dir, _internal_call)
   2801     output_dir = self.args.output_dir
   2803 if is_torch_tpu_available():
-> 2804     self._save_tpu(output_dir)
   2805 elif is_sagemaker_mp_enabled():
   2806     # Calling the state_dict needs to be done on the wrapped model and on all processes.
   2807     os.makedirs(output_dir, exist_ok=True)

File /kaggle/working/env/transformers/trainer.py:2873, in Trainer._save_tpu(self, output_dir)
   2871         xm.save(state_dict, os.path.join(output_dir, WEIGHTS_NAME))
   2872 else:
-> 2873     self.model.save_pretrained(output_dir, is_main_process=self.args.should_save, save_function=xm.save)
   2874 if self.tokenizer is not None and self.args.should_save:
   2875     self.tokenizer.save_pretrained(output_dir)

File /kaggle/working/env/transformers/modeling_utils.py:2187, in PreTrainedModel.save_pretrained(self, save_directory, is_main_process, state_dict, save_function, push_to_hub, max_shard_size, safe_serialization, variant, token, save_peft_format, **kwargs)
   2183 for shard_file, shard in shards.items():
   2184     if safe_serialization:
   2185         # At some point we will need to deal better with save_function (used for TPU and other distributed
   2186         # joyfulness), but for now this enough.
-> 2187         safe_save_file(shard, os.path.join(save_directory, shard_file), metadata={"format": "pt"})
   2188     else:
   2189         save_function(shard, os.path.join(save_directory, shard_file))

File /kaggle/working/env/safetensors/torch.py:281, in save_file(tensors, filename, metadata)
    250 def save_file(
    251     tensors: Dict[str, torch.Tensor],
    252     filename: Union[str, os.PathLike],
    253     metadata: Optional[Dict[str, str]] = None,
    254 ):
    255     """
    256     Saves a dictionary of tensors into raw bytes in safetensors format.
    257 
   (...)
    279     ```
    280     """
--> 281     serialize_file(_flatten(tensors), filename, metadata=metadata)

File /kaggle/working/env/safetensors/torch.py:460, in _flatten(tensors)
    453 if invalid_tensors:
    454     raise ValueError(
    455         f"You are trying to save a sparse tensors: `{invalid_tensors}` which this library does not support."
    456         " You can make it a dense tensor before saving with `.to_dense()` but be aware this might"
    457         " make a much larger file than needed."
    458     )
--> 460 shared_pointers = _find_shared_tensors(tensors)
    461 failing = []
    462 for names in shared_pointers:

File /kaggle/working/env/safetensors/torch.py:72, in _find_shared_tensors(state_dict)
     70 tensors = defaultdict(set)
     71 for k, v in state_dict.items():
---> 72     if v.device != torch.device("meta") and storage_ptr(v) != 0 and storage_size(v) != 0:
     73         # Need to add device as key because of multiple GPU.
     74         tensors[(v.device, storage_ptr(v), storage_size(v))].add(k)
     75 tensors = list(sorted(tensors.values()))

File /kaggle/working/env/safetensors/torch.py:17, in storage_ptr(tensor)
     14 except Exception:
     15     # Fallback for torch==1.10
     16     try:
---> 17         return tensor.storage().data_ptr()
     18     except NotImplementedError:
     19         # Fallback for meta storage
     20         return 0

File /kaggle/working/env/torch/storage.py:909, in TypedStorage.data_ptr(self)
    907 def data_ptr(self):
    908     _warn_typed_storage_removal()
--> 909     return self._data_ptr()

File /kaggle/working/env/torch/storage.py:913, in TypedStorage._data_ptr(self)
    912 def _data_ptr(self):
--> 913     return self._untyped_storage.data_ptr()

RuntimeError: Attempted to access the data pointer on an invalid python storage.

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Run in Kaggle TPU, Environment: Always use latest environment. Input data is included in the notebook

Expected behavior

Expected to save successfully like when using GPU.