Using pd.DataFrame(tensor) is abnormally slow, you can make the following modifications · Issue #44616 · pandas-dev/pandas (original) (raw)

Reproducible Example

import numpy as np import pandas as pd import torch

row = 700000 col = 64 val_numpy = np.random.rand(row, col) val_tensor = torch.randn(row, col)

numpy_pd_start_time = time.time() va_numpy_pd = pd.DataFrame(val_numpy) numpy_pd_end_time = time.time() print("numpy to pd time:{:.4f}s". format(numpy_pd_end_time - numpy_pd_start_time))

tensor_numpy_pd_start_time = time.time() val_tensor_pd1 = pd.DataFrame(val_tensor.numpy()) tensor_numpy_pd_end_time = time.time() print("tensor to numpy to pd time:{:.4f} s". format(tensor_numpy_pd_end_time - tensor_numpy_pd_start_time))

tensor_pd_start_time = time.time() val_tensor_pd2 = pd.DataFrame(val_tensor) tensor_pd_end_time = time.time() print("tensor to pd time:{:.4f} s". format(tensor_pd_end_time - tensor_pd_start_time))

Issue Description

Recently, using pd.DataFrame() to convert data of type torch.tensor to pandas DataFrame is very slow, while converting tensor to numpy and then to pandas DataFrame is very fast. The test code is shown in the Reproducible Example.
The code prints as follows:

numpy to pd time: 0.0013s
tensor to numpy to pd time:0.0005s
tensor to pd time:220.5251s

Then I read the source code and found that if the data accepted by pd.DataFrame() is tensor, tensor will be processed as list_like (line 682 in https://github.com/pandas-dev/pandas/blob/master/pandas/core/ frame.py) .
Mainly time-consuming in the following three stages:

data = list(data):2.5952s nested_data_to_arrays: 214.7532s arrays_to_mgr:2.5987s

In the nested_data_to_arrays stage, a large number of data type conversion operations are involved, the row-list is converted to col-list, and the operation is read by row.This will take a long time.

Sure,This method of use may not be appropriate, but now torch.tensor is widely used, and it is inevitable that it will be used directly in this way, resulting in low efficiency. So can you add a comment at line 467 in frame.py, like this: If data is a torch.tensor, you can transform it to numpy first(tensor.numpy()).
Or can I submit a PR? When it is judged that the input parameter is tensor, execute the conversion, and then execute the ''elif isinstance(data, (np.ndarray, Series, Index))'' judgment.

Looking forward to your reply ~

Installed Versions

pandas.version == 1.3.4