torch_frame.datasets.MultimodalTextBenchmark — pytorch-frame documentation (original) (raw)

pytorch-frame

class MultimodalTextBenchmark(root: str, name: str, text_stype: torch_frame.stype = stype.text_embedded, col_to_text_embedder_cfg: dict[str, TextEmbedderConfig] | TextEmbedderConfig | None = None, col_to_text_tokenizer_cfg: dict[str, TextTokenizerConfig] | TextTokenizerConfig | None = None)[source]

Bases: Dataset

The tabular data with text columns benchmark datasets used by“Benchmarking Multimodal AutoML for Tabular Data with Text Fields”. Some regression datasets’ target column is transformed from log scale to original scale.

Parameters:

STATS:

Name #rows #cols (numerical) #cols (categorical) #cols (text) #cols (other) #classes Task Missing value ratio
product_sentiment_machine_hack 6,364 0 1 1 0 4 multiclass_classification 0.0%
jigsaw_unintended_bias100K 125,000 29 0 1 0 2 binary_classification 41.4%
news_channel 25,355 14 0 1 0 6 multiclass_classification 0.0%
wine_reviews 105,154 2 2 1 0 30 multiclass_classification 1.0%
data_scientist_salary 19,802 0 3 2 1 6 multiclass_classification 12.3%
melbourne_airbnb 22,895 26 47 13 3 10 multiclass_classification 9.6%
imdb_genre_prediction 1,000 7 1 2 1 2 binary_classification 0.0%
kick_starter_funding 108,128 1 3 3 2 2 binary_classification 0.0%
fake_job_postings2 15,907 0 3 2 0 2 binary_classification 23.8%
google_qa_answer_type_reason_explanation 6,079 0 1 3 0 1 regression 0.0%
google_qa_question_type_reason_explanation 6,079 0 1 3 0 1 regression 0.0%
bookprice_prediction 6,237 2 3 3 0 1 regression 1.7%
jc_penney_products 13,575 2 1 2 0 1 regression 13.7%
women_clothing_review 23,486 1 3 2 0 1 regression 1.8%
news_popularity2 30,009 3 0 1 0 1 regression 0.0%
ae_price_prediction 28,328 2 5 1 3 1 regression 6.1%
california_house_price 47,439 18 8 2 11 1 regression 13.8%
mercari_price_suggestion100K 125,000 0 6 2 1 1 regression 3.4%