BaseModel vs HSTU for sequential recommendations (original) (raw)

In May 2024, a preprint titled "Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations" [1] was posted by Meta AI researchers to ArXiv. The preprint introduced a novel recommender, referred to as 'HSTU' which stands for “Hierarchical Sequential Transduction Units” promising new state-of-the-art results for sequential recommendation tasks, as well as scalability to exceptionally large datasets.

Overview of HSTU

HSTU is yet another attempt at adapting (modified) Transformers to generative recommendation, after DeepMind’s TIGER model (benchmarked in a previous post). The most interesting properties of Meta AI’s HSTU architecture are:

Details of HSTU Architecture

HSTU is utilizes a few interesting components:

Source: [1]

The HSTU model utilizes an intricate setup of representing categorical features as auxiliary events in a time-series.

This is best illustrated by the following diagram from the original preprint:

Source: [1]

On the left-hand side is a classic Deep Learning Recommender Model, while on the right side is the generative causal setup proposed by Meta AI. It can be seen how categorical features are transformed into auxiliary events which are then incorporated into the main time-series.

HSTU’s Benchmark Results

HSTU was benchmarked on 3 public datasets: MovieLens-1M, MovieLens-20M and Amazon Books.

HSTU outperformed prior strong baselines on the datasets used. Results below:

HSTU vs SASRec result comparison [1]

HSTU was able to achieve significant improvements over prior state-of-the-art (SASRec) on all metrics on all datasets.

It is worth noting that based on our own experiments at Synerise AI, MovieLens is an atypical dataset that should not be used as a benchmark for sequential recommendations due to the underlying data generation process. The temporal ordering of movies rated is very loosely correlated with the order of consumer choices – one may rate a movie seen a long time ago. Reviews are often done in short bursts, even though the movies were watched by the reviewer sequentially prior to the review or even many years in the past.

BaseModel vs HSTU Performance

To evaluate BaseModel against HSTU, we replicated the exact data preparation, training, validation, and testing protocols described in the HSTU paper. The exact same implementations of HitRate and NDCG metrics were used for consistency.

For comparison of the models a few steps were performed:

The entire process took 5 hours from scratch to finish. The parameters of BaseModel were default for the Amazon Books dataset, and slightly modified for MovieLens datasets, to reflect their not-really-sequential structure. The results look as follows:

BaseModel vs HSTU and SASRec

Despite limited optimization of BaseModel’s parameters, the results are remarkably interesting. BaseModel achieved an +55.5% to +292.4% improvement over SASRec’s results on Amazon Books, significantly outperforming HSTU.

On MovieLens-1M, BaseModel had a clear advantage over both SASRec and HSTU, and on MovieLens-20M both BaseModel and HSTU were tied, while SASRec remained far behind.

We have internally confirmed MovieLens is a pathologically constructed dataset, and its sequential/temporal structure does not correspond to typical sequence recommendation scenarios in other public and private datasets. We hypothesize that both BaseModel and HSTU reach near-perfect achievable scores on MovieLens-20M.

While exact HSTU training and inference times are not reported, the model is based on a modified Transformer architecture. Meta AI’s team has optimized the architecture significantly allowing training 2-15x faster than Transformer++. Yet, even with those optimizations BaseModel’s training and inference processes are orders of magnitude faster.

Conclusion

The comparison between BaseModel and HSTU reveals substantial differences in their architectural choices and performance. While HSTU represents a notable advancement in generative retrieval recommender systems, BaseModel’s approach demonstrates superior efficiency and effectiveness in sequential recommendation tasks. In addition, we conclude that usage of MovieLens datasets should be discouraged for sequential recommendations, as the sequential/temporal information contained therein is very noisy.

We are continuously improving our methods to push the boundaries of what behavioral models can achieve, comparisons with alternative approaches are a vital part of our work.

References

[1] Zhai, et al., “Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations”, https://arxiv.org/abs/2402.17152