GitHub - joneswack/dp-rfs: This repository contains PyTorch implementations of various random feature maps for dot product kernels. (original) (raw)

Random Features for Polynomial Kernels and Dot Product Kernels

This repository contains PyTorch implementations of various random feature maps for polynomial and general dot product kernels. In particular, we provide implementations using complex random projections that can improve the kernel approximation significantly. For dot product kernels, we provide an algorithm that minimizes the variance of the random feature approximation and can also be used in conjunction with the Gaussian kernel.

The basic building block of random features for dot product kernels are Polynomial Sketches that approximate the polynomial kernel of degree p. Such random projections can be seen as linear random projections applied to _p_-times self-tensored input features. If p=1, these sketches reduce to standard linear random projections.

For more information and a description of the novelties of this work, consult the associated technical reports:

@article{wacker2022a, title={Improved Random Features for Dot Product Kernels}, author={Wacker, Jonas and Kanagawa, Motonobu and Filippone, Maurizio}, journal={arXiv preprint arXiv:2201.08712}, year={2022} } @article{wacker2022b, title={Complex-to-Real Random Features for Polynomial Kernels}, author={Wacker, Jonas and Ohana, Ruben and Filippone, Maurizio}, journal={arXiv preprint arXiv:2202.02031}, year={2022} } @article{wacker2022c, title={Local Random Feature Approximations of the Gaussian Kernel}, author={Wacker, Jonas and Filippone, Maurizio}, journal={arXiv preprint arXiv:2204.05667}, year={2022} }

Please cite these works if you find them useful.

In addition, this repository implements and extends works from the following related papers:

Requirements

We recommend:

Multiple complex operations on tensors that we use in our code have been introduced in PyTorch version 1.8. In case you have a lower version, you cannot use the complex projections in our code.

A note on using GPU-accelerated SRHT sketches

PyTorch does not natively support the Fast Walsh Hadamard Transform. This repository contains an implementation including a CUDA kernel in util/hadamard_cuda. This kernel is needed for GPU-accelerated TensorSRHT sketches. If you want to use this implementation, you need to specify your cuda path inside util/fwht/__init__.py for its compilation:

if torch.cuda.is_available(): sources.extend([filedir + 'fwht_kernel.cu']) flags.extend(['-DIS_CUDA_AVAILABLE']) if os.environ.get('CUDA_HOME', '-1') == '-1': warnings.warn('CUDA_HOME variable not set. Setting CUDA_HOME=/usr/local/cuda-9.0...',) os.environ['CUDA_HOME'] = '/usr/local/cuda-9.0'

We used cuda version 9.0 in our experiments but other versions should work too.

Getting started

Polynomial Sketches

A Polynomial Sketch is a random feature approximation of the polynomial kernel. It can be used on its own or as a module inside a larger deep learning framework. It is initialized as follows:

from random_features.polynomial_sketch import PolynomialSketch feature_encoder = PolynomialSketch( input_dimension, # data input dimension (power of 2 for srht projection_type) projection_dimension, # output dimension of the random sketch degree=degree, # degree of the polynomial kernel bias=bias, # bias parameter of the polynomial kernel lengthscale=lengthscale, # inverse scale of the data (like lengthscale for Gaussian kernel) projection_type='gaussian'/'rademacher'/'srht'/'countsketch_scatter', complex_weights=False/True, # whether to use complex random projections (without complex_real outputs are complex-valued) complex_real=False/True, # whether to use complex-to-real sketches (outputs are real-valued) hierarchical=False/True, # whether to use hierarchical sketching as proposed in https://arxiv.org/abs/1909.01410 device='cpu'/'cuda', # whether to use CPU or GPU )

feature_encoder.resample() # initialize random feature sample feature_encoder.forward(input_data) # project input data

projection_type has a strong impact on the approximation quality and computation speed. projection_type='srht' uses the subsampled randomized Hadamard transform that makes use of structured matrix products. These are faster (especially on the GPU). They also give lower variances than Rademacher and Gaussian sketches most of the time. countsketch_scatter uses CountSketches as the base projection to yield TensorSketch (https://chbrown.github.io/kdd-2013-usb/kdd/p239.pdf).

Depending on the scaling of your data, high-degree sketches can give very large variances. complex_weights usually improve the approximation significantly in this case. hierarchical sketches (https://arxiv.org/abs/1909.01410) are not the focus of this work, but can be helpful too. Complex random features can be computed about as fast as real ones, in particular when using projection_type='srht'. However, the downstream task is usually more expensive when working with complex data. Complex-to-Real (CtR) (complex_real=True) sketches return real-valued random features instead and thus alleviate this problem.

Spherical Random Features (SRF)

We also provide an implementation of Spherical Random Features for polynomial kernels. They can be run similar to the above code but the random feature distribution needs to be optimized first. Please have a look at the code at the bottom of random_features/spherical.py to understand the procedure for obtaining this distribution and the random feature map afterwards. Once the distribution is obtained, it is saved under saved_models (some parameterizations are contained in this repository already).

Approximating the Gaussian kernel using Polynomial Sketches

The Gaussian kernel can be approximated well through a randomized Maclaurin expansion with Polynomial Sketches assuming that the data is zero-centered and a proper lengthscale is used.

Note: This method is not implemented for Complex-to-Real (CtR) sketches yet.

The approximator is initialized as follows:

from random_features.gaussian_approximator import GaussianApproximator feature_encoder = GaussianApproximator( input_dimension, # data input dimension (power of 2 for srht projection_type) projection_dimension, # output dimension of the random sketch approx_degree=10, # maximum degree of the taylor approximation lengthscale=lengthscale, # lengthscale of the Gaussian kernel var=kernel_var, # variance of the Gaussian kernel method='maclaurin', # approximation method (other possibilities: rff/poly_sketch) projection_type='gaussian'/'rademacher'/'srht'/'countsketch_scatter', complex_weights=False/True, # whether to use complex random projections (without complex_real outputs are complex-valued) hierarchical=False/True, # whether to use hierarchical sketching as proposed in https://arxiv.org/abs/1909.01410 device='cpu'/'cuda', # whether to use CPU or GPU )

find optimal sampling distribution when using optimized maclaurin

feature_encoder.initialize_sampling_distribution(input_data_sample, min_sampling_degree=2) feature_encoder.resample() # initialize random feature sample feature_encoder.forward(input_data) # project input data

Here, we chose the optimized Maclaurin method (method='maclaurin').initialize_sampling_distribution finds the optimal random feature distribtion between degrees min_sampling_degree=2 and approx_degree=10. method='rff' (random Fourier features) and method='poly_sketch' do not require this step.method='maclaurin_p' gives the Maclaurin approximation according to (Kar and Karnick, 2012), which is less optimal than the optimized Maclaurin method but requires no preprocessing.

Localized random features for Gaussian kernels with short length scales

Approximating the Gaussian kernel on high-frequency data, tipycally modelled using short length scales, is very challenging for random feature methods. The local_approximations folder contains implementations where the above Maclaurin approximations are centered around individual test points to improve results considerably in such scenarios. These implementations follow the work of (Wacker, 2022c) and serve to reproduce the experiments in the original paper.

Approximating other Dot Product Kernels using the Maclaurin method

The GaussianApproximator module above is a thin wrapper of the Maclaurin module contained in random_features.maclaurin. When approximating a general dot product kernel, e.g., the exponential dot product kernel, we do not need this wrapper, but use the Maclaurin module directly. Follow the code inside GaussianApproximator to see how the Maclaurin module is used.

Reproducing the Gaussian Process Classification/Regression experiments

If you would like to reproduce the results from the experiments in our paper, download the desired datasets from the UCI machine learning repository. The code_rna dataset is available at https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html. Then load the associated CSV files and save them using PyTorch. The dataset paths are in config/datasets and the datasets to be used in the experiments are in config/active_datasets.json.

Then run the following command:

python run_rf_gp_experiments.py --rf_parameter_file [rf_parameter_config] --datasets_file config/active_datasets.json --use_gpu

--use_gpu is optional. [rf_parameter_config] needs to be replaced by one of the files located in config/rf_parameters depending on which kernel should be approximated. These files can be easily adapted according to your needs.

The output logs of the experiments are saved in the csv and logs folder.

The bar plots can be created from the csv files using the Jupyter notebook notebooks/bar-plot-visualization.ipynb.