4. Dataset transformations — scikit-learn 0.20.4 documentation (original) (raw)
scikit-learn provides a library of transformers, which may clean (seePreprocessing data), reduce (see Unsupervised dimensionality reduction), expand (seeKernel Approximation) or generate (see Feature extraction) feature representations.
Like other estimators, these are represented by classes with a fit method, which learns model parameters (e.g. mean and standard deviation for normalization) from a training set, and a transform method which applies this transformation model to unseen data. fit_transform may be more convenient and efficient for modelling and transforming the training data simultaneously.
Combining such transformers, either in parallel or series is covered inPipelines and composite estimators. Pairwise metrics, Affinities and Kernels covers transforming feature spaces into affinity matrices, while Transforming the prediction target (y) considers transformations of the target space (e.g. categorical labels) for use in scikit-learn.
- 4.1. Pipelines and composite estimators
- 4.2. Feature extraction
- 4.2.1. Loading features from dicts
- 4.2.2. Feature hashing
* 4.2.2.1. Implementation details - 4.2.3. Text feature extraction
* 4.2.3.1. The Bag of Words representation
* 4.2.3.2. Sparsity
* 4.2.3.3. Common Vectorizer usage
* 4.2.3.3.1. Using stop words
* 4.2.3.4. Tf–idf term weighting
* 4.2.3.5. Decoding text files
* 4.2.3.6. Applications and examples
* 4.2.3.7. Limitations of the Bag of Words representation
* 4.2.3.8. Vectorizing a large text corpus with the hashing trick
* 4.2.3.9. Performing out-of-core scaling with HashingVectorizer
* 4.2.3.10. Customizing the vectorizer classes - 4.2.4. Image feature extraction
* 4.2.4.1. Patch extraction
* 4.2.4.2. Connectivity graph of an image
- 4.3. Preprocessing data
- 4.3.1. Standardization, or mean removal and variance scaling
* 4.3.1.1. Scaling features to a range
* 4.3.1.2. Scaling sparse data
* 4.3.1.3. Scaling data with outliers
* 4.3.1.4. Centering kernel matrices - 4.3.2. Non-linear transformation
* 4.3.2.1. Mapping to a Uniform distribution
* 4.3.2.2. Mapping to a Gaussian distribution - 4.3.3. Normalization
- 4.3.4. Encoding categorical features
- 4.3.5. Discretization
* 4.3.5.1. K-bins discretization
* 4.3.5.2. Feature binarization - 4.3.6. Imputation of missing values
- 4.3.7. Generating polynomial features
- 4.3.8. Custom transformers
- 4.3.1. Standardization, or mean removal and variance scaling
- 4.4. Imputation of missing values
- 4.5. Unsupervised dimensionality reduction
- 4.6. Random Projection
- 4.7. Kernel Approximation
- 4.8. Pairwise metrics, Affinities and Kernels
- 4.9. Transforming the prediction target (y)