LIBSVM Data: Classification (Multi Class) (original) (raw)
This page contains many classification, regression, multi-label and string data sets stored in LIBSVM format. For some sets raw materials (e.g., original texts) are also available. These data sets are from UCI, Statlog, StatLib and other collections. We thank their efforts. For most sets, we linearly scale each attribute to [-1,1] or [0,1]. The testing data (if provided) is adjusted accordingly. Some training data are further separated to "training" (tr) and "validation" (val) sets. Details can be found in the description of each data set. To read data via MATLAB, you can use "libsvmread" in LIBSVM package.
aloi
- Source:aloi [AR14a]
- # of classes: 1,000
- # of data: 108,000
- # of features: 128
- Files:
- aloi.bz2
- aloi.scale.bz2 (scaled to [0,1])
cifar10
- Source:The CIFAR-10 dataset [AK09a]
- Preprocessing: We combine five training batches in CIFAR-10 Matlab version from the cifar10 website to produce the training data. For every image, in the RGB order, by rows we convert 32x32 pixels to feature values. That is, (row 1, R), (row 2, R), ..., (row 1, G), ...
- # of classes: 10
- # of data: 50,000 / 10,000 (testing)
- # of features: 3,072
- Files:
- cifar10.bz2
- cifar10.t.bz2 (testing)
- cifar10.mat (dense matlab format)
- cifar10.t.mat (testing, dense matlab format)
connect-4
- Source:UCI / Connect-4
- Preprocessing: We used binary encoding for each feature (o, b, x), so the number of features is 42*3 = 126.
- # of classes: 3
- # of data: 67,557
- # of features: 126
- Files:
covtype
- Source:UCI / Covertype
- # of classes: 7
- # of data: 581,012
- # of features: 54
- Files:
- covtype.bz2
- covtype.scale01.bz2 (scaled to [0,1])
- covtype.scale.bz2 (scaled to mean zero and standard deviation one (first 10 attributes))
dna
- Source:Statlog / Dna
- Preprocessing: Training data is further separated into two sets, tr and val. [CWH01a]
- # of classes: 3
- # of data: 2,000 / 1,186 (testing) / 1,400 (tr) / 600 (val)
- # of features: 180
- Files:
- dna.scale
- dna.scale.t (testing)
- dna.scale.tr (tr)
- dna.scale.val (val)
glass
- Source:UCI / Glass Identification
- # of classes: 6
- # of data: 214
- # of features: 9
- Files:
imdb-rating
- Source: Jointly Modelling Aspects, Ratings and Sentiments for Movie Recommendation
- Preprocessing: The original dataset can be downloaded from Zenodo. We replaced any sequence of whitespace characters \s (a shorthand for [ \t\n\r\f\v]) with a space.
- # of classes: 10
- # of data: 348,415
- # of features:
- Files:
iris
- Source:UCI / Iris Plant
- # of classes: 3
- # of data: 150
- # of features: 4
- Files:
LEDGAR (LexGLUE)
- Source: [IC22b]
- Preprocessing: The procedure is the same as that for [ECtHR (A) (LexGLUE)](multilabel.html#ECtHR %28A%29 %28LexGLUE%29).
- # of classes: 100
- # of data: 60,000 / 10,000 (valid) / 10,000 (testing)
- # of features: 19,996
- Files:
letter
- Source:Statlog / Letter
- Preprocessing: Training data is further separated into two sets, tr and val. [CWH01a]
- # of classes: 26
- # of data: 15,000 / 5,000 (testing) / 10,500 (tr) / 4,500 (val)
- # of features: 16
- Files:
- letter.scale
- letter.scale.t (testing)
- letter.scale.tr (tr)
- letter.scale.val (val)
mnist
- Source: [YL98a]
- Preprocessing: Feature values are stored by rows of each image
- # of classes: 10
- # of data: 60,000 / 10,000 (testing)
- # of features: 780 / 778 (testing)
- Files:
- mnist.bz2
- mnist.t.bz2 (testing)
- mnist.scale.bz2 (scaled to [0,1] by dividing each feature by 255)
- mnist.scale.t.bz2 (testing) (scaled to [0,1] by dividing each feature by 255)
- mnist.mat (dense matlab format)
- mnist.t.mat (testing, dense matlab format)
mnist8m
- Source:Invariant SVM [GL07b]
- # of classes: 10
- # of data: 8,100,000
- # of features: 784
- Files:
- mnist8m.xz
- mnist8m.scale.xz (scaled to [0,1] by dividing each feature by 255)
news20
- Source: [KL95a]
- Preprocessing: First 80/20 training/testing split. Also seethis page [JR01a]
- # of classes: 20
- # of data: 15,935 / 3,993 (testing)
- # of features: 62,061 / 62,060 (testing)
- Files:
- news20.bz2
- news20.t.bz2 (testing)
- news20.scale.bz2 (scaled to binary encoding; then unit length for each instance)
- news20.t.scale.bz2 (testing) (scaled to binary encoding; then unit length for each instance)
news20 (18,846)
- Source: [KL95a]
- Preprocessing: The data are downloaded from sklearn. We have made sure the data provided by sklearn is the same as the 18,846 set at this page. All newlines are replaced with white spaces in addition. The raw data are in the format of labelstexts. We do a random 80/20 split to generate the validation set from the whole training set (raw texts only). We also provide data with tf-idf features, which are calculated from the raw texts provided here using TfidfVectorizer from sklearn with default configurations. The code used to generate the raw texts and tf-idf features is provided.
- # of classes: 20
- # of data: 9,051 / 2,263 (valid) / 7,532 (testing)
- # of features: 130,107
- Files:
pendigits
- Source:UCI / Pen-Based Recognition of Handwritten Digits Data Set
- # of classes: 10
- # of data: 7,494 / 3,498 (testing)
- # of features: 16
- Files:
- pendigits
- pendigits.t (testing)
poker
- Source:UCI / Poker Hand
- # of classes: 10
- # of data: 25,010 / 1,000,000 (testing)
- # of features: 10
- Files:
protein
- Source: [JYW02a]
- # of classes: 3
- # of data: 17,766 / 6,621 (testing) / 14,895 (training) / 2,871 (validation)
- # of features: 357
- Files:
- protein.bz2
- protein.t.bz2 (testing)
- protein.tr.bz2 (tr)
- protein.val.bz2 (val)
rcv1.multiclass
- Source: [DL04b]
- Preprocessing: First, label hierarchy is reorganized by mapping the data set to the second level of RCV1 topic hierarchy. The documents that have labels of the third or forth level only are mapped to their parent category of the second level. The documents that only have labels of the first level are not mapped onto any category. Second, we remove multi-labeled instances. [RB08a]
- # of classes: 53
- # of data: 15,564 / 518,571 (testing)
- # of features: 47,236
- Files:
SCOTUS (LexGLUE)
- Source: [IC22b]
- Preprocessing: The procedure is the same as that for [ECtHR (A) (LexGLUE)](multilabel.html#ECtHR %28A%29 %28LexGLUE%29).
- # of classes: 13
- # of data: 5,000 / 1,400 (validation) / 1,400 (testing)
- # of features: 126,405
- Files:
satimage
- Source:Statlog / Satimage
- Preprocessing: Training data is further separated into two sets, tr and val. [CWH01a]
- # of classes: 6
- # of data: 4,435 / 2,000 (testing) / 3,104 (tr) / 1,331 (val)
- # of features: 36
- Files:
- satimage.scale
- satimage.scale.t (testing)
- satimage.scale.tr (tr)
- satimage.scale.val (val)
sector
- Source: [AM98a]
- Preprocessing: The scaled data was used in our KDD 08 paper. For unknown reason we could now only generate something close to it. The sources are from this page. We select train-0.tc and test-0.tc from ecoc-svm-data.tar.gz. A 2/1 training/testing split gives training and testing sets below. They are in the original format instead of the libsvm format: in each row the 2nd value gives the class label and subsequent numbers give pairs of feature IDs and values. We then do a kind of tf-idf transformation: ln(1+tf)*log_2(#docs/#coll_freq_of_term) and normalize each instance to unit length. [JR01b,SSK08a]
- # of classes: 105
- # of data: 6,412 / 3,207 (testing)
- # of features: 55,197 / 55,197 (testing)
- Files:
- sector.bz2
- sector.t.bz2 (testing)
- sector.scale.bz2
- sector.t.scale.bz2 (testing)
segment
- Source:Statlog / Segment
- # of classes: 7
- # of data: 2,310
- # of features: 19
- Files:
Sensorless
- Source:UCI / Dataset for Sensorless Drive Diagnosis
- Preprocessing: The original data does not have test instances. For the [0,1]-scaled version we have a random split (.tr and .val) used in our paper. [CCW16a]
- # of classes: 11
- # of data: 58,509
- # of features: 48
- Files:
- Sensorless
- Sensorless.scale (scaled to [0,1])
- Sensorless.scale.tr
- Sensorless.scale.val
shuttle
- Source:Statlog / Shuttle
- Preprocessing: Training data is further separated into two sets, tr and val. [CWH01a]
- # of classes: 7
- # of data: 43,500 / 14,500 (testing) / 30,450 (tr) / 13,050 (val)
- # of features: 9
- Files:
- shuttle.scale
- shuttle.scale.t (testing)
- shuttle.scale.tr (tr)
- shuttle.scale.val (val)
smallNORB
- Source:The Small NORB Dataset [YL04b]
- Preprocessing: For each instance, from two cameras, it contains a pair of 96x96 grayscale images for two different channels. We downsample each channel of the origin data from 96x96 to 32x32 by selecting the maximum pixel value within every 3x3 disjoint region. Feature values are generated by (row 1, channel 1), (row 2, channel 1), ..., (row 1, channel 2), ... [CCW18a]
- # of classes: 5
- # of data: 24,300 / 24,300 (testing)
- # of features: 18,432 / 2,048 (downsampled)
- Files:
- smallNORB.xz
- smallNORB.t.xz (testing)
- smallNORB-32x32.xz (downsampled)
- smallNORB-32x32.t.xz (downsampled, testing)
- smallNORB-32x32.mat (dense matlab format)
- smallNORB-32x32.t.mat (testing, dense matlab format)
SVHN
- Source:SVHN [YN11a]
- Preprocessing: We consider format 2 (cropped digits) of the data set. For every image, in the RGB order, by rows we convert 32x32 pixels to feature values. That is, (row 1, R), (row 2, R), ..., (row 1, G), ... [YN11a]
- # of classes: 10
- # of data: 73,257 / 26,032 (testing) / 531,131 (extra)
- # of features: 3,072
- Files:
- SVHN.xz
- SVHN.t.xz (testing)
- SVHN.extra.xz (extra data from the original source)
- SVHN.scale.xz (scaled to [0,1] by dividing each feature by 255)
- SVHN.scale.t.xz (testing) (scaled to [0,1] by dividing each feature by 255)
- SVHN.scale.extra.xz (scaled to [0,1] by dividing each feature by 255)
- SVHN.mat (dense matlab format)
- SVHN.t.mat (testing, dense matlab format)
svmguide2
- Source: [CWH03a]
- Preprocessing: Original data: a bioinformatics application from Simon Fraser University, Canada. [JLG03a]
- # of classes: 3
- # of data: 391
- # of features: 20
- Files:
svmguide4
- Source: [CWH03a]
- Preprocessing: Original data: an application on traffic light signals from Georges Bonga at University of Applied Sciences, Berlin.
- # of classes: 6
- # of data: 300 / 312 (testing)
- # of features: 10
- Files:
- svmguide4
- svmguide4.t (testing)
usps
- Source: [JJH94a]
- # of classes: 10
- # of data: 7,291 / 2,007 (testing)
- # of features: 256
- Files:
- usps.bz2
- usps.t.bz2 (testing)
SensIT Vehicle (acoustic)
- Source:Sensit [MD04a]
- Preprocessing: Regenerate features by the authors' matlab scripts (see Sec. C of Appendix A), then randomly select 10% instances from the noise class so that the class proportion is 1:1:2 (AAV:DW:noise). The training/testing sets are from a random 80% and 20% split of the data. [MD04a]
- # of classes: 3
- # of data: 78,823 / 19,705 (testing)
- # of features: 50
- Files:
- acoustic
- acoustic.t (testing)
- acoustic_scale (scaled to [-1,1])
- acoustic_scale.t (testing)
SensIT Vehicle (seismic)
- Source:Sensit [MD04a]
- Preprocessing: Regenerate features by the authors' matlab scripts (see Sec. C of Appendix A), then randomly select 10% instances from the noise class so that the class proportion is 1:1:2 (AAV:DW:noise). The training/testing sets are from a random 80% and 20% split of the data. [MD04a]
- # of classes: 3
- # of data: 78,823 / 19,705 (testing)
- # of features: 50
- Files:
- seismic
- seismic.t (testing)
- seismic_scale (scaled to [-1,1])
- seismic_scale.t (testing)
SensIT Vehicle (combined)
- Source:Sensit [MD04a]
- Preprocessing: Regenerate features by the authors' matlab scripts (see Sec. C of Appendix A), then randomly select 10% instances from the noise class so that the class proportion is 1:1:2 (AAV:DW:noise). The training/testing sets are from a random 80% and 20% split of the data. The first 50 features are acoustic, while the rest are seismic. Due to the random selection, files here are not the direct concatenation of the "SensIT Vehicle (acoustic)" and "SensIT Vehicle (seismic)" sets. [MD04a]
- # of classes: 3
- # of data: 78,823 / 19,705 (testing)
- # of features: 100
- Files:
- combined
- combined.t (testing)
- combined_scale (scaled to [-1,1])
- combined_scale.t (testing)
vehicle
- Source:Statlog / Vehicle
- # of classes: 4
- # of data: 846
- # of features: 18
- Files:
- vehicle.original (original)
- vehicle.scale (scaled to [-1,1])
vowel
- Source:UCI / Vowel
- Preprocessing: First 528 instances are used as training and the remaining instances are for testing. Scaling training data first and adjust testing data accordingly.
- # of classes: 11
- # of data: 528 / 462 (testing)
- # of features: 10
- Files:
- vowel
- vowel.t (testing)
- vowel.scale (scaled to [-1,1])
- vowel.scale.t (testing)
wine
- Source:UCI / Wine Recognition
- # of classes: 3
- # of data: 178
- # of features: 13
- Files: