UCI Machine Learning datasets for R
Go to file
Andrzej Wójtowicz f7debcd154 in census-income grouped education, filtered occupation and removed native country variable;
added script to make release zip
2016-08-19 21:51:21 +02:00
data-collection in census-income grouped education, filtered occupation and removed native country variable; 2016-08-19 21:51:21 +02:00
.gitignore in census-income grouped education, filtered occupation and removed native country variable; 2016-08-19 21:51:21 +02:00
README.md in census-income grouped education, filtered occupation and removed native country variable; 2016-08-19 21:51:21 +02:00
config.R in census-income grouped education, filtered occupation and removed native country variable; 2016-08-19 21:51:21 +02:00
init.R in census-income grouped education, filtered occupation and removed native country variable; 2016-08-19 21:51:21 +02:00
s1-download-data.R added mushroom and census income datasets; 2016-08-11 18:15:25 +02:00
s2-preprocess-data.R added mushroom and census income datasets; 2016-08-11 18:15:25 +02:00
s3-make-readme.Rmd in census-income grouped education, filtered occupation and removed native country variable; 2016-08-19 21:51:21 +02:00
s4-make-release.sh in census-income grouped education, filtered occupation and removed native country variable; 2016-08-19 21:51:21 +02:00
uci-ml-to-r.Rproj Added first version for 10 UCI datasets 2016-04-15 15:44:49 +02:00
utils.R added mushroom and census income datasets; 2016-08-11 18:15:25 +02:00

README.md

UCI Machine Learning datasets for R

Andrzej Wójtowicz

Document generation date: 2016-08-19 21:47:14.

This project preprocesses a few datasets from UC Irvine Machine Learning Repository into tidy R object files. It focuses on the binary classification datasets and saves only complete cases within a dataset.

R software: Microsoft R Open (3.3.0)

Reproducibility library: checkpoint

Reproducibility procedure:

  1. Run s1-download-data.R to download original datasets.
  2. Run s2-preprocess-data.R to preprocess the datasets.

Optionally:

  1. knit s3-make-readme.Rmd to get an overview of the preprocessed datasets,
  2. run s4-make-release.sh to create zip file with preprocessed datasets.

Table of Contents

  1. Bank Marketing
  2. Breast Cancer Wisconsin (Diagnostic)
  3. Breast Cancer Wisconsin (Original)
  4. Cardiotocography
  5. Census income
  6. Default of credit card clients
  7. ILPD (Indian Liver Patient Dataset)
  8. MAGIC Gamma Telescope
  9. Mushroom
  10. Seismic bumps
  11. Spambase
  12. Wine Quality

Bank Marketing

Local directory: bank-marketing

Details: link

Source data files:

Cite:

S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

Dataset:

'data.frame':	43193 obs. of  16 variables:
 $ age      : int  58 44 33 35 28 42 58 43 41 29 ...
 $ job      : Factor w/ 11 levels "admin","blue.collar",..: 5 10 3 5 5 3 6 10 1 1 ...
 $ marital  : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 1 2 3 1 3 ...
 $ education: Ord.factor w/ 3 levels "primary"<"secondary"<..: 3 2 2 3 3 3 1 2 2 2 ...
 $ balance  : int  2143 29 2 231 447 2 121 593 270 390 ...
 $ housing  : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
 $ loan     : Factor w/ 2 levels "no","yes": 1 1 2 1 2 1 1 1 1 1 ...
 $ contact  : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
 $ month    : Ord.factor w/ 12 levels "jan"<"feb"<"mar"<..: 5 5 5 5 5 5 5 5 5 5 ...
 $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
 $ pdays    : int  999 999 999 999 999 999 999 999 999 999 ...
 $ pdays.bin: Factor w/ 2 levels "successful","never": 2 2 2 2 2 2 2 2 2 2 ...
 $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
 $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Predictors:

Type Frequency
factor 7
integer 6
ordered factor 2

Class imbalance:

class A class B
12 % 88 %
5021 38172

Breast Cancer Wisconsin (Diagnostic)

Local directory: breast-cancer-wisconsin-diagnostic

Details: link

Source data files:

Cite:

https://archive.ics.uci.edu/ml/citation_policy.html
@misc{Lichman:2013 , author = "M. Lichman", year = "2013", title = "{UCI} Machine Learning Repository", url = "http://archive.ics.uci.edu/ml", institution = "University of California, Irvine, School of Information and Computer Sciences" } 

Dataset:

'data.frame':	569 obs. of  31 variables:
 $ mean.radius            : num  18 20.6 19.7 11.4 20.3 ...
 $ mean.texture           : num  10.4 17.8 21.2 20.4 14.3 ...
 $ mean.perimeter         : num  122.8 132.9 130 77.6 135.1 ...
 $ mean.area              : num  1001 1326 1203 386 1297 ...
 $ mean.smoothness        : num  0.1184 0.0847 0.1096 0.1425 0.1003 ...
 $ mean.compactness       : num  0.2776 0.0786 0.1599 0.2839 0.1328 ...
 $ mean.concavity         : num  0.3001 0.0869 0.1974 0.2414 0.198 ...
 $ mean.concave.points    : num  0.1471 0.0702 0.1279 0.1052 0.1043 ...
 $ mean.symmetry          : num  0.242 0.181 0.207 0.26 0.181 ...
 $ mean.fractal.dimension : num  0.0787 0.0567 0.06 0.0974 0.0588 ...
 $ se.radius              : num  1.095 0.543 0.746 0.496 0.757 ...
 $ se.texture             : num  0.905 0.734 0.787 1.156 0.781 ...
 $ se.perimeter           : num  8.59 3.4 4.58 3.44 5.44 ...
 $ se.area                : num  153.4 74.1 94 27.2 94.4 ...
 $ se.smoothness          : num  0.0064 0.00522 0.00615 0.00911 0.01149 ...
 $ se.compactness         : num  0.049 0.0131 0.0401 0.0746 0.0246 ...
 $ se.concavity           : num  0.0537 0.0186 0.0383 0.0566 0.0569 ...
 $ se.concave.points      : num  0.0159 0.0134 0.0206 0.0187 0.0188 ...
 $ se.symmetry            : num  0.03 0.0139 0.0225 0.0596 0.0176 ...
 $ se.fractal.dimension   : num  0.00619 0.00353 0.00457 0.00921 0.00511 ...
 $ worst.radius           : num  25.4 25 23.6 14.9 22.5 ...
 $ worst.texture          : num  17.3 23.4 25.5 26.5 16.7 ...
 $ worst.perimeter        : num  184.6 158.8 152.5 98.9 152.2 ...
 $ worst.area             : num  2019 1956 1709 568 1575 ...
 $ worst.smoothness       : num  0.162 0.124 0.144 0.21 0.137 ...
 $ worst.compactness      : num  0.666 0.187 0.424 0.866 0.205 ...
 $ worst.concavity        : num  0.712 0.242 0.45 0.687 0.4 ...
 $ worst.concave.points   : num  0.265 0.186 0.243 0.258 0.163 ...
 $ worst.symmetry         : num  0.46 0.275 0.361 0.664 0.236 ...
 $ worst.fractal.dimension: num  0.1189 0.089 0.0876 0.173 0.0768 ...
 $ diagnosis              : Factor w/ 2 levels "b","m": 2 2 2 2 2 2 2 2 2 2 ...

Predictors:

Type Frequency
numeric 30

Class imbalance:

class A class B
37 % 63 %
212 357

Breast Cancer Wisconsin (Original)

Local directory: breast-cancer-wisconsin-original

Details: link

Source data files:

Cite:

O. L. Mangasarian and W. H. Wolberg: "Cancer diagnosis via linear programming", SIAM News, Volume 23, Number 5, September 1990, pp 1 & 18.

Dataset:

'data.frame':	683 obs. of  10 variables:
 $ clump.thickness            : int  5 5 3 6 4 8 1 2 2 4 ...
 $ uniformity.of.cell.size    : int  1 4 1 8 1 10 1 1 1 2 ...
 $ uniformity.of.cell.shape   : int  1 4 1 8 1 10 1 2 1 1 ...
 $ marginal.adhesion          : int  1 5 1 1 3 8 1 1 1 1 ...
 $ single.epithelial.cell.size: int  2 7 2 3 2 7 2 2 2 2 ...
 $ bare.nuclei                : int  2 3 4 6 2 3 3 2 2 2 ...
 $ bland.chromatin            : int  3 3 3 3 3 9 3 3 1 2 ...
 $ normal.nucleoli            : int  1 2 1 7 1 7 1 1 1 1 ...
 $ mitoses                    : int  1 1 1 1 1 1 1 1 5 1 ...
 $ class                      : Factor w/ 2 levels "x2","x4": 1 1 1 1 1 2 1 1 1 1 ...

Predictors:

Type Frequency
integer 9

Class imbalance:

class A class B
35 % 65 %
239 444

Cardiotocography

Local directory: cardiotocography

Details: link

Source data files:

Cite:

Ayres de Campos et al. (2000) SisPorto 2.0 A Program for Automated Analysis of Cardiotocograms. J Matern Fetal Med 5:311-318 

Dataset:

'data.frame':	2126 obs. of  30 variables:
 $ lb      : int  120 132 133 134 132 134 134 122 122 122 ...
 $ ac      : int  0 4 2 2 4 1 1 0 0 0 ...
 $ fm      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ uc      : int  0 4 5 6 5 10 9 0 1 3 ...
 $ astv    : int  73 17 16 16 16 26 29 83 84 86 ...
 $ mstv    : num  0.5 2.1 2.1 2.4 2.4 5.9 6.3 0.5 0.5 0.3 ...
 $ altv    : int  43 0 0 0 0 0 0 6 5 6 ...
 $ mltv    : num  2.4 10.4 13.4 23 19.9 0 0 15.6 13.6 10.6 ...
 $ dl      : int  0 2 2 2 0 9 6 0 0 0 ...
 $ dp      : int  0 0 0 0 0 2 2 0 0 0 ...
 $ width   : int  64 130 130 117 117 150 150 68 68 68 ...
 $ min     : int  62 68 68 53 53 50 50 62 62 62 ...
 $ max     : int  126 198 198 170 170 200 200 130 130 130 ...
 $ nmax    : int  2 6 5 11 9 5 6 0 0 1 ...
 $ nzeros  : int  0 1 1 0 0 3 3 0 0 0 ...
 $ mode    : int  120 141 141 137 137 76 71 122 122 122 ...
 $ mean    : int  137 136 135 134 136 107 107 122 122 122 ...
 $ median  : int  121 140 138 137 138 107 106 123 123 123 ...
 $ variance: int  73 12 13 13 11 170 215 3 3 1 ...
 $ tendency: Ord.factor w/ 3 levels "x.1"<"x0"<"x1": 3 2 2 3 3 2 2 3 3 3 ...
 $ a       : Factor w/ 2 levels "x0","x1": 1 1 1 1 1 1 1 1 1 1 ...
 $ b       : Factor w/ 2 levels "x0","x1": 1 1 1 1 2 1 1 1 1 1 ...
 $ c       : Factor w/ 2 levels "x0","x1": 1 1 1 1 1 1 1 1 1 1 ...
 $ d       : Factor w/ 2 levels "x0","x1": 1 1 1 1 1 1 1 1 1 1 ...
 $ e       : Factor w/ 2 levels "x0","x1": 1 1 1 1 1 1 1 1 1 1 ...
 $ ad      : Factor w/ 2 levels "x0","x1": 1 2 2 2 1 1 1 1 1 1 ...
 $ de      : Factor w/ 2 levels "x0","x1": 1 1 1 1 1 1 1 1 1 1 ...
 $ ld      : Factor w/ 2 levels "x0","x1": 1 1 1 1 1 2 2 1 1 1 ...
 $ fs      : Factor w/ 2 levels "x0","x1": 2 1 1 1 1 1 1 2 2 2 ...
 $ nsp     : Factor w/ 2 levels "x1","x3": 2 1 1 1 1 2 2 2 2 2 ...

Predictors:

Type Frequency
factor 9
integer 17
numeric 2
ordered factor 1

Class imbalance:

class A class B
22 % 78 %
471 1655

Census income

Local directory: census-income

Details: link

Source data files:

Cite:

https://archive.ics.uci.edu/ml/citation_policy.html
@misc{Lichman:2013 , author = "M. Lichman", year = "2013", title = "{UCI} Machine Learning Repository", url = "http://archive.ics.uci.edu/ml", institution = "University of California, Irvine, School of Information and Computer Sciences" } 

Dataset:

'data.frame':	46018 obs. of  13 variables:
 $ age           : int  39 50 38 53 28 37 49 52 31 42 ...
 $ workclass     : Factor w/ 7 levels "federal.gov",..: 6 5 3 3 3 3 3 5 3 3 ...
 $ fnlwgt        : int  77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
 $ education     : Ord.factor w/ 5 levels "school"<"highschool"<..: 4 4 2 1 4 5 1 2 5 4 ...
 $ marital.status: Factor w/ 7 levels "divorced","married.af.spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
 $ occupation    : Factor w/ 13 levels "adm.clerical",..: 1 3 5 5 9 3 7 3 9 3 ...
 $ relationship  : Factor w/ 6 levels "husband","not.in.family",..: 2 1 2 1 6 6 2 1 2 1 ...
 $ race          : Factor w/ 5 levels "amer.indian.eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
 $ sex           : Factor w/ 2 levels "female","male": 2 2 2 2 1 1 1 2 1 2 ...
 $ capital.gain  : int  2174 0 0 0 0 0 0 0 14084 5178 ...
 $ capital.loss  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ hours.per.week: int  40 13 40 40 40 40 16 45 50 40 ...
 $ class         : Factor w/ 2 levels "x..50k","x.50k": 1 1 1 1 1 1 1 2 2 2 ...

Predictors:

Type Frequency
factor 6
integer 5
ordered factor 1

Class imbalance:

class A class B
25 % 75 %
11417 34601

Default of credit card clients

Local directory: credit-card

Details: link

Source data files:

Cite:

Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2), 2473-2480.

Dataset:

'data.frame':	30000 obs. of  24 variables:
 $ limit.bal                 : int  20000 120000 90000 50000 50000 50000 500000 100000 140000 20000 ...
 $ sex                       : Factor w/ 2 levels "x1","x2": 2 2 2 2 1 1 1 2 2 1 ...
 $ education                 : Factor w/ 7 levels "x0","x1","x2",..: 3 3 3 3 3 2 2 3 4 4 ...
 $ marriage                  : Factor w/ 4 levels "x0","x1","x2",..: 2 3 3 2 2 3 3 3 2 3 ...
 $ age                       : int  24 26 34 37 57 37 29 23 28 35 ...
 $ pay.0                     : int  2 0 0 0 0 0 0 0 0 0 ...
 $ pay.2                     : int  2 2 0 0 0 0 0 0 0 0 ...
 $ pay.3                     : int  0 0 0 0 0 0 0 0 2 0 ...
 $ pay.4                     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pay.5                     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pay.6                     : int  0 2 0 0 0 0 0 0 0 0 ...
 $ bill.amt1                 : int  3913 2682 29239 46990 8617 64400 367965 11876 11285 0 ...
 $ bill.amt2                 : int  3102 1725 14027 48233 5670 57069 412023 380 14096 0 ...
 $ bill.amt3                 : int  689 2682 13559 49291 35835 57608 445007 601 12108 0 ...
 $ bill.amt4                 : int  0 3272 14331 28314 20940 19394 542653 221 12211 0 ...
 $ bill.amt5                 : int  0 3455 14948 28959 19146 19619 483003 -159 11793 13007 ...
 $ bill.amt6                 : int  0 3261 15549 29547 19131 20024 473944 567 3719 13912 ...
 $ pay.amt1                  : int  0 0 1518 2000 2000 2500 55000 380 3329 0 ...
 $ pay.amt2                  : int  689 1000 1500 2019 36681 1815 40000 601 0 0 ...
 $ pay.amt3                  : int  0 1000 1000 1200 10000 657 38000 0 432 0 ...
 $ pay.amt4                  : int  0 1000 1000 1100 9000 1000 20239 581 1000 13007 ...
 $ pay.amt5                  : int  0 0 1000 1069 689 1000 13750 1687 1000 1122 ...
 $ pay.amt6                  : int  0 2000 5000 1000 679 800 13770 1542 1000 0 ...
 $ default.payment.next.month: Factor w/ 2 levels "x0","x1": 2 2 1 1 1 1 1 1 1 1 ...

Predictors:

Type Frequency
factor 3
integer 20

Class imbalance:

class A class B
22 % 78 %
6636 23364

ILPD (Indian Liver Patient Dataset)

Local directory: indian-liver

Details: link

Source data files:

Cite:

https://archive.ics.uci.edu/ml/citation_policy.html
@misc{Lichman:2013 , author = "M. Lichman", year = "2013", title = "{UCI} Machine Learning Repository", url = "http://archive.ics.uci.edu/ml", institution = "University of California, Irvine, School of Information and Computer Sciences" } 

Dataset:

'data.frame':	583 obs. of  11 variables:
 $ age      : int  65 62 62 58 72 46 26 29 17 55 ...
 $ gender   : Factor w/ 2 levels "female","male": 1 2 2 2 2 2 1 1 2 2 ...
 $ tb       : num  0.7 10.9 7.3 1 3.9 1.8 0.9 0.9 0.9 0.7 ...
 $ db       : num  0.1 5.5 4.1 0.4 2 0.7 0.2 0.3 0.3 0.2 ...
 $ alkphos  : int  187 699 490 182 195 208 154 202 202 290 ...
 $ sgpt     : int  16 64 60 14 27 19 16 14 22 53 ...
 $ sgot     : int  18 100 68 20 59 14 12 11 19 58 ...
 $ tp       : num  6.8 7.5 7 6.8 7.3 7.6 7 6.7 7.4 6.8 ...
 $ alb      : num  3.3 3.2 3.3 3.4 2.4 4.4 3.5 3.6 4.1 3.4 ...
 $ a.g.ratio: num  0.9 0.74 0.89 1 0.4 1.3 1 1.1 1.2 1 ...
 $ selector : Factor w/ 2 levels "x1","x2": 1 1 1 1 1 1 1 1 2 1 ...

Predictors:

Type Frequency
factor 1
integer 4
numeric 5

Class imbalance:

class A class B
29 % 71 %
167 416

MAGIC Gamma Telescope

Local directory: magic

Details: link

Source data files:

Cite:

https://archive.ics.uci.edu/ml/citation_policy.html
@misc{Lichman:2013 , author = "M. Lichman", year = "2013", title = "{UCI} Machine Learning Repository", url = "http://archive.ics.uci.edu/ml", institution = "University of California, Irvine, School of Information and Computer Sciences" } 

Dataset:

'data.frame':	19020 obs. of  11 variables:
 $ flength : num  28.8 31.6 162.1 23.8 75.1 ...
 $ fwidth  : num  16 11.72 136.03 9.57 30.92 ...
 $ fsize   : num  2.64 2.52 4.06 2.34 3.16 ...
 $ fconc   : num  0.3918 0.5303 0.0374 0.6147 0.3168 ...
 $ fconc1  : num  0.1982 0.3773 0.0187 0.3922 0.1832 ...
 $ fasym   : num  27.7 26.27 116.74 27.21 -5.53 ...
 $ fm3long : num  22.01 23.82 -64.86 -6.46 28.55 ...
 $ fm3trans: num  -8.2 -9.96 -45.22 -7.15 21.84 ...
 $ falpha  : num  40.09 6.36 76.96 10.45 4.65 ...
 $ fdist   : num  81.9 205.3 256.8 116.7 356.5 ...
 $ class   : Factor w/ 2 levels "g","h": 1 1 1 1 1 1 1 1 1 1 ...

Predictors:

Type Frequency
numeric 10

Class imbalance:

class A class B
35 % 65 %
6688 12332

Mushroom

Local directory: mushroom

Details: link

Source data files:

Cite:

https://archive.ics.uci.edu/ml/citation_policy.html
@misc{Lichman:2013 , author = "M. Lichman", year = "2013", title = "{UCI} Machine Learning Repository", url = "http://archive.ics.uci.edu/ml", institution = "University of California, Irvine, School of Information and Computer Sciences" } 

Dataset:

'data.frame':	5644 obs. of  22 variables:
 $ cap.shape               : Factor w/ 6 levels "b","c","f","k",..: 6 6 1 6 6 6 1 1 6 1 ...
 $ cap.surface             : Factor w/ 4 levels "f","g","s","y": 3 3 3 4 3 4 3 4 4 3 ...
 $ cap.color               : Factor w/ 10 levels "b","c","e","g",..: 5 10 9 9 4 10 9 9 9 10 ...
 $ bruises                 : Factor w/ 2 levels "f","t": 2 2 2 2 1 2 2 2 2 2 ...
 $ odor                    : Factor w/ 9 levels "a","c","f","l",..: 7 1 4 7 6 1 1 4 7 1 ...
 $ gill.attachment         : Factor w/ 2 levels "a","f": 2 2 2 2 2 2 2 2 2 2 ...
 $ gill.spacing            : Factor w/ 2 levels "c","w": 1 1 1 1 2 1 1 1 1 1 ...
 $ gill.size               : Factor w/ 2 levels "b","n": 2 1 1 2 1 1 1 1 2 1 ...
 $ gill.color              : Factor w/ 12 levels "b","e","g","h",..: 5 5 6 6 5 6 3 6 8 3 ...
 $ stalk.shape             : Factor w/ 2 levels "e","t": 1 1 1 1 2 1 1 1 1 1 ...
 $ stalk.root              : Factor w/ 4 levels "b","c","e","r": 3 2 2 3 3 2 2 2 3 2 ...
 $ stalk.surface.above.ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
 $ stalk.surface.below.ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
 $ stalk.color.above.ring  : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
 $ stalk.color.below.ring  : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
 $ veil.color              : Factor w/ 4 levels "n","o","w","y": 3 3 3 3 3 3 3 3 3 3 ...
 $ ring.number             : int  1 1 1 1 1 1 1 1 1 1 ...
 $ ring.type               : Factor w/ 5 levels "e","f","l","n",..: 5 5 5 5 1 5 5 5 5 5 ...
 $ spore.print.color       : Factor w/ 9 levels "b","h","k","n",..: 3 4 4 3 4 3 3 4 3 3 ...
 $ population              : Factor w/ 6 levels "a","c","n","s",..: 4 3 3 4 1 3 3 4 5 4 ...
 $ habitat                 : Factor w/ 7 levels "d","g","l","m",..: 6 2 4 6 2 2 4 4 2 4 ...
 $ class                   : Factor w/ 2 levels "e","p": 2 1 1 2 1 1 1 1 2 1 ...

Predictors:

Type Frequency
factor 20
integer 1

Class imbalance:

class A class B
38 % 62 %
2156 3488

Seismic bumps

Local directory: seismic-bumps

Details: link

Source data files:

Cite:

Sikora M., Wrobel L.: Application of rule induction algorithms for analysis of data collected by seismic hazard monitoring systems in coal mines. Archives of Mining Sciences, 55(1), 2010, 91-114.

Dataset:

'data.frame':	2584 obs. of  16 variables:
 $ seismic       : Factor w/ 2 levels "a","b": 1 1 1 1 1 1 1 1 1 1 ...
 $ seismoacoustic: Factor w/ 3 levels "a","b","c": 1 1 1 1 1 1 1 1 1 1 ...
 $ shift         : Factor w/ 2 levels "n","w": 1 1 1 1 1 2 2 1 1 2 ...
 $ genergy       : int  15180 14720 8050 28820 12640 63760 207930 48990 100190 247620 ...
 $ gpuls         : int  48 33 30 171 57 195 614 194 303 675 ...
 $ gdenergy      : int  -72 -70 -81 -23 -63 -73 -6 -27 54 4 ...
 $ gdpuls        : int  -72 -79 -78 40 -52 -65 18 -3 52 25 ...
 $ ghazard       : Factor w/ 3 levels "a","b","c": 1 1 1 1 1 1 1 1 1 1 ...
 $ nbumps        : int  0 1 0 1 0 0 2 1 0 1 ...
 $ nbumps2       : int  0 0 0 0 0 0 2 0 0 1 ...
 $ nbumps3       : int  0 1 0 1 0 0 0 1 0 0 ...
 $ nbumps4       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ nbumps5       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ energy        : int  0 2000 0 3000 0 0 1000 4000 0 500 ...
 $ maxenergy     : int  0 2000 0 3000 0 0 700 4000 0 500 ...
 $ class         : Factor w/ 2 levels "x0","x1": 1 1 1 1 1 1 1 1 1 1 ...

Predictors:

Type Frequency
factor 4
integer 11

Class imbalance:

class A class B
7 % 93 %
170 2414

Spambase

Local directory: spambase

Details: link

Source data files:

Cite:

https://archive.ics.uci.edu/ml/citation_policy.html
@misc{Lichman:2013 , author = "M. Lichman", year = "2013", title = "{UCI} Machine Learning Repository", url = "http://archive.ics.uci.edu/ml", institution = "University of California, Irvine, School of Information and Computer Sciences" }

Dataset:

'data.frame':	4601 obs. of  58 variables:
 $ word.freq.make            : num  0 0.21 0.06 0 0 0 0 0 0.15 0.06 ...
 $ word.freq.address         : num  0.64 0.28 0 0 0 0 0 0 0 0.12 ...
 $ word.freq.all             : num  0.64 0.5 0.71 0 0 0 0 0 0.46 0.77 ...
 $ word.freq.3d              : num  0 0 0 0 0 0 0 0 0 0 ...
 $ word.freq.our             : num  0.32 0.14 1.23 0.63 0.63 1.85 1.92 1.88 0.61 0.19 ...
 $ word.freq.over            : num  0 0.28 0.19 0 0 0 0 0 0 0.32 ...
 $ word.freq.remove          : num  0 0.21 0.19 0.31 0.31 0 0 0 0.3 0.38 ...
 $ word.freq.internet        : num  0 0.07 0.12 0.63 0.63 1.85 0 1.88 0 0 ...
 $ word.freq.order           : num  0 0 0.64 0.31 0.31 0 0 0 0.92 0.06 ...
 $ word.freq.mail            : num  0 0.94 0.25 0.63 0.63 0 0.64 0 0.76 0 ...
 $ word.freq.receive         : num  0 0.21 0.38 0.31 0.31 0 0.96 0 0.76 0 ...
 $ word.freq.will            : num  0.64 0.79 0.45 0.31 0.31 0 1.28 0 0.92 0.64 ...
 $ word.freq.people          : num  0 0.65 0.12 0.31 0.31 0 0 0 0 0.25 ...
 $ word.freq.report          : num  0 0.21 0 0 0 0 0 0 0 0 ...
 $ word.freq.addresses       : num  0 0.14 1.75 0 0 0 0 0 0 0.12 ...
 $ word.freq.free            : num  0.32 0.14 0.06 0.31 0.31 0 0.96 0 0 0 ...
 $ word.freq.business        : num  0 0.07 0.06 0 0 0 0 0 0 0 ...
 $ word.freq.email           : num  1.29 0.28 1.03 0 0 0 0.32 0 0.15 0.12 ...
 $ word.freq.you             : num  1.93 3.47 1.36 3.18 3.18 0 3.85 0 1.23 1.67 ...
 $ word.freq.credit          : num  0 0 0.32 0 0 0 0 0 3.53 0.06 ...
 $ word.freq.your            : num  0.96 1.59 0.51 0.31 0.31 0 0.64 0 2 0.71 ...
 $ word.freq.font            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ word.freq.000             : num  0 0.43 1.16 0 0 0 0 0 0 0.19 ...
 $ word.freq.money           : num  0 0.43 0.06 0 0 0 0 0 0.15 0 ...
 $ word.freq.hp              : num  0 0 0 0 0 0 0 0 0 0 ...
 $ word.freq.hpl             : num  0 0 0 0 0 0 0 0 0 0 ...
 $ word.freq.george          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ word.freq.650             : num  0 0 0 0 0 0 0 0 0 0 ...
 $ word.freq.lab             : num  0 0 0 0 0 0 0 0 0 0 ...
 $ word.freq.labs            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ word.freq.telnet          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ word.freq.857             : num  0 0 0 0 0 0 0 0 0 0 ...
 $ word.freq.data            : num  0 0 0 0 0 0 0 0 0.15 0 ...
 $ word.freq.415             : num  0 0 0 0 0 0 0 0 0 0 ...
 $ word.freq.85              : num  0 0 0 0 0 0 0 0 0 0 ...
 $ word.freq.technology      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ word.freq.1999            : num  0 0.07 0 0 0 0 0 0 0 0 ...
 $ word.freq.parts           : num  0 0 0 0 0 0 0 0 0 0 ...
 $ word.freq.pm              : num  0 0 0 0 0 0 0 0 0 0 ...
 $ word.freq.direct          : num  0 0 0.06 0 0 0 0 0 0 0 ...
 $ word.freq.cs              : num  0 0 0 0 0 0 0 0 0 0 ...
 $ word.freq.meeting         : num  0 0 0 0 0 0 0 0 0 0 ...
 $ word.freq.original        : num  0 0 0.12 0 0 0 0 0 0.3 0 ...
 $ word.freq.project         : num  0 0 0 0 0 0 0 0 0 0.06 ...
 $ word.freq.re              : num  0 0 0.06 0 0 0 0 0 0 0 ...
 $ word.freq.edu             : num  0 0 0.06 0 0 0 0 0 0 0 ...
 $ word.freq.table           : num  0 0 0 0 0 0 0 0 0 0 ...
 $ word.freq.conference      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ char.freq..               : num  0 0 0.01 0 0 0 0 0 0 0.04 ...
 $ char.freq...1             : num  0 0.132 0.143 0.137 0.135 0.223 0.054 0.206 0.271 0.03 ...
 $ char.freq...2             : num  0 0 0 0 0 0 0 0 0 0 ...
 $ char.freq...3             : num  0.778 0.372 0.276 0.137 0.135 0 0.164 0 0.181 0.244 ...
 $ char.freq...4             : num  0 0.18 0.184 0 0 0 0.054 0 0.203 0.081 ...
 $ char.freq...5             : num  0 0.048 0.01 0 0 0 0 0 0.022 0 ...
 $ capital.run.length.average: num  3.76 5.11 9.82 3.54 3.54 ...
 $ capital.run.length.longest: int  61 101 485 40 40 15 4 11 445 43 ...
 $ capital.run.length.total  : int  278 1028 2259 191 191 54 112 49 1257 749 ...
 $ class                     : Factor w/ 2 levels "x0","x1": 2 2 2 2 2 2 2 2 2 2 ...

Predictors:

Type Frequency
integer 2
numeric 55

Class imbalance:

class A class B
39 % 61 %
1813 2788

Wine Quality

Local directory: wine-quality

Details: link

Source data files:

Cite:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

Dataset:

'data.frame':	6497 obs. of  13 variables:
 $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
 $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
 $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
 $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
 $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
 $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
 $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
 $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
 $ ph                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
 $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
 $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
 $ color               : Factor w/ 2 levels "red","white": 2 2 2 2 2 2 2 2 2 2 ...
 $ quality             : Factor w/ 2 levels "x0","x1": 2 2 2 2 2 2 2 2 2 2 ...

Predictors:

Type Frequency
factor 1
numeric 11

Class imbalance:

class A class B
37 % 63 %
2384 4113