uci-ml-to-r/README.md

773 lines
30 KiB
Markdown

# UCI Machine Learning datasets for R
Andrzej Wójtowicz
Document generation date: 2016-08-19 21:47:14.
This project preprocesses a few datasets from [UC Irvine Machine Learning
Repository](https://archive.ics.uci.edu/ml/) into tidy R object files.
It focuses on the binary classification datasets and saves only complete cases
within a dataset.
**R software**: [Microsoft R Open](https://mran.microsoft.com/open/) (3.3.0)
**Reproducibility library**: [checkpoint](https://github.com/RevolutionAnalytics/checkpoint)
**Reproducibility procedure**:
1. Run *s1-download-data.R* to download original datasets.
2. Run *s2-preprocess-data.R* to preprocess the datasets.
Optionally:
3. knit *s3-make-readme.Rmd* to get an overview of the preprocessed datasets,
4. run *s4-make-release.sh* to create zip file with preprocessed datasets.
# Table of Contents
1. [Bank Marketing](#bank-marketing)
1. [Breast Cancer Wisconsin (Diagnostic)](#breast-cancer-wisconsin-diagnostic)
1. [Breast Cancer Wisconsin (Original)](#breast-cancer-wisconsin-original)
1. [Cardiotocography](#cardiotocography)
1. [Census income](#census-income)
1. [Default of credit card clients](#default-of-credit-card-clients)
1. [ILPD (Indian Liver Patient Dataset)](#ilpd-indian-liver-patient-dataset)
1. [MAGIC Gamma Telescope](#magic-gamma-telescope)
1. [Mushroom](#mushroom)
1. [Seismic bumps](#seismic-bumps)
1. [Spambase](#spambase)
1. [Wine Quality](#wine-quality)
---
# Bank Marketing
**Local directory**: bank-marketing
**Details**: [link](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing)
**Source data files**:
* [bank.zip](https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip)
**Cite**:
```nohighlight
S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014
```
**Dataset**:
```nohighlight
'data.frame': 43193 obs. of 16 variables:
$ age : int 58 44 33 35 28 42 58 43 41 29 ...
$ job : Factor w/ 11 levels "admin","blue.collar",..: 5 10 3 5 5 3 6 10 1 1 ...
$ marital : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 1 2 3 1 3 ...
$ education: Ord.factor w/ 3 levels "primary"<"secondary"<..: 3 2 2 3 3 3 1 2 2 2 ...
$ balance : int 2143 29 2 231 447 2 121 593 270 390 ...
$ housing : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
$ loan : Factor w/ 2 levels "no","yes": 1 1 2 1 2 1 1 1 1 1 ...
$ contact : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
$ day : int 5 5 5 5 5 5 5 5 5 5 ...
$ month : Ord.factor w/ 12 levels "jan"<"feb"<"mar"<..: 5 5 5 5 5 5 5 5 5 5 ...
$ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
$ pdays : int 999 999 999 999 999 999 999 999 999 999 ...
$ pdays.bin: Factor w/ 2 levels "successful","never": 2 2 2 2 2 2 2 2 2 2 ...
$ previous : int 0 0 0 0 0 0 0 0 0 0 ...
$ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
$ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
```
**Predictors**:
|Type | Frequency|
|:--------------|---------:|
|factor | 7|
|integer | 6|
|ordered factor | 2|
**Class imbalance**:
| class A | class B |
|:-------:|:-------:|
| 12 % | 88 % |
| 5021 | 38172 |
---
# Breast Cancer Wisconsin (Diagnostic)
**Local directory**: breast-cancer-wisconsin-diagnostic
**Details**: [link](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29)
**Source data files**:
* [wdbc.data](https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data)
* [wdbc.names](https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.names)
**Cite**:
```nohighlight
https://archive.ics.uci.edu/ml/citation_policy.html
@misc{Lichman:2013 , author = "M. Lichman", year = "2013", title = "{UCI} Machine Learning Repository", url = "http://archive.ics.uci.edu/ml", institution = "University of California, Irvine, School of Information and Computer Sciences" }
```
**Dataset**:
```nohighlight
'data.frame': 569 obs. of 31 variables:
$ mean.radius : num 18 20.6 19.7 11.4 20.3 ...
$ mean.texture : num 10.4 17.8 21.2 20.4 14.3 ...
$ mean.perimeter : num 122.8 132.9 130 77.6 135.1 ...
$ mean.area : num 1001 1326 1203 386 1297 ...
$ mean.smoothness : num 0.1184 0.0847 0.1096 0.1425 0.1003 ...
$ mean.compactness : num 0.2776 0.0786 0.1599 0.2839 0.1328 ...
$ mean.concavity : num 0.3001 0.0869 0.1974 0.2414 0.198 ...
$ mean.concave.points : num 0.1471 0.0702 0.1279 0.1052 0.1043 ...
$ mean.symmetry : num 0.242 0.181 0.207 0.26 0.181 ...
$ mean.fractal.dimension : num 0.0787 0.0567 0.06 0.0974 0.0588 ...
$ se.radius : num 1.095 0.543 0.746 0.496 0.757 ...
$ se.texture : num 0.905 0.734 0.787 1.156 0.781 ...
$ se.perimeter : num 8.59 3.4 4.58 3.44 5.44 ...
$ se.area : num 153.4 74.1 94 27.2 94.4 ...
$ se.smoothness : num 0.0064 0.00522 0.00615 0.00911 0.01149 ...
$ se.compactness : num 0.049 0.0131 0.0401 0.0746 0.0246 ...
$ se.concavity : num 0.0537 0.0186 0.0383 0.0566 0.0569 ...
$ se.concave.points : num 0.0159 0.0134 0.0206 0.0187 0.0188 ...
$ se.symmetry : num 0.03 0.0139 0.0225 0.0596 0.0176 ...
$ se.fractal.dimension : num 0.00619 0.00353 0.00457 0.00921 0.00511 ...
$ worst.radius : num 25.4 25 23.6 14.9 22.5 ...
$ worst.texture : num 17.3 23.4 25.5 26.5 16.7 ...
$ worst.perimeter : num 184.6 158.8 152.5 98.9 152.2 ...
$ worst.area : num 2019 1956 1709 568 1575 ...
$ worst.smoothness : num 0.162 0.124 0.144 0.21 0.137 ...
$ worst.compactness : num 0.666 0.187 0.424 0.866 0.205 ...
$ worst.concavity : num 0.712 0.242 0.45 0.687 0.4 ...
$ worst.concave.points : num 0.265 0.186 0.243 0.258 0.163 ...
$ worst.symmetry : num 0.46 0.275 0.361 0.664 0.236 ...
$ worst.fractal.dimension: num 0.1189 0.089 0.0876 0.173 0.0768 ...
$ diagnosis : Factor w/ 2 levels "b","m": 2 2 2 2 2 2 2 2 2 2 ...
```
**Predictors**:
|Type | Frequency|
|:-------|---------:|
|numeric | 30|
**Class imbalance**:
| class A | class B |
|:-------:|:-------:|
| 37 % | 63 % |
| 212 | 357 |
---
# Breast Cancer Wisconsin (Original)
**Local directory**: breast-cancer-wisconsin-original
**Details**: [link](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29)
**Source data files**:
* [breast-cancer-wisconsin.data](https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data)
* [breast-cancer-wisconsin.names](https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names)
**Cite**:
```nohighlight
O. L. Mangasarian and W. H. Wolberg: "Cancer diagnosis via linear programming", SIAM News, Volume 23, Number 5, September 1990, pp 1 & 18.
```
**Dataset**:
```nohighlight
'data.frame': 683 obs. of 10 variables:
$ clump.thickness : int 5 5 3 6 4 8 1 2 2 4 ...
$ uniformity.of.cell.size : int 1 4 1 8 1 10 1 1 1 2 ...
$ uniformity.of.cell.shape : int 1 4 1 8 1 10 1 2 1 1 ...
$ marginal.adhesion : int 1 5 1 1 3 8 1 1 1 1 ...
$ single.epithelial.cell.size: int 2 7 2 3 2 7 2 2 2 2 ...
$ bare.nuclei : int 2 3 4 6 2 3 3 2 2 2 ...
$ bland.chromatin : int 3 3 3 3 3 9 3 3 1 2 ...
$ normal.nucleoli : int 1 2 1 7 1 7 1 1 1 1 ...
$ mitoses : int 1 1 1 1 1 1 1 1 5 1 ...
$ class : Factor w/ 2 levels "x2","x4": 1 1 1 1 1 2 1 1 1 1 ...
```
**Predictors**:
|Type | Frequency|
|:-------|---------:|
|integer | 9|
**Class imbalance**:
| class A | class B |
|:-------:|:-------:|
| 35 % | 65 % |
| 239 | 444 |
---
# Cardiotocography
**Local directory**: cardiotocography
**Details**: [link](https://archive.ics.uci.edu/ml/datasets/Cardiotocography)
**Source data files**:
* [CTG.xls](https://archive.ics.uci.edu/ml/machine-learning-databases/00193/CTG.xls)
**Cite**:
```nohighlight
Ayres de Campos et al. (2000) SisPorto 2.0 A Program for Automated Analysis of Cardiotocograms. J Matern Fetal Med 5:311-318
```
**Dataset**:
```nohighlight
'data.frame': 2126 obs. of 30 variables:
$ lb : int 120 132 133 134 132 134 134 122 122 122 ...
$ ac : int 0 4 2 2 4 1 1 0 0 0 ...
$ fm : int 0 0 0 0 0 0 0 0 0 0 ...
$ uc : int 0 4 5 6 5 10 9 0 1 3 ...
$ astv : int 73 17 16 16 16 26 29 83 84 86 ...
$ mstv : num 0.5 2.1 2.1 2.4 2.4 5.9 6.3 0.5 0.5 0.3 ...
$ altv : int 43 0 0 0 0 0 0 6 5 6 ...
$ mltv : num 2.4 10.4 13.4 23 19.9 0 0 15.6 13.6 10.6 ...
$ dl : int 0 2 2 2 0 9 6 0 0 0 ...
$ dp : int 0 0 0 0 0 2 2 0 0 0 ...
$ width : int 64 130 130 117 117 150 150 68 68 68 ...
$ min : int 62 68 68 53 53 50 50 62 62 62 ...
$ max : int 126 198 198 170 170 200 200 130 130 130 ...
$ nmax : int 2 6 5 11 9 5 6 0 0 1 ...
$ nzeros : int 0 1 1 0 0 3 3 0 0 0 ...
$ mode : int 120 141 141 137 137 76 71 122 122 122 ...
$ mean : int 137 136 135 134 136 107 107 122 122 122 ...
$ median : int 121 140 138 137 138 107 106 123 123 123 ...
$ variance: int 73 12 13 13 11 170 215 3 3 1 ...
$ tendency: Ord.factor w/ 3 levels "x.1"<"x0"<"x1": 3 2 2 3 3 2 2 3 3 3 ...
$ a : Factor w/ 2 levels "x0","x1": 1 1 1 1 1 1 1 1 1 1 ...
$ b : Factor w/ 2 levels "x0","x1": 1 1 1 1 2 1 1 1 1 1 ...
$ c : Factor w/ 2 levels "x0","x1": 1 1 1 1 1 1 1 1 1 1 ...
$ d : Factor w/ 2 levels "x0","x1": 1 1 1 1 1 1 1 1 1 1 ...
$ e : Factor w/ 2 levels "x0","x1": 1 1 1 1 1 1 1 1 1 1 ...
$ ad : Factor w/ 2 levels "x0","x1": 1 2 2 2 1 1 1 1 1 1 ...
$ de : Factor w/ 2 levels "x0","x1": 1 1 1 1 1 1 1 1 1 1 ...
$ ld : Factor w/ 2 levels "x0","x1": 1 1 1 1 1 2 2 1 1 1 ...
$ fs : Factor w/ 2 levels "x0","x1": 2 1 1 1 1 1 1 2 2 2 ...
$ nsp : Factor w/ 2 levels "x1","x3": 2 1 1 1 1 2 2 2 2 2 ...
```
**Predictors**:
|Type | Frequency|
|:--------------|---------:|
|factor | 9|
|integer | 17|
|numeric | 2|
|ordered factor | 1|
**Class imbalance**:
| class A | class B |
|:-------:|:-------:|
| 22 % | 78 % |
| 471 | 1655 |
---
# Census income
**Local directory**: census-income
**Details**: [link](https://archive.ics.uci.edu/ml/datasets/Census+Income)
**Source data files**:
* [adult.data](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data)
* [adult.test](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test)
* [adult.names](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names)
**Cite**:
```nohighlight
https://archive.ics.uci.edu/ml/citation_policy.html
@misc{Lichman:2013 , author = "M. Lichman", year = "2013", title = "{UCI} Machine Learning Repository", url = "http://archive.ics.uci.edu/ml", institution = "University of California, Irvine, School of Information and Computer Sciences" }
```
**Dataset**:
```nohighlight
'data.frame': 46018 obs. of 13 variables:
$ age : int 39 50 38 53 28 37 49 52 31 42 ...
$ workclass : Factor w/ 7 levels "federal.gov",..: 6 5 3 3 3 3 3 5 3 3 ...
$ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
$ education : Ord.factor w/ 5 levels "school"<"highschool"<..: 4 4 2 1 4 5 1 2 5 4 ...
$ marital.status: Factor w/ 7 levels "divorced","married.af.spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
$ occupation : Factor w/ 13 levels "adm.clerical",..: 1 3 5 5 9 3 7 3 9 3 ...
$ relationship : Factor w/ 6 levels "husband","not.in.family",..: 2 1 2 1 6 6 2 1 2 1 ...
$ race : Factor w/ 5 levels "amer.indian.eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
$ sex : Factor w/ 2 levels "female","male": 2 2 2 2 1 1 1 2 1 2 ...
$ capital.gain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
$ capital.loss : int 0 0 0 0 0 0 0 0 0 0 ...
$ hours.per.week: int 40 13 40 40 40 40 16 45 50 40 ...
$ class : Factor w/ 2 levels "x..50k","x.50k": 1 1 1 1 1 1 1 2 2 2 ...
```
**Predictors**:
|Type | Frequency|
|:--------------|---------:|
|factor | 6|
|integer | 5|
|ordered factor | 1|
**Class imbalance**:
| class A | class B |
|:-------:|:-------:|
| 25 % | 75 % |
| 11417 | 34601 |
---
# Default of credit card clients
**Local directory**: credit-card
**Details**: [link](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients)
**Source data files**:
* [default of credit card clients.xls](https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls)
**Cite**:
```nohighlight
Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2), 2473-2480.
```
**Dataset**:
```nohighlight
'data.frame': 30000 obs. of 24 variables:
$ limit.bal : int 20000 120000 90000 50000 50000 50000 500000 100000 140000 20000 ...
$ sex : Factor w/ 2 levels "x1","x2": 2 2 2 2 1 1 1 2 2 1 ...
$ education : Factor w/ 7 levels "x0","x1","x2",..: 3 3 3 3 3 2 2 3 4 4 ...
$ marriage : Factor w/ 4 levels "x0","x1","x2",..: 2 3 3 2 2 3 3 3 2 3 ...
$ age : int 24 26 34 37 57 37 29 23 28 35 ...
$ pay.0 : int 2 0 0 0 0 0 0 0 0 0 ...
$ pay.2 : int 2 2 0 0 0 0 0 0 0 0 ...
$ pay.3 : int 0 0 0 0 0 0 0 0 2 0 ...
$ pay.4 : int 0 0 0 0 0 0 0 0 0 0 ...
$ pay.5 : int 0 0 0 0 0 0 0 0 0 0 ...
$ pay.6 : int 0 2 0 0 0 0 0 0 0 0 ...
$ bill.amt1 : int 3913 2682 29239 46990 8617 64400 367965 11876 11285 0 ...
$ bill.amt2 : int 3102 1725 14027 48233 5670 57069 412023 380 14096 0 ...
$ bill.amt3 : int 689 2682 13559 49291 35835 57608 445007 601 12108 0 ...
$ bill.amt4 : int 0 3272 14331 28314 20940 19394 542653 221 12211 0 ...
$ bill.amt5 : int 0 3455 14948 28959 19146 19619 483003 -159 11793 13007 ...
$ bill.amt6 : int 0 3261 15549 29547 19131 20024 473944 567 3719 13912 ...
$ pay.amt1 : int 0 0 1518 2000 2000 2500 55000 380 3329 0 ...
$ pay.amt2 : int 689 1000 1500 2019 36681 1815 40000 601 0 0 ...
$ pay.amt3 : int 0 1000 1000 1200 10000 657 38000 0 432 0 ...
$ pay.amt4 : int 0 1000 1000 1100 9000 1000 20239 581 1000 13007 ...
$ pay.amt5 : int 0 0 1000 1069 689 1000 13750 1687 1000 1122 ...
$ pay.amt6 : int 0 2000 5000 1000 679 800 13770 1542 1000 0 ...
$ default.payment.next.month: Factor w/ 2 levels "x0","x1": 2 2 1 1 1 1 1 1 1 1 ...
```
**Predictors**:
|Type | Frequency|
|:-------|---------:|
|factor | 3|
|integer | 20|
**Class imbalance**:
| class A | class B |
|:-------:|:-------:|
| 22 % | 78 % |
| 6636 | 23364 |
---
# ILPD (Indian Liver Patient Dataset)
**Local directory**: indian-liver
**Details**: [link](https://archive.ics.uci.edu/ml/datasets/ILPD+(Indian+Liver+Patient+Dataset))
**Source data files**:
* [Indian Liver Patient Dataset (ILPD).csv](https://archive.ics.uci.edu/ml/machine-learning-databases/00225/Indian%20Liver%20Patient%20Dataset%20(ILPD).csv)
**Cite**:
```nohighlight
https://archive.ics.uci.edu/ml/citation_policy.html
@misc{Lichman:2013 , author = "M. Lichman", year = "2013", title = "{UCI} Machine Learning Repository", url = "http://archive.ics.uci.edu/ml", institution = "University of California, Irvine, School of Information and Computer Sciences" }
```
**Dataset**:
```nohighlight
'data.frame': 583 obs. of 11 variables:
$ age : int 65 62 62 58 72 46 26 29 17 55 ...
$ gender : Factor w/ 2 levels "female","male": 1 2 2 2 2 2 1 1 2 2 ...
$ tb : num 0.7 10.9 7.3 1 3.9 1.8 0.9 0.9 0.9 0.7 ...
$ db : num 0.1 5.5 4.1 0.4 2 0.7 0.2 0.3 0.3 0.2 ...
$ alkphos : int 187 699 490 182 195 208 154 202 202 290 ...
$ sgpt : int 16 64 60 14 27 19 16 14 22 53 ...
$ sgot : int 18 100 68 20 59 14 12 11 19 58 ...
$ tp : num 6.8 7.5 7 6.8 7.3 7.6 7 6.7 7.4 6.8 ...
$ alb : num 3.3 3.2 3.3 3.4 2.4 4.4 3.5 3.6 4.1 3.4 ...
$ a.g.ratio: num 0.9 0.74 0.89 1 0.4 1.3 1 1.1 1.2 1 ...
$ selector : Factor w/ 2 levels "x1","x2": 1 1 1 1 1 1 1 1 2 1 ...
```
**Predictors**:
|Type | Frequency|
|:-------|---------:|
|factor | 1|
|integer | 4|
|numeric | 5|
**Class imbalance**:
| class A | class B |
|:-------:|:-------:|
| 29 % | 71 % |
| 167 | 416 |
---
# MAGIC Gamma Telescope
**Local directory**: magic
**Details**: [link](https://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope)
**Source data files**:
* [magic04.data](https://archive.ics.uci.edu/ml/machine-learning-databases/magic/magic04.data)
* [magic04.names](https://archive.ics.uci.edu/ml/machine-learning-databases/magic/magic04.names)
**Cite**:
```nohighlight
https://archive.ics.uci.edu/ml/citation_policy.html
@misc{Lichman:2013 , author = "M. Lichman", year = "2013", title = "{UCI} Machine Learning Repository", url = "http://archive.ics.uci.edu/ml", institution = "University of California, Irvine, School of Information and Computer Sciences" }
```
**Dataset**:
```nohighlight
'data.frame': 19020 obs. of 11 variables:
$ flength : num 28.8 31.6 162.1 23.8 75.1 ...
$ fwidth : num 16 11.72 136.03 9.57 30.92 ...
$ fsize : num 2.64 2.52 4.06 2.34 3.16 ...
$ fconc : num 0.3918 0.5303 0.0374 0.6147 0.3168 ...
$ fconc1 : num 0.1982 0.3773 0.0187 0.3922 0.1832 ...
$ fasym : num 27.7 26.27 116.74 27.21 -5.53 ...
$ fm3long : num 22.01 23.82 -64.86 -6.46 28.55 ...
$ fm3trans: num -8.2 -9.96 -45.22 -7.15 21.84 ...
$ falpha : num 40.09 6.36 76.96 10.45 4.65 ...
$ fdist : num 81.9 205.3 256.8 116.7 356.5 ...
$ class : Factor w/ 2 levels "g","h": 1 1 1 1 1 1 1 1 1 1 ...
```
**Predictors**:
|Type | Frequency|
|:-------|---------:|
|numeric | 10|
**Class imbalance**:
| class A | class B |
|:-------:|:-------:|
| 35 % | 65 % |
| 6688 | 12332 |
---
# Mushroom
**Local directory**: mushroom
**Details**: [link](https://archive.ics.uci.edu/ml/datasets/Mushroom)
**Source data files**:
* [agaricus-lepiota.data](https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data)
* [agaricus-lepiota.names](https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.names)
**Cite**:
```nohighlight
https://archive.ics.uci.edu/ml/citation_policy.html
@misc{Lichman:2013 , author = "M. Lichman", year = "2013", title = "{UCI} Machine Learning Repository", url = "http://archive.ics.uci.edu/ml", institution = "University of California, Irvine, School of Information and Computer Sciences" }
```
**Dataset**:
```nohighlight
'data.frame': 5644 obs. of 22 variables:
$ cap.shape : Factor w/ 6 levels "b","c","f","k",..: 6 6 1 6 6 6 1 1 6 1 ...
$ cap.surface : Factor w/ 4 levels "f","g","s","y": 3 3 3 4 3 4 3 4 4 3 ...
$ cap.color : Factor w/ 10 levels "b","c","e","g",..: 5 10 9 9 4 10 9 9 9 10 ...
$ bruises : Factor w/ 2 levels "f","t": 2 2 2 2 1 2 2 2 2 2 ...
$ odor : Factor w/ 9 levels "a","c","f","l",..: 7 1 4 7 6 1 1 4 7 1 ...
$ gill.attachment : Factor w/ 2 levels "a","f": 2 2 2 2 2 2 2 2 2 2 ...
$ gill.spacing : Factor w/ 2 levels "c","w": 1 1 1 1 2 1 1 1 1 1 ...
$ gill.size : Factor w/ 2 levels "b","n": 2 1 1 2 1 1 1 1 2 1 ...
$ gill.color : Factor w/ 12 levels "b","e","g","h",..: 5 5 6 6 5 6 3 6 8 3 ...
$ stalk.shape : Factor w/ 2 levels "e","t": 1 1 1 1 2 1 1 1 1 1 ...
$ stalk.root : Factor w/ 4 levels "b","c","e","r": 3 2 2 3 3 2 2 2 3 2 ...
$ stalk.surface.above.ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
$ stalk.surface.below.ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
$ stalk.color.above.ring : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
$ stalk.color.below.ring : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
$ veil.color : Factor w/ 4 levels "n","o","w","y": 3 3 3 3 3 3 3 3 3 3 ...
$ ring.number : int 1 1 1 1 1 1 1 1 1 1 ...
$ ring.type : Factor w/ 5 levels "e","f","l","n",..: 5 5 5 5 1 5 5 5 5 5 ...
$ spore.print.color : Factor w/ 9 levels "b","h","k","n",..: 3 4 4 3 4 3 3 4 3 3 ...
$ population : Factor w/ 6 levels "a","c","n","s",..: 4 3 3 4 1 3 3 4 5 4 ...
$ habitat : Factor w/ 7 levels "d","g","l","m",..: 6 2 4 6 2 2 4 4 2 4 ...
$ class : Factor w/ 2 levels "e","p": 2 1 1 2 1 1 1 1 2 1 ...
```
**Predictors**:
|Type | Frequency|
|:-------|---------:|
|factor | 20|
|integer | 1|
**Class imbalance**:
| class A | class B |
|:-------:|:-------:|
| 38 % | 62 % |
| 2156 | 3488 |
---
# Seismic bumps
**Local directory**: seismic-bumps
**Details**: [link](https://archive.ics.uci.edu/ml/datasets/seismic-bumps)
**Source data files**:
* [seismic-bumps.arff](https://archive.ics.uci.edu/ml/machine-learning-databases/00266/seismic-bumps.arff)
**Cite**:
```nohighlight
Sikora M., Wrobel L.: Application of rule induction algorithms for analysis of data collected by seismic hazard monitoring systems in coal mines. Archives of Mining Sciences, 55(1), 2010, 91-114.
```
**Dataset**:
```nohighlight
'data.frame': 2584 obs. of 16 variables:
$ seismic : Factor w/ 2 levels "a","b": 1 1 1 1 1 1 1 1 1 1 ...
$ seismoacoustic: Factor w/ 3 levels "a","b","c": 1 1 1 1 1 1 1 1 1 1 ...
$ shift : Factor w/ 2 levels "n","w": 1 1 1 1 1 2 2 1 1 2 ...
$ genergy : int 15180 14720 8050 28820 12640 63760 207930 48990 100190 247620 ...
$ gpuls : int 48 33 30 171 57 195 614 194 303 675 ...
$ gdenergy : int -72 -70 -81 -23 -63 -73 -6 -27 54 4 ...
$ gdpuls : int -72 -79 -78 40 -52 -65 18 -3 52 25 ...
$ ghazard : Factor w/ 3 levels "a","b","c": 1 1 1 1 1 1 1 1 1 1 ...
$ nbumps : int 0 1 0 1 0 0 2 1 0 1 ...
$ nbumps2 : int 0 0 0 0 0 0 2 0 0 1 ...
$ nbumps3 : int 0 1 0 1 0 0 0 1 0 0 ...
$ nbumps4 : int 0 0 0 0 0 0 0 0 0 0 ...
$ nbumps5 : int 0 0 0 0 0 0 0 0 0 0 ...
$ energy : int 0 2000 0 3000 0 0 1000 4000 0 500 ...
$ maxenergy : int 0 2000 0 3000 0 0 700 4000 0 500 ...
$ class : Factor w/ 2 levels "x0","x1": 1 1 1 1 1 1 1 1 1 1 ...
```
**Predictors**:
|Type | Frequency|
|:-------|---------:|
|factor | 4|
|integer | 11|
**Class imbalance**:
| class A | class B |
|:-------:|:-------:|
| 7 % | 93 % |
| 170 | 2414 |
---
# Spambase
**Local directory**: spambase
**Details**: [link](https://archive.ics.uci.edu/ml/datasets/Spambase)
**Source data files**:
* [spambase.DOCUMENTATION](https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.DOCUMENTATION)
* [spambase.data](https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data)
* [spambase.names](https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.names)
**Cite**:
```nohighlight
https://archive.ics.uci.edu/ml/citation_policy.html
@misc{Lichman:2013 , author = "M. Lichman", year = "2013", title = "{UCI} Machine Learning Repository", url = "http://archive.ics.uci.edu/ml", institution = "University of California, Irvine, School of Information and Computer Sciences" }
```
**Dataset**:
```nohighlight
'data.frame': 4601 obs. of 58 variables:
$ word.freq.make : num 0 0.21 0.06 0 0 0 0 0 0.15 0.06 ...
$ word.freq.address : num 0.64 0.28 0 0 0 0 0 0 0 0.12 ...
$ word.freq.all : num 0.64 0.5 0.71 0 0 0 0 0 0.46 0.77 ...
$ word.freq.3d : num 0 0 0 0 0 0 0 0 0 0 ...
$ word.freq.our : num 0.32 0.14 1.23 0.63 0.63 1.85 1.92 1.88 0.61 0.19 ...
$ word.freq.over : num 0 0.28 0.19 0 0 0 0 0 0 0.32 ...
$ word.freq.remove : num 0 0.21 0.19 0.31 0.31 0 0 0 0.3 0.38 ...
$ word.freq.internet : num 0 0.07 0.12 0.63 0.63 1.85 0 1.88 0 0 ...
$ word.freq.order : num 0 0 0.64 0.31 0.31 0 0 0 0.92 0.06 ...
$ word.freq.mail : num 0 0.94 0.25 0.63 0.63 0 0.64 0 0.76 0 ...
$ word.freq.receive : num 0 0.21 0.38 0.31 0.31 0 0.96 0 0.76 0 ...
$ word.freq.will : num 0.64 0.79 0.45 0.31 0.31 0 1.28 0 0.92 0.64 ...
$ word.freq.people : num 0 0.65 0.12 0.31 0.31 0 0 0 0 0.25 ...
$ word.freq.report : num 0 0.21 0 0 0 0 0 0 0 0 ...
$ word.freq.addresses : num 0 0.14 1.75 0 0 0 0 0 0 0.12 ...
$ word.freq.free : num 0.32 0.14 0.06 0.31 0.31 0 0.96 0 0 0 ...
$ word.freq.business : num 0 0.07 0.06 0 0 0 0 0 0 0 ...
$ word.freq.email : num 1.29 0.28 1.03 0 0 0 0.32 0 0.15 0.12 ...
$ word.freq.you : num 1.93 3.47 1.36 3.18 3.18 0 3.85 0 1.23 1.67 ...
$ word.freq.credit : num 0 0 0.32 0 0 0 0 0 3.53 0.06 ...
$ word.freq.your : num 0.96 1.59 0.51 0.31 0.31 0 0.64 0 2 0.71 ...
$ word.freq.font : num 0 0 0 0 0 0 0 0 0 0 ...
$ word.freq.000 : num 0 0.43 1.16 0 0 0 0 0 0 0.19 ...
$ word.freq.money : num 0 0.43 0.06 0 0 0 0 0 0.15 0 ...
$ word.freq.hp : num 0 0 0 0 0 0 0 0 0 0 ...
$ word.freq.hpl : num 0 0 0 0 0 0 0 0 0 0 ...
$ word.freq.george : num 0 0 0 0 0 0 0 0 0 0 ...
$ word.freq.650 : num 0 0 0 0 0 0 0 0 0 0 ...
$ word.freq.lab : num 0 0 0 0 0 0 0 0 0 0 ...
$ word.freq.labs : num 0 0 0 0 0 0 0 0 0 0 ...
$ word.freq.telnet : num 0 0 0 0 0 0 0 0 0 0 ...
$ word.freq.857 : num 0 0 0 0 0 0 0 0 0 0 ...
$ word.freq.data : num 0 0 0 0 0 0 0 0 0.15 0 ...
$ word.freq.415 : num 0 0 0 0 0 0 0 0 0 0 ...
$ word.freq.85 : num 0 0 0 0 0 0 0 0 0 0 ...
$ word.freq.technology : num 0 0 0 0 0 0 0 0 0 0 ...
$ word.freq.1999 : num 0 0.07 0 0 0 0 0 0 0 0 ...
$ word.freq.parts : num 0 0 0 0 0 0 0 0 0 0 ...
$ word.freq.pm : num 0 0 0 0 0 0 0 0 0 0 ...
$ word.freq.direct : num 0 0 0.06 0 0 0 0 0 0 0 ...
$ word.freq.cs : num 0 0 0 0 0 0 0 0 0 0 ...
$ word.freq.meeting : num 0 0 0 0 0 0 0 0 0 0 ...
$ word.freq.original : num 0 0 0.12 0 0 0 0 0 0.3 0 ...
$ word.freq.project : num 0 0 0 0 0 0 0 0 0 0.06 ...
$ word.freq.re : num 0 0 0.06 0 0 0 0 0 0 0 ...
$ word.freq.edu : num 0 0 0.06 0 0 0 0 0 0 0 ...
$ word.freq.table : num 0 0 0 0 0 0 0 0 0 0 ...
$ word.freq.conference : num 0 0 0 0 0 0 0 0 0 0 ...
$ char.freq.. : num 0 0 0.01 0 0 0 0 0 0 0.04 ...
$ char.freq...1 : num 0 0.132 0.143 0.137 0.135 0.223 0.054 0.206 0.271 0.03 ...
$ char.freq...2 : num 0 0 0 0 0 0 0 0 0 0 ...
$ char.freq...3 : num 0.778 0.372 0.276 0.137 0.135 0 0.164 0 0.181 0.244 ...
$ char.freq...4 : num 0 0.18 0.184 0 0 0 0.054 0 0.203 0.081 ...
$ char.freq...5 : num 0 0.048 0.01 0 0 0 0 0 0.022 0 ...
$ capital.run.length.average: num 3.76 5.11 9.82 3.54 3.54 ...
$ capital.run.length.longest: int 61 101 485 40 40 15 4 11 445 43 ...
$ capital.run.length.total : int 278 1028 2259 191 191 54 112 49 1257 749 ...
$ class : Factor w/ 2 levels "x0","x1": 2 2 2 2 2 2 2 2 2 2 ...
```
**Predictors**:
|Type | Frequency|
|:-------|---------:|
|integer | 2|
|numeric | 55|
**Class imbalance**:
| class A | class B |
|:-------:|:-------:|
| 39 % | 61 % |
| 1813 | 2788 |
---
# Wine Quality
**Local directory**: wine-quality
**Details**: [link](https://archive.ics.uci.edu/ml/datasets/Wine+Quality)
**Source data files**:
* [winequality-red.csv](https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv)
* [winequality-white.csv](https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv)
* [winequality.names](https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality.names)
**Cite**:
```nohighlight
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
```
**Dataset**:
```nohighlight
'data.frame': 6497 obs. of 13 variables:
$ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
$ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
$ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
$ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
$ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
$ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
$ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
$ density : num 1.001 0.994 0.995 0.996 0.996 ...
$ ph : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
$ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
$ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
$ color : Factor w/ 2 levels "red","white": 2 2 2 2 2 2 2 2 2 2 ...
$ quality : Factor w/ 2 levels "x0","x1": 2 2 2 2 2 2 2 2 2 2 ...
```
**Predictors**:
|Type | Frequency|
|:-------|---------:|
|factor | 1|
|numeric | 11|
**Class imbalance**:
| class A | class B |
|:-------:|:-------:|
| 37 % | 63 % |
| 2384 | 4113 |
---