1
0
mirror of https://github.com/andre-wojtowicz/r-distributed-computing synced 2024-11-21 15:35:28 +01:00
A practical solution to massive parallel computing in R with caret machine learning and foreach examples
Go to file
Andrzej Wójtowicz 2afc48e622 Create README.md
2016-11-08 16:17:36 +01:00
.gitignore Added core files 2016-11-08 13:49:28 +01:00
config-parallel.R Added core files 2016-11-08 13:49:28 +01:00
config.R Added core files 2016-11-08 13:49:28 +01:00
init-parallel.R Added core files 2016-11-08 13:49:28 +01:00
init.R Added core files 2016-11-08 13:49:28 +01:00
r-distributed-computing.Rproj Added core files 2016-11-08 13:49:28 +01:00
README.md Create README.md 2016-11-08 16:17:36 +01:00
remote-commands.sh Added core files 2016-11-08 13:49:28 +01:00
test.R Added core files 2016-11-08 13:49:28 +01:00
utils-parallel.R Added core files 2016-11-08 13:49:28 +01:00
utils.R Added core files 2016-11-08 13:49:28 +01:00

R distributed computing

In this repository I show a practical solution to massive distributed computing in R. I have tested this approach in my research with ~200 remote hosts and ~800 remote connections. Brief examples concern caret package for creating predictive models and foreach loop for more general parallel computing.

The solution is based on:

Quick example

Suppose you have three Linux machines with at least 4 GB RAM:

  1. server (192.168.0.1; some Debian-based distribution),
  2. hosts (192.168.0.2 and 192.168.0.3; preferably WMI rescure - small Linux image based on Debian distribution).

On the server you install necessary packages and R with project libraries:

[~/r-distributed-computing]$ sudo bash remote-commands.sh install_env install_mro
[~/r-distributed-computing]$ Rscript init.R

Then you prepare remote hosts:

[~/r-distributed-computing]$ echo "192.168.0.2
192.168.0.3" | bash remote-commands.sh configure_hosts

Finally, you run your calculations:

[~/r-distributed-computing]$ Rscript test.R
Scanning for packages used in this project
  |====================================================================| 100%
- Discovered 14 packages
All detected packages already installed
checkpoint process complete
---
INFO [2016-11-08 13:30:46] Creating remote PSOCK cluster
INFO [2016-11-08 13:30:46] [1/2] Connecting to 192.168.0.2 ... 
INFO [2016-11-08 13:30:46] OK
INFO [2016-11-08 13:30:46] [2/2] Connecting to 192.168.0.3 ... 
INFO [2016-11-08 13:30:47] OK
INFO [2016-11-08 13:30:47] Working on 2 nodes
INFO [2016-11-08 13:30:47] Exporting checkpoint constants
INFO [2016-11-08 13:30:54] Setting cluster RNG kind
INFO [2016-11-08 13:30:54] Registering cluster
INFO [2016-11-08 13:30:54] *************************
INFO [2016-11-08 13:30:54] Test foreach
[1] 1 2
INFO [2016-11-08 13:30:55] Test caret
Support Vector Machines with Linear Kernel 

32 samples
10 predictors

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times) 
Summary of sample sizes: 28, 29, 29, 29, 28, 29, ... 
Resampling results:

  RMSE      Rsquared 
  3.276949  0.8328294

Tuning parameter 'C' was held constant at a value of 1

INFO [2016-11-08 13:31:54] Workers shut down

Customization

Bash control script

R project files

Working with more than 125 connections

Currently R has a hard-coded limit for number of connections, which is effecively 125. If you want to work with more connections (nodes), then on the server you have to recompile and install customized R. You may see how to set custom limit in my repository Microsoft R Open compilation for customised cluster nodes connection limit.

Reproducibility

In caret you can control reproducibility by pre-setting seeds; see section 5.4 Notes on Reproducibility.

Intel MKL