mirror of https://github.com/andre-wojtowicz/blas-benchmarks synced 2025-04-15 16:00:31 +02:00

Timing results for BLAS (Basic Linear Algebra Subprograms) libraries in R

Go to file

Andrzej Wójtowicz e92f87edbf added Intel Xeon E3-1275 v5		2016-11-29 11:48:07 +01:00
gen	added Intel Xeon E3-1275 v5	2016-11-29 11:48:07 +01:00
.gitignore	updated readme (first results)	2016-05-26 19:30:19 +02:00
benchmark-gcbd.R	added gcbd benchmark;	2016-05-20 12:54:32 +02:00
benchmark-revolution.R	added saving host info;	2016-05-19 19:11:33 +02:00
benchmark-sample.R	added naive cublas optimize	2016-05-30 17:49:09 +02:00
benchmark-urbanek.R	added saving host info;	2016-05-19 19:11:33 +02:00
master-ctrl-slaves.sh	added gcbd benchmark;	2016-05-20 12:54:32 +02:00
README.md	added Intel Xeon E3-1275 v5	2016-11-29 11:48:07 +01:00
results.Rmd	added Intel Xeon E3-1275 v5	2016-11-29 11:48:07 +01:00
slave-cmds.sh	added Intel Core i5-6500;	2016-07-14 17:26:58 +02:00

README.md

BLAS libraries benchmarks

Andrzej Wójtowicz

Document generation date: 2016-11-29 11:40:07

This document presents timing results for BLAS (Basic Linear Algebra Subprograms) libraries in R on diverse CPUs and GPUs.

Changelog

2016-11-29: results: added Intel Xeon E3-1275 v5.
2016-11-25: results: added Intel Atom C2758.
2016-07-14: results: added Intel Core i5-6500; changed results view of gcbd benchmark to relative performance gain; changed reference CPU (Intel Pentium Dual-Core E5300) and GPU (NVIDIA GeForce GT 630M); code: fixed target architecture detection for Intel Core i5-6500-like CPUs in multi-threaded Atlas library; added info how to force target architecture in GotoBLAS2 and BLIS libraries.

Configuration
Results per host
Results per library
- Netlib
- Atlas (st)
- OpenBLAS
- Atlas (mt)
- GotoBLAS2
- MKL
- BLIS
- cuBLAS

Configuration

OS: Debian Jessie, kernel 4.4

R software: Microsoft R Open (3.2.4)

Libraries:

CPU (single-threaded)	CPU (multi-threaded)	GPU
Netlib (debian package, blas 1.2.20110419, lapack 3.5.0)	OpenBLAS (debian package, 0.2.12)	NVIDIA cuBLAS (NVBLAS 6.5 + Intel MKL)
ATLAS (debian package, 3.10.2)	ATLAS (dev branch, 3.11.38)
	GotoBLAS2 (Survive fork, 3.141)
	Intel MKL (part of RevoMath package, 3.2.4)
	BLIS (dev branch, 0.2.0+/17.05.2016)

Hosts:

No.	CPU	GPU
1.	Intel Xeon E3-1275 v5	-
2.	Intel Core i7-4790K (OC 4.5 GHz)	MSI GeForce GTX 980 Ti Lightning
3.	Intel Core i5-4590	NVIDIA GeForce GT 430
4.	Intel Core i5-4590	NVIDIA GeForce GTX 750 Ti
5.	Intel Core i5-6500	-
6.	Intel Core i5-3570	-
7.	Intel Core i3-2120	-
8.	Intel Core i3-3120M	-
9.	Intel Core i5-3317U	NVIDIA GeForce GT 630M
10.	Intel Atom C2758	-
11.	Intel Pentium Dual-Core E5300	-

Benchmarks: R-benchmark-25, Revolution, Gcbd.

Results per host

Intel Xeon E3-1275 v5

R-benchmark-25

2800x2800 cross-product matrix

Time in seconds - 10 runs - lower is better

Linear regr. over a 2000x2000 matrix

Time in seconds - 10 runs - lower is better

Eigenvalues of a 600x600 random matrix

Time in seconds - 10 runs - lower is better

Determinant of a 2500x2500 random matrix

Time in seconds - 10 runs - lower is better

Cholesky decomposition of a 3000x3000 matrix

Time in seconds - 10 runs - lower is better

Inverse of a 1600x1600 random matrix

Time in seconds - 10 runs - lower is better

Escoufier's method on a 45x45 matrix

Time in seconds - 10 runs - lower is better

Revolution benchmark

Matrix Multiply

Time in seconds - 10 runs - lower is better

Cholesky Factorization

Time in seconds - 10 runs - lower is better

Singular Value Deomposition

Time in seconds - 10 runs - lower is better

Principal Components Analysis

Time in seconds - 10 runs - lower is better

Linear Discriminant Analysis

Time in seconds - 10 runs - lower is better

Gcbd benchmark

Matrix Multiply

Performance gain regarding matrix size - reference: Netlib - from 50 to 5 runs - higher is better

QR Decomposition

Performance gain regarding matrix size - reference: Netlib - from 50 to 5 runs - higher is better

Singular Value Deomposition

Performance gain regarding matrix size - reference: Netlib - from 50 to 5 runs - higher is better

Triangular Decomposition

Performance gain regarding matrix size - reference: Netlib - from 50 to 5 runs - higher is better