30 KiB
BLAS libraries benchmarks
Andrzej Wójtowicz
Document generation date: 2016-05-26 19:23:05
Table of Contents
Configuration
R software: Microsoft R Open.
Libraries:
CPU (single-threaded) | CPU (multi-threaded) | GPU |
---|---|---|
Netlib (debian package) | OpenBLAS (debian package) | NVIDIA cuBLAS (NVBLAS + Intel MKL) |
ATLAS (debian package) | ATLAS (dev branch) | |
GotoBLAS2 (Survive fork) | ||
Intel MKL (part of Microsoft R Open) | ||
BLIS |
Hosts:
No. | CPU | GPU |
---|---|---|
1. | Intel Core i5-4590 | NVIDIA GeForce GT 430 |
2. | Intel Core i5-3570 | - |
3. | Intel Core i3-2120 | - |
4. | Intel Core i3-3120M | - |
Benchmarks: Urbanek, Revolution, Gcbd.
Results per host
Intel Core i5-4590 + NVIDIA GeForce GT 430
Urbanek benchmark
2800x2800 cross-product matrix
Time in seconds - 10 runs - lower is better
Linear regr. over a 3000x3000 matrix
Time in seconds - 10 runs - lower is better
Eigenvalues of a 640x640 random matrix
Time in seconds - 10 runs - lower is better
Determinant of a 2500x2500 random matrix
Time in seconds - 10 runs - lower is better
Cholesky decomposition of a 3000x3000 matrix
Time in seconds - 10 runs - lower is better
Inverse of a 1600x1600 random matrix
Time in seconds - 10 runs - lower is better
Escoufier's method on a 45x45 matrix
Time in seconds - 10 runs - lower is better
Revolution benchmark
Matrix Multiply
Time in seconds - 10 runs - lower is better
Cholesky Factorization
Time in seconds - 10 runs - lower is better
Singular Value Deomposition
Time in seconds - 10 runs - lower is better
Principal Components Analysis
Time in seconds - 10 runs - lower is better
Linear Discriminant Analysis
Time in seconds - 10 runs - lower is better
Gcbd benchmark
Matrix Multiply
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
QR Decomposition
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
Singular Value Deomposition
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
Triangular Decomposition
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
Intel Core i5-3570
Urbanek benchmark
2800x2800 cross-product matrix
Time in seconds - 10 runs - lower is better
Linear regr. over a 3000x3000 matrix
Time in seconds - 10 runs - lower is better
Eigenvalues of a 640x640 random matrix
Time in seconds - 10 runs - lower is better
Determinant of a 2500x2500 random matrix
Time in seconds - 10 runs - lower is better
Cholesky decomposition of a 3000x3000 matrix
Time in seconds - 10 runs - lower is better
Inverse of a 1600x1600 random matrix
Time in seconds - 10 runs - lower is better
Escoufier's method on a 45x45 matrix
Time in seconds - 10 runs - lower is better
Revolution benchmark
Matrix Multiply
Time in seconds - 10 runs - lower is better
Cholesky Factorization
Time in seconds - 10 runs - lower is better
Singular Value Deomposition
Time in seconds - 10 runs - lower is better
Principal Components Analysis
Time in seconds - 10 runs - lower is better
Linear Discriminant Analysis
Time in seconds - 10 runs - lower is better
Gcbd benchmark
Matrix Multiply
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
QR Decomposition
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
Singular Value Deomposition
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
Triangular Decomposition
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
Intel Core i3-2120
Urbanek benchmark
2800x2800 cross-product matrix
Time in seconds - 10 runs - lower is better
Linear regr. over a 3000x3000 matrix
Time in seconds - 10 runs - lower is better
Eigenvalues of a 640x640 random matrix
Time in seconds - 10 runs - lower is better
Determinant of a 2500x2500 random matrix
Time in seconds - 10 runs - lower is better
Cholesky decomposition of a 3000x3000 matrix
Time in seconds - 10 runs - lower is better
Inverse of a 1600x1600 random matrix
Time in seconds - 10 runs - lower is better
Escoufier's method on a 45x45 matrix
Time in seconds - 10 runs - lower is better
Revolution benchmark
Matrix Multiply
Time in seconds - 10 runs - lower is better
Cholesky Factorization
Time in seconds - 10 runs - lower is better
Singular Value Deomposition
Time in seconds - 10 runs - lower is better
Principal Components Analysis
Time in seconds - 10 runs - lower is better
Linear Discriminant Analysis
Time in seconds - 10 runs - lower is better
Gcbd benchmark
Matrix Multiply
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
QR Decomposition
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
Singular Value Deomposition
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
Triangular Decomposition
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
Intel Core i3-3120M
Urbanek benchmark
2800x2800 cross-product matrix
Time in seconds - 10 runs - lower is better
Linear regr. over a 3000x3000 matrix
Time in seconds - 10 runs - lower is better
Eigenvalues of a 640x640 random matrix
Time in seconds - 10 runs - lower is better
Determinant of a 2500x2500 random matrix
Time in seconds - 10 runs - lower is better
Cholesky decomposition of a 3000x3000 matrix
Time in seconds - 10 runs - lower is better
Inverse of a 1600x1600 random matrix
Time in seconds - 10 runs - lower is better
Escoufier's method on a 45x45 matrix
Time in seconds - 10 runs - lower is better
Revolution benchmark
Matrix Multiply
Time in seconds - 10 runs - lower is better
Cholesky Factorization
Time in seconds - 10 runs - lower is better
Singular Value Deomposition
Time in seconds - 10 runs - lower is better
Principal Components Analysis
Time in seconds - 10 runs - lower is better
Linear Discriminant Analysis
Time in seconds - 10 runs - lower is better
Gcbd benchmark
Matrix Multiply
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
QR Decomposition
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
Singular Value Deomposition
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
Triangular Decomposition
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
Results per library
Netlib
Urbanek benchmark
2800x2800 cross-product matrix
Time in seconds - 10 runs - lower is better
Linear regr. over a 3000x3000 matrix
Time in seconds - 10 runs - lower is better
Eigenvalues of a 640x640 random matrix
Time in seconds - 10 runs - lower is better
Determinant of a 2500x2500 random matrix
Time in seconds - 10 runs - lower is better
Cholesky decomposition of a 3000x3000 matrix
Time in seconds - 10 runs - lower is better
Inverse of a 1600x1600 random matrix
Time in seconds - 10 runs - lower is better
Escoufier's method on a 45x45 matrix
Time in seconds - 10 runs - lower is better
Revolution benchmark
Matrix Multiply
Time in seconds - 10 runs - lower is better
Cholesky Factorization
Time in seconds - 10 runs - lower is better
Singular Value Deomposition
Time in seconds - 10 runs - lower is better
Principal Components Analysis
Time in seconds - 10 runs - lower is better
Linear Discriminant Analysis
Time in seconds - 10 runs - lower is better
Gcbd benchmark
Matrix Multiply
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
QR Decomposition
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
Singular Value Deomposition
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
Triangular Decomposition
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
ATLAS (st)
Urbanek benchmark
2800x2800 cross-product matrix
Time in seconds - 10 runs - lower is better
Linear regr. over a 3000x3000 matrix
Time in seconds - 10 runs - lower is better
Eigenvalues of a 640x640 random matrix
Time in seconds - 10 runs - lower is better
Determinant of a 2500x2500 random matrix
Time in seconds - 10 runs - lower is better
Cholesky decomposition of a 3000x3000 matrix
Time in seconds - 10 runs - lower is better
Inverse of a 1600x1600 random matrix
Time in seconds - 10 runs - lower is better
Escoufier's method on a 45x45 matrix
Time in seconds - 10 runs - lower is better
Revolution benchmark
Matrix Multiply
Time in seconds - 10 runs - lower is better
Cholesky Factorization
Time in seconds - 10 runs - lower is better
Singular Value Deomposition
Time in seconds - 10 runs - lower is better
Principal Components Analysis
Time in seconds - 10 runs - lower is better
Linear Discriminant Analysis
Time in seconds - 10 runs - lower is better
Gcbd benchmark
Matrix Multiply
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
QR Decomposition
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
Singular Value Deomposition
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
Triangular Decomposition
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
OpenBLAS
Urbanek benchmark
2800x2800 cross-product matrix
Time in seconds - 10 runs - lower is better
Linear regr. over a 3000x3000 matrix
Time in seconds - 10 runs - lower is better
Eigenvalues of a 640x640 random matrix
Time in seconds - 10 runs - lower is better
Determinant of a 2500x2500 random matrix
Time in seconds - 10 runs - lower is better
Cholesky decomposition of a 3000x3000 matrix
Time in seconds - 10 runs - lower is better
Inverse of a 1600x1600 random matrix
Time in seconds - 10 runs - lower is better
Escoufier's method on a 45x45 matrix
Time in seconds - 10 runs - lower is better
Revolution benchmark
Matrix Multiply
Time in seconds - 10 runs - lower is better
Cholesky Factorization
Time in seconds - 10 runs - lower is better
Singular Value Deomposition
Time in seconds - 10 runs - lower is better
Principal Components Analysis
Time in seconds - 10 runs - lower is better
Linear Discriminant Analysis
Time in seconds - 10 runs - lower is better
Gcbd benchmark
Matrix Multiply
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
QR Decomposition
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
Singular Value Deomposition
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
Triangular Decomposition
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
ATLAS (mt)
Urbanek benchmark
2800x2800 cross-product matrix
Time in seconds - 10 runs - lower is better
Linear regr. over a 3000x3000 matrix
Time in seconds - 10 runs - lower is better
Eigenvalues of a 640x640 random matrix
Time in seconds - 10 runs - lower is better
Determinant of a 2500x2500 random matrix
Time in seconds - 10 runs - lower is better
Cholesky decomposition of a 3000x3000 matrix
Time in seconds - 10 runs - lower is better
Inverse of a 1600x1600 random matrix
Time in seconds - 10 runs - lower is better
Escoufier's method on a 45x45 matrix
Time in seconds - 10 runs - lower is better
Revolution benchmark
Matrix Multiply
Time in seconds - 10 runs - lower is better
Cholesky Factorization
Time in seconds - 10 runs - lower is better
Singular Value Deomposition
Time in seconds - 10 runs - lower is better
Principal Components Analysis
Time in seconds - 10 runs - lower is better
Linear Discriminant Analysis
Time in seconds - 10 runs - lower is better
Gcbd benchmark
Matrix Multiply
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
QR Decomposition
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
Singular Value Deomposition
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
Triangular Decomposition
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
GotoBLAS2
Urbanek benchmark
2800x2800 cross-product matrix
Time in seconds - 10 runs - lower is better
Linear regr. over a 3000x3000 matrix
Time in seconds - 10 runs - lower is better
Eigenvalues of a 640x640 random matrix
Time in seconds - 10 runs - lower is better
Determinant of a 2500x2500 random matrix
Time in seconds - 10 runs - lower is better
Cholesky decomposition of a 3000x3000 matrix
Time in seconds - 10 runs - lower is better
Inverse of a 1600x1600 random matrix
Time in seconds - 10 runs - lower is better
Escoufier's method on a 45x45 matrix
Time in seconds - 10 runs - lower is better
Revolution benchmark
Matrix Multiply
Time in seconds - 10 runs - lower is better
Cholesky Factorization
Time in seconds - 10 runs - lower is better
Singular Value Deomposition
Time in seconds - 10 runs - lower is better
Principal Components Analysis
Time in seconds - 10 runs - lower is better
Linear Discriminant Analysis
Time in seconds - 10 runs - lower is better
Gcbd benchmark
Matrix Multiply
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
QR Decomposition
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
Singular Value Deomposition
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
Triangular Decomposition
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
MKL
Urbanek benchmark
2800x2800 cross-product matrix
Time in seconds - 10 runs - lower is better
Linear regr. over a 3000x3000 matrix
Time in seconds - 10 runs - lower is better
Eigenvalues of a 640x640 random matrix
Time in seconds - 10 runs - lower is better
Determinant of a 2500x2500 random matrix
Time in seconds - 10 runs - lower is better
Cholesky decomposition of a 3000x3000 matrix
Time in seconds - 10 runs - lower is better
Inverse of a 1600x1600 random matrix
Time in seconds - 10 runs - lower is better
Escoufier's method on a 45x45 matrix
Time in seconds - 10 runs - lower is better
Revolution benchmark
Matrix Multiply
Time in seconds - 10 runs - lower is better
Cholesky Factorization
Time in seconds - 10 runs - lower is better
Singular Value Deomposition
Time in seconds - 10 runs - lower is better
Principal Components Analysis
Time in seconds - 10 runs - lower is better
Linear Discriminant Analysis
Time in seconds - 10 runs - lower is better
Gcbd benchmark
Matrix Multiply
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
QR Decomposition
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
Singular Value Deomposition
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
Triangular Decomposition
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
BLIS
Urbanek benchmark
2800x2800 cross-product matrix
Time in seconds - 10 runs - lower is better
Linear regr. over a 3000x3000 matrix
Time in seconds - 10 runs - lower is better
Eigenvalues of a 640x640 random matrix
Time in seconds - 10 runs - lower is better
Determinant of a 2500x2500 random matrix
Time in seconds - 10 runs - lower is better
Cholesky decomposition of a 3000x3000 matrix
Time in seconds - 10 runs - lower is better
Inverse of a 1600x1600 random matrix
Time in seconds - 10 runs - lower is better
Escoufier's method on a 45x45 matrix
Time in seconds - 10 runs - lower is better
Revolution benchmark
Matrix Multiply
Time in seconds - 10 runs - lower is better
Cholesky Factorization
Time in seconds - 10 runs - lower is better
Singular Value Deomposition
Time in seconds - 10 runs - lower is better
Principal Components Analysis
Time in seconds - 10 runs - lower is better
Linear Discriminant Analysis
Time in seconds - 10 runs - lower is better
Gcbd benchmark
Matrix Multiply
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
QR Decomposition
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
Singular Value Deomposition
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
Triangular Decomposition
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
cuBLAS
Urbanek benchmark
2800x2800 cross-product matrix
Time in seconds - 10 runs - lower is better
Linear regr. over a 3000x3000 matrix
Time in seconds - 10 runs - lower is better
Eigenvalues of a 640x640 random matrix
Time in seconds - 10 runs - lower is better
Determinant of a 2500x2500 random matrix
Time in seconds - 10 runs - lower is better
Cholesky decomposition of a 3000x3000 matrix
Time in seconds - 10 runs - lower is better
Inverse of a 1600x1600 random matrix
Time in seconds - 10 runs - lower is better
Escoufier's method on a 45x45 matrix
Time in seconds - 10 runs - lower is better
Revolution benchmark
Matrix Multiply
Time in seconds - 10 runs - lower is better
Cholesky Factorization
Time in seconds - 10 runs - lower is better
Singular Value Deomposition
Time in seconds - 10 runs - lower is better
Principal Components Analysis
Time in seconds - 10 runs - lower is better
Linear Discriminant Analysis
Time in seconds - 10 runs - lower is better
Gcbd benchmark
Matrix Multiply
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
QR Decomposition
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
Singular Value Deomposition
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better
Triangular Decomposition
Time in seconds regarding matrix size - right panel on log scale - from 50 to 5 runs - lower is better