
1234

Dear Spark developers,
I am exploring how to make linear algebra operations faster within Spark. One way of doing this is to use Scala Breeze library that is bundled with Spark. For matrix operations, it employs Netlibjava that has a Java wrapper for BLAS (basic linear algebra subprograms) and LAPACK native binaries if they are available on the worker node. It also has its own optimized Java implementation of BLAS. It is worth mentioning, that native binaries provide better performance only for BLAS level 3, i.e. matrixmatrix operations or general matrix multiplication (GEMM). This is confirmed by GEMM test on Netlibjava page https://github.com/fommil/netlibjava. I also confirmed it with my experiments with training of artificial neural network https://github.com/apache/spark/pull/1290#issuecomment70313952. However, I would like to boost performance more.
GPU is supposed to work fast with linear algebra and there is Nvidia CUDA implementation of BLAS, called cublas. I have one Linux server with Nvidia GPU and I was able to do the following. I linked cublas (instead of cpubased blas) with Netlibjava wrapper and put it into Spark, so Breeze/Netlib is using it. Then I did some performance measurements with regards to artificial neural network batch learning in Spark MLlib that involves matrixmatrix multiplications. It turns out that for matrices of size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes slower for bigger matrices. It worth mentioning that it is was not a test for ONLY multiplication since there are other operations involved. One of the reasons for slowdown might be the overhead of copying the matrices from computer memory to graphic card memory and back.
So, few questions:
1) Do these results with CUDA make sense?
2) If the problem is with copy overhead, are there any libraries that allow to force intermediate results to stay in graphic card memory thus removing the overhead?
3) Any other options to speedup linear algebra in Spark?
Thank you, Alexander

To unsubscribe, email: [hidden email]
For additional commands, email: [hidden email]


I'd expect that we can make GPUaccelerated BLAS faster than CPU blas in
many cases.
You might consider taking a look at the codepaths that BIDMat (
https://github.com/BIDData/BIDMat) takes and comparing them to
netlibjava/breeze. John Canny et. al. have done a bunch of work optimizing
to make this work really fast from Scala. I've run it on my laptop and
compared to MKL and in certain cases it's 10x faster at matrix multiply.
There are a lot of layers of indirection here and you really want to avoid
data copying as much as possible.
We could also consider swapping out BIDMat for Breeze, but that would be a
big project and if we can figure out how to get breeze+cublas to comparable
performance that would be a big win.
On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander < [hidden email]>
wrote:
> Dear Spark developers,
>
> I am exploring how to make linear algebra operations faster within Spark.
> One way of doing this is to use Scala Breeze library that is bundled with
> Spark. For matrix operations, it employs Netlibjava that has a Java
> wrapper for BLAS (basic linear algebra subprograms) and LAPACK native
> binaries if they are available on the worker node. It also has its own
> optimized Java implementation of BLAS. It is worth mentioning, that native
> binaries provide better performance only for BLAS level 3, i.e.
> matrixmatrix operations or general matrix multiplication (GEMM). This is
> confirmed by GEMM test on Netlibjava page
> https://github.com/fommil/netlibjava. I also confirmed it with my
> experiments with training of artificial neural network
> https://github.com/apache/spark/pull/1290#issuecomment70313952. However,
> I would like to boost performance more.
>
> GPU is supposed to work fast with linear algebra and there is Nvidia CUDA
> implementation of BLAS, called cublas. I have one Linux server with Nvidia
> GPU and I was able to do the following. I linked cublas (instead of
> cpubased blas) with Netlibjava wrapper and put it into Spark, so
> Breeze/Netlib is using it. Then I did some performance measurements with
> regards to artificial neural network batch learning in Spark MLlib that
> involves matrixmatrix multiplications. It turns out that for matrices of
> size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas
> becomes slower for bigger matrices. It worth mentioning that it is was not
> a test for ONLY multiplication since there are other operations involved.
> One of the reasons for slowdown might be the overhead of copying the
> matrices from computer memory to graphic card memory and back.
>
> So, few questions:
> 1) Do these results with CUDA make sense?
> 2) If the problem is with copy overhead, are there any libraries that
> allow to force intermediate results to stay in graphic card memory thus
> removing the overhead?
> 3) Any other options to speedup linear algebra in Spark?
>
> Thank you, Alexander
>
> 
> To unsubscribe, email: [hidden email]
> For additional commands, email: [hidden email]
>
>


Hi Evan,
Thank you for suggestion! BIDMat seems to have terrific speed. Do you know what makes them faster than netlibjava?
The same group has BIDMach library that implements machine learning. For some examples they use Caffe convolutional neural network library owned by another group in Berkeley. Could you elaborate on how these all might be connected with Spark Mllib? If you take BIDMat for linear algebra why don’t you take BIDMach for optimization and learning?
Best regards, Alexander
From: Evan R. Sparks [mailto: [hidden email]]
Sent: Thursday, February 05, 2015 12:09 PM
To: Ulanov, Alexander
Cc: [hidden email]
Subject: Re: Using CUDA within Spark / boosting linear algebra
I'd expect that we can make GPUaccelerated BLAS faster than CPU blas in many cases.
You might consider taking a look at the codepaths that BIDMat ( https://github.com/BIDData/BIDMat) takes and comparing them to netlibjava/breeze. John Canny et. al. have done a bunch of work optimizing to make this work really fast from Scala. I've run it on my laptop and compared to MKL and in certain cases it's 10x faster at matrix multiply. There are a lot of layers of indirection here and you really want to avoid data copying as much as possible.
We could also consider swapping out BIDMat for Breeze, but that would be a big project and if we can figure out how to get breeze+cublas to comparable performance that would be a big win.
On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander < [hidden email]<mailto: [hidden email]>> wrote:
Dear Spark developers,
I am exploring how to make linear algebra operations faster within Spark. One way of doing this is to use Scala Breeze library that is bundled with Spark. For matrix operations, it employs Netlibjava that has a Java wrapper for BLAS (basic linear algebra subprograms) and LAPACK native binaries if they are available on the worker node. It also has its own optimized Java implementation of BLAS. It is worth mentioning, that native binaries provide better performance only for BLAS level 3, i.e. matrixmatrix operations or general matrix multiplication (GEMM). This is confirmed by GEMM test on Netlibjava page https://github.com/fommil/netlibjava. I also confirmed it with my experiments with training of artificial neural network https://github.com/apache/spark/pull/1290#issuecomment70313952. However, I would like to boost performance more.
GPU is supposed to work fast with linear algebra and there is Nvidia CUDA implementation of BLAS, called cublas. I have one Linux server with Nvidia GPU and I was able to do the following. I linked cublas (instead of cpubased blas) with Netlibjava wrapper and put it into Spark, so Breeze/Netlib is using it. Then I did some performance measurements with regards to artificial neural network batch learning in Spark MLlib that involves matrixmatrix multiplications. It turns out that for matrices of size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes slower for bigger matrices. It worth mentioning that it is was not a test for ONLY multiplication since there are other operations involved. One of the reasons for slowdown might be the overhead of copying the matrices from computer memory to graphic card memory and back.
So, few questions:
1) Do these results with CUDA make sense?
2) If the problem is with copy overhead, are there any libraries that allow to force intermediate results to stay in graphic card memory thus removing the overhead?
3) Any other options to speedup linear algebra in Spark?
Thank you, Alexander

To unsubscribe, email: [hidden email]<mailto: [hidden email]>
For additional commands, email: [hidden email]<mailto: [hidden email]>


I'd be surprised of BIDMat+OpenBLAS was significantly faster than
netlibjava+OpenBLAS, but if it is much faster it's probably due to data
layout and fewer levels of indirection  it's definitely a worthwhile
experiment to run. The main speedups I've seen from using it come from
highly optimized GPU code for linear algebra. I know that in the past Canny
has gone as far as to write custom GPU kernels for performancecritical
regions of code.[1]
BIDMach is highly optimized for single node performance or performance on
small clusters.[2] Once data doesn't fit easily in GPU memory (or can be
batched in that way) the performance tends to fall off. Canny argues for
hardware/software codesign and as such prefers machine configurations that
are quite different than what we find in most commodity cluster nodes 
e.g. 10 disk cahnnels and 4 GPUs.
In contrast, MLlib was designed for horizontal scalability on commodity
clusters and works best on very big datasets  order of terabytes.
For the most part, these projects developed concurrently to address
slightly different use cases. That said, there may be bits of BIDMach we
could repurpose for MLlib  keep in mind we need to be careful about
maintaining crosslanguage compatibility for our Java and Pythonusers,
though.
 Evan
[1]  http://arxiv.org/abs/1409.5402[2]  http://eecs.berkeley.edu/~hzhao/papers/BD.pdfOn Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander < [hidden email]>
wrote:
> Hi Evan,
>
>
>
> Thank you for suggestion! BIDMat seems to have terrific speed. Do you know
> what makes them faster than netlibjava?
>
>
>
> The same group has BIDMach library that implements machine learning. For
> some examples they use Caffe convolutional neural network library owned by
> another group in Berkeley. Could you elaborate on how these all might be
> connected with Spark Mllib? If you take BIDMat for linear algebra why don’t
> you take BIDMach for optimization and learning?
>
>
>
> Best regards, Alexander
>
>
>
> *From:* Evan R. Sparks [mailto: [hidden email]]
> *Sent:* Thursday, February 05, 2015 12:09 PM
> *To:* Ulanov, Alexander
> *Cc:* [hidden email]
> *Subject:* Re: Using CUDA within Spark / boosting linear algebra
>
>
>
> I'd expect that we can make GPUaccelerated BLAS faster than CPU blas in
> many cases.
>
>
>
> You might consider taking a look at the codepaths that BIDMat (
> https://github.com/BIDData/BIDMat) takes and comparing them to
> netlibjava/breeze. John Canny et. al. have done a bunch of work optimizing
> to make this work really fast from Scala. I've run it on my laptop and
> compared to MKL and in certain cases it's 10x faster at matrix multiply.
> There are a lot of layers of indirection here and you really want to avoid
> data copying as much as possible.
>
>
>
> We could also consider swapping out BIDMat for Breeze, but that would be a
> big project and if we can figure out how to get breeze+cublas to comparable
> performance that would be a big win.
>
>
>
> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
> [hidden email]> wrote:
>
> Dear Spark developers,
>
> I am exploring how to make linear algebra operations faster within Spark.
> One way of doing this is to use Scala Breeze library that is bundled with
> Spark. For matrix operations, it employs Netlibjava that has a Java
> wrapper for BLAS (basic linear algebra subprograms) and LAPACK native
> binaries if they are available on the worker node. It also has its own
> optimized Java implementation of BLAS. It is worth mentioning, that native
> binaries provide better performance only for BLAS level 3, i.e.
> matrixmatrix operations or general matrix multiplication (GEMM). This is
> confirmed by GEMM test on Netlibjava page
> https://github.com/fommil/netlibjava. I also confirmed it with my
> experiments with training of artificial neural network
> https://github.com/apache/spark/pull/1290#issuecomment70313952. However,
> I would like to boost performance more.
>
> GPU is supposed to work fast with linear algebra and there is Nvidia CUDA
> implementation of BLAS, called cublas. I have one Linux server with Nvidia
> GPU and I was able to do the following. I linked cublas (instead of
> cpubased blas) with Netlibjava wrapper and put it into Spark, so
> Breeze/Netlib is using it. Then I did some performance measurements with
> regards to artificial neural network batch learning in Spark MLlib that
> involves matrixmatrix multiplications. It turns out that for matrices of
> size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas
> becomes slower for bigger matrices. It worth mentioning that it is was not
> a test for ONLY multiplication since there are other operations involved.
> One of the reasons for slowdown might be the overhead of copying the
> matrices from computer memory to graphic card memory and back.
>
> So, few questions:
> 1) Do these results with CUDA make sense?
> 2) If the problem is with copy overhead, are there any libraries that
> allow to force intermediate results to stay in graphic card memory thus
> removing the overhead?
> 3) Any other options to speedup linear algebra in Spark?
>
> Thank you, Alexander
>
> 
> To unsubscribe, email: [hidden email]
> For additional commands, email: [hidden email]
>
>
>


Thank you for explanation! I’ve watched the BIDMach presentation by John Canny and I am really inspired by his talk and comparisons with Spark MLlib.
I am very interested to find out what will be better within Spark: BIDMat or netlibjava with CPU or GPU natives. Could you suggest a fair way to benchmark them? Currently I do benchmarks on artificial neural networks in batch mode. While it is not a “pure” test of linear algebra, it involves some other things that are essential to machine learning.
From: Evan R. Sparks [mailto: [hidden email]]
Sent: Thursday, February 05, 2015 1:29 PM
To: Ulanov, Alexander
Cc: [hidden email]
Subject: Re: Using CUDA within Spark / boosting linear algebra
I'd be surprised of BIDMat+OpenBLAS was significantly faster than netlibjava+OpenBLAS, but if it is much faster it's probably due to data layout and fewer levels of indirection  it's definitely a worthwhile experiment to run. The main speedups I've seen from using it come from highly optimized GPU code for linear algebra. I know that in the past Canny has gone as far as to write custom GPU kernels for performancecritical regions of code.[1]
BIDMach is highly optimized for single node performance or performance on small clusters.[2] Once data doesn't fit easily in GPU memory (or can be batched in that way) the performance tends to fall off. Canny argues for hardware/software codesign and as such prefers machine configurations that are quite different than what we find in most commodity cluster nodes  e.g. 10 disk cahnnels and 4 GPUs.
In contrast, MLlib was designed for horizontal scalability on commodity clusters and works best on very big datasets  order of terabytes.
For the most part, these projects developed concurrently to address slightly different use cases. That said, there may be bits of BIDMach we could repurpose for MLlib  keep in mind we need to be careful about maintaining crosslanguage compatibility for our Java and Pythonusers, though.
 Evan
[1]  http://arxiv.org/abs/1409.5402[2]  http://eecs.berkeley.edu/~hzhao/papers/BD.pdfOn Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander < [hidden email]<mailto: [hidden email]>> wrote:
Hi Evan,
Thank you for suggestion! BIDMat seems to have terrific speed. Do you know what makes them faster than netlibjava?
The same group has BIDMach library that implements machine learning. For some examples they use Caffe convolutional neural network library owned by another group in Berkeley. Could you elaborate on how these all might be connected with Spark Mllib? If you take BIDMat for linear algebra why don’t you take BIDMach for optimization and learning?
Best regards, Alexander
From: Evan R. Sparks [mailto: [hidden email]<mailto: [hidden email]>]
Sent: Thursday, February 05, 2015 12:09 PM
To: Ulanov, Alexander
Cc: [hidden email]<mailto: [hidden email]>
Subject: Re: Using CUDA within Spark / boosting linear algebra
I'd expect that we can make GPUaccelerated BLAS faster than CPU blas in many cases.
You might consider taking a look at the codepaths that BIDMat ( https://github.com/BIDData/BIDMat) takes and comparing them to netlibjava/breeze. John Canny et. al. have done a bunch of work optimizing to make this work really fast from Scala. I've run it on my laptop and compared to MKL and in certain cases it's 10x faster at matrix multiply. There are a lot of layers of indirection here and you really want to avoid data copying as much as possible.
We could also consider swapping out BIDMat for Breeze, but that would be a big project and if we can figure out how to get breeze+cublas to comparable performance that would be a big win.
On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander < [hidden email]<mailto: [hidden email]>> wrote:
Dear Spark developers,
I am exploring how to make linear algebra operations faster within Spark. One way of doing this is to use Scala Breeze library that is bundled with Spark. For matrix operations, it employs Netlibjava that has a Java wrapper for BLAS (basic linear algebra subprograms) and LAPACK native binaries if they are available on the worker node. It also has its own optimized Java implementation of BLAS. It is worth mentioning, that native binaries provide better performance only for BLAS level 3, i.e. matrixmatrix operations or general matrix multiplication (GEMM). This is confirmed by GEMM test on Netlibjava page https://github.com/fommil/netlibjava. I also confirmed it with my experiments with training of artificial neural network https://github.com/apache/spark/pull/1290#issuecomment70313952. However, I would like to boost performance more.
GPU is supposed to work fast with linear algebra and there is Nvidia CUDA implementation of BLAS, called cublas. I have one Linux server with Nvidia GPU and I was able to do the following. I linked cublas (instead of cpubased blas) with Netlibjava wrapper and put it into Spark, so Breeze/Netlib is using it. Then I did some performance measurements with regards to artificial neural network batch learning in Spark MLlib that involves matrixmatrix multiplications. It turns out that for matrices of size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes slower for bigger matrices. It worth mentioning that it is was not a test for ONLY multiplication since there are other operations involved. One of the reasons for slowdown might be the overhead of copying the matrices from computer memory to graphic card memory and back.
So, few questions:
1) Do these results with CUDA make sense?
2) If the problem is with copy overhead, are there any libraries that allow to force intermediate results to stay in graphic card memory thus removing the overhead?
3) Any other options to speedup linear algebra in Spark?
Thank you, Alexander

To unsubscribe, email: [hidden email]<mailto: [hidden email]>
For additional commands, email: [hidden email]<mailto: [hidden email]>


Hi Alexander,
Using GPUs with Spark would be very exciting. Small comment: Concerning
your question earlier about keeping data stored on the GPU rather than
having to move it between main memory and GPU memory on each iteration, I
would guess this would be critical to getting good performance. If you
could do multiple local iterations before aggregating results, then the
cost of data movement to the GPU could be amortized (and I believe that is
done in practice). Having Spark be aware of the GPU and using it as
another part of memory sounds like a much bigger undertaking.
Joseph
On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander < [hidden email]>
wrote:
> Thank you for explanation! I’ve watched the BIDMach presentation by John
> Canny and I am really inspired by his talk and comparisons with Spark MLlib.
>
> I am very interested to find out what will be better within Spark: BIDMat
> or netlibjava with CPU or GPU natives. Could you suggest a fair way to
> benchmark them? Currently I do benchmarks on artificial neural networks in
> batch mode. While it is not a “pure” test of linear algebra, it involves
> some other things that are essential to machine learning.
>
> From: Evan R. Sparks [mailto: [hidden email]]
> Sent: Thursday, February 05, 2015 1:29 PM
> To: Ulanov, Alexander
> Cc: [hidden email]
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
> netlibjava+OpenBLAS, but if it is much faster it's probably due to data
> layout and fewer levels of indirection  it's definitely a worthwhile
> experiment to run. The main speedups I've seen from using it come from
> highly optimized GPU code for linear algebra. I know that in the past Canny
> has gone as far as to write custom GPU kernels for performancecritical
> regions of code.[1]
>
> BIDMach is highly optimized for single node performance or performance on
> small clusters.[2] Once data doesn't fit easily in GPU memory (or can be
> batched in that way) the performance tends to fall off. Canny argues for
> hardware/software codesign and as such prefers machine configurations that
> are quite different than what we find in most commodity cluster nodes 
> e.g. 10 disk cahnnels and 4 GPUs.
>
> In contrast, MLlib was designed for horizontal scalability on commodity
> clusters and works best on very big datasets  order of terabytes.
>
> For the most part, these projects developed concurrently to address
> slightly different use cases. That said, there may be bits of BIDMach we
> could repurpose for MLlib  keep in mind we need to be careful about
> maintaining crosslanguage compatibility for our Java and Pythonusers,
> though.
>
>  Evan
>
> [1]  http://arxiv.org/abs/1409.5402> [2]  http://eecs.berkeley.edu/~hzhao/papers/BD.pdf>
> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander < [hidden email]
> <mailto: [hidden email]>> wrote:
> Hi Evan,
>
> Thank you for suggestion! BIDMat seems to have terrific speed. Do you know
> what makes them faster than netlibjava?
>
> The same group has BIDMach library that implements machine learning. For
> some examples they use Caffe convolutional neural network library owned by
> another group in Berkeley. Could you elaborate on how these all might be
> connected with Spark Mllib? If you take BIDMat for linear algebra why don’t
> you take BIDMach for optimization and learning?
>
> Best regards, Alexander
>
> From: Evan R. Sparks [mailto: [hidden email]<mailto:
> [hidden email]>]
> Sent: Thursday, February 05, 2015 12:09 PM
> To: Ulanov, Alexander
> Cc: [hidden email]<mailto: [hidden email]>
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> I'd expect that we can make GPUaccelerated BLAS faster than CPU blas in
> many cases.
>
> You might consider taking a look at the codepaths that BIDMat (
> https://github.com/BIDData/BIDMat) takes and comparing them to
> netlibjava/breeze. John Canny et. al. have done a bunch of work optimizing
> to make this work really fast from Scala. I've run it on my laptop and
> compared to MKL and in certain cases it's 10x faster at matrix multiply.
> There are a lot of layers of indirection here and you really want to avoid
> data copying as much as possible.
>
> We could also consider swapping out BIDMat for Breeze, but that would be a
> big project and if we can figure out how to get breeze+cublas to comparable
> performance that would be a big win.
>
> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
> [hidden email]<mailto: [hidden email]>> wrote:
> Dear Spark developers,
>
> I am exploring how to make linear algebra operations faster within Spark.
> One way of doing this is to use Scala Breeze library that is bundled with
> Spark. For matrix operations, it employs Netlibjava that has a Java
> wrapper for BLAS (basic linear algebra subprograms) and LAPACK native
> binaries if they are available on the worker node. It also has its own
> optimized Java implementation of BLAS. It is worth mentioning, that native
> binaries provide better performance only for BLAS level 3, i.e.
> matrixmatrix operations or general matrix multiplication (GEMM). This is
> confirmed by GEMM test on Netlibjava page
> https://github.com/fommil/netlibjava. I also confirmed it with my
> experiments with training of artificial neural network
> https://github.com/apache/spark/pull/1290#issuecomment70313952. However,
> I would like to boost performance more.
>
> GPU is supposed to work fast with linear algebra and there is Nvidia CUDA
> implementation of BLAS, called cublas. I have one Linux server with Nvidia
> GPU and I was able to do the following. I linked cublas (instead of
> cpubased blas) with Netlibjava wrapper and put it into Spark, so
> Breeze/Netlib is using it. Then I did some performance measurements with
> regards to artificial neural network batch learning in Spark MLlib that
> involves matrixmatrix multiplications. It turns out that for matrices of
> size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas
> becomes slower for bigger matrices. It worth mentioning that it is was not
> a test for ONLY multiplication since there are other operations involved.
> One of the reasons for slowdown might be the overhead of copying the
> matrices from computer memory to graphic card memory and back.
>
> So, few questions:
> 1) Do these results with CUDA make sense?
> 2) If the problem is with copy overhead, are there any libraries that
> allow to force intermediate results to stay in graphic card memory thus
> removing the overhead?
> 3) Any other options to speedup linear algebra in Spark?
>
> Thank you, Alexander
>
> 
> To unsubscribe, email: [hidden email]<mailto:
> [hidden email]>
> For additional commands, email: [hidden email]<mailto:
> [hidden email]>
>
>
>


Hi Evan, Joseph
I did few matrix multiplication test and BIDMat seems to be ~10x faster than netlibjava+breeze (sorry for weird table formatting):
A*B size  BIDMat MKL  Breeze+Netlibjava native_system_linux_x8664 Breeze+Netlibjava f2jblas 
++
100x100*100x100  0,00205596  0,03810324  0,002556 
1000x1000*1000x1000  0,018320947  0,51803557 1,638475459 
10000x10000*10000x10000  23,78046632  445,0935211  1569,233228 
Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19 Linux, Scala 2.11.
Later I will make tests with Cuda. I need to install new Cuda version for this purpose.
Do you have any ideas why breezenetlib with native blas is so much slower than BIDMat MKL?
Best regards, Alexander
From: Joseph Bradley [mailto: [hidden email]]
Sent: Thursday, February 05, 2015 5:29 PM
To: Ulanov, Alexander
Cc: Evan R. Sparks; [hidden email]
Subject: Re: Using CUDA within Spark / boosting linear algebra
Hi Alexander,
Using GPUs with Spark would be very exciting. Small comment: Concerning your question earlier about keeping data stored on the GPU rather than having to move it between main memory and GPU memory on each iteration, I would guess this would be critical to getting good performance. If you could do multiple local iterations before aggregating results, then the cost of data movement to the GPU could be amortized (and I believe that is done in practice). Having Spark be aware of the GPU and using it as another part of memory sounds like a much bigger undertaking.
Joseph
On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander < [hidden email]> wrote:
Thank you for explanation! I’ve watched the BIDMach presentation by John Canny and I am really inspired by his talk and comparisons with Spark MLlib.
I am very interested to find out what will be better within Spark: BIDMat or netlibjava with CPU or GPU natives. Could you suggest a fair way to benchmark them? Currently I do benchmarks on artificial neural networks in batch mode. While it is not a “pure” test of linear algebra, it involves some other things that are essential to machine learning.
From: Evan R. Sparks [mailto: [hidden email]]
Sent: Thursday, February 05, 2015 1:29 PM
To: Ulanov, Alexander
Cc: [hidden email]
Subject: Re: Using CUDA within Spark / boosting linear algebra
I'd be surprised of BIDMat+OpenBLAS was significantly faster than netlibjava+OpenBLAS, but if it is much faster it's probably due to data layout and fewer levels of indirection  it's definitely a worthwhile experiment to run. The main speedups I've seen from using it come from highly optimized GPU code for linear algebra. I know that in the past Canny has gone as far as to write custom GPU kernels for performancecritical regions of code.[1]
BIDMach is highly optimized for single node performance or performance on small clusters.[2] Once data doesn't fit easily in GPU memory (or can be batched in that way) the performance tends to fall off. Canny argues for hardware/software codesign and as such prefers machine configurations that are quite different than what we find in most commodity cluster nodes  e.g. 10 disk cahnnels and 4 GPUs.
In contrast, MLlib was designed for horizontal scalability on commodity clusters and works best on very big datasets  order of terabytes.
For the most part, these projects developed concurrently to address slightly different use cases. That said, there may be bits of BIDMach we could repurpose for MLlib  keep in mind we need to be careful about maintaining crosslanguage compatibility for our Java and Pythonusers, though.
 Evan
[1]  http://arxiv.org/abs/1409.5402[2]  http://eecs.berkeley.edu/~hzhao/papers/BD.pdfOn Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander < [hidden email]<mailto: [hidden email]>> wrote:
Hi Evan,
Thank you for suggestion! BIDMat seems to have terrific speed. Do you know what makes them faster than netlibjava?
The same group has BIDMach library that implements machine learning. For some examples they use Caffe convolutional neural network library owned by another group in Berkeley. Could you elaborate on how these all might be connected with Spark Mllib? If you take BIDMat for linear algebra why don’t you take BIDMach for optimization and learning?
Best regards, Alexander
From: Evan R. Sparks [mailto: [hidden email]<mailto: [hidden email]>]
Sent: Thursday, February 05, 2015 12:09 PM
To: Ulanov, Alexander
Cc: [hidden email]<mailto: [hidden email]>
Subject: Re: Using CUDA within Spark / boosting linear algebra
I'd expect that we can make GPUaccelerated BLAS faster than CPU blas in many cases.
You might consider taking a look at the codepaths that BIDMat ( https://github.com/BIDData/BIDMat) takes and comparing them to netlibjava/breeze. John Canny et. al. have done a bunch of work optimizing to make this work really fast from Scala. I've run it on my laptop and compared to MKL and in certain cases it's 10x faster at matrix multiply. There are a lot of layers of indirection here and you really want to avoid data copying as much as possible.
We could also consider swapping out BIDMat for Breeze, but that would be a big project and if we can figure out how to get breeze+cublas to comparable performance that would be a big win.
On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander < [hidden email]<mailto: [hidden email]>> wrote:
Dear Spark developers,
I am exploring how to make linear algebra operations faster within Spark. One way of doing this is to use Scala Breeze library that is bundled with Spark. For matrix operations, it employs Netlibjava that has a Java wrapper for BLAS (basic linear algebra subprograms) and LAPACK native binaries if they are available on the worker node. It also has its own optimized Java implementation of BLAS. It is worth mentioning, that native binaries provide better performance only for BLAS level 3, i.e. matrixmatrix operations or general matrix multiplication (GEMM). This is confirmed by GEMM test on Netlibjava page https://github.com/fommil/netlibjava. I also confirmed it with my experiments with training of artificial neural network https://github.com/apache/spark/pull/1290#issuecomment70313952. However, I would like to boost performance more.
GPU is supposed to work fast with linear algebra and there is Nvidia CUDA implementation of BLAS, called cublas. I have one Linux server with Nvidia GPU and I was able to do the following. I linked cublas (instead of cpubased blas) with Netlibjava wrapper and put it into Spark, so Breeze/Netlib is using it. Then I did some performance measurements with regards to artificial neural network batch learning in Spark MLlib that involves matrixmatrix multiplications. It turns out that for matrices of size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes slower for bigger matrices. It worth mentioning that it is was not a test for ONLY multiplication since there are other operations involved. One of the reasons for slowdown might be the overhead of copying the matrices from computer memory to graphic card memory and back.
So, few questions:
1) Do these results with CUDA make sense?
2) If the problem is with copy overhead, are there any libraries that allow to force intermediate results to stay in graphic card memory thus removing the overhead?
3) Any other options to speedup linear algebra in Spark?
Thank you, Alexander

To unsubscribe, email: [hidden email]<mailto: [hidden email]>
For additional commands, email: [hidden email]<mailto: [hidden email]>

To unsubscribe, email: [hidden email]
For additional commands, email: [hidden email]


Getting breeze to pick up the right blas library is critical for
performance. I recommend using OpenBLAS (or MKL, if you already have it).
It might make sense to force BIDMat to use the same underlying BLAS library
as well.
On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander < [hidden email]>
wrote:
> Hi Evan, Joseph
>
> I did few matrix multiplication test and BIDMat seems to be ~10x faster
> than netlibjava+breeze (sorry for weird table formatting):
>
> A*B size  BIDMat MKL  Breeze+Netlibjava native_system_linux_x8664
> Breeze+Netlibjava f2jblas 
> ++
> 100x100*100x100  0,00205596  0,03810324  0,002556 
> 1000x1000*1000x1000  0,018320947  0,51803557 1,638475459 
> 10000x10000*10000x10000  23,78046632  445,0935211  1569,233228 
>
> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19
> Linux, Scala 2.11.
>
> Later I will make tests with Cuda. I need to install new Cuda version for
> this purpose.
>
> Do you have any ideas why breezenetlib with native blas is so much slower
> than BIDMat MKL?
>
> Best regards, Alexander
>
> From: Joseph Bradley [mailto: [hidden email]]
> Sent: Thursday, February 05, 2015 5:29 PM
> To: Ulanov, Alexander
> Cc: Evan R. Sparks; [hidden email]
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> Hi Alexander,
>
> Using GPUs with Spark would be very exciting. Small comment: Concerning
> your question earlier about keeping data stored on the GPU rather than
> having to move it between main memory and GPU memory on each iteration, I
> would guess this would be critical to getting good performance. If you
> could do multiple local iterations before aggregating results, then the
> cost of data movement to the GPU could be amortized (and I believe that is
> done in practice). Having Spark be aware of the GPU and using it as
> another part of memory sounds like a much bigger undertaking.
>
> Joseph
>
> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander < [hidden email]>
> wrote:
> Thank you for explanation! I’ve watched the BIDMach presentation by John
> Canny and I am really inspired by his talk and comparisons with Spark MLlib.
>
> I am very interested to find out what will be better within Spark: BIDMat
> or netlibjava with CPU or GPU natives. Could you suggest a fair way to
> benchmark them? Currently I do benchmarks on artificial neural networks in
> batch mode. While it is not a “pure” test of linear algebra, it involves
> some other things that are essential to machine learning.
>
> From: Evan R. Sparks [mailto: [hidden email]]
> Sent: Thursday, February 05, 2015 1:29 PM
> To: Ulanov, Alexander
> Cc: [hidden email]
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
> netlibjava+OpenBLAS, but if it is much faster it's probably due to data
> layout and fewer levels of indirection  it's definitely a worthwhile
> experiment to run. The main speedups I've seen from using it come from
> highly optimized GPU code for linear algebra. I know that in the past Canny
> has gone as far as to write custom GPU kernels for performancecritical
> regions of code.[1]
>
> BIDMach is highly optimized for single node performance or performance on
> small clusters.[2] Once data doesn't fit easily in GPU memory (or can be
> batched in that way) the performance tends to fall off. Canny argues for
> hardware/software codesign and as such prefers machine configurations that
> are quite different than what we find in most commodity cluster nodes 
> e.g. 10 disk cahnnels and 4 GPUs.
>
> In contrast, MLlib was designed for horizontal scalability on commodity
> clusters and works best on very big datasets  order of terabytes.
>
> For the most part, these projects developed concurrently to address
> slightly different use cases. That said, there may be bits of BIDMach we
> could repurpose for MLlib  keep in mind we need to be careful about
> maintaining crosslanguage compatibility for our Java and Pythonusers,
> though.
>
>  Evan
>
> [1]  http://arxiv.org/abs/1409.5402> [2]  http://eecs.berkeley.edu/~hzhao/papers/BD.pdf>
> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander < [hidden email]
> <mailto: [hidden email]>> wrote:
> Hi Evan,
>
> Thank you for suggestion! BIDMat seems to have terrific speed. Do you know
> what makes them faster than netlibjava?
>
> The same group has BIDMach library that implements machine learning. For
> some examples they use Caffe convolutional neural network library owned by
> another group in Berkeley. Could you elaborate on how these all might be
> connected with Spark Mllib? If you take BIDMat for linear algebra why don’t
> you take BIDMach for optimization and learning?
>
> Best regards, Alexander
>
> From: Evan R. Sparks [mailto: [hidden email]<mailto:
> [hidden email]>]
> Sent: Thursday, February 05, 2015 12:09 PM
> To: Ulanov, Alexander
> Cc: [hidden email]<mailto: [hidden email]>
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> I'd expect that we can make GPUaccelerated BLAS faster than CPU blas in
> many cases.
>
> You might consider taking a look at the codepaths that BIDMat (
> https://github.com/BIDData/BIDMat) takes and comparing them to
> netlibjava/breeze. John Canny et. al. have done a bunch of work optimizing
> to make this work really fast from Scala. I've run it on my laptop and
> compared to MKL and in certain cases it's 10x faster at matrix multiply.
> There are a lot of layers of indirection here and you really want to avoid
> data copying as much as possible.
>
> We could also consider swapping out BIDMat for Breeze, but that would be a
> big project and if we can figure out how to get breeze+cublas to comparable
> performance that would be a big win.
>
> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
> [hidden email]<mailto: [hidden email]>> wrote:
> Dear Spark developers,
>
> I am exploring how to make linear algebra operations faster within Spark.
> One way of doing this is to use Scala Breeze library that is bundled with
> Spark. For matrix operations, it employs Netlibjava that has a Java
> wrapper for BLAS (basic linear algebra subprograms) and LAPACK native
> binaries if they are available on the worker node. It also has its own
> optimized Java implementation of BLAS. It is worth mentioning, that native
> binaries provide better performance only for BLAS level 3, i.e.
> matrixmatrix operations or general matrix multiplication (GEMM). This is
> confirmed by GEMM test on Netlibjava page
> https://github.com/fommil/netlibjava. I also confirmed it with my
> experiments with training of artificial neural network
> https://github.com/apache/spark/pull/1290#issuecomment70313952. However,
> I would like to boost performance more.
>
> GPU is supposed to work fast with linear algebra and there is Nvidia CUDA
> implementation of BLAS, called cublas. I have one Linux server with Nvidia
> GPU and I was able to do the following. I linked cublas (instead of
> cpubased blas) with Netlibjava wrapper and put it into Spark, so
> Breeze/Netlib is using it. Then I did some performance measurements with
> regards to artificial neural network batch learning in Spark MLlib that
> involves matrixmatrix multiplications. It turns out that for matrices of
> size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas
> becomes slower for bigger matrices. It worth mentioning that it is was not
> a test for ONLY multiplication since there are other operations involved.
> One of the reasons for slowdown might be the overhead of copying the
> matrices from computer memory to graphic card memory and back.
>
> So, few questions:
> 1) Do these results with CUDA make sense?
> 2) If the problem is with copy overhead, are there any libraries that
> allow to force intermediate results to stay in graphic card memory thus
> removing the overhead?
> 3) Any other options to speedup linear algebra in Spark?
>
> Thank you, Alexander
>
> 
> To unsubscribe, email: [hidden email]<mailto:
> [hidden email]>
> For additional commands, email: [hidden email]<mailto:
> [hidden email]>
>
>
>


Evan, could you elaborate on how to force BIDMat and netlibjava to force loading the right blas? For netlib, I there are few JVM flags, such as Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can force it to use Java implementation. Not sure I understand how to force use a specific blas (not specific wrapper for blas).
Btw. I have installed openblas (yum install openblas), so I suppose that netlib is using it.
From: Evan R. Sparks [mailto: [hidden email]]
Sent: Friday, February 06, 2015 5:19 PM
To: Ulanov, Alexander
Cc: Joseph Bradley; [hidden email]
Subject: Re: Using CUDA within Spark / boosting linear algebra
Getting breeze to pick up the right blas library is critical for performance. I recommend using OpenBLAS (or MKL, if you already have it). It might make sense to force BIDMat to use the same underlying BLAS library as well.
On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander < [hidden email]<mailto: [hidden email]>> wrote:
Hi Evan, Joseph
I did few matrix multiplication test and BIDMat seems to be ~10x faster than netlibjava+breeze (sorry for weird table formatting):
A*B size  BIDMat MKL  Breeze+Netlibjava native_system_linux_x8664 Breeze+Netlibjava f2jblas 
++
100x100*100x100  0,00205596  0,03810324  0,002556 
1000x1000*1000x1000  0,018320947  0,51803557 1,638475459 
10000x10000*10000x10000  23,78046632  445,0935211  1569,233228 
Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19 Linux, Scala 2.11.
Later I will make tests with Cuda. I need to install new Cuda version for this purpose.
Do you have any ideas why breezenetlib with native blas is so much slower than BIDMat MKL?
Best regards, Alexander
From: Joseph Bradley [mailto: [hidden email]<mailto: [hidden email]>]
Sent: Thursday, February 05, 2015 5:29 PM
To: Ulanov, Alexander
Cc: Evan R. Sparks; [hidden email]<mailto: [hidden email]>
Subject: Re: Using CUDA within Spark / boosting linear algebra
Hi Alexander,
Using GPUs with Spark would be very exciting. Small comment: Concerning your question earlier about keeping data stored on the GPU rather than having to move it between main memory and GPU memory on each iteration, I would guess this would be critical to getting good performance. If you could do multiple local iterations before aggregating results, then the cost of data movement to the GPU could be amortized (and I believe that is done in practice). Having Spark be aware of the GPU and using it as another part of memory sounds like a much bigger undertaking.
Joseph
On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander < [hidden email]<mailto: [hidden email]>> wrote:
Thank you for explanation! I’ve watched the BIDMach presentation by John Canny and I am really inspired by his talk and comparisons with Spark MLlib.
I am very interested to find out what will be better within Spark: BIDMat or netlibjava with CPU or GPU natives. Could you suggest a fair way to benchmark them? Currently I do benchmarks on artificial neural networks in batch mode. While it is not a “pure” test of linear algebra, it involves some other things that are essential to machine learning.
From: Evan R. Sparks [mailto: [hidden email]<mailto: [hidden email]>]
Sent: Thursday, February 05, 2015 1:29 PM
To: Ulanov, Alexander
Cc: [hidden email]<mailto: [hidden email]>
Subject: Re: Using CUDA within Spark / boosting linear algebra
I'd be surprised of BIDMat+OpenBLAS was significantly faster than netlibjava+OpenBLAS, but if it is much faster it's probably due to data layout and fewer levels of indirection  it's definitely a worthwhile experiment to run. The main speedups I've seen from using it come from highly optimized GPU code for linear algebra. I know that in the past Canny has gone as far as to write custom GPU kernels for performancecritical regions of code.[1]
BIDMach is highly optimized for single node performance or performance on small clusters.[2] Once data doesn't fit easily in GPU memory (or can be batched in that way) the performance tends to fall off. Canny argues for hardware/software codesign and as such prefers machine configurations that are quite different than what we find in most commodity cluster nodes  e.g. 10 disk cahnnels and 4 GPUs.
In contrast, MLlib was designed for horizontal scalability on commodity clusters and works best on very big datasets  order of terabytes.
For the most part, these projects developed concurrently to address slightly different use cases. That said, there may be bits of BIDMach we could repurpose for MLlib  keep in mind we need to be careful about maintaining crosslanguage compatibility for our Java and Pythonusers, though.
 Evan
[1]  http://arxiv.org/abs/1409.5402[2]  http://eecs.berkeley.edu/~hzhao/papers/BD.pdfOn Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander < [hidden email]<mailto: [hidden email]><mailto: [hidden email]<mailto: [hidden email]>>> wrote:
Hi Evan,
Thank you for suggestion! BIDMat seems to have terrific speed. Do you know what makes them faster than netlibjava?
The same group has BIDMach library that implements machine learning. For some examples they use Caffe convolutional neural network library owned by another group in Berkeley. Could you elaborate on how these all might be connected with Spark Mllib? If you take BIDMat for linear algebra why don’t you take BIDMach for optimization and learning?
Best regards, Alexander
From: Evan R. Sparks [mailto: [hidden email]<mailto: [hidden email]><mailto: [hidden email]<mailto: [hidden email]>>]
Sent: Thursday, February 05, 2015 12:09 PM
To: Ulanov, Alexander
Cc: [hidden email]<mailto: [hidden email]><mailto: [hidden email]<mailto: [hidden email]>>
Subject: Re: Using CUDA within Spark / boosting linear algebra
I'd expect that we can make GPUaccelerated BLAS faster than CPU blas in many cases.
You might consider taking a look at the codepaths that BIDMat ( https://github.com/BIDData/BIDMat) takes and comparing them to netlibjava/breeze. John Canny et. al. have done a bunch of work optimizing to make this work really fast from Scala. I've run it on my laptop and compared to MKL and in certain cases it's 10x faster at matrix multiply. There are a lot of layers of indirection here and you really want to avoid data copying as much as possible.
We could also consider swapping out BIDMat for Breeze, but that would be a big project and if we can figure out how to get breeze+cublas to comparable performance that would be a big win.
On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander < [hidden email]<mailto: [hidden email]><mailto: [hidden email]<mailto: [hidden email]>>> wrote:
Dear Spark developers,
I am exploring how to make linear algebra operations faster within Spark. One way of doing this is to use Scala Breeze library that is bundled with Spark. For matrix operations, it employs Netlibjava that has a Java wrapper for BLAS (basic linear algebra subprograms) and LAPACK native binaries if they are available on the worker node. It also has its own optimized Java implementation of BLAS. It is worth mentioning, that native binaries provide better performance only for BLAS level 3, i.e. matrixmatrix operations or general matrix multiplication (GEMM). This is confirmed by GEMM test on Netlibjava page https://github.com/fommil/netlibjava. I also confirmed it with my experiments with training of artificial neural network https://github.com/apache/spark/pull/1290#issuecomment70313952. However, I would like to boost performance more.
GPU is supposed to work fast with linear algebra and there is Nvidia CUDA implementation of BLAS, called cublas. I have one Linux server with Nvidia GPU and I was able to do the following. I linked cublas (instead of cpubased blas) with Netlibjava wrapper and put it into Spark, so Breeze/Netlib is using it. Then I did some performance measurements with regards to artificial neural network batch learning in Spark MLlib that involves matrixmatrix multiplications. It turns out that for matrices of size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes slower for bigger matrices. It worth mentioning that it is was not a test for ONLY multiplication since there are other operations involved. One of the reasons for slowdown might be the overhead of copying the matrices from computer memory to graphic card memory and back.
So, few questions:
1) Do these results with CUDA make sense?
2) If the problem is with copy overhead, are there any libraries that allow to force intermediate results to stay in graphic card memory thus removing the overhead?
3) Any other options to speedup linear algebra in Spark?
Thank you, Alexander

To unsubscribe, email: [hidden email]<mailto: [hidden email]><mailto: [hidden email]<mailto: [hidden email]>>
For additional commands, email: [hidden email]<mailto: [hidden email]><mailto: [hidden email]<mailto: [hidden email]>>


I would build OpenBLAS yourself, since good BLAS performance comes from
getting cache sizes, etc. set up correctly for your particular hardware 
this is often a very tricky process (see, e.g. ATLAS), but we found that on
relatively modern Xeon chips, OpenBLAS builds quickly and yields
performance competitive with MKL.
To make sure the right library is getting used, you have to make sure it's
first on the search path  export LD_LIBRARY_PATH=/path/to/blas/library.so
will do the trick here.
For some examples of getting netlibjava setup on an ec2 node and some
example benchmarking code we ran a while back, see:
https://github.com/shivaram/matrixbenchIn particular  buildopenblasec2.sh shows you how to build the library
and set up symlinks correctly, and scala/runnetlib.sh shows you how to get
the path setup and get that library picked up by netlibjava.
In this way  you could probably get cuBLAS set up to be used by
netlibjava as well.
 Evan
On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander < [hidden email]>
wrote:
> Evan, could you elaborate on how to force BIDMat and netlibjava to
> force loading the right blas? For netlib, I there are few JVM flags, such
> as Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I
> can force it to use Java implementation. Not sure I understand how to force
> use a specific blas (not specific wrapper for blas).
>
>
>
> Btw. I have installed openblas (yum install openblas), so I suppose that
> netlib is using it.
>
>
>
> *From:* Evan R. Sparks [mailto: [hidden email]]
> *Sent:* Friday, February 06, 2015 5:19 PM
> *To:* Ulanov, Alexander
> *Cc:* Joseph Bradley; [hidden email]
>
> *Subject:* Re: Using CUDA within Spark / boosting linear algebra
>
>
>
> Getting breeze to pick up the right blas library is critical for
> performance. I recommend using OpenBLAS (or MKL, if you already have it).
> It might make sense to force BIDMat to use the same underlying BLAS library
> as well.
>
>
>
> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander < [hidden email]>
> wrote:
>
> Hi Evan, Joseph
>
> I did few matrix multiplication test and BIDMat seems to be ~10x faster
> than netlibjava+breeze (sorry for weird table formatting):
>
> A*B size  BIDMat MKL  Breeze+Netlibjava native_system_linux_x8664
> Breeze+Netlibjava f2jblas 
> ++
> 100x100*100x100  0,00205596  0,03810324  0,002556 
> 1000x1000*1000x1000  0,018320947  0,51803557 1,638475459 
> 10000x10000*10000x10000  23,78046632  445,0935211  1569,233228 
>
> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19
> Linux, Scala 2.11.
>
> Later I will make tests with Cuda. I need to install new Cuda version for
> this purpose.
>
> Do you have any ideas why breezenetlib with native blas is so much slower
> than BIDMat MKL?
>
> Best regards, Alexander
>
> From: Joseph Bradley [mailto: [hidden email]]
> Sent: Thursday, February 05, 2015 5:29 PM
> To: Ulanov, Alexander
> Cc: Evan R. Sparks; [hidden email]
>
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> Hi Alexander,
>
> Using GPUs with Spark would be very exciting. Small comment: Concerning
> your question earlier about keeping data stored on the GPU rather than
> having to move it between main memory and GPU memory on each iteration, I
> would guess this would be critical to getting good performance. If you
> could do multiple local iterations before aggregating results, then the
> cost of data movement to the GPU could be amortized (and I believe that is
> done in practice). Having Spark be aware of the GPU and using it as
> another part of memory sounds like a much bigger undertaking.
>
> Joseph
>
> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander < [hidden email]>
> wrote:
> Thank you for explanation! I’ve watched the BIDMach presentation by John
> Canny and I am really inspired by his talk and comparisons with Spark MLlib.
>
> I am very interested to find out what will be better within Spark: BIDMat
> or netlibjava with CPU or GPU natives. Could you suggest a fair way to
> benchmark them? Currently I do benchmarks on artificial neural networks in
> batch mode. While it is not a “pure” test of linear algebra, it involves
> some other things that are essential to machine learning.
>
> From: Evan R. Sparks [mailto: [hidden email]]
> Sent: Thursday, February 05, 2015 1:29 PM
> To: Ulanov, Alexander
> Cc: [hidden email]
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
> netlibjava+OpenBLAS, but if it is much faster it's probably due to data
> layout and fewer levels of indirection  it's definitely a worthwhile
> experiment to run. The main speedups I've seen from using it come from
> highly optimized GPU code for linear algebra. I know that in the past Canny
> has gone as far as to write custom GPU kernels for performancecritical
> regions of code.[1]
>
> BIDMach is highly optimized for single node performance or performance on
> small clusters.[2] Once data doesn't fit easily in GPU memory (or can be
> batched in that way) the performance tends to fall off. Canny argues for
> hardware/software codesign and as such prefers machine configurations that
> are quite different than what we find in most commodity cluster nodes 
> e.g. 10 disk cahnnels and 4 GPUs.
>
> In contrast, MLlib was designed for horizontal scalability on commodity
> clusters and works best on very big datasets  order of terabytes.
>
> For the most part, these projects developed concurrently to address
> slightly different use cases. That said, there may be bits of BIDMach we
> could repurpose for MLlib  keep in mind we need to be careful about
> maintaining crosslanguage compatibility for our Java and Pythonusers,
> though.
>
>  Evan
>
> [1]  http://arxiv.org/abs/1409.5402> [2]  http://eecs.berkeley.edu/~hzhao/papers/BD.pdf>
> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander < [hidden email]
> <mailto: [hidden email]>> wrote:
> Hi Evan,
>
> Thank you for suggestion! BIDMat seems to have terrific speed. Do you know
> what makes them faster than netlibjava?
>
> The same group has BIDMach library that implements machine learning. For
> some examples they use Caffe convolutional neural network library owned by
> another group in Berkeley. Could you elaborate on how these all might be
> connected with Spark Mllib? If you take BIDMat for linear algebra why don’t
> you take BIDMach for optimization and learning?
>
> Best regards, Alexander
>
> From: Evan R. Sparks [mailto: [hidden email]<mailto:
> [hidden email]>]
> Sent: Thursday, February 05, 2015 12:09 PM
> To: Ulanov, Alexander
> Cc: [hidden email]<mailto: [hidden email]>
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> I'd expect that we can make GPUaccelerated BLAS faster than CPU blas in
> many cases.
>
> You might consider taking a look at the codepaths that BIDMat (
> https://github.com/BIDData/BIDMat) takes and comparing them to
> netlibjava/breeze. John Canny et. al. have done a bunch of work optimizing
> to make this work really fast from Scala. I've run it on my laptop and
> compared to MKL and in certain cases it's 10x faster at matrix multiply.
> There are a lot of layers of indirection here and you really want to avoid
> data copying as much as possible.
>
> We could also consider swapping out BIDMat for Breeze, but that would be a
> big project and if we can figure out how to get breeze+cublas to comparable
> performance that would be a big win.
>
> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
> [hidden email]<mailto: [hidden email]>> wrote:
> Dear Spark developers,
>
> I am exploring how to make linear algebra operations faster within Spark.
> One way of doing this is to use Scala Breeze library that is bundled with
> Spark. For matrix operations, it employs Netlibjava that has a Java
> wrapper for BLAS (basic linear algebra subprograms) and LAPACK native
> binaries if they are available on the worker node. It also has its own
> optimized Java implementation of BLAS. It is worth mentioning, that native
> binaries provide better performance only for BLAS level 3, i.e.
> matrixmatrix operations or general matrix multiplication (GEMM). This is
> confirmed by GEMM test on Netlibjava page
> https://github.com/fommil/netlibjava. I also confirmed it with my
> experiments with training of artificial neural network
> https://github.com/apache/spark/pull/1290#issuecomment70313952. However,
> I would like to boost performance more.
>
> GPU is supposed to work fast with linear algebra and there is Nvidia CUDA
> implementation of BLAS, called cublas. I have one Linux server with Nvidia
> GPU and I was able to do the following. I linked cublas (instead of
> cpubased blas) with Netlibjava wrapper and put it into Spark, so
> Breeze/Netlib is using it. Then I did some performance measurements with
> regards to artificial neural network batch learning in Spark MLlib that
> involves matrixmatrix multiplications. It turns out that for matrices of
> size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas
> becomes slower for bigger matrices. It worth mentioning that it is was not
> a test for ONLY multiplication since there are other operations involved.
> One of the reasons for slowdown might be the overhead of copying the
> matrices from computer memory to graphic card memory and back.
>
> So, few questions:
> 1) Do these results with CUDA make sense?
> 2) If the problem is with copy overhead, are there any libraries that
> allow to force intermediate results to stay in graphic card memory thus
> removing the overhead?
> 3) Any other options to speedup linear algebra in Spark?
>
> Thank you, Alexander
>
> 
> To unsubscribe, email: [hidden email]<mailto:
> [hidden email]>
> For additional commands, email: [hidden email]<mailto:
> [hidden email]>
>
>
>


Lemme butt in randomly here and say there is an interesting discussion on
this Spark PR < https://github.com/apache/spark/pull/4448> about
netlibjava, JBLAS, Breeze, and other things I know nothing of, that y'all
may find interesting. Among the participants is the author of netlibjava.
On Sun Feb 08 2015 at 2:48:19 AM Ulanov, Alexander < [hidden email]>
wrote:
> Hi Evan, Joseph
>
> I did few matrix multiplication test and BIDMat seems to be ~10x faster
> than netlibjava+breeze (sorry for weird table formatting):
>
> A*B size  BIDMat MKL  Breeze+Netlibjava native_system_linux_x8664
> Breeze+Netlibjava f2jblas 
> ++
> 100x100*100x100  0,00205596  0,03810324  0,002556 
> 1000x1000*1000x1000  0,018320947  0,51803557 1,638475459 
> 10000x10000*10000x10000  23,78046632  445,0935211  1569,233228 
>
> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19
> Linux, Scala 2.11.
>
> Later I will make tests with Cuda. I need to install new Cuda version for
> this purpose.
>
> Do you have any ideas why breezenetlib with native blas is so much slower
> than BIDMat MKL?
>
> Best regards, Alexander
>
> From: Joseph Bradley [mailto: [hidden email]]
> Sent: Thursday, February 05, 2015 5:29 PM
> To: Ulanov, Alexander
> Cc: Evan R. Sparks; [hidden email]
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> Hi Alexander,
>
> Using GPUs with Spark would be very exciting. Small comment: Concerning
> your question earlier about keeping data stored on the GPU rather than
> having to move it between main memory and GPU memory on each iteration, I
> would guess this would be critical to getting good performance. If you
> could do multiple local iterations before aggregating results, then the
> cost of data movement to the GPU could be amortized (and I believe that is
> done in practice). Having Spark be aware of the GPU and using it as
> another part of memory sounds like a much bigger undertaking.
>
> Joseph
>
> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander < [hidden email]>
> wrote:
> Thank you for explanation! I’ve watched the BIDMach presentation by John
> Canny and I am really inspired by his talk and comparisons with Spark MLlib.
>
> I am very interested to find out what will be better within Spark: BIDMat
> or netlibjava with CPU or GPU natives. Could you suggest a fair way to
> benchmark them? Currently I do benchmarks on artificial neural networks in
> batch mode. While it is not a “pure” test of linear algebra, it involves
> some other things that are essential to machine learning.
>
> From: Evan R. Sparks [mailto: [hidden email]]
> Sent: Thursday, February 05, 2015 1:29 PM
> To: Ulanov, Alexander
> Cc: [hidden email]
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
> netlibjava+OpenBLAS, but if it is much faster it's probably due to data
> layout and fewer levels of indirection  it's definitely a worthwhile
> experiment to run. The main speedups I've seen from using it come from
> highly optimized GPU code for linear algebra. I know that in the past Canny
> has gone as far as to write custom GPU kernels for performancecritical
> regions of code.[1]
>
> BIDMach is highly optimized for single node performance or performance on
> small clusters.[2] Once data doesn't fit easily in GPU memory (or can be
> batched in that way) the performance tends to fall off. Canny argues for
> hardware/software codesign and as such prefers machine configurations that
> are quite different than what we find in most commodity cluster nodes 
> e.g. 10 disk cahnnels and 4 GPUs.
>
> In contrast, MLlib was designed for horizontal scalability on commodity
> clusters and works best on very big datasets  order of terabytes.
>
> For the most part, these projects developed concurrently to address
> slightly different use cases. That said, there may be bits of BIDMach we
> could repurpose for MLlib  keep in mind we need to be careful about
> maintaining crosslanguage compatibility for our Java and Pythonusers,
> though.
>
>  Evan
>
> [1]  http://arxiv.org/abs/1409.5402> [2]  http://eecs.berkeley.edu/~hzhao/papers/BD.pdf>
> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander < [hidden email]
> <mailto: [hidden email]>> wrote:
> Hi Evan,
>
> Thank you for suggestion! BIDMat seems to have terrific speed. Do you know
> what makes them faster than netlibjava?
>
> The same group has BIDMach library that implements machine learning. For
> some examples they use Caffe convolutional neural network library owned by
> another group in Berkeley. Could you elaborate on how these all might be
> connected with Spark Mllib? If you take BIDMat for linear algebra why don’t
> you take BIDMach for optimization and learning?
>
> Best regards, Alexander
>
> From: Evan R. Sparks [mailto: [hidden email]<mailto:
> [hidden email]>]
> Sent: Thursday, February 05, 2015 12:09 PM
> To: Ulanov, Alexander
> Cc: [hidden email]<mailto: [hidden email]>
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> I'd expect that we can make GPUaccelerated BLAS faster than CPU blas in
> many cases.
>
> You might consider taking a look at the codepaths that BIDMat (
> https://github.com/BIDData/BIDMat) takes and comparing them to
> netlibjava/breeze. John Canny et. al. have done a bunch of work optimizing
> to make this work really fast from Scala. I've run it on my laptop and
> compared to MKL and in certain cases it's 10x faster at matrix multiply.
> There are a lot of layers of indirection here and you really want to avoid
> data copying as much as possible.
>
> We could also consider swapping out BIDMat for Breeze, but that would be a
> big project and if we can figure out how to get breeze+cublas to comparable
> performance that would be a big win.
>
> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
> [hidden email]<mailto: [hidden email]>> wrote:
> Dear Spark developers,
>
> I am exploring how to make linear algebra operations faster within Spark.
> One way of doing this is to use Scala Breeze library that is bundled with
> Spark. For matrix operations, it employs Netlibjava that has a Java
> wrapper for BLAS (basic linear algebra subprograms) and LAPACK native
> binaries if they are available on the worker node. It also has its own
> optimized Java implementation of BLAS. It is worth mentioning, that native
> binaries provide better performance only for BLAS level 3, i.e.
> matrixmatrix operations or general matrix multiplication (GEMM). This is
> confirmed by GEMM test on Netlibjava page https://github.com/fommil/> netlibjava. I also confirmed it with my experiments with training of
> artificial neural network https://github.com/apache/> spark/pull/1290#issuecomment70313952. However, I would like to boost
> performance more.
>
> GPU is supposed to work fast with linear algebra and there is Nvidia CUDA
> implementation of BLAS, called cublas. I have one Linux server with Nvidia
> GPU and I was able to do the following. I linked cublas (instead of
> cpubased blas) with Netlibjava wrapper and put it into Spark, so
> Breeze/Netlib is using it. Then I did some performance measurements with
> regards to artificial neural network batch learning in Spark MLlib that
> involves matrixmatrix multiplications. It turns out that for matrices of
> size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas
> becomes slower for bigger matrices. It worth mentioning that it is was not
> a test for ONLY multiplication since there are other operations involved.
> One of the reasons for slowdown might be the overhead of copying the
> matrices from computer memory to graphic card memory and back.
>
> So, few questions:
> 1) Do these results with CUDA make sense?
> 2) If the problem is with copy overhead, are there any libraries that
> allow to force intermediate results to stay in graphic card memory thus
> removing the overhead?
> 3) Any other options to speedup linear algebra in Spark?
>
> Thank you, Alexander
>
> 
> To unsubscribe, email: [hidden email]<mailto:
> [hidden email]>
> For additional commands, email: [hidden email]<mailto:
> [hidden email]>
>
>
>


Hi Evan,
Thank you for explanation and useful link. I am going to build OpenBLAS, link it with Netlibjava and perform benchmark again.
Do I understand correctly that BIDMat binaries contain statically linked Intel MKL BLAS? It might be the reason why I am able to run BIDMat not having MKL BLAS installed on my server. If it is true, I wonder if it is OK because Intel sells this library. Nevertheless, it seems that in my case precompiled MKL BLAS performs better than precompiled OpenBLAS given that BIDMat and Netlibjava are supposed to be on par with JNI overheads.
Though, it might be interesting to link Netlibjava with Intel MKL, as you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday (Netlibjava) interested to compare their libraries.
Best regards, Alexander
From: Evan R. Sparks [mailto: [hidden email]]
Sent: Friday, February 06, 2015 5:58 PM
To: Ulanov, Alexander
Cc: Joseph Bradley; [hidden email]
Subject: Re: Using CUDA within Spark / boosting linear algebra
I would build OpenBLAS yourself, since good BLAS performance comes from getting cache sizes, etc. set up correctly for your particular hardware  this is often a very tricky process (see, e.g. ATLAS), but we found that on relatively modern Xeon chips, OpenBLAS builds quickly and yields performance competitive with MKL.
To make sure the right library is getting used, you have to make sure it's first on the search path  export LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
For some examples of getting netlibjava setup on an ec2 node and some example benchmarking code we ran a while back, see: https://github.com/shivaram/matrixbenchIn particular  buildopenblasec2.sh shows you how to build the library and set up symlinks correctly, and scala/runnetlib.sh shows you how to get the path setup and get that library picked up by netlibjava.
In this way  you could probably get cuBLAS set up to be used by netlibjava as well.
 Evan
On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander < [hidden email]<mailto: [hidden email]>> wrote:
Evan, could you elaborate on how to force BIDMat and netlibjava to force loading the right blas? For netlib, I there are few JVM flags, such as Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can force it to use Java implementation. Not sure I understand how to force use a specific blas (not specific wrapper for blas).
Btw. I have installed openblas (yum install openblas), so I suppose that netlib is using it.
From: Evan R. Sparks [mailto: [hidden email]<mailto: [hidden email]>]
Sent: Friday, February 06, 2015 5:19 PM
To: Ulanov, Alexander
Cc: Joseph Bradley; [hidden email]<mailto: [hidden email]>
Subject: Re: Using CUDA within Spark / boosting linear algebra
Getting breeze to pick up the right blas library is critical for performance. I recommend using OpenBLAS (or MKL, if you already have it). It might make sense to force BIDMat to use the same underlying BLAS library as well.
On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander < [hidden email]<mailto: [hidden email]>> wrote:
Hi Evan, Joseph
I did few matrix multiplication test and BIDMat seems to be ~10x faster than netlibjava+breeze (sorry for weird table formatting):
A*B size  BIDMat MKL  Breeze+Netlibjava native_system_linux_x8664 Breeze+Netlibjava f2jblas 
++
100x100*100x100  0,00205596  0,03810324  0,002556 
1000x1000*1000x1000  0,018320947  0,51803557 1,638475459 
10000x10000*10000x10000  23,78046632  445,0935211  1569,233228 
Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19 Linux, Scala 2.11.
Later I will make tests with Cuda. I need to install new Cuda version for this purpose.
Do you have any ideas why breezenetlib with native blas is so much slower than BIDMat MKL?
Best regards, Alexander
From: Joseph Bradley [mailto: [hidden email]<mailto: [hidden email]>]
Sent: Thursday, February 05, 2015 5:29 PM
To: Ulanov, Alexander
Cc: Evan R. Sparks; [hidden email]<mailto: [hidden email]>
Subject: Re: Using CUDA within Spark / boosting linear algebra
Hi Alexander,
Using GPUs with Spark would be very exciting. Small comment: Concerning your question earlier about keeping data stored on the GPU rather than having to move it between main memory and GPU memory on each iteration, I would guess this would be critical to getting good performance. If you could do multiple local iterations before aggregating results, then the cost of data movement to the GPU could be amortized (and I believe that is done in practice). Having Spark be aware of the GPU and using it as another part of memory sounds like a much bigger undertaking.
Joseph
On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander < [hidden email]<mailto: [hidden email]>> wrote:
Thank you for explanation! I’ve watched the BIDMach presentation by John Canny and I am really inspired by his talk and comparisons with Spark MLlib.
I am very interested to find out what will be better within Spark: BIDMat or netlibjava with CPU or GPU natives. Could you suggest a fair way to benchmark them? Currently I do benchmarks on artificial neural networks in batch mode. While it is not a “pure” test of linear algebra, it involves some other things that are essential to machine learning.
From: Evan R. Sparks [mailto: [hidden email]<mailto: [hidden email]>]
Sent: Thursday, February 05, 2015 1:29 PM
To: Ulanov, Alexander
Cc: [hidden email]<mailto: [hidden email]>
Subject: Re: Using CUDA within Spark / boosting linear algebra
I'd be surprised of BIDMat+OpenBLAS was significantly faster than netlibjava+OpenBLAS, but if it is much faster it's probably due to data layout and fewer levels of indirection  it's definitely a worthwhile experiment to run. The main speedups I've seen from using it come from highly optimized GPU code for linear algebra. I know that in the past Canny has gone as far as to write custom GPU kernels for performancecritical regions of code.[1]
BIDMach is highly optimized for single node performance or performance on small clusters.[2] Once data doesn't fit easily in GPU memory (or can be batched in that way) the performance tends to fall off. Canny argues for hardware/software codesign and as such prefers machine configurations that are quite different than what we find in most commodity cluster nodes  e.g. 10 disk cahnnels and 4 GPUs.
In contrast, MLlib was designed for horizontal scalability on commodity clusters and works best on very big datasets  order of terabytes.
For the most part, these projects developed concurrently to address slightly different use cases. That said, there may be bits of BIDMach we could repurpose for MLlib  keep in mind we need to be careful about maintaining crosslanguage compatibility for our Java and Pythonusers, though.
 Evan
[1]  http://arxiv.org/abs/1409.5402[2]  http://eecs.berkeley.edu/~hzhao/papers/BD.pdfOn Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander < [hidden email]<mailto: [hidden email]><mailto: [hidden email]<mailto: [hidden email]>>> wrote:
Hi Evan,
Thank you for suggestion! BIDMat seems to have terrific speed. Do you know what makes them faster than netlibjava?
The same group has BIDMach library that implements machine learning. For some examples they use Caffe convolutional neural network library owned by another group in Berkeley. Could you elaborate on how these all might be connected with Spark Mllib? If you take BIDMat for linear algebra why don’t you take BIDMach for optimization and learning?
Best regards, Alexander
From: Evan R. Sparks [mailto: [hidden email]<mailto: [hidden email]><mailto: [hidden email]<mailto: [hidden email]>>]
Sent: Thursday, February 05, 2015 12:09 PM
To: Ulanov, Alexander
Cc: [hidden email]<mailto: [hidden email]><mailto: [hidden email]<mailto: [hidden email]>>
Subject: Re: Using CUDA within Spark / boosting linear algebra
I'd expect that we can make GPUaccelerated BLAS faster than CPU blas in many cases.
You might consider taking a look at the codepaths that BIDMat ( https://github.com/BIDData/BIDMat) takes and comparing them to netlibjava/breeze. John Canny et. al. have done a bunch of work optimizing to make this work really fast from Scala. I've run it on my laptop and compared to MKL and in certain cases it's 10x faster at matrix multiply. There are a lot of layers of indirection here and you really want to avoid data copying as much as possible.
We could also consider swapping out BIDMat for Breeze, but that would be a big project and if we can figure out how to get breeze+cublas to comparable performance that would be a big win.
On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander < [hidden email]<mailto: [hidden email]><mailto: [hidden email]<mailto: [hidden email]>>> wrote:
Dear Spark developers,
I am exploring how to make linear algebra operations faster within Spark. One way of doing this is to use Scala Breeze library that is bundled with Spark. For matrix operations, it employs Netlibjava that has a Java wrapper for BLAS (basic linear algebra subprograms) and LAPACK native binaries if they are available on the worker node. It also has its own optimized Java implementation of BLAS. It is worth mentioning, that native binaries provide better performance only for BLAS level 3, i.e. matrixmatrix operations or general matrix multiplication (GEMM). This is confirmed by GEMM test on Netlibjava page https://github.com/fommil/netlibjava. I also confirmed it with my experiments with training of artificial neural network https://github.com/apache/spark/pull/1290#issuecomment70313952. However, I would like to boost performance more.
GPU is supposed to work fast with linear algebra and there is Nvidia CUDA implementation of BLAS, called cublas. I have one Linux server with Nvidia GPU and I was able to do the following. I linked cublas (instead of cpubased blas) with Netlibjava wrapper and put it into Spark, so Breeze/Netlib is using it. Then I did some performance measurements with regards to artificial neural network batch learning in Spark MLlib that involves matrixmatrix multiplications. It turns out that for matrices of size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes slower for bigger matrices. It worth mentioning that it is was not a test for ONLY multiplication since there are other operations involved. One of the reasons for slowdown might be the overhead of copying the matrices from computer memory to graphic card memory and back.
So, few questions:
1) Do these results with CUDA make sense?
2) If the problem is with copy overhead, are there any libraries that allow to force intermediate results to stay in graphic card memory thus removing the overhead?
3) Any other options to speedup linear algebra in Spark?
Thank you, Alexander

To unsubscribe, email: [hidden email]<mailto: [hidden email]><mailto: [hidden email]<mailto: [hidden email]>>
For additional commands, email: [hidden email]<mailto: [hidden email]><mailto: [hidden email]<mailto: [hidden email]>>


Maybe you can ask prof john canny himself:) as I invited him to give a talk at Alpine data labs in March's meetup (SF big Analytics & SF machine learning joined meetup) , 3/11. To be announced in next day or so.
Chester
Sent from my iPhone
> On Feb 9, 2015, at 4:48 PM, "Ulanov, Alexander" < [hidden email]> wrote:
>
> Hi Evan,
>
> Thank you for explanation and useful link. I am going to build OpenBLAS, link it with Netlibjava and perform benchmark again.
>
> Do I understand correctly that BIDMat binaries contain statically linked Intel MKL BLAS? It might be the reason why I am able to run BIDMat not having MKL BLAS installed on my server. If it is true, I wonder if it is OK because Intel sells this library. Nevertheless, it seems that in my case precompiled MKL BLAS performs better than precompiled OpenBLAS given that BIDMat and Netlibjava are supposed to be on par with JNI overheads.
>
> Though, it might be interesting to link Netlibjava with Intel MKL, as you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday (Netlibjava) interested to compare their libraries.
>
> Best regards, Alexander
>
> From: Evan R. Sparks [mailto: [hidden email]]
> Sent: Friday, February 06, 2015 5:58 PM
> To: Ulanov, Alexander
> Cc: Joseph Bradley; [hidden email]
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> I would build OpenBLAS yourself, since good BLAS performance comes from getting cache sizes, etc. set up correctly for your particular hardware  this is often a very tricky process (see, e.g. ATLAS), but we found that on relatively modern Xeon chips, OpenBLAS builds quickly and yields performance competitive with MKL.
>
> To make sure the right library is getting used, you have to make sure it's first on the search path  export LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
>
> For some examples of getting netlibjava setup on an ec2 node and some example benchmarking code we ran a while back, see: https://github.com/shivaram/matrixbench>
> In particular  buildopenblasec2.sh shows you how to build the library and set up symlinks correctly, and scala/runnetlib.sh shows you how to get the path setup and get that library picked up by netlibjava.
>
> In this way  you could probably get cuBLAS set up to be used by netlibjava as well.
>
>  Evan
>
> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander < [hidden email]<mailto: [hidden email]>> wrote:
> Evan, could you elaborate on how to force BIDMat and netlibjava to force loading the right blas? For netlib, I there are few JVM flags, such as Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can force it to use Java implementation. Not sure I understand how to force use a specific blas (not specific wrapper for blas).
>
> Btw. I have installed openblas (yum install openblas), so I suppose that netlib is using it.
>
> From: Evan R. Sparks [mailto: [hidden email]<mailto: [hidden email]>]
> Sent: Friday, February 06, 2015 5:19 PM
> To: Ulanov, Alexander
> Cc: Joseph Bradley; [hidden email]<mailto: [hidden email]>
>
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> Getting breeze to pick up the right blas library is critical for performance. I recommend using OpenBLAS (or MKL, if you already have it). It might make sense to force BIDMat to use the same underlying BLAS library as well.
>
> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander < [hidden email]<mailto: [hidden email]>> wrote:
> Hi Evan, Joseph
>
> I did few matrix multiplication test and BIDMat seems to be ~10x faster than netlibjava+breeze (sorry for weird table formatting):
>
> A*B size  BIDMat MKL  Breeze+Netlibjava native_system_linux_x8664 Breeze+Netlibjava f2jblas 
> ++
> 100x100*100x100  0,00205596  0,03810324  0,002556 
> 1000x1000*1000x1000  0,018320947  0,51803557 1,638475459 
> 10000x10000*10000x10000  23,78046632  445,0935211  1569,233228 
>
> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19 Linux, Scala 2.11.
>
> Later I will make tests with Cuda. I need to install new Cuda version for this purpose.
>
> Do you have any ideas why breezenetlib with native blas is so much slower than BIDMat MKL?
>
> Best regards, Alexander
>
> From: Joseph Bradley [mailto: [hidden email]<mailto: [hidden email]>]
> Sent: Thursday, February 05, 2015 5:29 PM
> To: Ulanov, Alexander
> Cc: Evan R. Sparks; [hidden email]<mailto: [hidden email]>
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> Hi Alexander,
>
> Using GPUs with Spark would be very exciting. Small comment: Concerning your question earlier about keeping data stored on the GPU rather than having to move it between main memory and GPU memory on each iteration, I would guess this would be critical to getting good performance. If you could do multiple local iterations before aggregating results, then the cost of data movement to the GPU could be amortized (and I believe that is done in practice). Having Spark be aware of the GPU and using it as another part of memory sounds like a much bigger undertaking.
>
> Joseph
>
> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander < [hidden email]<mailto: [hidden email]>> wrote:
> Thank you for explanation! I’ve watched the BIDMach presentation by John Canny and I am really inspired by his talk and comparisons with Spark MLlib.
>
> I am very interested to find out what will be better within Spark: BIDMat or netlibjava with CPU or GPU natives. Could you suggest a fair way to benchmark them? Currently I do benchmarks on artificial neural networks in batch mode. While it is not a “pure” test of linear algebra, it involves some other things that are essential to machine learning.
>
> From: Evan R. Sparks [mailto: [hidden email]<mailto: [hidden email]>]
> Sent: Thursday, February 05, 2015 1:29 PM
> To: Ulanov, Alexander
> Cc: [hidden email]<mailto: [hidden email]>
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> I'd be surprised of BIDMat+OpenBLAS was significantly faster than netlibjava+OpenBLAS, but if it is much faster it's probably due to data layout and fewer levels of indirection  it's definitely a worthwhile experiment to run. The main speedups I've seen from using it come from highly optimized GPU code for linear algebra. I know that in the past Canny has gone as far as to write custom GPU kernels for performancecritical regions of code.[1]
>
> BIDMach is highly optimized for single node performance or performance on small clusters.[2] Once data doesn't fit easily in GPU memory (or can be batched in that way) the performance tends to fall off. Canny argues for hardware/software codesign and as such prefers machine configurations that are quite different than what we find in most commodity cluster nodes  e.g. 10 disk cahnnels and 4 GPUs.
>
> In contrast, MLlib was designed for horizontal scalability on commodity clusters and works best on very big datasets  order of terabytes.
>
> For the most part, these projects developed concurrently to address slightly different use cases. That said, there may be bits of BIDMach we could repurpose for MLlib  keep in mind we need to be careful about maintaining crosslanguage compatibility for our Java and Pythonusers, though.
>
>  Evan
>
> [1]  http://arxiv.org/abs/1409.5402> [2]  http://eecs.berkeley.edu/~hzhao/papers/BD.pdf>
> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander < [hidden email]<mailto: [hidden email]><mailto: [hidden email]<mailto: [hidden email]>>> wrote:
> Hi Evan,
>
> Thank you for suggestion! BIDMat seems to have terrific speed. Do you know what makes them faster than netlibjava?
>
> The same group has BIDMach library that implements machine learning. For some examples they use Caffe convolutional neural network library owned by another group in Berkeley. Could you elaborate on how these all might be connected with Spark Mllib? If you take BIDMat for linear algebra why don’t you take BIDMach for optimization and learning?
>
> Best regards, Alexander
>
> From: Evan R. Sparks [mailto: [hidden email]<mailto: [hidden email]><mailto: [hidden email]<mailto: [hidden email]>>]
> Sent: Thursday, February 05, 2015 12:09 PM
> To: Ulanov, Alexander
> Cc: [hidden email]<mailto: [hidden email]><mailto: [hidden email]<mailto: [hidden email]>>
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> I'd expect that we can make GPUaccelerated BLAS faster than CPU blas in many cases.
>
> You might consider taking a look at the codepaths that BIDMat ( https://github.com/BIDData/BIDMat) takes and comparing them to netlibjava/breeze. John Canny et. al. have done a bunch of work optimizing to make this work really fast from Scala. I've run it on my laptop and compared to MKL and in certain cases it's 10x faster at matrix multiply. There are a lot of layers of indirection here and you really want to avoid data copying as much as possible.
>
> We could also consider swapping out BIDMat for Breeze, but that would be a big project and if we can figure out how to get breeze+cublas to comparable performance that would be a big win.
>
> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander < [hidden email]<mailto: [hidden email]><mailto: [hidden email]<mailto: [hidden email]>>> wrote:
> Dear Spark developers,
>
> I am exploring how to make linear algebra operations faster within Spark. One way of doing this is to use Scala Breeze library that is bundled with Spark. For matrix operations, it employs Netlibjava that has a Java wrapper for BLAS (basic linear algebra subprograms) and LAPACK native binaries if they are available on the worker node. It also has its own optimized Java implementation of BLAS. It is worth mentioning, that native binaries provide better performance only for BLAS level 3, i.e. matrixmatrix operations or general matrix multiplication (GEMM). This is confirmed by GEMM test on Netlibjava page https://github.com/fommil/netlibjava. I also confirmed it with my experiments with training of artificial neural network https://github.com/apache/spark/pull/1290#issuecomment70313952. However, I would like to boost performance more.
>
> GPU is supposed to work fast with linear algebra and there is Nvidia CUDA implementation of BLAS, called cublas. I have one Linux server with Nvidia GPU and I was able to do the following. I linked cublas (instead of cpubased blas) with Netlibjava wrapper and put it into Spark, so Breeze/Netlib is using it. Then I did some performance measurements with regards to artificial neural network batch learning in Spark MLlib that involves matrixmatrix multiplications. It turns out that for matrices of size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes slower for bigger matrices. It worth mentioning that it is was not a test for ONLY multiplication since there are other operations involved. One of the reasons for slowdown might be the overhead of copying the matrices from computer memory to graphic card memory and back.
>
> So, few questions:
> 1) Do these results with CUDA make sense?
> 2) If the problem is with copy overhead, are there any libraries that allow to force intermediate results to stay in graphic card memory thus removing the overhead?
> 3) Any other options to speedup linear algebra in Spark?
>
> Thank you, Alexander
>
> 
> To unsubscribe, email: [hidden email]<mailto: [hidden email]><mailto: [hidden email]<mailto: [hidden email]>>
> For additional commands, email: [hidden email]<mailto: [hidden email]><mailto: [hidden email]<mailto: [hidden email]>>
>
>

To unsubscribe, email: [hidden email]
For additional commands, email: [hidden email]


Great  perhaps we can move this discussion offlist and onto a JIRA
ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK5705)
It seems like this is going to be somewhat exploratory for a while (and
there's probably only a handful of us who really care about fast linear
algebra!)
 Evan
On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander < [hidden email]>
wrote:
> Hi Evan,
>
>
>
> Thank you for explanation and useful link. I am going to build OpenBLAS,
> link it with Netlibjava and perform benchmark again.
>
>
>
> Do I understand correctly that BIDMat binaries contain statically linked
> Intel MKL BLAS? It might be the reason why I am able to run BIDMat not
> having MKL BLAS installed on my server. If it is true, I wonder if it is OK
> because Intel sells this library. Nevertheless, it seems that in my case
> precompiled MKL BLAS performs better than precompiled OpenBLAS given that
> BIDMat and Netlibjava are supposed to be on par with JNI overheads.
>
>
>
> Though, it might be interesting to link Netlibjava with Intel MKL, as you
> suggested. I wonder, are John Canny (BIDMat) and Sam Halliday (Netlibjava)
> interested to compare their libraries.
>
>
>
> Best regards, Alexander
>
>
>
> *From:* Evan R. Sparks [mailto: [hidden email]]
> *Sent:* Friday, February 06, 2015 5:58 PM
>
> *To:* Ulanov, Alexander
> *Cc:* Joseph Bradley; [hidden email]
> *Subject:* Re: Using CUDA within Spark / boosting linear algebra
>
>
>
> I would build OpenBLAS yourself, since good BLAS performance comes from
> getting cache sizes, etc. set up correctly for your particular hardware 
> this is often a very tricky process (see, e.g. ATLAS), but we found that on
> relatively modern Xeon chips, OpenBLAS builds quickly and yields
> performance competitive with MKL.
>
>
>
> To make sure the right library is getting used, you have to make sure it's
> first on the search path  export LD_LIBRARY_PATH=/path/to/blas/library.so
> will do the trick here.
>
>
>
> For some examples of getting netlibjava setup on an ec2 node and some
> example benchmarking code we ran a while back, see:
> https://github.com/shivaram/matrixbench>
>
>
> In particular  buildopenblasec2.sh shows you how to build the library
> and set up symlinks correctly, and scala/runnetlib.sh shows you how to get
> the path setup and get that library picked up by netlibjava.
>
>
>
> In this way  you could probably get cuBLAS set up to be used by
> netlibjava as well.
>
>
>
>  Evan
>
>
>
> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander < [hidden email]>
> wrote:
>
> Evan, could you elaborate on how to force BIDMat and netlibjava to
> force loading the right blas? For netlib, I there are few JVM flags, such
> as Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I
> can force it to use Java implementation. Not sure I understand how to force
> use a specific blas (not specific wrapper for blas).
>
>
>
> Btw. I have installed openblas (yum install openblas), so I suppose that
> netlib is using it.
>
>
>
> *From:* Evan R. Sparks [mailto: [hidden email]]
> *Sent:* Friday, February 06, 2015 5:19 PM
> *To:* Ulanov, Alexander
> *Cc:* Joseph Bradley; [hidden email]
>
>
> *Subject:* Re: Using CUDA within Spark / boosting linear algebra
>
>
>
> Getting breeze to pick up the right blas library is critical for
> performance. I recommend using OpenBLAS (or MKL, if you already have it).
> It might make sense to force BIDMat to use the same underlying BLAS library
> as well.
>
>
>
> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander < [hidden email]>
> wrote:
>
> Hi Evan, Joseph
>
> I did few matrix multiplication test and BIDMat seems to be ~10x faster
> than netlibjava+breeze (sorry for weird table formatting):
>
> A*B size  BIDMat MKL  Breeze+Netlibjava native_system_linux_x8664
> Breeze+Netlibjava f2jblas 
> ++
> 100x100*100x100  0,00205596  0,03810324  0,002556 
> 1000x1000*1000x1000  0,018320947  0,51803557 1,638475459 
> 10000x10000*10000x10000  23,78046632  445,0935211  1569,233228 
>
> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19
> Linux, Scala 2.11.
>
> Later I will make tests with Cuda. I need to install new Cuda version for
> this purpose.
>
> Do you have any ideas why breezenetlib with native blas is so much slower
> than BIDMat MKL?
>
> Best regards, Alexander
>
> From: Joseph Bradley [mailto: [hidden email]]
> Sent: Thursday, February 05, 2015 5:29 PM
> To: Ulanov, Alexander
> Cc: Evan R. Sparks; [hidden email]
>
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> Hi Alexander,
>
> Using GPUs with Spark would be very exciting. Small comment: Concerning
> your question earlier about keeping data stored on the GPU rather than
> having to move it between main memory and GPU memory on each iteration, I
> would guess this would be critical to getting good performance. If you
> could do multiple local iterations before aggregating results, then the
> cost of data movement to the GPU could be amortized (and I believe that is
> done in practice). Having Spark be aware of the GPU and using it as
> another part of memory sounds like a much bigger undertaking.
>
> Joseph
>
> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander < [hidden email]>
> wrote:
> Thank you for explanation! I’ve watched the BIDMach presentation by John
> Canny and I am really inspired by his talk and comparisons with Spark MLlib.
>
> I am very interested to find out what will be better within Spark: BIDMat
> or netlibjava with CPU or GPU natives. Could you suggest a fair way to
> benchmark them? Currently I do benchmarks on artificial neural networks in
> batch mode. While it is not a “pure” test of linear algebra, it involves
> some other things that are essential to machine learning.
>
> From: Evan R. Sparks [mailto: [hidden email]]
> Sent: Thursday, February 05, 2015 1:29 PM
> To: Ulanov, Alexander
> Cc: [hidden email]
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
> netlibjava+OpenBLAS, but if it is much faster it's probably due to data
> layout and fewer levels of indirection  it's definitely a worthwhile
> experiment to run. The main speedups I've seen from using it come from
> highly optimized GPU code for linear algebra. I know that in the past Canny
> has gone as far as to write custom GPU kernels for performancecritical
> regions of code.[1]
>
> BIDMach is highly optimized for single node performance or performance on
> small clusters.[2] Once data doesn't fit easily in GPU memory (or can be
> batched in that way) the performance tends to fall off. Canny argues for
> hardware/software codesign and as such prefers machine configurations that
> are quite different than what we find in most commodity cluster nodes 
> e.g. 10 disk cahnnels and 4 GPUs.
>
> In contrast, MLlib was designed for horizontal scalability on commodity
> clusters and works best on very big datasets  order of terabytes.
>
> For the most part, these projects developed concurrently to address
> slightly different use cases. That said, there may be bits of BIDMach we
> could repurpose for MLlib  keep in mind we need to be careful about
> maintaining crosslanguage compatibility for our Java and Pythonusers,
> though.
>
>  Evan
>
> [1]  http://arxiv.org/abs/1409.5402> [2]  http://eecs.berkeley.edu/~hzhao/papers/BD.pdf>
> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander < [hidden email]
> <mailto: [hidden email]>> wrote:
> Hi Evan,
>
> Thank you for suggestion! BIDMat seems to have terrific speed. Do you know
> what makes them faster than netlibjava?
>
> The same group has BIDMach library that implements machine learning. For
> some examples they use Caffe convolutional neural network library owned by
> another group in Berkeley. Could you elaborate on how these all might be
> connected with Spark Mllib? If you take BIDMat for linear algebra why don’t
> you take BIDMach for optimization and learning?
>
> Best regards, Alexander
>
> From: Evan R. Sparks [mailto: [hidden email]<mailto:
> [hidden email]>]
> Sent: Thursday, February 05, 2015 12:09 PM
> To: Ulanov, Alexander
> Cc: [hidden email]<mailto: [hidden email]>
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> I'd expect that we can make GPUaccelerated BLAS faster than CPU blas in
> many cases.
>
> You might consider taking a look at the codepaths that BIDMat (
> https://github.com/BIDData/BIDMat) takes and comparing them to
> netlibjava/breeze. John Canny et. al. have done a bunch of work optimizing
> to make this work really fast from Scala. I've run it on my laptop and
> compared to MKL and in certain cases it's 10x faster at matrix multiply.
> There are a lot of layers of indirection here and you really want to avoid
> data copying as much as possible.
>
> We could also consider swapping out BIDMat for Breeze, but that would be a
> big project and if we can figure out how to get breeze+cublas to comparable
> performance that would be a big win.
>
> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
> [hidden email]<mailto: [hidden email]>> wrote:
> Dear Spark developers,
>
> I am exploring how to make linear algebra operations faster within Spark.
> One way of doing this is to use Scala Breeze library that is bundled with
> Spark. For matrix operations, it employs Netlibjava that has a Java
> wrapper for BLAS (basic linear algebra subprograms) and LAPACK native
> binaries if they are available on the worker node. It also has its own
> optimized Java implementation of BLAS. It is worth mentioning, that native
> binaries provide better performance only for BLAS level 3, i.e.
> matrixmatrix operations or general matrix multiplication (GEMM). This is
> confirmed by GEMM test on Netlibjava page
> https://github.com/fommil/netlibjava. I also confirmed it with my
> experiments with training of artificial neural network
> https://github.com/apache/spark/pull/1290#issuecomment70313952. However,
> I would like to boost performance more.
>
> GPU is supposed to work fast with linear algebra and there is Nvidia CUDA
> implementation of BLAS, called cublas. I have one Linux server with Nvidia
> GPU and I was able to do the following. I linked cublas (instead of
> cpubased blas) with Netlibjava wrapper and put it into Spark, so
> Breeze/Netlib is using it. Then I did some performance measurements with
> regards to artificial neural network batch learning in Spark MLlib that
> involves matrixmatrix multiplications. It turns out that for matrices of
> size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas
> becomes slower for bigger matrices. It worth mentioning that it is was not
> a test for ONLY multiplication since there are other operations involved.
> One of the reasons for slowdown might be the overhead of copying the
> matrices from computer memory to graphic card memory and back.
>
> So, few questions:
> 1) Do these results with CUDA make sense?
> 2) If the problem is with copy overhead, are there any libraries that
> allow to force intermediate results to stay in graphic card memory thus
> removing the overhead?
> 3) Any other options to speedup linear algebra in Spark?
>
> Thank you, Alexander
>
> 
> To unsubscribe, email: [hidden email]<mailto:
> [hidden email]>
> For additional commands, email: [hidden email]<mailto:
> [hidden email]>
>
>
>
>
>


Thanks, Evan! It seems that ticket was marked as duplicate though the original one discusses slightly different topic. I was able to link netlib with MKL from BIDMat binaries. Indeed, MKL is statically linked inside a 60MB library.
A*B size  BIDMat MKL  Breeze+NetlibMKL from BIDMat Breeze+NetlibOpenBlas(native system) Breeze+Netlibf2jblas 
++
100x100*100x100  0,00205596  0,000381  0,03810324  0,002556 
1000x1000*1000x1000  0,018320947  0,038316857  0,51803557 1,638475459 
10000x10000*10000x10000  23,78046632  32,94546697 445,0935211  1569,233228 
It turn out that precompiled MKL is faster than precompiled OpenBlas on my machine. Probably, I’ll add two more columns with locally compiled openblas and cuda.
Alexander
From: Evan R. Sparks [mailto: [hidden email]]
Sent: Monday, February 09, 2015 6:06 PM
To: Ulanov, Alexander
Cc: Joseph Bradley; [hidden email]
Subject: Re: Using CUDA within Spark / boosting linear algebra
Great  perhaps we can move this discussion offlist and onto a JIRA ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK5705)
It seems like this is going to be somewhat exploratory for a while (and there's probably only a handful of us who really care about fast linear algebra!)
 Evan
On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander < [hidden email]<mailto: [hidden email]>> wrote:
Hi Evan,
Thank you for explanation and useful link. I am going to build OpenBLAS, link it with Netlibjava and perform benchmark again.
Do I understand correctly that BIDMat binaries contain statically linked Intel MKL BLAS? It might be the reason why I am able to run BIDMat not having MKL BLAS installed on my server. If it is true, I wonder if it is OK because Intel sells this library. Nevertheless, it seems that in my case precompiled MKL BLAS performs better than precompiled OpenBLAS given that BIDMat and Netlibjava are supposed to be on par with JNI overheads.
Though, it might be interesting to link Netlibjava with Intel MKL, as you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday (Netlibjava) interested to compare their libraries.
Best regards, Alexander
From: Evan R. Sparks [mailto: [hidden email]<mailto: [hidden email]>]
Sent: Friday, February 06, 2015 5:58 PM
To: Ulanov, Alexander
Cc: Joseph Bradley; [hidden email]<mailto: [hidden email]>
Subject: Re: Using CUDA within Spark / boosting linear algebra
I would build OpenBLAS yourself, since good BLAS performance comes from getting cache sizes, etc. set up correctly for your particular hardware  this is often a very tricky process (see, e.g. ATLAS), but we found that on relatively modern Xeon chips, OpenBLAS builds quickly and yields performance competitive with MKL.
To make sure the right library is getting used, you have to make sure it's first on the search path  export LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
For some examples of getting netlibjava setup on an ec2 node and some example benchmarking code we ran a while back, see: https://github.com/shivaram/matrixbenchIn particular  buildopenblasec2.sh shows you how to build the library and set up symlinks correctly, and scala/runnetlib.sh shows you how to get the path setup and get that library picked up by netlibjava.
In this way  you could probably get cuBLAS set up to be used by netlibjava as well.
 Evan
On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander < [hidden email]<mailto: [hidden email]>> wrote:
Evan, could you elaborate on how to force BIDMat and netlibjava to force loading the right blas? For netlib, I there are few JVM flags, such as Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can force it to use Java implementation. Not sure I understand how to force use a specific blas (not specific wrapper for blas).
Btw. I have installed openblas (yum install openblas), so I suppose that netlib is using it.
From: Evan R. Sparks [mailto: [hidden email]<mailto: [hidden email]>]
Sent: Friday, February 06, 2015 5:19 PM
To: Ulanov, Alexander
Cc: Joseph Bradley; [hidden email]<mailto: [hidden email]>
Subject: Re: Using CUDA within Spark / boosting linear algebra
Getting breeze to pick up the right blas library is critical for performance. I recommend using OpenBLAS (or MKL, if you already have it). It might make sense to force BIDMat to use the same underlying BLAS library as well.
On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander < [hidden email]<mailto: [hidden email]>> wrote:
Hi Evan, Joseph
I did few matrix multiplication test and BIDMat seems to be ~10x faster than netlibjava+breeze (sorry for weird table formatting):
A*B size  BIDMat MKL  Breeze+Netlibjava native_system_linux_x8664 Breeze+Netlibjava f2jblas 
++
100x100*100x100  0,00205596  0,03810324  0,002556 
1000x1000*1000x1000  0,018320947  0,51803557 1,638475459 
10000x10000*10000x10000  23,78046632  445,0935211  1569,233228 
Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19 Linux, Scala 2.11.
Later I will make tests with Cuda. I need to install new Cuda version for this purpose.
Do you have any ideas why breezenetlib with native blas is so much slower than BIDMat MKL?
Best regards, Alexander
From: Joseph Bradley [mailto: [hidden email]<mailto: [hidden email]>]
Sent: Thursday, February 05, 2015 5:29 PM
To: Ulanov, Alexander
Cc: Evan R. Sparks; [hidden email]<mailto: [hidden email]>
Subject: Re: Using CUDA within Spark / boosting linear algebra
Hi Alexander,
Using GPUs with Spark would be very exciting. Small comment: Concerning your question earlier about keeping data stored on the GPU rather than having to move it between main memory and GPU memory on each iteration, I would guess this would be critical to getting good performance. If you could do multiple local iterations before aggregating results, then the cost of data movement to the GPU could be amortized (and I believe that is done in practice). Having Spark be aware of the GPU and using it as another part of memory sounds like a much bigger undertaking.
Joseph
On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander < [hidden email]<mailto: [hidden email]>> wrote:
Thank you for explanation! I’ve watched the BIDMach presentation by John Canny and I am really inspired by his talk and comparisons with Spark MLlib.
I am very interested to find out what will be better within Spark: BIDMat or netlibjava with CPU or GPU natives. Could you suggest a fair way to benchmark them? Currently I do benchmarks on artificial neural networks in batch mode. While it is not a “pure” test of linear algebra, it involves some other things that are essential to machine learning.
From: Evan R. Sparks [mailto: [hidden email]<mailto: [hidden email]>]
Sent: Thursday, February 05, 2015 1:29 PM
To: Ulanov, Alexander
Cc: [hidden email]<mailto: [hidden email]>
Subject: Re: Using CUDA within Spark / boosting linear algebra
I'd be surprised of BIDMat+OpenBLAS was significantly faster than netlibjava+OpenBLAS, but if it is much faster it's probably due to data layout and fewer levels of indirection  it's definitely a worthwhile experiment to run. The main speedups I've seen from using it come from highly optimized GPU code for linear algebra. I know that in the past Canny has gone as far as to write custom GPU kernels for performancecritical regions of code.[1]
BIDMach is highly optimized for single node performance or performance on small clusters.[2] Once data doesn't fit easily in GPU memory (or can be batched in that way) the performance tends to fall off. Canny argues for hardware/software codesign and as such prefers machine configurations that are quite different than what we find in most commodity cluster nodes  e.g. 10 disk cahnnels and 4 GPUs.
In contrast, MLlib was designed for horizontal scalability on commodity clusters and works best on very big datasets  order of terabytes.
For the most part, these projects developed concurrently to address slightly different use cases. That said, there may be bits of BIDMach we could repurpose for MLlib  keep in mind we need to be careful about maintaining crosslanguage compatibility for our Java and Pythonusers, though.
 Evan
[1]  http://arxiv.org/abs/1409.5402[2]  http://eecs.berkeley.edu/~hzhao/papers/BD.pdfOn Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander < [hidden email]<mailto: [hidden email]><mailto: [hidden email]<mailto: [hidden email]>>> wrote:
Hi Evan,
Thank you for suggestion! BIDMat seems to have terrific speed. Do you know what makes them faster than netlibjava?
The same group has BIDMach library that implements machine learning. For some examples they use Caffe convolutional neural network library owned by another group in Berkeley. Could you elaborate on how these all might be connected with Spark Mllib? If you take BIDMat for linear algebra why don’t you take BIDMach for optimization and learning?
Best regards, Alexander
From: Evan R. Sparks [mailto: [hidden email]<mailto: [hidden email]><mailto: [hidden email]<mailto: [hidden email]>>]
Sent: Thursday, February 05, 2015 12:09 PM
To: Ulanov, Alexander
Cc: [hidden email]<mailto: [hidden email]><mailto: [hidden email]<mailto: [hidden email]>>
Subject: Re: Using CUDA within Spark / boosting linear algebra
I'd expect that we can make GPUaccelerated BLAS faster than CPU blas in many cases.
You might consider taking a look at the codepaths that BIDMat ( https://github.com/BIDData/BIDMat) takes and comparing them to netlibjava/breeze. John Canny et. al. have done a bunch of work optimizing to make this work really fast from Scala. I've run it on my laptop and compared to MKL and in certain cases it's 10x faster at matrix multiply. There are a lot of layers of indirection here and you really want to avoid data copying as much as possible.
We could also consider swapping out BIDMat for Breeze, but that would be a big project and if we can figure out how to get breeze+cublas to comparable performance that would be a big win.
On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander < [hidden email]<mailto: [hidden email]><mailto: [hidden email]<mailto: [hidden email]>>> wrote:
Dear Spark developers,
I am exploring how to make linear algebra operations faster within Spark. One way of doing this is to use Scala Breeze library that is bundled with Spark. For matrix operations, it employs Netlibjava that has a Java wrapper for BLAS (basic linear algebra subprograms) and LAPACK native binaries if they are available on the worker node. It also has its own optimized Java implementation of BLAS. It is worth mentioning, that native binaries provide better performance only for BLAS level 3, i.e. matrixmatrix operations or general matrix multiplication (GEMM). This is confirmed by GEMM test on Netlibjava page https://github.com/fommil/netlibjava. I also confirmed it with my experiments with training of artificial neural network https://github.com/apache/spark/pull/1290#issuecomment70313952. However, I would like to boost performance more.
GPU is supposed to work fast with linear algebra and there is Nvidia CUDA implementation of BLAS, called cublas. I have one Linux server with Nvidia GPU and I was able to do the following. I linked cublas (instead of cpubased blas) with Netlibjava wrapper and put it into Spark, so Breeze/Netlib is using it. Then I did some performance measurements with regards to artificial neural network batch learning in Spark MLlib that involves matrixmatrix multiplications. It turns out that for matrices of size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes slower for bigger matrices. It worth mentioning that it is was not a test for ONLY multiplication since there are other operations involved. One of the reasons for slowdown might be the overhead of copying the matrices from computer memory to graphic card memory and back.
So, few questions:
1) Do these results with CUDA make sense?
2) If the problem is with copy overhead, are there any libraries that allow to force intermediate results to stay in graphic card memory thus removing the overhead?
3) Any other options to speedup linear algebra in Spark?
Thank you, Alexander

To unsubscribe, email: [hidden email]<mailto: [hidden email]><mailto: [hidden email]<mailto: [hidden email]>>
For additional commands, email: [hidden email]<mailto: [hidden email]><mailto: [hidden email]<mailto: [hidden email]>>


Just to summarize this thread, I was finally able to make all performance comparisons that we discussed. It turns out that:
BIDMatcublas>>BIDMat MKL==netlibmkl==netlibopenblascompiled>netlibopenblasyumrepo==netlibcublas>netlibblas>f2jblas
Below is the link to the spreadsheet with full results.
https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharingOne thing still needs exploration: does BIDMatcublas perform copying to/from machine’s RAM?
Original Message
From: Ulanov, Alexander
Sent: Tuesday, February 10, 2015 2:12 PM
To: Evan R. Sparks
Cc: Joseph Bradley; [hidden email]
Subject: RE: Using CUDA within Spark / boosting linear algebra
Thanks, Evan! It seems that ticket was marked as duplicate though the original one discusses slightly different topic. I was able to link netlib with MKL from BIDMat binaries. Indeed, MKL is statically linked inside a 60MB library.
A*B size  BIDMat MKL  Breeze+NetlibMKL from BIDMat Breeze+NetlibOpenBlas(native system) Breeze+Netlibf2jblas 
++
100x100*100x100  0,00205596  0,000381  0,03810324  0,002556 
1000x1000*1000x1000  0,018320947  0,038316857  0,51803557 1,638475459 
10000x10000*10000x10000  23,78046632  32,94546697 445,0935211  1569,233228 
It turn out that precompiled MKL is faster than precompiled OpenBlas on my machine. Probably, I’ll add two more columns with locally compiled openblas and cuda.
Alexander
From: Evan R. Sparks [mailto: [hidden email]]
Sent: Monday, February 09, 2015 6:06 PM
To: Ulanov, Alexander
Cc: Joseph Bradley; [hidden email]
Subject: Re: Using CUDA within Spark / boosting linear algebra
Great  perhaps we can move this discussion offlist and onto a JIRA ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK5705)
It seems like this is going to be somewhat exploratory for a while (and there's probably only a handful of us who really care about fast linear algebra!)
 Evan
On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander < [hidden email]<mailto: [hidden email]>> wrote:
Hi Evan,
Thank you for explanation and useful link. I am going to build OpenBLAS, link it with Netlibjava and perform benchmark again.
Do I understand correctly that BIDMat binaries contain statically linked Intel MKL BLAS? It might be the reason why I am able to run BIDMat not having MKL BLAS installed on my server. If it is true, I wonder if it is OK because Intel sells this library. Nevertheless, it seems that in my case precompiled MKL BLAS performs better than precompiled OpenBLAS given that BIDMat and Netlibjava are supposed to be on par with JNI overheads.
Though, it might be interesting to link Netlibjava with Intel MKL, as you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday (Netlibjava) interested to compare their libraries.
Best regards, Alexander
From: Evan R. Sparks [mailto: [hidden email]<mailto: [hidden email]>]
Sent: Friday, February 06, 2015 5:58 PM
To: Ulanov, Alexander
Cc: Joseph Bradley; [hidden email]<mailto: [hidden email]>
Subject: Re: Using CUDA within Spark / boosting linear algebra
I would build OpenBLAS yourself, since good BLAS performance comes from getting cache sizes, etc. set up correctly for your particular hardware  this is often a very tricky process (see, e.g. ATLAS), but we found that on relatively modern Xeon chips, OpenBLAS builds quickly and yields performance competitive with MKL.
To make sure the right library is getting used, you have to make sure it's first on the search path  export LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
For some examples of getting netlibjava setup on an ec2 node and some example benchmarking code we ran a while back, see: https://github.com/shivaram/matrixbenchIn particular  buildopenblasec2.sh shows you how to build the library and set up symlinks correctly, and scala/runnetlib.sh shows you how to get the path setup and get that library picked up by netlibjava.
In this way  you could probably get cuBLAS set up to be used by netlibjava as well.
 Evan
On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander < [hidden email]<mailto: [hidden email]>> wrote:
Evan, could you elaborate on how to force BIDMat and netlibjava to force loading the right blas? For netlib, I there are few JVM flags, such as Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can force it to use Java implementation. Not sure I understand how to force use a specific blas (not specific wrapper for blas).
Btw. I have installed openblas (yum install openblas), so I suppose that netlib is using it.
From: Evan R. Sparks [mailto: [hidden email]<mailto: [hidden email]>]
Sent: Friday, February 06, 2015 5:19 PM
To: Ulanov, Alexander
Cc: Joseph Bradley; [hidden email]<mailto: [hidden email]>
Subject: Re: Using CUDA within Spark / boosting linear algebra
Getting breeze to pick up the right blas library is critical for performance. I recommend using OpenBLAS (or MKL, if you already have it). It might make sense to force BIDMat to use the same underlying BLAS library as well.
On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander < [hidden email]<mailto: [hidden email]>> wrote:
Hi Evan, Joseph
I did few matrix multiplication test and BIDMat seems to be ~10x faster than netlibjava+breeze (sorry for weird table formatting):
A*B size  BIDMat MKL  Breeze+Netlibjava native_system_linux_x8664 Breeze+Netlibjava f2jblas 
++
100x100*100x100  0,00205596  0,03810324  0,002556 
1000x1000*1000x1000  0,018320947  0,51803557 1,638475459 
10000x10000*10000x10000  23,78046632  445,0935211  1569,233228 
Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19 Linux, Scala 2.11.
Later I will make tests with Cuda. I need to install new Cuda version for this purpose.
Do you have any ideas why breezenetlib with native blas is so much slower than BIDMat MKL?
Best regards, Alexander
From: Joseph Bradley [mailto: [hidden email]<mailto: [hidden email]>]
Sent: Thursday, February 05, 2015 5:29 PM
To: Ulanov, Alexander
Cc: Evan R. Sparks; [hidden email]<mailto: [hidden email]>
Subject: Re: Using CUDA within Spark / boosting linear algebra
Hi Alexander,
Using GPUs with Spark would be very exciting. Small comment: Concerning your question earlier about keeping data stored on the GPU rather than having to move it between main memory and GPU memory on each iteration, I would guess this would be critical to getting good performance. If you could do multiple local iterations before aggregating results, then the cost of data movement to the GPU could be amortized (and I believe that is done in practice). Having Spark be aware of the GPU and using it as another part of memory sounds like a much bigger undertaking.
Joseph
On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander < [hidden email]<mailto: [hidden email]>> wrote:
Thank you for explanation! I’ve watched the BIDMach presentation by John Canny and I am really inspired by his talk and comparisons with Spark MLlib.
I am very interested to find out what will be better within Spark: BIDMat or netlibjava with CPU or GPU natives. Could you suggest a fair way to benchmark them? Currently I do benchmarks on artificial neural networks in batch mode. While it is not a “pure” test of linear algebra, it involves some other things that are essential to machine learning.
From: Evan R. Sparks [mailto: [hidden email]<mailto: [hidden email]>]
Sent: Thursday, February 05, 2015 1:29 PM
To: Ulanov, Alexander
Cc: [hidden email]<mailto: [hidden email]>
Subject: Re: Using CUDA within Spark / boosting linear algebra
I'd be surprised of BIDMat+OpenBLAS was significantly faster than netlibjava+OpenBLAS, but if it is much faster it's probably due to data layout and fewer levels of indirection  it's definitely a worthwhile experiment to run. The main speedups I've seen from using it come from highly optimized GPU code for linear algebra. I know that in the past Canny has gone as far as to write custom GPU kernels for performancecritical regions of code.[1]
BIDMach is highly optimized for single node performance or performance on small clusters.[2] Once data doesn't fit easily in GPU memory (or can be batched in that way) the performance tends to fall off. Canny argues for hardware/software codesign and as such prefers machine configurations that are quite different than what we find in most commodity cluster nodes  e.g. 10 disk cahnnels and 4 GPUs.
In contrast, MLlib was designed for horizontal scalability on commodity clusters and works best on very big datasets  order of terabytes.
For the most part, these projects developed concurrently to address slightly different use cases. That said, there may be bits of BIDMach we could repurpose for MLlib  keep in mind we need to be careful about maintaining crosslanguage compatibility for our Java and Pythonusers, though.
 Evan
[1]  http://arxiv.org/abs/1409.5402[2]  http://eecs.berkeley.edu/~hzhao/papers/BD.pdfOn Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander < [hidden email]<mailto: [hidden email]><mailto: [hidden email]<mailto: [hidden email]>>> wrote:
Hi Evan,
Thank you for suggestion! BIDMat seems to have terrific speed. Do you know what makes them faster than netlibjava?
The same group has BIDMach library that implements machine learning. For some examples they use Caffe convolutional neural network library owned by another group in Berkeley. Could you elaborate on how these all might be connected with Spark Mllib? If you take BIDMat for linear algebra why don’t you take BIDMach for optimization and learning?
Best regards, Alexander
From: Evan R. Sparks [mailto: [hidden email]<mailto: [hidden email]><mailto: [hidden email]<mailto: [hidden email]>>]
Sent: Thursday, February 05, 2015 12:09 PM
To: Ulanov, Alexander
Cc: [hidden email]<mailto: [hidden email]><mailto: [hidden email]<mailto: [hidden email]>>
Subject: Re: Using CUDA within Spark / boosting linear algebra
I'd expect that we can make GPUaccelerated BLAS faster than CPU blas in many cases.
You might consider taking a look at the codepaths that BIDMat ( https://github.com/BIDData/BIDMat) takes and comparing them to netlibjava/breeze. John Canny et. al. have done a bunch of work optimizing to make this work really fast from Scala. I've run it on my laptop and compared to MKL and in certain cases it's 10x faster at matrix multiply. There are a lot of layers of indirection here and you really want to avoid data copying as much as possible.
We could also consider swapping out BIDMat for Breeze, but that would be a big project and if we can figure out how to get breeze+cublas to comparable performance that would be a big win.
On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander < [hidden email]<mailto: [hidden email]><mailto: [hidden email]<mailto: [hidden email]>>> wrote:
Dear Spark developers,
I am exploring how to make linear algebra operations faster within Spark. One way of doing this is to use Scala Breeze library that is bundled with Spark. For matrix operations, it employs Netlibjava that has a Java wrapper for BLAS (basic linear algebra subprograms) and LAPACK native binaries if they are available on the worker node. It also has its own optimized Java implementation of BLAS. It is worth mentioning, that native binaries provide better performance only for BLAS level 3, i.e. matrixmatrix operations or general matrix multiplication (GEMM). This is confirmed by GEMM test on Netlibjava page https://github.com/fommil/netlibjava. I also confirmed it with my experiments with training of artificial neural network https://github.com/apache/spark/pull/1290#issuecomment70313952. However, I would like to boost performance more.
GPU is supposed to work fast with linear algebra and there is Nvidia CUDA implementation of BLAS, called cublas. I have one Linux server with Nvidia GPU and I was able to do the following. I linked cublas (instead of cpubased blas) with Netlibjava wrapper and put it into Spark, so Breeze/Netlib is using it. Then I did some performance measurements with regards to artificial neural network batch learning in Spark MLlib that involves matrixmatrix multiplications. It turns out that for matrices of size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes slower for bigger matrices. It worth mentioning that it is was not a test for ONLY multiplication since there are other operations involved. One of the reasons for slowdown might be the overhead of copying the matrices from computer memory to graphic card memory and back.
So, few questions:
1) Do these results with CUDA make sense?
2) If the problem is with copy overhead, are there any libraries that allow to force intermediate results to stay in graphic card memory thus removing the overhead?
3) Any other options to speedup linear algebra in Spark?
Thank you, Alexander

To unsubscribe, email: [hidden email]<mailto: [hidden email]><mailto: [hidden email]<mailto: [hidden email]>>
For additional commands, email: [hidden email]<mailto: [hidden email]><mailto: [hidden email]<mailto: [hidden email]>>

To unsubscribe, email: [hidden email]
For additional commands, email: [hidden email]


Thanks for compiling all the data and running these benchmarks, Alex. The
big takeaways here can be seen with this chart:
https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive1) A properly configured GPU matrix multiply implementation (e.g.
BIDMat+GPU) can provide substantial (but less than an order of magnitude)
benefit over a welltuned CPU implementation (e.g. BIDMat+MKL or
netlibjava+openblascompiled).
2) A poorly tuned CPU implementation can be 12 orders of magnitude worse
than a welltuned CPU implementation, particularly for larger matrices.
(netlibf2jblas or netlibref) This is not to pick on netlib  this
basically agrees with the authors own benchmarks (
https://github.com/fommil/netlibjava)
I think that most of our users are in a situation where using GPUs may not
be practical  although we could consider having a good GPU backend
available as an option. However, *ALL* users of MLlib could benefit
(potentially tremendously) from using a welltuned CPUbased BLAS
implementation. Perhaps we should consider updating the mllib guide with a
more complete section for enabling high performance binaries on OSX and
Linux? Or better, figure out a way for the system to fetch these
automatically.
 Evan
On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander < [hidden email]>
wrote:
> Just to summarize this thread, I was finally able to make all performance
> comparisons that we discussed. It turns out that:
> BIDMatcublas>>BIDMat
> MKL==netlibmkl==netlibopenblascompiled>netlibopenblasyumrepo==netlibcublas>netlibblas>f2jblas
>
> Below is the link to the spreadsheet with full results.
>
> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing>
> One thing still needs exploration: does BIDMatcublas perform copying
> to/from machine’s RAM?
>
> Original Message
> From: Ulanov, Alexander
> Sent: Tuesday, February 10, 2015 2:12 PM
> To: Evan R. Sparks
> Cc: Joseph Bradley; [hidden email]
> Subject: RE: Using CUDA within Spark / boosting linear algebra
>
> Thanks, Evan! It seems that ticket was marked as duplicate though the
> original one discusses slightly different topic. I was able to link netlib
> with MKL from BIDMat binaries. Indeed, MKL is statically linked inside a
> 60MB library.
>
> A*B size  BIDMat MKL  Breeze+NetlibMKL from BIDMat
> Breeze+NetlibOpenBlas(native system) Breeze+Netlibf2jblas 
> ++
> 100x100*100x100  0,00205596  0,000381  0,03810324  0,002556 
> 1000x1000*1000x1000  0,018320947  0,038316857  0,51803557 1,638475459
> 
> 10000x10000*10000x10000  23,78046632  32,94546697 445,0935211 
> 1569,233228 
>
> It turn out that precompiled MKL is faster than precompiled OpenBlas on
> my machine. Probably, I’ll add two more columns with locally compiled
> openblas and cuda.
>
> Alexander
>
> From: Evan R. Sparks [mailto: [hidden email]]
> Sent: Monday, February 09, 2015 6:06 PM
> To: Ulanov, Alexander
> Cc: Joseph Bradley; [hidden email]
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> Great  perhaps we can move this discussion offlist and onto a JIRA
> ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK5705)
>
> It seems like this is going to be somewhat exploratory for a while (and
> there's probably only a handful of us who really care about fast linear
> algebra!)
>
>  Evan
>
> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander < [hidden email]
> <mailto: [hidden email]>> wrote:
> Hi Evan,
>
> Thank you for explanation and useful link. I am going to build OpenBLAS,
> link it with Netlibjava and perform benchmark again.
>
> Do I understand correctly that BIDMat binaries contain statically linked
> Intel MKL BLAS? It might be the reason why I am able to run BIDMat not
> having MKL BLAS installed on my server. If it is true, I wonder if it is OK
> because Intel sells this library. Nevertheless, it seems that in my case
> precompiled MKL BLAS performs better than precompiled OpenBLAS given that
> BIDMat and Netlibjava are supposed to be on par with JNI overheads.
>
> Though, it might be interesting to link Netlibjava with Intel MKL, as you
> suggested. I wonder, are John Canny (BIDMat) and Sam Halliday (Netlibjava)
> interested to compare their libraries.
>
> Best regards, Alexander
>
> From: Evan R. Sparks [mailto: [hidden email]<mailto:
> [hidden email]>]
> Sent: Friday, February 06, 2015 5:58 PM
>
> To: Ulanov, Alexander
> Cc: Joseph Bradley; [hidden email]<mailto: [hidden email]>
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> I would build OpenBLAS yourself, since good BLAS performance comes from
> getting cache sizes, etc. set up correctly for your particular hardware 
> this is often a very tricky process (see, e.g. ATLAS), but we found that on
> relatively modern Xeon chips, OpenBLAS builds quickly and yields
> performance competitive with MKL.
>
> To make sure the right library is getting used, you have to make sure it's
> first on the search path  export LD_LIBRARY_PATH=/path/to/blas/library.so
> will do the trick here.
>
> For some examples of getting netlibjava setup on an ec2 node and some
> example benchmarking code we ran a while back, see:
> https://github.com/shivaram/matrixbench>
> In particular  buildopenblasec2.sh shows you how to build the library
> and set up symlinks correctly, and scala/runnetlib.sh shows you how to get
> the path setup and get that library picked up by netlibjava.
>
> In this way  you could probably get cuBLAS set up to be used by
> netlibjava as well.
>
>  Evan
>
> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander < [hidden email]
> <mailto: [hidden email]>> wrote:
> Evan, could you elaborate on how to force BIDMat and netlibjava to force
> loading the right blas? For netlib, I there are few JVM flags, such as
> Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can
> force it to use Java implementation. Not sure I understand how to force use
> a specific blas (not specific wrapper for blas).
>
> Btw. I have installed openblas (yum install openblas), so I suppose that
> netlib is using it.
>
> From: Evan R. Sparks [mailto: [hidden email]<mailto:
> [hidden email]>]
> Sent: Friday, February 06, 2015 5:19 PM
> To: Ulanov, Alexander
> Cc: Joseph Bradley; [hidden email]<mailto: [hidden email]>
>
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> Getting breeze to pick up the right blas library is critical for
> performance. I recommend using OpenBLAS (or MKL, if you already have it).
> It might make sense to force BIDMat to use the same underlying BLAS library
> as well.
>
> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander < [hidden email]
> <mailto: [hidden email]>> wrote:
> Hi Evan, Joseph
>
> I did few matrix multiplication test and BIDMat seems to be ~10x faster
> than netlibjava+breeze (sorry for weird table formatting):
>
> A*B size  BIDMat MKL  Breeze+Netlibjava native_system_linux_x8664
> Breeze+Netlibjava f2jblas 
> ++
> 100x100*100x100  0,00205596  0,03810324  0,002556 
> 1000x1000*1000x1000  0,018320947  0,51803557 1,638475459 
> 10000x10000*10000x10000  23,78046632  445,0935211  1569,233228 
>
> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19
> Linux, Scala 2.11.
>
> Later I will make tests with Cuda. I need to install new Cuda version for
> this purpose.
>
> Do you have any ideas why breezenetlib with native blas is so much slower
> than BIDMat MKL?
>
> Best regards, Alexander
>
> From: Joseph Bradley [mailto: [hidden email]<mailto:
> [hidden email]>]
> Sent: Thursday, February 05, 2015 5:29 PM
> To: Ulanov, Alexander
> Cc: Evan R. Sparks; [hidden email]<mailto: [hidden email]>
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> Hi Alexander,
>
> Using GPUs with Spark would be very exciting. Small comment: Concerning
> your question earlier about keeping data stored on the GPU rather than
> having to move it between main memory and GPU memory on each iteration, I
> would guess this would be critical to getting good performance. If you
> could do multiple local iterations before aggregating results, then the
> cost of data movement to the GPU could be amortized (and I believe that is
> done in practice). Having Spark be aware of the GPU and using it as
> another part of memory sounds like a much bigger undertaking.
>
> Joseph
>
> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander < [hidden email]
> <mailto: [hidden email]>> wrote:
> Thank you for explanation! I’ve watched the BIDMach presentation by John
> Canny and I am really inspired by his talk and comparisons with Spark MLlib.
>
> I am very interested to find out what will be better within Spark: BIDMat
> or netlibjava with CPU or GPU natives. Could you suggest a fair way to
> benchmark them? Currently I do benchmarks on artificial neural networks in
> batch mode. While it is not a “pure” test of linear algebra, it involves
> some other things that are essential to machine learning.
>
> From: Evan R. Sparks [mailto: [hidden email]<mailto:
> [hidden email]>]
> Sent: Thursday, February 05, 2015 1:29 PM
> To: Ulanov, Alexander
> Cc: [hidden email]<mailto: [hidden email]>
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
> netlibjava+OpenBLAS, but if it is much faster it's probably due to data
> layout and fewer levels of indirection  it's definitely a worthwhile
> experiment to run. The main speedups I've seen from using it come from
> highly optimized GPU code for linear algebra. I know that in the past Canny
> has gone as far as to write custom GPU kernels for performancecritical
> regions of code.[1]
>
> BIDMach is highly optimized for single node performance or performance on
> small clusters.[2] Once data doesn't fit easily in GPU memory (or can be
> batched in that way) the performance tends to fall off. Canny argues for
> hardware/software codesign and as such prefers machine configurations that
> are quite different than what we find in most commodity cluster nodes 
> e.g. 10 disk cahnnels and 4 GPUs.
>
> In contrast, MLlib was designed for horizontal scalability on commodity
> clusters and works best on very big datasets  order of terabytes.
>
> For the most part, these projects developed concurrently to address
> slightly different use cases. That said, there may be bits of BIDMach we
> could repurpose for MLlib  keep in mind we need to be careful about
> maintaining crosslanguage compatibility for our Java and Pythonusers,
> though.
>
>  Evan
>
> [1]  http://arxiv.org/abs/1409.5402> [2]  http://eecs.berkeley.edu/~hzhao/papers/BD.pdf>
> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander < [hidden email]
> <mailto: [hidden email]><mailto: [hidden email]<mailto:
> [hidden email]>>> wrote:
> Hi Evan,
>
> Thank you for suggestion! BIDMat seems to have terrific speed. Do you know
> what makes them faster than netlibjava?
>
> The same group has BIDMach library that implements machine learning. For
> some examples they use Caffe convolutional neural network library owned by
> another group in Berkeley. Could you elaborate on how these all might be
> connected with Spark Mllib? If you take BIDMat for linear algebra why don’t
> you take BIDMach for optimization and learning?
>
> Best regards, Alexander
>
> From: Evan R. Sparks [mailto: [hidden email]<mailto:
> [hidden email]><mailto: [hidden email]<mailto:
> [hidden email]>>]
> Sent: Thursday, February 05, 2015 12:09 PM
> To: Ulanov, Alexander
> Cc: [hidden email]<mailto: [hidden email]><mailto:
> [hidden email]<mailto: [hidden email]>>
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> I'd expect that we can make GPUaccelerated BLAS faster than CPU blas in
> many cases.
>
> You might consider taking a look at the codepaths that BIDMat (
> https://github.com/BIDData/BIDMat) takes and comparing them to
> netlibjava/breeze. John Canny et. al. have done a bunch of work optimizing
> to make this work really fast from Scala. I've run it on my laptop and
> compared to MKL and in certain cases it's 10x faster at matrix multiply.
> There are a lot of layers of indirection here and you really want to avoid
> data copying as much as possible.
>
> We could also consider swapping out BIDMat for Breeze, but that would be a
> big project and if we can figure out how to get breeze+cublas to comparable
> performance that would be a big win.
>
> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
> [hidden email]<mailto: [hidden email]><mailto:
> [hidden email]<mailto: [hidden email]>>> wrote:
> Dear Spark developers,
>
> I am exploring how to make linear algebra operations faster within Spark.
> One way of doing this is to use Scala Breeze library that is bundled with
> Spark. For matrix operations, it employs Netlibjava that has a Java
> wrapper for BLAS (basic linear algebra subprograms) and LAPACK native
> binaries if they are available on the worker node. It also has its own
> optimized Java implementation of BLAS. It is worth mentioning, that native
> binaries provide better performance only for BLAS level 3, i.e.
> matrixmatrix operations or general matrix multiplication (GEMM). This is
> confirmed by GEMM test on Netlibjava page
> https://github.com/fommil/netlibjava. I also confirmed it with my
> experiments with training of artificial neural network
> https://github.com/apache/spark/pull/1290#issuecomment70313952. However,
> I would like to boost performance more.
>
> GPU is supposed to work fast with linear algebra and there is Nvidia CUDA
> implementation of BLAS, called cublas. I have one Linux server with Nvidia
> GPU and I was able to do the following. I linked cublas (instead of
> cpubased blas) with Netlibjava wrapper and put it into Spark, so
> Breeze/Netlib is using it. Then I did some performance measurements with
> regards to artificial neural network batch learning in Spark MLlib that
> involves matrixmatrix multiplications. It turns out that for matrices of
> size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas
> becomes slower for bigger matrices. It worth mentioning that it is was not
> a test for ONLY multiplication since there are other operations involved.
> One of the reasons for slowdown might be the overhead of copying the
> matrices from computer memory to graphic card memory and back.
>
> So, few questions:
> 1) Do these results with CUDA make sense?
> 2) If the problem is with copy overhead, are there any libraries that
> allow to force intermediate results to stay in graphic card memory thus
> removing the overhead?
> 3) Any other options to speedup linear algebra in Spark?
>
> Thank you, Alexander
>
> 
> To unsubscribe, email: [hidden email]<mailto:
> [hidden email]><mailto: [hidden email]
> <mailto: [hidden email]>>
> For additional commands, email: [hidden email]<mailto:
> [hidden email]><mailto: [hidden email]<mailto:
> [hidden email]>>
>
>
>
>


Better documentation for linking would be very helpful! Here's a JIRA:
https://issues.apache.org/jira/browse/SPARK6019On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks < [hidden email]>
wrote:
> Thanks for compiling all the data and running these benchmarks, Alex. The
> big takeaways here can be seen with this chart:
>
> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive>
> 1) A properly configured GPU matrix multiply implementation (e.g.
> BIDMat+GPU) can provide substantial (but less than an order of magnitude)
> benefit over a welltuned CPU implementation (e.g. BIDMat+MKL or
> netlibjava+openblascompiled).
> 2) A poorly tuned CPU implementation can be 12 orders of magnitude worse
> than a welltuned CPU implementation, particularly for larger matrices.
> (netlibf2jblas or netlibref) This is not to pick on netlib  this
> basically agrees with the authors own benchmarks (
> https://github.com/fommil/netlibjava)
>
> I think that most of our users are in a situation where using GPUs may not
> be practical  although we could consider having a good GPU backend
> available as an option. However, *ALL* users of MLlib could benefit
> (potentially tremendously) from using a welltuned CPUbased BLAS
> implementation. Perhaps we should consider updating the mllib guide with a
> more complete section for enabling high performance binaries on OSX and
> Linux? Or better, figure out a way for the system to fetch these
> automatically.
>
>  Evan
>
>
>
> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
> [hidden email]> wrote:
>
>> Just to summarize this thread, I was finally able to make all performance
>> comparisons that we discussed. It turns out that:
>> BIDMatcublas>>BIDMat
>> MKL==netlibmkl==netlibopenblascompiled>netlibopenblasyumrepo==netlibcublas>netlibblas>f2jblas
>>
>> Below is the link to the spreadsheet with full results.
>>
>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing>>
>> One thing still needs exploration: does BIDMatcublas perform copying
>> to/from machine’s RAM?
>>
>> Original Message
>> From: Ulanov, Alexander
>> Sent: Tuesday, February 10, 2015 2:12 PM
>> To: Evan R. Sparks
>> Cc: Joseph Bradley; [hidden email]
>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>
>> Thanks, Evan! It seems that ticket was marked as duplicate though the
>> original one discusses slightly different topic. I was able to link netlib
>> with MKL from BIDMat binaries. Indeed, MKL is statically linked inside a
>> 60MB library.
>>
>> A*B size  BIDMat MKL  Breeze+NetlibMKL from BIDMat
>> Breeze+NetlibOpenBlas(native system) Breeze+Netlibf2jblas 
>> ++
>> 100x100*100x100  0,00205596  0,000381  0,03810324  0,002556 
>> 1000x1000*1000x1000  0,018320947  0,038316857  0,51803557
>> 1,638475459 
>> 10000x10000*10000x10000  23,78046632  32,94546697 445,0935211 
>> 1569,233228 
>>
>> It turn out that precompiled MKL is faster than precompiled OpenBlas on
>> my machine. Probably, I’ll add two more columns with locally compiled
>> openblas and cuda.
>>
>> Alexander
>>
>> From: Evan R. Sparks [mailto: [hidden email]]
>> Sent: Monday, February 09, 2015 6:06 PM
>> To: Ulanov, Alexander
>> Cc: Joseph Bradley; [hidden email]
>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>
>> Great  perhaps we can move this discussion offlist and onto a JIRA
>> ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK5705)
>>
>> It seems like this is going to be somewhat exploratory for a while (and
>> there's probably only a handful of us who really care about fast linear
>> algebra!)
>>
>>  Evan
>>
>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
>> [hidden email]<mailto: [hidden email]>> wrote:
>> Hi Evan,
>>
>> Thank you for explanation and useful link. I am going to build OpenBLAS,
>> link it with Netlibjava and perform benchmark again.
>>
>> Do I understand correctly that BIDMat binaries contain statically linked
>> Intel MKL BLAS? It might be the reason why I am able to run BIDMat not
>> having MKL BLAS installed on my server. If it is true, I wonder if it is OK
>> because Intel sells this library. Nevertheless, it seems that in my case
>> precompiled MKL BLAS performs better than precompiled OpenBLAS given that
>> BIDMat and Netlibjava are supposed to be on par with JNI overheads.
>>
>> Though, it might be interesting to link Netlibjava with Intel MKL, as
>> you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday
>> (Netlibjava) interested to compare their libraries.
>>
>> Best regards, Alexander
>>
>> From: Evan R. Sparks [mailto: [hidden email]<mailto:
>> [hidden email]>]
>> Sent: Friday, February 06, 2015 5:58 PM
>>
>> To: Ulanov, Alexander
>> Cc: Joseph Bradley; [hidden email]<mailto: [hidden email]>
>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>
>> I would build OpenBLAS yourself, since good BLAS performance comes from
>> getting cache sizes, etc. set up correctly for your particular hardware 
>> this is often a very tricky process (see, e.g. ATLAS), but we found that on
>> relatively modern Xeon chips, OpenBLAS builds quickly and yields
>> performance competitive with MKL.
>>
>> To make sure the right library is getting used, you have to make sure
>> it's first on the search path  export
>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
>>
>> For some examples of getting netlibjava setup on an ec2 node and some
>> example benchmarking code we ran a while back, see:
>> https://github.com/shivaram/matrixbench>>
>> In particular  buildopenblasec2.sh shows you how to build the library
>> and set up symlinks correctly, and scala/runnetlib.sh shows you how to get
>> the path setup and get that library picked up by netlibjava.
>>
>> In this way  you could probably get cuBLAS set up to be used by
>> netlibjava as well.
>>
>>  Evan
>>
>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
>> [hidden email]<mailto: [hidden email]>> wrote:
>> Evan, could you elaborate on how to force BIDMat and netlibjava to force
>> loading the right blas? For netlib, I there are few JVM flags, such as
>> Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can
>> force it to use Java implementation. Not sure I understand how to force use
>> a specific blas (not specific wrapper for blas).
>>
>> Btw. I have installed openblas (yum install openblas), so I suppose that
>> netlib is using it.
>>
>> From: Evan R. Sparks [mailto: [hidden email]<mailto:
>> [hidden email]>]
>> Sent: Friday, February 06, 2015 5:19 PM
>> To: Ulanov, Alexander
>> Cc: Joseph Bradley; [hidden email]<mailto: [hidden email]>
>>
>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>
>> Getting breeze to pick up the right blas library is critical for
>> performance. I recommend using OpenBLAS (or MKL, if you already have it).
>> It might make sense to force BIDMat to use the same underlying BLAS library
>> as well.
>>
>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
>> [hidden email]<mailto: [hidden email]>> wrote:
>> Hi Evan, Joseph
>>
>> I did few matrix multiplication test and BIDMat seems to be ~10x faster
>> than netlibjava+breeze (sorry for weird table formatting):
>>
>> A*B size  BIDMat MKL  Breeze+Netlibjava native_system_linux_x8664
>> Breeze+Netlibjava f2jblas 
>> ++
>> 100x100*100x100  0,00205596  0,03810324  0,002556 
>> 1000x1000*1000x1000  0,018320947  0,51803557 1,638475459 
>> 10000x10000*10000x10000  23,78046632  445,0935211  1569,233228 
>>
>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19
>> Linux, Scala 2.11.
>>
>> Later I will make tests with Cuda. I need to install new Cuda version for
>> this purpose.
>>
>> Do you have any ideas why breezenetlib with native blas is so much
>> slower than BIDMat MKL?
>>
>> Best regards, Alexander
>>
>> From: Joseph Bradley [mailto: [hidden email]<mailto:
>> [hidden email]>]
>> Sent: Thursday, February 05, 2015 5:29 PM
>> To: Ulanov, Alexander
>> Cc: Evan R. Sparks; [hidden email]<mailto: [hidden email]>
>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>
>> Hi Alexander,
>>
>> Using GPUs with Spark would be very exciting. Small comment: Concerning
>> your question earlier about keeping data stored on the GPU rather than
>> having to move it between main memory and GPU memory on each iteration, I
>> would guess this would be critical to getting good performance. If you
>> could do multiple local iterations before aggregating results, then the
>> cost of data movement to the GPU could be amortized (and I believe that is
>> done in practice). Having Spark be aware of the GPU and using it as
>> another part of memory sounds like a much bigger undertaking.
>>
>> Joseph
>>
>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
>> [hidden email]<mailto: [hidden email]>> wrote:
>> Thank you for explanation! I’ve watched the BIDMach presentation by John
>> Canny and I am really inspired by his talk and comparisons with Spark MLlib.
>>
>> I am very interested to find out what will be better within Spark: BIDMat
>> or netlibjava with CPU or GPU natives. Could you suggest a fair way to
>> benchmark them? Currently I do benchmarks on artificial neural networks in
>> batch mode. While it is not a “pure” test of linear algebra, it involves
>> some other things that are essential to machine learning.
>>
>> From: Evan R. Sparks [mailto: [hidden email]<mailto:
>> [hidden email]>]
>> Sent: Thursday, February 05, 2015 1:29 PM
>> To: Ulanov, Alexander
>> Cc: [hidden email]<mailto: [hidden email]>
>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>
>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
>> netlibjava+OpenBLAS, but if it is much faster it's probably due to data
>> layout and fewer levels of indirection  it's definitely a worthwhile
>> experiment to run. The main speedups I've seen from using it come from
>> highly optimized GPU code for linear algebra. I know that in the past Canny
>> has gone as far as to write custom GPU kernels for performancecritical
>> regions of code.[1]
>>
>> BIDMach is highly optimized for single node performance or performance on
>> small clusters.[2] Once data doesn't fit easily in GPU memory (or can be
>> batched in that way) the performance tends to fall off. Canny argues for
>> hardware/software codesign and as such prefers machine configurations that
>> are quite different than what we find in most commodity cluster nodes 
>> e.g. 10 disk cahnnels and 4 GPUs.
>>
>> In contrast, MLlib was designed for horizontal scalability on commodity
>> clusters and works best on very big datasets  order of terabytes.
>>
>> For the most part, these projects developed concurrently to address
>> slightly different use cases. That said, there may be bits of BIDMach we
>> could repurpose for MLlib  keep in mind we need to be careful about
>> maintaining crosslanguage compatibility for our Java and Pythonusers,
>> though.
>>
>>  Evan
>>
>> [1]  http://arxiv.org/abs/1409.5402>> [2]  http://eecs.berkeley.edu/~hzhao/papers/BD.pdf>>
>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
>> [hidden email]<mailto: [hidden email]><mailto:
>> [hidden email]<mailto: [hidden email]>>> wrote:
>> Hi Evan,
>>
>> Thank you for suggestion! BIDMat seems to have terrific speed. Do you
>> know what makes them faster than netlibjava?
>>
>> The same group has BIDMach library that implements machine learning. For
>> some examples they use Caffe convolutional neural network library owned by
>> another group in Berkeley. Could you elaborate on how these all might be
>> connected with Spark Mllib? If you take BIDMat for linear algebra why don’t
>> you take BIDMach for optimization and learning?
>>
>> Best regards, Alexander
>>
>> From: Evan R. Sparks [mailto: [hidden email]<mailto:
>> [hidden email]><mailto: [hidden email]<mailto:
>> [hidden email]>>]
>> Sent: Thursday, February 05, 2015 12:09 PM
>> To: Ulanov, Alexander
>> Cc: [hidden email]<mailto: [hidden email]><mailto:
>> [hidden email]<mailto: [hidden email]>>
>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>
>> I'd expect that we can make GPUaccelerated BLAS faster than CPU blas in
>> many cases.
>>
>> You might consider taking a look at the codepaths that BIDMat (
>> https://github.com/BIDData/BIDMat) takes and comparing them to
>> netlibjava/breeze. John Canny et. al. have done a bunch of work optimizing
>> to make this work really fast from Scala. I've run it on my laptop and
>> compared to MKL and in certain cases it's 10x faster at matrix multiply.
>> There are a lot of layers of indirection here and you really want to avoid
>> data copying as much as possible.
>>
>> We could also consider swapping out BIDMat for Breeze, but that would be
>> a big project and if we can figure out how to get breeze+cublas to
>> comparable performance that would be a big win.
>>
>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
>> [hidden email]<mailto: [hidden email]><mailto:
>> [hidden email]<mailto: [hidden email]>>> wrote:
>> Dear Spark developers,
>>
>> I am exploring how to make linear algebra operations faster within Spark.
>> One way of doing this is to use Scala Breeze library that is bundled with
>> Spark. For matrix operations, it employs Netlibjava that has a Java
>> wrapper for BLAS (basic linear algebra subprograms) and LAPACK native
>> binaries if they are available on the worker node. It also has its own
>> optimized Java implementation of BLAS. It is worth mentioning, that native
>> binaries provide better performance only for BLAS level 3, i.e.
>> matrixmatrix operations or general matrix multiplication (GEMM). This is
>> confirmed by GEMM test on Netlibjava page
>> https://github.com/fommil/netlibjava. I also confirmed it with my
>> experiments with training of artificial neural network
>> https://github.com/apache/spark/pull/1290#issuecomment70313952.
>> However, I would like to boost performance more.
>>
>> GPU is supposed to work fast with linear algebra and there is Nvidia CUDA
>> implementation of BLAS, called cublas. I have one Linux server with Nvidia
>> GPU and I was able to do the following. I linked cublas (instead of
>> cpubased blas) with Netlibjava wrapper and put it into Spark, so
>> Breeze/Netlib is using it. Then I did some performance measurements with
>> regards to artificial neural network batch learning in Spark MLlib that
>> involves matrixmatrix multiplications. It turns out that for matrices of
>> size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas
>> becomes slower for bigger matrices. It worth mentioning that it is was not
>> a test for ONLY multiplication since there are other operations involved.
>> One of the reasons for slowdown might be the overhead of copying the
>> matrices from computer memory to graphic card memory and back.
>>
>> So, few questions:
>> 1) Do these results with CUDA make sense?
>> 2) If the problem is with copy overhead, are there any libraries that
>> allow to force intermediate results to stay in graphic card memory thus
>> removing the overhead?
>> 3) Any other options to speedup linear algebra in Spark?
>>
>> Thank you, Alexander
>>
>> 
>> To unsubscribe, email: [hidden email]<mailto:
>> [hidden email]><mailto: [hidden email]
>> <mailto: [hidden email]>>
>> For additional commands, email: [hidden email]<mailto:
>> [hidden email]><mailto: [hidden email]<mailto:
>> [hidden email]>>
>>
>>
>>
>>
>


Hey Alexander,
I don't quite understand the part where netlibcublas is about 20x
slower than netlibopenblas. What is the overhead of using a GPU BLAS
with netlibjava?
CC'ed Sam, the author of netlibjava.
Best,
Xiangrui
On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley < [hidden email]> wrote:
> Better documentation for linking would be very helpful! Here's a JIRA:
> https://issues.apache.org/jira/browse/SPARK6019>
>
> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks < [hidden email]>
> wrote:
>
>> Thanks for compiling all the data and running these benchmarks, Alex. The
>> big takeaways here can be seen with this chart:
>>
>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive>>
>> 1) A properly configured GPU matrix multiply implementation (e.g.
>> BIDMat+GPU) can provide substantial (but less than an order of magnitude)
>> benefit over a welltuned CPU implementation (e.g. BIDMat+MKL or
>> netlibjava+openblascompiled).
>> 2) A poorly tuned CPU implementation can be 12 orders of magnitude worse
>> than a welltuned CPU implementation, particularly for larger matrices.
>> (netlibf2jblas or netlibref) This is not to pick on netlib  this
>> basically agrees with the authors own benchmarks (
>> https://github.com/fommil/netlibjava)
>>
>> I think that most of our users are in a situation where using GPUs may not
>> be practical  although we could consider having a good GPU backend
>> available as an option. However, *ALL* users of MLlib could benefit
>> (potentially tremendously) from using a welltuned CPUbased BLAS
>> implementation. Perhaps we should consider updating the mllib guide with a
>> more complete section for enabling high performance binaries on OSX and
>> Linux? Or better, figure out a way for the system to fetch these
>> automatically.
>>
>>  Evan
>>
>>
>>
>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
>> [hidden email]> wrote:
>>
>>> Just to summarize this thread, I was finally able to make all performance
>>> comparisons that we discussed. It turns out that:
>>> BIDMatcublas>>BIDMat
>>> MKL==netlibmkl==netlibopenblascompiled>netlibopenblasyumrepo==netlibcublas>netlibblas>f2jblas
>>>
>>> Below is the link to the spreadsheet with full results.
>>>
>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing>>>
>>> One thing still needs exploration: does BIDMatcublas perform copying
>>> to/from machine’s RAM?
>>>
>>> Original Message
>>> From: Ulanov, Alexander
>>> Sent: Tuesday, February 10, 2015 2:12 PM
>>> To: Evan R. Sparks
>>> Cc: Joseph Bradley; [hidden email]
>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>>
>>> Thanks, Evan! It seems that ticket was marked as duplicate though the
>>> original one discusses slightly different topic. I was able to link netlib
>>> with MKL from BIDMat binaries. Indeed, MKL is statically linked inside a
>>> 60MB library.
>>>
>>> A*B size  BIDMat MKL  Breeze+NetlibMKL from BIDMat
>>> Breeze+NetlibOpenBlas(native system) Breeze+Netlibf2jblas 
>>> ++
>>> 100x100*100x100  0,00205596  0,000381  0,03810324  0,002556 
>>> 1000x1000*1000x1000  0,018320947  0,038316857  0,51803557
>>> 1,638475459 
>>> 10000x10000*10000x10000  23,78046632  32,94546697 445,0935211 
>>> 1569,233228 
>>>
>>> It turn out that precompiled MKL is faster than precompiled OpenBlas on
>>> my machine. Probably, I’ll add two more columns with locally compiled
>>> openblas and cuda.
>>>
>>> Alexander
>>>
>>> From: Evan R. Sparks [mailto: [hidden email]]
>>> Sent: Monday, February 09, 2015 6:06 PM
>>> To: Ulanov, Alexander
>>> Cc: Joseph Bradley; [hidden email]
>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>
>>> Great  perhaps we can move this discussion offlist and onto a JIRA
>>> ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK5705)
>>>
>>> It seems like this is going to be somewhat exploratory for a while (and
>>> there's probably only a handful of us who really care about fast linear
>>> algebra!)
>>>
>>>  Evan
>>>
>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
>>> [hidden email]<mailto: [hidden email]>> wrote:
>>> Hi Evan,
>>>
>>> Thank you for explanation and useful link. I am going to build OpenBLAS,
>>> link it with Netlibjava and perform benchmark again.
>>>
>>> Do I understand correctly that BIDMat binaries contain statically linked
>>> Intel MKL BLAS? It might be the reason why I am able to run BIDMat not
>>> having MKL BLAS installed on my server. If it is true, I wonder if it is OK
>>> because Intel sells this library. Nevertheless, it seems that in my case
>>> precompiled MKL BLAS performs better than precompiled OpenBLAS given that
>>> BIDMat and Netlibjava are supposed to be on par with JNI overheads.
>>>
>>> Though, it might be interesting to link Netlibjava with Intel MKL, as
>>> you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday
>>> (Netlibjava) interested to compare their libraries.
>>>
>>> Best regards, Alexander
>>>
>>> From: Evan R. Sparks [mailto: [hidden email]<mailto:
>>> [hidden email]>]
>>> Sent: Friday, February 06, 2015 5:58 PM
>>>
>>> To: Ulanov, Alexander
>>> Cc: Joseph Bradley; [hidden email]<mailto: [hidden email]>
>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>
>>> I would build OpenBLAS yourself, since good BLAS performance comes from
>>> getting cache sizes, etc. set up correctly for your particular hardware 
>>> this is often a very tricky process (see, e.g. ATLAS), but we found that on
>>> relatively modern Xeon chips, OpenBLAS builds quickly and yields
>>> performance competitive with MKL.
>>>
>>> To make sure the right library is getting used, you have to make sure
>>> it's first on the search path  export
>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
>>>
>>> For some examples of getting netlibjava setup on an ec2 node and some
>>> example benchmarking code we ran a while back, see:
>>> https://github.com/shivaram/matrixbench>>>
>>> In particular  buildopenblasec2.sh shows you how to build the library
>>> and set up symlinks correctly, and scala/runnetlib.sh shows you how to get
>>> the path setup and get that library picked up by netlibjava.
>>>
>>> In this way  you could probably get cuBLAS set up to be used by
>>> netlibjava as well.
>>>
>>>  Evan
>>>
>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
>>> [hidden email]<mailto: [hidden email]>> wrote:
>>> Evan, could you elaborate on how to force BIDMat and netlibjava to force
>>> loading the right blas? For netlib, I there are few JVM flags, such as
>>> Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can
>>> force it to use Java implementation. Not sure I understand how to force use
>>> a specific blas (not specific wrapper for blas).
>>>
>>> Btw. I have installed openblas (yum install openblas), so I suppose that
>>> netlib is using it.
>>>
>>> From: Evan R. Sparks [mailto: [hidden email]<mailto:
>>> [hidden email]>]
>>> Sent: Friday, February 06, 2015 5:19 PM
>>> To: Ulanov, Alexander
>>> Cc: Joseph Bradley; [hidden email]<mailto: [hidden email]>
>>>
>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>
>>> Getting breeze to pick up the right blas library is critical for
>>> performance. I recommend using OpenBLAS (or MKL, if you already have it).
>>> It might make sense to force BIDMat to use the same underlying BLAS library
>>> as well.
>>>
>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
>>> [hidden email]<mailto: [hidden email]>> wrote:
>>> Hi Evan, Joseph
>>>
>>> I did few matrix multiplication test and BIDMat seems to be ~10x faster
>>> than netlibjava+breeze (sorry for weird table formatting):
>>>
>>> A*B size  BIDMat MKL  Breeze+Netlibjava native_system_linux_x8664
>>> Breeze+Netlibjava f2jblas 
>>> ++
>>> 100x100*100x100  0,00205596  0,03810324  0,002556 
>>> 1000x1000*1000x1000  0,018320947  0,51803557 1,638475459 
>>> 10000x10000*10000x10000  23,78046632  445,0935211  1569,233228 
>>>
>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19
>>> Linux, Scala 2.11.
>>>
>>> Later I will make tests with Cuda. I need to install new Cuda version for
>>> this purpose.
>>>
>>> Do you have any ideas why breezenetlib with native blas is so much
>>> slower than BIDMat MKL?
>>>
>>> Best regards, Alexander
>>>
>>> From: Joseph Bradley [mailto: [hidden email]<mailto:
>>> [hidden email]>]
>>> Sent: Thursday, February 05, 2015 5:29 PM
>>> To: Ulanov, Alexander
>>> Cc: Evan R. Sparks; [hidden email]<mailto: [hidden email]>
>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>
>>> Hi Alexander,
>>>
>>> Using GPUs with Spark would be very exciting. Small comment: Concerning
>>> your question earlier about keeping data stored on the GPU rather than
>>> having to move it between main memory and GPU memory on each iteration, I
>>> would guess this would be critical to getting good performance. If you
>>> could do multiple local iterations before aggregating results, then the
>>> cost of data movement to the GPU could be amortized (and I believe that is
>>> done in practice). Having Spark be aware of the GPU and using it as
>>> another part of memory sounds like a much bigger undertaking.
>>>
>>> Joseph
>>>
>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
>>> [hidden email]<mailto: [hidden email]>> wrote:
>>> Thank you for explanation! I’ve watched the BIDMach presentation by John
>>> Canny and I am really inspired by his talk and comparisons with Spark MLlib.
>>>
>>> I am very interested to find out what will be better within Spark: BIDMat
>>> or netlibjava with CPU or GPU natives. Could you suggest a fair way to
>>> benchmark them? Currently I do benchmarks on artificial neural networks in
>>> batch mode. While it is not a “pure” test of linear algebra, it involves
>>> some other things that are essential to machine learning.
>>>
>>> From: Evan R. Sparks [mailto: [hidden email]<mailto:
>>> [hidden email]>]
>>> Sent: Thursday, February 05, 2015 1:29 PM
>>> To: Ulanov, Alexander
>>> Cc: [hidden email]<mailto: [hidden email]>
>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>
>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
>>> netlibjava+OpenBLAS, but if it is much faster it's probably due to data
>>> layout and fewer levels of indirection  it's definitely a worthwhile
>>> experiment to run. The main speedups I've seen from using it come from
>>> highly optimized GPU code for linear algebra. I know that in the past Canny
>>> has gone as far as to write custom GPU kernels for performancecritical
>>> regions of code.[1]
>>>
>>> BIDMach is highly optimized for single node performance or performance on
>>> small clusters.[2] Once data doesn't fit easily in GPU memory (or can be
>>> batched in that way) the performance tends to fall off. Canny argues for
>>> hardware/software codesign and as such prefers machine configurations that
>>> are quite different than what we find in most commodity cluster nodes 
>>> e.g. 10 disk cahnnels and 4 GPUs.
>>>
>>> In contrast, MLlib was designed for horizontal scalability on commodity
>>> clusters and works best on very big datasets  order of terabytes.
>>>
>>> For the most part, these projects developed concurrently to address
>>> slightly different use cases. That said, there may be bits of BIDMach we
>>> could repurpose for MLlib  keep in mind we need to be careful about
>>> maintaining crosslanguage compatibility for our Java and Pythonusers,
>>> though.
>>>
>>>  Evan
>>>
>>> [1]  http://arxiv.org/abs/1409.5402>>> [2]  http://eecs.berkeley.edu/~hzhao/papers/BD.pdf>>>
>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
>>> [hidden email]<mailto: [hidden email]><mailto:
>>> [hidden email]<mailto: [hidden email]>>> wrote:
>>> Hi Evan,
>>>
>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do you
>>> know what makes them faster than netlibjava?
>>>
>>> The same group has BIDMach library that implements machine learning. For
>>> some examples they use Caffe convolutional neural network library owned by
>>> another group in Berkeley. Could you elaborate on how these all might be
>>> connected with Spark Mllib? If you take BIDMat for linear algebra why don’t
>>> you take BIDMach for optimization and learning?
>>>
>>> Best regards, Alexander
>>>
>>> From: Evan R. Sparks [mailto: [hidden email]<mailto:
>>> [hidden email]><mailto: [hidden email]<mailto:
>>> [hidden email]>>]
>>> Sent: Thursday, February 05, 2015 12:09 PM
>>> To: Ulanov, Alexander
>>> Cc: [hidden email]<mailto: [hidden email]><mailto:
>>> [hidden email]<mailto: [hidden email]>>
>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>
>>> I'd expect that we can make GPUaccelerated BLAS faster than CPU blas in
>>> many cases.
>>>
>>> You might consider taking a look at the codepaths that BIDMat (
>>> https://github.com/BIDData/BIDMat) takes and comparing them to
>>> netlibjava/breeze. John Canny et. al. have done a bunch of work optimizing
>>> to make this work really fast from Scala. I've run it on my laptop and
>>> compared to MKL and in certain cases it's 10x faster at matrix multiply.
>>> There are a lot of layers of indirection here and you really want to avoid
>>> data copying as much as possible.
>>>
>>> We could also consider swapping out BIDMat for Breeze, but that would be
>>> a big project and if we can figure out how to get breeze+cublas to
>>> comparable performance that would be a big win.
>>>
>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
>>> [hidden email]<mailto: [hidden email]><mailto:
>>> [hidden email]<mailto: [hidden email]>>> wrote:
>>> Dear Spark developers,
>>>
>>> I am exploring how to make linear algebra operations faster within Spark.
>>> One way of doing this is to use Scala Breeze library that is bundled with
>>> Spark. For matrix operations, it employs Netlibjava that has a Java
>>> wrapper for BLAS (basic linear algebra subprograms) and LAPACK native
>>> binaries if they are available on the worker node. It also has its own
>>> optimized Java implementation of BLAS. It is worth mentioning, that native
>>> binaries provide better performance only for BLAS level 3, i.e.
>>> matrixmatrix operations or general matrix multiplication (GEMM). This is
>>> confirmed by GEMM test on Netlibjava page
>>> https://github.com/fommil/netlibjava. I also confirmed it with my
>>> experiments with training of artificial neural network
>>> https://github.com/apache/spark/pull/1290#issuecomment70313952.
>>> However, I would like to boost performance more.
>>>
>>> GPU is supposed to work fast with linear algebra and there is Nvidia CUDA
>>> implementation of BLAS, called cublas. I have one Linux server with Nvidia
>>> GPU and I was able to do the following. I linked cublas (instead of
>>> cpubased blas) with Netlibjava wrapper and put it into Spark, so
>>> Breeze/Netlib is using it. Then I did some performance measurements with
>>> regards to artificial neural network batch learning in Spark MLlib that
>>> involves matrixmatrix multiplications. It turns out that for matrices of
>>> size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas
>>> becomes slower for bigger matrices. It worth mentioning that it is was not
>>> a test for ONLY multiplication since there are other operations involved.
>>> One of the reasons for slowdown might be the overhead of copying the
>>> matrices from computer memory to graphic card memory and back.
>>>
>>> So, few questions:
>>> 1) Do these results with CUDA make sense?
>>> 2) If the problem is with copy overhead, are there any libraries that
>>> allow to force intermediate results to stay in graphic card memory thus
>>> removing the overhead?
>>> 3) Any other options to speedup linear algebra in Spark?
>>>
>>> Thank you, Alexander
>>>
>>> 
>>> To unsubscribe, email: [hidden email]<mailto:
>>> [hidden email]><mailto: [hidden email]
>>> <mailto: [hidden email]>>
>>> For additional commands, email: [hidden email]<mailto:
>>> [hidden email]><mailto: [hidden email]<mailto:
>>> [hidden email]>>
>>>
>>>
>>>
>>>
>>

To unsubscribe, email: [hidden email]
For additional commands, email: [hidden email]


Hi all,
I'm not surprised if the GPU is slow. It's about the bottleneck copying the
memory. Watch my talk, linked from the netlibjava github page, to
understand further. The only way to currently make use of a GPU is to do
all the operations using the GPU's kernel. You can find some prepackaged
high level algorithms than do this, but it's extremely limiting.
I believe hardware will fix this problem eventually, so I still advocate
using the netlib primitives. I'm particularly interested in APU approaches
and I'm very interested in finding somebody to fund me to look into it.
It's too much work for a side project.
Look on the last few slides of my talk to see the potential performance
gains.
Best regards, Sam
On 26 Feb 2015 21:16, "Xiangrui Meng" < [hidden email]> wrote:
> Hey Alexander,
>
> I don't quite understand the part where netlibcublas is about 20x
> slower than netlibopenblas. What is the overhead of using a GPU BLAS
> with netlibjava?
>
> CC'ed Sam, the author of netlibjava.
>
> Best,
> Xiangrui
>
> On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley < [hidden email]>
> wrote:
> > Better documentation for linking would be very helpful! Here's a JIRA:
> > https://issues.apache.org/jira/browse/SPARK6019> >
> >
> > On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks < [hidden email]>
> > wrote:
> >
> >> Thanks for compiling all the data and running these benchmarks, Alex.
> The
> >> big takeaways here can be seen with this chart:
> >>
> >>
> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive> >>
> >> 1) A properly configured GPU matrix multiply implementation (e.g.
> >> BIDMat+GPU) can provide substantial (but less than an order of
> magnitude)
> >> benefit over a welltuned CPU implementation (e.g. BIDMat+MKL or
> >> netlibjava+openblascompiled).
> >> 2) A poorly tuned CPU implementation can be 12 orders of magnitude
> worse
> >> than a welltuned CPU implementation, particularly for larger matrices.
> >> (netlibf2jblas or netlibref) This is not to pick on netlib  this
> >> basically agrees with the authors own benchmarks (
> >> https://github.com/fommil/netlibjava)
> >>
> >> I think that most of our users are in a situation where using GPUs may
> not
> >> be practical  although we could consider having a good GPU backend
> >> available as an option. However, *ALL* users of MLlib could benefit
> >> (potentially tremendously) from using a welltuned CPUbased BLAS
> >> implementation. Perhaps we should consider updating the mllib guide
> with a
> >> more complete section for enabling high performance binaries on OSX and
> >> Linux? Or better, figure out a way for the system to fetch these
> >> automatically.
> >>
> >>  Evan
> >>
> >>
> >>
> >> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
> >> [hidden email]> wrote:
> >>
> >>> Just to summarize this thread, I was finally able to make all
> performance
> >>> comparisons that we discussed. It turns out that:
> >>> BIDMatcublas>>BIDMat
> >>>
> MKL==netlibmkl==netlibopenblascompiled>netlibopenblasyumrepo==netlibcublas>netlibblas>f2jblas
> >>>
> >>> Below is the link to the spreadsheet with full results.
> >>>
> >>>
> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing> >>>
> >>> One thing still needs exploration: does BIDMatcublas perform copying
> >>> to/from machine’s RAM?
> >>>
> >>> Original Message
> >>> From: Ulanov, Alexander
> >>> Sent: Tuesday, February 10, 2015 2:12 PM
> >>> To: Evan R. Sparks
> >>> Cc: Joseph Bradley; [hidden email]
> >>> Subject: RE: Using CUDA within Spark / boosting linear algebra
> >>>
> >>> Thanks, Evan! It seems that ticket was marked as duplicate though the
> >>> original one discusses slightly different topic. I was able to link
> netlib
> >>> with MKL from BIDMat binaries. Indeed, MKL is statically linked inside
> a
> >>> 60MB library.
> >>>
> >>> A*B size  BIDMat MKL  Breeze+NetlibMKL from BIDMat
> >>> Breeze+NetlibOpenBlas(native system) Breeze+Netlibf2jblas 
> >>>
> ++
> >>> 100x100*100x100  0,00205596  0,000381  0,03810324  0,002556 
> >>> 1000x1000*1000x1000  0,018320947  0,038316857  0,51803557
> >>> 1,638475459 
> >>> 10000x10000*10000x10000  23,78046632  32,94546697 445,0935211 
> >>> 1569,233228 
> >>>
> >>> It turn out that precompiled MKL is faster than precompiled OpenBlas
> on
> >>> my machine. Probably, I’ll add two more columns with locally compiled
> >>> openblas and cuda.
> >>>
> >>> Alexander
> >>>
> >>> From: Evan R. Sparks [mailto: [hidden email]]
> >>> Sent: Monday, February 09, 2015 6:06 PM
> >>> To: Ulanov, Alexander
> >>> Cc: Joseph Bradley; [hidden email]
> >>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>
> >>> Great  perhaps we can move this discussion offlist and onto a JIRA
> >>> ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK5705)
> >>>
> >>> It seems like this is going to be somewhat exploratory for a while (and
> >>> there's probably only a handful of us who really care about fast linear
> >>> algebra!)
> >>>
> >>>  Evan
> >>>
> >>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
> >>> [hidden email]<mailto: [hidden email]>> wrote:
> >>> Hi Evan,
> >>>
> >>> Thank you for explanation and useful link. I am going to build
> OpenBLAS,
> >>> link it with Netlibjava and perform benchmark again.
> >>>
> >>> Do I understand correctly that BIDMat binaries contain statically
> linked
> >>> Intel MKL BLAS? It might be the reason why I am able to run BIDMat not
> >>> having MKL BLAS installed on my server. If it is true, I wonder if it
> is OK
> >>> because Intel sells this library. Nevertheless, it seems that in my
> case
> >>> precompiled MKL BLAS performs better than precompiled OpenBLAS given
> that
> >>> BIDMat and Netlibjava are supposed to be on par with JNI overheads.
> >>>
> >>> Though, it might be interesting to link Netlibjava with Intel MKL, as
> >>> you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday
> >>> (Netlibjava) interested to compare their libraries.
> >>>
> >>> Best regards, Alexander
> >>>
> >>> From: Evan R. Sparks [mailto: [hidden email]<mailto:
> >>> [hidden email]>]
> >>> Sent: Friday, February 06, 2015 5:58 PM
> >>>
> >>> To: Ulanov, Alexander
> >>> Cc: Joseph Bradley; [hidden email]<mailto: [hidden email]>
> >>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>
> >>> I would build OpenBLAS yourself, since good BLAS performance comes from
> >>> getting cache sizes, etc. set up correctly for your particular
> hardware 
> >>> this is often a very tricky process (see, e.g. ATLAS), but we found
> that on
> >>> relatively modern Xeon chips, OpenBLAS builds quickly and yields
> >>> performance competitive with MKL.
> >>>
> >>> To make sure the right library is getting used, you have to make sure
> >>> it's first on the search path  export
> >>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
> >>>
> >>> For some examples of getting netlibjava setup on an ec2 node and some
> >>> example benchmarking code we ran a while back, see:
> >>> https://github.com/shivaram/matrixbench> >>>
> >>> In particular  buildopenblasec2.sh shows you how to build the
> library
> >>> and set up symlinks correctly, and scala/runnetlib.sh shows you how
> to get
> >>> the path setup and get that library picked up by netlibjava.
> >>>
> >>> In this way  you could probably get cuBLAS set up to be used by
> >>> netlibjava as well.
> >>>
> >>>  Evan
> >>>
> >>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
> >>> [hidden email]<mailto: [hidden email]>> wrote:
> >>> Evan, could you elaborate on how to force BIDMat and netlibjava to
> force
> >>> loading the right blas? For netlib, I there are few JVM flags, such as
> >>> Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I
> can
> >>> force it to use Java implementation. Not sure I understand how to
> force use
> >>> a specific blas (not specific wrapper for blas).
> >>>
> >>> Btw. I have installed openblas (yum install openblas), so I suppose
> that
> >>> netlib is using it.
> >>>
> >>> From: Evan R. Sparks [mailto: [hidden email]<mailto:
> >>> [hidden email]>]
> >>> Sent: Friday, February 06, 2015 5:19 PM
> >>> To: Ulanov, Alexander
> >>> Cc: Joseph Bradley; [hidden email]<mailto: [hidden email]>
> >>>
> >>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>
> >>> Getting breeze to pick up the right blas library is critical for
> >>> performance. I recommend using OpenBLAS (or MKL, if you already have
> it).
> >>> It might make sense to force BIDMat to use the same underlying BLAS
> library
> >>> as well.
> >>>
> >>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
> >>> [hidden email]<mailto: [hidden email]>> wrote:
> >>> Hi Evan, Joseph
> >>>
> >>> I did few matrix multiplication test and BIDMat seems to be ~10x faster
> >>> than netlibjava+breeze (sorry for weird table formatting):
> >>>
> >>> A*B size  BIDMat MKL  Breeze+Netlibjava
> native_system_linux_x8664
> >>> Breeze+Netlibjava f2jblas 
> >>>
> ++
> >>> 100x100*100x100  0,00205596  0,03810324  0,002556 
> >>> 1000x1000*1000x1000  0,018320947  0,51803557 1,638475459 
> >>> 10000x10000*10000x10000  23,78046632  445,0935211  1569,233228 
> >>>
> >>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19
> >>> Linux, Scala 2.11.
> >>>
> >>> Later I will make tests with Cuda. I need to install new Cuda version
> for
> >>> this purpose.
> >>>
> >>> Do you have any ideas why breezenetlib with native blas is so much
> >>> slower than BIDMat MKL?
> >>>
> >>> Best regards, Alexander
> >>>
> >>> From: Joseph Bradley [mailto: [hidden email]<mailto:
> >>> [hidden email]>]
> >>> Sent: Thursday, February 05, 2015 5:29 PM
> >>> To: Ulanov, Alexander
> >>> Cc: Evan R. Sparks; [hidden email]<mailto: [hidden email]>
> >>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>
> >>> Hi Alexander,
> >>>
> >>> Using GPUs with Spark would be very exciting. Small comment:
> Concerning
> >>> your question earlier about keeping data stored on the GPU rather than
> >>> having to move it between main memory and GPU memory on each
> iteration, I
> >>> would guess this would be critical to getting good performance. If you
> >>> could do multiple local iterations before aggregating results, then the
> >>> cost of data movement to the GPU could be amortized (and I believe
> that is
> >>> done in practice). Having Spark be aware of the GPU and using it as
> >>> another part of memory sounds like a much bigger undertaking.
> >>>
> >>> Joseph
> >>>
> >>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
> >>> [hidden email]<mailto: [hidden email]>> wrote:
> >>> Thank you for explanation! I’ve watched the BIDMach presentation by
> John
> >>> Canny and I am really inspired by his talk and comparisons with Spark
> MLlib.
> >>>
> >>> I am very interested to find out what will be better within Spark:
> BIDMat
> >>> or netlibjava with CPU or GPU natives. Could you suggest a fair way to
> >>> benchmark them? Currently I do benchmarks on artificial neural
> networks in
> >>> batch mode. While it is not a “pure” test of linear algebra, it
> involves
> >>> some other things that are essential to machine learning.
> >>>
> >>> From: Evan R. Sparks [mailto: [hidden email]<mailto:
> >>> [hidden email]>]
> >>> Sent: Thursday, February 05, 2015 1:29 PM
> >>> To: Ulanov, Alexander
> >>> Cc: [hidden email]<mailto: [hidden email]>
> >>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>
> >>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
> >>> netlibjava+OpenBLAS, but if it is much faster it's probably due to
> data
> >>> layout and fewer levels of indirection  it's definitely a worthwhile
> >>> experiment to run. The main speedups I've seen from using it come from
> >>> highly optimized GPU code for linear algebra. I know that in the past
> Canny
> >>> has gone as far as to write custom GPU kernels for performancecritical
> >>> regions of code.[1]
> >>>
> >>> BIDMach is highly optimized for single node performance or performance
> on
> >>> small clusters.[2] Once data doesn't fit easily in GPU memory (or can
> be
> >>> batched in that way) the performance tends to fall off. Canny argues
> for
> >>> hardware/software codesign and as such prefers machine configurations
> that
> >>> are quite different than what we find in most commodity cluster nodes 
> >>> e.g. 10 disk cahnnels and 4 GPUs.
> >>>
> >>> In contrast, MLlib was designed for horizontal scalability on commodity
> >>> clusters and works best on very big datasets  order of terabytes.
> >>>
> >>> For the most part, these projects developed concurrently to address
> >>> slightly different use cases. That said, there may be bits of BIDMach
> we
> >>> could repurpose for MLlib  keep in mind we need to be careful about
> >>> maintaining crosslanguage compatibility for our Java and Pythonusers,
> >>> though.
> >>>
> >>>  Evan
> >>>
> >>> [1]  http://arxiv.org/abs/1409.5402> >>> [2]  http://eecs.berkeley.edu/~hzhao/papers/BD.pdf> >>>
> >>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
> >>> [hidden email]<mailto: [hidden email]><mailto:
> >>> [hidden email]<mailto: [hidden email]>>> wrote:
> >>> Hi Evan,
> >>>
> >>> Thank you for suggestion! BIDMat seems to have terrific speed. Do you
> >>> know what makes them faster than netlibjava?
> >>>
> >>> The same group has BIDMach library that implements machine learning.
> For
> >>> some examples they use Caffe convolutional neural network library
> owned by
> >>> another group in Berkeley. Could you elaborate on how these all might
> be
> >>> connected with Spark Mllib? If you take BIDMat for linear algebra why
> don’t
> >>> you take BIDMach for optimization and learning?
> >>>
> >>> Best regards, Alexander
> >>>
> >>> From: Evan R. Sparks [mailto: [hidden email]<mailto:
> >>> [hidden email]><mailto: [hidden email]<mailto:
> >>> [hidden email]>>]
> >>> Sent: Thursday, February 05, 2015 12:09 PM
> >>> To: Ulanov, Alexander
> >>> Cc: [hidden email]<mailto: [hidden email]><mailto:
> >>> [hidden email]<mailto: [hidden email]>>
> >>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>
> >>> I'd expect that we can make GPUaccelerated BLAS faster than CPU blas
> in
> >>> many cases.
> >>>
> >>> You might consider taking a look at the codepaths that BIDMat (
> >>> https://github.com/BIDData/BIDMat) takes and comparing them to
> >>> netlibjava/breeze. John Canny et. al. have done a bunch of work
> optimizing
> >>> to make this work really fast from Scala. I've run it on my laptop and
> >>> compared to MKL and in certain cases it's 10x faster at matrix
> multiply.
> >>> There are a lot of layers of indirection here and you really want to
> avoid
> >>> data copying as much as possible.
> >>>
> >>> We could also consider swapping out BIDMat for Breeze, but that would
> be
> >>> a big project and if we can figure out how to get breeze+cublas to
> >>> comparable performance that would be a big win.
> >>>
> >>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
> >>> [hidden email]<mailto: [hidden email]><mailto:
> >>> [hidden email]<mailto: [hidden email]>>> wrote:
> >>> Dear Spark developers,
> >>>
> >>> I am exploring how to make linear algebra operations faster within
> Spark.
> >>> One way of doing this is to use Scala Breeze library that is bundled
> with
> >>> Spark. For matrix operations, it employs Netlibjava that has a Java
> >>> wrapper for BLAS (basic linear algebra subprograms) and LAPACK native
> >>> binaries if they are available on the worker node. It also has its own
> >>> optimized Java implementation of BLAS. It is worth mentioning, that
> native
> >>> binaries provide better performance only for BLAS level 3, i.e.
> >>> matrixmatrix operations or general matrix multiplication (GEMM). This
> is
> >>> confirmed by GEMM test on Netlibjava page
> >>> https://github.com/fommil/netlibjava. I also confirmed it with my
> >>> experiments with training of artificial neural network
> >>> https://github.com/apache/spark/pull/1290#issuecomment70313952.
> >>> However, I would like to boost performance more.
> >>>
> >>> GPU is supposed to work fast with linear algebra and there is Nvidia
> CUDA
> >>> implementation of BLAS, called cublas. I have one Linux server with
> Nvidia
> >>> GPU and I was able to do the following. I linked cublas (instead of
> >>> cpubased blas) with Netlibjava wrapper and put it into Spark, so
> >>> Breeze/Netlib is using it. Then I did some performance measurements
> with
> >>> regards to artificial neural network batch learning in Spark MLlib that
> >>> involves matrixmatrix multiplications. It turns out that for matrices
> of
> >>> size less than ~1000x780 GPU cublas has the same speed as CPU blas.
> Cublas
> >>> becomes slower for bigger matrices. It worth mentioning that it is was
> not
> >>> a test for ONLY multiplication since there are other operations
> involved.
> >>> One of the reasons for slowdown might be the overhead of copying the
> >>> matrices from computer memory to graphic card memory and back.
> >>>
> >>> So, few questions:
> >>> 1) Do these results with CUDA make sense?
> >>> 2) If the problem is with copy overhead, are there any libraries that
> >>> allow to force intermediate results to stay in graphic card memory thus
> >>> removing the overhead?
> >>> 3) Any other options to speedup linear algebra in Spark?
> >>>
> >>> Thank you, Alexander
> >>>
> >>> 
> >>> To unsubscribe, email: [hidden email]<mailto:
> >>> [hidden email]><mailto:
> [hidden email]
> >>> <mailto: [hidden email]>>
> >>> For additional commands, email: [hidden email]<mailto:
> >>> [hidden email]><mailto: [hidden email]<mailto:
> >>> [hidden email]>>
> >>>
> >>>
> >>>
> >>>
> >>
>

1234
