Using CUDA within Spark / boosting linear algebra

classic Classic list List threaded Threaded
77 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Re: Using CUDA within Spark / boosting linear algebra

jfcanny
Hi Reynold,
I left Chester with a copy of the slides, so I assume they'll be posted on the SF ML or Big Data sites. We have a draft paper under review. I can ask the co-authors about arxiv'ing it.

We have a few heuristics for power-law data. One of them is to keep the feature set sorted by frequency. Power-law data has roughly the same mass in each power-of-two range of feature frequency. By keeping the most frequent features together, you get a lot more value out of the caches on the device (even GPUs have them, albeit smaller ones). e.g. with 100 million features, 1/2 of the feature instances will be in the range 1...,10,000. If they're consecutive they will all hit a fast cache. Another 1/4 will be in 1,...,1,000,000 hitting the next cache etc.

Another is to subdivide sparse matrices using the vector of elements rather than rows or columns. Splitting power-law matrices by either rows or columns gives very uneven splits. That means we store sparse matrices in coordinate form rather than compressed row or column format.

Other than that, rooflining gives you a goal that you should be able to reach. If you arent at the limit, just knowing that gives you a target to aim at. You can try profiling the kernel to figure out why its slower than it should be. There are a few common reasons (low occupancy, imbalanced thread blocks, thread divergence) that you can discover with the profiler. Then hopefully you can solve them.

-John


On 3/12/2015 10:56 PM, rxin [via Apache Spark Developers List] wrote:
Thanks for chiming in, John. I missed your meetup last night - do you have
any writeups or slides about roofline design? In particular, I'm curious
about what optimizations are available for power-law dense * sparse? (I
don't have any background in optimizations)



On Thu, Mar 12, 2015 at 8:50 PM, jfcanny <[hidden email]> wrote:

> If you're contemplating GPU acceleration in Spark, its important to look
> beyond BLAS. Dense BLAS probably account for only 10% of the cycles in the
> datasets we've tested in BIDMach, and we've tried to make them
> representative of industry machine learning workloads. Unless you're
> crunching images or audio, the majority of data will be very sparse and
> power law distributed. You need a good sparse BLAS, and in practice it
> seems
> like you need a sparse BLAS tailored for power-law data. We had to write
> our
> own since the NVIDIA libraries didnt perform well on typical power-law
> data.
> Intel MKL sparse BLAS also have issues and we only use some of them.
>
> You also need 2D reductions, scan operations, slicing, element-wise
> transcendental functions and operators, many kinds of sort, random number
> generators etc, and some kind of memory management strategy. Some of this
> was layered on top of Thrust in BIDMat, but most had to be written from
> scratch. Its all been rooflined, typically to memory throughput of current
> GPUs (around 200 GB/s).
>
> When you have all this you can write Learning Algorithms in the same
> high-level primitives available in Breeze or Numpy/Scipy. Its literally the
> same in BIDMat, since the generic matrix operations are implemented on both
> CPU and GPU, so the same code runs on either platform.
>
> A lesser known fact is that GPUs are around 10x faster for *all* those
> operations, not just dense BLAS. Its mostly due to faster streaming memory
> speeds, but some kernels (random number generation and transcendentals) are
> more than an order of magnitude thanks to some specialized hardware for
> power series on the GPU chip.
>
> When you have all this there is no need to move data back and forth across
> the PCI bus. The CPU only has to pull chunks of data off disk, unpack them,
> and feed them to the available GPUs. Most models fit comfortably in GPU
> memory these days (4-12 GB). With minibatch algorithms you can push TBs of
> data through the GPU this way.
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Using-CUDA-within-Spark-boosting-linear-algebra-tp10481p11021.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>



To unsubscribe from Using CUDA within Spark / boosting linear algebra, click here.
NAML

Reply | Threaded
Open this post in threaded view
|

RE: Using CUDA within Spark / boosting linear algebra

Ulanov, Alexander
In reply to this post by fommil
Hi,

I am trying to use nvblas with netlib-java from Spark. nvblas functions should replace current blas functions calls after executing LD_PRELOAD as suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any changes to netlib-java. It seems to work for simple Java example, but I cannot make it work with Spark. I run the following:
export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell --driver-memory 4G
In nvidia-smi I observe that Java is to use GPU:
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      8873    C   bash                                            39MiB |
|    0      8910    C   /usr/lib/jvm/java-1.7.0/bin/java                39MiB |
+-----------------------------------------------------------------------------+

In Spark shell I do matrix multiplication and see the following:
15/03/25 06:48:01 INFO JniLoader: successfully loaded /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
So I am sure that netlib-native is loaded and cblas supposedly used. However, matrix multiplication does executes on CPU since I see 16% of CPU used and 0% of GPU used. I also checked different matrix sizes, from 100x100 to 12000x12000

Could you suggest might the LD_PRELOAD not affect Spark shell?

Best regards, Alexander



From: Sam Halliday [mailto:[hidden email]]
Sent: Monday, March 09, 2015 6:01 PM
To: Ulanov, Alexander
Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
Subject: RE: Using CUDA within Spark / boosting linear algebra


Thanks so much for following up on this!

Hmm, I wonder if we should have a concerted effort to chart performance on various pieces of hardware...
On 9 Mar 2015 21:08, "Ulanov, Alexander" <[hidden email]<mailto:[hidden email]>> wrote:
Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the support of Double in the current source code), did the test with BIDMat and CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

Best regards, Alexander

-----Original Message-----
From: Sam Halliday [mailto:[hidden email]<mailto:[hidden email]>]
Sent: Tuesday, March 03, 2015 1:54 PM
To: Xiangrui Meng; Joseph Bradley
Cc: Evan R. Sparks; Ulanov, Alexander; [hidden email]<mailto:[hidden email]>
Subject: Re: Using CUDA within Spark / boosting linear algebra

BTW, is anybody on this list going to the London Meetup in a few weeks?

https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community

Would be nice to meet other people working on the guts of Spark! :-)


Xiangrui Meng <[hidden email]<mailto:[hidden email]>> writes:

> Hey Alexander,
>
> I don't quite understand the part where netlib-cublas is about 20x
> slower than netlib-openblas. What is the overhead of using a GPU BLAS
> with netlib-java?
>
> CC'ed Sam, the author of netlib-java.
>
> Best,
> Xiangrui
>
> On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley <[hidden email]<mailto:[hidden email]>> wrote:
>> Better documentation for linking would be very helpful!  Here's a JIRA:
>> https://issues.apache.org/jira/browse/SPARK-6019
>>
>>
>> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
>> <[hidden email]<mailto:[hidden email]>>
>> wrote:
>>
>>> Thanks for compiling all the data and running these benchmarks,
>>> Alex. The big takeaways here can be seen with this chart:
>>>
>>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ
>>> Hl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
>>>
>>> 1) A properly configured GPU matrix multiply implementation (e.g.
>>> BIDMat+GPU) can provide substantial (but less than an order of
>>> BIDMat+magnitude)
>>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
>>> netlib-java+openblas-compiled).
>>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
>>> worse than a well-tuned CPU implementation, particularly for larger matrices.
>>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
>>> basically agrees with the authors own benchmarks (
>>> https://github.com/fommil/netlib-java)
>>>
>>> I think that most of our users are in a situation where using GPUs
>>> may not be practical - although we could consider having a good GPU
>>> backend available as an option. However, *ALL* users of MLlib could
>>> benefit (potentially tremendously) from using a well-tuned CPU-based
>>> BLAS implementation. Perhaps we should consider updating the mllib
>>> guide with a more complete section for enabling high performance
>>> binaries on OSX and Linux? Or better, figure out a way for the
>>> system to fetch these automatically.
>>>
>>> - Evan
>>>
>>>
>>>
>>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
>>> [hidden email]<mailto:[hidden email]>> wrote:
>>>
>>>> Just to summarize this thread, I was finally able to make all
>>>> performance comparisons that we discussed. It turns out that:
>>>> BIDMat-cublas>>BIDMat
>>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo=
>>>> =netlib-cublas>netlib-blas>f2jblas
>>>>
>>>> Below is the link to the spreadsheet with full results.
>>>>
>>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx
>>>> 378T9J5r7kwKSPkY/edit?usp=sharing
>>>>
>>>> One thing still needs exploration: does BIDMat-cublas perform
>>>> copying to/from machine’s RAM?
>>>>
>>>> -----Original Message-----
>>>> From: Ulanov, Alexander
>>>> Sent: Tuesday, February 10, 2015 2:12 PM
>>>> To: Evan R. Sparks
>>>> Cc: Joseph Bradley; [hidden email]<mailto:[hidden email]>
>>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Thanks, Evan! It seems that ticket was marked as duplicate though
>>>> the original one discusses slightly different topic. I was able to
>>>> link netlib with MKL from BIDMat binaries. Indeed, MKL is
>>>> statically linked inside a 60MB library.
>>>>
>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
>>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
>>>> +-----------------------------------------------------------------------+
>>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
>>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
>>>> |1,638475459 |
>>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 |
>>>> 1569,233228 |
>>>>
>>>> It turn out that pre-compiled MKL is faster than precompiled
>>>> OpenBlas on my machine. Probably, I’ll add two more columns with
>>>> locally compiled openblas and cuda.
>>>>
>>>> Alexander
>>>>
>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]>]
>>>> Sent: Monday, February 09, 2015 6:06 PM
>>>> To: Ulanov, Alexander
>>>> Cc: Joseph Bradley; [hidden email]<mailto:[hidden email]>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Great - perhaps we can move this discussion off-list and onto a
>>>> JIRA ticket? (Here's one:
>>>> https://issues.apache.org/jira/browse/SPARK-5705)
>>>>
>>>> It seems like this is going to be somewhat exploratory for a while
>>>> (and there's probably only a handful of us who really care about
>>>> fast linear
>>>> algebra!)
>>>>
>>>> - Evan
>>>>
>>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>> wrote:
>>>> Hi Evan,
>>>>
>>>> Thank you for explanation and useful link. I am going to build
>>>> OpenBLAS, link it with Netlib-java and perform benchmark again.
>>>>
>>>> Do I understand correctly that BIDMat binaries contain statically
>>>> linked Intel MKL BLAS? It might be the reason why I am able to run
>>>> BIDMat not having MKL BLAS installed on my server. If it is true, I
>>>> wonder if it is OK because Intel sells this library. Nevertheless,
>>>> it seems that in my case precompiled MKL BLAS performs better than
>>>> precompiled OpenBLAS given that BIDMat and Netlib-java are supposed to be on par with JNI overheads.
>>>>
>>>> Though, it might be interesting to link Netlib-java with Intel MKL,
>>>> as you suggested. I wonder, are John Canny (BIDMat) and Sam
>>>> Halliday
>>>> (Netlib-java) interested to compare their libraries.
>>>>
>>>> Best regards, Alexander
>>>>
>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>]
>>>> Sent: Friday, February 06, 2015 5:58 PM
>>>>
>>>> To: Ulanov, Alexander
>>>> Cc: Joseph Bradley;
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> I would build OpenBLAS yourself, since good BLAS performance comes
>>>> from getting cache sizes, etc. set up correctly for your particular
>>>> hardware - this is often a very tricky process (see, e.g. ATLAS),
>>>> but we found that on relatively modern Xeon chips, OpenBLAS builds
>>>> quickly and yields performance competitive with MKL.
>>>>
>>>> To make sure the right library is getting used, you have to make
>>>> sure it's first on the search path - export
>>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
>>>>
>>>> For some examples of getting netlib-java setup on an ec2 node and
>>>> some example benchmarking code we ran a while back, see:
>>>> https://github.com/shivaram/matrix-bench
>>>>
>>>> In particular - build-openblas-ec2.sh shows you how to build the
>>>> library and set up symlinks correctly, and scala/run-netlib.sh
>>>> shows you how to get the path setup and get that library picked up by netlib-java.
>>>>
>>>> In this way - you could probably get cuBLAS set up to be used by
>>>> netlib-java as well.
>>>>
>>>> - Evan
>>>>
>>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>> wrote:
>>>> Evan, could you elaborate on how to force BIDMat and netlib-java to
>>>> force loading the right blas? For netlib, I there are few JVM
>>>> flags, such as
>>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS,
>>>> so I can force it to use Java implementation. Not sure I understand how to force use a specific blas (not specific wrapper for blas).
>>>>
>>>> Btw. I have installed openblas (yum install openblas), so I suppose
>>>> that netlib is using it.
>>>>
>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>]
>>>> Sent: Friday, February 06, 2015 5:19 PM
>>>> To: Ulanov, Alexander
>>>> Cc: Joseph Bradley;
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>
>>>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Getting breeze to pick up the right blas library is critical for
>>>> performance. I recommend using OpenBLAS (or MKL, if you already have it).
>>>> It might make sense to force BIDMat to use the same underlying BLAS
>>>> library as well.
>>>>
>>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>> wrote:
>>>> Hi Evan, Joseph
>>>>
>>>> I did few matrix multiplication test and BIDMat seems to be ~10x
>>>> faster than netlib-java+breeze (sorry for weird table formatting):
>>>>
>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java
>>>> |native_system_linux_x86-64|
>>>> Breeze+Netlib-java f2jblas |
>>>> +-----------------------------------------------------------------------+
>>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
>>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
>>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228
>>>> ||
>>>>
>>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora
>>>> 19 Linux, Scala 2.11.
>>>>
>>>> Later I will make tests with Cuda. I need to install new Cuda
>>>> version for this purpose.
>>>>
>>>> Do you have any ideas why breeze-netlib with native blas is so much
>>>> slower than BIDMat MKL?
>>>>
>>>> Best regards, Alexander
>>>>
>>>> From: Joseph Bradley [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>]
>>>> Sent: Thursday, February 05, 2015 5:29 PM
>>>> To: Ulanov, Alexander
>>>> Cc: Evan R. Sparks;
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Hi Alexander,
>>>>
>>>> Using GPUs with Spark would be very exciting.  Small comment:
>>>> Concerning your question earlier about keeping data stored on the
>>>> GPU rather than having to move it between main memory and GPU
>>>> memory on each iteration, I would guess this would be critical to
>>>> getting good performance.  If you could do multiple local
>>>> iterations before aggregating results, then the cost of data
>>>> movement to the GPU could be amortized (and I believe that is done
>>>> in practice).  Having Spark be aware of the GPU and using it as another part of memory sounds like a much bigger undertaking.
>>>>
>>>> Joseph
>>>>
>>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>> wrote:
>>>> Thank you for explanation! I’ve watched the BIDMach presentation by
>>>> John Canny and I am really inspired by his talk and comparisons with Spark MLlib.
>>>>
>>>> I am very interested to find out what will be better within Spark:
>>>> BIDMat or netlib-java with CPU or GPU natives. Could you suggest a
>>>> fair way to benchmark them? Currently I do benchmarks on artificial
>>>> neural networks in batch mode. While it is not a “pure” test of
>>>> linear algebra, it involves some other things that are essential to machine learning.
>>>>
>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>]
>>>> Sent: Thursday, February 05, 2015 1:29 PM
>>>> To: Ulanov, Alexander
>>>> Cc: [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
>>>> netlib-java+OpenBLAS, but if it is much faster it's probably due to
>>>> netlib-java+data
>>>> layout and fewer levels of indirection - it's definitely a
>>>> worthwhile experiment to run. The main speedups I've seen from
>>>> using it come from highly optimized GPU code for linear algebra. I
>>>> know that in the past Canny has gone as far as to write custom GPU
>>>> kernels for performance-critical regions of code.[1]
>>>>
>>>> BIDMach is highly optimized for single node performance or
>>>> performance on small clusters.[2] Once data doesn't fit easily in
>>>> GPU memory (or can be batched in that way) the performance tends to
>>>> fall off. Canny argues for hardware/software codesign and as such
>>>> prefers machine configurations that are quite different than what
>>>> we find in most commodity cluster nodes - e.g. 10 disk cahnnels and 4 GPUs.
>>>>
>>>> In contrast, MLlib was designed for horizontal scalability on
>>>> commodity clusters and works best on very big datasets - order of terabytes.
>>>>
>>>> For the most part, these projects developed concurrently to address
>>>> slightly different use cases. That said, there may be bits of
>>>> BIDMach we could repurpose for MLlib - keep in mind we need to be
>>>> careful about maintaining cross-language compatibility for our Java
>>>> and Python-users, though.
>>>>
>>>> - Evan
>>>>
>>>> [1] - http://arxiv.org/abs/1409.5402 [2] -
>>>> http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
>>>>
>>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>> wrote:
>>>> Hi Evan,
>>>>
>>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do
>>>> you know what makes them faster than netlib-java?
>>>>
>>>> The same group has BIDMach library that implements machine
>>>> learning. For some examples they use Caffe convolutional neural
>>>> network library owned by another group in Berkeley. Could you
>>>> elaborate on how these all might be connected with Spark Mllib? If
>>>> you take BIDMat for linear algebra why don’t you take BIDMach for optimization and learning?
>>>>
>>>> Best regards, Alexander
>>>>
>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>>]
>>>> Sent: Thursday, February 05, 2015 12:09 PM
>>>> To: Ulanov, Alexander
>>>> Cc: [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU
>>>> blas in many cases.
>>>>
>>>> You might consider taking a look at the codepaths that BIDMat (
>>>> https://github.com/BIDData/BIDMat) takes and comparing them to
>>>> netlib-java/breeze. John Canny et. al. have done a bunch of work
>>>> optimizing to make this work really fast from Scala. I've run it on
>>>> my laptop and compared to MKL and in certain cases it's 10x faster at matrix multiply.
>>>> There are a lot of layers of indirection here and you really want
>>>> to avoid data copying as much as possible.
>>>>
>>>> We could also consider swapping out BIDMat for Breeze, but that
>>>> would be a big project and if we can figure out how to get
>>>> breeze+cublas to comparable performance that would be a big win.
>>>>
>>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>> wrote:
>>>> Dear Spark developers,
>>>>
>>>> I am exploring how to make linear algebra operations faster within Spark.
>>>> One way of doing this is to use Scala Breeze library that is
>>>> bundled with Spark. For matrix operations, it employs Netlib-java
>>>> that has a Java wrapper for BLAS (basic linear algebra subprograms)
>>>> and LAPACK native binaries if they are available on the worker
>>>> node. It also has its own optimized Java implementation of BLAS. It
>>>> is worth mentioning, that native binaries provide better performance only for BLAS level 3, i.e.
>>>> matrix-matrix operations or general matrix multiplication (GEMM).
>>>> This is confirmed by GEMM test on Netlib-java page
>>>> https://github.com/fommil/netlib-java. I also confirmed it with my
>>>> experiments with training of artificial neural network
>>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
>>>> However, I would like to boost performance more.
>>>>
>>>> GPU is supposed to work fast with linear algebra and there is
>>>> Nvidia CUDA implementation of BLAS, called cublas. I have one Linux
>>>> server with Nvidia GPU and I was able to do the following. I linked
>>>> cublas (instead of cpu-based blas) with Netlib-java wrapper and put
>>>> it into Spark, so Breeze/Netlib is using it. Then I did some
>>>> performance measurements with regards to artificial neural network
>>>> batch learning in Spark MLlib that involves matrix-matrix
>>>> multiplications. It turns out that for matrices of size less than
>>>> ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes
>>>> slower for bigger matrices. It worth mentioning that it is was not a test for ONLY multiplication since there are other operations involved.
>>>> One of the reasons for slowdown might be the overhead of copying
>>>> the matrices from computer memory to graphic card memory and back.
>>>>
>>>> So, few questions:
>>>> 1) Do these results with CUDA make sense?
>>>> 2) If the problem is with copy overhead, are there any libraries
>>>> that allow to force intermediate results to stay in graphic card
>>>> memory thus removing the overhead?
>>>> 3) Any other options to speed-up linear algebra in Spark?
>>>>
>>>> Thank you, Alexander
>>>>
>>>> -------------------------------------------------------------------
>>>> -- To unsubscribe, e-mail: [hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]>
>>>> he.org<http://he.org> <mailto:[hidden email]<mailto:[hidden email]>>>
>>>> For additional commands, e-mail: [hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>>
>>>>
>>>>
>>>>
>>>>
>>>

--
Best regards,
Sam
Reply | Threaded
Open this post in threaded view
|

RE: Using CUDA within Spark / boosting linear algebra

Ulanov, Alexander
Hi again,

I finally managed to use nvblas within Spark+netlib-java. It has exceptional performance for big matrices with Double, faster than BIDMat-cuda with Float. But for smaller matrices, if you will copy them to/from GPU, OpenBlas or MKL might be a better choice. This correlates with original nvblas presentation on GPU conf 2013 (slide 21): http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf
 
My results:
https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing 

Just in case, these tests are not for generalization of performance of different libraries. I just want to pick a library that does at best dense matrices multiplication for my task.

P.S. My previous issue with nvblas was the following: it has Fortran blas functions, at the same time netlib-java uses C cblas functions. So, one needs cblas shared library to use nvblas through netlib-java. Fedora does not have cblas (but Debian and Ubuntu have), so I needed to compile it. I could not use cblas from Atlas or Openblas because they link to their implementation and not to Fortran blas.

Best regards, Alexander

-----Original Message-----
From: Ulanov, Alexander
Sent: Tuesday, March 24, 2015 6:57 PM
To: Sam Halliday
Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
Subject: RE: Using CUDA within Spark / boosting linear algebra

Hi,

I am trying to use nvblas with netlib-java from Spark. nvblas functions should replace current blas functions calls after executing LD_PRELOAD as suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any changes to netlib-java. It seems to work for simple Java example, but I cannot make it work with Spark. I run the following:
export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell --driver-memory 4G In nvidia-smi I observe that Java is to use GPU:
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      8873    C   bash                                            39MiB |
|    0      8910    C   /usr/lib/jvm/java-1.7.0/bin/java                39MiB |
+-----------------------------------------------------------------------------+

In Spark shell I do matrix multiplication and see the following:
15/03/25 06:48:01 INFO JniLoader: successfully loaded /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
So I am sure that netlib-native is loaded and cblas supposedly used. However, matrix multiplication does executes on CPU since I see 16% of CPU used and 0% of GPU used. I also checked different matrix sizes, from 100x100 to 12000x12000

Could you suggest might the LD_PRELOAD not affect Spark shell?

Best regards, Alexander



From: Sam Halliday [mailto:[hidden email]]
Sent: Monday, March 09, 2015 6:01 PM
To: Ulanov, Alexander
Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
Subject: RE: Using CUDA within Spark / boosting linear algebra


Thanks so much for following up on this!

Hmm, I wonder if we should have a concerted effort to chart performance on various pieces of hardware...
On 9 Mar 2015 21:08, "Ulanov, Alexander" <[hidden email]<mailto:[hidden email]>> wrote:
Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the support of Double in the current source code), did the test with BIDMat and CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

Best regards, Alexander

-----Original Message-----
From: Sam Halliday [mailto:[hidden email]<mailto:[hidden email]>]
Sent: Tuesday, March 03, 2015 1:54 PM
To: Xiangrui Meng; Joseph Bradley
Cc: Evan R. Sparks; Ulanov, Alexander; [hidden email]<mailto:[hidden email]>
Subject: Re: Using CUDA within Spark / boosting linear algebra

BTW, is anybody on this list going to the London Meetup in a few weeks?

https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community

Would be nice to meet other people working on the guts of Spark! :-)


Xiangrui Meng <[hidden email]<mailto:[hidden email]>> writes:

> Hey Alexander,
>
> I don't quite understand the part where netlib-cublas is about 20x
> slower than netlib-openblas. What is the overhead of using a GPU BLAS
> with netlib-java?
>
> CC'ed Sam, the author of netlib-java.
>
> Best,
> Xiangrui
>
> On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley <[hidden email]<mailto:[hidden email]>> wrote:
>> Better documentation for linking would be very helpful!  Here's a JIRA:
>> https://issues.apache.org/jira/browse/SPARK-6019
>>
>>
>> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
>> <[hidden email]<mailto:[hidden email]>>
>> wrote:
>>
>>> Thanks for compiling all the data and running these benchmarks,
>>> Alex. The big takeaways here can be seen with this chart:
>>>
>>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ
>>> Hl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
>>>
>>> 1) A properly configured GPU matrix multiply implementation (e.g.
>>> BIDMat+GPU) can provide substantial (but less than an order of
>>> BIDMat+magnitude)
>>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
>>> netlib-java+openblas-compiled).
>>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
>>> worse than a well-tuned CPU implementation, particularly for larger matrices.
>>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
>>> basically agrees with the authors own benchmarks (
>>> https://github.com/fommil/netlib-java)
>>>
>>> I think that most of our users are in a situation where using GPUs
>>> may not be practical - although we could consider having a good GPU
>>> backend available as an option. However, *ALL* users of MLlib could
>>> benefit (potentially tremendously) from using a well-tuned CPU-based
>>> BLAS implementation. Perhaps we should consider updating the mllib
>>> guide with a more complete section for enabling high performance
>>> binaries on OSX and Linux? Or better, figure out a way for the
>>> system to fetch these automatically.
>>>
>>> - Evan
>>>
>>>
>>>
>>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
>>> [hidden email]<mailto:[hidden email]>> wrote:
>>>
>>>> Just to summarize this thread, I was finally able to make all
>>>> performance comparisons that we discussed. It turns out that:
>>>> BIDMat-cublas>>BIDMat
>>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo=
>>>> =netlib-cublas>netlib-blas>f2jblas
>>>>
>>>> Below is the link to the spreadsheet with full results.
>>>>
>>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx
>>>> 378T9J5r7kwKSPkY/edit?usp=sharing
>>>>
>>>> One thing still needs exploration: does BIDMat-cublas perform
>>>> copying to/from machine’s RAM?
>>>>
>>>> -----Original Message-----
>>>> From: Ulanov, Alexander
>>>> Sent: Tuesday, February 10, 2015 2:12 PM
>>>> To: Evan R. Sparks
>>>> Cc: Joseph Bradley;
>>>> [hidden email]<mailto:[hidden email]>
>>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Thanks, Evan! It seems that ticket was marked as duplicate though
>>>> the original one discusses slightly different topic. I was able to
>>>> link netlib with MKL from BIDMat binaries. Indeed, MKL is
>>>> statically linked inside a 60MB library.
>>>>
>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
>>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
>>>> +-----------------------------------------------------------------------+
>>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
>>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
>>>> |1,638475459 |
>>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 |
>>>> 1569,233228 |
>>>>
>>>> It turn out that pre-compiled MKL is faster than precompiled
>>>> OpenBlas on my machine. Probably, I’ll add two more columns with
>>>> locally compiled openblas and cuda.
>>>>
>>>> Alexander
>>>>
>>>> From: Evan R. Sparks
>>>> [mailto:[hidden email]<mailto:[hidden email]>]
>>>> Sent: Monday, February 09, 2015 6:06 PM
>>>> To: Ulanov, Alexander
>>>> Cc: Joseph Bradley;
>>>> [hidden email]<mailto:[hidden email]>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Great - perhaps we can move this discussion off-list and onto a
>>>> JIRA ticket? (Here's one:
>>>> https://issues.apache.org/jira/browse/SPARK-5705)
>>>>
>>>> It seems like this is going to be somewhat exploratory for a while
>>>> (and there's probably only a handful of us who really care about
>>>> fast linear
>>>> algebra!)
>>>>
>>>> - Evan
>>>>
>>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>> wrote:
>>>> Hi Evan,
>>>>
>>>> Thank you for explanation and useful link. I am going to build
>>>> OpenBLAS, link it with Netlib-java and perform benchmark again.
>>>>
>>>> Do I understand correctly that BIDMat binaries contain statically
>>>> linked Intel MKL BLAS? It might be the reason why I am able to run
>>>> BIDMat not having MKL BLAS installed on my server. If it is true, I
>>>> wonder if it is OK because Intel sells this library. Nevertheless,
>>>> it seems that in my case precompiled MKL BLAS performs better than
>>>> precompiled OpenBLAS given that BIDMat and Netlib-java are supposed to be on par with JNI overheads.
>>>>
>>>> Though, it might be interesting to link Netlib-java with Intel MKL,
>>>> as you suggested. I wonder, are John Canny (BIDMat) and Sam
>>>> Halliday
>>>> (Netlib-java) interested to compare their libraries.
>>>>
>>>> Best regards, Alexander
>>>>
>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>]
>>>> Sent: Friday, February 06, 2015 5:58 PM
>>>>
>>>> To: Ulanov, Alexander
>>>> Cc: Joseph Bradley;
>>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
>>>> apache.org<mailto:[hidden email]>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> I would build OpenBLAS yourself, since good BLAS performance comes
>>>> from getting cache sizes, etc. set up correctly for your particular
>>>> hardware - this is often a very tricky process (see, e.g. ATLAS),
>>>> but we found that on relatively modern Xeon chips, OpenBLAS builds
>>>> quickly and yields performance competitive with MKL.
>>>>
>>>> To make sure the right library is getting used, you have to make
>>>> sure it's first on the search path - export
>>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
>>>>
>>>> For some examples of getting netlib-java setup on an ec2 node and
>>>> some example benchmarking code we ran a while back, see:
>>>> https://github.com/shivaram/matrix-bench
>>>>
>>>> In particular - build-openblas-ec2.sh shows you how to build the
>>>> library and set up symlinks correctly, and scala/run-netlib.sh
>>>> shows you how to get the path setup and get that library picked up by netlib-java.
>>>>
>>>> In this way - you could probably get cuBLAS set up to be used by
>>>> netlib-java as well.
>>>>
>>>> - Evan
>>>>
>>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>> wrote:
>>>> Evan, could you elaborate on how to force BIDMat and netlib-java to
>>>> force loading the right blas? For netlib, I there are few JVM
>>>> flags, such as
>>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS,
>>>> so I can force it to use Java implementation. Not sure I understand how to force use a specific blas (not specific wrapper for blas).
>>>>
>>>> Btw. I have installed openblas (yum install openblas), so I suppose
>>>> that netlib is using it.
>>>>
>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>]
>>>> Sent: Friday, February 06, 2015 5:19 PM
>>>> To: Ulanov, Alexander
>>>> Cc: Joseph Bradley;
>>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
>>>> apache.org<mailto:[hidden email]>>
>>>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Getting breeze to pick up the right blas library is critical for
>>>> performance. I recommend using OpenBLAS (or MKL, if you already have it).
>>>> It might make sense to force BIDMat to use the same underlying BLAS
>>>> library as well.
>>>>
>>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>> wrote:
>>>> Hi Evan, Joseph
>>>>
>>>> I did few matrix multiplication test and BIDMat seems to be ~10x
>>>> faster than netlib-java+breeze (sorry for weird table formatting):
>>>>
>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java
>>>> |native_system_linux_x86-64|
>>>> Breeze+Netlib-java f2jblas |
>>>> +-----------------------------------------------------------------------+
>>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
>>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
>>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228
>>>> ||
>>>>
>>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora
>>>> 19 Linux, Scala 2.11.
>>>>
>>>> Later I will make tests with Cuda. I need to install new Cuda
>>>> version for this purpose.
>>>>
>>>> Do you have any ideas why breeze-netlib with native blas is so much
>>>> slower than BIDMat MKL?
>>>>
>>>> Best regards, Alexander
>>>>
>>>> From: Joseph Bradley [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>]
>>>> Sent: Thursday, February 05, 2015 5:29 PM
>>>> To: Ulanov, Alexander
>>>> Cc: Evan R. Sparks;
>>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
>>>> apache.org<mailto:[hidden email]>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Hi Alexander,
>>>>
>>>> Using GPUs with Spark would be very exciting.  Small comment:
>>>> Concerning your question earlier about keeping data stored on the
>>>> GPU rather than having to move it between main memory and GPU
>>>> memory on each iteration, I would guess this would be critical to
>>>> getting good performance.  If you could do multiple local
>>>> iterations before aggregating results, then the cost of data
>>>> movement to the GPU could be amortized (and I believe that is done
>>>> in practice).  Having Spark be aware of the GPU and using it as another part of memory sounds like a much bigger undertaking.
>>>>
>>>> Joseph
>>>>
>>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>> wrote:
>>>> Thank you for explanation! I’ve watched the BIDMach presentation by
>>>> John Canny and I am really inspired by his talk and comparisons with Spark MLlib.
>>>>
>>>> I am very interested to find out what will be better within Spark:
>>>> BIDMat or netlib-java with CPU or GPU natives. Could you suggest a
>>>> fair way to benchmark them? Currently I do benchmarks on artificial
>>>> neural networks in batch mode. While it is not a “pure” test of
>>>> linear algebra, it involves some other things that are essential to machine learning.
>>>>
>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>]
>>>> Sent: Thursday, February 05, 2015 1:29 PM
>>>> To: Ulanov, Alexander
>>>> Cc:
>>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
>>>> apache.org<mailto:[hidden email]>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
>>>> netlib-java+OpenBLAS, but if it is much faster it's probably due to
>>>> netlib-java+data
>>>> layout and fewer levels of indirection - it's definitely a
>>>> worthwhile experiment to run. The main speedups I've seen from
>>>> using it come from highly optimized GPU code for linear algebra. I
>>>> know that in the past Canny has gone as far as to write custom GPU
>>>> kernels for performance-critical regions of code.[1]
>>>>
>>>> BIDMach is highly optimized for single node performance or
>>>> performance on small clusters.[2] Once data doesn't fit easily in
>>>> GPU memory (or can be batched in that way) the performance tends to
>>>> fall off. Canny argues for hardware/software codesign and as such
>>>> prefers machine configurations that are quite different than what
>>>> we find in most commodity cluster nodes - e.g. 10 disk cahnnels and 4 GPUs.
>>>>
>>>> In contrast, MLlib was designed for horizontal scalability on
>>>> commodity clusters and works best on very big datasets - order of terabytes.
>>>>
>>>> For the most part, these projects developed concurrently to address
>>>> slightly different use cases. That said, there may be bits of
>>>> BIDMach we could repurpose for MLlib - keep in mind we need to be
>>>> careful about maintaining cross-language compatibility for our Java
>>>> and Python-users, though.
>>>>
>>>> - Evan
>>>>
>>>> [1] - http://arxiv.org/abs/1409.5402 [2] -
>>>> http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
>>>>
>>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>> wrote:
>>>> Hi Evan,
>>>>
>>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do
>>>> you know what makes them faster than netlib-java?
>>>>
>>>> The same group has BIDMach library that implements machine
>>>> learning. For some examples they use Caffe convolutional neural
>>>> network library owned by another group in Berkeley. Could you
>>>> elaborate on how these all might be connected with Spark Mllib? If
>>>> you take BIDMat for linear algebra why don’t you take BIDMach for optimization and learning?
>>>>
>>>> Best regards, Alexander
>>>>
>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>>]
>>>> Sent: Thursday, February 05, 2015 12:09 PM
>>>> To: Ulanov, Alexander
>>>> Cc: [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
>>>> apache.org<mailto:[hidden email]>>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU
>>>> blas in many cases.
>>>>
>>>> You might consider taking a look at the codepaths that BIDMat (
>>>> https://github.com/BIDData/BIDMat) takes and comparing them to
>>>> netlib-java/breeze. John Canny et. al. have done a bunch of work
>>>> optimizing to make this work really fast from Scala. I've run it on
>>>> my laptop and compared to MKL and in certain cases it's 10x faster at matrix multiply.
>>>> There are a lot of layers of indirection here and you really want
>>>> to avoid data copying as much as possible.
>>>>
>>>> We could also consider swapping out BIDMat for Breeze, but that
>>>> would be a big project and if we can figure out how to get
>>>> breeze+cublas to comparable performance that would be a big win.
>>>>
>>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>> wrote:
>>>> Dear Spark developers,
>>>>
>>>> I am exploring how to make linear algebra operations faster within Spark.
>>>> One way of doing this is to use Scala Breeze library that is
>>>> bundled with Spark. For matrix operations, it employs Netlib-java
>>>> that has a Java wrapper for BLAS (basic linear algebra subprograms)
>>>> and LAPACK native binaries if they are available on the worker
>>>> node. It also has its own optimized Java implementation of BLAS. It
>>>> is worth mentioning, that native binaries provide better performance only for BLAS level 3, i.e.
>>>> matrix-matrix operations or general matrix multiplication (GEMM).
>>>> This is confirmed by GEMM test on Netlib-java page
>>>> https://github.com/fommil/netlib-java. I also confirmed it with my
>>>> experiments with training of artificial neural network
>>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
>>>> However, I would like to boost performance more.
>>>>
>>>> GPU is supposed to work fast with linear algebra and there is
>>>> Nvidia CUDA implementation of BLAS, called cublas. I have one Linux
>>>> server with Nvidia GPU and I was able to do the following. I linked
>>>> cublas (instead of cpu-based blas) with Netlib-java wrapper and put
>>>> it into Spark, so Breeze/Netlib is using it. Then I did some
>>>> performance measurements with regards to artificial neural network
>>>> batch learning in Spark MLlib that involves matrix-matrix
>>>> multiplications. It turns out that for matrices of size less than
>>>> ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes
>>>> slower for bigger matrices. It worth mentioning that it is was not a test for ONLY multiplication since there are other operations involved.
>>>> One of the reasons for slowdown might be the overhead of copying
>>>> the matrices from computer memory to graphic card memory and back.
>>>>
>>>> So, few questions:
>>>> 1) Do these results with CUDA make sense?
>>>> 2) If the problem is with copy overhead, are there any libraries
>>>> that allow to force intermediate results to stay in graphic card
>>>> memory thus removing the overhead?
>>>> 3) Any other options to speed-up linear algebra in Spark?
>>>>
>>>> Thank you, Alexander
>>>>
>>>> -------------------------------------------------------------------
>>>> -- To unsubscribe, e-mail: [hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]
>>>> e.org>><mailto:[hidden email]<mailto:dev-unsubscribe@sp
>>>> ark.apac> he.org<http://he.org>
>>>> <mailto:[hidden email]<mailto:dev-unsubscribe@spa
>>>> rk.apache.org>>> For additional commands, e-mail:
>>>> [hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>>
>>>>
>>>>
>>>>
>>>>
>>>

--
Best regards,
Sam

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Using CUDA within Spark / boosting linear algebra

Dmitriy Lyubimov
Alexander,

does using netlib imply that one cannot switch between CPU and GPU blas
alternatives at will at the same time? the choice is always determined by
linking aliternatives to libblas.so, right?

On Wed, Mar 25, 2015 at 2:31 PM, Ulanov, Alexander <[hidden email]>
wrote:

> Hi again,
>
> I finally managed to use nvblas within Spark+netlib-java. It has
> exceptional performance for big matrices with Double, faster than
> BIDMat-cuda with Float. But for smaller matrices, if you will copy them
> to/from GPU, OpenBlas or MKL might be a better choice. This correlates with
> original nvblas presentation on GPU conf 2013 (slide 21):
> http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf
>
> My results:
>
> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>
> Just in case, these tests are not for generalization of performance of
> different libraries. I just want to pick a library that does at best dense
> matrices multiplication for my task.
>
> P.S. My previous issue with nvblas was the following: it has Fortran blas
> functions, at the same time netlib-java uses C cblas functions. So, one
> needs cblas shared library to use nvblas through netlib-java. Fedora does
> not have cblas (but Debian and Ubuntu have), so I needed to compile it. I
> could not use cblas from Atlas or Openblas because they link to their
> implementation and not to Fortran blas.
>
> Best regards, Alexander
>
> -----Original Message-----
> From: Ulanov, Alexander
> Sent: Tuesday, March 24, 2015 6:57 PM
> To: Sam Halliday
> Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
> Subject: RE: Using CUDA within Spark / boosting linear algebra
>
> Hi,
>
> I am trying to use nvblas with netlib-java from Spark. nvblas functions
> should replace current blas functions calls after executing LD_PRELOAD as
> suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any
> changes to netlib-java. It seems to work for simple Java example, but I
> cannot make it work with Spark. I run the following:
> export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
> env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell
> --driver-memory 4G In nvidia-smi I observe that Java is to use GPU:
>
> +-----------------------------------------------------------------------------+
> | Processes:                                                       GPU
> Memory |
> |  GPU       PID  Type  Process name                               Usage
>     |
>
> |=============================================================================|
> |    0      8873    C   bash
> 39MiB |
> |    0      8910    C   /usr/lib/jvm/java-1.7.0/bin/java
> 39MiB |
>
> +-----------------------------------------------------------------------------+
>
> In Spark shell I do matrix multiplication and see the following:
> 15/03/25 06:48:01 INFO JniLoader: successfully loaded
> /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
> So I am sure that netlib-native is loaded and cblas supposedly used.
> However, matrix multiplication does executes on CPU since I see 16% of CPU
> used and 0% of GPU used. I also checked different matrix sizes, from
> 100x100 to 12000x12000
>
> Could you suggest might the LD_PRELOAD not affect Spark shell?
>
> Best regards, Alexander
>
>
>
> From: Sam Halliday [mailto:[hidden email]]
> Sent: Monday, March 09, 2015 6:01 PM
> To: Ulanov, Alexander
> Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
> Subject: RE: Using CUDA within Spark / boosting linear algebra
>
>
> Thanks so much for following up on this!
>
> Hmm, I wonder if we should have a concerted effort to chart performance on
> various pieces of hardware...
> On 9 Mar 2015 21:08, "Ulanov, Alexander" <[hidden email]<mailto:
> [hidden email]>> wrote:
> Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the
> comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the
> support of Double in the current source code), did the test with BIDMat and
> CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL.
>
>
> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>
> Best regards, Alexander
>
> -----Original Message-----
> From: Sam Halliday [mailto:[hidden email]<mailto:
> [hidden email]>]
> Sent: Tuesday, March 03, 2015 1:54 PM
> To: Xiangrui Meng; Joseph Bradley
> Cc: Evan R. Sparks; Ulanov, Alexander; [hidden email]<mailto:
> [hidden email]>
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> BTW, is anybody on this list going to the London Meetup in a few weeks?
>
>
> https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community
>
> Would be nice to meet other people working on the guts of Spark! :-)
>
>
> Xiangrui Meng <[hidden email]<mailto:[hidden email]>> writes:
>
> > Hey Alexander,
> >
> > I don't quite understand the part where netlib-cublas is about 20x
> > slower than netlib-openblas. What is the overhead of using a GPU BLAS
> > with netlib-java?
> >
> > CC'ed Sam, the author of netlib-java.
> >
> > Best,
> > Xiangrui
> >
> > On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley <[hidden email]
> <mailto:[hidden email]>> wrote:
> >> Better documentation for linking would be very helpful!  Here's a JIRA:
> >> https://issues.apache.org/jira/browse/SPARK-6019
> >>
> >>
> >> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
> >> <[hidden email]<mailto:[hidden email]>>
> >> wrote:
> >>
> >>> Thanks for compiling all the data and running these benchmarks,
> >>> Alex. The big takeaways here can be seen with this chart:
> >>>
> >>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ
> >>> Hl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
> >>>
> >>> 1) A properly configured GPU matrix multiply implementation (e.g.
> >>> BIDMat+GPU) can provide substantial (but less than an order of
> >>> BIDMat+magnitude)
> >>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
> >>> netlib-java+openblas-compiled).
> >>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
> >>> worse than a well-tuned CPU implementation, particularly for larger
> matrices.
> >>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
> >>> basically agrees with the authors own benchmarks (
> >>> https://github.com/fommil/netlib-java)
> >>>
> >>> I think that most of our users are in a situation where using GPUs
> >>> may not be practical - although we could consider having a good GPU
> >>> backend available as an option. However, *ALL* users of MLlib could
> >>> benefit (potentially tremendously) from using a well-tuned CPU-based
> >>> BLAS implementation. Perhaps we should consider updating the mllib
> >>> guide with a more complete section for enabling high performance
> >>> binaries on OSX and Linux? Or better, figure out a way for the
> >>> system to fetch these automatically.
> >>>
> >>> - Evan
> >>>
> >>>
> >>>
> >>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
> >>> [hidden email]<mailto:[hidden email]>> wrote:
> >>>
> >>>> Just to summarize this thread, I was finally able to make all
> >>>> performance comparisons that we discussed. It turns out that:
> >>>> BIDMat-cublas>>BIDMat
> >>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo=
> >>>> =netlib-cublas>netlib-blas>f2jblas
> >>>>
> >>>> Below is the link to the spreadsheet with full results.
> >>>>
> >>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx
> >>>> 378T9J5r7kwKSPkY/edit?usp=sharing
> >>>>
> >>>> One thing still needs exploration: does BIDMat-cublas perform
> >>>> copying to/from machine’s RAM?
> >>>>
> >>>> -----Original Message-----
> >>>> From: Ulanov, Alexander
> >>>> Sent: Tuesday, February 10, 2015 2:12 PM
> >>>> To: Evan R. Sparks
> >>>> Cc: Joseph Bradley;
> >>>> [hidden email]<mailto:[hidden email]>
> >>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Thanks, Evan! It seems that ticket was marked as duplicate though
> >>>> the original one discusses slightly different topic. I was able to
> >>>> link netlib with MKL from BIDMat binaries. Indeed, MKL is
> >>>> statically linked inside a 60MB library.
> >>>>
> >>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
> >>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
> >>>>
> +-----------------------------------------------------------------------+
> >>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
> >>>> |1,638475459 |
> >>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 |
> >>>> 1569,233228 |
> >>>>
> >>>> It turn out that pre-compiled MKL is faster than precompiled
> >>>> OpenBlas on my machine. Probably, I’ll add two more columns with
> >>>> locally compiled openblas and cuda.
> >>>>
> >>>> Alexander
> >>>>
> >>>> From: Evan R. Sparks
> >>>> [mailto:[hidden email]<mailto:[hidden email]>]
> >>>> Sent: Monday, February 09, 2015 6:06 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: Joseph Bradley;
> >>>> [hidden email]<mailto:[hidden email]>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Great - perhaps we can move this discussion off-list and onto a
> >>>> JIRA ticket? (Here's one:
> >>>> https://issues.apache.org/jira/browse/SPARK-5705)
> >>>>
> >>>> It seems like this is going to be somewhat exploratory for a while
> >>>> (and there's probably only a handful of us who really care about
> >>>> fast linear
> >>>> algebra!)
> >>>>
> >>>> - Evan
> >>>>
> >>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>> wrote:
> >>>> Hi Evan,
> >>>>
> >>>> Thank you for explanation and useful link. I am going to build
> >>>> OpenBLAS, link it with Netlib-java and perform benchmark again.
> >>>>
> >>>> Do I understand correctly that BIDMat binaries contain statically
> >>>> linked Intel MKL BLAS? It might be the reason why I am able to run
> >>>> BIDMat not having MKL BLAS installed on my server. If it is true, I
> >>>> wonder if it is OK because Intel sells this library. Nevertheless,
> >>>> it seems that in my case precompiled MKL BLAS performs better than
> >>>> precompiled OpenBLAS given that BIDMat and Netlib-java are supposed
> to be on par with JNI overheads.
> >>>>
> >>>> Though, it might be interesting to link Netlib-java with Intel MKL,
> >>>> as you suggested. I wonder, are John Canny (BIDMat) and Sam
> >>>> Halliday
> >>>> (Netlib-java) interested to compare their libraries.
> >>>>
> >>>> Best regards, Alexander
> >>>>
> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
> [hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>]
> >>>> Sent: Friday, February 06, 2015 5:58 PM
> >>>>
> >>>> To: Ulanov, Alexander
> >>>> Cc: Joseph Bradley;
> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
> >>>> apache.org<mailto:[hidden email]>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> I would build OpenBLAS yourself, since good BLAS performance comes
> >>>> from getting cache sizes, etc. set up correctly for your particular
> >>>> hardware - this is often a very tricky process (see, e.g. ATLAS),
> >>>> but we found that on relatively modern Xeon chips, OpenBLAS builds
> >>>> quickly and yields performance competitive with MKL.
> >>>>
> >>>> To make sure the right library is getting used, you have to make
> >>>> sure it's first on the search path - export
> >>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
> >>>>
> >>>> For some examples of getting netlib-java setup on an ec2 node and
> >>>> some example benchmarking code we ran a while back, see:
> >>>> https://github.com/shivaram/matrix-bench
> >>>>
> >>>> In particular - build-openblas-ec2.sh shows you how to build the
> >>>> library and set up symlinks correctly, and scala/run-netlib.sh
> >>>> shows you how to get the path setup and get that library picked up by
> netlib-java.
> >>>>
> >>>> In this way - you could probably get cuBLAS set up to be used by
> >>>> netlib-java as well.
> >>>>
> >>>> - Evan
> >>>>
> >>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>> wrote:
> >>>> Evan, could you elaborate on how to force BIDMat and netlib-java to
> >>>> force loading the right blas? For netlib, I there are few JVM
> >>>> flags, such as
> >>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS,
> >>>> so I can force it to use Java implementation. Not sure I understand
> how to force use a specific blas (not specific wrapper for blas).
> >>>>
> >>>> Btw. I have installed openblas (yum install openblas), so I suppose
> >>>> that netlib is using it.
> >>>>
> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
> [hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>]
> >>>> Sent: Friday, February 06, 2015 5:19 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: Joseph Bradley;
> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
> >>>> apache.org<mailto:[hidden email]>>
> >>>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Getting breeze to pick up the right blas library is critical for
> >>>> performance. I recommend using OpenBLAS (or MKL, if you already have
> it).
> >>>> It might make sense to force BIDMat to use the same underlying BLAS
> >>>> library as well.
> >>>>
> >>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>> wrote:
> >>>> Hi Evan, Joseph
> >>>>
> >>>> I did few matrix multiplication test and BIDMat seems to be ~10x
> >>>> faster than netlib-java+breeze (sorry for weird table formatting):
> >>>>
> >>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java
> >>>> |native_system_linux_x86-64|
> >>>> Breeze+Netlib-java f2jblas |
> >>>>
> +-----------------------------------------------------------------------+
> >>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
> >>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228
> >>>> ||
> >>>>
> >>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora
> >>>> 19 Linux, Scala 2.11.
> >>>>
> >>>> Later I will make tests with Cuda. I need to install new Cuda
> >>>> version for this purpose.
> >>>>
> >>>> Do you have any ideas why breeze-netlib with native blas is so much
> >>>> slower than BIDMat MKL?
> >>>>
> >>>> Best regards, Alexander
> >>>>
> >>>> From: Joseph Bradley [mailto:[hidden email]<mailto:
> [hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>]
> >>>> Sent: Thursday, February 05, 2015 5:29 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: Evan R. Sparks;
> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
> >>>> apache.org<mailto:[hidden email]>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Hi Alexander,
> >>>>
> >>>> Using GPUs with Spark would be very exciting.  Small comment:
> >>>> Concerning your question earlier about keeping data stored on the
> >>>> GPU rather than having to move it between main memory and GPU
> >>>> memory on each iteration, I would guess this would be critical to
> >>>> getting good performance.  If you could do multiple local
> >>>> iterations before aggregating results, then the cost of data
> >>>> movement to the GPU could be amortized (and I believe that is done
> >>>> in practice).  Having Spark be aware of the GPU and using it as
> another part of memory sounds like a much bigger undertaking.
> >>>>
> >>>> Joseph
> >>>>
> >>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>> wrote:
> >>>> Thank you for explanation! I’ve watched the BIDMach presentation by
> >>>> John Canny and I am really inspired by his talk and comparisons with
> Spark MLlib.
> >>>>
> >>>> I am very interested to find out what will be better within Spark:
> >>>> BIDMat or netlib-java with CPU or GPU natives. Could you suggest a
> >>>> fair way to benchmark them? Currently I do benchmarks on artificial
> >>>> neural networks in batch mode. While it is not a “pure” test of
> >>>> linear algebra, it involves some other things that are essential to
> machine learning.
> >>>>
> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
> [hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>]
> >>>> Sent: Thursday, February 05, 2015 1:29 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc:
> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
> >>>> apache.org<mailto:[hidden email]>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
> >>>> netlib-java+OpenBLAS, but if it is much faster it's probably due to
> >>>> netlib-java+data
> >>>> layout and fewer levels of indirection - it's definitely a
> >>>> worthwhile experiment to run. The main speedups I've seen from
> >>>> using it come from highly optimized GPU code for linear algebra. I
> >>>> know that in the past Canny has gone as far as to write custom GPU
> >>>> kernels for performance-critical regions of code.[1]
> >>>>
> >>>> BIDMach is highly optimized for single node performance or
> >>>> performance on small clusters.[2] Once data doesn't fit easily in
> >>>> GPU memory (or can be batched in that way) the performance tends to
> >>>> fall off. Canny argues for hardware/software codesign and as such
> >>>> prefers machine configurations that are quite different than what
> >>>> we find in most commodity cluster nodes - e.g. 10 disk cahnnels and 4
> GPUs.
> >>>>
> >>>> In contrast, MLlib was designed for horizontal scalability on
> >>>> commodity clusters and works best on very big datasets - order of
> terabytes.
> >>>>
> >>>> For the most part, these projects developed concurrently to address
> >>>> slightly different use cases. That said, there may be bits of
> >>>> BIDMach we could repurpose for MLlib - keep in mind we need to be
> >>>> careful about maintaining cross-language compatibility for our Java
> >>>> and Python-users, though.
> >>>>
> >>>> - Evan
> >>>>
> >>>> [1] - http://arxiv.org/abs/1409.5402 [2] -
> >>>> http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
> >>>>
> >>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>><mailto:
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>>> wrote:
> >>>> Hi Evan,
> >>>>
> >>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do
> >>>> you know what makes them faster than netlib-java?
> >>>>
> >>>> The same group has BIDMach library that implements machine
> >>>> learning. For some examples they use Caffe convolutional neural
> >>>> network library owned by another group in Berkeley. Could you
> >>>> elaborate on how these all might be connected with Spark Mllib? If
> >>>> you take BIDMat for linear algebra why don’t you take BIDMach for
> optimization and learning?
> >>>>
> >>>> Best regards, Alexander
> >>>>
> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
> [hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>><mailto:
> [hidden email]<mailto:[hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>>]
> >>>> Sent: Thursday, February 05, 2015 12:09 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>><mailto:
> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
> >>>> apache.org<mailto:[hidden email]>>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU
> >>>> blas in many cases.
> >>>>
> >>>> You might consider taking a look at the codepaths that BIDMat (
> >>>> https://github.com/BIDData/BIDMat) takes and comparing them to
> >>>> netlib-java/breeze. John Canny et. al. have done a bunch of work
> >>>> optimizing to make this work really fast from Scala. I've run it on
> >>>> my laptop and compared to MKL and in certain cases it's 10x faster at
> matrix multiply.
> >>>> There are a lot of layers of indirection here and you really want
> >>>> to avoid data copying as much as possible.
> >>>>
> >>>> We could also consider swapping out BIDMat for Breeze, but that
> >>>> would be a big project and if we can figure out how to get
> >>>> breeze+cublas to comparable performance that would be a big win.
> >>>>
> >>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>><mailto:
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>>> wrote:
> >>>> Dear Spark developers,
> >>>>
> >>>> I am exploring how to make linear algebra operations faster within
> Spark.
> >>>> One way of doing this is to use Scala Breeze library that is
> >>>> bundled with Spark. For matrix operations, it employs Netlib-java
> >>>> that has a Java wrapper for BLAS (basic linear algebra subprograms)
> >>>> and LAPACK native binaries if they are available on the worker
> >>>> node. It also has its own optimized Java implementation of BLAS. It
> >>>> is worth mentioning, that native binaries provide better performance
> only for BLAS level 3, i.e.
> >>>> matrix-matrix operations or general matrix multiplication (GEMM).
> >>>> This is confirmed by GEMM test on Netlib-java page
> >>>> https://github.com/fommil/netlib-java. I also confirmed it with my
> >>>> experiments with training of artificial neural network
> >>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
> >>>> However, I would like to boost performance more.
> >>>>
> >>>> GPU is supposed to work fast with linear algebra and there is
> >>>> Nvidia CUDA implementation of BLAS, called cublas. I have one Linux
> >>>> server with Nvidia GPU and I was able to do the following. I linked
> >>>> cublas (instead of cpu-based blas) with Netlib-java wrapper and put
> >>>> it into Spark, so Breeze/Netlib is using it. Then I did some
> >>>> performance measurements with regards to artificial neural network
> >>>> batch learning in Spark MLlib that involves matrix-matrix
> >>>> multiplications. It turns out that for matrices of size less than
> >>>> ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes
> >>>> slower for bigger matrices. It worth mentioning that it is was not a
> test for ONLY multiplication since there are other operations involved.
> >>>> One of the reasons for slowdown might be the overhead of copying
> >>>> the matrices from computer memory to graphic card memory and back.
> >>>>
> >>>> So, few questions:
> >>>> 1) Do these results with CUDA make sense?
> >>>> 2) If the problem is with copy overhead, are there any libraries
> >>>> that allow to force intermediate results to stay in graphic card
> >>>> memory thus removing the overhead?
> >>>> 3) Any other options to speed-up linear algebra in Spark?
> >>>>
> >>>> Thank you, Alexander
> >>>>
> >>>> -------------------------------------------------------------------
> >>>> -- To unsubscribe, e-mail: [hidden email]<mailto:
> [hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]
> >>>> e.org>><mailto:[hidden email]<mailto:dev-unsubscribe@sp
> >>>> ark.apac> he.org<http://he.org>
> >>>> <mailto:[hidden email]<mailto:dev-unsubscribe@spa
> >>>> rk.apache.org>>> For additional commands, e-mail:
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>><mailto:
> [hidden email]<mailto:[hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
>
> --
> Best regards,
> Sam
>
Reply | Threaded
Open this post in threaded view
|

Re: Using CUDA within Spark / boosting linear algebra

Evan R. Sparks
In reply to this post by Ulanov, Alexander
Alex - great stuff, and the nvblas numbers are pretty remarkable (almost
too good... did you check the results for correctness? - also, is it
possible that the "unified memory model" of nvblas is somehow hiding pci
transfer time?)

this last bit (getting nvblas + netlib-java to play together) sounds like
it's non-trivial and took you a while to figure out! Would you mind posting
a gist or something of maybe the shell scripts/exports you used to make
this work - I can imagine it being highly useful for others in the future.

Thanks!
Evan

On Wed, Mar 25, 2015 at 2:31 PM, Ulanov, Alexander <[hidden email]>
wrote:

> Hi again,
>
> I finally managed to use nvblas within Spark+netlib-java. It has
> exceptional performance for big matrices with Double, faster than
> BIDMat-cuda with Float. But for smaller matrices, if you will copy them
> to/from GPU, OpenBlas or MKL might be a better choice. This correlates with
> original nvblas presentation on GPU conf 2013 (slide 21):
> http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf
>
> My results:
>
> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>
> Just in case, these tests are not for generalization of performance of
> different libraries. I just want to pick a library that does at best dense
> matrices multiplication for my task.
>
> P.S. My previous issue with nvblas was the following: it has Fortran blas
> functions, at the same time netlib-java uses C cblas functions. So, one
> needs cblas shared library to use nvblas through netlib-java. Fedora does
> not have cblas (but Debian and Ubuntu have), so I needed to compile it. I
> could not use cblas from Atlas or Openblas because they link to their
> implementation and not to Fortran blas.
>
> Best regards, Alexander
>
> -----Original Message-----
> From: Ulanov, Alexander
> Sent: Tuesday, March 24, 2015 6:57 PM
> To: Sam Halliday
> Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
> Subject: RE: Using CUDA within Spark / boosting linear algebra
>
> Hi,
>
> I am trying to use nvblas with netlib-java from Spark. nvblas functions
> should replace current blas functions calls after executing LD_PRELOAD as
> suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any
> changes to netlib-java. It seems to work for simple Java example, but I
> cannot make it work with Spark. I run the following:
> export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
> env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell
> --driver-memory 4G In nvidia-smi I observe that Java is to use GPU:
>
> +-----------------------------------------------------------------------------+
> | Processes:                                                       GPU
> Memory |
> |  GPU       PID  Type  Process name                               Usage
>     |
>
> |=============================================================================|
> |    0      8873    C   bash
> 39MiB |
> |    0      8910    C   /usr/lib/jvm/java-1.7.0/bin/java
> 39MiB |
>
> +-----------------------------------------------------------------------------+
>
> In Spark shell I do matrix multiplication and see the following:
> 15/03/25 06:48:01 INFO JniLoader: successfully loaded
> /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
> So I am sure that netlib-native is loaded and cblas supposedly used.
> However, matrix multiplication does executes on CPU since I see 16% of CPU
> used and 0% of GPU used. I also checked different matrix sizes, from
> 100x100 to 12000x12000
>
> Could you suggest might the LD_PRELOAD not affect Spark shell?
>
> Best regards, Alexander
>
>
>
> From: Sam Halliday [mailto:[hidden email]]
> Sent: Monday, March 09, 2015 6:01 PM
> To: Ulanov, Alexander
> Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
> Subject: RE: Using CUDA within Spark / boosting linear algebra
>
>
> Thanks so much for following up on this!
>
> Hmm, I wonder if we should have a concerted effort to chart performance on
> various pieces of hardware...
> On 9 Mar 2015 21:08, "Ulanov, Alexander" <[hidden email]<mailto:
> [hidden email]>> wrote:
> Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the
> comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the
> support of Double in the current source code), did the test with BIDMat and
> CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL.
>
>
> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>
> Best regards, Alexander
>
> -----Original Message-----
> From: Sam Halliday [mailto:[hidden email]<mailto:
> [hidden email]>]
> Sent: Tuesday, March 03, 2015 1:54 PM
> To: Xiangrui Meng; Joseph Bradley
> Cc: Evan R. Sparks; Ulanov, Alexander; [hidden email]<mailto:
> [hidden email]>
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> BTW, is anybody on this list going to the London Meetup in a few weeks?
>
>
> https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community
>
> Would be nice to meet other people working on the guts of Spark! :-)
>
>
> Xiangrui Meng <[hidden email]<mailto:[hidden email]>> writes:
>
> > Hey Alexander,
> >
> > I don't quite understand the part where netlib-cublas is about 20x
> > slower than netlib-openblas. What is the overhead of using a GPU BLAS
> > with netlib-java?
> >
> > CC'ed Sam, the author of netlib-java.
> >
> > Best,
> > Xiangrui
> >
> > On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley <[hidden email]
> <mailto:[hidden email]>> wrote:
> >> Better documentation for linking would be very helpful!  Here's a JIRA:
> >> https://issues.apache.org/jira/browse/SPARK-6019
> >>
> >>
> >> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
> >> <[hidden email]<mailto:[hidden email]>>
> >> wrote:
> >>
> >>> Thanks for compiling all the data and running these benchmarks,
> >>> Alex. The big takeaways here can be seen with this chart:
> >>>
> >>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ
> >>> Hl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
> >>>
> >>> 1) A properly configured GPU matrix multiply implementation (e.g.
> >>> BIDMat+GPU) can provide substantial (but less than an order of
> >>> BIDMat+magnitude)
> >>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
> >>> netlib-java+openblas-compiled).
> >>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
> >>> worse than a well-tuned CPU implementation, particularly for larger
> matrices.
> >>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
> >>> basically agrees with the authors own benchmarks (
> >>> https://github.com/fommil/netlib-java)
> >>>
> >>> I think that most of our users are in a situation where using GPUs
> >>> may not be practical - although we could consider having a good GPU
> >>> backend available as an option. However, *ALL* users of MLlib could
> >>> benefit (potentially tremendously) from using a well-tuned CPU-based
> >>> BLAS implementation. Perhaps we should consider updating the mllib
> >>> guide with a more complete section for enabling high performance
> >>> binaries on OSX and Linux? Or better, figure out a way for the
> >>> system to fetch these automatically.
> >>>
> >>> - Evan
> >>>
> >>>
> >>>
> >>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
> >>> [hidden email]<mailto:[hidden email]>> wrote:
> >>>
> >>>> Just to summarize this thread, I was finally able to make all
> >>>> performance comparisons that we discussed. It turns out that:
> >>>> BIDMat-cublas>>BIDMat
> >>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo=
> >>>> =netlib-cublas>netlib-blas>f2jblas
> >>>>
> >>>> Below is the link to the spreadsheet with full results.
> >>>>
> >>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx
> >>>> 378T9J5r7kwKSPkY/edit?usp=sharing
> >>>>
> >>>> One thing still needs exploration: does BIDMat-cublas perform
> >>>> copying to/from machine’s RAM?
> >>>>
> >>>> -----Original Message-----
> >>>> From: Ulanov, Alexander
> >>>> Sent: Tuesday, February 10, 2015 2:12 PM
> >>>> To: Evan R. Sparks
> >>>> Cc: Joseph Bradley;
> >>>> [hidden email]<mailto:[hidden email]>
> >>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Thanks, Evan! It seems that ticket was marked as duplicate though
> >>>> the original one discusses slightly different topic. I was able to
> >>>> link netlib with MKL from BIDMat binaries. Indeed, MKL is
> >>>> statically linked inside a 60MB library.
> >>>>
> >>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
> >>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
> >>>>
> +-----------------------------------------------------------------------+
> >>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
> >>>> |1,638475459 |
> >>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 |
> >>>> 1569,233228 |
> >>>>
> >>>> It turn out that pre-compiled MKL is faster than precompiled
> >>>> OpenBlas on my machine. Probably, I’ll add two more columns with
> >>>> locally compiled openblas and cuda.
> >>>>
> >>>> Alexander
> >>>>
> >>>> From: Evan R. Sparks
> >>>> [mailto:[hidden email]<mailto:[hidden email]>]
> >>>> Sent: Monday, February 09, 2015 6:06 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: Joseph Bradley;
> >>>> [hidden email]<mailto:[hidden email]>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Great - perhaps we can move this discussion off-list and onto a
> >>>> JIRA ticket? (Here's one:
> >>>> https://issues.apache.org/jira/browse/SPARK-5705)
> >>>>
> >>>> It seems like this is going to be somewhat exploratory for a while
> >>>> (and there's probably only a handful of us who really care about
> >>>> fast linear
> >>>> algebra!)
> >>>>
> >>>> - Evan
> >>>>
> >>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>> wrote:
> >>>> Hi Evan,
> >>>>
> >>>> Thank you for explanation and useful link. I am going to build
> >>>> OpenBLAS, link it with Netlib-java and perform benchmark again.
> >>>>
> >>>> Do I understand correctly that BIDMat binaries contain statically
> >>>> linked Intel MKL BLAS? It might be the reason why I am able to run
> >>>> BIDMat not having MKL BLAS installed on my server. If it is true, I
> >>>> wonder if it is OK because Intel sells this library. Nevertheless,
> >>>> it seems that in my case precompiled MKL BLAS performs better than
> >>>> precompiled OpenBLAS given that BIDMat and Netlib-java are supposed
> to be on par with JNI overheads.
> >>>>
> >>>> Though, it might be interesting to link Netlib-java with Intel MKL,
> >>>> as you suggested. I wonder, are John Canny (BIDMat) and Sam
> >>>> Halliday
> >>>> (Netlib-java) interested to compare their libraries.
> >>>>
> >>>> Best regards, Alexander
> >>>>
> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
> [hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>]
> >>>> Sent: Friday, February 06, 2015 5:58 PM
> >>>>
> >>>> To: Ulanov, Alexander
> >>>> Cc: Joseph Bradley;
> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
> >>>> apache.org<mailto:[hidden email]>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> I would build OpenBLAS yourself, since good BLAS performance comes
> >>>> from getting cache sizes, etc. set up correctly for your particular
> >>>> hardware - this is often a very tricky process (see, e.g. ATLAS),
> >>>> but we found that on relatively modern Xeon chips, OpenBLAS builds
> >>>> quickly and yields performance competitive with MKL.
> >>>>
> >>>> To make sure the right library is getting used, you have to make
> >>>> sure it's first on the search path - export
> >>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
> >>>>
> >>>> For some examples of getting netlib-java setup on an ec2 node and
> >>>> some example benchmarking code we ran a while back, see:
> >>>> https://github.com/shivaram/matrix-bench
> >>>>
> >>>> In particular - build-openblas-ec2.sh shows you how to build the
> >>>> library and set up symlinks correctly, and scala/run-netlib.sh
> >>>> shows you how to get the path setup and get that library picked up by
> netlib-java.
> >>>>
> >>>> In this way - you could probably get cuBLAS set up to be used by
> >>>> netlib-java as well.
> >>>>
> >>>> - Evan
> >>>>
> >>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>> wrote:
> >>>> Evan, could you elaborate on how to force BIDMat and netlib-java to
> >>>> force loading the right blas? For netlib, I there are few JVM
> >>>> flags, such as
> >>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS,
> >>>> so I can force it to use Java implementation. Not sure I understand
> how to force use a specific blas (not specific wrapper for blas).
> >>>>
> >>>> Btw. I have installed openblas (yum install openblas), so I suppose
> >>>> that netlib is using it.
> >>>>
> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
> [hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>]
> >>>> Sent: Friday, February 06, 2015 5:19 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: Joseph Bradley;
> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
> >>>> apache.org<mailto:[hidden email]>>
> >>>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Getting breeze to pick up the right blas library is critical for
> >>>> performance. I recommend using OpenBLAS (or MKL, if you already have
> it).
> >>>> It might make sense to force BIDMat to use the same underlying BLAS
> >>>> library as well.
> >>>>
> >>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>> wrote:
> >>>> Hi Evan, Joseph
> >>>>
> >>>> I did few matrix multiplication test and BIDMat seems to be ~10x
> >>>> faster than netlib-java+breeze (sorry for weird table formatting):
> >>>>
> >>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java
> >>>> |native_system_linux_x86-64|
> >>>> Breeze+Netlib-java f2jblas |
> >>>>
> +-----------------------------------------------------------------------+
> >>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
> >>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228
> >>>> ||
> >>>>
> >>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora
> >>>> 19 Linux, Scala 2.11.
> >>>>
> >>>> Later I will make tests with Cuda. I need to install new Cuda
> >>>> version for this purpose.
> >>>>
> >>>> Do you have any ideas why breeze-netlib with native blas is so much
> >>>> slower than BIDMat MKL?
> >>>>
> >>>> Best regards, Alexander
> >>>>
> >>>> From: Joseph Bradley [mailto:[hidden email]<mailto:
> [hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>]
> >>>> Sent: Thursday, February 05, 2015 5:29 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: Evan R. Sparks;
> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
> >>>> apache.org<mailto:[hidden email]>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Hi Alexander,
> >>>>
> >>>> Using GPUs with Spark would be very exciting.  Small comment:
> >>>> Concerning your question earlier about keeping data stored on the
> >>>> GPU rather than having to move it between main memory and GPU
> >>>> memory on each iteration, I would guess this would be critical to
> >>>> getting good performance.  If you could do multiple local
> >>>> iterations before aggregating results, then the cost of data
> >>>> movement to the GPU could be amortized (and I believe that is done
> >>>> in practice).  Having Spark be aware of the GPU and using it as
> another part of memory sounds like a much bigger undertaking.
> >>>>
> >>>> Joseph
> >>>>
> >>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>> wrote:
> >>>> Thank you for explanation! I’ve watched the BIDMach presentation by
> >>>> John Canny and I am really inspired by his talk and comparisons with
> Spark MLlib.
> >>>>
> >>>> I am very interested to find out what will be better within Spark:
> >>>> BIDMat or netlib-java with CPU or GPU natives. Could you suggest a
> >>>> fair way to benchmark them? Currently I do benchmarks on artificial
> >>>> neural networks in batch mode. While it is not a “pure” test of
> >>>> linear algebra, it involves some other things that are essential to
> machine learning.
> >>>>
> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
> [hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>]
> >>>> Sent: Thursday, February 05, 2015 1:29 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc:
> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
> >>>> apache.org<mailto:[hidden email]>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
> >>>> netlib-java+OpenBLAS, but if it is much faster it's probably due to
> >>>> netlib-java+data
> >>>> layout and fewer levels of indirection - it's definitely a
> >>>> worthwhile experiment to run. The main speedups I've seen from
> >>>> using it come from highly optimized GPU code for linear algebra. I
> >>>> know that in the past Canny has gone as far as to write custom GPU
> >>>> kernels for performance-critical regions of code.[1]
> >>>>
> >>>> BIDMach is highly optimized for single node performance or
> >>>> performance on small clusters.[2] Once data doesn't fit easily in
> >>>> GPU memory (or can be batched in that way) the performance tends to
> >>>> fall off. Canny argues for hardware/software codesign and as such
> >>>> prefers machine configurations that are quite different than what
> >>>> we find in most commodity cluster nodes - e.g. 10 disk cahnnels and 4
> GPUs.
> >>>>
> >>>> In contrast, MLlib was designed for horizontal scalability on
> >>>> commodity clusters and works best on very big datasets - order of
> terabytes.
> >>>>
> >>>> For the most part, these projects developed concurrently to address
> >>>> slightly different use cases. That said, there may be bits of
> >>>> BIDMach we could repurpose for MLlib - keep in mind we need to be
> >>>> careful about maintaining cross-language compatibility for our Java
> >>>> and Python-users, though.
> >>>>
> >>>> - Evan
> >>>>
> >>>> [1] - http://arxiv.org/abs/1409.5402 [2] -
> >>>> http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
> >>>>
> >>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>><mailto:
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>>> wrote:
> >>>> Hi Evan,
> >>>>
> >>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do
> >>>> you know what makes them faster than netlib-java?
> >>>>
> >>>> The same group has BIDMach library that implements machine
> >>>> learning. For some examples they use Caffe convolutional neural
> >>>> network library owned by another group in Berkeley. Could you
> >>>> elaborate on how these all might be connected with Spark Mllib? If
> >>>> you take BIDMat for linear algebra why don’t you take BIDMach for
> optimization and learning?
> >>>>
> >>>> Best regards, Alexander
> >>>>
> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
> [hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>><mailto:
> [hidden email]<mailto:[hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>>]
> >>>> Sent: Thursday, February 05, 2015 12:09 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>><mailto:
> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
> >>>> apache.org<mailto:[hidden email]>>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU
> >>>> blas in many cases.
> >>>>
> >>>> You might consider taking a look at the codepaths that BIDMat (
> >>>> https://github.com/BIDData/BIDMat) takes and comparing them to
> >>>> netlib-java/breeze. John Canny et. al. have done a bunch of work
> >>>> optimizing to make this work really fast from Scala. I've run it on
> >>>> my laptop and compared to MKL and in certain cases it's 10x faster at
> matrix multiply.
> >>>> There are a lot of layers of indirection here and you really want
> >>>> to avoid data copying as much as possible.
> >>>>
> >>>> We could also consider swapping out BIDMat for Breeze, but that
> >>>> would be a big project and if we can figure out how to get
> >>>> breeze+cublas to comparable performance that would be a big win.
> >>>>
> >>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>><mailto:
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>>> wrote:
> >>>> Dear Spark developers,
> >>>>
> >>>> I am exploring how to make linear algebra operations faster within
> Spark.
> >>>> One way of doing this is to use Scala Breeze library that is
> >>>> bundled with Spark. For matrix operations, it employs Netlib-java
> >>>> that has a Java wrapper for BLAS (basic linear algebra subprograms)
> >>>> and LAPACK native binaries if they are available on the worker
> >>>> node. It also has its own optimized Java implementation of BLAS. It
> >>>> is worth mentioning, that native binaries provide better performance
> only for BLAS level 3, i.e.
> >>>> matrix-matrix operations or general matrix multiplication (GEMM).
> >>>> This is confirmed by GEMM test on Netlib-java page
> >>>> https://github.com/fommil/netlib-java. I also confirmed it with my
> >>>> experiments with training of artificial neural network
> >>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
> >>>> However, I would like to boost performance more.
> >>>>
> >>>> GPU is supposed to work fast with linear algebra and there is
> >>>> Nvidia CUDA implementation of BLAS, called cublas. I have one Linux
> >>>> server with Nvidia GPU and I was able to do the following. I linked
> >>>> cublas (instead of cpu-based blas) with Netlib-java wrapper and put
> >>>> it into Spark, so Breeze/Netlib is using it. Then I did some
> >>>> performance measurements with regards to artificial neural network
> >>>> batch learning in Spark MLlib that involves matrix-matrix
> >>>> multiplications. It turns out that for matrices of size less than
> >>>> ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes
> >>>> slower for bigger matrices. It worth mentioning that it is was not a
> test for ONLY multiplication since there are other operations involved.
> >>>> One of the reasons for slowdown might be the overhead of copying
> >>>> the matrices from computer memory to graphic card memory and back.
> >>>>
> >>>> So, few questions:
> >>>> 1) Do these results with CUDA make sense?
> >>>> 2) If the problem is with copy overhead, are there any libraries
> >>>> that allow to force intermediate results to stay in graphic card
> >>>> memory thus removing the overhead?
> >>>> 3) Any other options to speed-up linear algebra in Spark?
> >>>>
> >>>> Thank you, Alexander
> >>>>
> >>>> -------------------------------------------------------------------
> >>>> -- To unsubscribe, e-mail: [hidden email]<mailto:
> [hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]
> >>>> e.org>><mailto:[hidden email]<mailto:dev-unsubscribe@sp
> >>>> ark.apac> he.org<http://he.org>
> >>>> <mailto:[hidden email]<mailto:dev-unsubscribe@spa
> >>>> rk.apache.org>>> For additional commands, e-mail:
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>><mailto:
> [hidden email]<mailto:[hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
>
> --
> Best regards,
> Sam
>
Reply | Threaded
Open this post in threaded view
|

Re: Using CUDA within Spark / boosting linear algebra

fommil
If you write it up I'll add it to the netlib-java wiki :-)

BTW, does it automatically flip between cpu/GPU? I've a project called
MultiBLAS which was going to do this, it should be easy (but boring to
write)
On 25 Mar 2015 22:00, "Evan R. Sparks" <[hidden email]> wrote:

> Alex - great stuff, and the nvblas numbers are pretty remarkable (almost
> too good... did you check the results for correctness? - also, is it
> possible that the "unified memory model" of nvblas is somehow hiding pci
> transfer time?)
>
> this last bit (getting nvblas + netlib-java to play together) sounds like
> it's non-trivial and took you a while to figure out! Would you mind posting
> a gist or something of maybe the shell scripts/exports you used to make
> this work - I can imagine it being highly useful for others in the future.
>
> Thanks!
> Evan
>
> On Wed, Mar 25, 2015 at 2:31 PM, Ulanov, Alexander <
> [hidden email]> wrote:
>
>> Hi again,
>>
>> I finally managed to use nvblas within Spark+netlib-java. It has
>> exceptional performance for big matrices with Double, faster than
>> BIDMat-cuda with Float. But for smaller matrices, if you will copy them
>> to/from GPU, OpenBlas or MKL might be a better choice. This correlates with
>> original nvblas presentation on GPU conf 2013 (slide 21):
>> http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf
>>
>> My results:
>>
>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>>
>> Just in case, these tests are not for generalization of performance of
>> different libraries. I just want to pick a library that does at best dense
>> matrices multiplication for my task.
>>
>> P.S. My previous issue with nvblas was the following: it has Fortran blas
>> functions, at the same time netlib-java uses C cblas functions. So, one
>> needs cblas shared library to use nvblas through netlib-java. Fedora does
>> not have cblas (but Debian and Ubuntu have), so I needed to compile it. I
>> could not use cblas from Atlas or Openblas because they link to their
>> implementation and not to Fortran blas.
>>
>> Best regards, Alexander
>>
>> -----Original Message-----
>> From: Ulanov, Alexander
>> Sent: Tuesday, March 24, 2015 6:57 PM
>> To: Sam Halliday
>> Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>
>> Hi,
>>
>> I am trying to use nvblas with netlib-java from Spark. nvblas functions
>> should replace current blas functions calls after executing LD_PRELOAD as
>> suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any
>> changes to netlib-java. It seems to work for simple Java example, but I
>> cannot make it work with Spark. I run the following:
>> export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
>> env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell
>> --driver-memory 4G In nvidia-smi I observe that Java is to use GPU:
>>
>> +-----------------------------------------------------------------------------+
>> | Processes:                                                       GPU
>> Memory |
>> |  GPU       PID  Type  Process name                               Usage
>>     |
>>
>> |=============================================================================|
>> |    0      8873    C   bash
>> 39MiB |
>> |    0      8910    C   /usr/lib/jvm/java-1.7.0/bin/java
>> 39MiB |
>>
>> +-----------------------------------------------------------------------------+
>>
>> In Spark shell I do matrix multiplication and see the following:
>> 15/03/25 06:48:01 INFO JniLoader: successfully loaded
>> /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
>> So I am sure that netlib-native is loaded and cblas supposedly used.
>> However, matrix multiplication does executes on CPU since I see 16% of CPU
>> used and 0% of GPU used. I also checked different matrix sizes, from
>> 100x100 to 12000x12000
>>
>> Could you suggest might the LD_PRELOAD not affect Spark shell?
>>
>> Best regards, Alexander
>>
>>
>>
>> From: Sam Halliday [mailto:[hidden email]]
>> Sent: Monday, March 09, 2015 6:01 PM
>> To: Ulanov, Alexander
>> Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>
>>
>> Thanks so much for following up on this!
>>
>> Hmm, I wonder if we should have a concerted effort to chart performance
>> on various pieces of hardware...
>> On 9 Mar 2015 21:08, "Ulanov, Alexander" <[hidden email]<mailto:
>> [hidden email]>> wrote:
>> Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the
>> comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the
>> support of Double in the current source code), did the test with BIDMat and
>> CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL.
>>
>>
>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>>
>> Best regards, Alexander
>>
>> -----Original Message-----
>> From: Sam Halliday [mailto:[hidden email]<mailto:
>> [hidden email]>]
>> Sent: Tuesday, March 03, 2015 1:54 PM
>> To: Xiangrui Meng; Joseph Bradley
>> Cc: Evan R. Sparks; Ulanov, Alexander; [hidden email]<mailto:
>> [hidden email]>
>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>
>> BTW, is anybody on this list going to the London Meetup in a few weeks?
>>
>>
>> https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community
>>
>> Would be nice to meet other people working on the guts of Spark! :-)
>>
>>
>> Xiangrui Meng <[hidden email]<mailto:[hidden email]>> writes:
>>
>> > Hey Alexander,
>> >
>> > I don't quite understand the part where netlib-cublas is about 20x
>> > slower than netlib-openblas. What is the overhead of using a GPU BLAS
>> > with netlib-java?
>> >
>> > CC'ed Sam, the author of netlib-java.
>> >
>> > Best,
>> > Xiangrui
>> >
>> > On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley <[hidden email]
>> <mailto:[hidden email]>> wrote:
>> >> Better documentation for linking would be very helpful!  Here's a JIRA:
>> >> https://issues.apache.org/jira/browse/SPARK-6019
>> >>
>> >>
>> >> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
>> >> <[hidden email]<mailto:[hidden email]>>
>> >> wrote:
>> >>
>> >>> Thanks for compiling all the data and running these benchmarks,
>> >>> Alex. The big takeaways here can be seen with this chart:
>> >>>
>> >>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ
>> >>> Hl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
>> >>>
>> >>> 1) A properly configured GPU matrix multiply implementation (e.g.
>> >>> BIDMat+GPU) can provide substantial (but less than an order of
>> >>> BIDMat+magnitude)
>> >>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
>> >>> netlib-java+openblas-compiled).
>> >>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
>> >>> worse than a well-tuned CPU implementation, particularly for larger
>> matrices.
>> >>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
>> >>> basically agrees with the authors own benchmarks (
>> >>> https://github.com/fommil/netlib-java)
>> >>>
>> >>> I think that most of our users are in a situation where using GPUs
>> >>> may not be practical - although we could consider having a good GPU
>> >>> backend available as an option. However, *ALL* users of MLlib could
>> >>> benefit (potentially tremendously) from using a well-tuned CPU-based
>> >>> BLAS implementation. Perhaps we should consider updating the mllib
>> >>> guide with a more complete section for enabling high performance
>> >>> binaries on OSX and Linux? Or better, figure out a way for the
>> >>> system to fetch these automatically.
>> >>>
>> >>> - Evan
>> >>>
>> >>>
>> >>>
>> >>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
>> >>> [hidden email]<mailto:[hidden email]>> wrote:
>> >>>
>> >>>> Just to summarize this thread, I was finally able to make all
>> >>>> performance comparisons that we discussed. It turns out that:
>> >>>> BIDMat-cublas>>BIDMat
>> >>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo=
>> >>>> =netlib-cublas>netlib-blas>f2jblas
>> >>>>
>> >>>> Below is the link to the spreadsheet with full results.
>> >>>>
>> >>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx
>> >>>> 378T9J5r7kwKSPkY/edit?usp=sharing
>> >>>>
>> >>>> One thing still needs exploration: does BIDMat-cublas perform
>> >>>> copying to/from machine’s RAM?
>> >>>>
>> >>>> -----Original Message-----
>> >>>> From: Ulanov, Alexander
>> >>>> Sent: Tuesday, February 10, 2015 2:12 PM
>> >>>> To: Evan R. Sparks
>> >>>> Cc: Joseph Bradley;
>> >>>> [hidden email]<mailto:[hidden email]>
>> >>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>> >>>>
>> >>>> Thanks, Evan! It seems that ticket was marked as duplicate though
>> >>>> the original one discusses slightly different topic. I was able to
>> >>>> link netlib with MKL from BIDMat binaries. Indeed, MKL is
>> >>>> statically linked inside a 60MB library.
>> >>>>
>> >>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
>> >>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
>> >>>>
>> +-----------------------------------------------------------------------+
>> >>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
>> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
>> >>>> |1,638475459 |
>> >>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 |
>> >>>> 1569,233228 |
>> >>>>
>> >>>> It turn out that pre-compiled MKL is faster than precompiled
>> >>>> OpenBlas on my machine. Probably, I’ll add two more columns with
>> >>>> locally compiled openblas and cuda.
>> >>>>
>> >>>> Alexander
>> >>>>
>> >>>> From: Evan R. Sparks
>> >>>> [mailto:[hidden email]<mailto:[hidden email]>]
>> >>>> Sent: Monday, February 09, 2015 6:06 PM
>> >>>> To: Ulanov, Alexander
>> >>>> Cc: Joseph Bradley;
>> >>>> [hidden email]<mailto:[hidden email]>
>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>
>> >>>> Great - perhaps we can move this discussion off-list and onto a
>> >>>> JIRA ticket? (Here's one:
>> >>>> https://issues.apache.org/jira/browse/SPARK-5705)
>> >>>>
>> >>>> It seems like this is going to be somewhat exploratory for a while
>> >>>> (and there's probably only a handful of us who really care about
>> >>>> fast linear
>> >>>> algebra!)
>> >>>>
>> >>>> - Evan
>> >>>>
>> >>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>> [hidden email]<mailto:[hidden email]>>> wrote:
>> >>>> Hi Evan,
>> >>>>
>> >>>> Thank you for explanation and useful link. I am going to build
>> >>>> OpenBLAS, link it with Netlib-java and perform benchmark again.
>> >>>>
>> >>>> Do I understand correctly that BIDMat binaries contain statically
>> >>>> linked Intel MKL BLAS? It might be the reason why I am able to run
>> >>>> BIDMat not having MKL BLAS installed on my server. If it is true, I
>> >>>> wonder if it is OK because Intel sells this library. Nevertheless,
>> >>>> it seems that in my case precompiled MKL BLAS performs better than
>> >>>> precompiled OpenBLAS given that BIDMat and Netlib-java are supposed
>> to be on par with JNI overheads.
>> >>>>
>> >>>> Though, it might be interesting to link Netlib-java with Intel MKL,
>> >>>> as you suggested. I wonder, are John Canny (BIDMat) and Sam
>> >>>> Halliday
>> >>>> (Netlib-java) interested to compare their libraries.
>> >>>>
>> >>>> Best regards, Alexander
>> >>>>
>> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
>> [hidden email]><mailto:
>> >>>> [hidden email]<mailto:[hidden email]>>]
>> >>>> Sent: Friday, February 06, 2015 5:58 PM
>> >>>>
>> >>>> To: Ulanov, Alexander
>> >>>> Cc: Joseph Bradley;
>> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
>> >>>> apache.org<mailto:[hidden email]>>
>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>
>> >>>> I would build OpenBLAS yourself, since good BLAS performance comes
>> >>>> from getting cache sizes, etc. set up correctly for your particular
>> >>>> hardware - this is often a very tricky process (see, e.g. ATLAS),
>> >>>> but we found that on relatively modern Xeon chips, OpenBLAS builds
>> >>>> quickly and yields performance competitive with MKL.
>> >>>>
>> >>>> To make sure the right library is getting used, you have to make
>> >>>> sure it's first on the search path - export
>> >>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
>> >>>>
>> >>>> For some examples of getting netlib-java setup on an ec2 node and
>> >>>> some example benchmarking code we ran a while back, see:
>> >>>> https://github.com/shivaram/matrix-bench
>> >>>>
>> >>>> In particular - build-openblas-ec2.sh shows you how to build the
>> >>>> library and set up symlinks correctly, and scala/run-netlib.sh
>> >>>> shows you how to get the path setup and get that library picked up
>> by netlib-java.
>> >>>>
>> >>>> In this way - you could probably get cuBLAS set up to be used by
>> >>>> netlib-java as well.
>> >>>>
>> >>>> - Evan
>> >>>>
>> >>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>> [hidden email]<mailto:[hidden email]>>> wrote:
>> >>>> Evan, could you elaborate on how to force BIDMat and netlib-java to
>> >>>> force loading the right blas? For netlib, I there are few JVM
>> >>>> flags, such as
>> >>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS,
>> >>>> so I can force it to use Java implementation. Not sure I understand
>> how to force use a specific blas (not specific wrapper for blas).
>> >>>>
>> >>>> Btw. I have installed openblas (yum install openblas), so I suppose
>> >>>> that netlib is using it.
>> >>>>
>> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
>> [hidden email]><mailto:
>> >>>> [hidden email]<mailto:[hidden email]>>]
>> >>>> Sent: Friday, February 06, 2015 5:19 PM
>> >>>> To: Ulanov, Alexander
>> >>>> Cc: Joseph Bradley;
>> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
>> >>>> apache.org<mailto:[hidden email]>>
>> >>>>
>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>
>> >>>> Getting breeze to pick up the right blas library is critical for
>> >>>> performance. I recommend using OpenBLAS (or MKL, if you already have
>> it).
>> >>>> It might make sense to force BIDMat to use the same underlying BLAS
>> >>>> library as well.
>> >>>>
>> >>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>> [hidden email]<mailto:[hidden email]>>> wrote:
>> >>>> Hi Evan, Joseph
>> >>>>
>> >>>> I did few matrix multiplication test and BIDMat seems to be ~10x
>> >>>> faster than netlib-java+breeze (sorry for weird table formatting):
>> >>>>
>> >>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java
>> >>>> |native_system_linux_x86-64|
>> >>>> Breeze+Netlib-java f2jblas |
>> >>>>
>> +-----------------------------------------------------------------------+
>> >>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
>> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
>> >>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228
>> >>>> ||
>> >>>>
>> >>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora
>> >>>> 19 Linux, Scala 2.11.
>> >>>>
>> >>>> Later I will make tests with Cuda. I need to install new Cuda
>> >>>> version for this purpose.
>> >>>>
>> >>>> Do you have any ideas why breeze-netlib with native blas is so much
>> >>>> slower than BIDMat MKL?
>> >>>>
>> >>>> Best regards, Alexander
>> >>>>
>> >>>> From: Joseph Bradley [mailto:[hidden email]<mailto:
>> [hidden email]><mailto:
>> >>>> [hidden email]<mailto:[hidden email]>>]
>> >>>> Sent: Thursday, February 05, 2015 5:29 PM
>> >>>> To: Ulanov, Alexander
>> >>>> Cc: Evan R. Sparks;
>> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
>> >>>> apache.org<mailto:[hidden email]>>
>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>
>> >>>> Hi Alexander,
>> >>>>
>> >>>> Using GPUs with Spark would be very exciting.  Small comment:
>> >>>> Concerning your question earlier about keeping data stored on the
>> >>>> GPU rather than having to move it between main memory and GPU
>> >>>> memory on each iteration, I would guess this would be critical to
>> >>>> getting good performance.  If you could do multiple local
>> >>>> iterations before aggregating results, then the cost of data
>> >>>> movement to the GPU could be amortized (and I believe that is done
>> >>>> in practice).  Having Spark be aware of the GPU and using it as
>> another part of memory sounds like a much bigger undertaking.
>> >>>>
>> >>>> Joseph
>> >>>>
>> >>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>> [hidden email]<mailto:[hidden email]>>> wrote:
>> >>>> Thank you for explanation! I’ve watched the BIDMach presentation by
>> >>>> John Canny and I am really inspired by his talk and comparisons with
>> Spark MLlib.
>> >>>>
>> >>>> I am very interested to find out what will be better within Spark:
>> >>>> BIDMat or netlib-java with CPU or GPU natives. Could you suggest a
>> >>>> fair way to benchmark them? Currently I do benchmarks on artificial
>> >>>> neural networks in batch mode. While it is not a “pure” test of
>> >>>> linear algebra, it involves some other things that are essential to
>> machine learning.
>> >>>>
>> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
>> [hidden email]><mailto:
>> >>>> [hidden email]<mailto:[hidden email]>>]
>> >>>> Sent: Thursday, February 05, 2015 1:29 PM
>> >>>> To: Ulanov, Alexander
>> >>>> Cc:
>> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
>> >>>> apache.org<mailto:[hidden email]>>
>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>
>> >>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
>> >>>> netlib-java+OpenBLAS, but if it is much faster it's probably due to
>> >>>> netlib-java+data
>> >>>> layout and fewer levels of indirection - it's definitely a
>> >>>> worthwhile experiment to run. The main speedups I've seen from
>> >>>> using it come from highly optimized GPU code for linear algebra. I
>> >>>> know that in the past Canny has gone as far as to write custom GPU
>> >>>> kernels for performance-critical regions of code.[1]
>> >>>>
>> >>>> BIDMach is highly optimized for single node performance or
>> >>>> performance on small clusters.[2] Once data doesn't fit easily in
>> >>>> GPU memory (or can be batched in that way) the performance tends to
>> >>>> fall off. Canny argues for hardware/software codesign and as such
>> >>>> prefers machine configurations that are quite different than what
>> >>>> we find in most commodity cluster nodes - e.g. 10 disk cahnnels and
>> 4 GPUs.
>> >>>>
>> >>>> In contrast, MLlib was designed for horizontal scalability on
>> >>>> commodity clusters and works best on very big datasets - order of
>> terabytes.
>> >>>>
>> >>>> For the most part, these projects developed concurrently to address
>> >>>> slightly different use cases. That said, there may be bits of
>> >>>> BIDMach we could repurpose for MLlib - keep in mind we need to be
>> >>>> careful about maintaining cross-language compatibility for our Java
>> >>>> and Python-users, though.
>> >>>>
>> >>>> - Evan
>> >>>>
>> >>>> [1] - http://arxiv.org/abs/1409.5402 [2] -
>> >>>> http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
>> >>>>
>> >>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>> [hidden email]<mailto:[hidden email]>><mailto:
>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>> [hidden email]<mailto:[hidden email]>>>> wrote:
>> >>>> Hi Evan,
>> >>>>
>> >>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do
>> >>>> you know what makes them faster than netlib-java?
>> >>>>
>> >>>> The same group has BIDMach library that implements machine
>> >>>> learning. For some examples they use Caffe convolutional neural
>> >>>> network library owned by another group in Berkeley. Could you
>> >>>> elaborate on how these all might be connected with Spark Mllib? If
>> >>>> you take BIDMat for linear algebra why don’t you take BIDMach for
>> optimization and learning?
>> >>>>
>> >>>> Best regards, Alexander
>> >>>>
>> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
>> [hidden email]><mailto:
>> >>>> [hidden email]<mailto:[hidden email]>><mailto:
>> [hidden email]<mailto:[hidden email]><mailto:
>> >>>> [hidden email]<mailto:[hidden email]>>>]
>> >>>> Sent: Thursday, February 05, 2015 12:09 PM
>> >>>> To: Ulanov, Alexander
>> >>>> Cc: [hidden email]<mailto:[hidden email]><mailto:
>> [hidden email]<mailto:[hidden email]>><mailto:
>> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
>> >>>> apache.org<mailto:[hidden email]>>>
>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>
>> >>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU
>> >>>> blas in many cases.
>> >>>>
>> >>>> You might consider taking a look at the codepaths that BIDMat (
>> >>>> https://github.com/BIDData/BIDMat) takes and comparing them to
>> >>>> netlib-java/breeze. John Canny et. al. have done a bunch of work
>> >>>> optimizing to make this work really fast from Scala. I've run it on
>> >>>> my laptop and compared to MKL and in certain cases it's 10x faster
>> at matrix multiply.
>> >>>> There are a lot of layers of indirection here and you really want
>> >>>> to avoid data copying as much as possible.
>> >>>>
>> >>>> We could also consider swapping out BIDMat for Breeze, but that
>> >>>> would be a big project and if we can figure out how to get
>> >>>> breeze+cublas to comparable performance that would be a big win.
>> >>>>
>> >>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>> [hidden email]<mailto:[hidden email]>><mailto:
>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>> [hidden email]<mailto:[hidden email]>>>> wrote:
>> >>>> Dear Spark developers,
>> >>>>
>> >>>> I am exploring how to make linear algebra operations faster within
>> Spark.
>> >>>> One way of doing this is to use Scala Breeze library that is
>> >>>> bundled with Spark. For matrix operations, it employs Netlib-java
>> >>>> that has a Java wrapper for BLAS (basic linear algebra subprograms)
>> >>>> and LAPACK native binaries if they are available on the worker
>> >>>> node. It also has its own optimized Java implementation of BLAS. It
>> >>>> is worth mentioning, that native binaries provide better performance
>> only for BLAS level 3, i.e.
>> >>>> matrix-matrix operations or general matrix multiplication (GEMM).
>> >>>> This is confirmed by GEMM test on Netlib-java page
>> >>>> https://github.com/fommil/netlib-java. I also confirmed it with my
>> >>>> experiments with training of artificial neural network
>> >>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
>> >>>> However, I would like to boost performance more.
>> >>>>
>> >>>> GPU is supposed to work fast with linear algebra and there is
>> >>>> Nvidia CUDA implementation of BLAS, called cublas. I have one Linux
>> >>>> server with Nvidia GPU and I was able to do the following. I linked
>> >>>> cublas (instead of cpu-based blas) with Netlib-java wrapper and put
>> >>>> it into Spark, so Breeze/Netlib is using it. Then I did some
>> >>>> performance measurements with regards to artificial neural network
>> >>>> batch learning in Spark MLlib that involves matrix-matrix
>> >>>> multiplications. It turns out that for matrices of size less than
>> >>>> ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes
>> >>>> slower for bigger matrices. It worth mentioning that it is was not a
>> test for ONLY multiplication since there are other operations involved.
>> >>>> One of the reasons for slowdown might be the overhead of copying
>> >>>> the matrices from computer memory to graphic card memory and back.
>> >>>>
>> >>>> So, few questions:
>> >>>> 1) Do these results with CUDA make sense?
>> >>>> 2) If the problem is with copy overhead, are there any libraries
>> >>>> that allow to force intermediate results to stay in graphic card
>> >>>> memory thus removing the overhead?
>> >>>> 3) Any other options to speed-up linear algebra in Spark?
>> >>>>
>> >>>> Thank you, Alexander
>> >>>>
>> >>>> -------------------------------------------------------------------
>> >>>> -- To unsubscribe, e-mail: [hidden email]<mailto:
>> [hidden email]><mailto:
>> >>>> [hidden email]<mailto:[hidden email]
>> >>>> e.org>><mailto:[hidden email]<mailto:dev-unsubscribe@sp
>> >>>> ark.apac> he.org<http://he.org>
>> >>>> <mailto:[hidden email]<mailto:dev-unsubscribe@spa
>> >>>> rk.apache.org>>> For additional commands, e-mail:
>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>> >>>> [hidden email]<mailto:[hidden email]>><mailto:
>> [hidden email]<mailto:[hidden email]><mailto:
>> >>>> [hidden email]<mailto:[hidden email]>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>
>>
>> --
>> Best regards,
>> Sam
>>
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Using CUDA within Spark / boosting linear algebra

Ulanov, Alexander
In reply to this post by Dmitriy Lyubimov
Netlib knows nothing about GPU (or CPU), it just uses cblas symbols from the provided libblas.so.3 library at the runtime. So, you can switch at the runtime by providing another library. Sam, please suggest if there is another way.

From: Dmitriy Lyubimov [mailto:[hidden email]]
Sent: Wednesday, March 25, 2015 2:55 PM
To: Ulanov, Alexander
Cc: Sam Halliday; [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks; jfcanny
Subject: Re: Using CUDA within Spark / boosting linear algebra

Alexander,

does using netlib imply that one cannot switch between CPU and GPU blas alternatives at will at the same time? the choice is always determined by linking aliternatives to libblas.so, right?

On Wed, Mar 25, 2015 at 2:31 PM, Ulanov, Alexander <[hidden email]<mailto:[hidden email]>> wrote:
Hi again,

I finally managed to use nvblas within Spark+netlib-java. It has exceptional performance for big matrices with Double, faster than BIDMat-cuda with Float. But for smaller matrices, if you will copy them to/from GPU, OpenBlas or MKL might be a better choice. This correlates with original nvblas presentation on GPU conf 2013 (slide 21): http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf

My results:
https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

Just in case, these tests are not for generalization of performance of different libraries. I just want to pick a library that does at best dense matrices multiplication for my task.

P.S. My previous issue with nvblas was the following: it has Fortran blas functions, at the same time netlib-java uses C cblas functions. So, one needs cblas shared library to use nvblas through netlib-java. Fedora does not have cblas (but Debian and Ubuntu have), so I needed to compile it. I could not use cblas from Atlas or Openblas because they link to their implementation and not to Fortran blas.

Best regards, Alexander

-----Original Message-----
From: Ulanov, Alexander
Sent: Tuesday, March 24, 2015 6:57 PM
To: Sam Halliday
Cc: [hidden email]<mailto:[hidden email]>; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
Subject: RE: Using CUDA within Spark / boosting linear algebra

Hi,

I am trying to use nvblas with netlib-java from Spark. nvblas functions should replace current blas functions calls after executing LD_PRELOAD as suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any changes to netlib-java. It seems to work for simple Java example, but I cannot make it work with Spark. I run the following:
export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell --driver-memory 4G In nvidia-smi I observe that Java is to use GPU:
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      8873    C   bash                                            39MiB |
|    0      8910    C   /usr/lib/jvm/java-1.7.0/bin/java                39MiB |
+-----------------------------------------------------------------------------+

In Spark shell I do matrix multiplication and see the following:
15/03/25 06:48:01 INFO JniLoader: successfully loaded /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
So I am sure that netlib-native is loaded and cblas supposedly used. However, matrix multiplication does executes on CPU since I see 16% of CPU used and 0% of GPU used. I also checked different matrix sizes, from 100x100 to 12000x12000

Could you suggest might the LD_PRELOAD not affect Spark shell?

Best regards, Alexander



From: Sam Halliday [mailto:[hidden email]<mailto:[hidden email]>]
Sent: Monday, March 09, 2015 6:01 PM
To: Ulanov, Alexander
Cc: [hidden email]<mailto:[hidden email]>; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
Subject: RE: Using CUDA within Spark / boosting linear algebra


Thanks so much for following up on this!

Hmm, I wonder if we should have a concerted effort to chart performance on various pieces of hardware...
On 9 Mar 2015 21:08, "Ulanov, Alexander" <[hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>> wrote:
Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the support of Double in the current source code), did the test with BIDMat and CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

Best regards, Alexander

-----Original Message-----
From: Sam Halliday [mailto:[hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>]
Sent: Tuesday, March 03, 2015 1:54 PM
To: Xiangrui Meng; Joseph Bradley
Cc: Evan R. Sparks; Ulanov, Alexander; [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>
Subject: Re: Using CUDA within Spark / boosting linear algebra

BTW, is anybody on this list going to the London Meetup in a few weeks?

https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community

Would be nice to meet other people working on the guts of Spark! :-)


Xiangrui Meng <[hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>> writes:

> Hey Alexander,
>
> I don't quite understand the part where netlib-cublas is about 20x
> slower than netlib-openblas. What is the overhead of using a GPU BLAS
> with netlib-java?
>
> CC'ed Sam, the author of netlib-java.
>
> Best,
> Xiangrui
>
> On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley <[hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>> wrote:
>> Better documentation for linking would be very helpful!  Here's a JIRA:
>> https://issues.apache.org/jira/browse/SPARK-6019
>>
>>
>> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
>> <[hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>
>> wrote:
>>
>>> Thanks for compiling all the data and running these benchmarks,
>>> Alex. The big takeaways here can be seen with this chart:
>>>
>>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ
>>> Hl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
>>>
>>> 1) A properly configured GPU matrix multiply implementation (e.g.
>>> BIDMat+GPU) can provide substantial (but less than an order of
>>> BIDMat+magnitude)
>>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
>>> netlib-java+openblas-compiled).
>>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
>>> worse than a well-tuned CPU implementation, particularly for larger matrices.
>>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
>>> basically agrees with the authors own benchmarks (
>>> https://github.com/fommil/netlib-java)
>>>
>>> I think that most of our users are in a situation where using GPUs
>>> may not be practical - although we could consider having a good GPU
>>> backend available as an option. However, *ALL* users of MLlib could
>>> benefit (potentially tremendously) from using a well-tuned CPU-based
>>> BLAS implementation. Perhaps we should consider updating the mllib
>>> guide with a more complete section for enabling high performance
>>> binaries on OSX and Linux? Or better, figure out a way for the
>>> system to fetch these automatically.
>>>
>>> - Evan
>>>
>>>
>>>
>>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>> wrote:
>>>
>>>> Just to summarize this thread, I was finally able to make all
>>>> performance comparisons that we discussed. It turns out that:
>>>> BIDMat-cublas>>BIDMat
>>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo=
>>>> =netlib-cublas>netlib-blas>f2jblas
>>>>
>>>> Below is the link to the spreadsheet with full results.
>>>>
>>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx
>>>> 378T9J5r7kwKSPkY/edit?usp=sharing
>>>>
>>>> One thing still needs exploration: does BIDMat-cublas perform
>>>> copying to/from machine’s RAM?
>>>>
>>>> -----Original Message-----
>>>> From: Ulanov, Alexander
>>>> Sent: Tuesday, February 10, 2015 2:12 PM
>>>> To: Evan R. Sparks
>>>> Cc: Joseph Bradley;
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>
>>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Thanks, Evan! It seems that ticket was marked as duplicate though
>>>> the original one discusses slightly different topic. I was able to
>>>> link netlib with MKL from BIDMat binaries. Indeed, MKL is
>>>> statically linked inside a 60MB library.
>>>>
>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
>>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
>>>> +-----------------------------------------------------------------------+
>>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
>>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
>>>> |1,638475459 |
>>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 |
>>>> 1569,233228 |
>>>>
>>>> It turn out that pre-compiled MKL is faster than precompiled
>>>> OpenBlas on my machine. Probably, I’ll add two more columns with
>>>> locally compiled openblas and cuda.
>>>>
>>>> Alexander
>>>>
>>>> From: Evan R. Sparks
>>>> [mailto:[hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>]
>>>> Sent: Monday, February 09, 2015 6:06 PM
>>>> To: Ulanov, Alexander
>>>> Cc: Joseph Bradley;
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Great - perhaps we can move this discussion off-list and onto a
>>>> JIRA ticket? (Here's one:
>>>> https://issues.apache.org/jira/browse/SPARK-5705)
>>>>
>>>> It seems like this is going to be somewhat exploratory for a while
>>>> (and there's probably only a handful of us who really care about
>>>> fast linear
>>>> algebra!)
>>>>
>>>> - Evan
>>>>
>>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>> wrote:
>>>> Hi Evan,
>>>>
>>>> Thank you for explanation and useful link. I am going to build
>>>> OpenBLAS, link it with Netlib-java and perform benchmark again.
>>>>
>>>> Do I understand correctly that BIDMat binaries contain statically
>>>> linked Intel MKL BLAS? It might be the reason why I am able to run
>>>> BIDMat not having MKL BLAS installed on my server. If it is true, I
>>>> wonder if it is OK because Intel sells this library. Nevertheless,
>>>> it seems that in my case precompiled MKL BLAS performs better than
>>>> precompiled OpenBLAS given that BIDMat and Netlib-java are supposed to be on par with JNI overheads.
>>>>
>>>> Though, it might be interesting to link Netlib-java with Intel MKL,
>>>> as you suggested. I wonder, are John Canny (BIDMat) and Sam
>>>> Halliday
>>>> (Netlib-java) interested to compare their libraries.
>>>>
>>>> Best regards, Alexander
>>>>
>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>]
>>>> Sent: Friday, February 06, 2015 5:58 PM
>>>>
>>>> To: Ulanov, Alexander
>>>> Cc: Joseph Bradley;
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:dev@spark<mailto:dev@spark>.
>>>> apache.org<http://apache.org><mailto:[hidden email]<mailto:[hidden email]>>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> I would build OpenBLAS yourself, since good BLAS performance comes
>>>> from getting cache sizes, etc. set up correctly for your particular
>>>> hardware - this is often a very tricky process (see, e.g. ATLAS),
>>>> but we found that on relatively modern Xeon chips, OpenBLAS builds
>>>> quickly and yields performance competitive with MKL.
>>>>
>>>> To make sure the right library is getting used, you have to make
>>>> sure it's first on the search path - export
>>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
>>>>
>>>> For some examples of getting netlib-java setup on an ec2 node and
>>>> some example benchmarking code we ran a while back, see:
>>>> https://github.com/shivaram/matrix-bench
>>>>
>>>> In particular - build-openblas-ec2.sh shows you how to build the
>>>> library and set up symlinks correctly, and scala/run-netlib.sh
>>>> shows you how to get the path setup and get that library picked up by netlib-java.
>>>>
>>>> In this way - you could probably get cuBLAS set up to be used by
>>>> netlib-java as well.
>>>>
>>>> - Evan
>>>>
>>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>> wrote:
>>>> Evan, could you elaborate on how to force BIDMat and netlib-java to
>>>> force loading the right blas? For netlib, I there are few JVM
>>>> flags, such as
>>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS,
>>>> so I can force it to use Java implementation. Not sure I understand how to force use a specific blas (not specific wrapper for blas).
>>>>
>>>> Btw. I have installed openblas (yum install openblas), so I suppose
>>>> that netlib is using it.
>>>>
>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>]
>>>> Sent: Friday, February 06, 2015 5:19 PM
>>>> To: Ulanov, Alexander
>>>> Cc: Joseph Bradley;
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:dev@spark<mailto:dev@spark>.
>>>> apache.org<http://apache.org><mailto:[hidden email]<mailto:[hidden email]>>>
>>>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Getting breeze to pick up the right blas library is critical for
>>>> performance. I recommend using OpenBLAS (or MKL, if you already have it).
>>>> It might make sense to force BIDMat to use the same underlying BLAS
>>>> library as well.
>>>>
>>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>> wrote:
>>>> Hi Evan, Joseph
>>>>
>>>> I did few matrix multiplication test and BIDMat seems to be ~10x
>>>> faster than netlib-java+breeze (sorry for weird table formatting):
>>>>
>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java
>>>> |native_system_linux_x86-64|
>>>> Breeze+Netlib-java f2jblas |
>>>> +-----------------------------------------------------------------------+
>>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
>>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
>>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228
>>>> ||
>>>>
>>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora
>>>> 19 Linux, Scala 2.11.
>>>>
>>>> Later I will make tests with Cuda. I need to install new Cuda
>>>> version for this purpose.
>>>>
>>>> Do you have any ideas why breeze-netlib with native blas is so much
>>>> slower than BIDMat MKL?
>>>>
>>>> Best regards, Alexander
>>>>
>>>> From: Joseph Bradley [mailto:[hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>]
>>>> Sent: Thursday, February 05, 2015 5:29 PM
>>>> To: Ulanov, Alexander
>>>> Cc: Evan R. Sparks;
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:dev@spark<mailto:dev@spark>.
>>>> apache.org<http://apache.org><mailto:[hidden email]<mailto:[hidden email]>>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Hi Alexander,
>>>>
>>>> Using GPUs with Spark would be very exciting.  Small comment:
>>>> Concerning your question earlier about keeping data stored on the
>>>> GPU rather than having to move it between main memory and GPU
>>>> memory on each iteration, I would guess this would be critical to
>>>> getting good performance.  If you could do multiple local
>>>> iterations before aggregating results, then the cost of data
>>>> movement to the GPU could be amortized (and I believe that is done
>>>> in practice).  Having Spark be aware of the GPU and using it as another part of memory sounds like a much bigger undertaking.
>>>>
>>>> Joseph
>>>>
>>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>> wrote:
>>>> Thank you for explanation! I’ve watched the BIDMach presentation by
>>>> John Canny and I am really inspired by his talk and comparisons with Spark MLlib.
>>>>
>>>> I am very interested to find out what will be better within Spark:
>>>> BIDMat or netlib-java with CPU or GPU natives. Could you suggest a
>>>> fair way to benchmark them? Currently I do benchmarks on artificial
>>>> neural networks in batch mode. While it is not a “pure” test of
>>>> linear algebra, it involves some other things that are essential to machine learning.
>>>>
>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>]
>>>> Sent: Thursday, February 05, 2015 1:29 PM
>>>> To: Ulanov, Alexander
>>>> Cc:
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:dev@spark<mailto:dev@spark>.
>>>> apache.org<http://apache.org><mailto:[hidden email]<mailto:[hidden email]>>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
>>>> netlib-java+OpenBLAS, but if it is much faster it's probably due to
>>>> netlib-java+data
>>>> layout and fewer levels of indirection - it's definitely a
>>>> worthwhile experiment to run. The main speedups I've seen from
>>>> using it come from highly optimized GPU code for linear algebra. I
>>>> know that in the past Canny has gone as far as to write custom GPU
>>>> kernels for performance-critical regions of code.[1]
>>>>
>>>> BIDMach is highly optimized for single node performance or
>>>> performance on small clusters.[2] Once data doesn't fit easily in
>>>> GPU memory (or can be batched in that way) the performance tends to
>>>> fall off. Canny argues for hardware/software codesign and as such
>>>> prefers machine configurations that are quite different than what
>>>> we find in most commodity cluster nodes - e.g. 10 disk cahnnels and 4 GPUs.
>>>>
>>>> In contrast, MLlib was designed for horizontal scalability on
>>>> commodity clusters and works best on very big datasets - order of terabytes.
>>>>
>>>> For the most part, these projects developed concurrently to address
>>>> slightly different use cases. That said, there may be bits of
>>>> BIDMach we could repurpose for MLlib - keep in mind we need to be
>>>> careful about maintaining cross-language compatibility for our Java
>>>> and Python-users, though.
>>>>
>>>> - Evan
>>>>
>>>> [1] - http://arxiv.org/abs/1409.5402 [2] -
>>>> http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
>>>>
>>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>><mailto:
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>>> wrote:
>>>> Hi Evan,
>>>>
>>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do
>>>> you know what makes them faster than netlib-java?
>>>>
>>>> The same group has BIDMach library that implements machine
>>>> learning. For some examples they use Caffe convolutional neural
>>>> network library owned by another group in Berkeley. Could you
>>>> elaborate on how these all might be connected with Spark Mllib? If
>>>> you take BIDMat for linear algebra why don’t you take BIDMach for optimization and learning?
>>>>
>>>> Best regards, Alexander
>>>>
>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>><mailto:[hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>>]
>>>> Sent: Thursday, February 05, 2015 12:09 PM
>>>> To: Ulanov, Alexander
>>>> Cc: [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>><mailto:
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:dev@spark<mailto:dev@spark>.
>>>> apache.org<http://apache.org><mailto:[hidden email]<mailto:[hidden email]>>>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU
>>>> blas in many cases.
>>>>
>>>> You might consider taking a look at the codepaths that BIDMat (
>>>> https://github.com/BIDData/BIDMat) takes and comparing them to
>>>> netlib-java/breeze. John Canny et. al. have done a bunch of work
>>>> optimizing to make this work really fast from Scala. I've run it on
>>>> my laptop and compared to MKL and in certain cases it's 10x faster at matrix multiply.
>>>> There are a lot of layers of indirection here and you really want
>>>> to avoid data copying as much as possible.
>>>>
>>>> We could also consider swapping out BIDMat for Breeze, but that
>>>> would be a big project and if we can figure out how to get
>>>> breeze+cublas to comparable performance that would be a big win.
>>>>
>>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>><mailto:
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>>> wrote:
>>>> Dear Spark developers,
>>>>
>>>> I am exploring how to make linear algebra operations faster within Spark.
>>>> One way of doing this is to use Scala Breeze library that is
>>>> bundled with Spark. For matrix operations, it employs Netlib-java
>>>> that has a Java wrapper for BLAS (basic linear algebra subprograms)
>>>> and LAPACK native binaries if they are available on the worker
>>>> node. It also has its own optimized Java implementation of BLAS. It
>>>> is worth mentioning, that native binaries provide better performance only for BLAS level 3, i.e.
>>>> matrix-matrix operations or general matrix multiplication (GEMM).
>>>> This is confirmed by GEMM test on Netlib-java page
>>>> https://github.com/fommil/netlib-java. I also confirmed it with my
>>>> experiments with training of artificial neural network
>>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
>>>> However, I would like to boost performance more.
>>>>
>>>> GPU is supposed to work fast with linear algebra and there is
>>>> Nvidia CUDA implementation of BLAS, called cublas. I have one Linux
>>>> server with Nvidia GPU and I was able to do the following. I linked
>>>> cublas (instead of cpu-based blas) with Netlib-java wrapper and put
>>>> it into Spark, so Breeze/Netlib is using it. Then I did some
>>>> performance measurements with regards to artificial neural network
>>>> batch learning in Spark MLlib that involves matrix-matrix
>>>> multiplications. It turns out that for matrices of size less than
>>>> ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes
>>>> slower for bigger matrices. It worth mentioning that it is was not a test for ONLY multiplication since there are other operations involved.
>>>> One of the reasons for slowdown might be the overhead of copying
>>>> the matrices from computer memory to graphic card memory and back.
>>>>
>>>> So, few questions:
>>>> 1) Do these results with CUDA make sense?
>>>> 2) If the problem is with copy overhead, are there any libraries
>>>> that allow to force intermediate results to stay in graphic card
>>>> memory thus removing the overhead?
>>>> 3) Any other options to speed-up linear algebra in Spark?
>>>>
>>>> Thank you, Alexander
>>>>
>>>> -------------------------------------------------------------------
>>>> -- To unsubscribe, e-mail: [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>
>>>> e.org<http://e.org>>><mailto:[hidden email]<mailto:[hidden email]><mailto:dev-unsubscribe@sp<mailto:dev-unsubscribe@sp>
>>>> ark.apac> he.org<http://he.org><http://he.org>
>>>> <mailto:[hidden email]<mailto:[hidden email]><mailto:dev-unsubscribe@spa<mailto:dev-unsubscribe@spa>
>>>> rk.apache.org<http://rk.apache.org>>>> For additional commands, e-mail:
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>><mailto:[hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>>
>>>>
>>>>
>>>>
>>>>
>>>

--
Best regards,
Sam

Reply | Threaded
Open this post in threaded view
|

RE: Using CUDA within Spark / boosting linear algebra

fommil
Yeah, MultiBLAS... it is dynamic.

Except, I haven't written it yet :-P
On 25 Mar 2015 22:06, "Ulanov, Alexander" <[hidden email]> wrote:

>  Netlib knows nothing about GPU (or CPU), it just uses cblas symbols from
> the provided libblas.so.3 library at the runtime. So, you can switch at the
> runtime by providing another library. Sam, please suggest if there is
> another way.
>
>
>
> *From:* Dmitriy Lyubimov [mailto:[hidden email]]
> *Sent:* Wednesday, March 25, 2015 2:55 PM
> *To:* Ulanov, Alexander
> *Cc:* Sam Halliday; [hidden email]; Xiangrui Meng; Joseph Bradley;
> Evan R. Sparks; jfcanny
> *Subject:* Re: Using CUDA within Spark / boosting linear algebra
>
>
>
> Alexander,
>
>
>
> does using netlib imply that one cannot switch between CPU and GPU blas
> alternatives at will at the same time? the choice is always determined by
> linking aliternatives to libblas.so, right?
>
>
>
> On Wed, Mar 25, 2015 at 2:31 PM, Ulanov, Alexander <
> [hidden email]> wrote:
>
> Hi again,
>
> I finally managed to use nvblas within Spark+netlib-java. It has
> exceptional performance for big matrices with Double, faster than
> BIDMat-cuda with Float. But for smaller matrices, if you will copy them
> to/from GPU, OpenBlas or MKL might be a better choice. This correlates with
> original nvblas presentation on GPU conf 2013 (slide 21):
> http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf
>
> My results:
>
> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>
> Just in case, these tests are not for generalization of performance of
> different libraries. I just want to pick a library that does at best dense
> matrices multiplication for my task.
>
> P.S. My previous issue with nvblas was the following: it has Fortran blas
> functions, at the same time netlib-java uses C cblas functions. So, one
> needs cblas shared library to use nvblas through netlib-java. Fedora does
> not have cblas (but Debian and Ubuntu have), so I needed to compile it. I
> could not use cblas from Atlas or Openblas because they link to their
> implementation and not to Fortran blas.
>
> Best regards, Alexander
>
> -----Original Message-----
> From: Ulanov, Alexander
>
> Sent: Tuesday, March 24, 2015 6:57 PM
> To: Sam Halliday
> Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
> Subject: RE: Using CUDA within Spark / boosting linear algebra
>
> Hi,
>
> I am trying to use nvblas with netlib-java from Spark. nvblas functions
> should replace current blas functions calls after executing LD_PRELOAD as
> suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any
> changes to netlib-java. It seems to work for simple Java example, but I
> cannot make it work with Spark. I run the following:
> export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
> env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell
> --driver-memory 4G In nvidia-smi I observe that Java is to use GPU:
>
> +-----------------------------------------------------------------------------+
> | Processes:                                                       GPU
> Memory |
> |  GPU       PID  Type  Process name                               Usage
>     |
>
> |=============================================================================|
> |    0      8873    C   bash
> 39MiB |
> |    0      8910    C   /usr/lib/jvm/java-1.7.0/bin/java
> 39MiB |
>
> +-----------------------------------------------------------------------------+
>
> In Spark shell I do matrix multiplication and see the following:
> 15/03/25 06:48:01 INFO JniLoader: successfully loaded
> /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
> So I am sure that netlib-native is loaded and cblas supposedly used.
> However, matrix multiplication does executes on CPU since I see 16% of CPU
> used and 0% of GPU used. I also checked different matrix sizes, from
> 100x100 to 12000x12000
>
> Could you suggest might the LD_PRELOAD not affect Spark shell?
>
> Best regards, Alexander
>
>
>
> From: Sam Halliday [mailto:[hidden email]]
> Sent: Monday, March 09, 2015 6:01 PM
> To: Ulanov, Alexander
> Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
> Subject: RE: Using CUDA within Spark / boosting linear algebra
>
>
> Thanks so much for following up on this!
>
> Hmm, I wonder if we should have a concerted effort to chart performance on
> various pieces of hardware...
> On 9 Mar 2015 21:08, "Ulanov, Alexander" <[hidden email]<mailto:
> [hidden email]>> wrote:
> Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the
> comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the
> support of Double in the current source code), did the test with BIDMat and
> CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL.
>
>
> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>
> Best regards, Alexander
>
> -----Original Message-----
> From: Sam Halliday [mailto:[hidden email]<mailto:
> [hidden email]>]
> Sent: Tuesday, March 03, 2015 1:54 PM
> To: Xiangrui Meng; Joseph Bradley
> Cc: Evan R. Sparks; Ulanov, Alexander; [hidden email]<mailto:
> [hidden email]>
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> BTW, is anybody on this list going to the London Meetup in a few weeks?
>
>
> https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community
>
> Would be nice to meet other people working on the guts of Spark! :-)
>
>
> Xiangrui Meng <[hidden email]<mailto:[hidden email]>> writes:
>
> > Hey Alexander,
> >
> > I don't quite understand the part where netlib-cublas is about 20x
> > slower than netlib-openblas. What is the overhead of using a GPU BLAS
> > with netlib-java?
> >
> > CC'ed Sam, the author of netlib-java.
> >
> > Best,
> > Xiangrui
> >
> > On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley <[hidden email]
> <mailto:[hidden email]>> wrote:
> >> Better documentation for linking would be very helpful!  Here's a JIRA:
> >> https://issues.apache.org/jira/browse/SPARK-6019
> >>
> >>
> >> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
> >> <[hidden email]<mailto:[hidden email]>>
> >> wrote:
> >>
> >>> Thanks for compiling all the data and running these benchmarks,
> >>> Alex. The big takeaways here can be seen with this chart:
> >>>
> >>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ
> >>> Hl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
> >>>
> >>> 1) A properly configured GPU matrix multiply implementation (e.g.
> >>> BIDMat+GPU) can provide substantial (but less than an order of
> >>> BIDMat+magnitude)
> >>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
> >>> netlib-java+openblas-compiled).
> >>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
> >>> worse than a well-tuned CPU implementation, particularly for larger
> matrices.
> >>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
> >>> basically agrees with the authors own benchmarks (
> >>> https://github.com/fommil/netlib-java)
> >>>
> >>> I think that most of our users are in a situation where using GPUs
> >>> may not be practical - although we could consider having a good GPU
> >>> backend available as an option. However, *ALL* users of MLlib could
> >>> benefit (potentially tremendously) from using a well-tuned CPU-based
> >>> BLAS implementation. Perhaps we should consider updating the mllib
> >>> guide with a more complete section for enabling high performance
> >>> binaries on OSX and Linux? Or better, figure out a way for the
> >>> system to fetch these automatically.
> >>>
> >>> - Evan
> >>>
> >>>
> >>>
> >>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
> >>> [hidden email]<mailto:[hidden email]>> wrote:
> >>>
> >>>> Just to summarize this thread, I was finally able to make all
> >>>> performance comparisons that we discussed. It turns out that:
> >>>> BIDMat-cublas>>BIDMat
> >>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo=
> >>>> =netlib-cublas>netlib-blas>f2jblas
> >>>>
> >>>> Below is the link to the spreadsheet with full results.
> >>>>
> >>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx
> >>>> 378T9J5r7kwKSPkY/edit?usp=sharing
> >>>>
> >>>> One thing still needs exploration: does BIDMat-cublas perform
> >>>> copying to/from machine’s RAM?
> >>>>
> >>>> -----Original Message-----
> >>>> From: Ulanov, Alexander
> >>>> Sent: Tuesday, February 10, 2015 2:12 PM
> >>>> To: Evan R. Sparks
> >>>> Cc: Joseph Bradley;
> >>>> [hidden email]<mailto:[hidden email]>
> >>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Thanks, Evan! It seems that ticket was marked as duplicate though
> >>>> the original one discusses slightly different topic. I was able to
> >>>> link netlib with MKL from BIDMat binaries. Indeed, MKL is
> >>>> statically linked inside a 60MB library.
> >>>>
> >>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
> >>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
> >>>>
> +-----------------------------------------------------------------------+
> >>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
> >>>> |1,638475459 |
> >>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 |
> >>>> 1569,233228 |
> >>>>
> >>>> It turn out that pre-compiled MKL is faster than precompiled
> >>>> OpenBlas on my machine. Probably, I’ll add two more columns with
> >>>> locally compiled openblas and cuda.
> >>>>
> >>>> Alexander
> >>>>
> >>>> From: Evan R. Sparks
> >>>> [mailto:[hidden email]<mailto:[hidden email]>]
> >>>> Sent: Monday, February 09, 2015 6:06 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: Joseph Bradley;
> >>>> [hidden email]<mailto:[hidden email]>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Great - perhaps we can move this discussion off-list and onto a
> >>>> JIRA ticket? (Here's one:
> >>>> https://issues.apache.org/jira/browse/SPARK-5705)
> >>>>
> >>>> It seems like this is going to be somewhat exploratory for a while
> >>>> (and there's probably only a handful of us who really care about
> >>>> fast linear
> >>>> algebra!)
> >>>>
> >>>> - Evan
> >>>>
> >>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>> wrote:
> >>>> Hi Evan,
> >>>>
> >>>> Thank you for explanation and useful link. I am going to build
> >>>> OpenBLAS, link it with Netlib-java and perform benchmark again.
> >>>>
> >>>> Do I understand correctly that BIDMat binaries contain statically
> >>>> linked Intel MKL BLAS? It might be the reason why I am able to run
> >>>> BIDMat not having MKL BLAS installed on my server. If it is true, I
> >>>> wonder if it is OK because Intel sells this library. Nevertheless,
> >>>> it seems that in my case precompiled MKL BLAS performs better than
> >>>> precompiled OpenBLAS given that BIDMat and Netlib-java are supposed
> to be on par with JNI overheads.
> >>>>
> >>>> Though, it might be interesting to link Netlib-java with Intel MKL,
> >>>> as you suggested. I wonder, are John Canny (BIDMat) and Sam
> >>>> Halliday
> >>>> (Netlib-java) interested to compare their libraries.
> >>>>
> >>>> Best regards, Alexander
> >>>>
> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
> [hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>]
> >>>> Sent: Friday, February 06, 2015 5:58 PM
> >>>>
> >>>> To: Ulanov, Alexander
> >>>> Cc: Joseph Bradley;
> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
> >>>> apache.org<mailto:[hidden email]>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> I would build OpenBLAS yourself, since good BLAS performance comes
> >>>> from getting cache sizes, etc. set up correctly for your particular
> >>>> hardware - this is often a very tricky process (see, e.g. ATLAS),
> >>>> but we found that on relatively modern Xeon chips, OpenBLAS builds
> >>>> quickly and yields performance competitive with MKL.
> >>>>
> >>>> To make sure the right library is getting used, you have to make
> >>>> sure it's first on the search path - export
> >>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
> >>>>
> >>>> For some examples of getting netlib-java setup on an ec2 node and
> >>>> some example benchmarking code we ran a while back, see:
> >>>> https://github.com/shivaram/matrix-bench
> >>>>
> >>>> In particular - build-openblas-ec2.sh shows you how to build the
> >>>> library and set up symlinks correctly, and scala/run-netlib.sh
> >>>> shows you how to get the path setup and get that library picked up by
> netlib-java.
> >>>>
> >>>> In this way - you could probably get cuBLAS set up to be used by
> >>>> netlib-java as well.
> >>>>
> >>>> - Evan
> >>>>
> >>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>> wrote:
> >>>> Evan, could you elaborate on how to force BIDMat and netlib-java to
> >>>> force loading the right blas? For netlib, I there are few JVM
> >>>> flags, such as
> >>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS,
> >>>> so I can force it to use Java implementation. Not sure I understand
> how to force use a specific blas (not specific wrapper for blas).
> >>>>
> >>>> Btw. I have installed openblas (yum install openblas), so I suppose
> >>>> that netlib is using it.
> >>>>
> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
> [hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>]
> >>>> Sent: Friday, February 06, 2015 5:19 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: Joseph Bradley;
> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
> >>>> apache.org<mailto:[hidden email]>>
> >>>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Getting breeze to pick up the right blas library is critical for
> >>>> performance. I recommend using OpenBLAS (or MKL, if you already have
> it).
> >>>> It might make sense to force BIDMat to use the same underlying BLAS
> >>>> library as well.
> >>>>
> >>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>> wrote:
> >>>> Hi Evan, Joseph
> >>>>
> >>>> I did few matrix multiplication test and BIDMat seems to be ~10x
> >>>> faster than netlib-java+breeze (sorry for weird table formatting):
> >>>>
> >>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java
> >>>> |native_system_linux_x86-64|
> >>>> Breeze+Netlib-java f2jblas |
> >>>>
> +-----------------------------------------------------------------------+
> >>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
> >>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228
> >>>> ||
> >>>>
> >>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora
> >>>> 19 Linux, Scala 2.11.
> >>>>
> >>>> Later I will make tests with Cuda. I need to install new Cuda
> >>>> version for this purpose.
> >>>>
> >>>> Do you have any ideas why breeze-netlib with native blas is so much
> >>>> slower than BIDMat MKL?
> >>>>
> >>>> Best regards, Alexander
> >>>>
> >>>> From: Joseph Bradley [mailto:[hidden email]<mailto:
> [hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>]
> >>>> Sent: Thursday, February 05, 2015 5:29 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: Evan R. Sparks;
> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
> >>>> apache.org<mailto:[hidden email]>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Hi Alexander,
> >>>>
> >>>> Using GPUs with Spark would be very exciting.  Small comment:
> >>>> Concerning your question earlier about keeping data stored on the
> >>>> GPU rather than having to move it between main memory and GPU
> >>>> memory on each iteration, I would guess this would be critical to
> >>>> getting good performance.  If you could do multiple local
> >>>> iterations before aggregating results, then the cost of data
> >>>> movement to the GPU could be amortized (and I believe that is done
> >>>> in practice).  Having Spark be aware of the GPU and using it as
> another part of memory sounds like a much bigger undertaking.
> >>>>
> >>>> Joseph
> >>>>
> >>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>> wrote:
> >>>> Thank you for explanation! I’ve watched the BIDMach presentation by
> >>>> John Canny and I am really inspired by his talk and comparisons with
> Spark MLlib.
> >>>>
> >>>> I am very interested to find out what will be better within Spark:
> >>>> BIDMat or netlib-java with CPU or GPU natives. Could you suggest a
> >>>> fair way to benchmark them? Currently I do benchmarks on artificial
> >>>> neural networks in batch mode. While it is not a “pure” test of
> >>>> linear algebra, it involves some other things that are essential to
> machine learning.
> >>>>
> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
> [hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>]
> >>>> Sent: Thursday, February 05, 2015 1:29 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc:
> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
> >>>> apache.org<mailto:[hidden email]>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
> >>>> netlib-java+OpenBLAS, but if it is much faster it's probably due to
> >>>> netlib-java+data
> >>>> layout and fewer levels of indirection - it's definitely a
> >>>> worthwhile experiment to run. The main speedups I've seen from
> >>>> using it come from highly optimized GPU code for linear algebra. I
> >>>> know that in the past Canny has gone as far as to write custom GPU
> >>>> kernels for performance-critical regions of code.[1]
> >>>>
> >>>> BIDMach is highly optimized for single node performance or
> >>>> performance on small clusters.[2] Once data doesn't fit easily in
> >>>> GPU memory (or can be batched in that way) the performance tends to
> >>>> fall off. Canny argues for hardware/software codesign and as such
> >>>> prefers machine configurations that are quite different than what
> >>>> we find in most commodity cluster nodes - e.g. 10 disk cahnnels and 4
> GPUs.
> >>>>
> >>>> In contrast, MLlib was designed for horizontal scalability on
> >>>> commodity clusters and works best on very big datasets - order of
> terabytes.
> >>>>
> >>>> For the most part, these projects developed concurrently to address
> >>>> slightly different use cases. That said, there may be bits of
> >>>> BIDMach we could repurpose for MLlib - keep in mind we need to be
> >>>> careful about maintaining cross-language compatibility for our Java
> >>>> and Python-users, though.
> >>>>
> >>>> - Evan
> >>>>
> >>>> [1] - http://arxiv.org/abs/1409.5402 [2] -
> >>>> http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
> >>>>
> >>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>><mailto:
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>>> wrote:
> >>>> Hi Evan,
> >>>>
> >>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do
> >>>> you know what makes them faster than netlib-java?
> >>>>
> >>>> The same group has BIDMach library that implements machine
> >>>> learning. For some examples they use Caffe convolutional neural
> >>>> network library owned by another group in Berkeley. Could you
> >>>> elaborate on how these all might be connected with Spark Mllib? If
> >>>> you take BIDMat for linear algebra why don’t you take BIDMach for
> optimization and learning?
> >>>>
> >>>> Best regards, Alexander
> >>>>
> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
> [hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>><mailto:
> [hidden email]<mailto:[hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>>]
> >>>> Sent: Thursday, February 05, 2015 12:09 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>><mailto:
> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
> >>>> apache.org<mailto:[hidden email]>>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU
> >>>> blas in many cases.
> >>>>
> >>>> You might consider taking a look at the codepaths that BIDMat (
> >>>> https://github.com/BIDData/BIDMat) takes and comparing them to
> >>>> netlib-java/breeze. John Canny et. al. have done a bunch of work
> >>>> optimizing to make this work really fast from Scala. I've run it on
> >>>> my laptop and compared to MKL and in certain cases it's 10x faster at
> matrix multiply.
> >>>> There are a lot of layers of indirection here and you really want
> >>>> to avoid data copying as much as possible.
> >>>>
> >>>> We could also consider swapping out BIDMat for Breeze, but that
> >>>> would be a big project and if we can figure out how to get
> >>>> breeze+cublas to comparable performance that would be a big win.
> >>>>
> >>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>><mailto:
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>>> wrote:
> >>>> Dear Spark developers,
> >>>>
> >>>> I am exploring how to make linear algebra operations faster within
> Spark.
> >>>> One way of doing this is to use Scala Breeze library that is
> >>>> bundled with Spark. For matrix operations, it employs Netlib-java
> >>>> that has a Java wrapper for BLAS (basic linear algebra subprograms)
> >>>> and LAPACK native binaries if they are available on the worker
> >>>> node. It also has its own optimized Java implementation of BLAS. It
> >>>> is worth mentioning, that native binaries provide better performance
> only for BLAS level 3, i.e.
> >>>> matrix-matrix operations or general matrix multiplication (GEMM).
> >>>> This is confirmed by GEMM test on Netlib-java page
> >>>> https://github.com/fommil/netlib-java. I also confirmed it with my
> >>>> experiments with training of artificial neural network
> >>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
> >>>> However, I would like to boost performance more.
> >>>>
> >>>> GPU is supposed to work fast with linear algebra and there is
> >>>> Nvidia CUDA implementation of BLAS, called cublas. I have one Linux
> >>>> server with Nvidia GPU and I was able to do the following. I linked
> >>>> cublas (instead of cpu-based blas) with Netlib-java wrapper and put
> >>>> it into Spark, so Breeze/Netlib is using it. Then I did some
> >>>> performance measurements with regards to artificial neural network
> >>>> batch learning in Spark MLlib that involves matrix-matrix
> >>>> multiplications. It turns out that for matrices of size less than
> >>>> ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes
> >>>> slower for bigger matrices. It worth mentioning that it is was not a
> test for ONLY multiplication since there are other operations involved.
> >>>> One of the reasons for slowdown might be the overhead of copying
> >>>> the matrices from computer memory to graphic card memory and back.
> >>>>
> >>>> So, few questions:
> >>>> 1) Do these results with CUDA make sense?
> >>>> 2) If the problem is with copy overhead, are there any libraries
> >>>> that allow to force intermediate results to stay in graphic card
> >>>> memory thus removing the overhead?
> >>>> 3) Any other options to speed-up linear algebra in Spark?
> >>>>
> >>>> Thank you, Alexander
> >>>>
> >>>> -------------------------------------------------------------------
> >>>> -- To unsubscribe, e-mail: [hidden email]<mailto:
> [hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]
> >>>> e.org>><mailto:[hidden email]<mailto:dev-unsubscribe@sp
> >>>> ark.apac> he.org<http://he.org>
> >>>> <mailto:[hidden email]<mailto:dev-unsubscribe@spa
> >>>> rk.apache.org>>> For additional commands, e-mail:
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>><mailto:
> [hidden email]<mailto:[hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
>
> --
> Best regards,
> Sam
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Using CUDA within Spark / boosting linear algebra

jfcanny
In reply to this post by Ulanov, Alexander
Alex,
I think you should recheck your numbers. Both BIDMat and nvblas are wrappers for cublas. The speeds are identical, except on machines that have multiple GPUs which nvblas exploits and cublas doesnt.

It would be a good idea to add a column with Gflop throughput. Your numbers for BIDMat 10kx10k multiply give about 300 single float gflops, which seems about right for a Quadro 4000 (current generation devices are > 10x faster than a 4000).

Your numbers for netlib-nvblas would indicate a double float throughput of 8 tflops, which is physically impossible on that device.

It shouldnt matter which interface you use if you have a single GPU.

-John

On 3/25/2015 2:34 PM, Ulanov, Alexander [via Apache Spark Developers List] wrote:
Hi again,

I finally managed to use nvblas within Spark+netlib-java. It has exceptional performance for big matrices with Double, faster than BIDMat-cuda with Float. But for smaller matrices, if you will copy them to/from GPU, OpenBlas or MKL might be a better choice. This correlates with original nvblas presentation on GPU conf 2013 (slide 21): http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf
 
My results:
https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing 

Just in case, these tests are not for generalization of performance of different libraries. I just want to pick a library that does at best dense matrices multiplication for my task.

P.S. My previous issue with nvblas was the following: it has Fortran blas functions, at the same time netlib-java uses C cblas functions. So, one needs cblas shared library to use nvblas through netlib-java. Fedora does not have cblas (but Debian and Ubuntu have), so I needed to compile it. I could not use cblas from Atlas or Openblas because they link to their implementation and not to Fortran blas.

Best regards, Alexander

-----Original Message-----
From: Ulanov, Alexander
Sent: Tuesday, March 24, 2015 6:57 PM
To: Sam Halliday
Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
Subject: RE: Using CUDA within Spark / boosting linear algebra

Hi,

I am trying to use nvblas with netlib-java from Spark. nvblas functions should replace current blas functions calls after executing LD_PRELOAD as suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any changes to netlib-java. It seems to work for simple Java example, but I cannot make it work with Spark. I run the following:
export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell --driver-memory 4G In nvidia-smi I observe that Java is to use GPU:
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      8873    C   bash                                            39MiB |
|    0      8910    C   /usr/lib/jvm/java-1.7.0/bin/java                39MiB |
+-----------------------------------------------------------------------------+

In Spark shell I do matrix multiplication and see the following:
15/03/25 06:48:01 INFO JniLoader: successfully loaded /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
So I am sure that netlib-native is loaded and cblas supposedly used. However, matrix multiplication does executes on CPU since I see 16% of CPU used and 0% of GPU used. I also checked different matrix sizes, from 100x100 to 12000x12000

Could you suggest might the LD_PRELOAD not affect Spark shell?

Best regards, Alexander



From: Sam Halliday [mailto:[hidden email]]
Sent: Monday, March 09, 2015 6:01 PM
To: Ulanov, Alexander
Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
Subject: RE: Using CUDA within Spark / boosting linear algebra


Thanks so much for following up on this!

Hmm, I wonder if we should have a concerted effort to chart performance on various pieces of hardware...
On 9 Mar 2015 21:08, "Ulanov, Alexander" <[hidden email]<mailto:[hidden email]>> wrote:
Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the support of Double in the current source code), did the test with BIDMat and CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

Best regards, Alexander

-----Original Message-----
From: Sam Halliday [mailto:[hidden email]<mailto:[hidden email]>]
Sent: Tuesday, March 03, 2015 1:54 PM
To: Xiangrui Meng; Joseph Bradley
Cc: Evan R. Sparks; Ulanov, Alexander; [hidden email]<mailto:[hidden email]>
Subject: Re: Using CUDA within Spark / boosting linear algebra

BTW, is anybody on this list going to the London Meetup in a few weeks?

https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community

Would be nice to meet other people working on the guts of Spark! :-)


Xiangrui Meng <[hidden email]<mailto:[hidden email]>> writes:

> Hey Alexander,
>
> I don't quite understand the part where netlib-cublas is about 20x
> slower than netlib-openblas. What is the overhead of using a GPU BLAS
> with netlib-java?
>
> CC'ed Sam, the author of netlib-java.
>
> Best,
> Xiangrui
>
> On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley <[hidden email]<mailto:[hidden email]>> wrote:
>> Better documentation for linking would be very helpful!  Here's a JIRA:
>> https://issues.apache.org/jira/browse/SPARK-6019
>>
>>
>> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
>> <[hidden email]<mailto:[hidden email]>>
>> wrote:
>>
>>> Thanks for compiling all the data and running these benchmarks,
>>> Alex. The big takeaways here can be seen with this chart:
>>>
>>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ
>>> Hl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
>>>
>>> 1) A properly configured GPU matrix multiply implementation (e.g.
>>> BIDMat+GPU) can provide substantial (but less than an order of
>>> BIDMat+magnitude)
>>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
>>> netlib-java+openblas-compiled).
>>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
>>> worse than a well-tuned CPU implementation, particularly for larger matrices.
>>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
>>> basically agrees with the authors own benchmarks (
>>> https://github.com/fommil/netlib-java)
>>>
>>> I think that most of our users are in a situation where using GPUs
>>> may not be practical - although we could consider having a good GPU
>>> backend available as an option. However, *ALL* users of MLlib could
>>> benefit (potentially tremendously) from using a well-tuned CPU-based
>>> BLAS implementation. Perhaps we should consider updating the mllib
>>> guide with a more complete section for enabling high performance
>>> binaries on OSX and Linux? Or better, figure out a way for the
>>> system to fetch these automatically.
>>>
>>> - Evan
>>>
>>>
>>>
>>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
>>> [hidden email]<mailto:[hidden email]>> wrote:
>>>
>>>> Just to summarize this thread, I was finally able to make all
>>>> performance comparisons that we discussed. It turns out that:
>>>> BIDMat-cublas>>BIDMat
>>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo=
>>>> =netlib-cublas>netlib-blas>f2jblas
>>>>
>>>> Below is the link to the spreadsheet with full results.
>>>>
>>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx
>>>> 378T9J5r7kwKSPkY/edit?usp=sharing
>>>>
>>>> One thing still needs exploration: does BIDMat-cublas perform
>>>> copying to/from machine’s RAM?
>>>>
>>>> -----Original Message-----
>>>> From: Ulanov, Alexander
>>>> Sent: Tuesday, February 10, 2015 2:12 PM
>>>> To: Evan R. Sparks
>>>> Cc: Joseph Bradley;
>>>> [hidden email]<mailto:[hidden email]>
>>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Thanks, Evan! It seems that ticket was marked as duplicate though
>>>> the original one discusses slightly different topic. I was able to
>>>> link netlib with MKL from BIDMat binaries. Indeed, MKL is
>>>> statically linked inside a 60MB library.
>>>>
>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
>>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
>>>> +-----------------------------------------------------------------------+
>>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
>>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
>>>> |1,638475459 |
>>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 |
>>>> 1569,233228 |
>>>>
>>>> It turn out that pre-compiled MKL is faster than precompiled
>>>> OpenBlas on my machine. Probably, I’ll add two more columns with
>>>> locally compiled openblas and cuda.
>>>>
>>>> Alexander
>>>>
>>>> From: Evan R. Sparks
>>>> [mailto:[hidden email]<mailto:[hidden email]>]
>>>> Sent: Monday, February 09, 2015 6:06 PM
>>>> To: Ulanov, Alexander
>>>> Cc: Joseph Bradley;
>>>> [hidden email]<mailto:[hidden email]>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Great - perhaps we can move this discussion off-list and onto a
>>>> JIRA ticket? (Here's one:
>>>> https://issues.apache.org/jira/browse/SPARK-5705)
>>>>
>>>> It seems like this is going to be somewhat exploratory for a while
>>>> (and there's probably only a handful of us who really care about
>>>> fast linear
>>>> algebra!)
>>>>
>>>> - Evan
>>>>
>>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>> wrote:
>>>> Hi Evan,
>>>>
>>>> Thank you for explanation and useful link. I am going to build
>>>> OpenBLAS, link it with Netlib-java and perform benchmark again.
>>>>
>>>> Do I understand correctly that BIDMat binaries contain statically
>>>> linked Intel MKL BLAS? It might be the reason why I am able to run
>>>> BIDMat not having MKL BLAS installed on my server. If it is true, I
>>>> wonder if it is OK because Intel sells this library. Nevertheless,
>>>> it seems that in my case precompiled MKL BLAS performs better than
>>>> precompiled OpenBLAS given that BIDMat and Netlib-java are supposed to be on par with JNI overheads.
>>>>
>>>> Though, it might be interesting to link Netlib-java with Intel MKL,
>>>> as you suggested. I wonder, are John Canny (BIDMat) and Sam
>>>> Halliday
>>>> (Netlib-java) interested to compare their libraries.
>>>>
>>>> Best regards, Alexander
>>>>
>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>]
>>>> Sent: Friday, February 06, 2015 5:58 PM
>>>>
>>>> To: Ulanov, Alexander
>>>> Cc: Joseph Bradley;
>>>> [hidden email]<mailto:[hidden email]><[hidden email].
>>>> apache.org<mailto:[hidden email]>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> I would build OpenBLAS yourself, since good BLAS performance comes
>>>> from getting cache sizes, etc. set up correctly for your particular
>>>> hardware - this is often a very tricky process (see, e.g. ATLAS),
>>>> but we found that on relatively modern Xeon chips, OpenBLAS builds
>>>> quickly and yields performance competitive with MKL.
>>>>
>>>> To make sure the right library is getting used, you have to make
>>>> sure it's first on the search path - export
>>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
>>>>
>>>> For some examples of getting netlib-java setup on an ec2 node and
>>>> some example benchmarking code we ran a while back, see:
>>>> https://github.com/shivaram/matrix-bench
>>>>
>>>> In particular - build-openblas-ec2.sh shows you how to build the
>>>> library and set up symlinks correctly, and scala/run-netlib.sh
>>>> shows you how to get the path setup and get that library picked up by netlib-java.
>>>>
>>>> In this way - you could probably get cuBLAS set up to be used by
>>>> netlib-java as well.
>>>>
>>>> - Evan
>>>>
>>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>> wrote:
>>>> Evan, could you elaborate on how to force BIDMat and netlib-java to
>>>> force loading the right blas? For netlib, I there are few JVM
>>>> flags, such as
>>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS,
>>>> so I can force it to use Java implementation. Not sure I understand how to force use a specific blas (not specific wrapper for blas).
>>>>
>>>> Btw. I have installed openblas (yum install openblas), so I suppose
>>>> that netlib is using it.
>>>>
>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>]
>>>> Sent: Friday, February 06, 2015 5:19 PM
>>>> To: Ulanov, Alexander
>>>> Cc: Joseph Bradley;
>>>> [hidden email]<mailto:[hidden email]><[hidden email].
>>>> apache.org<mailto:[hidden email]>>
>>>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Getting breeze to pick up the right blas library is critical for
>>>> performance. I recommend using OpenBLAS (or MKL, if you already have it).
>>>> It might make sense to force BIDMat to use the same underlying BLAS
>>>> library as well.
>>>>
>>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>> wrote:
>>>> Hi Evan, Joseph
>>>>
>>>> I did few matrix multiplication test and BIDMat seems to be ~10x
>>>> faster than netlib-java+breeze (sorry for weird table formatting):
>>>>
>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java
>>>> |native_system_linux_x86-64|
>>>> Breeze+Netlib-java f2jblas |
>>>> +-----------------------------------------------------------------------+
>>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
>>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
>>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228
>>>> ||
>>>>
>>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora
>>>> 19 Linux, Scala 2.11.
>>>>
>>>> Later I will make tests with Cuda. I need to install new Cuda
>>>> version for this purpose.
>>>>
>>>> Do you have any ideas why breeze-netlib with native blas is so much
>>>> slower than BIDMat MKL?
>>>>
>>>> Best regards, Alexander
>>>>
>>>> From: Joseph Bradley [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>]
>>>> Sent: Thursday, February 05, 2015 5:29 PM
>>>> To: Ulanov, Alexander
>>>> Cc: Evan R. Sparks;
>>>> [hidden email]<mailto:[hidden email]><[hidden email].
>>>> apache.org<mailto:[hidden email]>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Hi Alexander,
>>>>
>>>> Using GPUs with Spark would be very exciting.  Small comment:
>>>> Concerning your question earlier about keeping data stored on the
>>>> GPU rather than having to move it between main memory and GPU
>>>> memory on each iteration, I would guess this would be critical to
>>>> getting good performance.  If you could do multiple local
>>>> iterations before aggregating results, then the cost of data
>>>> movement to the GPU could be amortized (and I believe that is done
>>>> in practice).  Having Spark be aware of the GPU and using it as another part of memory sounds like a much bigger undertaking.
>>>>
>>>> Joseph
>>>>
>>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>> wrote:
>>>> Thank you for explanation! I’ve watched the BIDMach presentation by
>>>> John Canny and I am really inspired by his talk and comparisons with Spark MLlib.
>>>>
>>>> I am very interested to find out what will be better within Spark:
>>>> BIDMat or netlib-java with CPU or GPU natives. Could you suggest a
>>>> fair way to benchmark them? Currently I do benchmarks on artificial
>>>> neural networks in batch mode. While it is not a “pure” test of
>>>> linear algebra, it involves some other things that are essential to machine learning.
>>>>
>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>]
>>>> Sent: Thursday, February 05, 2015 1:29 PM
>>>> To: Ulanov, Alexander
>>>> Cc:
>>>> [hidden email]<mailto:[hidden email]><[hidden email].
>>>> apache.org<mailto:[hidden email]>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
>>>> netlib-java+OpenBLAS, but if it is much faster it's probably due to
>>>> netlib-java+data
>>>> layout and fewer levels of indirection - it's definitely a
>>>> worthwhile experiment to run. The main speedups I've seen from
>>>> using it come from highly optimized GPU code for linear algebra. I
>>>> know that in the past Canny has gone as far as to write custom GPU
>>>> kernels for performance-critical regions of code.[1]
>>>>
>>>> BIDMach is highly optimized for single node performance or
>>>> performance on small clusters.[2] Once data doesn't fit easily in
>>>> GPU memory (or can be batched in that way) the performance tends to
>>>> fall off. Canny argues for hardware/software codesign and as such
>>>> prefers machine configurations that are quite different than what
>>>> we find in most commodity cluster nodes - e.g. 10 disk cahnnels and 4 GPUs.
>>>>
>>>> In contrast, MLlib was designed for horizontal scalability on
>>>> commodity clusters and works best on very big datasets - order of terabytes.
>>>>
>>>> For the most part, these projects developed concurrently to address
>>>> slightly different use cases. That said, there may be bits of
>>>> BIDMach we could repurpose for MLlib - keep in mind we need to be
>>>> careful about maintaining cross-language compatibility for our Java
>>>> and Python-users, though.
>>>>
>>>> - Evan
>>>>
>>>> [1] - http://arxiv.org/abs/1409.5402 [2] -
>>>> http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
>>>>
>>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>> wrote:
>>>> Hi Evan,
>>>>
>>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do
>>>> you know what makes them faster than netlib-java?
>>>>
>>>> The same group has BIDMach library that implements machine
>>>> learning. For some examples they use Caffe convolutional neural
>>>> network library owned by another group in Berkeley. Could you
>>>> elaborate on how these all might be connected with Spark Mllib? If
>>>> you take BIDMat for linear algebra why don’t you take BIDMach for optimization and learning?
>>>>
>>>> Best regards, Alexander
>>>>
>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>>]
>>>> Sent: Thursday, February 05, 2015 12:09 PM
>>>> To: Ulanov, Alexander
>>>> Cc: [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>>>> [hidden email]<mailto:[hidden email]><[hidden email].
>>>> apache.org<mailto:[hidden email]>>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU
>>>> blas in many cases.
>>>>
>>>> You might consider taking a look at the codepaths that BIDMat (
>>>> https://github.com/BIDData/BIDMat) takes and comparing them to
>>>> netlib-java/breeze. John Canny et. al. have done a bunch of work
>>>> optimizing to make this work really fast from Scala. I've run it on
>>>> my laptop and compared to MKL and in certain cases it's 10x faster at matrix multiply.
>>>> There are a lot of layers of indirection here and you really want
>>>> to avoid data copying as much as possible.
>>>>
>>>> We could also consider swapping out BIDMat for Breeze, but that
>>>> would be a big project and if we can figure out how to get
>>>> breeze+cublas to comparable performance that would be a big win.
>>>>
>>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>> wrote:
>>>> Dear Spark developers,
>>>>
>>>> I am exploring how to make linear algebra operations faster within Spark.
>>>> One way of doing this is to use Scala Breeze library that is
>>>> bundled with Spark. For matrix operations, it employs Netlib-java
>>>> that has a Java wrapper for BLAS (basic linear algebra subprograms)
>>>> and LAPACK native binaries if they are available on the worker
>>>> node. It also has its own optimized Java implementation of BLAS. It
>>>> is worth mentioning, that native binaries provide better performance only for BLAS level 3, i.e.
>>>> matrix-matrix operations or general matrix multiplication (GEMM).
>>>> This is confirmed by GEMM test on Netlib-java page
>>>> https://github.com/fommil/netlib-java. I also confirmed it with my
>>>> experiments with training of artificial neural network
>>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
>>>> However, I would like to boost performance more.
>>>>
>>>> GPU is supposed to work fast with linear algebra and there is
>>>> Nvidia CUDA implementation of BLAS, called cublas. I have one Linux
>>>> server with Nvidia GPU and I was able to do the following. I linked
>>>> cublas (instead of cpu-based blas) with Netlib-java wrapper and put
>>>> it into Spark, so Breeze/Netlib is using it. Then I did some
>>>> performance measurements with regards to artificial neural network
>>>> batch learning in Spark MLlib that involves matrix-matrix
>>>> multiplications. It turns out that for matrices of size less than
>>>> ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes
>>>> slower for bigger matrices. It worth mentioning that it is was not a test for ONLY multiplication since there are other operations involved.
>>>> One of the reasons for slowdown might be the overhead of copying
>>>> the matrices from computer memory to graphic card memory and back.
>>>>
>>>> So, few questions:
>>>> 1) Do these results with CUDA make sense?
>>>> 2) If the problem is with copy overhead, are there any libraries
>>>> that allow to force intermediate results to stay in graphic card
>>>> memory thus removing the overhead?
>>>> 3) Any other options to speed-up linear algebra in Spark?
>>>>
>>>> Thank you, Alexander
>>>>
>>>> -------------------------------------------------------------------
>>>> -- To unsubscribe, e-mail: [hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]
>>>> e.org>><mailto:[hidden email]<[hidden email]
>>>> ark.apac> he.org<http://he.org>
>>>> <mailto:[hidden email]<[hidden email]
>>>> rk.apache.org>>> For additional commands, e-mail:
>>>> [hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>>
>>>>
>>>>
>>>>
>>>>
>>>

--
Best regards,
Sam

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



To unsubscribe from Using CUDA within Spark / boosting linear algebra, click here.
NAML

Reply | Threaded
Open this post in threaded view
|

Re: Using CUDA within Spark / boosting linear algebra

Dmitriy Lyubimov
In reply to this post by fommil
Sam,

whould it be easier to hack netlib-java to allow multiple (configurable)
 library contexts? And so enable 3rd party configurations and optimizers to
make their own choices until then?

On Wed, Mar 25, 2015 at 3:07 PM, Sam Halliday <[hidden email]>
wrote:

> Yeah, MultiBLAS... it is dynamic.
>
> Except, I haven't written it yet :-P
> On 25 Mar 2015 22:06, "Ulanov, Alexander" <[hidden email]> wrote:
>
>>  Netlib knows nothing about GPU (or CPU), it just uses cblas symbols
>> from the provided libblas.so.3 library at the runtime. So, you can switch
>> at the runtime by providing another library. Sam, please suggest if there
>> is another way.
>>
>>
>>
>> *From:* Dmitriy Lyubimov [mailto:[hidden email]]
>> *Sent:* Wednesday, March 25, 2015 2:55 PM
>> *To:* Ulanov, Alexander
>> *Cc:* Sam Halliday; [hidden email]; Xiangrui Meng; Joseph Bradley;
>> Evan R. Sparks; jfcanny
>> *Subject:* Re: Using CUDA within Spark / boosting linear algebra
>>
>>
>>
>> Alexander,
>>
>>
>>
>> does using netlib imply that one cannot switch between CPU and GPU blas
>> alternatives at will at the same time? the choice is always determined by
>> linking aliternatives to libblas.so, right?
>>
>>
>>
>> On Wed, Mar 25, 2015 at 2:31 PM, Ulanov, Alexander <
>> [hidden email]> wrote:
>>
>> Hi again,
>>
>> I finally managed to use nvblas within Spark+netlib-java. It has
>> exceptional performance for big matrices with Double, faster than
>> BIDMat-cuda with Float. But for smaller matrices, if you will copy them
>> to/from GPU, OpenBlas or MKL might be a better choice. This correlates with
>> original nvblas presentation on GPU conf 2013 (slide 21):
>> http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf
>>
>> My results:
>>
>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>>
>> Just in case, these tests are not for generalization of performance of
>> different libraries. I just want to pick a library that does at best dense
>> matrices multiplication for my task.
>>
>> P.S. My previous issue with nvblas was the following: it has Fortran blas
>> functions, at the same time netlib-java uses C cblas functions. So, one
>> needs cblas shared library to use nvblas through netlib-java. Fedora does
>> not have cblas (but Debian and Ubuntu have), so I needed to compile it. I
>> could not use cblas from Atlas or Openblas because they link to their
>> implementation and not to Fortran blas.
>>
>> Best regards, Alexander
>>
>> -----Original Message-----
>> From: Ulanov, Alexander
>>
>> Sent: Tuesday, March 24, 2015 6:57 PM
>> To: Sam Halliday
>> Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>
>> Hi,
>>
>> I am trying to use nvblas with netlib-java from Spark. nvblas functions
>> should replace current blas functions calls after executing LD_PRELOAD as
>> suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any
>> changes to netlib-java. It seems to work for simple Java example, but I
>> cannot make it work with Spark. I run the following:
>> export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
>> env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell
>> --driver-memory 4G In nvidia-smi I observe that Java is to use GPU:
>>
>> +-----------------------------------------------------------------------------+
>> | Processes:                                                       GPU
>> Memory |
>> |  GPU       PID  Type  Process name                               Usage
>>     |
>>
>> |=============================================================================|
>> |    0      8873    C   bash
>> 39MiB |
>> |    0      8910    C   /usr/lib/jvm/java-1.7.0/bin/java
>> 39MiB |
>>
>> +-----------------------------------------------------------------------------+
>>
>> In Spark shell I do matrix multiplication and see the following:
>> 15/03/25 06:48:01 INFO JniLoader: successfully loaded
>> /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
>> So I am sure that netlib-native is loaded and cblas supposedly used.
>> However, matrix multiplication does executes on CPU since I see 16% of CPU
>> used and 0% of GPU used. I also checked different matrix sizes, from
>> 100x100 to 12000x12000
>>
>> Could you suggest might the LD_PRELOAD not affect Spark shell?
>>
>> Best regards, Alexander
>>
>>
>>
>> From: Sam Halliday [mailto:[hidden email]]
>> Sent: Monday, March 09, 2015 6:01 PM
>> To: Ulanov, Alexander
>> Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>
>>
>> Thanks so much for following up on this!
>>
>> Hmm, I wonder if we should have a concerted effort to chart performance
>> on various pieces of hardware...
>> On 9 Mar 2015 21:08, "Ulanov, Alexander" <[hidden email]<mailto:
>> [hidden email]>> wrote:
>> Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the
>> comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the
>> support of Double in the current source code), did the test with BIDMat and
>> CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL.
>>
>>
>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>>
>> Best regards, Alexander
>>
>> -----Original Message-----
>> From: Sam Halliday [mailto:[hidden email]<mailto:
>> [hidden email]>]
>> Sent: Tuesday, March 03, 2015 1:54 PM
>> To: Xiangrui Meng; Joseph Bradley
>> Cc: Evan R. Sparks; Ulanov, Alexander; [hidden email]<mailto:
>> [hidden email]>
>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>
>> BTW, is anybody on this list going to the London Meetup in a few weeks?
>>
>>
>> https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community
>>
>> Would be nice to meet other people working on the guts of Spark! :-)
>>
>>
>> Xiangrui Meng <[hidden email]<mailto:[hidden email]>> writes:
>>
>> > Hey Alexander,
>> >
>> > I don't quite understand the part where netlib-cublas is about 20x
>> > slower than netlib-openblas. What is the overhead of using a GPU BLAS
>> > with netlib-java?
>> >
>> > CC'ed Sam, the author of netlib-java.
>> >
>> > Best,
>> > Xiangrui
>> >
>> > On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley <[hidden email]
>> <mailto:[hidden email]>> wrote:
>> >> Better documentation for linking would be very helpful!  Here's a JIRA:
>> >> https://issues.apache.org/jira/browse/SPARK-6019
>> >>
>> >>
>> >> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
>> >> <[hidden email]<mailto:[hidden email]>>
>> >> wrote:
>> >>
>> >>> Thanks for compiling all the data and running these benchmarks,
>> >>> Alex. The big takeaways here can be seen with this chart:
>> >>>
>> >>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ
>> >>> Hl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
>> >>>
>> >>> 1) A properly configured GPU matrix multiply implementation (e.g.
>> >>> BIDMat+GPU) can provide substantial (but less than an order of
>> >>> BIDMat+magnitude)
>> >>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
>> >>> netlib-java+openblas-compiled).
>> >>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
>> >>> worse than a well-tuned CPU implementation, particularly for larger
>> matrices.
>> >>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
>> >>> basically agrees with the authors own benchmarks (
>> >>> https://github.com/fommil/netlib-java)
>> >>>
>> >>> I think that most of our users are in a situation where using GPUs
>> >>> may not be practical - although we could consider having a good GPU
>> >>> backend available as an option. However, *ALL* users of MLlib could
>> >>> benefit (potentially tremendously) from using a well-tuned CPU-based
>> >>> BLAS implementation. Perhaps we should consider updating the mllib
>> >>> guide with a more complete section for enabling high performance
>> >>> binaries on OSX and Linux? Or better, figure out a way for the
>> >>> system to fetch these automatically.
>> >>>
>> >>> - Evan
>> >>>
>> >>>
>> >>>
>> >>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
>> >>> [hidden email]<mailto:[hidden email]>> wrote:
>> >>>
>> >>>> Just to summarize this thread, I was finally able to make all
>> >>>> performance comparisons that we discussed. It turns out that:
>> >>>> BIDMat-cublas>>BIDMat
>> >>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo=
>> >>>> =netlib-cublas>netlib-blas>f2jblas
>> >>>>
>> >>>> Below is the link to the spreadsheet with full results.
>> >>>>
>> >>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx
>> >>>> 378T9J5r7kwKSPkY/edit?usp=sharing
>> >>>>
>> >>>> One thing still needs exploration: does BIDMat-cublas perform
>> >>>> copying to/from machine’s RAM?
>> >>>>
>> >>>> -----Original Message-----
>> >>>> From: Ulanov, Alexander
>> >>>> Sent: Tuesday, February 10, 2015 2:12 PM
>> >>>> To: Evan R. Sparks
>> >>>> Cc: Joseph Bradley;
>> >>>> [hidden email]<mailto:[hidden email]>
>> >>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>> >>>>
>> >>>> Thanks, Evan! It seems that ticket was marked as duplicate though
>> >>>> the original one discusses slightly different topic. I was able to
>> >>>> link netlib with MKL from BIDMat binaries. Indeed, MKL is
>> >>>> statically linked inside a 60MB library.
>> >>>>
>> >>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
>> >>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
>> >>>>
>> +-----------------------------------------------------------------------+
>> >>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
>> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
>> >>>> |1,638475459 |
>> >>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 |
>> >>>> 1569,233228 |
>> >>>>
>> >>>> It turn out that pre-compiled MKL is faster than precompiled
>> >>>> OpenBlas on my machine. Probably, I’ll add two more columns with
>> >>>> locally compiled openblas and cuda.
>> >>>>
>> >>>> Alexander
>> >>>>
>> >>>> From: Evan R. Sparks
>> >>>> [mailto:[hidden email]<mailto:[hidden email]>]
>> >>>> Sent: Monday, February 09, 2015 6:06 PM
>> >>>> To: Ulanov, Alexander
>> >>>> Cc: Joseph Bradley;
>> >>>> [hidden email]<mailto:[hidden email]>
>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>
>> >>>> Great - perhaps we can move this discussion off-list and onto a
>> >>>> JIRA ticket? (Here's one:
>> >>>> https://issues.apache.org/jira/browse/SPARK-5705)
>> >>>>
>> >>>> It seems like this is going to be somewhat exploratory for a while
>> >>>> (and there's probably only a handful of us who really care about
>> >>>> fast linear
>> >>>> algebra!)
>> >>>>
>> >>>> - Evan
>> >>>>
>> >>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>> [hidden email]<mailto:[hidden email]>>> wrote:
>> >>>> Hi Evan,
>> >>>>
>> >>>> Thank you for explanation and useful link. I am going to build
>> >>>> OpenBLAS, link it with Netlib-java and perform benchmark again.
>> >>>>
>> >>>> Do I understand correctly that BIDMat binaries contain statically
>> >>>> linked Intel MKL BLAS? It might be the reason why I am able to run
>> >>>> BIDMat not having MKL BLAS installed on my server. If it is true, I
>> >>>> wonder if it is OK because Intel sells this library. Nevertheless,
>> >>>> it seems that in my case precompiled MKL BLAS performs better than
>> >>>> precompiled OpenBLAS given that BIDMat and Netlib-java are supposed
>> to be on par with JNI overheads.
>> >>>>
>> >>>> Though, it might be interesting to link Netlib-java with Intel MKL,
>> >>>> as you suggested. I wonder, are John Canny (BIDMat) and Sam
>> >>>> Halliday
>> >>>> (Netlib-java) interested to compare their libraries.
>> >>>>
>> >>>> Best regards, Alexander
>> >>>>
>> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
>> [hidden email]><mailto:
>> >>>> [hidden email]<mailto:[hidden email]>>]
>> >>>> Sent: Friday, February 06, 2015 5:58 PM
>> >>>>
>> >>>> To: Ulanov, Alexander
>> >>>> Cc: Joseph Bradley;
>> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
>> >>>> apache.org<mailto:[hidden email]>>
>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>
>> >>>> I would build OpenBLAS yourself, since good BLAS performance comes
>> >>>> from getting cache sizes, etc. set up correctly for your particular
>> >>>> hardware - this is often a very tricky process (see, e.g. ATLAS),
>> >>>> but we found that on relatively modern Xeon chips, OpenBLAS builds
>> >>>> quickly and yields performance competitive with MKL.
>> >>>>
>> >>>> To make sure the right library is getting used, you have to make
>> >>>> sure it's first on the search path - export
>> >>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
>> >>>>
>> >>>> For some examples of getting netlib-java setup on an ec2 node and
>> >>>> some example benchmarking code we ran a while back, see:
>> >>>> https://github.com/shivaram/matrix-bench
>> >>>>
>> >>>> In particular - build-openblas-ec2.sh shows you how to build the
>> >>>> library and set up symlinks correctly, and scala/run-netlib.sh
>> >>>> shows you how to get the path setup and get that library picked up
>> by netlib-java.
>> >>>>
>> >>>> In this way - you could probably get cuBLAS set up to be used by
>> >>>> netlib-java as well.
>> >>>>
>> >>>> - Evan
>> >>>>
>> >>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>> [hidden email]<mailto:[hidden email]>>> wrote:
>> >>>> Evan, could you elaborate on how to force BIDMat and netlib-java to
>> >>>> force loading the right blas? For netlib, I there are few JVM
>> >>>> flags, such as
>> >>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS,
>> >>>> so I can force it to use Java implementation. Not sure I understand
>> how to force use a specific blas (not specific wrapper for blas).
>> >>>>
>> >>>> Btw. I have installed openblas (yum install openblas), so I suppose
>> >>>> that netlib is using it.
>> >>>>
>> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
>> [hidden email]><mailto:
>> >>>> [hidden email]<mailto:[hidden email]>>]
>> >>>> Sent: Friday, February 06, 2015 5:19 PM
>> >>>> To: Ulanov, Alexander
>> >>>> Cc: Joseph Bradley;
>> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
>> >>>> apache.org<mailto:[hidden email]>>
>> >>>>
>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>
>> >>>> Getting breeze to pick up the right blas library is critical for
>> >>>> performance. I recommend using OpenBLAS (or MKL, if you already have
>> it).
>> >>>> It might make sense to force BIDMat to use the same underlying BLAS
>> >>>> library as well.
>> >>>>
>> >>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>> [hidden email]<mailto:[hidden email]>>> wrote:
>> >>>> Hi Evan, Joseph
>> >>>>
>> >>>> I did few matrix multiplication test and BIDMat seems to be ~10x
>> >>>> faster than netlib-java+breeze (sorry for weird table formatting):
>> >>>>
>> >>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java
>> >>>> |native_system_linux_x86-64|
>> >>>> Breeze+Netlib-java f2jblas |
>> >>>>
>> +-----------------------------------------------------------------------+
>> >>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
>> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
>> >>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228
>> >>>> ||
>> >>>>
>> >>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora
>> >>>> 19 Linux, Scala 2.11.
>> >>>>
>> >>>> Later I will make tests with Cuda. I need to install new Cuda
>> >>>> version for this purpose.
>> >>>>
>> >>>> Do you have any ideas why breeze-netlib with native blas is so much
>> >>>> slower than BIDMat MKL?
>> >>>>
>> >>>> Best regards, Alexander
>> >>>>
>> >>>> From: Joseph Bradley [mailto:[hidden email]<mailto:
>> [hidden email]><mailto:
>> >>>> [hidden email]<mailto:[hidden email]>>]
>> >>>> Sent: Thursday, February 05, 2015 5:29 PM
>> >>>> To: Ulanov, Alexander
>> >>>> Cc: Evan R. Sparks;
>> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
>> >>>> apache.org<mailto:[hidden email]>>
>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>
>> >>>> Hi Alexander,
>> >>>>
>> >>>> Using GPUs with Spark would be very exciting.  Small comment:
>> >>>> Concerning your question earlier about keeping data stored on the
>> >>>> GPU rather than having to move it between main memory and GPU
>> >>>> memory on each iteration, I would guess this would be critical to
>> >>>> getting good performance.  If you could do multiple local
>> >>>> iterations before aggregating results, then the cost of data
>> >>>> movement to the GPU could be amortized (and I believe that is done
>> >>>> in practice).  Having Spark be aware of the GPU and using it as
>> another part of memory sounds like a much bigger undertaking.
>> >>>>
>> >>>> Joseph
>> >>>>
>> >>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>> [hidden email]<mailto:[hidden email]>>> wrote:
>> >>>> Thank you for explanation! I’ve watched the BIDMach presentation by
>> >>>> John Canny and I am really inspired by his talk and comparisons with
>> Spark MLlib.
>> >>>>
>> >>>> I am very interested to find out what will be better within Spark:
>> >>>> BIDMat or netlib-java with CPU or GPU natives. Could you suggest a
>> >>>> fair way to benchmark them? Currently I do benchmarks on artificial
>> >>>> neural networks in batch mode. While it is not a “pure” test of
>> >>>> linear algebra, it involves some other things that are essential to
>> machine learning.
>> >>>>
>> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
>> [hidden email]><mailto:
>> >>>> [hidden email]<mailto:[hidden email]>>]
>> >>>> Sent: Thursday, February 05, 2015 1:29 PM
>> >>>> To: Ulanov, Alexander
>> >>>> Cc:
>> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
>> >>>> apache.org<mailto:[hidden email]>>
>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>
>> >>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
>> >>>> netlib-java+OpenBLAS, but if it is much faster it's probably due to
>> >>>> netlib-java+data
>> >>>> layout and fewer levels of indirection - it's definitely a
>> >>>> worthwhile experiment to run. The main speedups I've seen from
>> >>>> using it come from highly optimized GPU code for linear algebra. I
>> >>>> know that in the past Canny has gone as far as to write custom GPU
>> >>>> kernels for performance-critical regions of code.[1]
>> >>>>
>> >>>> BIDMach is highly optimized for single node performance or
>> >>>> performance on small clusters.[2] Once data doesn't fit easily in
>> >>>> GPU memory (or can be batched in that way) the performance tends to
>> >>>> fall off. Canny argues for hardware/software codesign and as such
>> >>>> prefers machine configurations that are quite different than what
>> >>>> we find in most commodity cluster nodes - e.g. 10 disk cahnnels and
>> 4 GPUs.
>> >>>>
>> >>>> In contrast, MLlib was designed for horizontal scalability on
>> >>>> commodity clusters and works best on very big datasets - order of
>> terabytes.
>> >>>>
>> >>>> For the most part, these projects developed concurrently to address
>> >>>> slightly different use cases. That said, there may be bits of
>> >>>> BIDMach we could repurpose for MLlib - keep in mind we need to be
>> >>>> careful about maintaining cross-language compatibility for our Java
>> >>>> and Python-users, though.
>> >>>>
>> >>>> - Evan
>> >>>>
>> >>>> [1] - http://arxiv.org/abs/1409.5402 [2] -
>> >>>> http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
>> >>>>
>> >>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>> [hidden email]<mailto:[hidden email]>><mailto:
>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>> [hidden email]<mailto:[hidden email]>>>> wrote:
>> >>>> Hi Evan,
>> >>>>
>> >>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do
>> >>>> you know what makes them faster than netlib-java?
>> >>>>
>> >>>> The same group has BIDMach library that implements machine
>> >>>> learning. For some examples they use Caffe convolutional neural
>> >>>> network library owned by another group in Berkeley. Could you
>> >>>> elaborate on how these all might be connected with Spark Mllib? If
>> >>>> you take BIDMat for linear algebra why don’t you take BIDMach for
>> optimization and learning?
>> >>>>
>> >>>> Best regards, Alexander
>> >>>>
>> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
>> [hidden email]><mailto:
>> >>>> [hidden email]<mailto:[hidden email]>><mailto:
>> [hidden email]<mailto:[hidden email]><mailto:
>> >>>> [hidden email]<mailto:[hidden email]>>>]
>> >>>> Sent: Thursday, February 05, 2015 12:09 PM
>> >>>> To: Ulanov, Alexander
>> >>>> Cc: [hidden email]<mailto:[hidden email]><mailto:
>> [hidden email]<mailto:[hidden email]>><mailto:
>> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
>> >>>> apache.org<mailto:[hidden email]>>>
>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>
>> >>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU
>> >>>> blas in many cases.
>> >>>>
>> >>>> You might consider taking a look at the codepaths that BIDMat (
>> >>>> https://github.com/BIDData/BIDMat) takes and comparing them to
>> >>>> netlib-java/breeze. John Canny et. al. have done a bunch of work
>> >>>> optimizing to make this work really fast from Scala. I've run it on
>> >>>> my laptop and compared to MKL and in certain cases it's 10x faster
>> at matrix multiply.
>> >>>> There are a lot of layers of indirection here and you really want
>> >>>> to avoid data copying as much as possible.
>> >>>>
>> >>>> We could also consider swapping out BIDMat for Breeze, but that
>> >>>> would be a big project and if we can figure out how to get
>> >>>> breeze+cublas to comparable performance that would be a big win.
>> >>>>
>> >>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>> [hidden email]<mailto:[hidden email]>><mailto:
>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>> [hidden email]<mailto:[hidden email]>>>> wrote:
>> >>>> Dear Spark developers,
>> >>>>
>> >>>> I am exploring how to make linear algebra operations faster within
>> Spark.
>> >>>> One way of doing this is to use Scala Breeze library that is
>> >>>> bundled with Spark. For matrix operations, it employs Netlib-java
>> >>>> that has a Java wrapper for BLAS (basic linear algebra subprograms)
>> >>>> and LAPACK native binaries if they are available on the worker
>> >>>> node. It also has its own optimized Java implementation of BLAS. It
>> >>>> is worth mentioning, that native binaries provide better performance
>> only for BLAS level 3, i.e.
>> >>>> matrix-matrix operations or general matrix multiplication (GEMM).
>> >>>> This is confirmed by GEMM test on Netlib-java page
>> >>>> https://github.com/fommil/netlib-java. I also confirmed it with my
>> >>>> experiments with training of artificial neural network
>> >>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
>> >>>> However, I would like to boost performance more.
>> >>>>
>> >>>> GPU is supposed to work fast with linear algebra and there is
>> >>>> Nvidia CUDA implementation of BLAS, called cublas. I have one Linux
>> >>>> server with Nvidia GPU and I was able to do the following. I linked
>> >>>> cublas (instead of cpu-based blas) with Netlib-java wrapper and put
>> >>>> it into Spark, so Breeze/Netlib is using it. Then I did some
>> >>>> performance measurements with regards to artificial neural network
>> >>>> batch learning in Spark MLlib that involves matrix-matrix
>> >>>> multiplications. It turns out that for matrices of size less than
>> >>>> ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes
>> >>>> slower for bigger matrices. It worth mentioning that it is was not a
>> test for ONLY multiplication since there are other operations involved.
>> >>>> One of the reasons for slowdown might be the overhead of copying
>> >>>> the matrices from computer memory to graphic card memory and back.
>> >>>>
>> >>>> So, few questions:
>> >>>> 1) Do these results with CUDA make sense?
>> >>>> 2) If the problem is with copy overhead, are there any libraries
>> >>>> that allow to force intermediate results to stay in graphic card
>> >>>> memory thus removing the overhead?
>> >>>> 3) Any other options to speed-up linear algebra in Spark?
>> >>>>
>> >>>> Thank you, Alexander
>> >>>>
>> >>>> -------------------------------------------------------------------
>> >>>> -- To unsubscribe, e-mail: [hidden email]<mailto:
>> [hidden email]><mailto:
>> >>>> [hidden email]<mailto:[hidden email]
>> >>>> e.org>><mailto:[hidden email]<mailto:dev-unsubscribe@sp
>> >>>> ark.apac> he.org<http://he.org>
>> >>>> <mailto:[hidden email]<mailto:dev-unsubscribe@spa
>> >>>> rk.apache.org>>> For additional commands, e-mail:
>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>> >>>> [hidden email]<mailto:[hidden email]>><mailto:
>> [hidden email]<mailto:[hidden email]><mailto:
>> >>>> [hidden email]<mailto:[hidden email]>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>
>>
>> --
>> Best regards,
>> Sam
>>
>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Using CUDA within Spark / boosting linear algebra

fommil
That would be a difficult task that would only benefit users of
netlib-java. MultiBLAS is easily implemented (although a lot of
boilerplate) and benefits all BLAS users on the system.

If anyone knows of a funding route for it, I'd love to hear from them,
because it's too much work for me to take on at the moment as hobby.
On 25 Mar 2015 22:16, "Dmitriy Lyubimov" <[hidden email]> wrote:

> Sam,
>
> whould it be easier to hack netlib-java to allow multiple (configurable)
>  library contexts? And so enable 3rd party configurations and optimizers to
> make their own choices until then?
>
> On Wed, Mar 25, 2015 at 3:07 PM, Sam Halliday <[hidden email]>
> wrote:
>
>> Yeah, MultiBLAS... it is dynamic.
>>
>> Except, I haven't written it yet :-P
>> On 25 Mar 2015 22:06, "Ulanov, Alexander" <[hidden email]>
>> wrote:
>>
>>>  Netlib knows nothing about GPU (or CPU), it just uses cblas symbols
>>> from the provided libblas.so.3 library at the runtime. So, you can switch
>>> at the runtime by providing another library. Sam, please suggest if there
>>> is another way.
>>>
>>>
>>>
>>> *From:* Dmitriy Lyubimov [mailto:[hidden email]]
>>> *Sent:* Wednesday, March 25, 2015 2:55 PM
>>> *To:* Ulanov, Alexander
>>> *Cc:* Sam Halliday; [hidden email]; Xiangrui Meng; Joseph
>>> Bradley; Evan R. Sparks; jfcanny
>>> *Subject:* Re: Using CUDA within Spark / boosting linear algebra
>>>
>>>
>>>
>>> Alexander,
>>>
>>>
>>>
>>> does using netlib imply that one cannot switch between CPU and GPU blas
>>> alternatives at will at the same time? the choice is always determined by
>>> linking aliternatives to libblas.so, right?
>>>
>>>
>>>
>>> On Wed, Mar 25, 2015 at 2:31 PM, Ulanov, Alexander <
>>> [hidden email]> wrote:
>>>
>>> Hi again,
>>>
>>> I finally managed to use nvblas within Spark+netlib-java. It has
>>> exceptional performance for big matrices with Double, faster than
>>> BIDMat-cuda with Float. But for smaller matrices, if you will copy them
>>> to/from GPU, OpenBlas or MKL might be a better choice. This correlates with
>>> original nvblas presentation on GPU conf 2013 (slide 21):
>>> http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf
>>>
>>> My results:
>>>
>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>>>
>>> Just in case, these tests are not for generalization of performance of
>>> different libraries. I just want to pick a library that does at best dense
>>> matrices multiplication for my task.
>>>
>>> P.S. My previous issue with nvblas was the following: it has Fortran
>>> blas functions, at the same time netlib-java uses C cblas functions. So,
>>> one needs cblas shared library to use nvblas through netlib-java. Fedora
>>> does not have cblas (but Debian and Ubuntu have), so I needed to compile
>>> it. I could not use cblas from Atlas or Openblas because they link to their
>>> implementation and not to Fortran blas.
>>>
>>> Best regards, Alexander
>>>
>>> -----Original Message-----
>>> From: Ulanov, Alexander
>>>
>>> Sent: Tuesday, March 24, 2015 6:57 PM
>>> To: Sam Halliday
>>> Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>>
>>> Hi,
>>>
>>> I am trying to use nvblas with netlib-java from Spark. nvblas functions
>>> should replace current blas functions calls after executing LD_PRELOAD as
>>> suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any
>>> changes to netlib-java. It seems to work for simple Java example, but I
>>> cannot make it work with Spark. I run the following:
>>> export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
>>> env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell
>>> --driver-memory 4G In nvidia-smi I observe that Java is to use GPU:
>>>
>>> +-----------------------------------------------------------------------------+
>>> | Processes:                                                       GPU
>>> Memory |
>>> |  GPU       PID  Type  Process name
>>>  Usage      |
>>>
>>> |=============================================================================|
>>> |    0      8873    C   bash
>>> 39MiB |
>>> |    0      8910    C   /usr/lib/jvm/java-1.7.0/bin/java
>>> 39MiB |
>>>
>>> +-----------------------------------------------------------------------------+
>>>
>>> In Spark shell I do matrix multiplication and see the following:
>>> 15/03/25 06:48:01 INFO JniLoader: successfully loaded
>>> /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
>>> So I am sure that netlib-native is loaded and cblas supposedly used.
>>> However, matrix multiplication does executes on CPU since I see 16% of CPU
>>> used and 0% of GPU used. I also checked different matrix sizes, from
>>> 100x100 to 12000x12000
>>>
>>> Could you suggest might the LD_PRELOAD not affect Spark shell?
>>>
>>> Best regards, Alexander
>>>
>>>
>>>
>>> From: Sam Halliday [mailto:[hidden email]]
>>> Sent: Monday, March 09, 2015 6:01 PM
>>> To: Ulanov, Alexander
>>> Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>>
>>>
>>> Thanks so much for following up on this!
>>>
>>> Hmm, I wonder if we should have a concerted effort to chart performance
>>> on various pieces of hardware...
>>> On 9 Mar 2015 21:08, "Ulanov, Alexander" <[hidden email]
>>> <mailto:[hidden email]>> wrote:
>>> Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the
>>> comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the
>>> support of Double in the current source code), did the test with BIDMat and
>>> CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL.
>>>
>>>
>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>>>
>>> Best regards, Alexander
>>>
>>> -----Original Message-----
>>> From: Sam Halliday [mailto:[hidden email]<mailto:
>>> [hidden email]>]
>>> Sent: Tuesday, March 03, 2015 1:54 PM
>>> To: Xiangrui Meng; Joseph Bradley
>>> Cc: Evan R. Sparks; Ulanov, Alexander; [hidden email]<mailto:
>>> [hidden email]>
>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>
>>> BTW, is anybody on this list going to the London Meetup in a few weeks?
>>>
>>>
>>> https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community
>>>
>>> Would be nice to meet other people working on the guts of Spark! :-)
>>>
>>>
>>> Xiangrui Meng <[hidden email]<mailto:[hidden email]>> writes:
>>>
>>> > Hey Alexander,
>>> >
>>> > I don't quite understand the part where netlib-cublas is about 20x
>>> > slower than netlib-openblas. What is the overhead of using a GPU BLAS
>>> > with netlib-java?
>>> >
>>> > CC'ed Sam, the author of netlib-java.
>>> >
>>> > Best,
>>> > Xiangrui
>>> >
>>> > On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley <[hidden email]
>>> <mailto:[hidden email]>> wrote:
>>> >> Better documentation for linking would be very helpful!  Here's a
>>> JIRA:
>>> >> https://issues.apache.org/jira/browse/SPARK-6019
>>> >>
>>> >>
>>> >> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
>>> >> <[hidden email]<mailto:[hidden email]>>
>>> >> wrote:
>>> >>
>>> >>> Thanks for compiling all the data and running these benchmarks,
>>> >>> Alex. The big takeaways here can be seen with this chart:
>>> >>>
>>> >>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ
>>> >>> Hl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
>>> >>>
>>> >>> 1) A properly configured GPU matrix multiply implementation (e.g.
>>> >>> BIDMat+GPU) can provide substantial (but less than an order of
>>> >>> BIDMat+magnitude)
>>> >>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
>>> >>> netlib-java+openblas-compiled).
>>> >>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
>>> >>> worse than a well-tuned CPU implementation, particularly for larger
>>> matrices.
>>> >>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
>>> >>> basically agrees with the authors own benchmarks (
>>> >>> https://github.com/fommil/netlib-java)
>>> >>>
>>> >>> I think that most of our users are in a situation where using GPUs
>>> >>> may not be practical - although we could consider having a good GPU
>>> >>> backend available as an option. However, *ALL* users of MLlib could
>>> >>> benefit (potentially tremendously) from using a well-tuned CPU-based
>>> >>> BLAS implementation. Perhaps we should consider updating the mllib
>>> >>> guide with a more complete section for enabling high performance
>>> >>> binaries on OSX and Linux? Or better, figure out a way for the
>>> >>> system to fetch these automatically.
>>> >>>
>>> >>> - Evan
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
>>> >>> [hidden email]<mailto:[hidden email]>> wrote:
>>> >>>
>>> >>>> Just to summarize this thread, I was finally able to make all
>>> >>>> performance comparisons that we discussed. It turns out that:
>>> >>>> BIDMat-cublas>>BIDMat
>>> >>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo=
>>> >>>> =netlib-cublas>netlib-blas>f2jblas
>>> >>>>
>>> >>>> Below is the link to the spreadsheet with full results.
>>> >>>>
>>> >>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx
>>> >>>> 378T9J5r7kwKSPkY/edit?usp=sharing
>>> >>>>
>>> >>>> One thing still needs exploration: does BIDMat-cublas perform
>>> >>>> copying to/from machine’s RAM?
>>> >>>>
>>> >>>> -----Original Message-----
>>> >>>> From: Ulanov, Alexander
>>> >>>> Sent: Tuesday, February 10, 2015 2:12 PM
>>> >>>> To: Evan R. Sparks
>>> >>>> Cc: Joseph Bradley;
>>> >>>> [hidden email]<mailto:[hidden email]>
>>> >>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>> >>>>
>>> >>>> Thanks, Evan! It seems that ticket was marked as duplicate though
>>> >>>> the original one discusses slightly different topic. I was able to
>>> >>>> link netlib with MKL from BIDMat binaries. Indeed, MKL is
>>> >>>> statically linked inside a 60MB library.
>>> >>>>
>>> >>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
>>> >>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
>>> >>>>
>>> +-----------------------------------------------------------------------+
>>> >>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
>>> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
>>> >>>> |1,638475459 |
>>> >>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 |
>>> >>>> 1569,233228 |
>>> >>>>
>>> >>>> It turn out that pre-compiled MKL is faster than precompiled
>>> >>>> OpenBlas on my machine. Probably, I’ll add two more columns with
>>> >>>> locally compiled openblas and cuda.
>>> >>>>
>>> >>>> Alexander
>>> >>>>
>>> >>>> From: Evan R. Sparks
>>> >>>> [mailto:[hidden email]<mailto:[hidden email]>]
>>> >>>> Sent: Monday, February 09, 2015 6:06 PM
>>> >>>> To: Ulanov, Alexander
>>> >>>> Cc: Joseph Bradley;
>>> >>>> [hidden email]<mailto:[hidden email]>
>>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>> >>>>
>>> >>>> Great - perhaps we can move this discussion off-list and onto a
>>> >>>> JIRA ticket? (Here's one:
>>> >>>> https://issues.apache.org/jira/browse/SPARK-5705)
>>> >>>>
>>> >>>> It seems like this is going to be somewhat exploratory for a while
>>> >>>> (and there's probably only a handful of us who really care about
>>> >>>> fast linear
>>> >>>> algebra!)
>>> >>>>
>>> >>>> - Evan
>>> >>>>
>>> >>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
>>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>>> [hidden email]<mailto:[hidden email]>>> wrote:
>>> >>>> Hi Evan,
>>> >>>>
>>> >>>> Thank you for explanation and useful link. I am going to build
>>> >>>> OpenBLAS, link it with Netlib-java and perform benchmark again.
>>> >>>>
>>> >>>> Do I understand correctly that BIDMat binaries contain statically
>>> >>>> linked Intel MKL BLAS? It might be the reason why I am able to run
>>> >>>> BIDMat not having MKL BLAS installed on my server. If it is true, I
>>> >>>> wonder if it is OK because Intel sells this library. Nevertheless,
>>> >>>> it seems that in my case precompiled MKL BLAS performs better than
>>> >>>> precompiled OpenBLAS given that BIDMat and Netlib-java are supposed
>>> to be on par with JNI overheads.
>>> >>>>
>>> >>>> Though, it might be interesting to link Netlib-java with Intel MKL,
>>> >>>> as you suggested. I wonder, are John Canny (BIDMat) and Sam
>>> >>>> Halliday
>>> >>>> (Netlib-java) interested to compare their libraries.
>>> >>>>
>>> >>>> Best regards, Alexander
>>> >>>>
>>> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
>>> [hidden email]><mailto:
>>> >>>> [hidden email]<mailto:[hidden email]>>]
>>> >>>> Sent: Friday, February 06, 2015 5:58 PM
>>> >>>>
>>> >>>> To: Ulanov, Alexander
>>> >>>> Cc: Joseph Bradley;
>>> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
>>> >>>> apache.org<mailto:[hidden email]>>
>>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>> >>>>
>>> >>>> I would build OpenBLAS yourself, since good BLAS performance comes
>>> >>>> from getting cache sizes, etc. set up correctly for your particular
>>> >>>> hardware - this is often a very tricky process (see, e.g. ATLAS),
>>> >>>> but we found that on relatively modern Xeon chips, OpenBLAS builds
>>> >>>> quickly and yields performance competitive with MKL.
>>> >>>>
>>> >>>> To make sure the right library is getting used, you have to make
>>> >>>> sure it's first on the search path - export
>>> >>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
>>> >>>>
>>> >>>> For some examples of getting netlib-java setup on an ec2 node and
>>> >>>> some example benchmarking code we ran a while back, see:
>>> >>>> https://github.com/shivaram/matrix-bench
>>> >>>>
>>> >>>> In particular - build-openblas-ec2.sh shows you how to build the
>>> >>>> library and set up symlinks correctly, and scala/run-netlib.sh
>>> >>>> shows you how to get the path setup and get that library picked up
>>> by netlib-java.
>>> >>>>
>>> >>>> In this way - you could probably get cuBLAS set up to be used by
>>> >>>> netlib-java as well.
>>> >>>>
>>> >>>> - Evan
>>> >>>>
>>> >>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
>>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>>> [hidden email]<mailto:[hidden email]>>> wrote:
>>> >>>> Evan, could you elaborate on how to force BIDMat and netlib-java to
>>> >>>> force loading the right blas? For netlib, I there are few JVM
>>> >>>> flags, such as
>>> >>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS,
>>> >>>> so I can force it to use Java implementation. Not sure I understand
>>> how to force use a specific blas (not specific wrapper for blas).
>>> >>>>
>>> >>>> Btw. I have installed openblas (yum install openblas), so I suppose
>>> >>>> that netlib is using it.
>>> >>>>
>>> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
>>> [hidden email]><mailto:
>>> >>>> [hidden email]<mailto:[hidden email]>>]
>>> >>>> Sent: Friday, February 06, 2015 5:19 PM
>>> >>>> To: Ulanov, Alexander
>>> >>>> Cc: Joseph Bradley;
>>> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
>>> >>>> apache.org<mailto:[hidden email]>>
>>> >>>>
>>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>> >>>>
>>> >>>> Getting breeze to pick up the right blas library is critical for
>>> >>>> performance. I recommend using OpenBLAS (or MKL, if you already
>>> have it).
>>> >>>> It might make sense to force BIDMat to use the same underlying BLAS
>>> >>>> library as well.
>>> >>>>
>>> >>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
>>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>>> [hidden email]<mailto:[hidden email]>>> wrote:
>>> >>>> Hi Evan, Joseph
>>> >>>>
>>> >>>> I did few matrix multiplication test and BIDMat seems to be ~10x
>>> >>>> faster than netlib-java+breeze (sorry for weird table formatting):
>>> >>>>
>>> >>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java
>>> >>>> |native_system_linux_x86-64|
>>> >>>> Breeze+Netlib-java f2jblas |
>>> >>>>
>>> +-----------------------------------------------------------------------+
>>> >>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
>>> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
>>> >>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228
>>> >>>> ||
>>> >>>>
>>> >>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora
>>> >>>> 19 Linux, Scala 2.11.
>>> >>>>
>>> >>>> Later I will make tests with Cuda. I need to install new Cuda
>>> >>>> version for this purpose.
>>> >>>>
>>> >>>> Do you have any ideas why breeze-netlib with native blas is so much
>>> >>>> slower than BIDMat MKL?
>>> >>>>
>>> >>>> Best regards, Alexander
>>> >>>>
>>> >>>> From: Joseph Bradley [mailto:[hidden email]<mailto:
>>> [hidden email]><mailto:
>>> >>>> [hidden email]<mailto:[hidden email]>>]
>>> >>>> Sent: Thursday, February 05, 2015 5:29 PM
>>> >>>> To: Ulanov, Alexander
>>> >>>> Cc: Evan R. Sparks;
>>> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
>>> >>>> apache.org<mailto:[hidden email]>>
>>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>> >>>>
>>> >>>> Hi Alexander,
>>> >>>>
>>> >>>> Using GPUs with Spark would be very exciting.  Small comment:
>>> >>>> Concerning your question earlier about keeping data stored on the
>>> >>>> GPU rather than having to move it between main memory and GPU
>>> >>>> memory on each iteration, I would guess this would be critical to
>>> >>>> getting good performance.  If you could do multiple local
>>> >>>> iterations before aggregating results, then the cost of data
>>> >>>> movement to the GPU could be amortized (and I believe that is done
>>> >>>> in practice).  Having Spark be aware of the GPU and using it as
>>> another part of memory sounds like a much bigger undertaking.
>>> >>>>
>>> >>>> Joseph
>>> >>>>
>>> >>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
>>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>>> [hidden email]<mailto:[hidden email]>>> wrote:
>>> >>>> Thank you for explanation! I’ve watched the BIDMach presentation by
>>> >>>> John Canny and I am really inspired by his talk and comparisons
>>> with Spark MLlib.
>>> >>>>
>>> >>>> I am very interested to find out what will be better within Spark:
>>> >>>> BIDMat or netlib-java with CPU or GPU natives. Could you suggest a
>>> >>>> fair way to benchmark them? Currently I do benchmarks on artificial
>>> >>>> neural networks in batch mode. While it is not a “pure” test of
>>> >>>> linear algebra, it involves some other things that are essential to
>>> machine learning.
>>> >>>>
>>> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
>>> [hidden email]><mailto:
>>> >>>> [hidden email]<mailto:[hidden email]>>]
>>> >>>> Sent: Thursday, February 05, 2015 1:29 PM
>>> >>>> To: Ulanov, Alexander
>>> >>>> Cc:
>>> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
>>> >>>> apache.org<mailto:[hidden email]>>
>>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>> >>>>
>>> >>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
>>> >>>> netlib-java+OpenBLAS, but if it is much faster it's probably due to
>>> >>>> netlib-java+data
>>> >>>> layout and fewer levels of indirection - it's definitely a
>>> >>>> worthwhile experiment to run. The main speedups I've seen from
>>> >>>> using it come from highly optimized GPU code for linear algebra. I
>>> >>>> know that in the past Canny has gone as far as to write custom GPU
>>> >>>> kernels for performance-critical regions of code.[1]
>>> >>>>
>>> >>>> BIDMach is highly optimized for single node performance or
>>> >>>> performance on small clusters.[2] Once data doesn't fit easily in
>>> >>>> GPU memory (or can be batched in that way) the performance tends to
>>> >>>> fall off. Canny argues for hardware/software codesign and as such
>>> >>>> prefers machine configurations that are quite different than what
>>> >>>> we find in most commodity cluster nodes - e.g. 10 disk cahnnels and
>>> 4 GPUs.
>>> >>>>
>>> >>>> In contrast, MLlib was designed for horizontal scalability on
>>> >>>> commodity clusters and works best on very big datasets - order of
>>> terabytes.
>>> >>>>
>>> >>>> For the most part, these projects developed concurrently to address
>>> >>>> slightly different use cases. That said, there may be bits of
>>> >>>> BIDMach we could repurpose for MLlib - keep in mind we need to be
>>> >>>> careful about maintaining cross-language compatibility for our Java
>>> >>>> and Python-users, though.
>>> >>>>
>>> >>>> - Evan
>>> >>>>
>>> >>>> [1] - http://arxiv.org/abs/1409.5402 [2] -
>>> >>>> http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
>>> >>>>
>>> >>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
>>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>>> [hidden email]<mailto:[hidden email]>><mailto:
>>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>>> [hidden email]<mailto:[hidden email]>>>> wrote:
>>> >>>> Hi Evan,
>>> >>>>
>>> >>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do
>>> >>>> you know what makes them faster than netlib-java?
>>> >>>>
>>> >>>> The same group has BIDMach library that implements machine
>>> >>>> learning. For some examples they use Caffe convolutional neural
>>> >>>> network library owned by another group in Berkeley. Could you
>>> >>>> elaborate on how these all might be connected with Spark Mllib? If
>>> >>>> you take BIDMat for linear algebra why don’t you take BIDMach for
>>> optimization and learning?
>>> >>>>
>>> >>>> Best regards, Alexander
>>> >>>>
>>> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
>>> [hidden email]><mailto:
>>> >>>> [hidden email]<mailto:[hidden email]>><mailto:
>>> [hidden email]<mailto:[hidden email]><mailto:
>>> >>>> [hidden email]<mailto:[hidden email]>>>]
>>> >>>> Sent: Thursday, February 05, 2015 12:09 PM
>>> >>>> To: Ulanov, Alexander
>>> >>>> Cc: [hidden email]<mailto:[hidden email]><mailto:
>>> [hidden email]<mailto:[hidden email]>><mailto:
>>> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
>>> >>>> apache.org<mailto:[hidden email]>>>
>>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>> >>>>
>>> >>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU
>>> >>>> blas in many cases.
>>> >>>>
>>> >>>> You might consider taking a look at the codepaths that BIDMat (
>>> >>>> https://github.com/BIDData/BIDMat) takes and comparing them to
>>> >>>> netlib-java/breeze. John Canny et. al. have done a bunch of work
>>> >>>> optimizing to make this work really fast from Scala. I've run it on
>>> >>>> my laptop and compared to MKL and in certain cases it's 10x faster
>>> at matrix multiply.
>>> >>>> There are a lot of layers of indirection here and you really want
>>> >>>> to avoid data copying as much as possible.
>>> >>>>
>>> >>>> We could also consider swapping out BIDMat for Breeze, but that
>>> >>>> would be a big project and if we can figure out how to get
>>> >>>> breeze+cublas to comparable performance that would be a big win.
>>> >>>>
>>> >>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
>>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>>> [hidden email]<mailto:[hidden email]>><mailto:
>>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>>> [hidden email]<mailto:[hidden email]>>>> wrote:
>>> >>>> Dear Spark developers,
>>> >>>>
>>> >>>> I am exploring how to make linear algebra operations faster within
>>> Spark.
>>> >>>> One way of doing this is to use Scala Breeze library that is
>>> >>>> bundled with Spark. For matrix operations, it employs Netlib-java
>>> >>>> that has a Java wrapper for BLAS (basic linear algebra subprograms)
>>> >>>> and LAPACK native binaries if they are available on the worker
>>> >>>> node. It also has its own optimized Java implementation of BLAS. It
>>> >>>> is worth mentioning, that native binaries provide better
>>> performance only for BLAS level 3, i.e.
>>> >>>> matrix-matrix operations or general matrix multiplication (GEMM).
>>> >>>> This is confirmed by GEMM test on Netlib-java page
>>> >>>> https://github.com/fommil/netlib-java. I also confirmed it with my
>>> >>>> experiments with training of artificial neural network
>>> >>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
>>> >>>> However, I would like to boost performance more.
>>> >>>>
>>> >>>> GPU is supposed to work fast with linear algebra and there is
>>> >>>> Nvidia CUDA implementation of BLAS, called cublas. I have one Linux
>>> >>>> server with Nvidia GPU and I was able to do the following. I linked
>>> >>>> cublas (instead of cpu-based blas) with Netlib-java wrapper and put
>>> >>>> it into Spark, so Breeze/Netlib is using it. Then I did some
>>> >>>> performance measurements with regards to artificial neural network
>>> >>>> batch learning in Spark MLlib that involves matrix-matrix
>>> >>>> multiplications. It turns out that for matrices of size less than
>>> >>>> ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes
>>> >>>> slower for bigger matrices. It worth mentioning that it is was not
>>> a test for ONLY multiplication since there are other operations involved.
>>> >>>> One of the reasons for slowdown might be the overhead of copying
>>> >>>> the matrices from computer memory to graphic card memory and back.
>>> >>>>
>>> >>>> So, few questions:
>>> >>>> 1) Do these results with CUDA make sense?
>>> >>>> 2) If the problem is with copy overhead, are there any libraries
>>> >>>> that allow to force intermediate results to stay in graphic card
>>> >>>> memory thus removing the overhead?
>>> >>>> 3) Any other options to speed-up linear algebra in Spark?
>>> >>>>
>>> >>>> Thank you, Alexander
>>> >>>>
>>> >>>> -------------------------------------------------------------------
>>> >>>> -- To unsubscribe, e-mail: [hidden email]<mailto:
>>> [hidden email]><mailto:
>>> >>>> [hidden email]<mailto:[hidden email]
>>> >>>> e.org>><mailto:[hidden email]<mailto:dev-unsubscribe@sp
>>> >>>> ark.apac> he.org<http://he.org>
>>> >>>> <mailto:[hidden email]<mailto:dev-unsubscribe@spa
>>> >>>> rk.apache.org>>> For additional commands, e-mail:
>>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>>> >>>> [hidden email]<mailto:[hidden email]
>>> >><mailto:[hidden email]<mailto:[hidden email]
>>> ><mailto:
>>> >>>> [hidden email]<mailto:[hidden email]>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>
>>>
>>> --
>>> Best regards,
>>> Sam
>>>
>>>
>>>
>>
>
Reply | Threaded
Open this post in threaded view
|

RE: Using CUDA within Spark / boosting linear algebra

Ulanov, Alexander
In reply to this post by fommil
Sure, I will write a how-to after I re-check the results.

-----Original Message-----
From: Sam Halliday [mailto:[hidden email]]
Sent: Wednesday, March 25, 2015 3:04 PM
To: Evan R. Sparks; [hidden email]
Subject: Re: Using CUDA within Spark / boosting linear algebra

If you write it up I'll add it to the netlib-java wiki :-)

BTW, does it automatically flip between cpu/GPU? I've a project called MultiBLAS which was going to do this, it should be easy (but boring to
write)
On 25 Mar 2015 22:00, "Evan R. Sparks" <[hidden email]> wrote:

> Alex - great stuff, and the nvblas numbers are pretty remarkable
> (almost too good... did you check the results for correctness? - also,
> is it possible that the "unified memory model" of nvblas is somehow
> hiding pci transfer time?)
>
> this last bit (getting nvblas + netlib-java to play together) sounds
> like it's non-trivial and took you a while to figure out! Would you
> mind posting a gist or something of maybe the shell scripts/exports
> you used to make this work - I can imagine it being highly useful for others in the future.
>
> Thanks!
> Evan
>
> On Wed, Mar 25, 2015 at 2:31 PM, Ulanov, Alexander <
> [hidden email]> wrote:
>
>> Hi again,
>>
>> I finally managed to use nvblas within Spark+netlib-java. It has
>> exceptional performance for big matrices with Double, faster than
>> BIDMat-cuda with Float. But for smaller matrices, if you will copy
>> them to/from GPU, OpenBlas or MKL might be a better choice. This
>> correlates with original nvblas presentation on GPU conf 2013 (slide 21):
>> http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3
>> 108-New-Features-CUDA%206%20-GPU-Acceleration.pdf
>>
>> My results:
>>
>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx37
>> 8T9J5r7kwKSPkY/edit?usp=sharing
>>
>> Just in case, these tests are not for generalization of performance
>> of different libraries. I just want to pick a library that does at
>> best dense matrices multiplication for my task.
>>
>> P.S. My previous issue with nvblas was the following: it has Fortran
>> blas functions, at the same time netlib-java uses C cblas functions.
>> So, one needs cblas shared library to use nvblas through netlib-java.
>> Fedora does not have cblas (but Debian and Ubuntu have), so I needed
>> to compile it. I could not use cblas from Atlas or Openblas because
>> they link to their implementation and not to Fortran blas.
>>
>> Best regards, Alexander
>>
>> -----Original Message-----
>> From: Ulanov, Alexander
>> Sent: Tuesday, March 24, 2015 6:57 PM
>> To: Sam Halliday
>> Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R.
>> Sparks
>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>
>> Hi,
>>
>> I am trying to use nvblas with netlib-java from Spark. nvblas
>> functions should replace current blas functions calls after executing
>> LD_PRELOAD as suggested in http://docs.nvidia.com/cuda/nvblas/#Usage 
>> without any changes to netlib-java. It seems to work for simple Java
>> example, but I cannot make it work with Spark. I run the following:
>> export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
>> env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell
>> --driver-memory 4G In nvidia-smi I observe that Java is to use GPU:
>>
>> +-----------------------------------------------------------------------------+
>> | Processes:                                                       GPU
>> Memory |
>> |  GPU       PID  Type  Process name                               Usage
>>     |
>>
>> |=============================================================================|
>> |    0      8873    C   bash
>> 39MiB |
>> |    0      8910    C   /usr/lib/jvm/java-1.7.0/bin/java
>> 39MiB |
>>
>> +-----------------------------------------------------------------------------+
>>
>> In Spark shell I do matrix multiplication and see the following:
>> 15/03/25 06:48:01 INFO JniLoader: successfully loaded
>> /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
>> So I am sure that netlib-native is loaded and cblas supposedly used.
>> However, matrix multiplication does executes on CPU since I see 16%
>> of CPU used and 0% of GPU used. I also checked different matrix
>> sizes, from
>> 100x100 to 12000x12000
>>
>> Could you suggest might the LD_PRELOAD not affect Spark shell?
>>
>> Best regards, Alexander
>>
>>
>>
>> From: Sam Halliday [mailto:[hidden email]]
>> Sent: Monday, March 09, 2015 6:01 PM
>> To: Ulanov, Alexander
>> Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R.
>> Sparks
>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>
>>
>> Thanks so much for following up on this!
>>
>> Hmm, I wonder if we should have a concerted effort to chart
>> performance on various pieces of hardware...
>> On 9 Mar 2015 21:08, "Ulanov, Alexander" <[hidden email]<mailto:
>> [hidden email]>> wrote:
>> Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added
>> the comment that BIDMat 0.9.7 uses Float matrices in GPU (although I
>> see the support of Double in the current source code), did the test
>> with BIDMat and CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL.
>>
>>
>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx37
>> 8T9J5r7kwKSPkY/edit?usp=sharing
>>
>> Best regards, Alexander
>>
>> -----Original Message-----
>> From: Sam Halliday [mailto:[hidden email]<mailto:
>> [hidden email]>]
>> Sent: Tuesday, March 03, 2015 1:54 PM
>> To: Xiangrui Meng; Joseph Bradley
>> Cc: Evan R. Sparks; Ulanov, Alexander; [hidden email]<mailto:
>> [hidden email]>
>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>
>> BTW, is anybody on this list going to the London Meetup in a few weeks?
>>
>>
>> https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-ma
>> preduce-world#community
>>
>> Would be nice to meet other people working on the guts of Spark! :-)
>>
>>
>> Xiangrui Meng <[hidden email]<mailto:[hidden email]>> writes:
>>
>> > Hey Alexander,
>> >
>> > I don't quite understand the part where netlib-cublas is about 20x
>> > slower than netlib-openblas. What is the overhead of using a GPU
>> > BLAS with netlib-java?
>> >
>> > CC'ed Sam, the author of netlib-java.
>> >
>> > Best,
>> > Xiangrui
>> >
>> > On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley
>> > <[hidden email]
>> <mailto:[hidden email]>> wrote:
>> >> Better documentation for linking would be very helpful!  Here's a JIRA:
>> >> https://issues.apache.org/jira/browse/SPARK-6019
>> >>
>> >>
>> >> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
>> >> <[hidden email]<mailto:[hidden email]>>
>> >> wrote:
>> >>
>> >>> Thanks for compiling all the data and running these benchmarks,
>> >>> Alex. The big takeaways here can be seen with this chart:
>> >>>
>> >>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF5
>> >>> 0uZ Hl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
>> >>>
>> >>> 1) A properly configured GPU matrix multiply implementation (e.g.
>> >>> BIDMat+GPU) can provide substantial (but less than an order of
>> >>> BIDMat+magnitude)
>> >>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
>> >>> netlib-java+openblas-compiled).
>> >>> 2) A poorly tuned CPU implementation can be 1-2 orders of
>> >>> magnitude worse than a well-tuned CPU implementation,
>> >>> particularly for larger
>> matrices.
>> >>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib -
>> >>> this basically agrees with the authors own benchmarks (
>> >>> https://github.com/fommil/netlib-java)
>> >>>
>> >>> I think that most of our users are in a situation where using
>> >>> GPUs may not be practical - although we could consider having a
>> >>> good GPU backend available as an option. However, *ALL* users of
>> >>> MLlib could benefit (potentially tremendously) from using a
>> >>> well-tuned CPU-based BLAS implementation. Perhaps we should
>> >>> consider updating the mllib guide with a more complete section
>> >>> for enabling high performance binaries on OSX and Linux? Or
>> >>> better, figure out a way for the system to fetch these automatically.
>> >>>
>> >>> - Evan
>> >>>
>> >>>
>> >>>
>> >>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
>> >>> [hidden email]<mailto:[hidden email]>> wrote:
>> >>>
>> >>>> Just to summarize this thread, I was finally able to make all
>> >>>> performance comparisons that we discussed. It turns out that:
>> >>>> BIDMat-cublas>>BIDMat
>> >>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-re
>> >>>> po= =netlib-cublas>netlib-blas>f2jblas
>> >>>>
>> >>>> Below is the link to the spreadsheet with full results.
>> >>>>
>> >>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgH
>> >>>> UMx 378T9J5r7kwKSPkY/edit?usp=sharing
>> >>>>
>> >>>> One thing still needs exploration: does BIDMat-cublas perform
>> >>>> copying to/from machine’s RAM?
>> >>>>
>> >>>> -----Original Message-----
>> >>>> From: Ulanov, Alexander
>> >>>> Sent: Tuesday, February 10, 2015 2:12 PM
>> >>>> To: Evan R. Sparks
>> >>>> Cc: Joseph Bradley;
>> >>>> [hidden email]<mailto:[hidden email]>
>> >>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>> >>>>
>> >>>> Thanks, Evan! It seems that ticket was marked as duplicate
>> >>>> though the original one discusses slightly different topic. I
>> >>>> was able to link netlib with MKL from BIDMat binaries. Indeed,
>> >>>> MKL is statically linked inside a 60MB library.
>> >>>>
>> >>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
>> >>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
>> >>>>
>> +-----------------------------------------------------------------------+
>> >>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556
>> >>>> ||
>> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
>> >>>> |1,638475459 |
>> >>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697
>> >>>> ||445,0935211 |
>> >>>> 1569,233228 |
>> >>>>
>> >>>> It turn out that pre-compiled MKL is faster than precompiled
>> >>>> OpenBlas on my machine. Probably, I’ll add two more columns with
>> >>>> locally compiled openblas and cuda.
>> >>>>
>> >>>> Alexander
>> >>>>
>> >>>> From: Evan R. Sparks
>> >>>> [mailto:[hidden email]<mailto:[hidden email]>]
>> >>>> Sent: Monday, February 09, 2015 6:06 PM
>> >>>> To: Ulanov, Alexander
>> >>>> Cc: Joseph Bradley;
>> >>>> [hidden email]<mailto:[hidden email]>
>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>
>> >>>> Great - perhaps we can move this discussion off-list and onto a
>> >>>> JIRA ticket? (Here's one:
>> >>>> https://issues.apache.org/jira/browse/SPARK-5705)
>> >>>>
>> >>>> It seems like this is going to be somewhat exploratory for a
>> >>>> while (and there's probably only a handful of us who really care
>> >>>> about fast linear
>> >>>> algebra!)
>> >>>>
>> >>>> - Evan
>> >>>>
>> >>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>> [hidden email]<mailto:[hidden email]>>> wrote:
>> >>>> Hi Evan,
>> >>>>
>> >>>> Thank you for explanation and useful link. I am going to build
>> >>>> OpenBLAS, link it with Netlib-java and perform benchmark again.
>> >>>>
>> >>>> Do I understand correctly that BIDMat binaries contain
>> >>>> statically linked Intel MKL BLAS? It might be the reason why I
>> >>>> am able to run BIDMat not having MKL BLAS installed on my
>> >>>> server. If it is true, I wonder if it is OK because Intel sells
>> >>>> this library. Nevertheless, it seems that in my case precompiled
>> >>>> MKL BLAS performs better than precompiled OpenBLAS given that
>> >>>> BIDMat and Netlib-java are supposed
>> to be on par with JNI overheads.
>> >>>>
>> >>>> Though, it might be interesting to link Netlib-java with Intel
>> >>>> MKL, as you suggested. I wonder, are John Canny (BIDMat) and Sam
>> >>>> Halliday
>> >>>> (Netlib-java) interested to compare their libraries.
>> >>>>
>> >>>> Best regards, Alexander
>> >>>>
>> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
>> [hidden email]><mailto:
>> >>>> [hidden email]<mailto:[hidden email]>>]
>> >>>> Sent: Friday, February 06, 2015 5:58 PM
>> >>>>
>> >>>> To: Ulanov, Alexander
>> >>>> Cc: Joseph Bradley;
>> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
>> >>>> apache.org<mailto:[hidden email]>>
>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>
>> >>>> I would build OpenBLAS yourself, since good BLAS performance
>> >>>> comes from getting cache sizes, etc. set up correctly for your
>> >>>> particular hardware - this is often a very tricky process (see,
>> >>>> e.g. ATLAS), but we found that on relatively modern Xeon chips,
>> >>>> OpenBLAS builds quickly and yields performance competitive with MKL.
>> >>>>
>> >>>> To make sure the right library is getting used, you have to make
>> >>>> sure it's first on the search path - export
>> >>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
>> >>>>
>> >>>> For some examples of getting netlib-java setup on an ec2 node
>> >>>> and some example benchmarking code we ran a while back, see:
>> >>>> https://github.com/shivaram/matrix-bench
>> >>>>
>> >>>> In particular - build-openblas-ec2.sh shows you how to build the
>> >>>> library and set up symlinks correctly, and scala/run-netlib.sh
>> >>>> shows you how to get the path setup and get that library picked
>> >>>> up
>> by netlib-java.
>> >>>>
>> >>>> In this way - you could probably get cuBLAS set up to be used by
>> >>>> netlib-java as well.
>> >>>>
>> >>>> - Evan
>> >>>>
>> >>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>> [hidden email]<mailto:[hidden email]>>> wrote:
>> >>>> Evan, could you elaborate on how to force BIDMat and netlib-java
>> >>>> to force loading the right blas? For netlib, I there are few JVM
>> >>>> flags, such as
>> >>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS
>> >>>> , so I can force it to use Java implementation. Not sure I
>> >>>> understand
>> how to force use a specific blas (not specific wrapper for blas).
>> >>>>
>> >>>> Btw. I have installed openblas (yum install openblas), so I
>> >>>> suppose that netlib is using it.
>> >>>>
>> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
>> [hidden email]><mailto:
>> >>>> [hidden email]<mailto:[hidden email]>>]
>> >>>> Sent: Friday, February 06, 2015 5:19 PM
>> >>>> To: Ulanov, Alexander
>> >>>> Cc: Joseph Bradley;
>> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
>> >>>> apache.org<mailto:[hidden email]>>
>> >>>>
>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>
>> >>>> Getting breeze to pick up the right blas library is critical for
>> >>>> performance. I recommend using OpenBLAS (or MKL, if you already
>> >>>> have
>> it).
>> >>>> It might make sense to force BIDMat to use the same underlying
>> >>>> BLAS library as well.
>> >>>>
>> >>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>> [hidden email]<mailto:[hidden email]>>> wrote:
>> >>>> Hi Evan, Joseph
>> >>>>
>> >>>> I did few matrix multiplication test and BIDMat seems to be ~10x
>> >>>> faster than netlib-java+breeze (sorry for weird table formatting):
>> >>>>
>> >>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java
>> >>>> |native_system_linux_x86-64|
>> >>>> Breeze+Netlib-java f2jblas |
>> >>>>
>> +-----------------------------------------------------------------------+
>> >>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
>> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
>> >>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 |
>> >>>> |1569,233228
>> >>>> ||
>> >>>>
>> >>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM,
>> >>>> Fedora
>> >>>> 19 Linux, Scala 2.11.
>> >>>>
>> >>>> Later I will make tests with Cuda. I need to install new Cuda
>> >>>> version for this purpose.
>> >>>>
>> >>>> Do you have any ideas why breeze-netlib with native blas is so
>> >>>> much slower than BIDMat MKL?
>> >>>>
>> >>>> Best regards, Alexander
>> >>>>
>> >>>> From: Joseph Bradley [mailto:[hidden email]<mailto:
>> [hidden email]><mailto:
>> >>>> [hidden email]<mailto:[hidden email]>>]
>> >>>> Sent: Thursday, February 05, 2015 5:29 PM
>> >>>> To: Ulanov, Alexander
>> >>>> Cc: Evan R. Sparks;
>> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
>> >>>> apache.org<mailto:[hidden email]>>
>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>
>> >>>> Hi Alexander,
>> >>>>
>> >>>> Using GPUs with Spark would be very exciting.  Small comment:
>> >>>> Concerning your question earlier about keeping data stored on
>> >>>> the GPU rather than having to move it between main memory and
>> >>>> GPU memory on each iteration, I would guess this would be
>> >>>> critical to getting good performance.  If you could do multiple
>> >>>> local iterations before aggregating results, then the cost of
>> >>>> data movement to the GPU could be amortized (and I believe that
>> >>>> is done in practice).  Having Spark be aware of the GPU and
>> >>>> using it as
>> another part of memory sounds like a much bigger undertaking.
>> >>>>
>> >>>> Joseph
>> >>>>
>> >>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>> [hidden email]<mailto:[hidden email]>>> wrote:
>> >>>> Thank you for explanation! I’ve watched the BIDMach presentation
>> >>>> by John Canny and I am really inspired by his talk and
>> >>>> comparisons with
>> Spark MLlib.
>> >>>>
>> >>>> I am very interested to find out what will be better within Spark:
>> >>>> BIDMat or netlib-java with CPU or GPU natives. Could you suggest
>> >>>> a fair way to benchmark them? Currently I do benchmarks on
>> >>>> artificial neural networks in batch mode. While it is not a
>> >>>> “pure” test of linear algebra, it involves some other things
>> >>>> that are essential to
>> machine learning.
>> >>>>
>> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
>> [hidden email]><mailto:
>> >>>> [hidden email]<mailto:[hidden email]>>]
>> >>>> Sent: Thursday, February 05, 2015 1:29 PM
>> >>>> To: Ulanov, Alexander
>> >>>> Cc:
>> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
>> >>>> apache.org<mailto:[hidden email]>>
>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>
>> >>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster
>> >>>> than
>> >>>> netlib-java+OpenBLAS, but if it is much faster it's probably due
>> >>>> netlib-java+to data
>> >>>> layout and fewer levels of indirection - it's definitely a
>> >>>> worthwhile experiment to run. The main speedups I've seen from
>> >>>> using it come from highly optimized GPU code for linear algebra.
>> >>>> I know that in the past Canny has gone as far as to write custom
>> >>>> GPU kernels for performance-critical regions of code.[1]
>> >>>>
>> >>>> BIDMach is highly optimized for single node performance or
>> >>>> performance on small clusters.[2] Once data doesn't fit easily
>> >>>> in GPU memory (or can be batched in that way) the performance
>> >>>> tends to fall off. Canny argues for hardware/software codesign
>> >>>> and as such prefers machine configurations that are quite
>> >>>> different than what we find in most commodity cluster nodes -
>> >>>> e.g. 10 disk cahnnels and
>> 4 GPUs.
>> >>>>
>> >>>> In contrast, MLlib was designed for horizontal scalability on
>> >>>> commodity clusters and works best on very big datasets - order
>> >>>> of
>> terabytes.
>> >>>>
>> >>>> For the most part, these projects developed concurrently to
>> >>>> address slightly different use cases. That said, there may be
>> >>>> bits of BIDMach we could repurpose for MLlib - keep in mind we
>> >>>> need to be careful about maintaining cross-language
>> >>>> compatibility for our Java and Python-users, though.
>> >>>>
>> >>>> - Evan
>> >>>>
>> >>>> [1] - http://arxiv.org/abs/1409.5402 [2] -
>> >>>> http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
>> >>>>
>> >>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>> [hidden email]<mailto:[hidden email]>><mailto:
>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>> [hidden email]<mailto:[hidden email]>>>> wrote:
>> >>>> Hi Evan,
>> >>>>
>> >>>> Thank you for suggestion! BIDMat seems to have terrific speed.
>> >>>> Do you know what makes them faster than netlib-java?
>> >>>>
>> >>>> The same group has BIDMach library that implements machine
>> >>>> learning. For some examples they use Caffe convolutional neural
>> >>>> network library owned by another group in Berkeley. Could you
>> >>>> elaborate on how these all might be connected with Spark Mllib?
>> >>>> If you take BIDMat for linear algebra why don’t you take BIDMach
>> >>>> for
>> optimization and learning?
>> >>>>
>> >>>> Best regards, Alexander
>> >>>>
>> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
>> [hidden email]><mailto:
>> >>>> [hidden email]<mailto:[hidden email]>><mailto:
>> [hidden email]<mailto:[hidden email]><mailto:
>> >>>> [hidden email]<mailto:[hidden email]>>>]
>> >>>> Sent: Thursday, February 05, 2015 12:09 PM
>> >>>> To: Ulanov, Alexander
>> >>>> Cc: [hidden email]<mailto:[hidden email]><mailto:
>> [hidden email]<mailto:[hidden email]>><mailto:
>> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
>> >>>> apache.org<mailto:[hidden email]>>>
>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>
>> >>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU
>> >>>> blas in many cases.
>> >>>>
>> >>>> You might consider taking a look at the codepaths that BIDMat (
>> >>>> https://github.com/BIDData/BIDMat) takes and comparing them to
>> >>>> netlib-java/breeze. John Canny et. al. have done a bunch of work
>> >>>> optimizing to make this work really fast from Scala. I've run it
>> >>>> on my laptop and compared to MKL and in certain cases it's 10x
>> >>>> faster
>> at matrix multiply.
>> >>>> There are a lot of layers of indirection here and you really
>> >>>> want to avoid data copying as much as possible.
>> >>>>
>> >>>> We could also consider swapping out BIDMat for Breeze, but that
>> >>>> would be a big project and if we can figure out how to get
>> >>>> breeze+cublas to comparable performance that would be a big win.
>> >>>>
>> >>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>> [hidden email]<mailto:[hidden email]>><mailto:
>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>> [hidden email]<mailto:[hidden email]>>>> wrote:
>> >>>> Dear Spark developers,
>> >>>>
>> >>>> I am exploring how to make linear algebra operations faster
>> >>>> within
>> Spark.
>> >>>> One way of doing this is to use Scala Breeze library that is
>> >>>> bundled with Spark. For matrix operations, it employs
>> >>>> Netlib-java that has a Java wrapper for BLAS (basic linear
>> >>>> algebra subprograms) and LAPACK native binaries if they are
>> >>>> available on the worker node. It also has its own optimized Java
>> >>>> implementation of BLAS. It is worth mentioning, that native
>> >>>> binaries provide better performance
>> only for BLAS level 3, i.e.
>> >>>> matrix-matrix operations or general matrix multiplication (GEMM).
>> >>>> This is confirmed by GEMM test on Netlib-java page
>> >>>> https://github.com/fommil/netlib-java. I also confirmed it with
>> >>>> my experiments with training of artificial neural network
>> >>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
>> >>>> However, I would like to boost performance more.
>> >>>>
>> >>>> GPU is supposed to work fast with linear algebra and there is
>> >>>> Nvidia CUDA implementation of BLAS, called cublas. I have one
>> >>>> Linux server with Nvidia GPU and I was able to do the following.
>> >>>> I linked cublas (instead of cpu-based blas) with Netlib-java
>> >>>> wrapper and put it into Spark, so Breeze/Netlib is using it.
>> >>>> Then I did some performance measurements with regards to
>> >>>> artificial neural network batch learning in Spark MLlib that
>> >>>> involves matrix-matrix multiplications. It turns out that for
>> >>>> matrices of size less than
>> >>>> ~1000x780 GPU cublas has the same speed as CPU blas. Cublas
>> >>>> becomes slower for bigger matrices. It worth mentioning that it
>> >>>> is was not a
>> test for ONLY multiplication since there are other operations involved.
>> >>>> One of the reasons for slowdown might be the overhead of copying
>> >>>> the matrices from computer memory to graphic card memory and back.
>> >>>>
>> >>>> So, few questions:
>> >>>> 1) Do these results with CUDA make sense?
>> >>>> 2) If the problem is with copy overhead, are there any libraries
>> >>>> that allow to force intermediate results to stay in graphic card
>> >>>> memory thus removing the overhead?
>> >>>> 3) Any other options to speed-up linear algebra in Spark?
>> >>>>
>> >>>> Thank you, Alexander
>> >>>>
>> >>>> ----------------------------------------------------------------
>> >>>> ---
>> >>>> -- To unsubscribe, e-mail: [hidden email]<mailto:
>> [hidden email]><mailto:
>> >>>> [hidden email]<mailto:[hidden email]
>> >>>> ach
>> >>>> e.org>><mailto:[hidden email]<mailto:dev-unsubscribe
>> >>>> @sp
>> >>>> ark.apac> he.org<http://he.org>
>> >>>> <mailto:[hidden email]<mailto:dev-unsubscribe@
>> >>>> spa rk.apache.org>>> For additional commands, e-mail:
>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>> >>>> [hidden email]<mailto:[hidden email]>><mailto:
>> [hidden email]<mailto:[hidden email]><mailto:
>> >>>> [hidden email]<mailto:[hidden email]>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>
>>
>> --
>> Best regards,
>> Sam
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Using CUDA within Spark / boosting linear algebra

Reza Zadeh
In reply to this post by Ulanov, Alexander
These are awesome (and surprising) results, Alex. I've been following this
thread and really surprised by the improvement over BIDMat-cuda, almost 20x
faster.

Any chance you could send scripts or github gist for reproduction?

Thanks,
Reza

On Wed, Mar 25, 2015 at 2:31 PM, Ulanov, Alexander <[hidden email]>
wrote:

> Hi again,
>
> I finally managed to use nvblas within Spark+netlib-java. It has
> exceptional performance for big matrices with Double, faster than
> BIDMat-cuda with Float. But for smaller matrices, if you will copy them
> to/from GPU, OpenBlas or MKL might be a better choice. This correlates with
> original nvblas presentation on GPU conf 2013 (slide 21):
> http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf
>
> My results:
>
> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>
> Just in case, these tests are not for generalization of performance of
> different libraries. I just want to pick a library that does at best dense
> matrices multiplication for my task.
>
> P.S. My previous issue with nvblas was the following: it has Fortran blas
> functions, at the same time netlib-java uses C cblas functions. So, one
> needs cblas shared library to use nvblas through netlib-java. Fedora does
> not have cblas (but Debian and Ubuntu have), so I needed to compile it. I
> could not use cblas from Atlas or Openblas because they link to their
> implementation and not to Fortran blas.
>
> Best regards, Alexander
>
> -----Original Message-----
> From: Ulanov, Alexander
> Sent: Tuesday, March 24, 2015 6:57 PM
> To: Sam Halliday
> Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
> Subject: RE: Using CUDA within Spark / boosting linear algebra
>
> Hi,
>
> I am trying to use nvblas with netlib-java from Spark. nvblas functions
> should replace current blas functions calls after executing LD_PRELOAD as
> suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any
> changes to netlib-java. It seems to work for simple Java example, but I
> cannot make it work with Spark. I run the following:
> export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
> env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell
> --driver-memory 4G In nvidia-smi I observe that Java is to use GPU:
>
> +-----------------------------------------------------------------------------+
> | Processes:                                                       GPU
> Memory |
> |  GPU       PID  Type  Process name                               Usage
>     |
>
> |=============================================================================|
> |    0      8873    C   bash
> 39MiB |
> |    0      8910    C   /usr/lib/jvm/java-1.7.0/bin/java
> 39MiB |
>
> +-----------------------------------------------------------------------------+
>
> In Spark shell I do matrix multiplication and see the following:
> 15/03/25 06:48:01 INFO JniLoader: successfully loaded
> /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
> So I am sure that netlib-native is loaded and cblas supposedly used.
> However, matrix multiplication does executes on CPU since I see 16% of CPU
> used and 0% of GPU used. I also checked different matrix sizes, from
> 100x100 to 12000x12000
>
> Could you suggest might the LD_PRELOAD not affect Spark shell?
>
> Best regards, Alexander
>
>
>
> From: Sam Halliday [mailto:[hidden email]]
> Sent: Monday, March 09, 2015 6:01 PM
> To: Ulanov, Alexander
> Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
> Subject: RE: Using CUDA within Spark / boosting linear algebra
>
>
> Thanks so much for following up on this!
>
> Hmm, I wonder if we should have a concerted effort to chart performance on
> various pieces of hardware...
> On 9 Mar 2015 21:08, "Ulanov, Alexander" <[hidden email]<mailto:
> [hidden email]>> wrote:
> Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the
> comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the
> support of Double in the current source code), did the test with BIDMat and
> CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL.
>
>
> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>
> Best regards, Alexander
>
> -----Original Message-----
> From: Sam Halliday [mailto:[hidden email]<mailto:
> [hidden email]>]
> Sent: Tuesday, March 03, 2015 1:54 PM
> To: Xiangrui Meng; Joseph Bradley
> Cc: Evan R. Sparks; Ulanov, Alexander; [hidden email]<mailto:
> [hidden email]>
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> BTW, is anybody on this list going to the London Meetup in a few weeks?
>
>
> https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community
>
> Would be nice to meet other people working on the guts of Spark! :-)
>
>
> Xiangrui Meng <[hidden email]<mailto:[hidden email]>> writes:
>
> > Hey Alexander,
> >
> > I don't quite understand the part where netlib-cublas is about 20x
> > slower than netlib-openblas. What is the overhead of using a GPU BLAS
> > with netlib-java?
> >
> > CC'ed Sam, the author of netlib-java.
> >
> > Best,
> > Xiangrui
> >
> > On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley <[hidden email]
> <mailto:[hidden email]>> wrote:
> >> Better documentation for linking would be very helpful!  Here's a JIRA:
> >> https://issues.apache.org/jira/browse/SPARK-6019
> >>
> >>
> >> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
> >> <[hidden email]<mailto:[hidden email]>>
> >> wrote:
> >>
> >>> Thanks for compiling all the data and running these benchmarks,
> >>> Alex. The big takeaways here can be seen with this chart:
> >>>
> >>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ
> >>> Hl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
> >>>
> >>> 1) A properly configured GPU matrix multiply implementation (e.g.
> >>> BIDMat+GPU) can provide substantial (but less than an order of
> >>> BIDMat+magnitude)
> >>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
> >>> netlib-java+openblas-compiled).
> >>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
> >>> worse than a well-tuned CPU implementation, particularly for larger
> matrices.
> >>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
> >>> basically agrees with the authors own benchmarks (
> >>> https://github.com/fommil/netlib-java)
> >>>
> >>> I think that most of our users are in a situation where using GPUs
> >>> may not be practical - although we could consider having a good GPU
> >>> backend available as an option. However, *ALL* users of MLlib could
> >>> benefit (potentially tremendously) from using a well-tuned CPU-based
> >>> BLAS implementation. Perhaps we should consider updating the mllib
> >>> guide with a more complete section for enabling high performance
> >>> binaries on OSX and Linux? Or better, figure out a way for the
> >>> system to fetch these automatically.
> >>>
> >>> - Evan
> >>>
> >>>
> >>>
> >>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
> >>> [hidden email]<mailto:[hidden email]>> wrote:
> >>>
> >>>> Just to summarize this thread, I was finally able to make all
> >>>> performance comparisons that we discussed. It turns out that:
> >>>> BIDMat-cublas>>BIDMat
> >>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo=
> >>>> =netlib-cublas>netlib-blas>f2jblas
> >>>>
> >>>> Below is the link to the spreadsheet with full results.
> >>>>
> >>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx
> >>>> 378T9J5r7kwKSPkY/edit?usp=sharing
> >>>>
> >>>> One thing still needs exploration: does BIDMat-cublas perform
> >>>> copying to/from machine’s RAM?
> >>>>
> >>>> -----Original Message-----
> >>>> From: Ulanov, Alexander
> >>>> Sent: Tuesday, February 10, 2015 2:12 PM
> >>>> To: Evan R. Sparks
> >>>> Cc: Joseph Bradley;
> >>>> [hidden email]<mailto:[hidden email]>
> >>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Thanks, Evan! It seems that ticket was marked as duplicate though
> >>>> the original one discusses slightly different topic. I was able to
> >>>> link netlib with MKL from BIDMat binaries. Indeed, MKL is
> >>>> statically linked inside a 60MB library.
> >>>>
> >>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
> >>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
> >>>>
> +-----------------------------------------------------------------------+
> >>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
> >>>> |1,638475459 |
> >>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 |
> >>>> 1569,233228 |
> >>>>
> >>>> It turn out that pre-compiled MKL is faster than precompiled
> >>>> OpenBlas on my machine. Probably, I’ll add two more columns with
> >>>> locally compiled openblas and cuda.
> >>>>
> >>>> Alexander
> >>>>
> >>>> From: Evan R. Sparks
> >>>> [mailto:[hidden email]<mailto:[hidden email]>]
> >>>> Sent: Monday, February 09, 2015 6:06 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: Joseph Bradley;
> >>>> [hidden email]<mailto:[hidden email]>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Great - perhaps we can move this discussion off-list and onto a
> >>>> JIRA ticket? (Here's one:
> >>>> https://issues.apache.org/jira/browse/SPARK-5705)
> >>>>
> >>>> It seems like this is going to be somewhat exploratory for a while
> >>>> (and there's probably only a handful of us who really care about
> >>>> fast linear
> >>>> algebra!)
> >>>>
> >>>> - Evan
> >>>>
> >>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>> wrote:
> >>>> Hi Evan,
> >>>>
> >>>> Thank you for explanation and useful link. I am going to build
> >>>> OpenBLAS, link it with Netlib-java and perform benchmark again.
> >>>>
> >>>> Do I understand correctly that BIDMat binaries contain statically
> >>>> linked Intel MKL BLAS? It might be the reason why I am able to run
> >>>> BIDMat not having MKL BLAS installed on my server. If it is true, I
> >>>> wonder if it is OK because Intel sells this library. Nevertheless,
> >>>> it seems that in my case precompiled MKL BLAS performs better than
> >>>> precompiled OpenBLAS given that BIDMat and Netlib-java are supposed
> to be on par with JNI overheads.
> >>>>
> >>>> Though, it might be interesting to link Netlib-java with Intel MKL,
> >>>> as you suggested. I wonder, are John Canny (BIDMat) and Sam
> >>>> Halliday
> >>>> (Netlib-java) interested to compare their libraries.
> >>>>
> >>>> Best regards, Alexander
> >>>>
> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
> [hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>]
> >>>> Sent: Friday, February 06, 2015 5:58 PM
> >>>>
> >>>> To: Ulanov, Alexander
> >>>> Cc: Joseph Bradley;
> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
> >>>> apache.org<mailto:[hidden email]>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> I would build OpenBLAS yourself, since good BLAS performance comes
> >>>> from getting cache sizes, etc. set up correctly for your particular
> >>>> hardware - this is often a very tricky process (see, e.g. ATLAS),
> >>>> but we found that on relatively modern Xeon chips, OpenBLAS builds
> >>>> quickly and yields performance competitive with MKL.
> >>>>
> >>>> To make sure the right library is getting used, you have to make
> >>>> sure it's first on the search path - export
> >>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
> >>>>
> >>>> For some examples of getting netlib-java setup on an ec2 node and
> >>>> some example benchmarking code we ran a while back, see:
> >>>> https://github.com/shivaram/matrix-bench
> >>>>
> >>>> In particular - build-openblas-ec2.sh shows you how to build the
> >>>> library and set up symlinks correctly, and scala/run-netlib.sh
> >>>> shows you how to get the path setup and get that library picked up by
> netlib-java.
> >>>>
> >>>> In this way - you could probably get cuBLAS set up to be used by
> >>>> netlib-java as well.
> >>>>
> >>>> - Evan
> >>>>
> >>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>> wrote:
> >>>> Evan, could you elaborate on how to force BIDMat and netlib-java to
> >>>> force loading the right blas? For netlib, I there are few JVM
> >>>> flags, such as
> >>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS,
> >>>> so I can force it to use Java implementation. Not sure I understand
> how to force use a specific blas (not specific wrapper for blas).
> >>>>
> >>>> Btw. I have installed openblas (yum install openblas), so I suppose
> >>>> that netlib is using it.
> >>>>
> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
> [hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>]
> >>>> Sent: Friday, February 06, 2015 5:19 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: Joseph Bradley;
> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
> >>>> apache.org<mailto:[hidden email]>>
> >>>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Getting breeze to pick up the right blas library is critical for
> >>>> performance. I recommend using OpenBLAS (or MKL, if you already have
> it).
> >>>> It might make sense to force BIDMat to use the same underlying BLAS
> >>>> library as well.
> >>>>
> >>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>> wrote:
> >>>> Hi Evan, Joseph
> >>>>
> >>>> I did few matrix multiplication test and BIDMat seems to be ~10x
> >>>> faster than netlib-java+breeze (sorry for weird table formatting):
> >>>>
> >>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java
> >>>> |native_system_linux_x86-64|
> >>>> Breeze+Netlib-java f2jblas |
> >>>>
> +-----------------------------------------------------------------------+
> >>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
> >>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228
> >>>> ||
> >>>>
> >>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora
> >>>> 19 Linux, Scala 2.11.
> >>>>
> >>>> Later I will make tests with Cuda. I need to install new Cuda
> >>>> version for this purpose.
> >>>>
> >>>> Do you have any ideas why breeze-netlib with native blas is so much
> >>>> slower than BIDMat MKL?
> >>>>
> >>>> Best regards, Alexander
> >>>>
> >>>> From: Joseph Bradley [mailto:[hidden email]<mailto:
> [hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>]
> >>>> Sent: Thursday, February 05, 2015 5:29 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: Evan R. Sparks;
> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
> >>>> apache.org<mailto:[hidden email]>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Hi Alexander,
> >>>>
> >>>> Using GPUs with Spark would be very exciting.  Small comment:
> >>>> Concerning your question earlier about keeping data stored on the
> >>>> GPU rather than having to move it between main memory and GPU
> >>>> memory on each iteration, I would guess this would be critical to
> >>>> getting good performance.  If you could do multiple local
> >>>> iterations before aggregating results, then the cost of data
> >>>> movement to the GPU could be amortized (and I believe that is done
> >>>> in practice).  Having Spark be aware of the GPU and using it as
> another part of memory sounds like a much bigger undertaking.
> >>>>
> >>>> Joseph
> >>>>
> >>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>> wrote:
> >>>> Thank you for explanation! I’ve watched the BIDMach presentation by
> >>>> John Canny and I am really inspired by his talk and comparisons with
> Spark MLlib.
> >>>>
> >>>> I am very interested to find out what will be better within Spark:
> >>>> BIDMat or netlib-java with CPU or GPU natives. Could you suggest a
> >>>> fair way to benchmark them? Currently I do benchmarks on artificial
> >>>> neural networks in batch mode. While it is not a “pure” test of
> >>>> linear algebra, it involves some other things that are essential to
> machine learning.
> >>>>
> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
> [hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>]
> >>>> Sent: Thursday, February 05, 2015 1:29 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc:
> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
> >>>> apache.org<mailto:[hidden email]>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
> >>>> netlib-java+OpenBLAS, but if it is much faster it's probably due to
> >>>> netlib-java+data
> >>>> layout and fewer levels of indirection - it's definitely a
> >>>> worthwhile experiment to run. The main speedups I've seen from
> >>>> using it come from highly optimized GPU code for linear algebra. I
> >>>> know that in the past Canny has gone as far as to write custom GPU
> >>>> kernels for performance-critical regions of code.[1]
> >>>>
> >>>> BIDMach is highly optimized for single node performance or
> >>>> performance on small clusters.[2] Once data doesn't fit easily in
> >>>> GPU memory (or can be batched in that way) the performance tends to
> >>>> fall off. Canny argues for hardware/software codesign and as such
> >>>> prefers machine configurations that are quite different than what
> >>>> we find in most commodity cluster nodes - e.g. 10 disk cahnnels and 4
> GPUs.
> >>>>
> >>>> In contrast, MLlib was designed for horizontal scalability on
> >>>> commodity clusters and works best on very big datasets - order of
> terabytes.
> >>>>
> >>>> For the most part, these projects developed concurrently to address
> >>>> slightly different use cases. That said, there may be bits of
> >>>> BIDMach we could repurpose for MLlib - keep in mind we need to be
> >>>> careful about maintaining cross-language compatibility for our Java
> >>>> and Python-users, though.
> >>>>
> >>>> - Evan
> >>>>
> >>>> [1] - http://arxiv.org/abs/1409.5402 [2] -
> >>>> http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
> >>>>
> >>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>><mailto:
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>>> wrote:
> >>>> Hi Evan,
> >>>>
> >>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do
> >>>> you know what makes them faster than netlib-java?
> >>>>
> >>>> The same group has BIDMach library that implements machine
> >>>> learning. For some examples they use Caffe convolutional neural
> >>>> network library owned by another group in Berkeley. Could you
> >>>> elaborate on how these all might be connected with Spark Mllib? If
> >>>> you take BIDMat for linear algebra why don’t you take BIDMach for
> optimization and learning?
> >>>>
> >>>> Best regards, Alexander
> >>>>
> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
> [hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>><mailto:
> [hidden email]<mailto:[hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>>]
> >>>> Sent: Thursday, February 05, 2015 12:09 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>><mailto:
> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
> >>>> apache.org<mailto:[hidden email]>>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU
> >>>> blas in many cases.
> >>>>
> >>>> You might consider taking a look at the codepaths that BIDMat (
> >>>> https://github.com/BIDData/BIDMat) takes and comparing them to
> >>>> netlib-java/breeze. John Canny et. al. have done a bunch of work
> >>>> optimizing to make this work really fast from Scala. I've run it on
> >>>> my laptop and compared to MKL and in certain cases it's 10x faster at
> matrix multiply.
> >>>> There are a lot of layers of indirection here and you really want
> >>>> to avoid data copying as much as possible.
> >>>>
> >>>> We could also consider swapping out BIDMat for Breeze, but that
> >>>> would be a big project and if we can figure out how to get
> >>>> breeze+cublas to comparable performance that would be a big win.
> >>>>
> >>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>><mailto:
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>>> wrote:
> >>>> Dear Spark developers,
> >>>>
> >>>> I am exploring how to make linear algebra operations faster within
> Spark.
> >>>> One way of doing this is to use Scala Breeze library that is
> >>>> bundled with Spark. For matrix operations, it employs Netlib-java
> >>>> that has a Java wrapper for BLAS (basic linear algebra subprograms)
> >>>> and LAPACK native binaries if they are available on the worker
> >>>> node. It also has its own optimized Java implementation of BLAS. It
> >>>> is worth mentioning, that native binaries provide better performance
> only for BLAS level 3, i.e.
> >>>> matrix-matrix operations or general matrix multiplication (GEMM).
> >>>> This is confirmed by GEMM test on Netlib-java page
> >>>> https://github.com/fommil/netlib-java. I also confirmed it with my
> >>>> experiments with training of artificial neural network
> >>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
> >>>> However, I would like to boost performance more.
> >>>>
> >>>> GPU is supposed to work fast with linear algebra and there is
> >>>> Nvidia CUDA implementation of BLAS, called cublas. I have one Linux
> >>>> server with Nvidia GPU and I was able to do the following. I linked
> >>>> cublas (instead of cpu-based blas) with Netlib-java wrapper and put
> >>>> it into Spark, so Breeze/Netlib is using it. Then I did some
> >>>> performance measurements with regards to artificial neural network
> >>>> batch learning in Spark MLlib that involves matrix-matrix
> >>>> multiplications. It turns out that for matrices of size less than
> >>>> ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes
> >>>> slower for bigger matrices. It worth mentioning that it is was not a
> test for ONLY multiplication since there are other operations involved.
> >>>> One of the reasons for slowdown might be the overhead of copying
> >>>> the matrices from computer memory to graphic card memory and back.
> >>>>
> >>>> So, few questions:
> >>>> 1) Do these results with CUDA make sense?
> >>>> 2) If the problem is with copy overhead, are there any libraries
> >>>> that allow to force intermediate results to stay in graphic card
> >>>> memory thus removing the overhead?
> >>>> 3) Any other options to speed-up linear algebra in Spark?
> >>>>
> >>>> Thank you, Alexander
> >>>>
> >>>> -------------------------------------------------------------------
> >>>> -- To unsubscribe, e-mail: [hidden email]<mailto:
> [hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]
> >>>> e.org>><mailto:[hidden email]<mailto:dev-unsubscribe@sp
> >>>> ark.apac> he.org<http://he.org>
> >>>> <mailto:[hidden email]<mailto:dev-unsubscribe@spa
> >>>> rk.apache.org>>> For additional commands, e-mail:
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>><mailto:
> [hidden email]<mailto:[hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
>
> --
> Best regards,
> Sam
>
Reply | Threaded
Open this post in threaded view
|

RE: Using CUDA within Spark / boosting linear algebra

Ulanov, Alexander
In reply to this post by jfcanny
John,

Thanks for your suggestion, it really seems strange. Though right now I have no idea what's wrong since I use exactly the same script for testing. I will appreciate any suggestions.

Best regards, Alexander

-----Original Message-----
From: jfcanny [mailto:[hidden email]]
Sent: Wednesday, March 25, 2015 3:09 PM
To: [hidden email]
Subject: Re: Using CUDA within Spark / boosting linear algebra

Alex,
I think you should recheck your numbers. Both BIDMat and nvblas are wrappers for cublas. The speeds are identical, except on machines that have multiple GPUs which nvblas exploits and cublas doesnt.

It would be a good idea to add a column with Gflop throughput. Your numbers for BIDMat 10kx10k multiply give about 300 single float gflops, which seems about right for a Quadro 4000 (current generation devices are > 10x faster than a 4000).

Your numbers for netlib-nvblas would indicate a double float throughput of 8 tflops, which is physically impossible on that device.

It shouldnt matter which interface you use if you have a single GPU.

-John

On 3/25/2015 2:34 PM, Ulanov, Alexander [via Apache Spark Developers List] wrote:

> Hi again,
>
> I finally managed to use nvblas within Spark+netlib-java. It has
> exceptional performance for big matrices with Double, faster than
> BIDMat-cuda with Float. But for smaller matrices, if you will copy
> them to/from GPU, OpenBlas or MKL might be a better choice. This
> correlates with original nvblas presentation on GPU conf 2013 (slide
> 21):
> http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC31
> 08-New-Features-CUDA%206%20-GPU-Acceleration.pdf
>
> My results:
> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378
> T9J5r7kwKSPkY/edit?usp=sharing
>
>
> Just in case, these tests are not for generalization of performance of
> different libraries. I just want to pick a library that does at best
> dense matrices multiplication for my task.
>
> P.S. My previous issue with nvblas was the following: it has Fortran
> blas functions, at the same time netlib-java uses C cblas functions.
> So, one needs cblas shared library to use nvblas through netlib-java.
> Fedora does not have cblas (but Debian and Ubuntu have), so I needed
> to compile it. I could not use cblas from Atlas or Openblas because
> they link to their implementation and not to Fortran blas.
>
> Best regards, Alexander
>
> -----Original Message-----
> From: Ulanov, Alexander
> Sent: Tuesday, March 24, 2015 6:57 PM
> To: Sam Halliday
> Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
> Subject: RE: Using CUDA within Spark / boosting linear algebra
>
> Hi,
>
> I am trying to use nvblas with netlib-java from Spark. nvblas
> functions should replace current blas functions calls after executing
> LD_PRELOAD as suggested in http://docs.nvidia.com/cuda/nvblas/#Usage 
> without any changes to netlib-java. It seems to work for simple Java
> example, but I cannot make it work with Spark. I run the following:
> export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
> env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell
> --driver-memory 4G In nvidia-smi I observe that Java is to use GPU:
> +-----------------------------------------------------------------------------+
>
> | Processes: GPU Memory |
> |  GPU       PID  Type  Process name Usage      |
> |=====================================================================
> |========|
>
> |    0      8873    C   bash      39MiB |
> |    0      8910    C   /usr/lib/jvm/java-1.7.0/bin/java      39MiB |
> +-----------------------------------------------------------------------------+
>
>
> In Spark shell I do matrix multiplication and see the following:
> 15/03/25 06:48:01 INFO JniLoader: successfully loaded
> /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
> So I am sure that netlib-native is loaded and cblas supposedly used.
> However, matrix multiplication does executes on CPU since I see 16% of
> CPU used and 0% of GPU used. I also checked different matrix sizes,
> from 100x100 to 12000x12000
>
> Could you suggest might the LD_PRELOAD not affect Spark shell?
>
> Best regards, Alexander
>
>
>
> From: Sam Halliday [mailto:[hidden email]]
> Sent: Monday, March 09, 2015 6:01 PM
> To: Ulanov, Alexander
> Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
> Subject: RE: Using CUDA within Spark / boosting linear algebra
>
>
> Thanks so much for following up on this!
>
> Hmm, I wonder if we should have a concerted effort to chart
> performance on various pieces of hardware...
> On 9 Mar 2015 21:08, "Ulanov, Alexander" <[hidden
> email]<mailto:[hidden email]>> wrote:
> Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added
> the comment that BIDMat 0.9.7 uses Float matrices in GPU (although I
> see the support of Double in the current source code), did the test
> with BIDMat and CPU Double matrices. BIDMat MKL is indeed on par with
> netlib MKL.
>
> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378
> T9J5r7kwKSPkY/edit?usp=sharing
>
> Best regards, Alexander
>
> -----Original Message-----
> From: Sam Halliday [mailto:[hidden email]<mailto:[hidden email]>]
> Sent: Tuesday, March 03, 2015 1:54 PM
> To: Xiangrui Meng; Joseph Bradley
> Cc: Evan R. Sparks; Ulanov, Alexander; [hidden email]<mailto:[hidden
> email]>
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> BTW, is anybody on this list going to the London Meetup in a few weeks?
>
> https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-map
> reduce-world#community
>
> Would be nice to meet other people working on the guts of Spark! :-)
>
>
> Xiangrui Meng <[hidden email]<mailto:[hidden email]>> writes:
>
> > Hey Alexander,
> >
> > I don't quite understand the part where netlib-cublas is about 20x
> > slower than netlib-openblas. What is the overhead of using a GPU
> > BLAS with netlib-java?
> >
> > CC'ed Sam, the author of netlib-java.
> >
> > Best,
> > Xiangrui
> >
> > On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley <[hidden
> email]<mailto:[hidden email]>> wrote:
> >> Better documentation for linking would be very helpful!  Here's a
> JIRA:
> >> https://issues.apache.org/jira/browse/SPARK-6019
> >>
> >>
> >> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks <[hidden
> >> email]<mailto:[hidden email]>>
> >> wrote:
> >>
> >>> Thanks for compiling all the data and running these benchmarks,
> >>> Alex. The big takeaways here can be seen with this chart:
> >>>
> >>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50
> >>> uZ Hl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
> >>>
> >>> 1) A properly configured GPU matrix multiply implementation (e.g.
> >>> BIDMat+GPU) can provide substantial (but less than an order of
> >>> BIDMat+magnitude)
> >>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
> >>> netlib-java+openblas-compiled).
> >>> 2) A poorly tuned CPU implementation can be 1-2 orders of
> >>> magnitude worse than a well-tuned CPU implementation, particularly
> >>> for
> larger matrices.
> >>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib -
> >>> this basically agrees with the authors own benchmarks (
> >>> https://github.com/fommil/netlib-java)
> >>>
> >>> I think that most of our users are in a situation where using GPUs
> >>> may not be practical - although we could consider having a good
> >>> GPU backend available as an option. However, *ALL* users of MLlib
> >>> could benefit (potentially tremendously) from using a well-tuned
> >>> CPU-based BLAS implementation. Perhaps we should consider updating
> >>> the mllib guide with a more complete section for enabling high
> >>> performance binaries on OSX and Linux? Or better, figure out a way
> >>> for the system to fetch these automatically.
> >>>
> >>> - Evan
> >>>
> >>>
> >>>
> >>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander < [hidden
> >>> email]<mailto:[hidden email]>> wrote:
> >>>
> >>>> Just to summarize this thread, I was finally able to make all
> >>>> performance comparisons that we discussed. It turns out that:
> >>>> BIDMat-cublas>>BIDMat
> >>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-rep
> >>>> o= =netlib-cublas>netlib-blas>f2jblas
> >>>>
> >>>> Below is the link to the spreadsheet with full results.
> >>>>
> >>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHU
> >>>> Mx 378T9J5r7kwKSPkY/edit?usp=sharing
> >>>>
> >>>> One thing still needs exploration: does BIDMat-cublas perform
> >>>> copying to/from machine’s RAM?
> >>>>
> >>>> -----Original Message-----
> >>>> From: Ulanov, Alexander
> >>>> Sent: Tuesday, February 10, 2015 2:12 PM
> >>>> To: Evan R. Sparks
> >>>> Cc: Joseph Bradley;
> >>>> [hidden email]<mailto:[hidden email]>
> >>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Thanks, Evan! It seems that ticket was marked as duplicate though
> >>>> the original one discusses slightly different topic. I was able
> >>>> to link netlib with MKL from BIDMat binaries. Indeed, MKL is
> >>>> statically linked inside a 60MB library.
> >>>>
> >>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
> >>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
> >>>>
> +-----------------------------------------------------------------------+
> >>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556
> >>>> ||
> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
> >>>> |1,638475459 |
> >>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211
> >>>> ||
> >>>> 1569,233228 |
> >>>>
> >>>> It turn out that pre-compiled MKL is faster than precompiled
> >>>> OpenBlas on my machine. Probably, I’ll add two more columns with
> >>>> locally compiled openblas and cuda.
> >>>>
> >>>> Alexander
> >>>>
> >>>> From: Evan R. Sparks
> >>>> [mailto:[hidden email]<mailto:[hidden email]>]
> >>>> Sent: Monday, February 09, 2015 6:06 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: Joseph Bradley;
> >>>> [hidden email]<mailto:[hidden email]>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Great - perhaps we can move this discussion off-list and onto a
> >>>> JIRA ticket? (Here's one:
> >>>> https://issues.apache.org/jira/browse/SPARK-5705)
> >>>>
> >>>> It seems like this is going to be somewhat exploratory for a
> >>>> while (and there's probably only a handful of us who really care
> >>>> about fast linear
> >>>> algebra!)
> >>>>
> >>>> - Evan
> >>>>
> >>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander < [hidden
> >>>> email]<mailto:[hidden email]><mailto:[hidden
> email]<mailto:[hidden email]>>> wrote:
> >>>> Hi Evan,
> >>>>
> >>>> Thank you for explanation and useful link. I am going to build
> >>>> OpenBLAS, link it with Netlib-java and perform benchmark again.
> >>>>
> >>>> Do I understand correctly that BIDMat binaries contain statically
> >>>> linked Intel MKL BLAS? It might be the reason why I am able to
> >>>> run BIDMat not having MKL BLAS installed on my server. If it is
> >>>> true, I wonder if it is OK because Intel sells this library.
> >>>> Nevertheless, it seems that in my case precompiled MKL BLAS
> >>>> performs better than precompiled OpenBLAS given that BIDMat and
> >>>> Netlib-java are
> supposed to be on par with JNI overheads.
> >>>>
> >>>> Though, it might be interesting to link Netlib-java with Intel
> >>>> MKL, as you suggested. I wonder, are John Canny (BIDMat) and Sam
> >>>> Halliday
> >>>> (Netlib-java) interested to compare their libraries.
> >>>>
> >>>> Best regards, Alexander
> >>>>
> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden
> email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>]
> >>>> Sent: Friday, February 06, 2015 5:58 PM
> >>>>
> >>>> To: Ulanov, Alexander
> >>>> Cc: Joseph Bradley;
> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
> >>>> apache.org<mailto:[hidden email]>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> I would build OpenBLAS yourself, since good BLAS performance
> >>>> comes from getting cache sizes, etc. set up correctly for your
> >>>> particular hardware - this is often a very tricky process (see,
> >>>> e.g. ATLAS), but we found that on relatively modern Xeon chips,
> >>>> OpenBLAS builds quickly and yields performance competitive with MKL.
> >>>>
> >>>> To make sure the right library is getting used, you have to make
> >>>> sure it's first on the search path - export
> >>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
> >>>>
> >>>> For some examples of getting netlib-java setup on an ec2 node and
> >>>> some example benchmarking code we ran a while back, see:
> >>>> https://github.com/shivaram/matrix-bench
> >>>>
> >>>> In particular - build-openblas-ec2.sh shows you how to build the
> >>>> library and set up symlinks correctly, and scala/run-netlib.sh
> >>>> shows you how to get the path setup and get that library picked
> up by netlib-java.
> >>>>
> >>>> In this way - you could probably get cuBLAS set up to be used by
> >>>> netlib-java as well.
> >>>>
> >>>> - Evan
> >>>>
> >>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander < [hidden
> >>>> email]<mailto:[hidden email]><mailto:[hidden
> email]<mailto:[hidden email]>>> wrote:
> >>>> Evan, could you elaborate on how to force BIDMat and netlib-java
> >>>> to force loading the right blas? For netlib, I there are few JVM
> >>>> flags, such as
> >>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS,
> >>>> so I can force it to use Java implementation. Not sure I
> understand how to force use a specific blas (not specific wrapper for
> blas).
> >>>>
> >>>> Btw. I have installed openblas (yum install openblas), so I
> >>>> suppose that netlib is using it.
> >>>>
> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden
> email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>]
> >>>> Sent: Friday, February 06, 2015 5:19 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: Joseph Bradley;
> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
> >>>> apache.org<mailto:[hidden email]>>
> >>>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Getting breeze to pick up the right blas library is critical for
> >>>> performance. I recommend using OpenBLAS (or MKL, if you already
> have it).
> >>>> It might make sense to force BIDMat to use the same underlying
> >>>> BLAS library as well.
> >>>>
> >>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander < [hidden
> >>>> email]<mailto:[hidden email]><mailto:[hidden
> email]<mailto:[hidden email]>>> wrote:
> >>>> Hi Evan, Joseph
> >>>>
> >>>> I did few matrix multiplication test and BIDMat seems to be ~10x
> >>>> faster than netlib-java+breeze (sorry for weird table formatting):
> >>>>
> >>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java
> >>>> |native_system_linux_x86-64|
> >>>> Breeze+Netlib-java f2jblas |
> >>>>
> +-----------------------------------------------------------------------+
> >>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
> >>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 |
> >>>> |1569,233228
> >>>> ||
> >>>>
> >>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM,
> >>>> Fedora
> >>>> 19 Linux, Scala 2.11.
> >>>>
> >>>> Later I will make tests with Cuda. I need to install new Cuda
> >>>> version for this purpose.
> >>>>
> >>>> Do you have any ideas why breeze-netlib with native blas is so
> >>>> much slower than BIDMat MKL?
> >>>>
> >>>> Best regards, Alexander
> >>>>
> >>>> From: Joseph Bradley [mailto:[hidden email]<mailto:[hidden
> email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>]
> >>>> Sent: Thursday, February 05, 2015 5:29 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: Evan R. Sparks;
> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
> >>>> apache.org<mailto:[hidden email]>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Hi Alexander,
> >>>>
> >>>> Using GPUs with Spark would be very exciting.  Small comment:
> >>>> Concerning your question earlier about keeping data stored on the
> >>>> GPU rather than having to move it between main memory and GPU
> >>>> memory on each iteration, I would guess this would be critical to
> >>>> getting good performance.  If you could do multiple local
> >>>> iterations before aggregating results, then the cost of data
> >>>> movement to the GPU could be amortized (and I believe that is
> >>>> done in practice).  Having Spark be aware of the GPU and using it
> >>>> as
> another part of memory sounds like a much bigger undertaking.
> >>>>
> >>>> Joseph
> >>>>
> >>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander < [hidden
> >>>> email]<mailto:[hidden email]><mailto:[hidden
> email]<mailto:[hidden email]>>> wrote:
> >>>> Thank you for explanation! I’ve watched the BIDMach presentation
> >>>> by John Canny and I am really inspired by his talk and
> >>>> comparisons
> with Spark MLlib.
> >>>>
> >>>> I am very interested to find out what will be better within Spark:
> >>>> BIDMat or netlib-java with CPU or GPU natives. Could you suggest
> >>>> a fair way to benchmark them? Currently I do benchmarks on
> >>>> artificial neural networks in batch mode. While it is not a
> >>>> “pure” test of linear algebra, it involves some other things that
> >>>> are essential
> to machine learning.
> >>>>
> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden
> email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>]
> >>>> Sent: Thursday, February 05, 2015 1:29 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc:
> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
> >>>> apache.org<mailto:[hidden email]>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
> >>>> netlib-java+OpenBLAS, but if it is much faster it's probably due
> >>>> netlib-java+to data
> >>>> layout and fewer levels of indirection - it's definitely a
> >>>> worthwhile experiment to run. The main speedups I've seen from
> >>>> using it come from highly optimized GPU code for linear algebra.
> >>>> I know that in the past Canny has gone as far as to write custom
> >>>> GPU kernels for performance-critical regions of code.[1]
> >>>>
> >>>> BIDMach is highly optimized for single node performance or
> >>>> performance on small clusters.[2] Once data doesn't fit easily in
> >>>> GPU memory (or can be batched in that way) the performance tends
> >>>> to fall off. Canny argues for hardware/software codesign and as
> >>>> such prefers machine configurations that are quite different than
> >>>> what we find in most commodity cluster nodes - e.g. 10 disk
> >>>> cahnnels
> and 4 GPUs.
> >>>>
> >>>> In contrast, MLlib was designed for horizontal scalability on
> >>>> commodity clusters and works best on very big datasets - order of
> terabytes.
> >>>>
> >>>> For the most part, these projects developed concurrently to
> >>>> address slightly different use cases. That said, there may be
> >>>> bits of BIDMach we could repurpose for MLlib - keep in mind we
> >>>> need to be careful about maintaining cross-language compatibility
> >>>> for our Java and Python-users, though.
> >>>>
> >>>> - Evan
> >>>>
> >>>> [1] - http://arxiv.org/abs/1409.5402 [2] -
> >>>> http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
> <http://eecs.berkeley.edu/%7Ehzhao/papers/BD.pdf>
> >>>>
> >>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander < [hidden
> >>>> email]<mailto:[hidden email]><mailto:[hidden
> email]<mailto:[hidden email]>><mailto:
> >>>> [hidden email]<mailto:[hidden email]><mailto:[hidden
> email]<mailto:[hidden email]>>>> wrote:
> >>>> Hi Evan,
> >>>>
> >>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do
> >>>> you know what makes them faster than netlib-java?
> >>>>
> >>>> The same group has BIDMach library that implements machine
> >>>> learning. For some examples they use Caffe convolutional neural
> >>>> network library owned by another group in Berkeley. Could you
> >>>> elaborate on how these all might be connected with Spark Mllib?
> >>>> If you take BIDMat for linear algebra why don’t you take BIDMach
> >>>> for
> optimization and learning?
> >>>>
> >>>> Best regards, Alexander
> >>>>
> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden
> email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>><mailto:[hidden
> email]<mailto:[hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>>]
> >>>> Sent: Thursday, February 05, 2015 12:09 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: [hidden email]<mailto:[hidden email]><mailto:[hidden
> email]<mailto:[hidden email]>><mailto:
> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
> >>>> apache.org<mailto:[hidden email]>>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU
> >>>> blas in many cases.
> >>>>
> >>>> You might consider taking a look at the codepaths that BIDMat (
> >>>> https://github.com/BIDData/BIDMat) takes and comparing them to
> >>>> netlib-java/breeze. John Canny et. al. have done a bunch of work
> >>>> optimizing to make this work really fast from Scala. I've run it
> >>>> on my laptop and compared to MKL and in certain cases it's 10x
> faster at matrix multiply.
> >>>> There are a lot of layers of indirection here and you really want
> >>>> to avoid data copying as much as possible.
> >>>>
> >>>> We could also consider swapping out BIDMat for Breeze, but that
> >>>> would be a big project and if we can figure out how to get
> >>>> breeze+cublas to comparable performance that would be a big win.
> >>>>
> >>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander < [hidden
> >>>> email]<mailto:[hidden email]><mailto:[hidden
> email]<mailto:[hidden email]>><mailto:
> >>>> [hidden email]<mailto:[hidden email]><mailto:[hidden
> email]<mailto:[hidden email]>>>> wrote:
> >>>> Dear Spark developers,
> >>>>
> >>>> I am exploring how to make linear algebra operations faster
> within Spark.
> >>>> One way of doing this is to use Scala Breeze library that is
> >>>> bundled with Spark. For matrix operations, it employs Netlib-java
> >>>> that has a Java wrapper for BLAS (basic linear algebra
> >>>> subprograms) and LAPACK native binaries if they are available on
> >>>> the worker node. It also has its own optimized Java
> >>>> implementation of BLAS. It is worth mentioning, that native
> >>>> binaries provide better
> performance only for BLAS level 3, i.e.
> >>>> matrix-matrix operations or general matrix multiplication (GEMM).
> >>>> This is confirmed by GEMM test on Netlib-java page
> >>>> https://github.com/fommil/netlib-java. I also confirmed it with
> >>>> my experiments with training of artificial neural network
> >>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
> >>>> However, I would like to boost performance more.
> >>>>
> >>>> GPU is supposed to work fast with linear algebra and there is
> >>>> Nvidia CUDA implementation of BLAS, called cublas. I have one
> >>>> Linux server with Nvidia GPU and I was able to do the following.
> >>>> I linked cublas (instead of cpu-based blas) with Netlib-java
> >>>> wrapper and put it into Spark, so Breeze/Netlib is using it. Then
> >>>> I did some performance measurements with regards to artificial
> >>>> neural network batch learning in Spark MLlib that involves
> >>>> matrix-matrix multiplications. It turns out that for matrices of
> >>>> size less than
> >>>> ~1000x780 GPU cublas has the same speed as CPU blas. Cublas
> >>>> becomes slower for bigger matrices. It worth mentioning that it
> >>>> is was
> not a test for ONLY multiplication since there are other operations
> involved.
> >>>> One of the reasons for slowdown might be the overhead of copying
> >>>> the matrices from computer memory to graphic card memory and back.
> >>>>
> >>>> So, few questions:
> >>>> 1) Do these results with CUDA make sense?
> >>>> 2) If the problem is with copy overhead, are there any libraries
> >>>> that allow to force intermediate results to stay in graphic card
> >>>> memory thus removing the overhead?
> >>>> 3) Any other options to speed-up linear algebra in Spark?
> >>>>
> >>>> Thank you, Alexander
> >>>>
> >>>> -----------------------------------------------------------------
> >>>> --
> >>>> -- To unsubscribe, e-mail: [hidden email]<mailto:[hidden
> email]><mailto:
> >>>> [hidden email]<mailto:[hidden email] e.org>><mailto:[hidden
> >>>> email]<mailto:dev-unsubscribe@sp ark.apac> he.org<http://he.org>
> >>>> <mailto:[hidden email]<mailto:dev-unsubscribe@spa
> >>>> rk.apache.org>>> For additional commands, e-mail:
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>><mailto:[hidden
> email]<mailto:[hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
>
> --
> Best regards,
> Sam
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email] For additional commands,
> e-mail: [hidden email]
>
>
> ----------------------------------------------------------------------
> -- If you reply to this email, your message will be added to the
> discussion below:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Using-CUDA-w
> ithin-Spark-boosting-linear-algebra-tp10481p11238.html
>
> To unsubscribe from Using CUDA within Spark / boosting linear algebra,
> click here
> <
> NAML
> <
http://apache-spark-developers-list.1001551.n3.nabble.com/template/Na
> mlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml
> &base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.N
> abbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.t
> emplate.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcr
> umbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%
> 3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>





--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Using-CUDA-within-Spark-boosting-linear-algebra-tp10481p11246.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

RE: Using CUDA within Spark / boosting linear algebra

Ulanov, Alexander
In reply to this post by Ulanov, Alexander
As everyone suggested, the results were too good to be true, so I double-checked them. It turns that nvblas did not do multiplication due to parameter NVBLAS_TILE_DIM from "nvblas.conf" and returned zero matrix. My previously posted results with nvblas are matrices copying only. The default NVBLAS_TILE_DIM==2048 is too big for my graphic card/matrix size. I handpicked other values that worked. As a result, netlib+nvblas is on par with BIDMat-cuda. As promised, I am going to post a how-to for nvblas configuration.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing



-----Original Message-----
From: Ulanov, Alexander
Sent: Wednesday, March 25, 2015 2:31 PM
To: Sam Halliday
Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks; jfcanny
Subject: RE: Using CUDA within Spark / boosting linear algebra

Hi again,

I finally managed to use nvblas within Spark+netlib-java. It has exceptional performance for big matrices with Double, faster than BIDMat-cuda with Float. But for smaller matrices, if you will copy them to/from GPU, OpenBlas or MKL might be a better choice. This correlates with original nvblas presentation on GPU conf 2013 (slide 21): http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf
 
My results:
https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing 

Just in case, these tests are not for generalization of performance of different libraries. I just want to pick a library that does at best dense matrices multiplication for my task.

P.S. My previous issue with nvblas was the following: it has Fortran blas functions, at the same time netlib-java uses C cblas functions. So, one needs cblas shared library to use nvblas through netlib-java. Fedora does not have cblas (but Debian and Ubuntu have), so I needed to compile it. I could not use cblas from Atlas or Openblas because they link to their implementation and not to Fortran blas.

Best regards, Alexander

-----Original Message-----
From: Ulanov, Alexander
Sent: Tuesday, March 24, 2015 6:57 PM
To: Sam Halliday
Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
Subject: RE: Using CUDA within Spark / boosting linear algebra

Hi,

I am trying to use nvblas with netlib-java from Spark. nvblas functions should replace current blas functions calls after executing LD_PRELOAD as suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any changes to netlib-java. It seems to work for simple Java example, but I cannot make it work with Spark. I run the following:
export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell --driver-memory 4G In nvidia-smi I observe that Java is to use GPU:
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      8873    C   bash                                            39MiB |
|    0      8910    C   /usr/lib/jvm/java-1.7.0/bin/java                39MiB |
+-----------------------------------------------------------------------------+

In Spark shell I do matrix multiplication and see the following:
15/03/25 06:48:01 INFO JniLoader: successfully loaded /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
So I am sure that netlib-native is loaded and cblas supposedly used. However, matrix multiplication does executes on CPU since I see 16% of CPU used and 0% of GPU used. I also checked different matrix sizes, from 100x100 to 12000x12000

Could you suggest might the LD_PRELOAD not affect Spark shell?

Best regards, Alexander



From: Sam Halliday [mailto:[hidden email]]
Sent: Monday, March 09, 2015 6:01 PM
To: Ulanov, Alexander
Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
Subject: RE: Using CUDA within Spark / boosting linear algebra


Thanks so much for following up on this!

Hmm, I wonder if we should have a concerted effort to chart performance on various pieces of hardware...
On 9 Mar 2015 21:08, "Ulanov, Alexander" <[hidden email]<mailto:[hidden email]>> wrote:
Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the support of Double in the current source code), did the test with BIDMat and CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

Best regards, Alexander

-----Original Message-----
From: Sam Halliday [mailto:[hidden email]<mailto:[hidden email]>]
Sent: Tuesday, March 03, 2015 1:54 PM
To: Xiangrui Meng; Joseph Bradley
Cc: Evan R. Sparks; Ulanov, Alexander; [hidden email]<mailto:[hidden email]>
Subject: Re: Using CUDA within Spark / boosting linear algebra

BTW, is anybody on this list going to the London Meetup in a few weeks?

https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community

Would be nice to meet other people working on the guts of Spark! :-)


Xiangrui Meng <[hidden email]<mailto:[hidden email]>> writes:

> Hey Alexander,
>
> I don't quite understand the part where netlib-cublas is about 20x
> slower than netlib-openblas. What is the overhead of using a GPU BLAS
> with netlib-java?
>
> CC'ed Sam, the author of netlib-java.
>
> Best,
> Xiangrui
>
> On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley <[hidden email]<mailto:[hidden email]>> wrote:
>> Better documentation for linking would be very helpful!  Here's a JIRA:
>> https://issues.apache.org/jira/browse/SPARK-6019
>>
>>
>> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
>> <[hidden email]<mailto:[hidden email]>>
>> wrote:
>>
>>> Thanks for compiling all the data and running these benchmarks,
>>> Alex. The big takeaways here can be seen with this chart:
>>>
>>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ
>>> Hl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
>>>
>>> 1) A properly configured GPU matrix multiply implementation (e.g.
>>> BIDMat+GPU) can provide substantial (but less than an order of
>>> BIDMat+magnitude)
>>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
>>> netlib-java+openblas-compiled).
>>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
>>> worse than a well-tuned CPU implementation, particularly for larger matrices.
>>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
>>> basically agrees with the authors own benchmarks (
>>> https://github.com/fommil/netlib-java)
>>>
>>> I think that most of our users are in a situation where using GPUs
>>> may not be practical - although we could consider having a good GPU
>>> backend available as an option. However, *ALL* users of MLlib could
>>> benefit (potentially tremendously) from using a well-tuned CPU-based
>>> BLAS implementation. Perhaps we should consider updating the mllib
>>> guide with a more complete section for enabling high performance
>>> binaries on OSX and Linux? Or better, figure out a way for the
>>> system to fetch these automatically.
>>>
>>> - Evan
>>>
>>>
>>>
>>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
>>> [hidden email]<mailto:[hidden email]>> wrote:
>>>
>>>> Just to summarize this thread, I was finally able to make all
>>>> performance comparisons that we discussed. It turns out that:
>>>> BIDMat-cublas>>BIDMat
>>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo=
>>>> =netlib-cublas>netlib-blas>f2jblas
>>>>
>>>> Below is the link to the spreadsheet with full results.
>>>>
>>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx
>>>> 378T9J5r7kwKSPkY/edit?usp=sharing
>>>>
>>>> One thing still needs exploration: does BIDMat-cublas perform
>>>> copying to/from machine’s RAM?
>>>>
>>>> -----Original Message-----
>>>> From: Ulanov, Alexander
>>>> Sent: Tuesday, February 10, 2015 2:12 PM
>>>> To: Evan R. Sparks
>>>> Cc: Joseph Bradley;
>>>> [hidden email]<mailto:[hidden email]>
>>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Thanks, Evan! It seems that ticket was marked as duplicate though
>>>> the original one discusses slightly different topic. I was able to
>>>> link netlib with MKL from BIDMat binaries. Indeed, MKL is
>>>> statically linked inside a 60MB library.
>>>>
>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
>>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
>>>> +-----------------------------------------------------------------------+
>>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
>>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
>>>> |1,638475459 |
>>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 |
>>>> 1569,233228 |
>>>>
>>>> It turn out that pre-compiled MKL is faster than precompiled
>>>> OpenBlas on my machine. Probably, I’ll add two more columns with
>>>> locally compiled openblas and cuda.
>>>>
>>>> Alexander
>>>>
>>>> From: Evan R. Sparks
>>>> [mailto:[hidden email]<mailto:[hidden email]>]
>>>> Sent: Monday, February 09, 2015 6:06 PM
>>>> To: Ulanov, Alexander
>>>> Cc: Joseph Bradley;
>>>> [hidden email]<mailto:[hidden email]>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Great - perhaps we can move this discussion off-list and onto a
>>>> JIRA ticket? (Here's one:
>>>> https://issues.apache.org/jira/browse/SPARK-5705)
>>>>
>>>> It seems like this is going to be somewhat exploratory for a while
>>>> (and there's probably only a handful of us who really care about
>>>> fast linear
>>>> algebra!)
>>>>
>>>> - Evan
>>>>
>>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>> wrote:
>>>> Hi Evan,
>>>>
>>>> Thank you for explanation and useful link. I am going to build
>>>> OpenBLAS, link it with Netlib-java and perform benchmark again.
>>>>
>>>> Do I understand correctly that BIDMat binaries contain statically
>>>> linked Intel MKL BLAS? It might be the reason why I am able to run
>>>> BIDMat not having MKL BLAS installed on my server. If it is true, I
>>>> wonder if it is OK because Intel sells this library. Nevertheless,
>>>> it seems that in my case precompiled MKL BLAS performs better than
>>>> precompiled OpenBLAS given that BIDMat and Netlib-java are supposed to be on par with JNI overheads.
>>>>
>>>> Though, it might be interesting to link Netlib-java with Intel MKL,
>>>> as you suggested. I wonder, are John Canny (BIDMat) and Sam
>>>> Halliday
>>>> (Netlib-java) interested to compare their libraries.
>>>>
>>>> Best regards, Alexander
>>>>
>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>]
>>>> Sent: Friday, February 06, 2015 5:58 PM
>>>>
>>>> To: Ulanov, Alexander
>>>> Cc: Joseph Bradley;
>>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
>>>> apache.org<mailto:[hidden email]>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> I would build OpenBLAS yourself, since good BLAS performance comes
>>>> from getting cache sizes, etc. set up correctly for your particular
>>>> hardware - this is often a very tricky process (see, e.g. ATLAS),
>>>> but we found that on relatively modern Xeon chips, OpenBLAS builds
>>>> quickly and yields performance competitive with MKL.
>>>>
>>>> To make sure the right library is getting used, you have to make
>>>> sure it's first on the search path - export
>>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
>>>>
>>>> For some examples of getting netlib-java setup on an ec2 node and
>>>> some example benchmarking code we ran a while back, see:
>>>> https://github.com/shivaram/matrix-bench
>>>>
>>>> In particular - build-openblas-ec2.sh shows you how to build the
>>>> library and set up symlinks correctly, and scala/run-netlib.sh
>>>> shows you how to get the path setup and get that library picked up by netlib-java.
>>>>
>>>> In this way - you could probably get cuBLAS set up to be used by
>>>> netlib-java as well.
>>>>
>>>> - Evan
>>>>
>>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>> wrote:
>>>> Evan, could you elaborate on how to force BIDMat and netlib-java to
>>>> force loading the right blas? For netlib, I there are few JVM
>>>> flags, such as
>>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS,
>>>> so I can force it to use Java implementation. Not sure I understand how to force use a specific blas (not specific wrapper for blas).
>>>>
>>>> Btw. I have installed openblas (yum install openblas), so I suppose
>>>> that netlib is using it.
>>>>
>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>]
>>>> Sent: Friday, February 06, 2015 5:19 PM
>>>> To: Ulanov, Alexander
>>>> Cc: Joseph Bradley;
>>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
>>>> apache.org<mailto:[hidden email]>>
>>>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Getting breeze to pick up the right blas library is critical for
>>>> performance. I recommend using OpenBLAS (or MKL, if you already have it).
>>>> It might make sense to force BIDMat to use the same underlying BLAS
>>>> library as well.
>>>>
>>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>> wrote:
>>>> Hi Evan, Joseph
>>>>
>>>> I did few matrix multiplication test and BIDMat seems to be ~10x
>>>> faster than netlib-java+breeze (sorry for weird table formatting):
>>>>
>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java
>>>> |native_system_linux_x86-64|
>>>> Breeze+Netlib-java f2jblas |
>>>> +-----------------------------------------------------------------------+
>>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
>>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
>>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228
>>>> ||
>>>>
>>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora
>>>> 19 Linux, Scala 2.11.
>>>>
>>>> Later I will make tests with Cuda. I need to install new Cuda
>>>> version for this purpose.
>>>>
>>>> Do you have any ideas why breeze-netlib with native blas is so much
>>>> slower than BIDMat MKL?
>>>>
>>>> Best regards, Alexander
>>>>
>>>> From: Joseph Bradley [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>]
>>>> Sent: Thursday, February 05, 2015 5:29 PM
>>>> To: Ulanov, Alexander
>>>> Cc: Evan R. Sparks;
>>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
>>>> apache.org<mailto:[hidden email]>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Hi Alexander,
>>>>
>>>> Using GPUs with Spark would be very exciting.  Small comment:
>>>> Concerning your question earlier about keeping data stored on the
>>>> GPU rather than having to move it between main memory and GPU
>>>> memory on each iteration, I would guess this would be critical to
>>>> getting good performance.  If you could do multiple local
>>>> iterations before aggregating results, then the cost of data
>>>> movement to the GPU could be amortized (and I believe that is done
>>>> in practice).  Having Spark be aware of the GPU and using it as another part of memory sounds like a much bigger undertaking.
>>>>
>>>> Joseph
>>>>
>>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>> wrote:
>>>> Thank you for explanation! I’ve watched the BIDMach presentation by
>>>> John Canny and I am really inspired by his talk and comparisons with Spark MLlib.
>>>>
>>>> I am very interested to find out what will be better within Spark:
>>>> BIDMat or netlib-java with CPU or GPU natives. Could you suggest a
>>>> fair way to benchmark them? Currently I do benchmarks on artificial
>>>> neural networks in batch mode. While it is not a “pure” test of
>>>> linear algebra, it involves some other things that are essential to machine learning.
>>>>
>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>]
>>>> Sent: Thursday, February 05, 2015 1:29 PM
>>>> To: Ulanov, Alexander
>>>> Cc:
>>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
>>>> apache.org<mailto:[hidden email]>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
>>>> netlib-java+OpenBLAS, but if it is much faster it's probably due to
>>>> netlib-java+data
>>>> layout and fewer levels of indirection - it's definitely a
>>>> worthwhile experiment to run. The main speedups I've seen from
>>>> using it come from highly optimized GPU code for linear algebra. I
>>>> know that in the past Canny has gone as far as to write custom GPU
>>>> kernels for performance-critical regions of code.[1]
>>>>
>>>> BIDMach is highly optimized for single node performance or
>>>> performance on small clusters.[2] Once data doesn't fit easily in
>>>> GPU memory (or can be batched in that way) the performance tends to
>>>> fall off. Canny argues for hardware/software codesign and as such
>>>> prefers machine configurations that are quite different than what
>>>> we find in most commodity cluster nodes - e.g. 10 disk cahnnels and 4 GPUs.
>>>>
>>>> In contrast, MLlib was designed for horizontal scalability on
>>>> commodity clusters and works best on very big datasets - order of terabytes.
>>>>
>>>> For the most part, these projects developed concurrently to address
>>>> slightly different use cases. That said, there may be bits of
>>>> BIDMach we could repurpose for MLlib - keep in mind we need to be
>>>> careful about maintaining cross-language compatibility for our Java
>>>> and Python-users, though.
>>>>
>>>> - Evan
>>>>
>>>> [1] - http://arxiv.org/abs/1409.5402 [2] -
>>>> http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
>>>>
>>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>> wrote:
>>>> Hi Evan,
>>>>
>>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do
>>>> you know what makes them faster than netlib-java?
>>>>
>>>> The same group has BIDMach library that implements machine
>>>> learning. For some examples they use Caffe convolutional neural
>>>> network library owned by another group in Berkeley. Could you
>>>> elaborate on how these all might be connected with Spark Mllib? If
>>>> you take BIDMat for linear algebra why don’t you take BIDMach for optimization and learning?
>>>>
>>>> Best regards, Alexander
>>>>
>>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>>]
>>>> Sent: Thursday, February 05, 2015 12:09 PM
>>>> To: Ulanov, Alexander
>>>> Cc: [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
>>>> apache.org<mailto:[hidden email]>>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU
>>>> blas in many cases.
>>>>
>>>> You might consider taking a look at the codepaths that BIDMat (
>>>> https://github.com/BIDData/BIDMat) takes and comparing them to
>>>> netlib-java/breeze. John Canny et. al. have done a bunch of work
>>>> optimizing to make this work really fast from Scala. I've run it on
>>>> my laptop and compared to MKL and in certain cases it's 10x faster at matrix multiply.
>>>> There are a lot of layers of indirection here and you really want
>>>> to avoid data copying as much as possible.
>>>>
>>>> We could also consider swapping out BIDMat for Breeze, but that
>>>> would be a big project and if we can figure out how to get
>>>> breeze+cublas to comparable performance that would be a big win.
>>>>
>>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>><mailto:
>>>> [hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>>> wrote:
>>>> Dear Spark developers,
>>>>
>>>> I am exploring how to make linear algebra operations faster within Spark.
>>>> One way of doing this is to use Scala Breeze library that is
>>>> bundled with Spark. For matrix operations, it employs Netlib-java
>>>> that has a Java wrapper for BLAS (basic linear algebra subprograms)
>>>> and LAPACK native binaries if they are available on the worker
>>>> node. It also has its own optimized Java implementation of BLAS. It
>>>> is worth mentioning, that native binaries provide better performance only for BLAS level 3, i.e.
>>>> matrix-matrix operations or general matrix multiplication (GEMM).
>>>> This is confirmed by GEMM test on Netlib-java page
>>>> https://github.com/fommil/netlib-java. I also confirmed it with my
>>>> experiments with training of artificial neural network
>>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
>>>> However, I would like to boost performance more.
>>>>
>>>> GPU is supposed to work fast with linear algebra and there is
>>>> Nvidia CUDA implementation of BLAS, called cublas. I have one Linux
>>>> server with Nvidia GPU and I was able to do the following. I linked
>>>> cublas (instead of cpu-based blas) with Netlib-java wrapper and put
>>>> it into Spark, so Breeze/Netlib is using it. Then I did some
>>>> performance measurements with regards to artificial neural network
>>>> batch learning in Spark MLlib that involves matrix-matrix
>>>> multiplications. It turns out that for matrices of size less than
>>>> ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes
>>>> slower for bigger matrices. It worth mentioning that it is was not a test for ONLY multiplication since there are other operations involved.
>>>> One of the reasons for slowdown might be the overhead of copying
>>>> the matrices from computer memory to graphic card memory and back.
>>>>
>>>> So, few questions:
>>>> 1) Do these results with CUDA make sense?
>>>> 2) If the problem is with copy overhead, are there any libraries
>>>> that allow to force intermediate results to stay in graphic card
>>>> memory thus removing the overhead?
>>>> 3) Any other options to speed-up linear algebra in Spark?
>>>>
>>>> Thank you, Alexander
>>>>
>>>> -------------------------------------------------------------------
>>>> -- To unsubscribe, e-mail: [hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]
>>>> e.org>><mailto:[hidden email]<mailto:dev-unsubscribe@sp
>>>> ark.apac> he.org<http://he.org>
>>>> <mailto:[hidden email]<mailto:dev-unsubscribe@spa
>>>> rk.apache.org>>> For additional commands, e-mail:
>>>> [hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>><mailto:[hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>>
>>>>
>>>>
>>>>
>>>>
>>>

--
Best regards,
Sam

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Using CUDA within Spark / boosting linear algebra

Evan R. Sparks
Yeah, much more reasonable - nice to know that we can get full GPU
performance from breeze/netlib-java - meaning there's no compelling
performance reason to switch out our current linear algebra library (at
least as far as this benchmark is concerned).

Instead, it looks like a user guide for configuring Spark/MLlib to use the
right BLAS library will get us most of the way there. Or, would it make
sense to finally ship openblas compiled for some common platforms (64-bit
linux, windows, mac) directly with Spark - hopefully eliminating the jblas
warnings once and for all for most users? (Licensing is BSD) Or am I
missing something?

On Wed, Mar 25, 2015 at 6:03 PM, Ulanov, Alexander <[hidden email]>
wrote:

> As everyone suggested, the results were too good to be true, so I
> double-checked them. It turns that nvblas did not do multiplication due to
> parameter NVBLAS_TILE_DIM from "nvblas.conf" and returned zero matrix. My
> previously posted results with nvblas are matrices copying only. The
> default NVBLAS_TILE_DIM==2048 is too big for my graphic card/matrix size. I
> handpicked other values that worked. As a result, netlib+nvblas is on par
> with BIDMat-cuda. As promised, I am going to post a how-to for nvblas
> configuration.
>
>
> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>
>
>
> -----Original Message-----
> From: Ulanov, Alexander
> Sent: Wednesday, March 25, 2015 2:31 PM
> To: Sam Halliday
> Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks;
> jfcanny
> Subject: RE: Using CUDA within Spark / boosting linear algebra
>
> Hi again,
>
> I finally managed to use nvblas within Spark+netlib-java. It has
> exceptional performance for big matrices with Double, faster than
> BIDMat-cuda with Float. But for smaller matrices, if you will copy them
> to/from GPU, OpenBlas or MKL might be a better choice. This correlates with
> original nvblas presentation on GPU conf 2013 (slide 21):
> http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf
>
> My results:
>
> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>
> Just in case, these tests are not for generalization of performance of
> different libraries. I just want to pick a library that does at best dense
> matrices multiplication for my task.
>
> P.S. My previous issue with nvblas was the following: it has Fortran blas
> functions, at the same time netlib-java uses C cblas functions. So, one
> needs cblas shared library to use nvblas through netlib-java. Fedora does
> not have cblas (but Debian and Ubuntu have), so I needed to compile it. I
> could not use cblas from Atlas or Openblas because they link to their
> implementation and not to Fortran blas.
>
> Best regards, Alexander
>
> -----Original Message-----
> From: Ulanov, Alexander
> Sent: Tuesday, March 24, 2015 6:57 PM
> To: Sam Halliday
> Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
> Subject: RE: Using CUDA within Spark / boosting linear algebra
>
> Hi,
>
> I am trying to use nvblas with netlib-java from Spark. nvblas functions
> should replace current blas functions calls after executing LD_PRELOAD as
> suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any
> changes to netlib-java. It seems to work for simple Java example, but I
> cannot make it work with Spark. I run the following:
> export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
> env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell
> --driver-memory 4G In nvidia-smi I observe that Java is to use GPU:
>
> +-----------------------------------------------------------------------------+
> | Processes:                                                       GPU
> Memory |
> |  GPU       PID  Type  Process name                               Usage
>     |
>
> |=============================================================================|
> |    0      8873    C   bash
> 39MiB |
> |    0      8910    C   /usr/lib/jvm/java-1.7.0/bin/java
> 39MiB |
>
> +-----------------------------------------------------------------------------+
>
> In Spark shell I do matrix multiplication and see the following:
> 15/03/25 06:48:01 INFO JniLoader: successfully loaded
> /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
> So I am sure that netlib-native is loaded and cblas supposedly used.
> However, matrix multiplication does executes on CPU since I see 16% of CPU
> used and 0% of GPU used. I also checked different matrix sizes, from
> 100x100 to 12000x12000
>
> Could you suggest might the LD_PRELOAD not affect Spark shell?
>
> Best regards, Alexander
>
>
>
> From: Sam Halliday [mailto:[hidden email]]
> Sent: Monday, March 09, 2015 6:01 PM
> To: Ulanov, Alexander
> Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
> Subject: RE: Using CUDA within Spark / boosting linear algebra
>
>
> Thanks so much for following up on this!
>
> Hmm, I wonder if we should have a concerted effort to chart performance on
> various pieces of hardware...
> On 9 Mar 2015 21:08, "Ulanov, Alexander" <[hidden email]<mailto:
> [hidden email]>> wrote:
> Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the
> comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the
> support of Double in the current source code), did the test with BIDMat and
> CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL.
>
>
> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>
> Best regards, Alexander
>
> -----Original Message-----
> From: Sam Halliday [mailto:[hidden email]<mailto:
> [hidden email]>]
> Sent: Tuesday, March 03, 2015 1:54 PM
> To: Xiangrui Meng; Joseph Bradley
> Cc: Evan R. Sparks; Ulanov, Alexander; [hidden email]<mailto:
> [hidden email]>
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> BTW, is anybody on this list going to the London Meetup in a few weeks?
>
>
> https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community
>
> Would be nice to meet other people working on the guts of Spark! :-)
>
>
> Xiangrui Meng <[hidden email]<mailto:[hidden email]>> writes:
>
> > Hey Alexander,
> >
> > I don't quite understand the part where netlib-cublas is about 20x
> > slower than netlib-openblas. What is the overhead of using a GPU BLAS
> > with netlib-java?
> >
> > CC'ed Sam, the author of netlib-java.
> >
> > Best,
> > Xiangrui
> >
> > On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley <[hidden email]
> <mailto:[hidden email]>> wrote:
> >> Better documentation for linking would be very helpful!  Here's a JIRA:
> >> https://issues.apache.org/jira/browse/SPARK-6019
> >>
> >>
> >> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
> >> <[hidden email]<mailto:[hidden email]>>
> >> wrote:
> >>
> >>> Thanks for compiling all the data and running these benchmarks,
> >>> Alex. The big takeaways here can be seen with this chart:
> >>>
> >>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ
> >>> Hl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
> >>>
> >>> 1) A properly configured GPU matrix multiply implementation (e.g.
> >>> BIDMat+GPU) can provide substantial (but less than an order of
> >>> BIDMat+magnitude)
> >>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
> >>> netlib-java+openblas-compiled).
> >>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
> >>> worse than a well-tuned CPU implementation, particularly for larger
> matrices.
> >>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
> >>> basically agrees with the authors own benchmarks (
> >>> https://github.com/fommil/netlib-java)
> >>>
> >>> I think that most of our users are in a situation where using GPUs
> >>> may not be practical - although we could consider having a good GPU
> >>> backend available as an option. However, *ALL* users of MLlib could
> >>> benefit (potentially tremendously) from using a well-tuned CPU-based
> >>> BLAS implementation. Perhaps we should consider updating the mllib
> >>> guide with a more complete section for enabling high performance
> >>> binaries on OSX and Linux? Or better, figure out a way for the
> >>> system to fetch these automatically.
> >>>
> >>> - Evan
> >>>
> >>>
> >>>
> >>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
> >>> [hidden email]<mailto:[hidden email]>> wrote:
> >>>
> >>>> Just to summarize this thread, I was finally able to make all
> >>>> performance comparisons that we discussed. It turns out that:
> >>>> BIDMat-cublas>>BIDMat
> >>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo=
> >>>> =netlib-cublas>netlib-blas>f2jblas
> >>>>
> >>>> Below is the link to the spreadsheet with full results.
> >>>>
> >>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx
> >>>> 378T9J5r7kwKSPkY/edit?usp=sharing
> >>>>
> >>>> One thing still needs exploration: does BIDMat-cublas perform
> >>>> copying to/from machine’s RAM?
> >>>>
> >>>> -----Original Message-----
> >>>> From: Ulanov, Alexander
> >>>> Sent: Tuesday, February 10, 2015 2:12 PM
> >>>> To: Evan R. Sparks
> >>>> Cc: Joseph Bradley;
> >>>> [hidden email]<mailto:[hidden email]>
> >>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Thanks, Evan! It seems that ticket was marked as duplicate though
> >>>> the original one discusses slightly different topic. I was able to
> >>>> link netlib with MKL from BIDMat binaries. Indeed, MKL is
> >>>> statically linked inside a 60MB library.
> >>>>
> >>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
> >>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
> >>>>
> +-----------------------------------------------------------------------+
> >>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
> >>>> |1,638475459 |
> >>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 |
> >>>> 1569,233228 |
> >>>>
> >>>> It turn out that pre-compiled MKL is faster than precompiled
> >>>> OpenBlas on my machine. Probably, I’ll add two more columns with
> >>>> locally compiled openblas and cuda.
> >>>>
> >>>> Alexander
> >>>>
> >>>> From: Evan R. Sparks
> >>>> [mailto:[hidden email]<mailto:[hidden email]>]
> >>>> Sent: Monday, February 09, 2015 6:06 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: Joseph Bradley;
> >>>> [hidden email]<mailto:[hidden email]>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Great - perhaps we can move this discussion off-list and onto a
> >>>> JIRA ticket? (Here's one:
> >>>> https://issues.apache.org/jira/browse/SPARK-5705)
> >>>>
> >>>> It seems like this is going to be somewhat exploratory for a while
> >>>> (and there's probably only a handful of us who really care about
> >>>> fast linear
> >>>> algebra!)
> >>>>
> >>>> - Evan
> >>>>
> >>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>> wrote:
> >>>> Hi Evan,
> >>>>
> >>>> Thank you for explanation and useful link. I am going to build
> >>>> OpenBLAS, link it with Netlib-java and perform benchmark again.
> >>>>
> >>>> Do I understand correctly that BIDMat binaries contain statically
> >>>> linked Intel MKL BLAS? It might be the reason why I am able to run
> >>>> BIDMat not having MKL BLAS installed on my server. If it is true, I
> >>>> wonder if it is OK because Intel sells this library. Nevertheless,
> >>>> it seems that in my case precompiled MKL BLAS performs better than
> >>>> precompiled OpenBLAS given that BIDMat and Netlib-java are supposed
> to be on par with JNI overheads.
> >>>>
> >>>> Though, it might be interesting to link Netlib-java with Intel MKL,
> >>>> as you suggested. I wonder, are John Canny (BIDMat) and Sam
> >>>> Halliday
> >>>> (Netlib-java) interested to compare their libraries.
> >>>>
> >>>> Best regards, Alexander
> >>>>
> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
> [hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>]
> >>>> Sent: Friday, February 06, 2015 5:58 PM
> >>>>
> >>>> To: Ulanov, Alexander
> >>>> Cc: Joseph Bradley;
> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
> >>>> apache.org<mailto:[hidden email]>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> I would build OpenBLAS yourself, since good BLAS performance comes
> >>>> from getting cache sizes, etc. set up correctly for your particular
> >>>> hardware - this is often a very tricky process (see, e.g. ATLAS),
> >>>> but we found that on relatively modern Xeon chips, OpenBLAS builds
> >>>> quickly and yields performance competitive with MKL.
> >>>>
> >>>> To make sure the right library is getting used, you have to make
> >>>> sure it's first on the search path - export
> >>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
> >>>>
> >>>> For some examples of getting netlib-java setup on an ec2 node and
> >>>> some example benchmarking code we ran a while back, see:
> >>>> https://github.com/shivaram/matrix-bench
> >>>>
> >>>> In particular - build-openblas-ec2.sh shows you how to build the
> >>>> library and set up symlinks correctly, and scala/run-netlib.sh
> >>>> shows you how to get the path setup and get that library picked up by
> netlib-java.
> >>>>
> >>>> In this way - you could probably get cuBLAS set up to be used by
> >>>> netlib-java as well.
> >>>>
> >>>> - Evan
> >>>>
> >>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>> wrote:
> >>>> Evan, could you elaborate on how to force BIDMat and netlib-java to
> >>>> force loading the right blas? For netlib, I there are few JVM
> >>>> flags, such as
> >>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS,
> >>>> so I can force it to use Java implementation. Not sure I understand
> how to force use a specific blas (not specific wrapper for blas).
> >>>>
> >>>> Btw. I have installed openblas (yum install openblas), so I suppose
> >>>> that netlib is using it.
> >>>>
> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
> [hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>]
> >>>> Sent: Friday, February 06, 2015 5:19 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: Joseph Bradley;
> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
> >>>> apache.org<mailto:[hidden email]>>
> >>>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Getting breeze to pick up the right blas library is critical for
> >>>> performance. I recommend using OpenBLAS (or MKL, if you already have
> it).
> >>>> It might make sense to force BIDMat to use the same underlying BLAS
> >>>> library as well.
> >>>>
> >>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>> wrote:
> >>>> Hi Evan, Joseph
> >>>>
> >>>> I did few matrix multiplication test and BIDMat seems to be ~10x
> >>>> faster than netlib-java+breeze (sorry for weird table formatting):
> >>>>
> >>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java
> >>>> |native_system_linux_x86-64|
> >>>> Breeze+Netlib-java f2jblas |
> >>>>
> +-----------------------------------------------------------------------+
> >>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
> >>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228
> >>>> ||
> >>>>
> >>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora
> >>>> 19 Linux, Scala 2.11.
> >>>>
> >>>> Later I will make tests with Cuda. I need to install new Cuda
> >>>> version for this purpose.
> >>>>
> >>>> Do you have any ideas why breeze-netlib with native blas is so much
> >>>> slower than BIDMat MKL?
> >>>>
> >>>> Best regards, Alexander
> >>>>
> >>>> From: Joseph Bradley [mailto:[hidden email]<mailto:
> [hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>]
> >>>> Sent: Thursday, February 05, 2015 5:29 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: Evan R. Sparks;
> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
> >>>> apache.org<mailto:[hidden email]>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Hi Alexander,
> >>>>
> >>>> Using GPUs with Spark would be very exciting.  Small comment:
> >>>> Concerning your question earlier about keeping data stored on the
> >>>> GPU rather than having to move it between main memory and GPU
> >>>> memory on each iteration, I would guess this would be critical to
> >>>> getting good performance.  If you could do multiple local
> >>>> iterations before aggregating results, then the cost of data
> >>>> movement to the GPU could be amortized (and I believe that is done
> >>>> in practice).  Having Spark be aware of the GPU and using it as
> another part of memory sounds like a much bigger undertaking.
> >>>>
> >>>> Joseph
> >>>>
> >>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>> wrote:
> >>>> Thank you for explanation! I’ve watched the BIDMach presentation by
> >>>> John Canny and I am really inspired by his talk and comparisons with
> Spark MLlib.
> >>>>
> >>>> I am very interested to find out what will be better within Spark:
> >>>> BIDMat or netlib-java with CPU or GPU natives. Could you suggest a
> >>>> fair way to benchmark them? Currently I do benchmarks on artificial
> >>>> neural networks in batch mode. While it is not a “pure” test of
> >>>> linear algebra, it involves some other things that are essential to
> machine learning.
> >>>>
> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
> [hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>]
> >>>> Sent: Thursday, February 05, 2015 1:29 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc:
> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
> >>>> apache.org<mailto:[hidden email]>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
> >>>> netlib-java+OpenBLAS, but if it is much faster it's probably due to
> >>>> netlib-java+data
> >>>> layout and fewer levels of indirection - it's definitely a
> >>>> worthwhile experiment to run. The main speedups I've seen from
> >>>> using it come from highly optimized GPU code for linear algebra. I
> >>>> know that in the past Canny has gone as far as to write custom GPU
> >>>> kernels for performance-critical regions of code.[1]
> >>>>
> >>>> BIDMach is highly optimized for single node performance or
> >>>> performance on small clusters.[2] Once data doesn't fit easily in
> >>>> GPU memory (or can be batched in that way) the performance tends to
> >>>> fall off. Canny argues for hardware/software codesign and as such
> >>>> prefers machine configurations that are quite different than what
> >>>> we find in most commodity cluster nodes - e.g. 10 disk cahnnels and 4
> GPUs.
> >>>>
> >>>> In contrast, MLlib was designed for horizontal scalability on
> >>>> commodity clusters and works best on very big datasets - order of
> terabytes.
> >>>>
> >>>> For the most part, these projects developed concurrently to address
> >>>> slightly different use cases. That said, there may be bits of
> >>>> BIDMach we could repurpose for MLlib - keep in mind we need to be
> >>>> careful about maintaining cross-language compatibility for our Java
> >>>> and Python-users, though.
> >>>>
> >>>> - Evan
> >>>>
> >>>> [1] - http://arxiv.org/abs/1409.5402 [2] -
> >>>> http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
> >>>>
> >>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>><mailto:
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>>> wrote:
> >>>> Hi Evan,
> >>>>
> >>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do
> >>>> you know what makes them faster than netlib-java?
> >>>>
> >>>> The same group has BIDMach library that implements machine
> >>>> learning. For some examples they use Caffe convolutional neural
> >>>> network library owned by another group in Berkeley. Could you
> >>>> elaborate on how these all might be connected with Spark Mllib? If
> >>>> you take BIDMat for linear algebra why don’t you take BIDMach for
> optimization and learning?
> >>>>
> >>>> Best regards, Alexander
> >>>>
> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
> [hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>><mailto:
> [hidden email]<mailto:[hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>>]
> >>>> Sent: Thursday, February 05, 2015 12:09 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>><mailto:
> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
> >>>> apache.org<mailto:[hidden email]>>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU
> >>>> blas in many cases.
> >>>>
> >>>> You might consider taking a look at the codepaths that BIDMat (
> >>>> https://github.com/BIDData/BIDMat) takes and comparing them to
> >>>> netlib-java/breeze. John Canny et. al. have done a bunch of work
> >>>> optimizing to make this work really fast from Scala. I've run it on
> >>>> my laptop and compared to MKL and in certain cases it's 10x faster at
> matrix multiply.
> >>>> There are a lot of layers of indirection here and you really want
> >>>> to avoid data copying as much as possible.
> >>>>
> >>>> We could also consider swapping out BIDMat for Breeze, but that
> >>>> would be a big project and if we can figure out how to get
> >>>> breeze+cublas to comparable performance that would be a big win.
> >>>>
> >>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>><mailto:
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>>> wrote:
> >>>> Dear Spark developers,
> >>>>
> >>>> I am exploring how to make linear algebra operations faster within
> Spark.
> >>>> One way of doing this is to use Scala Breeze library that is
> >>>> bundled with Spark. For matrix operations, it employs Netlib-java
> >>>> that has a Java wrapper for BLAS (basic linear algebra subprograms)
> >>>> and LAPACK native binaries if they are available on the worker
> >>>> node. It also has its own optimized Java implementation of BLAS. It
> >>>> is worth mentioning, that native binaries provide better performance
> only for BLAS level 3, i.e.
> >>>> matrix-matrix operations or general matrix multiplication (GEMM).
> >>>> This is confirmed by GEMM test on Netlib-java page
> >>>> https://github.com/fommil/netlib-java. I also confirmed it with my
> >>>> experiments with training of artificial neural network
> >>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
> >>>> However, I would like to boost performance more.
> >>>>
> >>>> GPU is supposed to work fast with linear algebra and there is
> >>>> Nvidia CUDA implementation of BLAS, called cublas. I have one Linux
> >>>> server with Nvidia GPU and I was able to do the following. I linked
> >>>> cublas (instead of cpu-based blas) with Netlib-java wrapper and put
> >>>> it into Spark, so Breeze/Netlib is using it. Then I did some
> >>>> performance measurements with regards to artificial neural network
> >>>> batch learning in Spark MLlib that involves matrix-matrix
> >>>> multiplications. It turns out that for matrices of size less than
> >>>> ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes
> >>>> slower for bigger matrices. It worth mentioning that it is was not a
> test for ONLY multiplication since there are other operations involved.
> >>>> One of the reasons for slowdown might be the overhead of copying
> >>>> the matrices from computer memory to graphic card memory and back.
> >>>>
> >>>> So, few questions:
> >>>> 1) Do these results with CUDA make sense?
> >>>> 2) If the problem is with copy overhead, are there any libraries
> >>>> that allow to force intermediate results to stay in graphic card
> >>>> memory thus removing the overhead?
> >>>> 3) Any other options to speed-up linear algebra in Spark?
> >>>>
> >>>> Thank you, Alexander
> >>>>
> >>>> -------------------------------------------------------------------
> >>>> -- To unsubscribe, e-mail: [hidden email]<mailto:
> [hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]
> >>>> e.org>><mailto:[hidden email]<mailto:dev-unsubscribe@sp
> >>>> ark.apac> he.org<http://he.org>
> >>>> <mailto:[hidden email]<mailto:dev-unsubscribe@spa
> >>>> rk.apache.org>>> For additional commands, e-mail:
> >>>> [hidden email]<mailto:[hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>><mailto:
> [hidden email]<mailto:[hidden email]><mailto:
> >>>> [hidden email]<mailto:[hidden email]>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
>
> --
> Best regards,
> Sam
>
Reply | Threaded
Open this post in threaded view
|

Re: Using CUDA within Spark / boosting linear algebra

fommil
I'm not at all surprised ;-) I fully expect the GPU performance to get
better automatically as the hardware improves.

Netlib natives still need to be shipped separately. I'd also oppose any
move to make Open BLAS the default - is not always better and I think
natives really need DevOps buy-in. It's not the right solution for
everybody.
On 26 Mar 2015 01:23, "Evan R. Sparks" <[hidden email]> wrote:

> Yeah, much more reasonable - nice to know that we can get full GPU
> performance from breeze/netlib-java - meaning there's no compelling
> performance reason to switch out our current linear algebra library (at
> least as far as this benchmark is concerned).
>
> Instead, it looks like a user guide for configuring Spark/MLlib to use the
> right BLAS library will get us most of the way there. Or, would it make
> sense to finally ship openblas compiled for some common platforms (64-bit
> linux, windows, mac) directly with Spark - hopefully eliminating the jblas
> warnings once and for all for most users? (Licensing is BSD) Or am I
> missing something?
>
> On Wed, Mar 25, 2015 at 6:03 PM, Ulanov, Alexander <
> [hidden email]> wrote:
>
>> As everyone suggested, the results were too good to be true, so I
>> double-checked them. It turns that nvblas did not do multiplication due to
>> parameter NVBLAS_TILE_DIM from "nvblas.conf" and returned zero matrix. My
>> previously posted results with nvblas are matrices copying only. The
>> default NVBLAS_TILE_DIM==2048 is too big for my graphic card/matrix size. I
>> handpicked other values that worked. As a result, netlib+nvblas is on par
>> with BIDMat-cuda. As promised, I am going to post a how-to for nvblas
>> configuration.
>>
>>
>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>>
>>
>>
>> -----Original Message-----
>> From: Ulanov, Alexander
>> Sent: Wednesday, March 25, 2015 2:31 PM
>> To: Sam Halliday
>> Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks;
>> jfcanny
>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>
>> Hi again,
>>
>> I finally managed to use nvblas within Spark+netlib-java. It has
>> exceptional performance for big matrices with Double, faster than
>> BIDMat-cuda with Float. But for smaller matrices, if you will copy them
>> to/from GPU, OpenBlas or MKL might be a better choice. This correlates with
>> original nvblas presentation on GPU conf 2013 (slide 21):
>> http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf
>>
>> My results:
>>
>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>>
>> Just in case, these tests are not for generalization of performance of
>> different libraries. I just want to pick a library that does at best dense
>> matrices multiplication for my task.
>>
>> P.S. My previous issue with nvblas was the following: it has Fortran blas
>> functions, at the same time netlib-java uses C cblas functions. So, one
>> needs cblas shared library to use nvblas through netlib-java. Fedora does
>> not have cblas (but Debian and Ubuntu have), so I needed to compile it. I
>> could not use cblas from Atlas or Openblas because they link to their
>> implementation and not to Fortran blas.
>>
>> Best regards, Alexander
>>
>> -----Original Message-----
>> From: Ulanov, Alexander
>> Sent: Tuesday, March 24, 2015 6:57 PM
>> To: Sam Halliday
>> Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>
>> Hi,
>>
>> I am trying to use nvblas with netlib-java from Spark. nvblas functions
>> should replace current blas functions calls after executing LD_PRELOAD as
>> suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any
>> changes to netlib-java. It seems to work for simple Java example, but I
>> cannot make it work with Spark. I run the following:
>> export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
>> env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell
>> --driver-memory 4G In nvidia-smi I observe that Java is to use GPU:
>>
>> +-----------------------------------------------------------------------------+
>> | Processes:                                                       GPU
>> Memory |
>> |  GPU       PID  Type  Process name                               Usage
>>     |
>>
>> |=============================================================================|
>> |    0      8873    C   bash
>> 39MiB |
>> |    0      8910    C   /usr/lib/jvm/java-1.7.0/bin/java
>> 39MiB |
>>
>> +-----------------------------------------------------------------------------+
>>
>> In Spark shell I do matrix multiplication and see the following:
>> 15/03/25 06:48:01 INFO JniLoader: successfully loaded
>> /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
>> So I am sure that netlib-native is loaded and cblas supposedly used.
>> However, matrix multiplication does executes on CPU since I see 16% of CPU
>> used and 0% of GPU used. I also checked different matrix sizes, from
>> 100x100 to 12000x12000
>>
>> Could you suggest might the LD_PRELOAD not affect Spark shell?
>>
>> Best regards, Alexander
>>
>>
>>
>> From: Sam Halliday [mailto:[hidden email]]
>> Sent: Monday, March 09, 2015 6:01 PM
>> To: Ulanov, Alexander
>> Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>
>>
>> Thanks so much for following up on this!
>>
>> Hmm, I wonder if we should have a concerted effort to chart performance
>> on various pieces of hardware...
>> On 9 Mar 2015 21:08, "Ulanov, Alexander" <[hidden email]<mailto:
>> [hidden email]>> wrote:
>> Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the
>> comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the
>> support of Double in the current source code), did the test with BIDMat and
>> CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL.
>>
>>
>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>>
>> Best regards, Alexander
>>
>> -----Original Message-----
>> From: Sam Halliday [mailto:[hidden email]<mailto:
>> [hidden email]>]
>> Sent: Tuesday, March 03, 2015 1:54 PM
>> To: Xiangrui Meng; Joseph Bradley
>> Cc: Evan R. Sparks; Ulanov, Alexander; [hidden email]<mailto:
>> [hidden email]>
>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>
>> BTW, is anybody on this list going to the London Meetup in a few weeks?
>>
>>
>> https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community
>>
>> Would be nice to meet other people working on the guts of Spark! :-)
>>
>>
>> Xiangrui Meng <[hidden email]<mailto:[hidden email]>> writes:
>>
>> > Hey Alexander,
>> >
>> > I don't quite understand the part where netlib-cublas is about 20x
>> > slower than netlib-openblas. What is the overhead of using a GPU BLAS
>> > with netlib-java?
>> >
>> > CC'ed Sam, the author of netlib-java.
>> >
>> > Best,
>> > Xiangrui
>> >
>> > On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley <[hidden email]
>> <mailto:[hidden email]>> wrote:
>> >> Better documentation for linking would be very helpful!  Here's a JIRA:
>> >> https://issues.apache.org/jira/browse/SPARK-6019
>> >>
>> >>
>> >> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
>> >> <[hidden email]<mailto:[hidden email]>>
>> >> wrote:
>> >>
>> >>> Thanks for compiling all the data and running these benchmarks,
>> >>> Alex. The big takeaways here can be seen with this chart:
>> >>>
>> >>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ
>> >>> Hl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
>> >>>
>> >>> 1) A properly configured GPU matrix multiply implementation (e.g.
>> >>> BIDMat+GPU) can provide substantial (but less than an order of
>> >>> BIDMat+magnitude)
>> >>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
>> >>> netlib-java+openblas-compiled).
>> >>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
>> >>> worse than a well-tuned CPU implementation, particularly for larger
>> matrices.
>> >>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
>> >>> basically agrees with the authors own benchmarks (
>> >>> https://github.com/fommil/netlib-java)
>> >>>
>> >>> I think that most of our users are in a situation where using GPUs
>> >>> may not be practical - although we could consider having a good GPU
>> >>> backend available as an option. However, *ALL* users of MLlib could
>> >>> benefit (potentially tremendously) from using a well-tuned CPU-based
>> >>> BLAS implementation. Perhaps we should consider updating the mllib
>> >>> guide with a more complete section for enabling high performance
>> >>> binaries on OSX and Linux? Or better, figure out a way for the
>> >>> system to fetch these automatically.
>> >>>
>> >>> - Evan
>> >>>
>> >>>
>> >>>
>> >>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
>> >>> [hidden email]<mailto:[hidden email]>> wrote:
>> >>>
>> >>>> Just to summarize this thread, I was finally able to make all
>> >>>> performance comparisons that we discussed. It turns out that:
>> >>>> BIDMat-cublas>>BIDMat
>> >>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo=
>> >>>> =netlib-cublas>netlib-blas>f2jblas
>> >>>>
>> >>>> Below is the link to the spreadsheet with full results.
>> >>>>
>> >>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx
>> >>>> 378T9J5r7kwKSPkY/edit?usp=sharing
>> >>>>
>> >>>> One thing still needs exploration: does BIDMat-cublas perform
>> >>>> copying to/from machine’s RAM?
>> >>>>
>> >>>> -----Original Message-----
>> >>>> From: Ulanov, Alexander
>> >>>> Sent: Tuesday, February 10, 2015 2:12 PM
>> >>>> To: Evan R. Sparks
>> >>>> Cc: Joseph Bradley;
>> >>>> [hidden email]<mailto:[hidden email]>
>> >>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>> >>>>
>> >>>> Thanks, Evan! It seems that ticket was marked as duplicate though
>> >>>> the original one discusses slightly different topic. I was able to
>> >>>> link netlib with MKL from BIDMat binaries. Indeed, MKL is
>> >>>> statically linked inside a 60MB library.
>> >>>>
>> >>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
>> >>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
>> >>>>
>> +-----------------------------------------------------------------------+
>> >>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
>> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
>> >>>> |1,638475459 |
>> >>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 |
>> >>>> 1569,233228 |
>> >>>>
>> >>>> It turn out that pre-compiled MKL is faster than precompiled
>> >>>> OpenBlas on my machine. Probably, I’ll add two more columns with
>> >>>> locally compiled openblas and cuda.
>> >>>>
>> >>>> Alexander
>> >>>>
>> >>>> From: Evan R. Sparks
>> >>>> [mailto:[hidden email]<mailto:[hidden email]>]
>> >>>> Sent: Monday, February 09, 2015 6:06 PM
>> >>>> To: Ulanov, Alexander
>> >>>> Cc: Joseph Bradley;
>> >>>> [hidden email]<mailto:[hidden email]>
>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>
>> >>>> Great - perhaps we can move this discussion off-list and onto a
>> >>>> JIRA ticket? (Here's one:
>> >>>> https://issues.apache.org/jira/browse/SPARK-5705)
>> >>>>
>> >>>> It seems like this is going to be somewhat exploratory for a while
>> >>>> (and there's probably only a handful of us who really care about
>> >>>> fast linear
>> >>>> algebra!)
>> >>>>
>> >>>> - Evan
>> >>>>
>> >>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>> [hidden email]<mailto:[hidden email]>>> wrote:
>> >>>> Hi Evan,
>> >>>>
>> >>>> Thank you for explanation and useful link. I am going to build
>> >>>> OpenBLAS, link it with Netlib-java and perform benchmark again.
>> >>>>
>> >>>> Do I understand correctly that BIDMat binaries contain statically
>> >>>> linked Intel MKL BLAS? It might be the reason why I am able to run
>> >>>> BIDMat not having MKL BLAS installed on my server. If it is true, I
>> >>>> wonder if it is OK because Intel sells this library. Nevertheless,
>> >>>> it seems that in my case precompiled MKL BLAS performs better than
>> >>>> precompiled OpenBLAS given that BIDMat and Netlib-java are supposed
>> to be on par with JNI overheads.
>> >>>>
>> >>>> Though, it might be interesting to link Netlib-java with Intel MKL,
>> >>>> as you suggested. I wonder, are John Canny (BIDMat) and Sam
>> >>>> Halliday
>> >>>> (Netlib-java) interested to compare their libraries.
>> >>>>
>> >>>> Best regards, Alexander
>> >>>>
>> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
>> [hidden email]><mailto:
>> >>>> [hidden email]<mailto:[hidden email]>>]
>> >>>> Sent: Friday, February 06, 2015 5:58 PM
>> >>>>
>> >>>> To: Ulanov, Alexander
>> >>>> Cc: Joseph Bradley;
>> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
>> >>>> apache.org<mailto:[hidden email]>>
>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>
>> >>>> I would build OpenBLAS yourself, since good BLAS performance comes
>> >>>> from getting cache sizes, etc. set up correctly for your particular
>> >>>> hardware - this is often a very tricky process (see, e.g. ATLAS),
>> >>>> but we found that on relatively modern Xeon chips, OpenBLAS builds
>> >>>> quickly and yields performance competitive with MKL.
>> >>>>
>> >>>> To make sure the right library is getting used, you have to make
>> >>>> sure it's first on the search path - export
>> >>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
>> >>>>
>> >>>> For some examples of getting netlib-java setup on an ec2 node and
>> >>>> some example benchmarking code we ran a while back, see:
>> >>>> https://github.com/shivaram/matrix-bench
>> >>>>
>> >>>> In particular - build-openblas-ec2.sh shows you how to build the
>> >>>> library and set up symlinks correctly, and scala/run-netlib.sh
>> >>>> shows you how to get the path setup and get that library picked up
>> by netlib-java.
>> >>>>
>> >>>> In this way - you could probably get cuBLAS set up to be used by
>> >>>> netlib-java as well.
>> >>>>
>> >>>> - Evan
>> >>>>
>> >>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>> [hidden email]<mailto:[hidden email]>>> wrote:
>> >>>> Evan, could you elaborate on how to force BIDMat and netlib-java to
>> >>>> force loading the right blas? For netlib, I there are few JVM
>> >>>> flags, such as
>> >>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS,
>> >>>> so I can force it to use Java implementation. Not sure I understand
>> how to force use a specific blas (not specific wrapper for blas).
>> >>>>
>> >>>> Btw. I have installed openblas (yum install openblas), so I suppose
>> >>>> that netlib is using it.
>> >>>>
>> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
>> [hidden email]><mailto:
>> >>>> [hidden email]<mailto:[hidden email]>>]
>> >>>> Sent: Friday, February 06, 2015 5:19 PM
>> >>>> To: Ulanov, Alexander
>> >>>> Cc: Joseph Bradley;
>> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
>> >>>> apache.org<mailto:[hidden email]>>
>> >>>>
>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>
>> >>>> Getting breeze to pick up the right blas library is critical for
>> >>>> performance. I recommend using OpenBLAS (or MKL, if you already have
>> it).
>> >>>> It might make sense to force BIDMat to use the same underlying BLAS
>> >>>> library as well.
>> >>>>
>> >>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>> [hidden email]<mailto:[hidden email]>>> wrote:
>> >>>> Hi Evan, Joseph
>> >>>>
>> >>>> I did few matrix multiplication test and BIDMat seems to be ~10x
>> >>>> faster than netlib-java+breeze (sorry for weird table formatting):
>> >>>>
>> >>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java
>> >>>> |native_system_linux_x86-64|
>> >>>> Breeze+Netlib-java f2jblas |
>> >>>>
>> +-----------------------------------------------------------------------+
>> >>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
>> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
>> >>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228
>> >>>> ||
>> >>>>
>> >>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora
>> >>>> 19 Linux, Scala 2.11.
>> >>>>
>> >>>> Later I will make tests with Cuda. I need to install new Cuda
>> >>>> version for this purpose.
>> >>>>
>> >>>> Do you have any ideas why breeze-netlib with native blas is so much
>> >>>> slower than BIDMat MKL?
>> >>>>
>> >>>> Best regards, Alexander
>> >>>>
>> >>>> From: Joseph Bradley [mailto:[hidden email]<mailto:
>> [hidden email]><mailto:
>> >>>> [hidden email]<mailto:[hidden email]>>]
>> >>>> Sent: Thursday, February 05, 2015 5:29 PM
>> >>>> To: Ulanov, Alexander
>> >>>> Cc: Evan R. Sparks;
>> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
>> >>>> apache.org<mailto:[hidden email]>>
>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>
>> >>>> Hi Alexander,
>> >>>>
>> >>>> Using GPUs with Spark would be very exciting.  Small comment:
>> >>>> Concerning your question earlier about keeping data stored on the
>> >>>> GPU rather than having to move it between main memory and GPU
>> >>>> memory on each iteration, I would guess this would be critical to
>> >>>> getting good performance.  If you could do multiple local
>> >>>> iterations before aggregating results, then the cost of data
>> >>>> movement to the GPU could be amortized (and I believe that is done
>> >>>> in practice).  Having Spark be aware of the GPU and using it as
>> another part of memory sounds like a much bigger undertaking.
>> >>>>
>> >>>> Joseph
>> >>>>
>> >>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>> [hidden email]<mailto:[hidden email]>>> wrote:
>> >>>> Thank you for explanation! I’ve watched the BIDMach presentation by
>> >>>> John Canny and I am really inspired by his talk and comparisons with
>> Spark MLlib.
>> >>>>
>> >>>> I am very interested to find out what will be better within Spark:
>> >>>> BIDMat or netlib-java with CPU or GPU natives. Could you suggest a
>> >>>> fair way to benchmark them? Currently I do benchmarks on artificial
>> >>>> neural networks in batch mode. While it is not a “pure” test of
>> >>>> linear algebra, it involves some other things that are essential to
>> machine learning.
>> >>>>
>> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
>> [hidden email]><mailto:
>> >>>> [hidden email]<mailto:[hidden email]>>]
>> >>>> Sent: Thursday, February 05, 2015 1:29 PM
>> >>>> To: Ulanov, Alexander
>> >>>> Cc:
>> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
>> >>>> apache.org<mailto:[hidden email]>>
>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>
>> >>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
>> >>>> netlib-java+OpenBLAS, but if it is much faster it's probably due to
>> >>>> netlib-java+data
>> >>>> layout and fewer levels of indirection - it's definitely a
>> >>>> worthwhile experiment to run. The main speedups I've seen from
>> >>>> using it come from highly optimized GPU code for linear algebra. I
>> >>>> know that in the past Canny has gone as far as to write custom GPU
>> >>>> kernels for performance-critical regions of code.[1]
>> >>>>
>> >>>> BIDMach is highly optimized for single node performance or
>> >>>> performance on small clusters.[2] Once data doesn't fit easily in
>> >>>> GPU memory (or can be batched in that way) the performance tends to
>> >>>> fall off. Canny argues for hardware/software codesign and as such
>> >>>> prefers machine configurations that are quite different than what
>> >>>> we find in most commodity cluster nodes - e.g. 10 disk cahnnels and
>> 4 GPUs.
>> >>>>
>> >>>> In contrast, MLlib was designed for horizontal scalability on
>> >>>> commodity clusters and works best on very big datasets - order of
>> terabytes.
>> >>>>
>> >>>> For the most part, these projects developed concurrently to address
>> >>>> slightly different use cases. That said, there may be bits of
>> >>>> BIDMach we could repurpose for MLlib - keep in mind we need to be
>> >>>> careful about maintaining cross-language compatibility for our Java
>> >>>> and Python-users, though.
>> >>>>
>> >>>> - Evan
>> >>>>
>> >>>> [1] - http://arxiv.org/abs/1409.5402 [2] -
>> >>>> http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
>> >>>>
>> >>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>> [hidden email]<mailto:[hidden email]>><mailto:
>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>> [hidden email]<mailto:[hidden email]>>>> wrote:
>> >>>> Hi Evan,
>> >>>>
>> >>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do
>> >>>> you know what makes them faster than netlib-java?
>> >>>>
>> >>>> The same group has BIDMach library that implements machine
>> >>>> learning. For some examples they use Caffe convolutional neural
>> >>>> network library owned by another group in Berkeley. Could you
>> >>>> elaborate on how these all might be connected with Spark Mllib? If
>> >>>> you take BIDMat for linear algebra why don’t you take BIDMach for
>> optimization and learning?
>> >>>>
>> >>>> Best regards, Alexander
>> >>>>
>> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
>> [hidden email]><mailto:
>> >>>> [hidden email]<mailto:[hidden email]>><mailto:
>> [hidden email]<mailto:[hidden email]><mailto:
>> >>>> [hidden email]<mailto:[hidden email]>>>]
>> >>>> Sent: Thursday, February 05, 2015 12:09 PM
>> >>>> To: Ulanov, Alexander
>> >>>> Cc: [hidden email]<mailto:[hidden email]><mailto:
>> [hidden email]<mailto:[hidden email]>><mailto:
>> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
>> >>>> apache.org<mailto:[hidden email]>>>
>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>
>> >>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU
>> >>>> blas in many cases.
>> >>>>
>> >>>> You might consider taking a look at the codepaths that BIDMat (
>> >>>> https://github.com/BIDData/BIDMat) takes and comparing them to
>> >>>> netlib-java/breeze. John Canny et. al. have done a bunch of work
>> >>>> optimizing to make this work really fast from Scala. I've run it on
>> >>>> my laptop and compared to MKL and in certain cases it's 10x faster
>> at matrix multiply.
>> >>>> There are a lot of layers of indirection here and you really want
>> >>>> to avoid data copying as much as possible.
>> >>>>
>> >>>> We could also consider swapping out BIDMat for Breeze, but that
>> >>>> would be a big project and if we can figure out how to get
>> >>>> breeze+cublas to comparable performance that would be a big win.
>> >>>>
>> >>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>> [hidden email]<mailto:[hidden email]>><mailto:
>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>> [hidden email]<mailto:[hidden email]>>>> wrote:
>> >>>> Dear Spark developers,
>> >>>>
>> >>>> I am exploring how to make linear algebra operations faster within
>> Spark.
>> >>>> One way of doing this is to use Scala Breeze library that is
>> >>>> bundled with Spark. For matrix operations, it employs Netlib-java
>> >>>> that has a Java wrapper for BLAS (basic linear algebra subprograms)
>> >>>> and LAPACK native binaries if they are available on the worker
>> >>>> node. It also has its own optimized Java implementation of BLAS. It
>> >>>> is worth mentioning, that native binaries provide better performance
>> only for BLAS level 3, i.e.
>> >>>> matrix-matrix operations or general matrix multiplication (GEMM).
>> >>>> This is confirmed by GEMM test on Netlib-java page
>> >>>> https://github.com/fommil/netlib-java. I also confirmed it with my
>> >>>> experiments with training of artificial neural network
>> >>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
>> >>>> However, I would like to boost performance more.
>> >>>>
>> >>>> GPU is supposed to work fast with linear algebra and there is
>> >>>> Nvidia CUDA implementation of BLAS, called cublas. I have one Linux
>> >>>> server with Nvidia GPU and I was able to do the following. I linked
>> >>>> cublas (instead of cpu-based blas) with Netlib-java wrapper and put
>> >>>> it into Spark, so Breeze/Netlib is using it. Then I did some
>> >>>> performance measurements with regards to artificial neural network
>> >>>> batch learning in Spark MLlib that involves matrix-matrix
>> >>>> multiplications. It turns out that for matrices of size less than
>> >>>> ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes
>> >>>> slower for bigger matrices. It worth mentioning that it is was not a
>> test for ONLY multiplication since there are other operations involved.
>> >>>> One of the reasons for slowdown might be the overhead of copying
>> >>>> the matrices from computer memory to graphic card memory and back.
>> >>>>
>> >>>> So, few questions:
>> >>>> 1) Do these results with CUDA make sense?
>> >>>> 2) If the problem is with copy overhead, are there any libraries
>> >>>> that allow to force intermediate results to stay in graphic card
>> >>>> memory thus removing the overhead?
>> >>>> 3) Any other options to speed-up linear algebra in Spark?
>> >>>>
>> >>>> Thank you, Alexander
>> >>>>
>> >>>> -------------------------------------------------------------------
>> >>>> -- To unsubscribe, e-mail: [hidden email]<mailto:
>> [hidden email]><mailto:
>> >>>> [hidden email]<mailto:[hidden email]
>> >>>> e.org>><mailto:[hidden email]<mailto:dev-unsubscribe@sp
>> >>>> ark.apac> he.org<http://he.org>
>> >>>> <mailto:[hidden email]<mailto:dev-unsubscribe@spa
>> >>>> rk.apache.org>>> For additional commands, e-mail:
>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>> >>>> [hidden email]<mailto:[hidden email]>><mailto:
>> [hidden email]<mailto:[hidden email]><mailto:
>> >>>> [hidden email]<mailto:[hidden email]>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>
>>
>> --
>> Best regards,
>> Sam
>>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Using CUDA within Spark / boosting linear algebra

fommil
Btw, OpenBLAS requires GPL runtime binaries which are typically considered
"system libraries" (and these fall under something similar to the Java
classpath exception rule)... so it's basically impossible to distribute
OpenBLAS the way you're suggesting, sorry. Indeed, there is work ongoing in
Spark right now to clear up something of this nature.

On a more technical level, I'd recommend watching my talk at ScalaX which
explains in detail why high performance only comes from machine optimised
binaries, which requires DevOps buy-in (and, I'd recommend using MKL anyway
on the CPU, not OpenBLAS).

On an even deeper level, using natives has consequences to JIT and GC which
isn't suitable for everybody and we'd really like people to go into that
with their eyes wide open.
On 26 Mar 2015 07:43, "Sam Halliday" <[hidden email]> wrote:

> I'm not at all surprised ;-) I fully expect the GPU performance to get
> better automatically as the hardware improves.
>
> Netlib natives still need to be shipped separately. I'd also oppose any
> move to make Open BLAS the default - is not always better and I think
> natives really need DevOps buy-in. It's not the right solution for
> everybody.
> On 26 Mar 2015 01:23, "Evan R. Sparks" <[hidden email]> wrote:
>
>> Yeah, much more reasonable - nice to know that we can get full GPU
>> performance from breeze/netlib-java - meaning there's no compelling
>> performance reason to switch out our current linear algebra library (at
>> least as far as this benchmark is concerned).
>>
>> Instead, it looks like a user guide for configuring Spark/MLlib to use
>> the right BLAS library will get us most of the way there. Or, would it make
>> sense to finally ship openblas compiled for some common platforms (64-bit
>> linux, windows, mac) directly with Spark - hopefully eliminating the jblas
>> warnings once and for all for most users? (Licensing is BSD) Or am I
>> missing something?
>>
>> On Wed, Mar 25, 2015 at 6:03 PM, Ulanov, Alexander <
>> [hidden email]> wrote:
>>
>>> As everyone suggested, the results were too good to be true, so I
>>> double-checked them. It turns that nvblas did not do multiplication due to
>>> parameter NVBLAS_TILE_DIM from "nvblas.conf" and returned zero matrix. My
>>> previously posted results with nvblas are matrices copying only. The
>>> default NVBLAS_TILE_DIM==2048 is too big for my graphic card/matrix size. I
>>> handpicked other values that worked. As a result, netlib+nvblas is on par
>>> with BIDMat-cuda. As promised, I am going to post a how-to for nvblas
>>> configuration.
>>>
>>>
>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Ulanov, Alexander
>>> Sent: Wednesday, March 25, 2015 2:31 PM
>>> To: Sam Halliday
>>> Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R.
>>> Sparks; jfcanny
>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>>
>>> Hi again,
>>>
>>> I finally managed to use nvblas within Spark+netlib-java. It has
>>> exceptional performance for big matrices with Double, faster than
>>> BIDMat-cuda with Float. But for smaller matrices, if you will copy them
>>> to/from GPU, OpenBlas or MKL might be a better choice. This correlates with
>>> original nvblas presentation on GPU conf 2013 (slide 21):
>>> http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf
>>>
>>> My results:
>>>
>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>>>
>>> Just in case, these tests are not for generalization of performance of
>>> different libraries. I just want to pick a library that does at best dense
>>> matrices multiplication for my task.
>>>
>>> P.S. My previous issue with nvblas was the following: it has Fortran
>>> blas functions, at the same time netlib-java uses C cblas functions. So,
>>> one needs cblas shared library to use nvblas through netlib-java. Fedora
>>> does not have cblas (but Debian and Ubuntu have), so I needed to compile
>>> it. I could not use cblas from Atlas or Openblas because they link to their
>>> implementation and not to Fortran blas.
>>>
>>> Best regards, Alexander
>>>
>>> -----Original Message-----
>>> From: Ulanov, Alexander
>>> Sent: Tuesday, March 24, 2015 6:57 PM
>>> To: Sam Halliday
>>> Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>>
>>> Hi,
>>>
>>> I am trying to use nvblas with netlib-java from Spark. nvblas functions
>>> should replace current blas functions calls after executing LD_PRELOAD as
>>> suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any
>>> changes to netlib-java. It seems to work for simple Java example, but I
>>> cannot make it work with Spark. I run the following:
>>> export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
>>> env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell
>>> --driver-memory 4G In nvidia-smi I observe that Java is to use GPU:
>>>
>>> +-----------------------------------------------------------------------------+
>>> | Processes:                                                       GPU
>>> Memory |
>>> |  GPU       PID  Type  Process name
>>>  Usage      |
>>>
>>> |=============================================================================|
>>> |    0      8873    C   bash
>>> 39MiB |
>>> |    0      8910    C   /usr/lib/jvm/java-1.7.0/bin/java
>>> 39MiB |
>>>
>>> +-----------------------------------------------------------------------------+
>>>
>>> In Spark shell I do matrix multiplication and see the following:
>>> 15/03/25 06:48:01 INFO JniLoader: successfully loaded
>>> /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
>>> So I am sure that netlib-native is loaded and cblas supposedly used.
>>> However, matrix multiplication does executes on CPU since I see 16% of CPU
>>> used and 0% of GPU used. I also checked different matrix sizes, from
>>> 100x100 to 12000x12000
>>>
>>> Could you suggest might the LD_PRELOAD not affect Spark shell?
>>>
>>> Best regards, Alexander
>>>
>>>
>>>
>>> From: Sam Halliday [mailto:[hidden email]]
>>> Sent: Monday, March 09, 2015 6:01 PM
>>> To: Ulanov, Alexander
>>> Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>>
>>>
>>> Thanks so much for following up on this!
>>>
>>> Hmm, I wonder if we should have a concerted effort to chart performance
>>> on various pieces of hardware...
>>> On 9 Mar 2015 21:08, "Ulanov, Alexander" <[hidden email]
>>> <mailto:[hidden email]>> wrote:
>>> Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the
>>> comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the
>>> support of Double in the current source code), did the test with BIDMat and
>>> CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL.
>>>
>>>
>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>>>
>>> Best regards, Alexander
>>>
>>> -----Original Message-----
>>> From: Sam Halliday [mailto:[hidden email]<mailto:
>>> [hidden email]>]
>>> Sent: Tuesday, March 03, 2015 1:54 PM
>>> To: Xiangrui Meng; Joseph Bradley
>>> Cc: Evan R. Sparks; Ulanov, Alexander; [hidden email]<mailto:
>>> [hidden email]>
>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>
>>> BTW, is anybody on this list going to the London Meetup in a few weeks?
>>>
>>>
>>> https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community
>>>
>>> Would be nice to meet other people working on the guts of Spark! :-)
>>>
>>>
>>> Xiangrui Meng <[hidden email]<mailto:[hidden email]>> writes:
>>>
>>> > Hey Alexander,
>>> >
>>> > I don't quite understand the part where netlib-cublas is about 20x
>>> > slower than netlib-openblas. What is the overhead of using a GPU BLAS
>>> > with netlib-java?
>>> >
>>> > CC'ed Sam, the author of netlib-java.
>>> >
>>> > Best,
>>> > Xiangrui
>>> >
>>> > On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley <[hidden email]
>>> <mailto:[hidden email]>> wrote:
>>> >> Better documentation for linking would be very helpful!  Here's a
>>> JIRA:
>>> >> https://issues.apache.org/jira/browse/SPARK-6019
>>> >>
>>> >>
>>> >> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
>>> >> <[hidden email]<mailto:[hidden email]>>
>>> >> wrote:
>>> >>
>>> >>> Thanks for compiling all the data and running these benchmarks,
>>> >>> Alex. The big takeaways here can be seen with this chart:
>>> >>>
>>> >>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ
>>> >>> Hl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
>>> >>>
>>> >>> 1) A properly configured GPU matrix multiply implementation (e.g.
>>> >>> BIDMat+GPU) can provide substantial (but less than an order of
>>> >>> BIDMat+magnitude)
>>> >>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
>>> >>> netlib-java+openblas-compiled).
>>> >>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
>>> >>> worse than a well-tuned CPU implementation, particularly for larger
>>> matrices.
>>> >>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
>>> >>> basically agrees with the authors own benchmarks (
>>> >>> https://github.com/fommil/netlib-java)
>>> >>>
>>> >>> I think that most of our users are in a situation where using GPUs
>>> >>> may not be practical - although we could consider having a good GPU
>>> >>> backend available as an option. However, *ALL* users of MLlib could
>>> >>> benefit (potentially tremendously) from using a well-tuned CPU-based
>>> >>> BLAS implementation. Perhaps we should consider updating the mllib
>>> >>> guide with a more complete section for enabling high performance
>>> >>> binaries on OSX and Linux? Or better, figure out a way for the
>>> >>> system to fetch these automatically.
>>> >>>
>>> >>> - Evan
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
>>> >>> [hidden email]<mailto:[hidden email]>> wrote:
>>> >>>
>>> >>>> Just to summarize this thread, I was finally able to make all
>>> >>>> performance comparisons that we discussed. It turns out that:
>>> >>>> BIDMat-cublas>>BIDMat
>>> >>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo=
>>> >>>> =netlib-cublas>netlib-blas>f2jblas
>>> >>>>
>>> >>>> Below is the link to the spreadsheet with full results.
>>> >>>>
>>> >>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx
>>> >>>> 378T9J5r7kwKSPkY/edit?usp=sharing
>>> >>>>
>>> >>>> One thing still needs exploration: does BIDMat-cublas perform
>>> >>>> copying to/from machine’s RAM?
>>> >>>>
>>> >>>> -----Original Message-----
>>> >>>> From: Ulanov, Alexander
>>> >>>> Sent: Tuesday, February 10, 2015 2:12 PM
>>> >>>> To: Evan R. Sparks
>>> >>>> Cc: Joseph Bradley;
>>> >>>> [hidden email]<mailto:[hidden email]>
>>> >>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>> >>>>
>>> >>>> Thanks, Evan! It seems that ticket was marked as duplicate though
>>> >>>> the original one discusses slightly different topic. I was able to
>>> >>>> link netlib with MKL from BIDMat binaries. Indeed, MKL is
>>> >>>> statically linked inside a 60MB library.
>>> >>>>
>>> >>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
>>> >>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
>>> >>>>
>>> +-----------------------------------------------------------------------+
>>> >>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
>>> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
>>> >>>> |1,638475459 |
>>> >>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 |
>>> >>>> 1569,233228 |
>>> >>>>
>>> >>>> It turn out that pre-compiled MKL is faster than precompiled
>>> >>>> OpenBlas on my machine. Probably, I’ll add two more columns with
>>> >>>> locally compiled openblas and cuda.
>>> >>>>
>>> >>>> Alexander
>>> >>>>
>>> >>>> From: Evan R. Sparks
>>> >>>> [mailto:[hidden email]<mailto:[hidden email]>]
>>> >>>> Sent: Monday, February 09, 2015 6:06 PM
>>> >>>> To: Ulanov, Alexander
>>> >>>> Cc: Joseph Bradley;
>>> >>>> [hidden email]<mailto:[hidden email]>
>>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>> >>>>
>>> >>>> Great - perhaps we can move this discussion off-list and onto a
>>> >>>> JIRA ticket? (Here's one:
>>> >>>> https://issues.apache.org/jira/browse/SPARK-5705)
>>> >>>>
>>> >>>> It seems like this is going to be somewhat exploratory for a while
>>> >>>> (and there's probably only a handful of us who really care about
>>> >>>> fast linear
>>> >>>> algebra!)
>>> >>>>
>>> >>>> - Evan
>>> >>>>
>>> >>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
>>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>>> [hidden email]<mailto:[hidden email]>>> wrote:
>>> >>>> Hi Evan,
>>> >>>>
>>> >>>> Thank you for explanation and useful link. I am going to build
>>> >>>> OpenBLAS, link it with Netlib-java and perform benchmark again.
>>> >>>>
>>> >>>> Do I understand correctly that BIDMat binaries contain statically
>>> >>>> linked Intel MKL BLAS? It might be the reason why I am able to run
>>> >>>> BIDMat not having MKL BLAS installed on my server. If it is true, I
>>> >>>> wonder if it is OK because Intel sells this library. Nevertheless,
>>> >>>> it seems that in my case precompiled MKL BLAS performs better than
>>> >>>> precompiled OpenBLAS given that BIDMat and Netlib-java are supposed
>>> to be on par with JNI overheads.
>>> >>>>
>>> >>>> Though, it might be interesting to link Netlib-java with Intel MKL,
>>> >>>> as you suggested. I wonder, are John Canny (BIDMat) and Sam
>>> >>>> Halliday
>>> >>>> (Netlib-java) interested to compare their libraries.
>>> >>>>
>>> >>>> Best regards, Alexander
>>> >>>>
>>> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
>>> [hidden email]><mailto:
>>> >>>> [hidden email]<mailto:[hidden email]>>]
>>> >>>> Sent: Friday, February 06, 2015 5:58 PM
>>> >>>>
>>> >>>> To: Ulanov, Alexander
>>> >>>> Cc: Joseph Bradley;
>>> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
>>> >>>> apache.org<mailto:[hidden email]>>
>>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>> >>>>
>>> >>>> I would build OpenBLAS yourself, since good BLAS performance comes
>>> >>>> from getting cache sizes, etc. set up correctly for your particular
>>> >>>> hardware - this is often a very tricky process (see, e.g. ATLAS),
>>> >>>> but we found that on relatively modern Xeon chips, OpenBLAS builds
>>> >>>> quickly and yields performance competitive with MKL.
>>> >>>>
>>> >>>> To make sure the right library is getting used, you have to make
>>> >>>> sure it's first on the search path - export
>>> >>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
>>> >>>>
>>> >>>> For some examples of getting netlib-java setup on an ec2 node and
>>> >>>> some example benchmarking code we ran a while back, see:
>>> >>>> https://github.com/shivaram/matrix-bench
>>> >>>>
>>> >>>> In particular - build-openblas-ec2.sh shows you how to build the
>>> >>>> library and set up symlinks correctly, and scala/run-netlib.sh
>>> >>>> shows you how to get the path setup and get that library picked up
>>> by netlib-java.
>>> >>>>
>>> >>>> In this way - you could probably get cuBLAS set up to be used by
>>> >>>> netlib-java as well.
>>> >>>>
>>> >>>> - Evan
>>> >>>>
>>> >>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
>>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>>> [hidden email]<mailto:[hidden email]>>> wrote:
>>> >>>> Evan, could you elaborate on how to force BIDMat and netlib-java to
>>> >>>> force loading the right blas? For netlib, I there are few JVM
>>> >>>> flags, such as
>>> >>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS,
>>> >>>> so I can force it to use Java implementation. Not sure I understand
>>> how to force use a specific blas (not specific wrapper for blas).
>>> >>>>
>>> >>>> Btw. I have installed openblas (yum install openblas), so I suppose
>>> >>>> that netlib is using it.
>>> >>>>
>>> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
>>> [hidden email]><mailto:
>>> >>>> [hidden email]<mailto:[hidden email]>>]
>>> >>>> Sent: Friday, February 06, 2015 5:19 PM
>>> >>>> To: Ulanov, Alexander
>>> >>>> Cc: Joseph Bradley;
>>> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
>>> >>>> apache.org<mailto:[hidden email]>>
>>> >>>>
>>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>> >>>>
>>> >>>> Getting breeze to pick up the right blas library is critical for
>>> >>>> performance. I recommend using OpenBLAS (or MKL, if you already
>>> have it).
>>> >>>> It might make sense to force BIDMat to use the same underlying BLAS
>>> >>>> library as well.
>>> >>>>
>>> >>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
>>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>>> [hidden email]<mailto:[hidden email]>>> wrote:
>>> >>>> Hi Evan, Joseph
>>> >>>>
>>> >>>> I did few matrix multiplication test and BIDMat seems to be ~10x
>>> >>>> faster than netlib-java+breeze (sorry for weird table formatting):
>>> >>>>
>>> >>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java
>>> >>>> |native_system_linux_x86-64|
>>> >>>> Breeze+Netlib-java f2jblas |
>>> >>>>
>>> +-----------------------------------------------------------------------+
>>> >>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
>>> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
>>> >>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228
>>> >>>> ||
>>> >>>>
>>> >>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora
>>> >>>> 19 Linux, Scala 2.11.
>>> >>>>
>>> >>>> Later I will make tests with Cuda. I need to install new Cuda
>>> >>>> version for this purpose.
>>> >>>>
>>> >>>> Do you have any ideas why breeze-netlib with native blas is so much
>>> >>>> slower than BIDMat MKL?
>>> >>>>
>>> >>>> Best regards, Alexander
>>> >>>>
>>> >>>> From: Joseph Bradley [mailto:[hidden email]<mailto:
>>> [hidden email]><mailto:
>>> >>>> [hidden email]<mailto:[hidden email]>>]
>>> >>>> Sent: Thursday, February 05, 2015 5:29 PM
>>> >>>> To: Ulanov, Alexander
>>> >>>> Cc: Evan R. Sparks;
>>> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
>>> >>>> apache.org<mailto:[hidden email]>>
>>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>> >>>>
>>> >>>> Hi Alexander,
>>> >>>>
>>> >>>> Using GPUs with Spark would be very exciting.  Small comment:
>>> >>>> Concerning your question earlier about keeping data stored on the
>>> >>>> GPU rather than having to move it between main memory and GPU
>>> >>>> memory on each iteration, I would guess this would be critical to
>>> >>>> getting good performance.  If you could do multiple local
>>> >>>> iterations before aggregating results, then the cost of data
>>> >>>> movement to the GPU could be amortized (and I believe that is done
>>> >>>> in practice).  Having Spark be aware of the GPU and using it as
>>> another part of memory sounds like a much bigger undertaking.
>>> >>>>
>>> >>>> Joseph
>>> >>>>
>>> >>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
>>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>>> [hidden email]<mailto:[hidden email]>>> wrote:
>>> >>>> Thank you for explanation! I’ve watched the BIDMach presentation by
>>> >>>> John Canny and I am really inspired by his talk and comparisons
>>> with Spark MLlib.
>>> >>>>
>>> >>>> I am very interested to find out what will be better within Spark:
>>> >>>> BIDMat or netlib-java with CPU or GPU natives. Could you suggest a
>>> >>>> fair way to benchmark them? Currently I do benchmarks on artificial
>>> >>>> neural networks in batch mode. While it is not a “pure” test of
>>> >>>> linear algebra, it involves some other things that are essential to
>>> machine learning.
>>> >>>>
>>> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
>>> [hidden email]><mailto:
>>> >>>> [hidden email]<mailto:[hidden email]>>]
>>> >>>> Sent: Thursday, February 05, 2015 1:29 PM
>>> >>>> To: Ulanov, Alexander
>>> >>>> Cc:
>>> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
>>> >>>> apache.org<mailto:[hidden email]>>
>>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>> >>>>
>>> >>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
>>> >>>> netlib-java+OpenBLAS, but if it is much faster it's probably due to
>>> >>>> netlib-java+data
>>> >>>> layout and fewer levels of indirection - it's definitely a
>>> >>>> worthwhile experiment to run. The main speedups I've seen from
>>> >>>> using it come from highly optimized GPU code for linear algebra. I
>>> >>>> know that in the past Canny has gone as far as to write custom GPU
>>> >>>> kernels for performance-critical regions of code.[1]
>>> >>>>
>>> >>>> BIDMach is highly optimized for single node performance or
>>> >>>> performance on small clusters.[2] Once data doesn't fit easily in
>>> >>>> GPU memory (or can be batched in that way) the performance tends to
>>> >>>> fall off. Canny argues for hardware/software codesign and as such
>>> >>>> prefers machine configurations that are quite different than what
>>> >>>> we find in most commodity cluster nodes - e.g. 10 disk cahnnels and
>>> 4 GPUs.
>>> >>>>
>>> >>>> In contrast, MLlib was designed for horizontal scalability on
>>> >>>> commodity clusters and works best on very big datasets - order of
>>> terabytes.
>>> >>>>
>>> >>>> For the most part, these projects developed concurrently to address
>>> >>>> slightly different use cases. That said, there may be bits of
>>> >>>> BIDMach we could repurpose for MLlib - keep in mind we need to be
>>> >>>> careful about maintaining cross-language compatibility for our Java
>>> >>>> and Python-users, though.
>>> >>>>
>>> >>>> - Evan
>>> >>>>
>>> >>>> [1] - http://arxiv.org/abs/1409.5402 [2] -
>>> >>>> http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
>>> >>>>
>>> >>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
>>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>>> [hidden email]<mailto:[hidden email]>><mailto:
>>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>>> [hidden email]<mailto:[hidden email]>>>> wrote:
>>> >>>> Hi Evan,
>>> >>>>
>>> >>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do
>>> >>>> you know what makes them faster than netlib-java?
>>> >>>>
>>> >>>> The same group has BIDMach library that implements machine
>>> >>>> learning. For some examples they use Caffe convolutional neural
>>> >>>> network library owned by another group in Berkeley. Could you
>>> >>>> elaborate on how these all might be connected with Spark Mllib? If
>>> >>>> you take BIDMat for linear algebra why don’t you take BIDMach for
>>> optimization and learning?
>>> >>>>
>>> >>>> Best regards, Alexander
>>> >>>>
>>> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
>>> [hidden email]><mailto:
>>> >>>> [hidden email]<mailto:[hidden email]>><mailto:
>>> [hidden email]<mailto:[hidden email]><mailto:
>>> >>>> [hidden email]<mailto:[hidden email]>>>]
>>> >>>> Sent: Thursday, February 05, 2015 12:09 PM
>>> >>>> To: Ulanov, Alexander
>>> >>>> Cc: [hidden email]<mailto:[hidden email]><mailto:
>>> [hidden email]<mailto:[hidden email]>><mailto:
>>> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark.
>>> >>>> apache.org<mailto:[hidden email]>>>
>>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>> >>>>
>>> >>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU
>>> >>>> blas in many cases.
>>> >>>>
>>> >>>> You might consider taking a look at the codepaths that BIDMat (
>>> >>>> https://github.com/BIDData/BIDMat) takes and comparing them to
>>> >>>> netlib-java/breeze. John Canny et. al. have done a bunch of work
>>> >>>> optimizing to make this work really fast from Scala. I've run it on
>>> >>>> my laptop and compared to MKL and in certain cases it's 10x faster
>>> at matrix multiply.
>>> >>>> There are a lot of layers of indirection here and you really want
>>> >>>> to avoid data copying as much as possible.
>>> >>>>
>>> >>>> We could also consider swapping out BIDMat for Breeze, but that
>>> >>>> would be a big project and if we can figure out how to get
>>> >>>> breeze+cublas to comparable performance that would be a big win.
>>> >>>>
>>> >>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
>>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>>> [hidden email]<mailto:[hidden email]>><mailto:
>>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>>> [hidden email]<mailto:[hidden email]>>>> wrote:
>>> >>>> Dear Spark developers,
>>> >>>>
>>> >>>> I am exploring how to make linear algebra operations faster within
>>> Spark.
>>> >>>> One way of doing this is to use Scala Breeze library that is
>>> >>>> bundled with Spark. For matrix operations, it employs Netlib-java
>>> >>>> that has a Java wrapper for BLAS (basic linear algebra subprograms)
>>> >>>> and LAPACK native binaries if they are available on the worker
>>> >>>> node. It also has its own optimized Java implementation of BLAS. It
>>> >>>> is worth mentioning, that native binaries provide better
>>> performance only for BLAS level 3, i.e.
>>> >>>> matrix-matrix operations or general matrix multiplication (GEMM).
>>> >>>> This is confirmed by GEMM test on Netlib-java page
>>> >>>> https://github.com/fommil/netlib-java. I also confirmed it with my
>>> >>>> experiments with training of artificial neural network
>>> >>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
>>> >>>> However, I would like to boost performance more.
>>> >>>>
>>> >>>> GPU is supposed to work fast with linear algebra and there is
>>> >>>> Nvidia CUDA implementation of BLAS, called cublas. I have one Linux
>>> >>>> server with Nvidia GPU and I was able to do the following. I linked
>>> >>>> cublas (instead of cpu-based blas) with Netlib-java wrapper and put
>>> >>>> it into Spark, so Breeze/Netlib is using it. Then I did some
>>> >>>> performance measurements with regards to artificial neural network
>>> >>>> batch learning in Spark MLlib that involves matrix-matrix
>>> >>>> multiplications. It turns out that for matrices of size less than
>>> >>>> ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes
>>> >>>> slower for bigger matrices. It worth mentioning that it is was not
>>> a test for ONLY multiplication since there are other operations involved.
>>> >>>> One of the reasons for slowdown might be the overhead of copying
>>> >>>> the matrices from computer memory to graphic card memory and back.
>>> >>>>
>>> >>>> So, few questions:
>>> >>>> 1) Do these results with CUDA make sense?
>>> >>>> 2) If the problem is with copy overhead, are there any libraries
>>> >>>> that allow to force intermediate results to stay in graphic card
>>> >>>> memory thus removing the overhead?
>>> >>>> 3) Any other options to speed-up linear algebra in Spark?
>>> >>>>
>>> >>>> Thank you, Alexander
>>> >>>>
>>> >>>> -------------------------------------------------------------------
>>> >>>> -- To unsubscribe, e-mail: [hidden email]<mailto:
>>> [hidden email]><mailto:
>>> >>>> [hidden email]<mailto:[hidden email]
>>> >>>> e.org>><mailto:[hidden email]<mailto:dev-unsubscribe@sp
>>> >>>> ark.apac> he.org<http://he.org>
>>> >>>> <mailto:[hidden email]<mailto:dev-unsubscribe@spa
>>> >>>> rk.apache.org>>> For additional commands, e-mail:
>>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>>> >>>> [hidden email]<mailto:[hidden email]
>>> >><mailto:[hidden email]<mailto:[hidden email]
>>> ><mailto:
>>> >>>> [hidden email]<mailto:[hidden email]>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>
>>>
>>> --
>>> Best regards,
>>> Sam
>>>
>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: Using CUDA within Spark / boosting linear algebra

Evan R. Sparks
Alright Sam - you are the expert here. If the GPL issues are unavoidable,
that's fine - what is the exact bit of code that is GPL?

The suggestion to use OpenBLAS is not to say it's the best option, but that
it's a *free, reasonable default* for many users - keep in mind the most
common deployment for Spark/MLlib is on 64-bit linux on EC2[1].
Additionally, for many of the problems we're targeting, this reasonable
default can provide a 1-2 orders of magnitude improvement in performance
over the f2jblas implementation that netlib-java falls back on.

The JVM issues are trickier, I agree - so it sounds like a good user guide
explaining the tradeoffs and configurations procedures as they relate to
spark is a reasonable way forward.

[1] -
https://gigaom.com/2015/01/27/a-few-interesting-numbers-about-apache-spark/

On Thu, Mar 26, 2015 at 12:54 AM, Sam Halliday <[hidden email]>
wrote:

> Btw, OpenBLAS requires GPL runtime binaries which are typically considered
> "system libraries" (and these fall under something similar to the Java
> classpath exception rule)... so it's basically impossible to distribute
> OpenBLAS the way you're suggesting, sorry. Indeed, there is work ongoing in
> Spark right now to clear up something of this nature.
>
> On a more technical level, I'd recommend watching my talk at ScalaX which
> explains in detail why high performance only comes from machine optimised
> binaries, which requires DevOps buy-in (and, I'd recommend using MKL anyway
> on the CPU, not OpenBLAS).
>
> On an even deeper level, using natives has consequences to JIT and GC
> which isn't suitable for everybody and we'd really like people to go into
> that with their eyes wide open.
> On 26 Mar 2015 07:43, "Sam Halliday" <[hidden email]> wrote:
>
>> I'm not at all surprised ;-) I fully expect the GPU performance to get
>> better automatically as the hardware improves.
>>
>> Netlib natives still need to be shipped separately. I'd also oppose any
>> move to make Open BLAS the default - is not always better and I think
>> natives really need DevOps buy-in. It's not the right solution for
>> everybody.
>> On 26 Mar 2015 01:23, "Evan R. Sparks" <[hidden email]> wrote:
>>
>>> Yeah, much more reasonable - nice to know that we can get full GPU
>>> performance from breeze/netlib-java - meaning there's no compelling
>>> performance reason to switch out our current linear algebra library (at
>>> least as far as this benchmark is concerned).
>>>
>>> Instead, it looks like a user guide for configuring Spark/MLlib to use
>>> the right BLAS library will get us most of the way there. Or, would it make
>>> sense to finally ship openblas compiled for some common platforms (64-bit
>>> linux, windows, mac) directly with Spark - hopefully eliminating the jblas
>>> warnings once and for all for most users? (Licensing is BSD) Or am I
>>> missing something?
>>>
>>> On Wed, Mar 25, 2015 at 6:03 PM, Ulanov, Alexander <
>>> [hidden email]> wrote:
>>>
>>>> As everyone suggested, the results were too good to be true, so I
>>>> double-checked them. It turns that nvblas did not do multiplication due to
>>>> parameter NVBLAS_TILE_DIM from "nvblas.conf" and returned zero matrix. My
>>>> previously posted results with nvblas are matrices copying only. The
>>>> default NVBLAS_TILE_DIM==2048 is too big for my graphic card/matrix size. I
>>>> handpicked other values that worked. As a result, netlib+nvblas is on par
>>>> with BIDMat-cuda. As promised, I am going to post a how-to for nvblas
>>>> configuration.
>>>>
>>>>
>>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Ulanov, Alexander
>>>> Sent: Wednesday, March 25, 2015 2:31 PM
>>>> To: Sam Halliday
>>>> Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R.
>>>> Sparks; jfcanny
>>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Hi again,
>>>>
>>>> I finally managed to use nvblas within Spark+netlib-java. It has
>>>> exceptional performance for big matrices with Double, faster than
>>>> BIDMat-cuda with Float. But for smaller matrices, if you will copy them
>>>> to/from GPU, OpenBlas or MKL might be a better choice. This correlates with
>>>> original nvblas presentation on GPU conf 2013 (slide 21):
>>>> http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf
>>>>
>>>> My results:
>>>>
>>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>>>>
>>>> Just in case, these tests are not for generalization of performance of
>>>> different libraries. I just want to pick a library that does at best dense
>>>> matrices multiplication for my task.
>>>>
>>>> P.S. My previous issue with nvblas was the following: it has Fortran
>>>> blas functions, at the same time netlib-java uses C cblas functions. So,
>>>> one needs cblas shared library to use nvblas through netlib-java. Fedora
>>>> does not have cblas (but Debian and Ubuntu have), so I needed to compile
>>>> it. I could not use cblas from Atlas or Openblas because they link to their
>>>> implementation and not to Fortran blas.
>>>>
>>>> Best regards, Alexander
>>>>
>>>> -----Original Message-----
>>>> From: Ulanov, Alexander
>>>> Sent: Tuesday, March 24, 2015 6:57 PM
>>>> To: Sam Halliday
>>>> Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
>>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Hi,
>>>>
>>>> I am trying to use nvblas with netlib-java from Spark. nvblas functions
>>>> should replace current blas functions calls after executing LD_PRELOAD as
>>>> suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any
>>>> changes to netlib-java. It seems to work for simple Java example, but I
>>>> cannot make it work with Spark. I run the following:
>>>> export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
>>>> env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell
>>>> --driver-memory 4G In nvidia-smi I observe that Java is to use GPU:
>>>>
>>>> +-----------------------------------------------------------------------------+
>>>> | Processes:                                                       GPU
>>>> Memory |
>>>> |  GPU       PID  Type  Process name
>>>>  Usage      |
>>>>
>>>> |=============================================================================|
>>>> |    0      8873    C   bash
>>>> 39MiB |
>>>> |    0      8910    C   /usr/lib/jvm/java-1.7.0/bin/java
>>>> 39MiB |
>>>>
>>>> +-----------------------------------------------------------------------------+
>>>>
>>>> In Spark shell I do matrix multiplication and see the following:
>>>> 15/03/25 06:48:01 INFO JniLoader: successfully loaded
>>>> /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
>>>> So I am sure that netlib-native is loaded and cblas supposedly used.
>>>> However, matrix multiplication does executes on CPU since I see 16% of CPU
>>>> used and 0% of GPU used. I also checked different matrix sizes, from
>>>> 100x100 to 12000x12000
>>>>
>>>> Could you suggest might the LD_PRELOAD not affect Spark shell?
>>>>
>>>> Best regards, Alexander
>>>>
>>>>
>>>>
>>>> From: Sam Halliday [mailto:[hidden email]]
>>>> Sent: Monday, March 09, 2015 6:01 PM
>>>> To: Ulanov, Alexander
>>>> Cc: [hidden email]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
>>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>>>
>>>>
>>>> Thanks so much for following up on this!
>>>>
>>>> Hmm, I wonder if we should have a concerted effort to chart performance
>>>> on various pieces of hardware...
>>>> On 9 Mar 2015 21:08, "Ulanov, Alexander" <[hidden email]
>>>> <mailto:[hidden email]>> wrote:
>>>> Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added
>>>> the comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see
>>>> the support of Double in the current source code), did the test with BIDMat
>>>> and CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL.
>>>>
>>>>
>>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>>>>
>>>> Best regards, Alexander
>>>>
>>>> -----Original Message-----
>>>> From: Sam Halliday [mailto:[hidden email]<mailto:
>>>> [hidden email]>]
>>>> Sent: Tuesday, March 03, 2015 1:54 PM
>>>> To: Xiangrui Meng; Joseph Bradley
>>>> Cc: Evan R. Sparks; Ulanov, Alexander; [hidden email]<mailto:
>>>> [hidden email]>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> BTW, is anybody on this list going to the London Meetup in a few weeks?
>>>>
>>>>
>>>> https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community
>>>>
>>>> Would be nice to meet other people working on the guts of Spark! :-)
>>>>
>>>>
>>>> Xiangrui Meng <[hidden email]<mailto:[hidden email]>> writes:
>>>>
>>>> > Hey Alexander,
>>>> >
>>>> > I don't quite understand the part where netlib-cublas is about 20x
>>>> > slower than netlib-openblas. What is the overhead of using a GPU BLAS
>>>> > with netlib-java?
>>>> >
>>>> > CC'ed Sam, the author of netlib-java.
>>>> >
>>>> > Best,
>>>> > Xiangrui
>>>> >
>>>> > On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley <
>>>> [hidden email]<mailto:[hidden email]>> wrote:
>>>> >> Better documentation for linking would be very helpful!  Here's a
>>>> JIRA:
>>>> >> https://issues.apache.org/jira/browse/SPARK-6019
>>>> >>
>>>> >>
>>>> >> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
>>>> >> <[hidden email]<mailto:[hidden email]>>
>>>> >> wrote:
>>>> >>
>>>> >>> Thanks for compiling all the data and running these benchmarks,
>>>> >>> Alex. The big takeaways here can be seen with this chart:
>>>> >>>
>>>> >>>
>>>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ
>>>> >>> Hl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
>>>> >>>
>>>> >>> 1) A properly configured GPU matrix multiply implementation (e.g.
>>>> >>> BIDMat+GPU) can provide substantial (but less than an order of
>>>> >>> BIDMat+magnitude)
>>>> >>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
>>>> >>> netlib-java+openblas-compiled).
>>>> >>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
>>>> >>> worse than a well-tuned CPU implementation, particularly for larger
>>>> matrices.
>>>> >>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
>>>> >>> basically agrees with the authors own benchmarks (
>>>> >>> https://github.com/fommil/netlib-java)
>>>> >>>
>>>> >>> I think that most of our users are in a situation where using GPUs
>>>> >>> may not be practical - although we could consider having a good GPU
>>>> >>> backend available as an option. However, *ALL* users of MLlib could
>>>> >>> benefit (potentially tremendously) from using a well-tuned CPU-based
>>>> >>> BLAS implementation. Perhaps we should consider updating the mllib
>>>> >>> guide with a more complete section for enabling high performance
>>>> >>> binaries on OSX and Linux? Or better, figure out a way for the
>>>> >>> system to fetch these automatically.
>>>> >>>
>>>> >>> - Evan
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
>>>> >>> [hidden email]<mailto:[hidden email]>> wrote:
>>>> >>>
>>>> >>>> Just to summarize this thread, I was finally able to make all
>>>> >>>> performance comparisons that we discussed. It turns out that:
>>>> >>>> BIDMat-cublas>>BIDMat
>>>> >>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo=
>>>> >>>> =netlib-cublas>netlib-blas>f2jblas
>>>> >>>>
>>>> >>>> Below is the link to the spreadsheet with full results.
>>>> >>>>
>>>> >>>>
>>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx
>>>> >>>> 378T9J5r7kwKSPkY/edit?usp=sharing
>>>> >>>>
>>>> >>>> One thing still needs exploration: does BIDMat-cublas perform
>>>> >>>> copying to/from machine’s RAM?
>>>> >>>>
>>>> >>>> -----Original Message-----
>>>> >>>> From: Ulanov, Alexander
>>>> >>>> Sent: Tuesday, February 10, 2015 2:12 PM
>>>> >>>> To: Evan R. Sparks
>>>> >>>> Cc: Joseph Bradley;
>>>> >>>> [hidden email]<mailto:[hidden email]>
>>>> >>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>>> >>>>
>>>> >>>> Thanks, Evan! It seems that ticket was marked as duplicate though
>>>> >>>> the original one discusses slightly different topic. I was able to
>>>> >>>> link netlib with MKL from BIDMat binaries. Indeed, MKL is
>>>> >>>> statically linked inside a 60MB library.
>>>> >>>>
>>>> >>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
>>>> >>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
>>>> >>>>
>>>> +-----------------------------------------------------------------------+
>>>> >>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
>>>> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
>>>> >>>> |1,638475459 |
>>>> >>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 |
>>>> >>>> 1569,233228 |
>>>> >>>>
>>>> >>>> It turn out that pre-compiled MKL is faster than precompiled
>>>> >>>> OpenBlas on my machine. Probably, I’ll add two more columns with
>>>> >>>> locally compiled openblas and cuda.
>>>> >>>>
>>>> >>>> Alexander
>>>> >>>>
>>>> >>>> From: Evan R. Sparks
>>>> >>>> [mailto:[hidden email]<mailto:[hidden email]>]
>>>> >>>> Sent: Monday, February 09, 2015 6:06 PM
>>>> >>>> To: Ulanov, Alexander
>>>> >>>> Cc: Joseph Bradley;
>>>> >>>> [hidden email]<mailto:[hidden email]>
>>>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>> >>>>
>>>> >>>> Great - perhaps we can move this discussion off-list and onto a
>>>> >>>> JIRA ticket? (Here's one:
>>>> >>>> https://issues.apache.org/jira/browse/SPARK-5705)
>>>> >>>>
>>>> >>>> It seems like this is going to be somewhat exploratory for a while
>>>> >>>> (and there's probably only a handful of us who really care about
>>>> >>>> fast linear
>>>> >>>> algebra!)
>>>> >>>>
>>>> >>>> - Evan
>>>> >>>>
>>>> >>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
>>>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>> wrote:
>>>> >>>> Hi Evan,
>>>> >>>>
>>>> >>>> Thank you for explanation and useful link. I am going to build
>>>> >>>> OpenBLAS, link it with Netlib-java and perform benchmark again.
>>>> >>>>
>>>> >>>> Do I understand correctly that BIDMat binaries contain statically
>>>> >>>> linked Intel MKL BLAS? It might be the reason why I am able to run
>>>> >>>> BIDMat not having MKL BLAS installed on my server. If it is true, I
>>>> >>>> wonder if it is OK because Intel sells this library. Nevertheless,
>>>> >>>> it seems that in my case precompiled MKL BLAS performs better than
>>>> >>>> precompiled OpenBLAS given that BIDMat and Netlib-java are
>>>> supposed to be on par with JNI overheads.
>>>> >>>>
>>>> >>>> Though, it might be interesting to link Netlib-java with Intel MKL,
>>>> >>>> as you suggested. I wonder, are John Canny (BIDMat) and Sam
>>>> >>>> Halliday
>>>> >>>> (Netlib-java) interested to compare their libraries.
>>>> >>>>
>>>> >>>> Best regards, Alexander
>>>> >>>>
>>>> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
>>>> [hidden email]><mailto:
>>>> >>>> [hidden email]<mailto:[hidden email]>>]
>>>> >>>> Sent: Friday, February 06, 2015 5:58 PM
>>>> >>>>
>>>> >>>> To: Ulanov, Alexander
>>>> >>>> Cc: Joseph Bradley;
>>>> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark
>>>> .
>>>> >>>> apache.org<mailto:[hidden email]>>
>>>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>> >>>>
>>>> >>>> I would build OpenBLAS yourself, since good BLAS performance comes
>>>> >>>> from getting cache sizes, etc. set up correctly for your particular
>>>> >>>> hardware - this is often a very tricky process (see, e.g. ATLAS),
>>>> >>>> but we found that on relatively modern Xeon chips, OpenBLAS builds
>>>> >>>> quickly and yields performance competitive with MKL.
>>>> >>>>
>>>> >>>> To make sure the right library is getting used, you have to make
>>>> >>>> sure it's first on the search path - export
>>>> >>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
>>>> >>>>
>>>> >>>> For some examples of getting netlib-java setup on an ec2 node and
>>>> >>>> some example benchmarking code we ran a while back, see:
>>>> >>>> https://github.com/shivaram/matrix-bench
>>>> >>>>
>>>> >>>> In particular - build-openblas-ec2.sh shows you how to build the
>>>> >>>> library and set up symlinks correctly, and scala/run-netlib.sh
>>>> >>>> shows you how to get the path setup and get that library picked up
>>>> by netlib-java.
>>>> >>>>
>>>> >>>> In this way - you could probably get cuBLAS set up to be used by
>>>> >>>> netlib-java as well.
>>>> >>>>
>>>> >>>> - Evan
>>>> >>>>
>>>> >>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
>>>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>> wrote:
>>>> >>>> Evan, could you elaborate on how to force BIDMat and netlib-java to
>>>> >>>> force loading the right blas? For netlib, I there are few JVM
>>>> >>>> flags, such as
>>>> >>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS,
>>>> >>>> so I can force it to use Java implementation. Not sure I
>>>> understand how to force use a specific blas (not specific wrapper for blas).
>>>> >>>>
>>>> >>>> Btw. I have installed openblas (yum install openblas), so I suppose
>>>> >>>> that netlib is using it.
>>>> >>>>
>>>> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
>>>> [hidden email]><mailto:
>>>> >>>> [hidden email]<mailto:[hidden email]>>]
>>>> >>>> Sent: Friday, February 06, 2015 5:19 PM
>>>> >>>> To: Ulanov, Alexander
>>>> >>>> Cc: Joseph Bradley;
>>>> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark
>>>> .
>>>> >>>> apache.org<mailto:[hidden email]>>
>>>> >>>>
>>>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>> >>>>
>>>> >>>> Getting breeze to pick up the right blas library is critical for
>>>> >>>> performance. I recommend using OpenBLAS (or MKL, if you already
>>>> have it).
>>>> >>>> It might make sense to force BIDMat to use the same underlying BLAS
>>>> >>>> library as well.
>>>> >>>>
>>>> >>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
>>>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>> wrote:
>>>> >>>> Hi Evan, Joseph
>>>> >>>>
>>>> >>>> I did few matrix multiplication test and BIDMat seems to be ~10x
>>>> >>>> faster than netlib-java+breeze (sorry for weird table formatting):
>>>> >>>>
>>>> >>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java
>>>> >>>> |native_system_linux_x86-64|
>>>> >>>> Breeze+Netlib-java f2jblas |
>>>> >>>>
>>>> +-----------------------------------------------------------------------+
>>>> >>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
>>>> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
>>>> >>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228
>>>> >>>> ||
>>>> >>>>
>>>> >>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora
>>>> >>>> 19 Linux, Scala 2.11.
>>>> >>>>
>>>> >>>> Later I will make tests with Cuda. I need to install new Cuda
>>>> >>>> version for this purpose.
>>>> >>>>
>>>> >>>> Do you have any ideas why breeze-netlib with native blas is so much
>>>> >>>> slower than BIDMat MKL?
>>>> >>>>
>>>> >>>> Best regards, Alexander
>>>> >>>>
>>>> >>>> From: Joseph Bradley [mailto:[hidden email]<mailto:
>>>> [hidden email]><mailto:
>>>> >>>> [hidden email]<mailto:[hidden email]>>]
>>>> >>>> Sent: Thursday, February 05, 2015 5:29 PM
>>>> >>>> To: Ulanov, Alexander
>>>> >>>> Cc: Evan R. Sparks;
>>>> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark
>>>> .
>>>> >>>> apache.org<mailto:[hidden email]>>
>>>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>> >>>>
>>>> >>>> Hi Alexander,
>>>> >>>>
>>>> >>>> Using GPUs with Spark would be very exciting.  Small comment:
>>>> >>>> Concerning your question earlier about keeping data stored on the
>>>> >>>> GPU rather than having to move it between main memory and GPU
>>>> >>>> memory on each iteration, I would guess this would be critical to
>>>> >>>> getting good performance.  If you could do multiple local
>>>> >>>> iterations before aggregating results, then the cost of data
>>>> >>>> movement to the GPU could be amortized (and I believe that is done
>>>> >>>> in practice).  Having Spark be aware of the GPU and using it as
>>>> another part of memory sounds like a much bigger undertaking.
>>>> >>>>
>>>> >>>> Joseph
>>>> >>>>
>>>> >>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
>>>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>> wrote:
>>>> >>>> Thank you for explanation! I’ve watched the BIDMach presentation by
>>>> >>>> John Canny and I am really inspired by his talk and comparisons
>>>> with Spark MLlib.
>>>> >>>>
>>>> >>>> I am very interested to find out what will be better within Spark:
>>>> >>>> BIDMat or netlib-java with CPU or GPU natives. Could you suggest a
>>>> >>>> fair way to benchmark them? Currently I do benchmarks on artificial
>>>> >>>> neural networks in batch mode. While it is not a “pure” test of
>>>> >>>> linear algebra, it involves some other things that are essential
>>>> to machine learning.
>>>> >>>>
>>>> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
>>>> [hidden email]><mailto:
>>>> >>>> [hidden email]<mailto:[hidden email]>>]
>>>> >>>> Sent: Thursday, February 05, 2015 1:29 PM
>>>> >>>> To: Ulanov, Alexander
>>>> >>>> Cc:
>>>> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark
>>>> .
>>>> >>>> apache.org<mailto:[hidden email]>>
>>>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>> >>>>
>>>> >>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
>>>> >>>> netlib-java+OpenBLAS, but if it is much faster it's probably due to
>>>> >>>> netlib-java+data
>>>> >>>> layout and fewer levels of indirection - it's definitely a
>>>> >>>> worthwhile experiment to run. The main speedups I've seen from
>>>> >>>> using it come from highly optimized GPU code for linear algebra. I
>>>> >>>> know that in the past Canny has gone as far as to write custom GPU
>>>> >>>> kernels for performance-critical regions of code.[1]
>>>> >>>>
>>>> >>>> BIDMach is highly optimized for single node performance or
>>>> >>>> performance on small clusters.[2] Once data doesn't fit easily in
>>>> >>>> GPU memory (or can be batched in that way) the performance tends to
>>>> >>>> fall off. Canny argues for hardware/software codesign and as such
>>>> >>>> prefers machine configurations that are quite different than what
>>>> >>>> we find in most commodity cluster nodes - e.g. 10 disk cahnnels
>>>> and 4 GPUs.
>>>> >>>>
>>>> >>>> In contrast, MLlib was designed for horizontal scalability on
>>>> >>>> commodity clusters and works best on very big datasets - order of
>>>> terabytes.
>>>> >>>>
>>>> >>>> For the most part, these projects developed concurrently to address
>>>> >>>> slightly different use cases. That said, there may be bits of
>>>> >>>> BIDMach we could repurpose for MLlib - keep in mind we need to be
>>>> >>>> careful about maintaining cross-language compatibility for our Java
>>>> >>>> and Python-users, though.
>>>> >>>>
>>>> >>>> - Evan
>>>> >>>>
>>>> >>>> [1] - http://arxiv.org/abs/1409.5402 [2] -
>>>> >>>> http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
>>>> >>>>
>>>> >>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
>>>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>><mailto:
>>>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>>> wrote:
>>>> >>>> Hi Evan,
>>>> >>>>
>>>> >>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do
>>>> >>>> you know what makes them faster than netlib-java?
>>>> >>>>
>>>> >>>> The same group has BIDMach library that implements machine
>>>> >>>> learning. For some examples they use Caffe convolutional neural
>>>> >>>> network library owned by another group in Berkeley. Could you
>>>> >>>> elaborate on how these all might be connected with Spark Mllib? If
>>>> >>>> you take BIDMat for linear algebra why don’t you take BIDMach for
>>>> optimization and learning?
>>>> >>>>
>>>> >>>> Best regards, Alexander
>>>> >>>>
>>>> >>>> From: Evan R. Sparks [mailto:[hidden email]<mailto:
>>>> [hidden email]><mailto:
>>>> >>>> [hidden email]<mailto:[hidden email]>><mailto:
>>>> [hidden email]<mailto:[hidden email]><mailto:
>>>> >>>> [hidden email]<mailto:[hidden email]>>>]
>>>> >>>> Sent: Thursday, February 05, 2015 12:09 PM
>>>> >>>> To: Ulanov, Alexander
>>>> >>>> Cc: [hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>><mailto:
>>>> >>>> [hidden email]<mailto:[hidden email]><mailto:dev@spark
>>>> .
>>>> >>>> apache.org<mailto:[hidden email]>>>
>>>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>> >>>>
>>>> >>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU
>>>> >>>> blas in many cases.
>>>> >>>>
>>>> >>>> You might consider taking a look at the codepaths that BIDMat (
>>>> >>>> https://github.com/BIDData/BIDMat) takes and comparing them to
>>>> >>>> netlib-java/breeze. John Canny et. al. have done a bunch of work
>>>> >>>> optimizing to make this work really fast from Scala. I've run it on
>>>> >>>> my laptop and compared to MKL and in certain cases it's 10x faster
>>>> at matrix multiply.
>>>> >>>> There are a lot of layers of indirection here and you really want
>>>> >>>> to avoid data copying as much as possible.
>>>> >>>>
>>>> >>>> We could also consider swapping out BIDMat for Breeze, but that
>>>> >>>> would be a big project and if we can figure out how to get
>>>> >>>> breeze+cublas to comparable performance that would be a big win.
>>>> >>>>
>>>> >>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
>>>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>><mailto:
>>>> >>>> [hidden email]<mailto:[hidden email]><mailto:
>>>> [hidden email]<mailto:[hidden email]>>>> wrote:
>>>> >>>> Dear Spark developers,
>>>> >>>>
>>>> >>>> I am exploring how to make linear algebra operations faster within
>>>> Spark.
>>>> >>>> One way of doing this is to use Scala Breeze library that is
>>>> >>>> bundled with Spark. For matrix operations, it employs Netlib-java
>>>> >>>> that has a Java wrapper for BLAS (basic linear algebra subprograms)
>>>> >>>> and LAPACK native binaries if they are available on the worker
>>>> >>>> node. It also has its own optimized Java implementation of BLAS. It
>>>> >>>> is worth mentioning, that native binaries provide better
>>>> performance only for BLAS level 3, i.e.
>>>> >>>> matrix-matrix operations or general matrix multiplication (GEMM).
>>>> >>>> This is confirmed by GEMM test on Netlib-java page
>>>> >>>> https://github.com/fommil/netlib-java. I also confirmed it with my
>>>> >>>> experiments with training of artificial neural network
>>>> >>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
>>>> >>>> However, I would like to boost performance more.
>>>> >>>>
>>>> >>>> GPU is supposed to work fast with linear algebra and there is
>>>> >>>> Nvidia CUDA implementation of BLAS, called cublas. I have one Linux
>>>> >>>> server with Nvidia GPU and I was able to do the following. I linked
>>>> >>>> cublas (instead of cpu-based blas) with Netlib-java wrapper and put
>>>> >>>> it into Spark, so Breeze/Netlib is using it. Then I did some
>>>> >>>> performance measurements with regards to artificial neural network
>>>> >>>> batch learning in Spark MLlib that involves matrix-matrix
>>>> >>>> multiplications. It turns out that for matrices of size less than
>>>> >>>> ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes
>>>> >>>> slower for bigger matrices. It worth mentioning that it is was not
>>>> a test for ONLY multiplication since there are other operations involved.
>>>> >>>> One of the reasons for slowdown might be the overhead of copying
>>>> >>>> the matrices from computer memory to graphic card memory and back.
>>>> >>>>
>>>> >>>> So, few questions:
>>>> >>>> 1) Do these results with CUDA make sense?
>>>> >>>> 2) If the problem is with copy overhead, are there any libraries
>>>> >>>> that allow to force intermediate results to stay in graphic card
>>>> >>>> memory thus removing the overhead?
>>>> >>>> 3) Any other options to speed-up linear algebra in Spark?
>>>> >>>>
>>>> >>>> Thank you, Alexander
>>>> >>>>
>>>> >>>> -------------------------------------------------------------------
>>>> >>>> -- To unsubscribe, e-mail: [hidden email]
>>>> <mailto:[hidden email]><mailto:
>>>> >>>> [hidden email]<mailto:
>>>> [hidden email]
>>>> >>>> e.org>><mailto:[hidden email]<mailto:
>>>> dev-unsubscribe@sp
>>>> >>>> ark.apac> he.org<http://he.org>
>>>> >>>> <mailto:[hidden email]<mailto:
>>>> dev-unsubscribe@spa
>>>> >>>> rk.apache.org>>> For additional commands, e-mail:
>>>> >>>> [hidden email]<mailto:[hidden email]
>>>> ><mailto:
>>>> >>>> [hidden email]<mailto:[hidden email]
>>>> >><mailto:[hidden email]<mailto:[hidden email]
>>>> ><mailto:
>>>> >>>> [hidden email]<mailto:[hidden email]>>>
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>
>>>>
>>>> --
>>>> Best regards,
>>>> Sam
>>>>
>>>
>>>
Reply | Threaded
Open this post in threaded view
|

Re: Using CUDA within Spark / boosting linear algebra

jfcanny
In reply to this post by Evan R. Sparks
I mentioned this earlier in the thread, but I'll put it out again. Dense
BLAS are not very important for most machine learning workloads: at
least for non-image workloads in industry (and for image processing you
would probably want a deep learning/SGD solution with convolution
kernels). e.g. it was only relevant for 1/7 of our recent benchmarks,
which should be a reasonable sample. What really matters is sparse BLAS
performance. BIDMat is still an order of magnitude faster there. Those
kernels are only in BIDMat, since NVIDIAs sparse BLAS dont perform well
on power-law data.

Its also the case that the overall performance of an algorithm is
determined by the slowest kernel, not the fastest. If the goal is to get
closer to BIDMach's performance on typical problems, you need to make
sure that every kernel goes at comparable speed. So the real question is
how much faster MLLib routines do on a complete problem with/without GPU
acceleration. For BIDMach, its close to a factor of 10. But that
required running entirely on the GPU, and making sure every kernel is
close to its limit.

-John

If you think nvblas would be helpful, you should try it in some